Newsgroups: comp.ai.neural-nets
Path: cantaloupe.srv.cs.cmu.edu!bb3.andrew.cmu.edu!newsfeed.pitt.edu!newsflash.concordia.ca!news.nstn.ca!coranto.ucs.mun.ca!news.unb.ca!news.uoregon.edu!enews.sgi.com!news.mathworks.com!newsgate.duke.edu!interpath!news.interpath.net!sas!newshost.unx.sas.com!saswss
From: saswss@hotellng.unx.sas.com (Warren Sarle)
Subject: Re: Bootstrapping vs. Bayesian
Originator: saswss@hotellng.unx.sas.com
Sender: news@unx.sas.com (Noter of Newsworthy Events)
Message-ID: <Dxp61u.2Ir@unx.sas.com>
Date: Sat, 14 Sep 1996 00:44:18 GMT
X-Nntp-Posting-Host: hotellng.unx.sas.com
References: <TAP.96Sep6190028@pearson.epi.terryfox.ubc.ca> <50vagm$qj8@newsbf02.news.aol.com> <50vlu3$jj3@delphi.cs.ucla.edu> <th.28.323661D6@skull.dcn.ed.ac.uk>
Organization: SAS Institute Inc.
Lines: 341


In article <th.28.323661D6@skull.dcn.ed.ac.uk>, th@skull.dcn.ed.ac.uk writes:
|> ...
|> Like the original poster, I find the bootstrap very attractive.  it seems to 
|> provide a way of assessing how sensitive your system is to the particular 
|> training set used, which is absolutely crucial for my current application.  

The bootstrap is indeed attractive and often useful. But, like neural
nets, the bootstrap is often portrayed as a magical spell that works
without any thought or care on the part of the analyst. S&T (see
references below) dispell this myth very effectively and very
technically.

|> Although the theory is rather deep (for me), the algorithms are simplicity 
|> itself - for example see:
|> 
|>      Baxt, W.G, and H. White, Bootstrapping confidence intervals for clinical
|>      input variable effects in a network trained to identify the presence of 
|>      acutemyocardial infarction, Neural Computation, 7: 624 - 638., 1995
|> 
|> I would be very interested in any more that could be said re the pitfalls of 
|> this approach.

Here is some material excerpted from the documentation for my
bootstrap macros (which are available from our Tech Support
department for any of you who happen to be SAS users). Funny looking
words in uppercase and beginning with a percent sign are names of
SAS things that people who are not using SAS can ignore.

The two most important issues are (1) how to deal with regression models
and (2) how to compute confidence intervals from biased estimators.
Issue (1) applies to all neural nets used for function approximation but
is rather glossed over by Baxt and White. Issue (2) certainly applies to
neural nets when there is a possibility of underfitting; overfitting is
(I would guess) less of a problem, but I haven't seen any definite
results one way or the other.

Bootstrapping Regression Models
-------------------------------

In regression models, there are two main ways to do bootstrap
resampling, depending on whether the predictor variables are random or
fixed.

If the predictors are random, you resample observations just as you
would for any simple random sample. This method is usually called
"bootstrapping pairs".

If the predictors are fixed, the resampling process should keep the same
values of the predictors in every resample and change only the values of
the response variable by resampling the residuals.

Either method of resampling for regression models (observations or
residuals) can be used regardless of the form of the error distribution.
However, residuals should be resampled only if the errors are
independent and identically distributed and if the functional form of
the model is correct to within a reasonable approximation.  If these
assumptions are questionable [as they often are in neural net
applications], it is safer to resample observations.


Confidence Intervals
--------------------

The terminology for bootstrap confidence intervals is confused. The
keywords used with the %BOOTCI macro follow S&T:

   Keyword      Terms from the references
   -------      -------------------------
   PCTL or      "bootstrap percentile" in S&T;
   PERCENTILE   "percentile" in E&T;
                "other percentile" in Hall;
                "Efron's `backwards' pecentile" in Hjorth

   HYBRID       "hybrid" in S&T;
                 no term in E&T;
                "percentile" in Hall;
                "simple" in Hjorth

   T            "bootstrap-t" in S&T and E&T;
                "percentile-t" in Hall;
                "studentized" in Hjorth

   BC           "BC" in all

   BCA          "BCa" in S&T, E&T, and Hjorth; "ABC" in Hall
                (cannot be used for bootstrapping residuals in
                regression models)

There is considerable controversy concerning the use of bootstrap
confidence intervals. To fully appreciate the issues, it is important to
read S&T and Hall in addition to E&T.  Asymptotically in simple random
samples, the T and BCa methods work better than the traditional normal
approximation, while the percentile, hybrid, and BC methods have the
same accuracy as the traditional normal approximation.  In small
samples, things get much more complicated:

 * The percentile method simply uses the alpha/2 and 1-alpha/2
   percentiles of the bootstrap distribution to define the interval.
   This method performs well for quantiles and for statistics that are
   unbiased and have a symmetric sampling distribution.  For a
   statistic that is biased, the percentile method amplifies the bias.
   The main virtue of the percentile method and the closely related BC
   and BCa methods is that the intervals are equivariant under
   transformation of the parameters. One consequence of this
   equivariance is that the interval cannot extend beyond the possible
   range of values of the statistic.  In some cases, however, this
   property can be a vice--see the "Cautionary Example" below.

 * The BC method corrects the percentile interval for bias--median
   bias, not mean bias. The correction is performed by adjusting the
   percentile points to values other than alpha/2 and 1-alpha/2.  If a
   large correction is required, one of the percentile points will be
   very small; hence a very large number of resamples will be required
   to approximate the interval accurately. See the "Cautionary
   Example" below.

 * The BCa method corrects the percentile interval for bias and
   skewness. This method requires an estimate of the acceleration,
   which is related to the skewness of the sampling distribution. The
   acceleration can be estimated by jackknifing for simple random
   samples which, of course, requires extra computation. For
   bootstrapping residuals in regression models, no general method for
   estimating the acceleration is known. If the acceleration is not
   estimated accurately, the BCa interval will perform poorly.  The
   length of the BCa interval is not monotonic with respect to alpha
   (Hall, pp 134-135, 137).  For large values of the acceleration and
   large alpha, the BCa interval is excessively short.  The BCa
   interval is no better than the BC interval for nonsmooth statistics
   such as the median.

 * The HYBRID method is the reverse of the percentile method. While
   the percentile method amplifies bias, the HYBRID method
   automatically adjusts for bias and skewness. The HYBRID method
   works well if the standard error of the statistic does not depend
   on any unknown parameters; otherwise, the T method works better if
   a good estimate of the standard error is available. Of all the
   methods in %BOOTCI, the HYBRID method seems to be the least likely
   to yield spectacularly wrong results, but often suffers from low
   coverage in relatively easy cases. The HYBRID method and the
   closely related T method are not equivariant under transformation
   of the parameters.

 * The T method requires an estimate of the standard error (or a
   constant multiple thereof) of each statistic being bootstrapped.
   This requires more work from the user. If the standard errors are
   not estimated accurately, the T method may perform poorly. In
   simulation studies, T intervals are often found to be very long.
   E&T (p 160) claim that the T method is erratic and sensitive to
   outliers.

Numerous other methods exist for bootstrap confidence intervals that
require nested resampling, i.e., each resample of the original
sample is itself reresampled multiple times. Since the total number
of reresamples required is typically 25,000 or more, these methods
are extremely expensive and have not yet been implemented in the
%BOOT and %BOOTCI macros.


A Cautionary Example
--------------------

Jackknifing and bootstrapping are no remedy for an inadequate sample
size. For nonparametric resampling methods, the sample distribution must
be reasonably close in some sense to the population distribution to
obtain accurate inferences. In parametric methods, only the estimated
parameters need be reasonably close to the population parameters to
obtain accurate inferences. The smaller the sample size, the greater the
fluctuations in the distribution of the sample. Nonparametric methods
that are sensitive to a wide variety of such fluctuations will suffer
more from small sample sizes than will parametric methods _if_ the
assumptions of the parametric methods are valid.

In this example, the purpose of the analysis is to find a 95% confidence
interval for R**2 [in neural net terminology, R**2 (R-squared) is one
minus the normalized mean squared error] in a linear regression with 20
observations and 10 predictors.  The predictors and response are
generated from a multivariate normal distribution, so normal-theory
methods are applicable. With real data, if the distribution were not
known to be normal, you might be tempted to use the jackknife or
bootstrap on the theory that normal approximations could not be trusted
in such a small sample size. In fact, most of the jackknife and
bootstrap methods cannot be trusted either.

This example computes a 95% confidence interval with each of the methods
available in %JACK and %BOOT using 1000 resamples. ...
assembled into a single data set called CI for compari

[SAS code omitted]

The actual sampling distribution of R**2, based on 10000 simulated data
sets, looks like this:

      Frequency

      1000 +                               *
           |                             * * *
           |                           * * * *
           |                           * * * * *
       800 +                         * * * * * *
           |                         * * * * * * *
           |                       * * * * * * * *
           |                       * * * * * * * *
       600 +                       * * * * * * * * *
           |                     * * * * * * * * * *
           |                     * * * * * * * * * *
           |                   * * * * * * * * * * *
       400 +                   * * * * * * * * * * * *
           |                   * * * * * * * * * * * *
           |                 * * * * * * * * * * * * *
           |                 * * * * * * * * * * * * *
       200 +               * * * * * * * * * * * * * * *
           |             * * * * * * * * * * * * * * * *
           |             * * * * * * * * * * * * * * * * *
           |         * * * * * * * * * * * * * * * * * * * *
           ------------------------------------------------------
             0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
             . . . . . . . . . . . . . . . . . . . . . . . . . .
             0 0 0 1 1 2 2 2 3 3 4 4 4 5 5 6 6 6 7 7 8 8 8 9 9 0
             0 4 8 2 6 0 4 8 2 6 0 4 8 2 6 0 4 8 2 6 0 4 8 2 6 0

The bootstrap distribution computed from the one data set in this
example is not even close to the true sampling distribution:

       Frequency

           |                                                   *
       300 +                                                   *
           |                                                   *
           |                                                   *
           |                                                   *
           |                                                   *
       200 +                                                   *
           |                                                   *
           |                                                   *
           |                                                 * *
           |                                               * * *
       100 +                                           * * * * *
           |                                         * * * * * *
           |                                       * * * * * * *
           |                                   * * * * * * * * *
           |                               * * * * * * * * * * *
           ------------------------------------------------------
             0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
             . . . . . . . . . . . . . . . . . . . . . . . . . .
             0 0 0 1 1 2 2 2 3 3 4 4 4 5 5 6 6 6 7 7 8 8 8 9 9 0
             0 4 8 2 6 0 4 8 2 6 0 4 8 2 6 0 4 8 2 6 0 4 8 2 6 0

The table of confidence intervals printed by the final PROC PRINT step
is:

               METHOD                ALCL        AUCL

               Normal theory        0.00000    0.62876
               Jackknife           -0.44648    0.54393
               Bootstrap Normal     0.07400    0.51324
               Bootstrap Hybrid     0.18391    0.56566
               Bootstrap PCTL       0.61824    1.00000
               Bootstrap BC         0.51547    0.57231
               Bootstrap BCa        0.51547    0.57231
               Bootstrap t         -3.11368    0.56556

The true value of R**2 in this example is 0.10, the sample plug-in
estimate is 0.59, and the adjusted estimate is 0.14. The normal-theory
interval can be considered the "right answer".

The jackknife interval has a negative lower limit, and the upper limit
is rather low, but the interval covers the true value.

The bootstrap interval based on a normal approximation is short but does
cover the true value. However, a glance at the chart of the bootstrap
distribution shows that a normal approximation is suspect.

The bootstrap hybrid interval is even shorter and does not cover the
true value. The hybrid interval is poor because the bootstrap
distribution is less variable and far more skewed than the true sampling
distribution.

The plug-in estimate is very biased, so it is no surprise that the
bootstrap PCTL method works poorly. However, the PCTL interval lies
entirely above the plug-in estimate, a dramatic illustration of Hall's
claim that the PCTL interval is "backwards"!

The bootstrap BC interval is extremely short and is not even close to
the true value.  The lower percentile point for computing the BC
interval is .00000000010453, so billions of resamples would be required
for an accurate approximation. The lower percentile point for the BCa
interval is even smaller at 7.3099E-17, and would require an
astronomical number of resamples for an accurate approximation.

The bootstrap t interval has a wildy negative lower limit, and the upper
limit is rather low, but the interval covers the true value.

A simulation was performed by repeating the above analysis 2000 times on
randomly generated data sets. For each method, the coverage probability
(COVERAGE), the average length (LENGTH), and the positive part of the
length (POSLEN=AUCL-MAX(0,ALCL)) were computed. Among the jackknife and
bootstrap methods, the only acceptable coverage probability is for the
bootstrap t interval, which is nevertheless very poor with regard to the
length of the interval. Considering only the positive part of the
interval, the bootstrap t interval is quite good, but it works well in
this example only because we know a lower bound for the parameter and
have an analytic expression for the standard error.

      METHOD               COVERAGE      LENGTH      POSLEN
      ------------------------------------------------------
      Bootstrap BC          0.00426    0.148825    0.148825
      Bootstrap BCa         0.00426    0.148575    0.148575
      Bootstrap Hybrid      0.42997    0.418778    0.351726
      Bootstrap Normal      0.54917    0.485234    0.361531
      Bootstrap PCTL              0    0.418778    0.418778
      Bootstrap T           0.95828    4.351963    0.542228
      Jackknife             0.66050    0.561788    0.333603
      Normal theory         0.95360    0.573977    0.573977


References
----------

Dixon   Dixon, P.M. (1993), "The bootstrap and the jackknife: 
        Describing the precision of ecological indices," in Scheiner,
        S.M. and Gurevitch, J., eds., Design and Analysis of Ecological
        Experiments, New York: Chapman & Hall, pp 290-318.

E&T     Efron, B. and Tibshirani, R.J. (1993), An Introduction to the
        Bootstrap, New York: Chapman & Hall.

Hall    Hall, P. (1992), The Bootstrap and Edgeworth Expansion, New York:
        Springer-Verlag.

Hjorth  Hjorth, J.S.U. (1994), Computer Intensive Statistical Methods,
        London: Chapman & Hall.

S&T     Shao, J. and Tu, D. (1995), The Jackknife and Bootstrap, New York:
        Springer-Verlag.

-- 

Warren S. Sarle       SAS Institute Inc.   The opinions expressed here
saswss@unx.sas.com    SAS Campus Drive     are mine and not necessarily
(919) 677-8000        Cary, NC 27513, USA  those of SAS Institute.
