Download presentation
Presentation is loading. Please wait.
1
Notes on Bootstrapping
12/9/2018 Jeff Witmer 10 February 2016
2
New sample new statistic
12/9/2018 We want to estimate a parameter (e.g., a population mean µ). We have a sample of data and a statistic (e.g., the sample mean). New sample new statistic Many samples many statistics sampling distn But we can only imagine taking repeated samples of size n and observing how the sample mean behaves – e.g., finding out how far it typically is from µ.
3
12/9/2018 Traditional method: Use theory to study how the statistic should behave. Bootstrap method: Use a computer to simulate how the statistic behaves. Brad Efron, 1980s – “infinitely intelligent, but he doesn’t know any distribution theory” The quote is from David Freedman, who was exaggerating. Efron knows plenty of distribution theory; he developed the bootstrap anyway.
4
The Bootstrap 12/9/2018 We wonder what would happen (how would a statistic behave) in repeated samples from the population. Basic idea: Simulate the sampling distribution of any statistic (like a sample mean or proportion) by repeatedly sampling from the data. A “pseudo-sample” is to the sample as the sample is to the population. We hope.
5
The Bootstrap 12/9/2018 Basic idea: Simulate the sampling distribution of any statistic (like a sample mean) by repeatedly sampling from the data. Get a “pseudo-sample”/bootstrap sample and fit a model in order to calculate a statistic (i.e., an estimate of a parameter) from the sample. Do this MANY times, fit the model for each bootstrap sample and collect the estimates.
6
“Pull yourself up by your bootstraps”
Why “bootstrap”? 12/9/2018 “Pull yourself up by your bootstraps” Lift yourself in the air simply by pulling up on the laces of your boots Metaphor for accomplishing an “impossible” task without any outside help Perhaps an unfortunate name…
7
Sampling Distribution
Population BUT, in practice we don’t see the “tree” or all of the “seeds” – we only have ONE seed
8
Bootstrap Distribution
12/9/2018 Bootstrap Distribution What can we do with just one seed? Bootstrap “Population” Grow a new tree! Note: This analogy is to the parametric bootstrap; I’m mostly talking about the non-parametric bootstrap. Hat tip: Robin Lock -
10
population::statistic
The Bootstrap 12/9/2018 We hope that population::statistic is as bootstrap population::bootstrap statistic If so, then by studying the distribution of bootstrap statistics we can understand how our sample statistic arose. E.g., we can get a good idea of what the SE is, and thus create a good confidence interval.
11
Bootstrap Distribution
The Bootstrap Process 12/9/2018 BootstrapSample Bootstrap Statistic BootstrapSample Bootstrap Statistic Original Sample Bootstrap Distribution . . Sample Statistic BootstrapSample Bootstrap Statistic
12
A small bootstrap example
12/9/2018 Consider these 7 data points on the variable Y = pulse: 48, 50, 60, 66, 68, 72, 90 The (sample) mean is 64.9 (and the SD is 14.3). Let’s take a bootstrap sample, by writing the 7 values on 7 cards, shuffling, and making 7 draws with replacement. Then calculate the mean of those 7 values. (We can call this a bootstrap mean.) From 6 STAT 113 students from California plus 1 from China. Repeat this a few times. This is tedious. Technology helps…StatKey…
13
Bootstrapping with StatKey
12/9/2018 We can go to StatKey, choose Bootstrap Confidence Interval and choose CI for Single Mean, Median, Std. Dev., choose any of the built-in data sets and click the Edit Data button, and then type in our data (48, 50, 60, 66, 68, 72, 90). Then the Generate 1 Sample button will create a bootstrap sample. And we can get 1000 bootstrap samples easily.
14
Bootstrapping with StatKey
12/9/2018 From 5000 bootsamples we see that the SE (“st. dev.”) is almost exactly 5. We can make a 95% CI for µ via Estimate ± 2 * SE 64.9 ± 2*(5) 64.9 ± 10 (54.9, 74.9) Good news: We can bootstrap a difference in means or a regression coefficient or… Bad news: Bootstrapping when n is small does not overcome the problem that a small sample might not mimic the shape and spread of the population.
15
Bootstrap Percentile CIs
12/9/2018 If the bootstrap distribution is not symmetric then “estimate ± 2*SE” won’t be right. And what if we want something other than 95% confidence? We can take the middle 95% of the bootstrap distribution as our interval. Or the middle 90% for a 90% CI, etc.
16
We can made a 95% CI for µ via
Recall the pulse data 12/9/2018 n=7, data are (48, 50, 60, 66, 68, 72, 90). We can made a 95% CI for µ via Estimate ± 2 * SE 64.9 ± 2*(5) 64.9 ± 10 (54.9, 74.9) Click the “Two-Tail” button. The middle 95% of the bootstrap distn is (55.4, 75.1)
17
Bootstrapping other statistics: Correlation
12/9/2018 We can bootstrap a correlation. Consider the Commute Atlanta (Time as a function of Distance) data. The sample correlation is r=0.81. Generate 5000 bootstrap samples. Each time, get the correlation b/w Distance and Time. Click the “Two-Tail” button. The middle 95% of the bootstrap distn of correlations is (0.72, 0.87) Note that this is not symmetric around 0.81!
18
Atlanta commute correlation bootstrapped
12/9/2018
19
Bootstrapping other statistics
12/9/2018 We can use the bootstrap for almost anything – but not everything works... Consider the Ten Counties (Area) data, which are skewed and have a small n of 10:
20
If we try to bootstrap the median, we don’t get a reasonably smooth and symmetric graph. Instead, we get clumpiness: 12/9/2018
21
Other Bootstrap CIs 12/9/2018 There are lots of ways to make a bootstrap CI. These include: (a) Percentile method (b) Normal method/ t with bootstrap SE (c) “basic” bootstrap (aka “reverse percentiles”, which is like inverting a randomization test) (d) T method/bootstrap t interval (a) And (b) are commonly used, but may not work well, esp. if n is small. (c) is fairly common but often doesn’t work well. (d) and (e) are better but are harder to do. T method: (theta-hat – q(0.975)*SE, theta-hat – q(0.025)*SE), so (a) we get quantiles from bootstrap t-distribution rather than from a t-table; and (b) we subtract an upper quantile of the bootstrap t-distribution to get the lower endpoint. (e) BCa – bias corrected, accelerated
22
N.B. We use the bootstrap to get CIs
12/9/2018 The center of the bootstrap distn is not the center of the sampling distn. Bootstrapping informs us about the standard error, it can alert us to bias, etc. An unfortunate name? Taking thousands of bootstrap samples does not increase the original sample size, nor, e.g., push the sample mean closer to µ. We don’t really “get something for nothing.” But we do learn about uncertainty in the sample statistic.
23
Bootstrap and Regression/Correlation
12/9/2018 Suppose we have a set of (x,y) pairs in a regression setting. If we don’t trust a t-test or t-based CI, how might we bootstrap? (1) Randomly resample (x,y) pairs and fit a regression model to the new “boot-sample.” (2) Randomly resample the residuals from the fitted model and attach them to the (fixed) (x,y) pairs. Then fit a regression model to this new “boot-sample.” ^ If we think of the x’s as fixed (e.g., a designed experiment), then (2) looks good. If not, then (1).
24
Implementing the bootstrap
12/9/2018 How does one actually do this? If it is 1991, then maybe write your own code. Today, use software such as StatKey for a small problem in a standard setting (1 mean, 2 means, simple regression, etc.) In general, use R and either (a) write a function for the statistic you want to bootstrap and use the boot() command or (b) use the do()* command in the mosaic package.
25
Example 12/9/2018 A consulting client recently needed a CI for a ratio estimator, so I wrote a short script in R. Here is part of it: ratioBoot <- function(mydata, indices) { d <- mydata[indices,] # allows boot to select sample ratio <- sum(d$Y)/sum(d$X) return(ratio) } mydata=ratio.data #this is the client’s data file library(boot) #make the boot package active bootratios5000 = boot(mydata, ratioBoot, R=5000) boot.ci(bootratios5000, type="norm") #gives to boot.ci(bootratios5000, type="perc") #gives to hist(bootratios5000$t) #a histogram of the 5000 values mean(bootratios5000$t) #gives sd(bootratios5000$t) #gives Note: Lots of digits reported here because the client is dealing with millions of dollars (and courts/judges tend to like things reported to the nearest dollar, if not the nearest penny).
26
Using the do()* command
12/9/2018 The mosaic package in R has a command that will do something (anything) many times. library(mosaic) pulse=c(48, 50, 60, 66, 68, 72, 90) mean(pulse) mean(resample(pulse)) do(5)*resample(pulse) do(5)*mean(resample(pulse)) mymeans = do(5)*mean(resample(pulse)) mymeans mymeans = do(10000)*mean(resample(pulse)) hist(mymeans$mean, breaks=100) mean(mymeans$mean) sd(mymeans$mean) quantile(mymeans$mean, probs=c(0.025,0.975)) #gives 55.7 to 75.1
27
There is more… 12/9/2018 There are yet other variations on the theme. E.g., the "wild bootstrap" in which you keep each residual with its observation, but randomly multiply the residual by -1 with probability 1/2. [Bootstrap tilting: Reweight the data to satisfy a null hypothesis, sample from the weighted distribution, then invert a hyp test to get a CI.] Note: I have been talking about the nonparametric bootstrap. There is also the smoothed bootstrap (take samples from a smooth (kernel density estimated) population) and the parametric bootstrap (take samples from a parametric distribution after estimating the parameters) [the “grow a new tree analogy’].
28
References 12/9/2018 Tim Hesterberg’s 2015 paper “What Teachers Should Know About the Bootstrap: Resampling in the Undergraduate Statistics Curriculum” in The American Statistician. See This is a summary of a longer (2014) paper with the same title, available at Efron and Tibshirani (1993) An Introduction to the Bootstrap, Springer.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.