Application of the Bootstrap Estimating a Population Mean Movie Average Shot Lengths Sources: Barry Sands’ Average Shot Length Movie Database L. Chihara and T. Hesterberg (2011). Mathematical Statistics with Resampling and R. Wiley, Hoboken, NJ.
Data Description Average Shot Length (seconds) for a population of 11001 films (Barry Sands’ movie database) Very highly right-skewed population. Min=1.330 LQ=4.510 Median=6.400 UQ=8.910 Max=1000 m = 7.739 s = 12.765 Coefficient of Variation: CV=100(12.765/7.739)=164.94% Goal: Small sample estimation of m with unknown small-sample sampling distribution of sample mean (in terms of shape)
Introduction to the Bootstrap Makes use of a sample from a population to estimate the sampling distribution of a statistic/estimator. Treats the sample as an “estimate” of the population of measurements (sample empirical cumulative distribution function as estimate of population cdf)
Population and Sample Empirical CDF’s (sample size: n=25)
Applying the Bootstrap Obtain a random sample of size n from the population Determine the estimator(s) of interest Compute the estimate(s) based on the sample: Determine B, the number of bootstrap samples to be taken Obtain B random samples of size n from the original sample with replacement Compute the estimate for each bootstrap sample: The bootstrap distribution is the collection of estimates The bootstrap standard error is the standard deviation of the estimates
Properties of the Bootstrap Sampling Distribution Center: The center of the bootstrap sampling distribution is the estimate based on the full sample, not the population parameter it is estimating Spread: The spread is representative of the spread of the estimator’s sampling distribution Bias: Represents the difference between the center of the bootstrap sampling distribution and the true parameter the estimator is used for. The bootstrap bias estimate is accurate for the true bias. Skewness: Skewness in bootstrap sampling distribution is representative of the skewness of the estimator’s sampling distribution
Example – Movie Average Shot Lengths (ASL) Interested in approximating the sampling distributions of the sample mean. Population value: m = 7.739 (Pseudo) Random sample of n=25 films’ ASLs: 4.40 14.98 7.80 9.50 9.50 6.70 7.50 9.20 3.70 8.04 4.47 9.40 8.40 8.88 5.50 16.30 6.70 3.65 4.27 11.60 9.30 3.40 2.90 12.00 16.60
Bootstrap Samples Taking B=10000 bootstrap samples from the original samples. Summaries for original sample, mean, sd, CV: > summary(ASL.sample1) Min. 1st Qu. Median Mean 3rd Qu. Max. 2.900 4.470 8.040 8.188 9.500 16.600 > summary(ASL.mean) 5.560 7.666 8.182 8.190 8.687 11.100 > summary(ASL.sd) 1.916 3.423 3.800 3.772 4.137 5.494 > summary(ASL.CV) 26.19 42.28 46.19 46.16 50.13 67.20
Bootstrap Standard Error and Sampling Distribution In terms of the sampling distribution of the sample mean: Mean of bootstrap sample means: 8.1899 (Close to original sample mean (8.1876), not so close to population mean (7.7394). Bootstrap estimate of bias: 8.1899-8.1876=0.0023. Bootstrap standard error: Standard deviation of the 10000 bootstrap sample means: 0.7620. Bias/BSE=.0023/.7620=.0030 (0.30%) Bootstrap 95-percentile interval: (.025,.975) quantiles of the bootstrap mean sampling distribution: (6.7444,9.7113) which does include the population mean (7.739) Note: Interval is of the following form (reflecting an asymmetric bootstrap sampling distribution:
Bootstrap t Confidence Interval for m
ASL Example
Comparison of 3 Methods – 95% CI for m Repeat methods described previously, based on each of M=1000 random samples from the original population. Obtain empirical coverage rates for each method based on the M=1000 random samples, with B=1000 bootstrap samples per random sample of n=25. Method 1: (t-interval based on normality assumption): Coverage Probability: .869 Average width: 5.05 seconds Method 2: Bootstrap Percentile Interval: Coverage Probability: .849 Average width: 4.40 seconds Method 3: Bootstrap t Confidence Interval: Coverage Probability: .903 Average width: 22.23 seconds