Presentation is loading. Please wait.

Presentation is loading. Please wait.

Bootstrap and randomization methods

Similar presentations


Presentation on theme: "Bootstrap and randomization methods"— Presentation transcript:

1 Bootstrap and randomization methods
University of Ottawa - Bio 4158 – Applied Biostatistics © Antoine Morin and Scott Findlay 30/04/2019 3:02 AM

2 Randomization methods: background
Most biological hypotheses generate predictions which specify certain patterns in a data set. In such cases, the appropriate statistical null hypotheses specify a lack of pattern, i.e., randomness. Statistical hypothesis testing is then concerned with estimating the likelihood that the observed pattern is simply due to chance. University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 30/04/2019 3:02 AM

3 Randomization methods: background (cont’d)
In standard hypothesis testing, some test statistic S is chosen, which has observed value s in the sample. This value is then compared to the distribution of S expected under the null hypothesis of no pattern to estimate p. Distribution of t under H0. -3 -2 -1 1 2 3 t = 2.01 Probability University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 30/04/2019 3:02 AM

4 The distribution of S under H0
The distribution of S under H0 is based on a number of assumptions, some or all of which may not hold for the particular sample under consideration E.g. in case of t-test, (1) random sample; (2) equal within-group variances; (3) normal distribution of X within groups. If all assumptions are not valid, the distribution of S under H0 may not be the “theoretical” distribution… …and p will be incorrect. Distribution of t under H0. -3 -2 -1 1 2 3 t = 2.01 Probability University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 30/04/2019 3:02 AM

5 The bootstrap Data consists of a sample of n observations x1, x2, …, xn from some unknown probability distribution F. Choose m samples of p  n observations, sampling with replacement. For each sample, estimate value of desired parameter(s). Entire sample (n observations) Sample 1, p observations Frequency Sample 2, p observations University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 30/04/2019 3:02 AM

6 23, 28, 30, 50, 61 Example Bootstrap Samples (B = 5000)
Original Sample (n = 5) 23, 28, 30, 50, 61 Bootstrap Samples (B = 5000) 1 2 3 . 5000 28, 50, 30, 23, 23 30, 50, 50, 61, 28 61, 23, 30, 23, 28 . 28, 50, 30, 61, 30 University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 30/04/2019 3:02 AM

7 Uncertainty in the Mean
Original Sample (n = 5) 23, 28, 30, 50, 61 = 38.4 Bootstrap Samples (B = 5000) 1 2 3 . 5000 28, 50, 30, 23, 23 30, 50, 50, 61, 28 61, 23, 30, 23, 28 . 28, 50, 30, 61, 30 University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 30/04/2019 3:02 AM

8 Example: bootstrapped regression
Want to estimate slope (b) of regression of log10 plant species richness (S) on log10 wetland area (A). Estimate b, for m = 500 samples using n = 50. Calculate average b and SD based on bootstrapped estimates. PLANT SPECIES 100 0.2 90 80 70 60 Count Count 50 Proportion per bar 0.1 Proportion per Bar 40 30 20 10 0.0 0.17 0.18 0.19 0.20 0.21 0.22 0.23 CONSTANT University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 30/04/2019 3:02 AM

9 Comments on usage Usually, p = n, the size of the original data set.
“Naïve” jackknife and bootstrap often provide poor estimates when underlying distributions are “heavy-tailed”. This caveat notwithstanding, it is a good idea to check parametric analysis with non-parametric resampling analysis. If the results are very different, the resampling results are generally considered more accurate. University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 30/04/2019 3:02 AM

10 Randomization tests: the principle
If the null hypothesis (of no pattern) is true, then all possible orders of the data are equally likely. Hence if we randomly reorder the data, and compute s for each reordering, we can generate the distribution of S under H0. We then compare the observed value of S with it’s randomized distribution. The “significance” of s is then the proportion p of randomized values that are as or more extreme than s. University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 30/04/2019 3:02 AM

11 A simple example Females Males
Data consists of set of mandible lengths of 10 female and 10 male golden jackals. Biological prediction: males are larger than females. H0: mfemales= mmales Step 1: calculate average values for males and females, and the difference (D*) between them. Step 2: Put 20 values together, and choose 10 at random. Call these females; the other 10 are males. Calculate difference in average lengths. D* = 4.8 mm. “Females” “Males” University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 30/04/2019 3:02 AM

12 A simple example (cont’d)
1024 times Step 3. Repeat step 210 = 1024 times (corresponding to all possible data combinations (permutations)) to generate randomized distribution of D. Step 4. Calculate the proportion of randomizations for which D > D* (p = .0018). “Females” “Males” “Males” randomizations) Frequency (no. of D University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 30/04/2019 3:02 AM

13 Sampling the randomized distribution versus complete enumeration
Females Males Sampling the randomized distribution versus complete enumeration D* = 4.8 mm. When the number of possible permutations is small, complete enumeration is possible… …but more usually, the randomization distribution is resampled using the bootstrap… …with (usually) little loss in accuracy. N bootstrap samples “Females” “Males” randomizations) Frequency (no. of D University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 30/04/2019 3:02 AM

14 Application of randomization methods
Randomization methods can be applied in many statistical procedures, both univariate (e.g. ANOVA, simple and multiple regresssion, ANCOVA) and multivariate (e.g. MANOVA, PCA, discriminant function analysis, etc.) Randomization tests should be considered in “non-standard” situations or when the assumptions underlying standard assumptions are unlikely to be met. When assumptions of standard tests are met, randomization tests usually give similar significance levels as parametric analysis. University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 30/04/2019 3:02 AM

15 Randomization tests: advantages and disadvantages
Generalizations from the conclusion of a randomization test to a population of interest may not be valid because the results pertain only to the sample at hand. Special software or programming expertise is often required. Valid even when the sample is non-random It is fairly easy to take into account specifics of the situation of interest and use non-standard test statistics. results are exact University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 30/04/2019 3:02 AM

16 How to bootstrap with S-Plus
Data on biomass of Simuliidae and Hydropsychidae in Southern Québec streams with velocity, depth, rock area, distance from upstream lake. University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 30/04/2019 3:02 AM

17 Example S-Plus commands
attach(simuliidae.2) boot.coef <-bootstrap(simuliidae.2, +coef(lm(bsim~v+z+distance+cs+a+bhydae)),+B=1000, seed=0, trace=F) boot.coef summary(boot.coef) plot(boot.coef) University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 30/04/2019 3:02 AM

18 attach(simuliidae.2) Pre-load a data set so that variables contained in the data set can be used in formula without the prefix University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 30/04/2019 3:02 AM

19 boot. coef <-bootstrap(simuliidae
boot.coef <-bootstrap(simuliidae.2, +coef(lm(bsim~v+z+distance+cs+a+bhydae)), +B=1000, seed=0, trace=F) Bootstraps data in simuliidae.2, computes coefficients of the model bsim~v+z+distance+cs+a+bhydae Does 1000 runs (B=1000) Do not show output on screen (Trace=F) Save results in object boot.coef Takes about 30s on my relatively fast computer. University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 30/04/2019 3:02 AM

20 boot.coef Displays general info about results stored in boot.coef
Call: bootstrap(data = simuliidae.2, statistic = coef(lm(bsim ~ v + z + distance + cs + a + bhydae)), B = 1000, seed = 0, trace = F) Number of Replications: 1000 Summary Statistics: Observed Bias Mean SE (Intercept) v z distance cs a bhydae University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 30/04/2019 3:02 AM

21 summary(boot.coef) Displays summary stats about stored results
Empirical Percentiles: 2.5% % % % (Intercept) v z distance cs a bhydae BCa Confidence Limits: (Intercept) e v e z e distance e cs e a e bhydae e University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 30/04/2019 3:02 AM

22 plot(boot.coef) University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 30/04/2019 3:02 AM

23 University of Ottawa - Bio 4118 – Applied Biostatistics
© Antoine Morin and Scott Findlay 30/04/2019 3:02 AM

24 The bootstrap: A panacea?
Very promising, still a bit too nerdy for mainstream use (alas) Increases computing requirements by a factor of 1,000 to 10,000 As for all statistical methods, sample MUST be representative University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 30/04/2019 3:02 AM


Download ppt "Bootstrap and randomization methods"

Similar presentations


Ads by Google