Bootstrap and randomization methods

Slides:



Advertisements
Similar presentations
Computational Statistics. Basic ideas  Predict values that are hard to measure irl, by using co-variables (other properties from the same measurement.
Advertisements

CHAPTER 21 Inferential Statistical Analysis. Understanding probability The idea of probability is central to inferential statistics. It means the chance.
Inferential Statistics & Hypothesis Testing
Generalized Linear Models (GLM)
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L10.1 CorrelationCorrelation The underlying principle of correlation analysis.
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
STAT 101 Dr. Kari Lock Morgan Exam 2 Review.
Today Concepts underlying inferential statistics
Simple Linear Regression Analysis
Bootstrap spatobotp ttaoospbr Hesterberger & Moore, chapter 16 1.
Chapter 12 Inferential Statistics Gay, Mills, and Airasian
University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 21/09/2015 7:46 PM 1 Two-sample comparisons Underlying principles.
Estimation Bias, Standard Error and Sampling Distribution Estimation Bias, Standard Error and Sampling Distribution Topic 9.
University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 03/10/2015 6:40 PM Final project: submission Wed Dec 15 th,2004.
Bootstrapping (And other statistical trickery). Reminder Of What We Do In Statistics Null Hypothesis Statistical Test Logic – Assume that the “no effect”
PARAMETRIC STATISTICAL INFERENCE
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 08/10/ :23 PM 1 Some basic statistical concepts, statistics.
+ Chapter 12: Inference for Regression Inference for Linear Regression.
9 Mar 2007 EMBnet Course – Introduction to Statistics for Biologists Nonparametric tests, Bootstrapping
Educational Research: Competencies for Analysis and Application, 9 th edition. Gay, Mills, & Airasian © 2009 Pearson Education, Inc. All rights reserved.
University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 21/10/ :24 PM 1 Review and important concepts Biological.
University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 23/10/2015 9:22 PM 1 Two-sample comparisons Underlying principles.
CHAPTER 17: Tests of Significance: The Basics
PCB 3043L - General Ecology Data Analysis. OUTLINE Organizing an ecological study Basic sampling terminology Statistical analysis of data –Why use statistics?
Chapter 7 Probability and Samples: The Distribution of Sample Means.
Stat 112: Notes 2 Today’s class: Section 3.3. –Full description of simple linear regression model. –Checking the assumptions of the simple linear regression.
Limits to Statistical Theory Bootstrap analysis ESM April 2006.
Statistics: Unlocking the Power of Data Lock 5 Exam 2 Review STAT 101 Dr. Kari Lock Morgan 11/13/12 Review of Chapters 5-9.
Introduction to Statistical Inference Jianan Hui 10/22/2014.
28. Multiple regression The Practice of Statistics in the Life Sciences Second Edition.
Analyzing Statistical Inferences July 30, Inferential Statistics? When? When you infer from a sample to a population Generalize sample results to.
Tutorial I: Missing Value Analysis
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L14.1 Lecture 14: Contingency tables and log-linear models Appropriate questions.
Chapter 9: Introduction to the t statistic. The t Statistic The t statistic allows researchers to use sample data to test hypotheses about an unknown.
Non-parametric Approaches The Bootstrap. Non-parametric? Non-parametric or distribution-free tests have more lax and/or different assumptions Properties:
BIOL 582 Lecture Set 2 Inferential Statistics, Hypotheses, and Resampling.
Educational Research Inferential Statistics Chapter th Chapter 12- 8th Gay and Airasian.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Simulation-based inference beyond the introductory course Beth Chance Department of Statistics Cal Poly – San Luis Obispo
University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 28/06/2016 4:11 PM 1 Review and important concepts.
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L12.1 Lecture 12: Generalized Linear Models (GLM) What are they? When do.
University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 22/11/ :12 AM 1 Contingency tables and log-linear models.
Stats Methods at IC Lecture 3: Regression.
Chapter 3 INTERVAL ESTIMATES
32931 Technology Research Methods Autumn 2017 Quantitative Research Component Topic 4: Bivariate Analysis (Contingency Analysis and Regression Analysis)
Chapter 3 INTERVAL ESTIMATES
ECO 173 Chapter 10: Introduction to Estimation Lecture 5a
Assumptions For testing a claim about the mean of a single population
Chapter 4. Inference about Process Quality
Modify—use bio. IB book  IB Biology Topic 1: Statistical Analysis
PCB 3043L - General Ecology Data Analysis.
Chapter 8: Inference for Proportions
12 Inferential Analysis.
ECO 173 Chapter 10: Introduction to Estimation Lecture 5a
The Practice of Statistics in the Life Sciences Fourth Edition
Test for Mean of a Non-Normal Population – small n
BOOTSTRAPPING: LEARNING FROM THE SAMPLE
I. Statistical Tests: Why do we use them? What do they involve?
Ch13 Empirical Methods.
Categorical Data Analysis Review for Final
12 Inferential Analysis.
Elements of a statistical test Statistical null hypotheses
Simple Linear Regression
Cross-validation Brenda Thomson/ Peter Fox Data Analytics
Chapter 13: Inferences about Comparing Two Populations
Inferences for Regression
MGS 3100 Business Analysis Regression Feb 18, 2016
Introductory Statistics
Evaluation David Kauchak CS 158 – Fall 2019.
Presentation transcript:

Bootstrap and randomization methods University of Ottawa - Bio 4158 – Applied Biostatistics © Antoine Morin and Scott Findlay 30/04/2019 3:02 AM

Randomization methods: background Most biological hypotheses generate predictions which specify certain patterns in a data set. In such cases, the appropriate statistical null hypotheses specify a lack of pattern, i.e., randomness. Statistical hypothesis testing is then concerned with estimating the likelihood that the observed pattern is simply due to chance. University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 30/04/2019 3:02 AM

Randomization methods: background (cont’d) In standard hypothesis testing, some test statistic S is chosen, which has observed value s in the sample. This value is then compared to the distribution of S expected under the null hypothesis of no pattern to estimate p. Distribution of t under H0. -3 -2 -1 1 2 3 t = 2.01 Probability University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 30/04/2019 3:02 AM

The distribution of S under H0 The distribution of S under H0 is based on a number of assumptions, some or all of which may not hold for the particular sample under consideration E.g. in case of t-test, (1) random sample; (2) equal within-group variances; (3) normal distribution of X within groups. If all assumptions are not valid, the distribution of S under H0 may not be the “theoretical” distribution… …and p will be incorrect. Distribution of t under H0. -3 -2 -1 1 2 3 t = 2.01 Probability University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 30/04/2019 3:02 AM

The bootstrap Data consists of a sample of n observations x1, x2, …, xn from some unknown probability distribution F. Choose m samples of p  n observations, sampling with replacement. For each sample, estimate value of desired parameter(s). Entire sample (n observations) Sample 1, p observations Frequency Sample 2, p observations University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 30/04/2019 3:02 AM

23, 28, 30, 50, 61 Example Bootstrap Samples (B = 5000) Original Sample (n = 5) 23, 28, 30, 50, 61 Bootstrap Samples (B = 5000) 1 2 3 . 5000 28, 50, 30, 23, 23 30, 50, 50, 61, 28 61, 23, 30, 23, 28 . 28, 50, 30, 61, 30 University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 30/04/2019 3:02 AM

Uncertainty in the Mean Original Sample (n = 5) 23, 28, 30, 50, 61 = 38.4 Bootstrap Samples (B = 5000) 1 2 3 . 5000 28, 50, 30, 23, 23 30, 50, 50, 61, 28 61, 23, 30, 23, 28 . 28, 50, 30, 61, 30 University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 30/04/2019 3:02 AM

Example: bootstrapped regression Want to estimate slope (b) of regression of log10 plant species richness (S) on log10 wetland area (A). Estimate b, for m = 500 samples using n = 50. Calculate average b and SD based on bootstrapped estimates. PLANT SPECIES 100 0.2 90 80 70 60 Count Count 50 Proportion per bar 0.1 Proportion per Bar 40 30 20 10 0.0 0.17 0.18 0.19 0.20 0.21 0.22 0.23 CONSTANT University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 30/04/2019 3:02 AM

Comments on usage Usually, p = n, the size of the original data set. “Naïve” jackknife and bootstrap often provide poor estimates when underlying distributions are “heavy-tailed”. This caveat notwithstanding, it is a good idea to check parametric analysis with non-parametric resampling analysis. If the results are very different, the resampling results are generally considered more accurate. University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 30/04/2019 3:02 AM

Randomization tests: the principle If the null hypothesis (of no pattern) is true, then all possible orders of the data are equally likely. Hence if we randomly reorder the data, and compute s for each reordering, we can generate the distribution of S under H0. We then compare the observed value of S with it’s randomized distribution. The “significance” of s is then the proportion p of randomized values that are as or more extreme than s. University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 30/04/2019 3:02 AM

A simple example Females Males Data consists of set of mandible lengths of 10 female and 10 male golden jackals. Biological prediction: males are larger than females. H0: mfemales= mmales Step 1: calculate average values for males and females, and the difference (D*) between them. Step 2: Put 20 values together, and choose 10 at random. Call these females; the other 10 are males. Calculate difference in average lengths. D* = 4.8 mm. “Females” “Males” University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 30/04/2019 3:02 AM

A simple example (cont’d) 1024 times Step 3. Repeat step 210 = 1024 times (corresponding to all possible data combinations (permutations)) to generate randomized distribution of D. Step 4. Calculate the proportion of randomizations for which D > D* (p = .0018). “Females” “Males” “Males” randomizations) Frequency (no. of D University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 30/04/2019 3:02 AM

Sampling the randomized distribution versus complete enumeration Females Males Sampling the randomized distribution versus complete enumeration D* = 4.8 mm. When the number of possible permutations is small, complete enumeration is possible… …but more usually, the randomization distribution is resampled using the bootstrap… …with (usually) little loss in accuracy. N bootstrap samples “Females” “Males” randomizations) Frequency (no. of D University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 30/04/2019 3:02 AM

Application of randomization methods Randomization methods can be applied in many statistical procedures, both univariate (e.g. ANOVA, simple and multiple regresssion, ANCOVA) and multivariate (e.g. MANOVA, PCA, discriminant function analysis, etc.) Randomization tests should be considered in “non-standard” situations or when the assumptions underlying standard assumptions are unlikely to be met. When assumptions of standard tests are met, randomization tests usually give similar significance levels as parametric analysis. University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 30/04/2019 3:02 AM

Randomization tests: advantages and disadvantages Generalizations from the conclusion of a randomization test to a population of interest may not be valid because the results pertain only to the sample at hand. Special software or programming expertise is often required. Valid even when the sample is non-random It is fairly easy to take into account specifics of the situation of interest and use non-standard test statistics. results are exact University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 30/04/2019 3:02 AM

How to bootstrap with S-Plus Data on biomass of Simuliidae and Hydropsychidae in Southern Québec streams with velocity, depth, rock area, distance from upstream lake. University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 30/04/2019 3:02 AM

Example S-Plus commands attach(simuliidae.2) boot.coef <-bootstrap(simuliidae.2, +coef(lm(bsim~v+z+distance+cs+a+bhydae)),+B=1000, seed=0, trace=F) boot.coef summary(boot.coef) plot(boot.coef) University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 30/04/2019 3:02 AM

attach(simuliidae.2) Pre-load a data set so that variables contained in the data set can be used in formula without the prefix University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 30/04/2019 3:02 AM

boot. coef <-bootstrap(simuliidae boot.coef <-bootstrap(simuliidae.2, +coef(lm(bsim~v+z+distance+cs+a+bhydae)), +B=1000, seed=0, trace=F) Bootstraps data in simuliidae.2, computes coefficients of the model bsim~v+z+distance+cs+a+bhydae Does 1000 runs (B=1000) Do not show output on screen (Trace=F) Save results in object boot.coef Takes about 30s on my relatively fast computer. University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 30/04/2019 3:02 AM

boot.coef Displays general info about results stored in boot.coef Call: bootstrap(data = simuliidae.2, statistic = coef(lm(bsim ~ v + z + distance + cs + a + bhydae)), B = 1000, seed = 0, trace = F) Number of Replications: 1000 Summary Statistics: Observed Bias Mean SE (Intercept) 1286.0921 -11.81883 1274.2733 1030.0889 v 12.9293 1.47244 14.4017 8.6100 z -21.1681 -0.37184 -21.5399 15.0382 distance -0.3414 -0.08628 -0.4277 0.3639 cs 1593.8676 -266.76905 1327.0985 991.8828 a -12.0341 0.13088 -11.9032 7.1988 bhydae 2.2011 -0.04142 2.1596 1.1954 University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 30/04/2019 3:02 AM

summary(boot.coef) Displays summary stats about stored results Empirical Percentiles: 2.5% 5% 95% 97.5% (Intercept) -673.6972 -364.8509 3014.47002 3351.15131 v 0.6127 2.1767 29.39770 34.19882 z -52.5363 -46.8128 1.79636 5.79851 distance -1.3285 -1.0104 -0.01941 0.02848 cs -965.4570 -533.9443 2613.58281 2803.40006 a -26.7147 -24.2897 -1.13750 0.88042 bhydae 0.2071 0.5157 3.44577 3.78338 BCa Confidence Limits: (Intercept) -549.5441 -244.7880 3.135e+003 3453.05689 v 0.9249 2.4455 2.995e+001 35.39557 z -56.2970 -49.9729 -3.548e-001 3.94172 distance -1.1765 -0.8783 6.617e-003 0.04704 cs -489.0288 -40.4751 2.806e+003 2982.26977 a -27.1140 -24.9182 -1.663e+000 0.47114 bhydae 0.2804 0.7097 3.580e+000 4.00391 University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 30/04/2019 3:02 AM

plot(boot.coef) University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 30/04/2019 3:02 AM

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 30/04/2019 3:02 AM

The bootstrap: A panacea? Very promising, still a bit too nerdy for mainstream use (alas) Increases computing requirements by a factor of 1,000 to 10,000 As for all statistical methods, sample MUST be representative University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 30/04/2019 3:02 AM