Bootstraps and Jackknives Hal Whitehead BIOL4062/5062
Confidence in estimators Why use bootstraps or jackknives? The jackknife The parametric bootstrap The non-parametric bootstrap – (“The bootstrap”)
Estimation without confidence (standard error, confidence interval) has little value
Confidence in estimates: Traditional approach DATA Biological model Estimator Statistical (Statistic) model Confidence in estimator ?
Confidence in estimates: Traditional approach e.g. What is sex ratio of vole population? Trap: 12 males 15 females Estimate ratio 12/(12+15)=0.444 Using binomial distribution: SE = [0.444x( )/(12+15)]=0.096 So: Sex ratio is estimated to be (SE 0.096)
e.g. Asymmetry of size among nestlings in nests of 6 Measure: difference between size of nestling and its most similar neighbour { } => [ ] = 0.58 But what confidence have we in this?
Confidence in estimator: Mean distance between animals In a small population: what is the expected distance between any two animals? Estimate is: mean of distances between all pairs of animals What is confidence in this estimate? no easy formula (lack of independence)
Use Bootstraps and Jackknives when: No clear biological model Deriving statistical model –very difficult, impossible, or tedious Statistical model too complicated to be useful Model may not be quite valid Accurate measure of precision under statistical model only possible with large n
The Jackknife Data D = {X 1, X 2, X 3,....,X n } => statistic s Jackknife replicates miss out units (or groups of units) in turn: –J1 = X 2, X 3,....,X n => statistic s -1 (missing unit 1) –J2 = X 1, X 3,....,X n => statistic s -2 (missing unit 2) –etc. Convert into pseudovalues: –φ 1 = n ⋅ s - (n-1)s -1 –φ 2 = n ⋅ s - (n-1)s -2 –etc.
The Jackknife The Jackknifed Estimate of s is then: –s J = mean(φ 1,...,φ n ) SE(s) = SE(φ 1,...,φ n )
The Jackknife Jackknifed Estimate removes bias Jackknife SE “rough and ready” –usually “conservative” (overestimates SE) Jackknife on blocks of units, if data not independent Assumes normality for confidence intervals
Correlation between gill weight and body weight in 12 crabs Jackknife r = [Mean φ i ] SE [SD( φ i )/ 12)] r = Gill(mg) Body(g) r -i φ i
Bootstraps
Parametric Bootstrap Assume Data produced by Model with some Parameters unknown, which need to be estimated: –Model => Data => Parameter estimates (s) The Bootstrap process: –Model + Parameter estimates (s) => Random data => Bootstrap replicate estimates (s*) Distribution of Bootstrap replicate estimates (s*s) give distribution, confidence intervals and standard errors of s (plus indicator of bias) Usually use ,000 bootstrap replicates
Parametric Bootstrap–an example Mark-Recapture Estimate Mark 25 animals Recapture 46 of which 12 Marked What is population size? “Petersen” estimate is 25x46/12=95.8 What is confidence in this estimate, expected bias?
Parametric Bootstrap–an example Mark-Recapture Estimate Mark 25 animals; Recapture 46, 12 Marked “Petersen” estimate is 25x46/12=95.8 What is confidence, expected bias? Parametric Bootstrap Replicates: –96 Animals, mark 25, recapture 46 –How many marked? –From simulation (m s =): –Calculate population estimates (n s = 25x46/m s )
Parametric Bootstrap–an example Mark-Recapture Estimate “Petersen” estimate is 25x46/12=95.8 Bootstrap population estimates (assuming n=96) – Expected Bias: –mean(n s ) - 96= = 3.7 Estimated standard error: –SD(n s ) = 20.4 So population estimate is: 92.1 (SE 20.4)
Parametric Bootstrap–an example Mark-Recapture Estimate
Non-Parametric Bootstrap (A.K.A. “The Bootstrap”) Data D = X 1, X 2, X 3,....,X n => statistic s Bootstrap replicate: –D*1 = X* 1, X* 2, X* 3,....,X* n => statistic s*1 –D*2 = X* 1, X* 2, X* 3,....,X* n => statistic s*2 –... X* 1, X* 2, X* 3,....,X* n are randomly selected with replacement, from X 1, X 2, X 3,....,X n Distribution, confidence interval and SE of s estimated from the distribution, confidence interval and standard error of the s*’s Usually use ,000 bootstrap replicates
Non-Parametric Bootstrap: an example: Median Gill Weight in Crabs Gill weights (in mg): Median = 195mg Median Real Bootstrap replicates: B B B B B B B B B
Non-Parametric Bootstrap: an example: Median Gill Weight in Crabs Gill weights (in mg): Median = 195mg Bootstrap mean(1000 samples) median = 188mg 95% c.i. = mg [b(25) -b(975)]
Bootstraps in Molecular Genetics Calculate tree based on genetic data –(e.g. 20 species and 300 loci) For each bootstrap replicate: –Resample loci with replacement –(20 species with 300 loci, some repeats) –Calculate tree Look at agreement between original and bootstrap trees
Bootstrapped spanning tree Glazko & Nei Mol. Biol. Evol. 2003
Bootstraps “Better” estimate of confidence Variable n Self-comparisons a problem –e.g. Mean of associations Gives SE’s, confidence intervals and profile of confidence Jackknives “Worse” estimate of confidence –Usually conservative underestimates precision Fixed n Self-comparisons not a problem Reduces Bias Only directly gives SE –Confidence intervals need assumption of normality
Bootstraps and Jackknives Give estimates of confidence (and bias) when: –distributions unknown, approximate, or intractable Parametric bootstrap –very useful if model known –needs programming Non-parametric bootstrap –widely applicable (except self-referencing situations) –few assumptions Jackknife –approximate –only standard error given directly –useful when bootstrap not applicable