Sampling And Resampling Risk Analysis for Water Resources Planning and Management Institute for Water Resources May 2007
Learning Objectives At the end of this session participants will be able to: List sampling techniques Estimate desired sample size Describe a resample and a bootstrap procedure
Why Sample? Expense Time Impossible/impractical Better data from a sample
We Want Representativeness Men Women W M Representative Not Representative Not Representative
Impediments to Representativeness Bias Systematic differences between sample and population Can be eliminated by good sample design Sample error Unavoidable differences due to chance in selection of a sample This is “dumb luck” It cannot be eliminated or avoided so we have to address it
Sampling Techniques Simple Random Sample Stratified Sample Cluster Sample Sequential Sample Non-Probability Sample/Convenience Sample
Think of Random Like This You want to identify each element of population Place a unique number from 1 to N on it Randomly select a number between 1 and N Find item with that number and measure it Replace number and repeat If same number comes up again, ignore it, replace it and choose again
How do you take a sample? Objectives of survey What question(s) are you trying to answer? What information do you need? ID target population Obtain sample frame Sample design Method of measurement Measurement instrument
How do you take a sample? Select and train field workers Pretest Organize field work Organize data management Data analysis
How big should a sample be? Trade-offs determine this precision (size of interval estimate) accuracy (capturing the value) sample size (what’s it going to cost?) What is important to your decision process? You pick any two and third is determined for you
Sample Size for Mean n is size of sample E is allowable error Precision z is z- value Accuracy (level of confidence) s is sample SD Pilot survey Guesstimate
Example Mean house value N = 501 E = $3000 z = 1.96 (95%) s = $10,000 n=[(1.96*10000)/3000] 2 =43
Sample Size for Proportion p is proportion w/ characteristic 1-p is proportion w/o characteristic Z and E as before
Example Proportion of homes with basement N=501 p=.5 1-p=.5 z=1.96 E=.05 n=.5*.5*(1.96/.03) 2 =1067
What happens when the population has less members than the sample size calculated requires? Step One : Calculate the sample size as before. n = n o noNnoN 1 + where n o is the sample size calculated in step one. Step Two : Calculate the new sample size.
What Happens if n > N? First, calculate the sample size as before. Second, calculate the new sample size using: n new =n old /[1+(n old /N)] n new =1067/[1+(1067/501)]=340
How n is Chosen in Practice Arbitrarily select a sample size As large a sample as you can get for a budget Pick a percentage for your sample Identify sample size required to obtain precision and accuracy desired!
With Good Samples…. We have classical statistical techniques that enable us to make inferences about the populations from which the samples were drawn Confidence intervals Hypothesis testing
Resampling Statistics is changing Computers make computational methods once inconceivable, possible Bootstrap Permutation tests Other resampling methods
Advantages of Resampling Fewer assumptions—normal and large n not required Greater accuracy—can be better than classical methods in some cases Generality—approach is pretty similar Promote understanding—not so theoretical
Bootstrapping Procedure 1) Resample Calculate bootstrap distribution Use bootstrap distribution
Bootstrap Idea Original sample represents population Take resamples by sampling with replacement from original random sample They “represent” many samples from population Bootstrap distribution of statistic represents sampling distribution
Concept 594 structure values ($1,000s) You want the population mean Glance says not normal Mean = SD = 20.6
Original & Resample
Calculate Bootstrap Distribution Calculate statistic for each resample and make distribution of them
Resampling Distribution Took 500 samples of n = 594 with replacement from the original sample Calculated (500) means of these 500 samples Plot the resampling distribution of means (nearly normal) Mean = (close) SD = 0.8
Bootstrap a Statistic Draw hundreds of resamples with replacement from original sample Inspect the bootstrap distribution of resampled statistics Bootstrap distribution approximates sampling distribution Approximate shape and spread, centers on original statistic not parameter Does not replace or add to data
Use Bootstrap Distribution Study characteristics of resampling distribution for insight
Bootstrap Mean & Confidence Intervals Sample Mean155.4 Resamples Mean155.9 Bias+0.5 Standard Error percentile percentile percentile percentile157.6 Confidence Interval 95% (t)155.9 ± 1.6
Why Bootstrapping Works Seems to create data out of nothing? Resamples not used as if real data Resample means are used to estimate how the sample mean for a sample of size 594 varies because of random sampling Use data twice Once to estimate population mean (original) Once to estimate variation in sample mean (resamples)
Applies to Other Statistics 25% trimmed mean (middle 50%) Difference between means Ratio of means Median Correlation coefficient Most anything
Take Away Points Sampling is a cost effective way to gather data Resampling offers analysts a powerful numerical technique for statistical analysis Resampling is relatively simple with resampling software
Accuracy Bootstrap based on large sample (n>100) Shape and spread do not depend much on original sample Does show shape and spread of sampling distribution Bootstrap based on small samples Almost all variation for a statistic comes from original sample, reduce variation with smaller sample size Does not overcome weakness of small samples as basis for inference Some methods (BCa, tilting) are better than standard methods
Beyond the Basics Bootstrap bias-corrected accelerated Adjusts percentile endpoints for 95% CI E.g., 4.1 to 98.6 instead of 2.5 to 97.5 for the 95% Bootstrap tilting Adjusts process of randomly forming resamples More efficient than BCa Use one of these more accurate methods if your software offers it
Permutation Tests Imagine experiment with 23 assigned randomly to control and 25 to treatment (n=48) Choose 25 of 48 at random and call this treatment (others to control) This is SRS without replacement—permutation resample Repeat 100s of times, calculate statistic of interest Permutation distribution—for 2 sample problems We can see if observed difference is so large that it would rarely occur if treatment did not matter!