Sampling: What you don’t know can hurt you Juan Muñoz
Outline of presentation Basic concepts –Scientific sampling –Simple Random Sampling –Sampling errors and confidence intervals –Sampling errors and sample size –Sample size and population size –Non-sampling errors –Sampling for rare events –Two-stage sampling and clustering –Stratification –Design effect Implementation issues –Planning the survey –Sample frames –Excluded strata –Paneling –Nonresponse
Random Sampling Random Sampling (a.k.a. Scientific Sampling) is a selection procedure that gives each element of the population a known, positive probability of being included in the sample Random Sampling permits establishing Sampling Errors and Confidence Intervals Other sampling procedures (purposive sampling, quota sampling, etc.) cannot do that Other sampling procedures can also yield biased conclusions
In a Simple Random Sample, households are chosen –With the same probability –Independently of each other In a Simple Random Sample, the selection probability of each household is p = n / N, where –n = sample size –N = size of the population A Simple Random Sample is self-weighted Simple Random Sampling
A simple random sample would be hard to implement... –A list of all households in the country is generally not available to select the sample from –In other words, we don’t have a good sample frame –High transportation costs –Difficult management...but can be used to illustrate some basic facts about sampling –Sampling Errors and Confidence Intervals –The relationship between sampling error and sample size –The relationship between sample size and population size –Sampling vs. non-sampling errors Simple Random Sampling
Sampling error and sample size Standard error e when estimating a prevalence P in a sample of size n taken from an infinite population
Confidence intervals In a sample of 1,000 households, 280 households (28 percent) have preschool children. Standard error is 1.42 percent.
Confidence intervals In a sample of 1,000 households, 280 households (28 percent) have preschool children. Standard error is 1.42 percent. Standard error 95 percent confidence interval: 28 ± percent confidence interval: 28 ±
Sampling error and sample size Standard error Sample size To halve sampling error......sample size must be quadrupled
Sample size and population size Standard error e when estimating a prevalence P in a sample of size n taken from a population of size N finite population correction
Sample size and population size Sample size needed for a given precision Population size
Sample size Sampling error Non-sampling error Sampling vs. non-sampling errors Total error
Absolute and relative errors Formula gives the absolute error But we are often interested in the relative error For rare events (small p,) the relative error can be large, even with very big samples This may be the case of some of the MDG’s Infant / maternal mortality HIV/AIDS prevalence Extreme poverty
Two-stage sampling The country is divided into small Primary Sampling Units (PSUs) In the first stage, PSUs are selected In the second stage, households are chosen within the selected PSUs
Two-stage sampling Solves the problems of Simple Random Sampling Provides an opportunity to link community-level factors to household behavior The sample can be made self-weighted if –In the first stage, PSUs are selected with Probability Proportional to Size (PPS) –In the second stage, a fixed number of households are chosen within each of the selected PSUs The price to pay is cluster effect
Cluster effect Standard error grows when the sample of size n is drawn from k PSUs, with m households in each PSU ( n=km ) Cluster effect Intra-cluster correlation coefficient Two Stage SampleSimple Random Sample
Cluster effects Intra-cluster correlation coefficient Number of PSUs Number of households per PSU For a total sample size of 12,000 households
Sampling weights need to be used to analyze the data Sampling weights need to be used to analyze the data Stratified Sampling These objectives are often contradictory in practice The population is divided up into subgroups or “strata”. A separate sample of households is then selected from each stratum. There are two primary reasons for using a stratified sampling design: –To potentially reduce sampling error by gaining greater control over the composition of the sample. –To ensure that particular groups within a population are adequately represented in the sample. The sampling fraction generally varies across strata.
Design effect In a two-stage sample Cluster effect = e ² TSS / e ² SRS In a more complex sample (with two or more stages, stratification, etc.) Design effect = Deff = e ² CS / e ² SRS It can be interpreted as an apparent shrinking of the sample size, as a result of clustering and stratification. It can be estimated with specialized software (such as the Stata’s svy commands)
First stage sample frame: The list of Census Enumeration Areas Exhaustive Unambiguous Linked with cartography Measure of size (for PPS selection) Up to date (?) Area Units of adequate size
Second stage sample frame: The household listing operation What is involved? How long does it take? How much does it cost? How much earlier than the survey? Is it always needed? Dwellings or households? Who draws the sample? Asking extra questions during listing Can new technologies help? Training, organization, supervision, forms households per enumerator/day ~15% of the total cost of fieldwork As close as possible Yes (almost)does A dwelling listing is more permanent Ideally, central staff Not recommended Yes (GPS )
Planning the survey Selected PSUs should be allocated –Among teams –During the survey period
Parts of the country may need to be excluded from the sample for security or other reasons Excluded strata
Panel Surveys can measure change better Y 2001 Y It seems that Y 2001 > Y 2005 but… …both measures are affected by sampling errors (e 2001 et e 2005 ) The error of the difference Y Y 2001 is… …√ (e² e² 2005 ) if the two samples are independent …only √(e² e² 2005 –2ρ[Y 2001,Y 2005 ]) if the sample is the same
Advantages and disadvantages of panels Analytical advantages –Can measure changes better –Permit understanding better why things changed –Permits correlating past and present behavior Analytical disadvantages –Become progressively less representative of the population Practical disadvantages –Sample attrition –Much harder to manage –Better to design them prospectively rather than in afterthought Practical advantages –No sampling design needed for the second and subsequent surveys
Nonresponse Possible solutions… Replace nonrespondents with similar households Increase the sample size to compensate for it Use correction formulas Use imputation techniques (hot-deck, cold-deck, warm-deck, etc.) to simulate the answers of nonrespondents None of the above ✔
The best way to deal with nonresponse is to prevent it Lohr, Sharon L. Sampling: Design & Analysis (1999)
Total Nonresponse Interviewers Type of survey Respondents Training Work LoadMotivation QualificationData collection method Demographic Socio-economic Economic Burden Motivation Proxy Availability Source: “Some factors affecting Non-Response.” by R. Platek Survey Methodology
Total sample size: 18,144 households 56 Strata = 18 governorates x 3 zones (5 in Bagdad) ( Urban Center / Other Urban / Rural ) No explicitly excluded strata Within each stratum: 324 households, selected in two- stages: –54 Blocks, selected with PPS –In each block: 6 households (a cluster,) selected with EP The 162 clusters of each governorate were allocated –To fieldworkers: 3 teams x 3 interviewers x 18 clusters –In time: 18 waves x 9 clusters (randomly) One wave = 20 days fieldwork period = 12 months Case study: The IHSES Iraq Household Socio-Economic Survey Presenter: Ms Najla Murad - COSIT
If a cluster could not be visited at the scheduled time, it was swapped with one of the selected clusters not yet visited, chosen at random. At the end of fieldwork, 75 of the 3,024 originally selected clusters could not be visited (2.5 percent) However, over 30 percent of the clusters were not visited at the scheduled time In the clusters that could be visited, non- response was negligible (~1.5 percent) Case study: The IHSES Iraq Household Socio-Economic Survey Performance of the contingency plans