Download presentation
Presentation is loading. Please wait.
Published byMoses Lindsey Modified over 8 years ago
1
Development Impact Evaluation Initiative Innovations in Investment climate reforms Paris, Nov 14, 2012 in collaboration with the Investment Climate Global Practice Practical Sampling for Impact Evaluations Aidan Coville* 1 *Presentation draws from slides presented by Laura Chioda
2
Sampling Variation SAMPLE 1SAMPLE 2 HEIGHT (CM) 149 cm 157 cm 149 cm Which sample is correct? Answer: Neither... And both
3
So what do we do now? We want accurate measures But have to deal with sampling error If we take a census we’ll get exact measures So let’s take a census… Need to think about the marginal value of added observations
4
The answer is… = 42
5
The End Questions?
6
Calculating sample size Think of the sample size as the accuracy of a measuring device: (assuming representative random samples drawn) Example: guess the sentence below Here, the # of revealed letters is analogous to the # of observations where each letter, say, costs US$ 100,000 You have US$ 2M with which to uncover up to 20 letters (all of them) If you guess wrong, you loose all of your investment 6 More observations More precision More confidence
7
Calculating sample size Let’s increase the number of “observations” (in this case letters) This is so much easier 7
8
Outline (a search for “n”) 1. Ingredients to determine sample size Detectable effect size Confidence/Probabilities of avoiding mistakes in inference (type I & type II errors) Variance of outcome(s) Clustering level 2. Enhancements Multiple treatments Group-disaggregated results 3. Detractions 1. Take up 2. Data quality 4. So what can we do (a guide to maximizing power) 8
9
Let’s run through an example Intervention: Risk-based inspection Q: what is the impact of a new risk-based inspections procedure on restaurant compliance with health safety standards? Methods: Randomize the implementation of the regime at the town level. Sample size:…?
10
Outline 1. Ingredients to determine sample size Detectable effect size Confidence/Probabilities of avoiding mistakes in inference (type I & type II errors) Variance of outcome(s) Clustering level 2. Enhancements Multiple treatments Group-disaggregated results 3. Detractions 1. Take up 2. Data quality 4. So what can we do (a guide to maximizing power) 10
11
Ingredients Detectable Effect Size ConfidenceVariance of outcomesClustering
12
Detectable Effect Size (1/3) We do not know in advance the effect of our policy. We want to design a precise way of measuring it But precision is not cheap: need cost-benefit analysis to decide 1 st ingredient: Smallest program effect size that you wish to detect i.e. the smallest effect for which we would be able to conclude that it is statistically different from zero “detect” is used in a statistical sense 12
13
Detectable Effect Size (2/3) Cost-benefit analysis guides us in determining “smallest detectable effect”: “What is the smallest effect of the program, below which we would consider a failure?” That could be useful for policy and could justify the cost of impact evaluations, etc. The smaller are the (EXPECTED) differences between treatment & control … … the more precise the instrument has to be to detect them The larger the sample needs to be 13
14
Detectable effect size (3/3) 14 The larger is the sample the more precise is the measuring device the easier it is to detect smaller effects Increasing sample size ≈ increasing precision (of our measuring device) Who is taller?
15
Ingredients Detectable Effect Size ConfidenceVariance of outcomesClustering
16
Confidence: Type I Error (1/2) The possibility of getting a false positive Observing a difference between 2 groups when a TRUE difference does not exist Because of budget concerns, treatment and control have 25 obs. each. By pure chance, treatment businesses are more diligent. Compliance Treatment (statistically) Larger than Compliance Control 16
17
Confidence: Type II error (2/2) Failing to detect an effect when a TRUE effect really does exist. Compliance Treatment very similar (≈) to Compliance Control Then we could conclude that our program has “no” effect for 2 reasons: i.e. That treatment and control outcomes are not statistically different 1. Because our instrument is not precise (Bad Inference) 2. Because the program indeed had no effect (Good Inference) Unless we have “enough” observations, we would not be able to decide with confidence between possibilities 1. and 2. 17
18
Ingredients Detectable Effect Size ConfidenceVariance of outcomesClustering
19
Variance of Outcomes (1/4) How does the variance of the outcome affect our ability to detect an impact? Example: Of the two (circled) populations, which animals are bigger? How many observations from each circle would you need to decide? 19
20
Variance of Outcomes (2/4) Example: on average which group has the larger animals? Comparison is more complicated in this case, such that you need more information (i.e. a larger sample) answer may depend on which members of the blue & red groups you observes 20
21
Variance of Outcomes (3/4) A more economic example: let’s look at our businesses and compliance rates Imagine that the risk-based inspections leads to an increase in compliance (impact) from 50% to 60% on average Case A: businesses are all very similar and the distribution of compliance rates is very concentrated Case B: businesses are very different and distribution of compliance rates are spread out (distributions overlap more) Which instance requires a more precise measuring device? 21
22
Variance of Outcomes (4/4) In sum: More underlying variance (heterogeneity) more difficult to detect difference need larger sample size Tricky: How do we know about heterogeneity before we decide our sample size and collect our data? Ideal: pre-existing data … but often non-existent Can use pre-existing data from a similar population Example: enterprise surveys, labor force surveys Common sense 22
23
Ingredients Detectable Effect Size ConfidenceVariance of outcomesClustering
24
Clustering (1/4) Is random sampling done at the… Business level Business group level Village/port/… Province? Depends on: Question being asked/Intervention type Sampling frame availability Cost/feasibility Potential spillovers
25
Clustering (2/4) What is the added value of more samples in the same cluster? Village 1 Village 2 Village 4 Village 3
26
Clustering (3/4) Village 1 Village 2 Village 4 Village 3
27
Clustering (4/4) Takeaway Larger within cluster correlation (guys in same cluster are similar) lower marginal value per extra sampled unit in the cluster higher sample size/more clusters needed than a simple random sample.
28
Outline 1. Ingredients to determine sample size Detectable effect size Confidence/Probabilities of avoiding mistakes in inference (type I & type II errors) Variance of outcome(s) Clustering level 2. Enhancements Multiple treatments Group-disaggregated results 3. Detractions 1. Take up 2. Data quality 4. So what can we do (a guide to maximizing power) 28
29
Multiple treatments (1/2) Risk-based inspections may increase compliance. But what if restaurants aren’t able to upgrade their processes to comply because of lack of access to credit? Treatment 1: Risk-based inspections Treatment 2: Matching grant to upgrade safety processes Treatment 3: Inspections and grant Intuition: the more comparisons (treatments) the larger sample size needed to be “confident” 29
30
Multiple treatments (2/2) To compare treatment groups requires very large samples The more comparisons you make, the more observations you need Especially if the various treatments are very similar, differences between the treatment groups can be expected to be smaller 30
31
Strata (1/5)
32
Strata (2/5) Group-disaggregated results Are effects different for men and women? For different sectors? If genders/sectors expected to react in a similar way, then estimating differences in treatment impact also requires very large samples Group-disaggregated results To ensure balance across treatment and comparison groups, good to divide sample into strata before assigning treatment Strata Sub-populations Common strata: geography, gender, sector, baseline values of outcome variable Treatment assignment (or sampling) occurs within these groups (i.e. randomize within strata) 32
33
Strata (3/5) Example: What is the impact in a particular region? = Treatment & = Control, assigned randomly Can you assess with confidence the impact of compliance within regions? A B C
34
Strata (4/5) To answer consider a few regions: Region A: we have almost no businesses in the control group Region B: very few observation, can you be confident? Region C: no observations at all A B C
35
Strata (5/5) How to prevent these imbalances and restore confidence in estimates within strata? Sampling within region can overcome this issue Random assignment to treatment within geographical units Within each unit, ½ will be treatment, ½ will be control. Similar logic for gender, industry, firm size, etc Which Strata? Your research & policy question should guide you
36
Outline 1. Ingredients to determine sample size Detectable effect size Confidence/Probabilities of avoiding mistakes in inference (type I & type II errors) Variance of outcome(s) Clustering level 2. Enhancements Multiple treatments Group-disaggregated results 3. Detractions 1. Take up 2. Data quality 4. So what can we do (a guide to maximizing power) 36
37
1. Take Up Example: No discretionary participation in inspections BUT We can only offer matching grant. We cannot force businesses to use it Offer grant to 500 businesses Only 50 participate In practice, because of low take up rate, we end up with a less precise measuring device We won’t be able to detect differences with precision Can only find an effect if it is really large Take-up Low take-up (rate) lowers precision of our comparisons Effectively decreases sample size 37
38
1. Take up Matching grant application vs. completion rates in Mozambique
39
2. Data Quality Data quality Poor data quality effectively increases required sample size Missing observations High measurement error Can be partly addressed with field coordinator on the ground monitoring data collection 39
40
Overview Who to interview is ultimately determined by our research/policy questions How Many: 40 Elements:Implication for Sample Size: The smaller effects that we want to detect The larger the sample size will have to be The more (statistical) confidence/precision The more underlying heterogeneity (variance) The more clustering in samples The more complicated design - Multiple treatments - Strata The lower take up The lower data quality
41
Mo precision mo money
42
Need to be realistic
43
How can we boost power Focus on homogenous group High frequency data on core indicators Increase take up!!!! better quality data (its worth it…) Avoid clustering where possible Factorial designs InspectionsNo Inspections Matching grant500 (interaction)500 No Grant500500 (control)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.