Download presentation
Presentation is loading. Please wait.
Published byKimberly Cain Modified over 9 years ago
1
1 1 Practical Sampling for Impact Evaluation (aka shedding light on voodoo) Laura Chioda (LAC Chief Economist office & Adopted by DIME) Reducing Fragility, Conflict, Crime, and Violence Lisbon, Portugal 23-27 March 2014
2
Introduction Now that you know how to build treatment and control groups in theory, how to do it in practice? 1.Which population or groups are we interested in and where do we find them? Selecting whom to interview 2.From that population, how many people/neighborhoods/units should be interviewed/observed? Sample size Seemingly trivial, but “the devil is in the details” Example: Suppose we want to understand whether a mix of pro-social mentoring and cognitive behavioral therapy for at-risk youth can mitigate anti-social and violent behavior Heller et al. (NBER 2013) : “Preventing Youth Violence and Dropout: A Randomized Field Experiment” 2
3
Introduction Example (1): “Whom to interview” is informed by the research/policy question 1.Everyone (male, female, kids, elderly)? 2.All youth aged 14-16? 3.All youth aged 14-16 in urban areas? 4.All youth aged 14-16 in a particular city and in public schools? Need some information before sampling Complete listing of all units of observation available for sampling in each area or group 3
4
Introduction “How many” – Sample Size – depends on a few ingredients Example (2), intuitively Sample Size = 2 : One adolescent receives mentoring to reduce antisocial behavior (treatment) A second adolescent does not (control) The two have been selected at random Impact = the difference between # times these two adolescents come into contact with police (e.g. being stopped, arrested) # times they have been disciplined in school, gotten into altercations or fights Why does sample size matter? If too small, then you may draw conclusions that are not “robust”: What if the first youth receiving mentoring by chance has very violent peers? Or, on the contrary, what if the one not receiving mentoring was by chance more risk averse and less impulsive, etc.? 4
5
Introduction Why not assign the entire population (individuals; youth) either to the treatment or to the control group? Ideal world: without budget or time constraints, interviewing everyone would be a good solution In practice interviews are costly and time consuming → not feasible e.g. Census every 10 years vs. more frequent household surveys that only sample a fraction of households In sum: Whom to interview is ultimately determined by our research/policy questions Sample size matters & determines the credibility of results It allows us to say with some “confidence” whether the average outcome in the treatment group is higher/lower than that in comparison group 5
6
Road Map What will we be doing now with the rest of the time? 1.What do we mean by confidence? How does confidence relate to sample size? 2.Ingredients to determine sample size Detectable effect size Probabilities of avoiding mistakes in inference (type I & type II errors) Variance of outcome(s) 3.Multiple treatments 4.Group-disaggregated results 5.Take-up 6.Data quality 6
7
Sample Size & Confidence (in your results) Think of sample size as the accuracy of a measuring device: The more observations you have The more precise is your “measuring device” The more confident you are about the conclusions of your evaluation Example: guess the sentence below knowing only 2 letters the # of revealed letters is analogous to the # of observations where each letter, say, costs US$ 100,000 You have US$ 2M with which to uncover up to 21 letters (all of them) If you guess wrong, you loose all of your investment 7
8
Sample Size & Confidence (in the results) Let’s increase the number of “observations” (in this case letters) This is so much easier You feel more confident about guessing Common sense: the more complicated is the sentence, the more letters you would need Below, we discuss the sense in which impacts can be “complicated” to detect and would require larger samples. 8
9
Calculating the Sample Size We understand confidence to mean “with a some degree of certainty” or “with little error” We are in luck!! This time, the statistical jargon & plain language point to the same notion The same holds true in the statistical sense, only it entails formalizing what is meant by error. The statistical derivation of the ideal sample size yields an ugly formula (it looks like voodoo): Would you like me to derive this formula? 9
10
Calculating the Sample Size Hopefully you answered “no” to my previous question (otherwise early lunch break) Intuitive approach will focus on 3 main ingredients: 1.Detectable effect size 2.Errors in inference: Type II (and type I) errors 3.Variance of outcome(s) We will answer the following question: How do these 3 ingredients affect credibility of your results? And therefore your choice of sample size ? 10
11
1 st Ingredient: Smallest Effect Size We do not know in advance the effect of our policy. We want to design a precise way of measuring it But precision is not cheap: need cost-benefit analysis to decide 1 st ingredient: Smallest program effect size that you wish to detect i.e. the smallest effect for which we would be able to conclude that it is statistically different from zero “detect” is used in a statistical sense Example: What if mentoring lowers the cost of crime (e.g., policing and incarceration expenditures) by 5% … but costs (extra man hours, tutoring materials, etc.) grow by 4.5 %? What if the aggregate benefits are lower than the cost of the IE? 11
12
1 st Ingredient: Smallest Effect Size Cost-benefit analysis guides us in determining “smallest detectable effect”: That could be useful for policy That could justify the cost of an impact evaluation, etc. The smaller are the (EXPECTED) differences between treatment & control … … the more precise the instrument has to be to detect them The larger the sample needs to be 12
13
1 st ingredient: Smallest Effect Size The larger is the sample the more precise is the measuring device the easier it is to detect smaller effects Increasing sample size ≈ increasing precision (of our measuring device) Who is taller? Which pair does require a more precise measuring device? 13
14
2 nd Ingredient: Type II Error Why is it important to be able to measure differences with precision? Example: Treatment = cognitive behavioral therapy (CBT): #Arrests Treatment very similar to (≈) #Arrests Control If treatment and control outcomes are not statistically different, then we could conclude that our program has “no” effect for 2 reasons: 1. Because our instrument is not precise (Bad Inference ) 2. Because the program indeed had no effect (Good Inference ) Unless we have “enough” observations, we would not be able to decide with confidence whether the “no effect” resulted from possibility 1. or 2. 14
15
2 nd Ingredient: Type I Error (false positive) In the previous example, suppose that by pure chance, treatment youth tend to have parents who are more involved in children’s upbringing (high quality parental investments). # Arrests Treatment (statistically) SMALLER than # Arrests Control We conclude that our program has an effect (despite there being none in truth) However the difference depends only on the difference in Parents’ Involvement (Bad Inference ) Good news: the larger the sample size, the smaller we can make the probability of committing this type of error 15
16
One more Ingredient: Variance of Outcomes (1) How does the variance of the outcome affect our ability to detect an impact? Example: Of the two (circled) populations, which animals on average are bigger? How many observations from each circle would you need to decide? 16
17
One more Ingredient: Variance of Outcomes (2) Example: on average which group has the larger animals? Comparison is more complicated in this case, such that you need more information (i.e. a larger sample) answer may depend on which members of the blue & red groups you observes 17
18
One more Ingredient: Variance of Outcomes (3) One more Ingredient: Variance of Outcomes (3) Economic example: let’s look at our adolescents and mentoring Imagine that the mentoring leads to a decline in disruptive incidents over 2 years (impact) from 60 to 50 Case A: Children are all very similar within treatment arms: distributions of incidents are very concentrated; Case B: Children are more heterogeneous, with distributions of incidents much more spread out (distributions overlap more) Which instance requires a more precise measuring device? 18
19
One more Ingredient: Variance of Outcomes (4) In sum: More underlying variance (heterogeneity) more difficult to detect differences need larger sample size Tricky: How do we know about outcome heterogeneity before we determine our sample size and collect our data? Ideal: pre-existing data … but often not available Can use pre-existing data from a similar population Example: surveys from other school districts, labor force surveys, institutional data on crime etc Common sense 19
20
What else to consider when determining sample size Additional features of the design/data that may have implications for determination of sample size 1. Multiple treatment arms 2. Group-disaggregated results 3. Take-up 4. Data quality 20
21
1. Multiple treatments Testing different mechanisms: is CBT alone enough? can I increase its effectiveness? Let’s suppose we are interested in the effects of youth mentoring on academic performance Reference Cook, et al (2014). We would like to “test” of three different treatments Treatment 1: Academic remediation only Treatment 2: Cognitive Behavioral Therapy (CBT) only Treatment 3: Mentoring/CBT and academic remediation Intuition: the more comparisons (treatments), the larger the sample size needed to be “confident” 21
22
1. Multiple treatments To compare multiple treatment groups requires very large samples Analog to having “multiple” impact evaluations bundled in one The more comparisons you make, the more observations you need If the various treatments are very similar, differences between the treatment groups can be expected to be particularly small 22
23
Why do we need strata? Group-disaggregated results Gender: Are effects different for boys and girls? Location: For different neighborhoods? For different family structures (e.g. 1- vs. 2-parent households)? Punch line: To ensure balance across treatment and comparison groups, good to divide sample into strata (aka groups) before assigning treatment Strata Sub-populations (sub-groups or sub-sets) Common strata: geography, gender, age, baseline values of outcome variable Treatment assignment (or sampling) occurs within these groups (i.e. randomize within strata) 23
24
What can go wrong, if you do not use strata? Example: You randomize without stratification. Now you ask: What is the impact in a particular neighborhood? = Treatment & = Control, assigned randomly Can you assess with confidence the impact of mentoring within neighborhoods? A B C 24
25
Why do we need strata? To answer consider a few neighborhoods: Region A: we have almost no kids in the control group Region B: very few observations, can you be confident? Region C: no observations at all A 25
26
Why do we need strata? To answer consider a few neighborhoods: Region A: we have almost no kids in the control group Region B: very few observations, can you be confident? Region C: no observations at all B 26
27
Why do we need strata? To answer consider a few regions: Region A: we have almost no farmers in the control group Region B: very few observations, can you be confident? Region C: no observations at all C 27
28
Why do we need strata? How to prevent these imbalances and restore confidence in estimates within strata? Example: you have 6 neighborhoods Instead of sampling 2400 students, regardless of their municipality of origin Within each region you draw a sample: Sample 2400 ÷ 6 = 400 per neighborhood: 200 treatment & 200 control I.e. Random assignment to treatment within geographical units Within each unit, ½ will be treatment, ½ will be control Similar logic for gender, family structure, age etc. Which Strata? Your research & policy question should guide you 28
29
Why do we need strata? What about now? : The treatment and control youth look “balanced” within neighborhoods Much better! 29
30
Take up: Example Rarely we can force people into programs exception: Mandatory School Age We would now like to offer youth mentoring but can only offer Incentive (e.g. skipping one class during mentorship / treatment). Advertise the program (communication campaign) What if we offer the inducement to 500 youth Only 50 participate (often not at random) In practice, because of low take up rate, we end up with a less precise measuring device We won’t be able to detect differences with precision Can only find an effect if it is really large Take-up Low take-up (rate) lowers precision of our comparisons Effectively decreases sample size 30
31
Data Quality Data quality Poor data quality effectively increases required sample size Missing observations quality of data collection, attrition, migration High measurement error: answers not always precise e.g. self reported behavior, or victimization status e.g. poorly reported peer associations e.g. recollection bias, framing, pleasing Poor data quality can be partly addressed with field coordinator on the ground monitoring data collection 31
32
In Conclusion Whom to interview is ultimately determined by our research / policy questions How Many: ElementsImplication for Sample Size The more (statistical) confidence/precision The larger the sample size will have to be The smaller effects that we want to detect The more underlying heterogeneity (variance) The more complicated design - Multiple treatments - Strata The lower the take up rate The lower data quality 32
33
Power Calculation in Practice: an Example Calculations can be made in many statistical packages – e.g. STATA, Optimal Design Optimal design software is freely donwloadable from University of Michigan website: http://sitemaker.umich.edu/group-based/optimal_design_software http://sitemaker.umich.edu/group-based/optimal_design_software 33
34
Power Calculation in Practice: an Example Example: Experiment in Ghana designed to increase the profits of microenterprise firms Baseline profits 50 cedi per month. Profits data typically noisy, so a coefficient of variation >1 common. Example STATA code to detect 10% increase in profits: sampsi 50 55, p(0.8) pre(1) post(1) r1(0.5) sd1(50) sd2(50) Having both a baseline and endline decreases required sample size (pre and post) Results 10% increase (from 50 to 55): 1,178 firms in each group 20% increase (from 50 to 60): 295 firms in each group. 50% increase (from 50 to 75): 48 firms in each group (But this effect size not realistic) What if take-up is only 50%? Offer business training that increases profits by 20%, but only half the firms do it. Mean for treated group = 0.5*50 + 0.5*60 = 55 Equivalent to detecting a 10% increase with 100% take-up need 1,178 in each group instead of 295 in each group 34
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.