Planning sample size for randomized evaluations Marc Shotland Director of Training J-PAL, MIT povertyactionlab.org
Today’s Question How large does the sample need to be to “credibly” detect a given treatment effect? What does credibly mean? Randomization removes bias, but it does not remove noise But how large must “large” be? Today’s Question: *How large does the sample need to be to “credibly” detect a given treatment effect? There are two parts to this question: 1 how large, 2 credibly *What does credibly mean? Are the results believable? Are they systematically biased? It means that I can be reasonably sure that the difference between: the group that received the program and the group that did not …is due to the program *Randomization removes bias, but it does not remove noise: Randomization works because of the “law of large numbers” On average, *But how large must “large” be? This is a statistics question.
Lecture Overview Estimation Intro to the scientific method Hypothesis testing Statistical significance Factors that influence power Effect size Sample size Cluster randomized trials
Estimation We do not observe the entire population, just a sample. We estimate the mean of the population by computing the average in the sample But we do not observe the entire population, just a sample. Each child in the sample was given an exam to test language and math competencies. Is their average score close to the mean learning-level in the population? We estimate the mean of the population by computing the average in the sample If we have very few children in our sample, the averages are imprecise. When we see a difference in sample averages (between treatment and control groups) we do not know whether it comes from the effect of the treatment or sampling error
Basic set up At the end of an experiment, we compare the average outcome of interest in the treatment with the average outcome of interest in the control We are interested in the difference: Mean (treatment) - Mean (control) = Effect (size) Example: Want to know the effect of giving out text books on test scores. You have the scores of students in treatment schools (with books) and in control schools (without books)
Simple Example
Effect of the program Subtract the average of the Control from the average of the Treatment Run a regression of the outcome (Y) on an indicator of being in the Treatment group: Yi = a + bT + ei
Simple Example
Effect of the program Effect size: Difference in means or, Yi = a + bT + ei b = Effect size (slope of the line)
Accuracy versus Precision
Unbiased and sample size
…is it from this distribution? When we see this…. …is it from this distribution?
…or this one?
…or this one?
…how about this one?
Estimation When we do estimation Sample size allows us to say something about our distribution But it doesn’t help us know what the truth is
With sample size, we can tell whether it’s this distribution
…or this one…
With other evaluation methods The truth could be here….
Or here….
Or here….
Or here….
With randomized evaluation We can be fairly certain the truth is somewhere around here
But without sufficient sample The truth could be here If our experiment gives us an estimate around here…
But without sufficient sample… The truth could be here If our experiment gives us an estimate around here…
With randomized evaluation AND sufficient sample We can be fairly certain the truth is within here
Unbiased and sample size Truth Truth Truth Truth
Scientific method Does the scientific method apply to social science? The scientific method involves: 1) proposing a hypothesis 2) designing experimental studies to test the hypothesis How do we test hypotheses? The scientific method is typically associated with the physical sciences: biology, chemistry, physics… but: *Does the scientific method apply to social science? Well the scientific method is a process… The scientific method involves: 1) proposing a hypothesis 2) designing experimental studies to test the hypothesis How do we test hypotheses? Take measurements Use statistics to compare measurements
Basic set up We start with our hypothesis At the end of an experiment, we test our hypothesis We compare the outcome of interest in the treatment and the comparison groups.
Hypothesis testing In criminal law, most institutions follow the rule: “innocent until proven guilty” The prosecutor wants to prove their hypothesis that the accused person is guilty The burden is on the prosecutor to show guilt The jury or judge starts with the “null hypothesis” that the accused person is innocent In criminal law, most institutions follow the rule: “innocent until proven guilty” This is meant to reduce the number of innocent people who get sent to jail Accepting that some guilty people will go free The state wants to prove their hypothesis that the accused person is guilty The burden is on the state to show guilt The jury or judge starts with the “null hypothesis” that the accused person is innocent
Hypothesis testing In program evaluation, instead of “presumption of innocence,” the rule is: “presumption of insignificance” Policymaker’s hypothesis: the program improves learning Evaluators approach experiments using the hypothesis: “There is zero impact” of this program Then we test this “Null Hypothesis” (H0) The burden of proof is on the program Must show a statistically significant impact In program evaluation, instead of “presumption of innocence,” the rule is: “presumption of insignificance” This is meant to reduce the number of ineffective programs that are pronounced effective Accepting that some effects will go undetected Policymaker’s hypothesis: the program improves learning Evaluators approach experiments using the hypothesis: “There is zero impact” of this program Then we test this “Null Hypothesis” (H0) The burden of proof is on the program: Must show a statistically significant impact
Hypothesis testing If our measurements show a difference between the treatment and control group, our first assumption is: In truth, there is no impact (our H0 is still true) There is some margin of error due to sampling “This difference is solely the result of chance (random sampling error)” We (still assuming H0 is true) then use statistics to calculate how likely this difference is in fact due to random chance
Is this difference due to random chance? Control Treatment Perhaps…
Is this difference due to random chance? Control Treatment Probably not….
Hypothesis testing: conclusions If it is very unlikely (less than a 5% probability) that the difference is solely due to chance: We “reject our null hypothesis” We may now say: “our program has a statistically significant impact” If the probability is calculated to be less than 5% that we would see this difference solely due to chance: We “reject our null hypothesis” Now we can stop assuming “insignificance” And conclude: “a statistically significant difference” The evidence is sufficient to overturn the presumption of innocence We may now say: “our program has a statistically significant impact”
Hypothesis testing: conclusions Are we now 100 percent certain there is an impact? No, we may be only 95% confident And we accept that if we use that 5% threshold, this conclusion may be wrong 5% of the time That is the price we’re willing to pay since we can never be 100% certain Because we can never see the counterfactual, We must use random sampling and random assignment, and rely on statistical probabilities
Is this difference due to random chance? Control Treatment …but it’s not impossible
Example: Pratham Balsakhi (Vadodarda) Revisiting the balsakhi example we used in lecture 3: why randomize
Baseline test score data in Vadodara This was the distribution of test scores in the baseline. The test was out of 100. Some students did really well, most, not so well Many actually scored zero
Endline test scores Was there an impact? Now, look at the improvement. Very few scored zero, and many scored much closer to the 40-point range… Looking at these two graphs, would you say there is an impact? [wait for response] Was there an impact?
Post-test: control & treatment That blue distribution was in fact the control group. If the balsakhi had absolutely zero impact, we still could have expected to see an improvement If we didn’t have a control group, we may have falsely assigned that improvement to the program But now, we have a treatment group that can be compared against the control group We can see that the treatment group distribution appears shifted to the right… But how do we measure the difference, and is this statistically significant? Stop! That was the control group. The treatment group is green.
Average difference: 6 points Typically, we use the “average” or “mean” to measure the difference But can we tell whether this is statistically significant? This is the true difference between the 2 groups
Population versus Sample It is the truth But is it due to the program? Uncertain: we must rely on a combination of: the randomized design, and Statistical properties Now it’s useful to think of the difference between the “population” and the “sample” Simple example: if we took income of all the people with blue eyes in the country and the incomes of all the people with brown eyes, and measured and found a difference, there is one question we may be interested in: does eye color affect income? If we took a random sample of 100 people with blue eyes and 100 people with brown eyes, measure the difference, there would be two questions Do our sample averages reflect the true population averages? If so, does eye color affect income? For the forty minutes, we’ll focus on the first question But we’ll ultimately be interested in the next question (next slide)
Population versus Sample How many children would we need to randomly sample to detect that the difference between the two groups is statistically significantly different from zero? OR How many children would we need to randomly sample to approximate the true difference with relative precision?
Testing statistical significance
Significance: control sampling dist
“Significance level” (5%) Critical region
“Significance level” (5%) equals 5% of this total area
Significance: Sample size = 4
Significance: Sample size = 9
Significance: Sample size = 100
Significance: Sample size = 6,000
Hypothesis testing: conclusions What if the probability is greater than 5%? We can’t reject our null hypothesis Are we 100 percent certain there is no impact? No, it just didn’t meet the statistical threshold to conclude otherwise Perhaps there is indeed no impact Or perhaps there is impact, But not enough sample to detect it most of the time Or we got a very unlucky sample this time How do we reduce this error? POWER!
Hypothesis testing: conclusions When we use a “95% confidence interval” How frequently will we “detect” effective programs? That is Statistical Power When we use a “95% confidence interval” We accept that we may come to the wrong conclusion 5% of the time Analogous to the criminal justice system saying: “we are willing to send 5% of in innocent people to jail” How frequently will we “detect” effective programs? How frequently will we convict the truly guilty? That is Statistical Power
Hypothesis testing: 95% confidence YOU CONCLUDE Effective No Effect THE TRUTH Type II Error (low power) Type I Error (5% of the time)
Please return in 10 minutes! Intermission… Please return in 10 minutes!
Hypothesis testing: 95% confidence YOU CONCLUDE Effective No Effect THE TRUTH Type II Error (low power) Type I Error (5% of the time)
Power: How frequently will we “detect” effective programs?
Power: main ingredients Variance The more “noisy” it is to start with, the harder it is to measure effects Effect Size to be detected The more fine (or more precise) the effect size we want to detect, the larger sample we need Smallest effect size with practical / policy significance? Sample Size The more children we sample, the more likely we are to obtain the true difference Variance The more “noisy” it is to start with, the harder it is to measure effects Example: If all children have very similar learning level without a program, a very small impact will be easy to detect Effect Size to be detected The more fine (or more precise) the effect size we want to detect, the larger sample we need Sample Size The more children we sample, the more likely we are to obtain the true difference
Variance
Variance There is very little we can do to reduce the noise The underlying variance is what it is We can try to “absorb” variance: using a baseline controlling for other variables
Effect Size To calculate statistical significance we start with the “null hypothesis”: To think about statistical power, we need to propose a secondary hypothesis To calculate statistical significance we start with the “null hypothesis”: Effect size = 0 That anchors our bell curve (sampling distribution) To think about statistical power, we need to propose a secondary hypothesis Effect size = ??? This will anchor a new bell curve that will be compared against this first one
2 Hypotheses & “significance level” The following is an example…
Null Hypothesis: assume zero impact “Impact = 0” There’s a sampling distribution around that.
Effect Size: 1 “standard deviation” We hypothesize another possible “true effect size”
Effect Size: 1 “standard deviation” And there’s a new sampling distribution around that
Effect Size: 3 standard deviations The less overlap the better…
Significance level: reject H0 in critical region
True effect is 1 SD
Power: when is H0 rejected?
Power: 26% If the true impact was 1SD… The Null Hypothesis would be rejected only 26% of the time
Power: if we change the effect size?
Power: assume effect size = 3 SDs
The Null Hypothesis would be rejected 91% of the time Power: 91% The Null Hypothesis would be rejected 91% of the time
DO NOT USE: “Expected” effect size Picking an effect size What is the smallest effect that should justify the program being adopted? If the effect is smaller than that, it might as well be zero: we are not interested in proving that a very small effect is different from zero In contrast, if any effect larger than that would justify adopting this program: we want to be able to distinguish it from zero The first cost example is cost-benefit The second cost example is cost-effectiveness What is the smallest effect that should justify the program being adopted Cost of this program vs the benefits it brings Cost of this program vs the alternative use of the money If the effect is smaller than that, it might as well be zero: we are not interested in proving that a very small effect is different from zero In contrast, if any effect larger than that would justify adopting this program: we want to be able to distinguish it from zero DO NOT USE: “Expected” effect size DO NOT USE: “Expected” effect size
Standardized effect sizes How large an effect you can detect with a given sample depends on how variable the outcome is. The Standardized effect size is the effect size divided by the standard deviation of the outcome Common effect sizes How large an effect you can detect with a given sample depends on how variable the outcome is. The Standardized effect size is the effect size divided by the standard deviation of the outcome δ= effect size/St.dev. Common effect sizes: δ =0.20 (small) d =0.40 (medium) d =0.50 (large)
Standardized effect size An effect size of… Is considered… …and it means that… 0.2 Modest The average member of the treatment group had a better outcome than the 58th percentile of the control group 0.5 Large The average member of the treatment group had a better outcome than the 69th percentile of the control group 0.8 VERY Large The average member of the treatment group had a better outcome than the 79th percentile of the control group
Effect Size: Bottom Line You should not alter the effect size to achieve power The effect size is more of a policy question One variable that can affect effect size is take-up! If your job training program increases income by 20% But only ½ of the people in your treatment group participate You need to adjust your impact estimate accordingly From 20% to 10% So how do you increase power? Try: Increasing the sample size
Sample size Increasing sample size reduces the “spread” of our bell curve The more observations we randomly pull, the more likely we get the “true average”
Power: Effect size = 1SD, Sample size = 1
Power: Sample size = 4
Power: 64%
Power: Sample size = 9
Power: 91%
Sample size In this example: a sample size of 9 gave us good power But the effect size we used was very large (1 SD)
Calculating power When planning an evaluation, with some preliminary research we can calculate the minimum sample we need to get to. A power of 80% tells us that, in 80% of the experiments of this sample size conducted in this population, if Ho is in fact false (e.g. the treatment effect is not zero), we will be able to reject it. The larger the sample, the larger the power. Common Power used: 80%, 90% When planning an evaluation, with some preliminary research we can calculate the minimum sample we need to get to: Test a pre-specified null hypothesis (e.g. treatment effect 0) For a pre-specified significance level (e.g. 0.05) Given a pre-specified effect size (e.g. 0.2 standard deviation of the outcomes of interest). To achieve a given power A power of 80% tells us that, in 80% of the experiments of this sample size conducted in this population, if Ho is in fact false (e.g. the treatment effect is not zero), we will be able to reject it. The larger the sample, the larger the power. Common Power used: 80%, 90%
Clustered design: intuition You want to know how close the upcoming national elections will be Method 1: Randomly select 50 people from entire Indian population Method 2: Randomly select 5 families, and ask ten members of each family their opinion
Clustered design: intuition If the response is correlated within a group, you learn less information from measuring multiple people in the group It is more informative to measure unrelated people Measuring similar people yields less information
Clustered design Cluster randomized trials are experiments in which social units or clusters rather than individuals are randomly allocated to intervention groups The unit of randomization (e.g. the school) is broader than the unit of analysis (e.g. students) That is: randomize at the school level, but use child-level tests as our unit of analysis
Consequences of clustering The outcomes for all the individuals within a unit may be correlated We call r (rho) the correlation between the units within the same cluster The outcomes for all the individuals within a unit may be correlated All villagers are exposed to the same weather All students share a schoolmaster The program affects all students at the same time. The members of a village interact with each other We call r (rho) the correlation between the units within the same cluster
Values of r (rho) Like percentages, r must be between 0 and 1 When working with clustered designs, a lower r is more desirable It is sometimes low, 0, .05, .08, but can be high:0.62 Madagascar Math + Language 0.5 Busia, Kenya Math + Language 0.22 Udaipur, India Math + Language 0.23 Mumbai, India Math + Language 0.29 Vadodara, India Math + Language 0.28 Busia, Kenya Math 0.62
Few examples of sample size Study # of interventions (+ Control) Total Number of Clusters Total Sample Size Women’s Empowerment 2 Rajasthan: 100 West Bengal: 161 1996 respondents 2813 respondents Pratham Read India 4 280 villages 17,500 children Pratham Balsakhi Mumbai: 77 schools Vadodara: 122 schools 10,300 children 12,300 children Kenya Extra Teacher Program 8 210 schools 10,000 children Deworming 3 75 schools 30,000 children Bednets 5 20 health centers 545 women
Implications for design and analysis Analysis: The standard errors will need to be adjusted to take into account the fact that the observations within a cluster are correlated. Adjustment factor (design effect) for given total sample size, clusters of size m, intra-cluster correlation of r, the size of smallest effect we can detect increases by compared to a non-clustered design Design: We need to take clustering into account when planning sample size
Implications If experimental design is clustered, we now need to consider ρ when choosing a sample size (as well as the other effects) It is extremely important to randomize an adequate number of groups Often the number of individuals within groups matter less than the total number of groups
Parameters that affect sample size Difference between average outcomes (effect) Variance of the outcome Continuous vs. binary outcomes Power Significance criterion Covariates Clusters Group sizes Repeated observations Blocking (pairing/stratification)