TRANSLATING RESEARCH INTO ACTION Sampling and Sample Size Marc Shotland J-PAL HQ.

Slides:



Advertisements
Similar presentations
Povertyactionlab.org Planning Sample Size for Randomized Evaluations Esther Duflo J-PAL.
Advertisements

Designing an impact evaluation: Randomization, statistical power, and some more fun…
Anthony Greene1 Simple Hypothesis Testing Detecting Statistical Differences In The Simplest Case:  and  are both known I The Logic of Hypothesis Testing:
Inference Sampling distributions Hypothesis testing.
Statistical Issues in Research Planning and Evaluation
Probability - 1 Probability statements are about likelihood, NOT determinism Example: You can’t say there is a 100% chance of rain (no possibility of.
Planning Sample Size for Randomized Evaluations Jed Friedman, World Bank SIEF Regional Impact Evaluation Workshop Beijing, China July 2009 Adapted from.
Review: What influences confidence intervals?
EPIDEMIOLOGY AND BIOSTATISTICS DEPT Esimating Population Value with Hypothesis Testing.
Business 205. Review Sampling Continuous Random Variables Central Limit Theorem Z-test.
Cal State Northridge  320 Ainsworth Sampling Distributions and Hypothesis Testing.
Intro to Statistics for the Behavioral Sciences PSYC 1900 Lecture 10: Hypothesis Tests for Two Means: Related & Independent Samples.
Intro to Statistics for the Behavioral Sciences PSYC 1900 Lecture 9: Hypothesis Tests for Means: One Sample.
Inference about a Mean Part II
Independent Sample T-test Often used with experimental designs N subjects are randomly assigned to two groups (Control * Treatment). After treatment, the.
1.  Why understanding probability is important?  What is normal curve  How to compute and interpret z scores. 2.
BCOR 1020 Business Statistics Lecture 18 – March 20, 2008.
Chapter 9 Hypothesis Testing II. Chapter Outline  Introduction  Hypothesis Testing with Sample Means (Large Samples)  Hypothesis Testing with Sample.
Independent Sample T-test Classical design used in psychology/medicine N subjects are randomly assigned to two groups (Control * Treatment). After treatment,
Impact Evaluation Session VII Sampling and Power Jishnu Das November 2006.
SAMPLING AND STATISTICAL POWER Erich Battistin Kinnon Scott Erich Battistin Kinnon Scott University of Padua DECRG, World Bank University of Padua DECRG,
Chapter 9 Hypothesis Testing II. Chapter Outline  Introduction  Hypothesis Testing with Sample Means (Large Samples)  Hypothesis Testing with Sample.
Sampling and Sample Size Part 2 Cally Ardington. Lecture Outline  Standard deviation and standard error Detecting impact  Background  Hypothesis testing.
1 Faculty of Social Sciences Induction Block: Maths & Statistics Lecture 5: Generalisability of Social Research and the Role of Inference Dr Gwilym Pryce.
Psy B07 Chapter 1Slide 1 ANALYSIS OF VARIANCE. Psy B07 Chapter 1Slide 2 t-test refresher  In chapter 7 we talked about analyses that could be conducted.
Hypothesis Testing II The Two-Sample Case.
Chapter 8 Hypothesis testing 1. ▪Along with estimation, hypothesis testing is one of the major fields of statistical inference ▪In estimation, we: –don’t.
Inference in practice BPS chapter 16 © 2006 W.H. Freeman and Company.
The paired sample experiment The paired t test. Frequently one is interested in comparing the effects of two treatments (drugs, etc…) on a response variable.
June 18, 2008Stat Lecture 11 - Confidence Intervals 1 Introduction to Inference Sampling Distributions, Confidence Intervals and Hypothesis Testing.
Academic Viva POWER and ERROR T R Wilson. Impact Factor Measure reflecting the average number of citations to recent articles published in that journal.
1 Today Null and alternative hypotheses 1- and 2-tailed tests Regions of rejection Sampling distributions The Central Limit Theorem Standard errors z-tests.
Povertyactionlab.org Planning Sample Size for Randomized Evaluations Esther Duflo MIT and Poverty Action Lab.
F OUNDATIONS OF S TATISTICAL I NFERENCE. D EFINITIONS Statistical inference is the process of reaching conclusions about characteristics of an entire.
Chapter 11: Estimation Estimation Defined Confidence Levels
Jan 17,  Hypothesis, Null hypothesis Research question Null is the hypothesis of “no relationship”  Normal Distribution Bell curve Standard normal.
Understanding the Variability of Your Data: Dependent Variable Two "Sources" of Variability in DV (Response Variable) –Independent (Predictor/Explanatory)
Chapter 9 Hypothesis Testing II: two samples Test of significance for sample means (large samples) The difference between “statistical significance” and.
PARAMETRIC STATISTICAL INFERENCE
6.1 - One Sample One Sample  Mean μ, Variance σ 2, Proportion π Two Samples Two Samples  Means, Variances, Proportions μ 1 vs. μ 2.
Maximum Likelihood Estimator of Proportion Let {s 1,s 2,…,s n } be a set of independent outcomes from a Bernoulli experiment with unknown probability.
Introduction to Probability and Statistics Thirteenth Edition Chapter 9 Large-Sample Tests of Hypotheses.
Inference and Inferential Statistics Methods of Educational Research EDU 660.
Statistical Hypotheses & Hypothesis Testing. Statistical Hypotheses There are two types of statistical hypotheses. Null Hypothesis The null hypothesis,
Lecture 2 Review Probabilities Probability Distributions Normal probability distributions Sampling distributions and estimation.
Confidence intervals: The basics BPS chapter 14 © 2006 W.H. Freeman and Company.
Physics 270 – Experimental Physics. Let say we are given a functional relationship between several measured variables Q(x, y, …) x ±  x and x ±  y What.
Statistical Inference for the Mean Objectives: (Chapter 9, DeCoursey) -To understand the terms: Null Hypothesis, Rejection Region, and Type I and II errors.
Copyright ©2013 Pearson Education, Inc. publishing as Prentice Hall 9-1 σ σ.
Introduction to Inference: Confidence Intervals and Hypothesis Testing Presentation 4 First Part.
Chapter 8 Parameter Estimates and Hypothesis Testing.
Benjamin Olken MIT Sampling and Sample Size. Course Overview 1.What is evaluation? 2.Measuring impacts (outcomes, indicators) 3.Why randomize? 4.How to.
Stats Lunch: Day 3 The Basis of Hypothesis Testing w/ Parametric Statistics.
Review I A student researcher obtains a random sample of UMD students and finds that 55% report using an illegally obtained stimulant to study in the past.
Inference: Probabilities and Distributions Feb , 2012.
Introduction Suppose that a pharmaceutical company is concerned that the mean potency  of an antibiotic meet the minimum government potency standards.
AP Statistics Unit 5 Addie Lunn, Taylor Lyon, Caroline Resetar.
SAMPLING DISTRIBUTION OF MEANS & PROPORTIONS. SAMPLING AND SAMPLING VARIATION Sample Knowledge of students No. of red blood cells in a person Length of.
Hypothesis Testing. Suppose we believe the average systolic blood pressure of healthy adults is normally distributed with mean μ = 120 and variance σ.
Hypothesis Testing and Statistical Significance
Statistical Inference for the Mean Objectives: (Chapter 8&9, DeCoursey) -To understand the terms variance and standard error of a sample mean, Null Hypothesis,
Prof. Robert Martin Southeastern Louisiana University.
STAT 312 Chapter 7 - Statistical Intervals Based on a Single Sample
CHAPTER 6 Statistical Inference & Hypothesis Testing
Central Limit Theorem, z-tests, & t-tests
Review: What influences confidence intervals?
Planning sample size for randomized evaluations
Sampling Distributions
SAMPLING AND STATISTICAL POWER
Sample Sizes for IE Power Calculations.
Presentation transcript:

TRANSLATING RESEARCH INTO ACTION Sampling and Sample Size Marc Shotland J-PAL HQ

Outline Sampling distributions Detecting impact

Outline Sampling distributions – population distribution – sampling distribution – law of large numbers/central limit theorem – standard deviation and standard error Detecting impact

Balsakhi baseline test scores

Mean = 26

Standard Deviation = 20

Good, the average of the sample is about 26… Population v. sampling distribution

But… … I remember that as my sample goes, up, isn’t the sampling distribution supposed to turn into a bell curve? (Central Limit Theorem) Is it that my sample isn’t large enough?

This is the distribution of my sample of 10,000 students! Population v. sampling distribution

Possible results & probability: 1 die This is our Population Distribution (a uniform distribution)

What’s the average result? If you were to roll a die once, what is the “expected result”? (i.e. the average)

Rolling 1 die: possible results & average

What’s the average result? If you were to roll two dice once, what is the expected sum of the two dice?

Rolling 2 dice: possible totals 12 possible totals, 36 permutations Die 1 Die

Rolling 2 dice: Possible totals & likelihood

What’s the average result? If you were to roll two dice once, what is the expected sum of the two dice?

What’s the average result? If you were to roll one die twice, what is the expected average of the two dice?

Rolling 2 dice: Average of dice & likelihood This is called a SAMPLING DISTRIBUTION of the mean

Sampling distribution Gives you: 1.All possible outcomes 2.The likelihood of each of those outcomes – Each column represents one possible outcome (average result) – Each block within a column represents one possible permutation (to obtain that average) 2.5

Rolling 3 dice: 16 results 3  18, 216 permutations

Rolling 4 dice: 21 results, 1296 permutations

Rolling 5 dice: 26 results, 7776 permutations

Rolling 10 dice: 50 results, >60 million permutations Looks like a bell curve, or a normal distribution

Rolling 30 dice: 150 results, 2 x permutations* about as many as there are stars in the universe

Rolling 100 dice: 500 results, 6 x permutations >99% of all rolls will yield an average between 3 and 4

Sampling in the real world When you take a sample from the real world, you usually get one shot only The population distribution isn’t always as nice and tidy as the uniform distribution of the dice

Population v. sampling distribution: N=1

Sampling Distribution: N=4

Law of Large Numbers: N=9

Law of Large Numbers: N=100

Central Limit Theorem: N=1

Central Limit Theorem : N=4

Central Limit Theorem : N=9

Central Limit Theorem : N =100

Standard deviation/error What’s the difference between the standard deviation and the standard error?

Standard Deviation/ Standard Error

Sample size ↑ x4, SE ↓ ½

Sample size ↑ x9, SE ↓ ?

Sample size ↑ x100, SE ↓?

Outline Sampling distributions Detecting impact – significance – effect size – power – baseline and covariates – clustering – stratification

Baseline test scores

Was there an impact? Endline test scores

Stop! That was the control group. The treatment group is green. Post-test: control & treatment

This is the true difference between the 2 groups Average difference: 6 points

Population – It is the truth – But is it due to the program? – Uncertain: we must rely on a combination of: the randomized design, and Statistical properties Population versus Sample

Sampling When you take a single random sample (one roll of the dice) you get only one permutation (and therefore one outcome) If you didn’t know the true average, that would be your best estimate of the average If it was a true random roll, it would be unbiased

Sample – How many children would we need to randomly sample to detect that the difference between the two groups is statistically significantly different from zero? OR – How many children would we need to randomly sample to approximate the true difference with relative precision? Population versus Sample

If our measurements show a difference between the treatment and control group, our first assumption is: – In truth, there is no impact (our H 0 is still true) – There is some margin of error due to sampling – “This difference is solely the result of chance (random sampling error)” We (still assuming H 0 is true) then use statistics to calculate how likely this difference is in fact due to random chance Hypothesis testing

If it is very unlikely (less than a 5% probability) that the difference is solely due to chance: – We “reject our null hypothesis” We may now say: – “our program has a statistically significant impact” Hypothesis testing: conclusions

Are we now 100 percent certain there is an impact? – No, we are only 95% confident – And we accept that if we use that 5% threshold, this conclusion may be wrong 5% of the time – That is the price we’re willing to pay since we can never be 100% certain – Because we can never see the counterfactual, We must use random sampling and random assignment, and rely on statistical probabilities Hypothesis testing: conclusions

Testing statistical significance

Significance: control sampling dist

“Significance level” (5%) Critical region

“Significance level” (5%) equals 5% of t h i s t ota l a r e a

Significance: Sample size = 4

Significance: Sample size = 9

Significance: Sample size = 100

Significance: Sample size = 6,000

POWER! What if the probability is greater than 5%? – We can’t reject our null hypothesis – Are we 100 percent certain there is no impact? No, it just didn’t meet the statistical threshold to conclude otherwise – Perhaps there is indeed no impact – Or perhaps there is impact, But not enough sample to detect it most of the time Or we got a very unlucky sample this time How do we reduce this error? Hypothesis testing: conclusions

When we use a “95% confidence interval” How frequently will we “detect” effective programs? That is Statistical Power Hypothesis testing: conclusions

YOU CONCLUDE EffectiveNo Effect THE TRUTH Effective Type II Error (low power)  No Effect Type I Error (5% of the time)  61 Hypothesis testing: 95% confidence

Power: main ingredients 1.Effect Size 2.Variance 3.Sample Size 4.Proportion of sample in T vs C 5.Clustering 6.Take-up

1.Effect Size to be detected – The more fine (or more precise) the effect size we want to detect, the larger sample we need – Smallest effect size with practical / policy significance? 2.Variance – The more “noisy” it is to start with, the harder it is to measure effects 3.Sample Size – The more children we sample, the more likely we are to obtain the true difference 4.Proportion of sample in T vs C 5.Clustering and rho 6.Take-up (back to effect size) 63 Power: main ingredients

To calculate statistical significance we start with the “null hypothesis”: To think about statistical power, we need to propose a secondary hypothesis Effect Size

Effect Size: standard normal 1 Standard Deviation

Effect Size: 1 standard deviation 1 Standard Deviation

Effect Size: 3 standard deviations 3 Standard Deviations

Power: control group = H 0

Power: reject H 0 in critical region

Power: true effect is 1 SD

Power: when is H 0 rejected?

Power: 26%

Power: assume impact = 1 SD

Power: assume impact = 3 SDs

Power: 91%

DO NOT USE: “Expected” effect size What is the smallest effect that should justify the program being adopted? If the effect is smaller than that, it might as well be zero: we are not interested in proving that a very small effect is different from zero In contrast, if any effect larger than that would justify adopting this program: we want to be able to distinguish it from zero 76 Picking an effect size

How large an effect you can detect with a given sample depends on how variable the outcome is. The Standardized effect size is the effect size divided by the standard deviation of the outcome Common effect sizes 77 Standardized effect sizes

An effect size of… Is considered……and it means that… 0.2ModestThe average member of the treatment group had a better outcome than the 58 th percentile of the control group 0.5LargeThe average member of the treatment group had a better outcome than the 69 th percentile of the control group 0.8Whoa…that’s a big effect size! The average member of the treatment group had a better outcome than the 79 th percentile of the control group Standardized effect size

Try: Increasing the sample size You should not alter the effect size to achieve power The effect size is more of a policy question One variable that can affect effect size is take-up! – If your job training program increases income by 20% – But only ½ of the people in your treatment group participate – You need to adjust your impact estimate accordingly From 20% to 10% So how do you increase power? Effect Size: Bottom Line

Power: main ingredients 1.Effect Size 2.Variance 3.Sample Size 4.Proportion of sample in T vs C 5.Clustering 6.Take-up

Power: assume sample size = N

Power: assume sample size = 4*N

Power: 64%

Power: assume sample size = 9*N

Power: 91%

Power: main ingredients 1.Effect Size 2.Variance 3.Sample Size 4.Proportion of sample in T vs C 5.Clustering 6.Take-up

Sample split: 50% C, 50% T

Power: 91%

Sample split: 50% C, 50% T

Sample split: 25% C, 75% T

Power: 83%

Sample split: 75% C, 25% T

Power: 83%

Allocation to T v C

Power: main ingredients 1.Effect Size 2.Variance 3.Sample Size 4.Proportion of sample in T vs C 5.Clustering 6.Take-up

There is sometimes very little we can do to reduce the noise The underlying variance is what it is We can try to “absorb” variance: – using a baseline – controlling for other variables Variance

Baseline and Covariates

Power: main ingredients 1.Effect Size 2.Variance 3.Sample Size 4.Proportion of sample in T vs C 5.Clustering 6.Take-up

You want to know how close the upcoming national elections will be Method 1: Randomly select 50 people from entire Indian population Method 2: Randomly select 5 families, and ask ten members of each family their opinion 114 Clustered design: intuition

If the response is correlated within a group, you learn less information from measuring multiple people in the group It is more informative to measure unrelated people Measuring similar people yields less information 115 Clustered design: intuition

Cluster randomized trials are experiments in which social units or clusters rather than individuals are randomly allocated to intervention groups The unit of randomization (e.g. the school) is broader than the unit of analysis (e.g. students) That is: randomize at the school level, but use child-level tests as our unit of analysis 116 Clustered design

The outcomes for all the individuals within a unit may be correlated We call r (rho) the correlation between the units within the same cluster 117 Consequences of clustering

Like percentages, r must be between 0 and 1 When working with clustered designs, a lower r is more desirable It is sometimes low, 0,.05,.08, but can be high: Values of r (rho) Madagascar Math + Language0.5 Busia, Kenya Math + Language0.22 Udaipur, India Math + Language0.23 Mumbai, India Math + Language0.29 Vadodara, India Math + Language0.28 Busia, Kenya Math0.62

Analysis: The standard errors will need to be adjusted to take into account the fact that the observations within a cluster are correlated. Adjustment factor (design effect) for given total sample size, clusters of size m, intra-cluster correlation of r, the size of smallest effect we can detect increases by compared to a non-clustered design Design: We need to take clustering into account when planning sample size 119 Implications for design and analysis

Study# of interventions (+ Control) Total Number of Clusters Total Sample Size Women’s Empowerment2Rajasthan: 100 West Bengal: respondents 2813 respondents Pratham Read India4280 villages17,500 children Pratham Balsakhi2Mumbai: 77 schools Vadodara: 122 schools 10,300 children 12,300 children Kenya Extra Teacher Program 8210 schools10,000 children Deworming375 schools30,000 children Few examples of sample size

Testing multiple treatments ←0.15 SD→ ↖ 0.25 SD ↘ ↑ 0.15 SD ↓ ↑ 0.10 SD ↓ ←0.10 SD→ ↗ 0.05 SD ↙

Stratification Why? – Power: increasing precision for the same sample size What? – With large samples, stratify on everything – With small samples, stratify on variables that explain most variation in outcome How? – Using baseline and/or administrative data