Impact Evaluation Methods Sebastian Martinez Impact Evaluation Cluster, AFTRL Slides by Paul J. Gertler & Sebastian Martinez
Motivation Objective in evaluation is to estimate the CAUSAL effect of intervention X on outcome Y What is the effect of a cash transfer on household consumption? For causal inference we must understand the data generation process For impact evaluation, this means understanding the behavioral process that generates the data how benefits are assigned
How to assess impact What is the effect of a cash transfer on household consumption? Formally, program impact is: α = (Y | P=1) - (Y | P=0)
Motivation We observe an outcome indicator:
Motivation And its value increases after the program: Intervention
Motivation However, we have to identify the counterfactual: Intervention
Tool Belt of IE Methods Randomized Experiments Quasi-experiments Randomized Promotion-Instrumental Variables Regression Discontinuity Difference in difference – panel data Matching All cases involve knowing the rule for assigning treatment
Choosing your design For impact evaluation, we will identify the “best” possible design given the operational context Best design = fewest risks for contamination Omitted Variables (biased estimates) Selection (results not generalizable)
Case Study Effect of cash transfers on consumption Estimate impact of cash transfer on consumption per capita Make sure: Cash transfer comes before change in consumption Cash transfer is correlated with consumption Cash transfer is the only thing changing consumption Example based on Oportunidades
Oportunidades National anti-poverty program in Mexico (1997) Cash transfers and in-kind benefits conditional on school attendance and health care visits. Transfer given preferably to mother of beneficiary children. Large program with large transfers: 5 million beneficiary households in 2004 Large transfers, capped at: $95 USD for HH with children through junior high $159 USD for HH with children in high school
Oportunidades Evaluation Phasing in of intervention 50,000 eligible rural communities Random sample of of 506 eligible communities in 7 states - evaluation sample
“Counterfeit” Counterfactual Number 1 Before and after: Assume we have data on Treatment households before the cash transfer Treatment households after the cash transfer Estimate “impact” of cash transfer on household consumption: Compare consumption per capita before the intervention to consumption per capita after the intervention Difference in consumption per capita between the two periods is “treatment”
Case 1: Before and After Compare Y before and after intervention αi = (CPCit | T=1) - (CPCi,t-1| T=0) Estimate of counterfactual (CPCi,t| T=0) = (CPCi,t-1| T=0) “Impact” = A-B CPC Before After A B t-1 t Time
Case 1: Before and After
Case 1: Before and After Compare Y before and after intervention αi = (CPCit | T=1) - (CPCi,t-1| T=0) Estimate of counterfactual (CPCi,t| T=0) = (CPCi,t-1| T=0) “Impact” = A-B Does not control for time varying factors Recession: Impact = A-C Boom: Impact = A-D CPC Before After A D? B C? t-1 t Time
“Counterfeit” Counterfactual Number 2 Enrolled/Not Enrolled Voluntary Inscription to the program Assume we have a cross-section of post-intervention data on: Households that did not enroll Households that enrolled Estimate “impact” of cash transfer on household consumption: Compare consumption per capita of those who did not enroll to consumption per capita of those who enrolled Difference in consumption per capita between the two groups is “treatment”
Case 2: Enrolled/Not Enrolled
Those who did not enroll…. Impact estimate: αi = (Yit | P=1) - (Yj,t| P=0) , Counterfactual: (Yj,t| P=0) ≠ (Yi,t| P=0) Examples: Those who choose not to enroll in program Those who were not offered the program Conditional Cash Transfer Job Training program Cannot control for all reasons why some choose to sign up & other didn’t Reasons could be correlated with outcomes We can control for observables….. But are still left with the unobservables
Impact Evaluation Example: Two counterfeit counterfactuals What is going on?? Which of these do we believe? Problem with Before-After: Can not control for other time-varying factors Problem with Enrolled-Not Enrolled: Do no know why the treated are treated and the others not
Solution to the Counterfeit Counterfactual Sick 2 days Sick 10 days Observe Y with treatment ESTIMATE Y without treatment Impact = 2 - 10 = - 8 days sick! On AVERAGE, is a good counterfactual for
Measuring Impact Randomized Experiments Quasi-experiments Randomized Promotion-Instrumental Variables Regression Discontinuity Difference in difference – panel data Matching
Choosing the methodology….. Choose the most robust strategy that fits the operational context Use program budget and capacity constraints to choose a design, i.e. pipeline: Universe of eligible individuals typically larger than available resources at a single point in time Fairest and most transparent way to assign benefit may be to give all an equal chance of participating randomization
Randomization The “gold standard” in impact evaluation Give each eligible unit the same chance of receiving treatment Lottery for who receives benefit Lottery for who receives benefit first
Randomization Randomization Randomization External Validity (sample) Internal Validity (identification)
Case 3: Oportunidades Randomization Random assignment of benefits by community: 320 treatment communities (14,446 households) First transfers distributed April 1998 186 control communities (9,630 households) First transfers November 1999
Baseline characteristics
Case 3: Randomization
Impact Evaluation Example: No Design v.s. Randomization
Measuring Impact Randomized Experiments Quasi-experiments Randomized Promotion-Instrumental Variables Regression Discontinuity Difference in difference – panel data Matching
Randomized Promotion Common scenario: Voluntary inscription in program National Program with universal eligibility Can’t “control” who enrolls and who does not Can we compare enrolled to not enrolled?
Randomized Promotion Possible solution: random promotion or incentives into the program Information Encouragement (small gift or prize) Transport Other help/incentives Those who get promotion are more likely to enroll But who got promotion was determined randomly, so not correlated with other observables/non-observables Compare average outcomes of two groups: promoted/not promoted
Encouragement Design Never Takeup Takeup if Encouraged Always Takeup NOT Takeup = 30% Y = 90 Change Takeup = 50% Change Y=10 Impact = 20 Never Takeup Takeup if Encouraged Always Takeup
Examples – Randomized Promotion Maternal Child Health Insurance in Argentina Intensive information campaigns Employment Program in Argentina Transport voucher Community Based School Management in Nepal Assistance from NGO Health Risk Funds in India Assistance from Community Resource Teams
Randomized Promotion Just an example of an Instrumental Variable A variable correlated with treatment but nothing else (i.e. random promotion) Again, we really just need to understand how the benefits are assigned Don’t have to exclude anyone
Two Stage Least Squares (2SLS) Model with endogenous Treatment (T): Stage 1: Regress endogenous variable on the IV (Z) and other exogenous regressors Calculate predicted value for each observation: T hat
Two stage Least Squares (2SLS) Stage 2: Regress outcome y on predicted variable (and other exogenous variables) Need to correct Standard Errors (they are based on T hat rather than T) In practice just use STATA - ivreg Intuition: T has been “cleaned” of its correlation with ε.
Case 6: IV Estimate TOT effect of Oportunidades on consumption Run 2SLS regression
Measuring Impact Randomized Experiments Quasi-experiments Randomized Promotion-Instrumental Variables Regression Discontinuity Difference in difference – panel data Matching
Case 4: Regression Discontinuity Assignment to treatment is based on a clearly defined index or parameter with a known cutoff for eligibility RD is possible when units can be ordered along a quantifiable dimension which is systematically related to the assignment of treatment The effect is measured at the discontinuity – estimated impact around the cutoff may not generalize to entire population
Indexes are common in targeting of social programs Anti-poverty programs targeted to households below a given poverty index Pension programs targeted to population above a certain age Scholarships targeted to students with high scores on standardized test CDD Programs awarded to NGOs that achieve highest scores
Example: effect of cash transfer on consumption Target transfer to poorest households Construct poverty index from 1 to 100 with pre-intervention characteristics Households with a score <=50 are poor Households with a score >50 are non-poor Cash transfer to poor households Measure outcomes (i.e. consumption) before and after transfer
Large SMME
Treatment Effect
Case 4: Regression Discontinuity Oportunidades assigned benefits based on a poverty index Where Treatment = 1 if score <=750 Treatment = 0 if score >750
Case 4: Regression Discontinuity Baseline – No treatment 2
Case 4: Regression Discontinuity Treatment Period
Potential Disadvantages of RD Local average treatment effects – not always generalizable Power: effect is estimated at the discontinuity, so we generally have fewer observations than in a randomized experiment with the same sample size Specification can be sensitive to functional form: make sure the relationship between the assignment variable and the outcome variable is correctly modeled, including: Nonlinear Relationships Interactions
Advantages of RD for Evaluation RD yields an unbiased estimate of treatment effect at the discontinuity Can many times take advantage of a known rule for assigning the benefit that are common in the designs of social policy No need to “exclude” a group of eligible households/individuals from treatment
Measuring Impact Randomized Experiments Quasi-experiments Randomized Promotion-Instrumental Variables Regression Discontinuity Difference in difference – panel data Matching
Case 5: Diff in diff Compare change in outcomes between treatments and non-treatment Impact is the difference in the change in outcomes Impact = (Yt1-Yt0) - (Yc1-Yc0)
Outcome B Treatment Group A D C Control Group Time Treatment Average Treatment Effect Treatment Group A D C Control Group
EstimatedAverage Treatment Effect Outcome Average Treatment Effect EstimatedAverage Treatment Effect Treatment Group Control Group Time Treatment
Diff in diff Fundamental assumption that trends (slopes) are the same in treatments and controls Need a minimum of three points in time to verify this and estimate treatment (two pre-intervention)
Case 5: Diff in Diff
Impact Evaluation Example – Summary of Results
Measuring Impact Randomized Experiments Quasi-experiments Randomized Promotion-Instrumental Variables Regression Discontinuity Difference in difference – panel data Matching
Matching Pick up the ideal comparison that matches the treatment group from a larger survey. The matches are selected on the basis of similarities in observed characteristics This assumes no selection bias based on unobservable characteristics. Source: Martin Ravallion
Propensity-Score Matching (PSM) Controls: non- participants with same characteristics as participants In practice, it is very hard. The entire vector of X observed characteristics could be huge. Rosenbaum and Rubin: match on the basis of the propensity score= P(Xi) = Pr (Di=1|X) Instead of aiming to ensure that the matched control for each participant has exactly the same value of X, same result can be achieved by matching on the probability of participation. This assumes that participation is independent of outcomes given X.
Steps in Score Matching Representative & highly comparables survey of non-participants and participants. Pool the two samples and estimated a logit (or probit) model of program participation. Restrict samples to assure common support (important source of bias in observational studies) For each participant find a sample of non-participants that have similar propensity scores Compare the outcome indicators. The difference is the estimate of the gain due to the program for that observation. Calculate the mean of these individual gains to obtain the average overall gain.
Density of scores for participants Region of common support 1 Propensity score
PSM vs an experiment Pure experiment does not require the untestable assumption of independence conditional on observables PSM requires large samples and good data
Lessons on Matching Methods Typically used when neither randomization, RD or other quasi-experimental options are not possible (i.e. no baseline) Be cautious of ex-post matching Matching on endogenous variables Matching helps control for OBSERVABLE heterogeneity Matching at baseline can be very useful: Estimation: combine with other techniques (i.e. diff in diff) Know the assignment rule (match on this rule) Sampling: selecting non-randomized evaluation samples Need good quality data Common support can be a problem
Case 7: Matching
Case 7: Matching
Impact Evaluation Example – Summary of Results
Measuring Impact Randomized Experiments Quasi-experiments Randomized Promotion-Instrumental Variables Regression Discontinuity Difference in difference – panel data Matching Combinations of the above
Remember….. Objective of impact evaluation is to estimate the CAUSAL effect of a program on outcomes of interest In designing the program we must understand the data generation process behavioral process that generates the data how benefits are assigned Fit the best evaluation design to the operational context