Impact Evaluation Methods

Slides:

Advertisements

Similar presentations

AFRICA IMPACT EVALUATION INITIATIVE, AFTRL Africa Program for Education Impact Evaluation Muna Meky Impact Evaluation Cluster, AFTRL Slides by Paul J.

Advertisements

Impact Evaluation Methods: Causal Inference

The World Bank Human Development Network Spanish Impact Evaluation Fund.

The World Bank Human Development Network Spanish Impact Evaluation Fund.

The World Bank Human Development Network Spanish Impact Evaluation Fund.

REGRESSION, IV, MATCHING Treatment effect Boualem RABTA Center for World Food Studies (SOW-VU) Vrije Universiteit - Amsterdam.

#ieGovern Impact Evaluation Workshop Istanbul, Turkey January 27-30, 2015 Measuring Impact 1 Non-experimental methods 2 Experiments Vincenzo Di Maro Development.

The World Bank Human Development Network Spanish Impact Evaluation Fund.

Presented by Malte Lierl (Yale University).  How do we measure program impact when random assignment is not possible ?  e.g. universal take-up  non-excludable.

Impact Evaluation Click to edit Master title style Click to edit Master subtitle style Impact Evaluation World Bank InstituteHuman Development Network.

The World Bank Human Development Network Spanish Impact Evaluation Fund.

The World Bank Human Development Network Spanish Impact Evaluation Fund.

Impact Evaluation: The case of Bogotá’s concession schools Felipe Barrera-Osorio World Bank 1 October 2010.

Matching Methods. Matching: Overview  The ideal comparison group is selected such that matches the treatment group using either a comprehensive baseline.

Measuring Impact: Experiments

Quasi Experimental Methods I Nethra Palaniswamy Development Strategy and Governance International Food Policy Research Institute.

CAUSAL INFERENCE Shwetlena Sabarwal Africa Program for Education Impact Evaluation Accra, Ghana, May 2010.

The World Bank Human Development Network Spanish Impact Evaluation Fund.

Session III Regression discontinuity (RD) Christel Vermeersch LCSHD November 2006.

Africa Impact Evaluation Program on AIDS (AIM-AIDS) Cape Town, South Africa March 8 – 13, Causal Inference Nandini Krishnan Africa Impact Evaluation.

Impact Evaluation Designs for Male Circumcision Sandi McCoy University of California, Berkeley Male Circumcision Evaluation Workshop and Operations Meeting.

The World Bank Human Development Network Spanish Impact Evaluation Fund.

The World Bank Human Development Network Spanish Impact Evaluation Fund.

AFRICA IMPACT EVALUATION INITIATIVE, AFTRL Africa Program for Education Impact Evaluation David Evans Impact Evaluation Cluster, AFTRL Slides by Paul J.

Applying impact evaluation tools A hypothetical fertilizer project.

Non-experimental methods Markus Goldstein The World Bank DECRG & AFTPM.

Christel M. J. Vermeersch November 2006 Session V Instrumental Variables.

Africa Program for Education Impact Evaluation Dakar, Senegal December 15-19, 2008 Experimental Methods Muna Meky Economist Africa Impact Evaluation Initiative.

Randomized Assignment Difference-in-Differences

Bilal Siddiqi Istanbul, May 12, 2015 Measuring Impact: Non-Experimental Methods.

Africa Impact Evaluation Program on AIDS (AIM-AIDS) Cape Town, South Africa March 8 – 13, Randomization.

Impact Evaluation for Evidence-Based Policy Making Arianna Legovini Lead Specialist Africa Impact Evaluation Initiative.

Copyright © 2015 Inter-American Development Bank. This work is licensed under a Creative Commons IGO 3.0 Attribution-Non Commercial-No Derivatives (CC-IGO.

Impact Evaluation Methods Randomization and Causal Inference Slides by Paul J. Gertler & Sebastian Martinez.

The World Bank Human Development Network Spanish Impact Evaluation Fund.

Cross-Country Workshop for Impact Evaluations in Agriculture and Community Driven Development Addis Ababa, April 13-16, Causal Inference Nandini.

Impact Evaluation Methods Regression Discontinuity Design and Difference in Differences Slides by Paul J. Gertler & Sebastian Martinez.

Looking for statistical twins

Measuring Results and Impact Evaluation: From Promises into Evidence

General belief that roads are good for development & living standards

Quasi Experimental Methods I

An introduction to Impact Evaluation

Quasi-Experimental Methods

Impact Evaluation Methods

Explanation of slide: Logos, to show while the audience arrive.

Explanation of slide: Logos, to show while the audience arrive.

Quasi-Experimental Methods

Impact evaluation: The quantitative methods with applications

Matching Methods & Propensity Scores

Matching Methods & Propensity Scores

Methods of Economic Investigation Lecture 12

Development Impact Evaluation in Finance and Private Sector

Impact Evaluation Methods

Empirical Tools of Public Finance

1 Causal Inference Counterfactuals False Counterfactuals

Impact Evaluation Toolbox

Matching Methods & Propensity Scores

Implementation Challenges

Randomization This presentation draws on previous presentations by Muna Meky, Arianna Legovini, Jed Friedman, David Evans and Sebastian Martinez.

Impact Evaluation Methods: Difference in difference & Matching

Evaluating Impacts: An Overview of Quantitative Methods

Randomization This presentation draws on previous presentations by Muna Meky, Arianna Legovini, Jed Friedman, David Evans and Sebastian Martinez.

Impact Evaluation Designs for Male Circumcision

Explanation of slide: Logos, to show while the audience arrive.

Explanation of slide: Logos, to show while the audience arrive.

Sampling for Impact Evaluation -theory and application-

Applying Impact Evaluation Tools: Hypothetical Fertilizer Project

Positive analysis in public finance

Module 3: Impact Evaluation for TTLs

Presentation transcript:

Impact Evaluation Methods Sebastian Martinez Impact Evaluation Cluster, AFTRL Slides by Paul J. Gertler & Sebastian Martinez

Motivation “Traditional” M&E: Impact Evaluation: Is the program being implemented as designed? Could the operations be more efficient? Are the benefits getting to those intended? Monitoring trends Are indicators moving in the right direction?  NO inherent Causality Impact Evaluation: What was the effect of the program on outcomes? Because of the program, are people better off? What would happen if we changed the program?  Causality

Need at least 10 people who would be willing to volunteer {to answer some type of question} Everyone else – randomly draw survey form and fill it out – anonymous Answers are secret and anonymous - don’t show your answer to your neighbors! (and don’t look at your neighbor)

Motivation Objective in evaluation is to estimate the CAUSAL effect of intervention X on outcome Y What is the effect of a cash transfer on household consumption? For causal inference we must understand the data generation process For impact evaluation, this means understanding the behavioral process that generates the data how benefits are assigned

Causation versus Correlation Recall: correlation is NOT causation Necessary but not sufficient condition Correlation: X and Y are related Change in X is related to a change in Y And…. A change in Y is related to a change in X Causation – if we change X how much does Y change A change in X is related to a change in Y Not necessarily the other way around

Causation versus Correlation Three criteria for causation: Independent variable precedes the dependent variable. Independent variable is related to the dependent variable. There are no third variables that could explain why the independent variable is related to the dependent variable External validity Generalizability: causal inference to generalize outside the sample population or setting

Motivation The word cause is not in the vocabulary of standard probability theory. Probability theory: two events are mutually correlated, or dependent  if we find one, we can expect to encounter the other. Example age and income For impact evaluation, we supplement the language of probability with a vocabulary for causality.

Statistical Analysis & Impact Evaluation Statistical analysis: Typically involves inferring the causal relationship between X and Y from observational data Many challenges & complex statistics Impact Evaluation: Retrospectively: same challenges as statistical analysis Prospectively: we generate the data ourselves through the program’s design  evaluation design makes things much easier!

How to assess impact What is the effect of a cash transfer on household consumption? Formally, program impact is: α = (Y | P=1) - (Y | P=0) Compare same individual with & without programs at same point in time So what’s the Problem?

Solving the evaluation problem Problem: we never observe the same individual with and without program at same point in time Need to estimate what would have happened to the beneficiary if he or she had not received benefits Counterfactual: what would have happened without the program Difference between treated observation and counterfactual is the estimated impact

Estimate effect of X on Y Compare same individual with & without treatment at same point in time (counterfactual): Program impact is outcome with program minus outcome without program sick 2 days sick 10 days Impact = 2 - 10 = - 8 days sick!

Finding a good counterfactual The treated observation and the counterfactual: have identical factors/characteristics, except for benefiting from the intervention No other explanations for differences in outcomes between the treated observation and counterfactual The only reason for the difference in outcomes is due to the intervention

Measuring Impact Tool belt of Impact Evaluation Design Options: Randomized Experiments Quasi-experiments Regression Discontinuity Difference in difference – panel data Other (using Instrumental Variables, matching, etc) In all cases, these will involve knowing the rule for assigning treatment

Choosing your design For impact evaluation, we will identify the “best” possible design given the operational context Best possible design is the one that has the fewest risks for contamination Omitted Variables (biased estimates) Selection (results not generalizable)

Case Study Effect of cash transfers on consumption Estimate impact of cash transfer on consumption per capita Make sure: Cash transfer comes before change in consumption Cash transfer is correlated with consumption Cash transfer is the only thing changing consumption Example based on Oportunidades

Oportunidades National anti-poverty program in Mexico (1997) Cash transfers and in-kind benefits conditional on school attendance and health care visits. Transfer given preferably to mother of beneficiary children. Large program with large transfers: 5 million beneficiary households in 2004 Large transfers, capped at: $95 USD for HH with children through junior high $159 USD for HH with children in high school

Oportunidades Evaluation Phasing in of intervention 50,000 eligible rural communities Random sample of of 506 eligible communities in 7 states - evaluation sample Random assignment of benefits by community: 320 treatment communities (14,446 households) First transfers distributed April 1998 186 control communities (9,630 households) First transfers November 1999

Oportunidades Example

Common Counterfeit Counterfactuals 2005 2007 1. Before and After: 2. Enrolled / Not Enrolled: Sick 2 days Sick 15 days Impact = 15 - 2 = 13 more days sick? Sick 2 days Sick 1 day Impact = 2 - 1 = + 1 day sick?

“Counterfeit” Counterfactual Number 1 Before and after: Assume we have data on Treatment households before the cash transfer Treatment households after the cash transfer Estimate “impact” of cash transfer on household consumption: Compare consumption per capita before the intervention to consumption per capita after the intervention Difference in consumption per capita between the two periods is “treatment”

Case 1: Before and After Compare Y before and after intervention αi = (CPCit | T=1) - (CPCi,t-1| T=0) Estimate of counterfactual (CPCi,t| T=0) = (CPCi,t-1| T=0) “Impact” = A-B CPC Before After A B t-1 t Time

Case 1: Before and After

Case 1: Before and After Compare Y before and after intervention αi = (CPCit | T=1) - (CPCi,t-1| T=0) Estimate of counterfactual (CPCi,t| T=0) = (CPCi,t-1| T=0) “Impact” = A-B Does not control for time varying factors Recession: Impact = A-C Boom: Impact = A-D CPC Before After A D? B C? t-1 t Time

“Counterfeit” Counterfactual Number 2 Enrolled/Not Enrolled Voluntary Inscription to the program Assume we have a cross-section of post-intervention data on: Households that did not enroll Households that enrolled Estimate “impact” of cash transfer on household consumption: Compare consumption per capita of those who did not enroll to consumption per capita of those who enrolled Difference in consumption per capita between the two groups is “treatment”

Case 2: Enrolled/Not Enrolled

Those who did not enroll…. Impact estimate: αi = (Yit | P=1) - (Yj,t| P=0) , Counterfactual: (Yj,t| P=0) ≠ (Yi,t| P=0) Examples: Those who choose not to enroll in program Those who were not offered the program Conditional Cash Transfer Job Training program Cannot control for all reasons why some choose to sign up & other didn’t Reasons could be correlated with outcomes We can control for observables….. But are still left with the unobservables

Impact Evaluation Example: Two counterfeit counterfactuals What is going on?? Which of these do we believe? Problem with Before-After: Can not control for other time-varying factors Problem with Enrolled-Not Enrolled: Do no know why the treated are treated and the others not

Solution to the Counterfeit Counterfactual Sick 2 days Sick 10 days Observe Y with treatment ESTIMATE Y without treatment Impact = 2 - 10 = - 8 days sick! On AVERAGE, is a good counterfactual for

Possible Solutions… We need to understand the data generation process How beneficiaries are selected and how benefits are assigned Guarantee comparability of treatment and control groups, so ONLY difference is the intervention

Measuring Impact Experimental design/randomization Quasi-experiments Regression Discontinuity Double differences (diff in diff) Other options

Choosing the methodology….. Choose the most robust strategy that fits the operational context Use program budget and capacity constraints to choose a design, i.e. pipeline: Universe of eligible individuals typically larger than available resources at a single point in time Fairest and most transparent way to assign benefit may be to give all an equal chance of participating  randomization

Randomization The “gold standard” in impact evaluation Give each eligible unit the same chance of receiving treatment Lottery for who receives benefit Lottery for who receives benefit first

Randomization Randomization Randomization External Validity (sample) Internal Validity (identification)

External & Internal Validity The purpose of the first-stage is to ensure that the results in the sample will represent the results in the population within a defined level of sampling error (external validity). The purpose of the second-stage is to ensure that the observed effect on the dependent variable is due to some aspect of the treatment rather than other confounding factors (internal validity).

Case 3: Randomization Randomized treatment/controls Community level randomization 320 treatment communities 186 control communities Pre-intervention characteristics well balanced

Baseline characteristics

Case 3: Randomization

Impact Evaluation Example: No Design v.s. Randomization

Measuring Impact Experimental design/randomization Quasi-experiments Regression Discontinuity Double differences (diff in diff) Other options

Case 4: Regression Discontinuity Assignment to treatment is based on a clearly defined index or parameter with a known cutoff for eligibility RD is possible when units can be ordered along a quantifiable dimension which is systematically related to the assignment of treatment The effect is measured at the discontinuity – estimated impact around the cutoff may not generalize to entire population

Indexes are common in targeting of social programs Anti-poverty programs  targeted to households below a given poverty index Pension programs  targeted to population above a certain age Scholarships  targeted to students with high scores on standardized test CDD Programs  awarded to NGOs that achieve highest scores

Example: effect of cash transfer on consumption Target transfer to poorest households Construct poverty index from 1 to 100 with pre-intervention characteristics Households with a score <=50 are poor Households with a score >50 are non-poor Cash transfer to poor households Measure outcomes (i.e. consumption) before and after transfer

Non-Poor Poor

Treatment Effect

Case 4: Regression Discontinuity Oportunidades assigned benefits based on a poverty index Where Treatment = 1 if score <=750 Treatment = 0 if score >750

Case 4: Regression Discontinuity Baseline – No treatment 2

Case 4: Regression Discontinuity Treatment Period

Potential Disadvantages of RD Local average treatment effects – not always generalizable Power: effect is estimated at the discontinuity, so we generally have fewer observations than in a randomized experiment with the same sample size Specification can be sensitive to functional form: make sure the relationship between the assignment variable and the outcome variable is correctly modeled, including: Nonlinear Relationships Interactions

Advantages of RD for Evaluation RD yields an unbiased estimate of treatment effect at the discontinuity Can many times take advantage of a known rule for assigning the benefit that are common in the designs of social policy No need to “exclude” a group of eligible households/individuals from treatment

Measuring Impact Experimental design/randomization Quasi-experiments Regression Discontinuity Double differences (Diff in diff) Other options

Case 5: Diff in diff Compare change in outcomes between treatments and non-treatment Impact is the difference in the change in outcomes Impact = (Yt1-Yt0) - (Yc1-Yc0)

Outcome Treatment Group Control Group Time Treatment Average Treatment Effect Treatment Group Control Group

EstimatedAverage Treatment Effect Outcome Average Treatment Effect EstimatedAverage Treatment Effect Treatment Group Control Group Time Treatment

Diff in diff Fundamental assumption that trends (slopes) are the same in treatments and controls Need a minimum of three points in time to verify this and estimate treatment (two pre-intervention)

Case 5: Diff in Diff

Impact Evaluation Example – Summary of Results

Measuring Impact Experimental design/randomization Quasi-experiments Regression Discontinuity Double differences (Diff in diff) Other options Instrumental Variables Matching

Other options for Impact Evaluation There are a few others out there Common scenario: Voluntary inscription in program Can’t “control” who enrolls and who does not Possible solution: random promotion or incentives into the program Information Money Other help/incentives

Random Promotion Those who get promotion are more likely to enroll But who got promotion was determined randomly, so not correlated with other observables/non-observables Compare average outcomes of two groups: promoted/not promoted Effect of offering the program (ITT) Effect of the intervention (TOT) TOT = effect of offering program/proportion of those who took up

Encouragement Design Never Takeup Takeup if Encouraged Always Takeup NOT Takeup = 30% Y = 90 Change Impact 50% 10/50%= Y=10 20 Never Takeup Takeup if Encouraged Always Takeup

Example – Community Based School Management Chaudhury, Gertler, Vermeersch (work in progress) Estimate effect of decentralization of school management on learning outcomes Grant for funding of community based management Community management of hiring, budgeting, oversight 1500 schools in the evaluation Each community chooses whether to participate in program Community submits proposal for program participation

Evaluation Design Community based school management Provision of technical assistance and training by NGOs for submission of grant application Random selection of communities with NGO support Random promotion is an Instrumental Variable

Technique called Instrumental Variables Some fancy statistics: Find a variable Z which satisfies two conditions: Correlated with T: corr (Z , T) ≠ 0 Uncorrelated with ε: corr (Z , ε) = 0 Z is the random promotion in our example

Indirect least squares – Case 1 Promotion No-Promotion Change Takeup (T) 0.5 Test Score (S) 100 80 20

Indirect least squares – Case 2 Promotion No-Promotion Change Takeup (T) 0.8 0.3 0.5 Test Score (S) 100 90 10

Two Stage Least Squares (2SLS) Model with endogenous Treatment (T): Stage 1: Regress endogenous variable on the IV (Z) and other exogenous regressors Calculate predicted value for each observation: T hat

Two stage Least Squares (2SLS) Stage 2: Regress outcome y on predicted variable (and other exogenous variables) Need to correct Standard Errors (they are based on T hat rather than T) In practice just use STATA - ivreg Intuition: T has been “cleaned” of its correlation with ε.

Instrumental Variables A variable correlated with treatment but nothing else (i.e. random promotion) Again, we really just need to understand how the data are generated Don’t have to exclude anyone

Case 6: IV Estimate TOT effect of Oportunidades on consumption Run 2SLS regression

Measuring Impact Experimental design/randomization Quasi-experiments Regression Discontinuity Double differences (Diff in diff) Other options Instrumental Variables Matching

Matching Pick up the ideal comparison that matches the treatment group from a larger survey. The matches are selected on the basis of similarities in observed characteristics This assumes no selection bias based on unobservable characteristics. Source: Martin Ravallion

Propensity-Score Matching (PSM) Controls: non- participants with same characteristics as participants In practice, it is very hard. The entire vector of X observed characteristics could be huge. Rosenbaum and Rubin: match on the basis of the propensity score= P(Xi) = Pr (Di=1|X) Instead of aiming to ensure that the matched control for each participant has exactly the same value of X, same result can be achieved by matching on the probability of participation. This assumes that participation is independent of outcomes given X.

Steps in Score Matching Representative & highly comparables survey of non-participants and participants. Pool the two samples and estimated a logit (or probit) model of program participation. Restrict samples to assure common support (important source of bias in observational studies) For each participant find a sample of non-participants that have similar propensity scores Compare the outcome indicators. The difference is the estimate of the gain due to the program for that observation. Calculate the mean of these individual gains to obtain the average overall gain.

Density of scores for participants Region of common support 1 Propensity score

PSM vs an experiment Pure experiment does not require the untestable assumption of independence conditional on observables PSM requires large samples and good data

Lessons on Matching Methods Typically used when neither randomization, RD or other quasi-experimental options are not possible (i.e. no baseline) Be cautious of ex-post matching Matching on endogenous variables Matching helps control for OBSERVABLE heterogeneity Matching at baseline can be very useful: Estimation: combine with other techniques (i.e. diff in diff) Know the assignment rule (match on this rule) Sampling: selecting non-randomized evaluation samples Need good quality data Common support can be a problem

Case 7: Matching

Case 7: Matching

Impact Evaluation Example – Summary of Results

Measuring Impact Experimental design/randomization Quasi-experiments Regression Discontinuity Double differences (Diff in diff) Other options Instrumental Variables Matching Combinations of the above

Remember….. Objective of impact evaluation is to estimate the CAUSAL effect of a program on outcomes of interest In designing the program we must understand the data generation process behavioral process that generates the data how benefits are assigned Fit the best evaluation design to the operational context