Download presentation
Presentation is loading. Please wait.
Published bySarah Jacobs Modified over 11 years ago
2
The logic of C ounterfactual I mpact E valuation 1
3
To understand counterfactuals It is necessary to understand impacts
4
Impacts differ in one fundamental way from outputs and results Outputs and results are observable quantities
5
Can we observe an impact? No, we cant
6
As output indicators measure outputs, result indicators measure results, so impact indicators measure impacts Sorry, they dont
7
Almost everything about programmes can be observed (at least in principle): outputs (beneficiaries served, activities done, training courses offered, KM of roads built, sewages cleaned) outcomes/results (income levels, inequality, well-being of the population, pollution, congestion, inflation, unemployment, birth rate)
8
What is needed for M&E of outputs and results are BITs (baselines, indicators, and targets)
9
Unlike outputs and results, to define, detect, understand, and measure impacts one needs to deal with causality
10
Causality is in the mind J.J. Heckman
11
Why this focus on causality? Because, unless we can attribute changes (or differences) to policies, we do not know whether the intervention works, for whom it works, and even less why it works (or does not) Causal questions represents a bigger challenge than non causal questions (descriptive, normative, exploratory) 10
12
The social science scientific community defines impact/effect as the difference between a situation observed after a stimulus has been applied and the situation that would have occurred without such stimulus 11
13
A very intuitive example of the role of causality in producing credible evidence for policy decisions
14
Does playing chess have an impact on math learning?
15
Policy-relevant question: Should we make chess part of the regular curriculum in elementary schools, to improve mathematics achievement? Which kind of evidence do we need to make this decision in an informed way? We can think of three types of evidence, from the most naive to the most credible 14
16
1. The naive evidence: pre-post difference Take a sample of pupils in fourth grade Measure their achievement in math at the beginning of the year Teach them to play chess during the year Test them again at the end of the year 15
17
Results for the pre-post difference Pupils at the beginning of the year Average score = 40 points Difference = 12 points = + 30% Question: what are the implications for making chess compulsory in schools? Have we proven anything? The same pupils at the end of the year Average score = 52 points 16
18
Can we attribute the increase in test score to playing chess? OBVIOUSLY NOT The data tell us that the effect is between zero and 12 points There is not doubt that many more factors are at play So we must dismiss the increase in 10 points as unable to tell us anything about impact. 17
19
The pre-post great temptation The pre-post comparisons have a great advantage: they seem kind of obvious (the pop definition of impact coincides with the pre-post difference) Particularly when the intervention is big, and the theory suggests that the outcomes should be affected This is not the case here, but we should be careful in general to make causal inference based on pre-post comparisons 18
20
The risky alternative: with-without difference Impact = difference between treated and not treated? 19 Compare math test scores for kids who have learned chess by themselves and kids who have not
21
Not really Average score of pupils who already play chess on their own (25% of the total) = 66 points Difference = 21 points = + 47% This difference is OBJECTIVE, but what does it mean, really? Does it have any implication for policy? Average score of pupils who DO NOT play chess on their own (75% of the total) = 45 points 20
22
This evidence tells us almost nothing about making chess compulsory for all students The data tell us that the effect of playing chess is between zero and 21 points. Why? The observed difference could entirely be due to differences in mathematical ability that exist before the courses, between the two groups 21
23
Play chess Math innate ability Math test scores CS SELECTION PROCESS DIRDIRE DIRECT INFLUENCE Ignoring math ability could severly bias the results, if we intend to interpret them as causal effect Does it have an impact on? 66 – 45: real effect or the fruit of sorting? 22
24
Counterfeit Counterfactual Both the raw difference between self-selected participants and non-participants, and the raw change between pre and post are a caricature of the counterfactual logic In the case of raw differences, the problem is selection bias (predetermined differences) In the case of raw changes, the problem is maturation bias (a.k.a. natural dynamics) 23
25
The modern way to understand causality is to think in terms of POTENTIAL OUTCOMES Let us imagine we know the score that kids would get if they played and they would get if they did not 24
26
Lets say there are three levels of ability Kids in the top quartile (top 25%) learn to play chess on their own Kids in the two middle quartiles learn if they are taught in school Kids in the bottom quartile (last 25%) never learn to play chess 25
27
Mid math ability 50% Mid math ability 50% High math ability 25% High math ability 25% Low math ability 25% Low math ability 25% Play chess by themselves Do not play chess Unless taught in school Never learn to play 26
28
Mid math ability High math ability Low math ability If they do play chess If they do NOT play chess Impact = gain from playing chess 66 56 10 54 48 6 6 40 0 0 Potential outcomes 27
29
Mid math ability High math ability Low math ability For those who play chess For those who do not play chess 66 48 40 Observed outcomes 45 the difference of 21 points is NOT an IMPACT, it is just an OBSERVED difference Mid/Low math ability combined 28
30
The problem: we do not observe the counterfactual(s) For the treated, the counterfactual is 56, but we do not see it The true impact is 10, but we do not see it Still we cannot use 45, that is the untreated observed outcome We can think of decomposing the 68-45 difference as the sum of the true impact on the treated and the effect of sorting 29
31
Low/mid math ability High math ability If play chess If do not play chess Decomposing the observed difference 66 56 = 10 Impact for players = 10 Impact for players 45 =21 Observed difference =21 Observed difference = 11 preexisting differences 21 = 10 + 11 30
32
21 = 10 + 11 Observed differences = Impact + Preexisting differences (selection bias) The heart of impact evaluation is getting rid of selection bias, by using experiments or by using some non- experimental methods 21 = 10 + 11 Observed differences = Impact + Preexisting differences (selection bias) The heart of impact evaluation is getting rid of selection bias, by using experiments or by using some non- experimental methods 31
33
Experimental evidence to the rescue Schools get a free instructor to teach chess to one class, if they agree to select the class at random among the fourth grade classes Now we have the following situation 32
34
Results of the randomized experiment Pupils in the selected classes Average score of randomized chess players = 60 points Pupils in the excluded classes Average score of NON chess players = 52 points Difference = 8 points Question: what does this difference tell us? 33
35
Thus we are able to isolate the effect of chess from other factors (but some problems remain) The results tell us that teaching chess truly improves math performance (by 8 points, about 15%) 34
36
Mid ability High ability Low ability If they do play chess If they do NOT play chess Composition of population 66 56 25% 54 48 50% 40 25% Averages 54 48 100% Impact Impact = 54 – 48 = 6 Average Treatment Effect ATE 35
37
Play chess Math ability Math test scores DIRDIRE Note that the experiment does solve all the cognitive problems related to policy design: for example, it does identify impact heterogeneity (for whom it works) 36
38
The ATE is the average effect if every member of the population is treated Generally there is more policy interest in Average Treatment Effect on the Treated ATT = 10 the chess example, while ATE = 6 ( we ran an experiment and got an impact of 8. Can you think why this happens? ) 37
39
Mid ability High ability Low ability Schools that vounteered Schools that DID NOT vounteer 50% 10 50% 6 6 EXPERIMENTAL mean of 66 and 54 = 60 EXPERIMENTAL mean of 66 and 54 = 60 True impact Impact = 60 – 52 = 8 38 50% 0 0 CONTROL mean of 56 and 48 = 52 Internal validity Little external validity
40
Lessons learned Impacts are differences, but not all differences are impacts Differences (and changes) have many causes, but we do not need to undersand all the causes We are especially interested in one cause, the policy, and we would like to eliminate all the counfounding causes of the difference (or change) Internal vs. External validity 39
41
An example of a real ERDF policy Grants to small enterprises to invest in R&D 40
42
To design an impact evaluation, one needs to answer three important questions 1. Impact of what? 2. Impact for whom? 3. Impact on what?
43
AVERAGEN PRE65.0002400 POST75.0002400 OBSERVED CHANGE10.000 R&D EXPENDITURES AMONG THE FIRMS RECEIVING GRANTS Is 10.000 the true average impact of the grant? 42
44
43
45
44
46
The fundamental challenge to this assumption is the well known fact that things change over time by natural dynamics How do we disentangle the change due to the policy from the myriad changes that would have occurred anyway? 45
47
AVERAGEN T=060.0002600 T=175.0002400 DIFFERENCE TREATED - NON TREATED +15.000 IS 15.000 THE TRUE IMPACT OF THE POLICY? 46
48
WITH-WITHOUT (I.A.: NO PRE-INTERVENTION DIFFERENCES) 47
49
DECOMPOSITION OF WITH-WITHOUT DIFFERENCES 48
50
DECOMPOSITION OF WITH-WITHOUT DIFFERENCES 49
51
We cannot use experiments with firms, for obvious (?) political reasons The good news is that there are lots of non-experimental counterfactual methods 50
52
The difference-in-differences (DID) is a combination of the first two strategies And it is a good way to understand the logic of (non-experimental) counterfactual evaluation 51
53
52
54
53
55
54
56
55
57
56
58
57
59
POST DIFFE- RENCE PRE DIFFE- RENCE 58
60
POST DIFFE- RENCE PRE DIFFE- RENCE 59
61
POST DIFFERENCE =15.000 - PRE DIFFERENCE =10.000 = Impact = 5000 60
62
61
63
CAN WE TEST THE PARALLELISM ASSUMPTION? With four observed means, we cannot The parallelism becomes testable if we have two additional data points pre-intervention PRE-PRE 62
64
63
65
64
66
65
67
66
68
67
69
68
70
69 WHEN TO USE DIFF-IN-DIFF? When we have longitudinal data and have reasons to believe that most of what drives selection is individual unobserved characteristics
71
70 Second, the path taken by the controls must be a plausible approximation of what would happen to the treated The following is an example in which it would be better NOT to use DID
72
71
73
72 58.00065.0007.000 57.00055.000-2.000 9.000 65.00075.00010.000 55.00067.00012.000 -2.000 Diff-in-diff-in-diff-11.000
74
73 58.00065.000 7.00065.000 72.000 75.000 Linearly projected impact 3.000
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.