Download presentation
Presentation is loading. Please wait.
1
The Fundamental Problem of Causal Inference
Alexander Tabarrok The Fundamental Problem of Causal Inference
2
Ideal and Real Data T=Treatment (0,1) YiT=Outcome for i when T=1
YiNT=Outcome for i when T=0
3
What can we Learn from Real Data?
4
Danger! The average outcome among the treated minus the average outcome among the untreated is not, except under special circumstances, equal to the average treatment effect.
5
What can we Learn from Real Data?
But is anything interesting?
7
ATE, ATT, ATU Call what we observe the Naïve Average Treatment Effect, NATE. We just showed that NATE=ATT+Selection Effect But what about the ATE? How does this differ from ATT? The ATT is not the same as the ATE. Where is the difference in this chart? E.g. a disease will kill if not treated. There is a drug that if taken will save the patient’s life but it only works for patients with a certain genetic makeup. A doctor tests patients with the disease for the gene and administers the drug to those with the gene. Note that for the ATT the people who were not treated are a good stand-in for what would have happened to the treated had they not been treated-namely they would have died. So the ATT is that the treatment effect is 100%, everyone who got the drug would have died without it. But this is not the same as the ATE since if everyone had been given the drug it would not have saved everyone. Thus, the treated are not a good stand-in for the what would have happened to the untreated had they been treated. Homework, let Pi equal the proportion of the treated. Write NATE in terms of ATE. Use ATT, ATE, ATU.
8
The Gold Standard The gold standard is randomization. If units are randomly assigned to treatment then the selection effect disappears. i.e. with random assignment the groups selected for treatment and the groups actually treated would have had the same outcomes on average if not treated. With random assignment the average treated minus the average untreated measures the average treatment effect on the treated (and in fact with random assignment this is also equal to the average treatment effect). In a randomized experiment we select N individuals from the population and randomly split them into two groups the treated with Nt members and the untreated with N-Nt.
9
Regression In a regression context we can run the following regression: and BT will measure the treatment effect. It's useful to run through this once in the simple case to prove that this is true. See handout.
10
When can we use matching?
Day 3 - Technical Track Session VI: Matching When can we use matching? What if the assignment to the treatment is done not randomly, but on the basis of observables? This is when matching methods come in! Matching methods allow you to construct comparison groups when the assignment to the treatment is done on the basis of observable variables. Warning: Matching doesn’t control for selection bias that arises when the assignment to the treatment is done on the basis of non-observables. (n.b. neither does regression. Need IV) Slide from Gertler et al. World Bank
11
Matching If individuals in the treatment and control groups differ in observable ways (selection on observables) but conditional on the observables there is random assignment then there are variety of “matching” techniques. Including exact matching, nearest neighbor matching, regression with indicators, propensity score matching, reweighting etc.
12
Matching
13
Peer Effects and Matching Sacerdote, B
Peer Effects and Matching Sacerdote, B. (Dartmouth), 2001, Peer effects with random assignment: Results for Dartmouth roommates. Quarterly Journal of Economics, 116(2),
14
Random within blocks Five questions 5^2=32 blocks (25 non-empty).
I smoke. I like to listen to music while studying. I keep late hours. I am more neat than messy. I am male. 5^2=32 blocks (25 non-empty). Within block, assignment is random!
15
Show Pre-Treatment Variables are Randomized
Note that messy, male smokers, who like to listen to music and stay up late could have a different SAT scores—probably they do but once we control for that group, i.e. within the Xs the assignment is random. Note that for a pure matching estimator, Sacerdote would run the regression within blocks. Here to increase efficiency he uses dummies which assumes that the GPA/Ability effect is the same across blocks and that the effect of the block is a simple change in intercept.
17
Peer Effects For every 1 point increase (decrease) in the roommate’s GPA, a student’s GPA increased (decreased) about .12 points. If you would have been a 3.0 student with a 3.0 roommate, but you were assigned to a 2.0 roommate, your GPA would be 2.88. Note that the peer effect in ability is 27% as large as the own effect! Peer effects are even larger in social choices such as the choice to join a fraternity. (Dorm effects are large here as well.)
18
Random within blocks In the case of the Dartmouth experiment we knew that residents were matched randomly within the blocks so we just had to control for the block. An exact match. Suppose we think there are two (or more) variables that determine treatment. We can then match with a distance measure.
19
Slides from Gary King
22
Note that matching could also be called pruning->eliminating variables not in the common support.
25
The Curse of Dimensionality
Matching breaks down when we add covariates. E.g. Suppose that we have two variables each with 10 levels, then we need 100 cells and we need treated and untreated members of each cell. Add one more 10 level variable and we need 1000 cells. Regression “solves” this problem by imposing linear relationships e.g. Y=α + β1 PSA + β2 Age + β3 Age × PSA + β4 T We have reduced (squashed!) a 100 variable problem to 3 variables but at the price of assuming away most of the possible variation.
27
Matching based on the Propensity Score
Day 3 - Technical Track Session VI: Matching Matching based on the Propensity Score Definition The propensity score is the conditional probability of receiving the treatment given the pre-treatment variables: p(X) =Pr{T = 1|X} = E{T|X} Lemma 1 If p(X) is the propensity score, then T X | p(X) “Given the propensity score, the pre-treatment variables are balanced between beneficiaries and non- beneficiaries” Lemma 2 Y1, Y0 T | X => Y1, Y0 T | p(X) “Suppose that assignment to treatment is unconfounded given the pre-treatment variables X. Then assignment to treatment is unconfounded given the propensity score p(X).”
28
Does the propensity score approach solve the dimensionality problem?
Day 3 - Technical Track Session VI: Matching Does the propensity score approach solve the dimensionality problem? YES! The balancing property of the propensity score (Lemma 1) ensures that: Observations with the same propensity score have the same distribution of observable covariates independently of treatment status; and for a given propensity score, assignment to treatment is “random” and therefore treatment and control units are observationally identical on average.
29
The Philosophy of PS Matching
In regression, 𝑌= 𝑎+𝛽 1 𝑋 1 + 𝛽 2 𝑋 2 + 𝛽 3 𝑇+𝜀 we focus on controlling for other determinants of 𝑌. We ask what is the effect of T on 𝑌after we remove the effect of 𝑋 1 and 𝑋 2 on 𝑌. In PS matching we ask what determines T? By removing systematic determinants of T the idea is that we “uncover” an experiment. If we have two groups of people both of whom were equally likely to be treated on average then the fact that one was treated and the other was not was random and thus we can estimate an ATE.
30
Implementation of the estimation strategy
Day 3 - Technical Track Session VI: Matching Implementation of the estimation strategy Remember we’re discussing a strategy for the estimation of the average treatment effect on the treated, called δ Step 1 Estimate the propensity score (e.g. logit or probit) Step 2 Estimate the average treatment effect given the propensity score match treated and controls with “nearby” propensity scores compute the effect of treatment for each value of the (estimated) propensity score obtain the average of these conditional effects
31
Day 3 - Technical Track Session VI: Matching Step 2: Estimate the average treatment effect given the propensity score The closest we can get to an exact matching is to match each treated unit with the nearest control in terms of propensity score “Nearest” can be defined in many ways. These different ways then correspondent to different ways of doing matching: Stratification on the Score Nearest neighbor matching on the Score Weighting on the basis of the Score
35
Side note: Squeezing everything into one dimension loses information so distance matching or coarsened matching work better in general (Gary King).
37
Again note that matching prunes the observations.
43
Matching Methods Stata
Nearest neighbor matching using Mahalanobis metric teffects nnmatch (y x) (t), atet Propensity score matching using logit teffects psmatch (y) (t x, logit), atet Propensity Score Matching using PSMatch2 psmatch2 t x, outcome(y) Coarsened Exact Matching cem x (cutpoints/cutalgorithm), treatment(t) In the case of CEM the command then returns a dataset including whether an observation was matched or not and then you run regression or difference in means etc. on the matched dataset. Note that the matched dataset uses the original observations, you only coarsen for matching.
44
Inverse Probability Weighting
Rather than matching 1 to 1 it is possible to match one treated to all untreated but adjusting the weights on the untreated to account for similarity. Uses more data and is also unbiased. When the objective is to estimate ATET, the treated person receives a weight of 1 while the control person receives a weight of p(X)/(1-p(X)).
45
Propensity Score Matching as Diagnostic and Explanatory Tool
Need to fix standard errors since p is estimated (Can use bootstrap by drawing from distribution of p scores 1000 times). Can also add covariates +Bxi. Can do a local, non-parametric regressions such as a smoothed mean (lowess in Stata) of Y on the propensity score for T=1 and T=0. Regression analogue: 𝑌 𝑖 =𝛼+𝜃 𝑇 𝑖 + 𝛿 1 𝑝 𝑖 + 𝛿 2 𝑇 𝑖 (𝑝 𝑖 − 𝜇 𝑝 )+ 𝜖 𝑖
47
The circle labeled "earnings" illustrates variation in the variable to be explained. Education and Ability are correlated explanatory variables and Ability is not observed. The blue area within the instrument circle represents variation in education that is uncorrelated with Ability and which can be used to consistently estimate the coefficient on education. Note that the only reason the instrument is correlated with Earnings is through education.
48
Angrist-Krueger IV Cunningham, Mixtape. Children born in December and children born in January are similar but at around age 6 the former goes to school and the latter is still in kindergarten. Either, however, can quit at age 16 but the December quitter will have had more school at age 16 than the January (1’st QOB) quitter.
49
Instruments in Action (Angrist and Krueger 1991)
50
Instrumental variables with weak instruments and correlation with unobserved influences.
Bias in the IV estimator is determined by the covariance of the instrument with education (blue within instrument circle) relative to the covariance between the instrument and the unobserved factors (red within instrument circle). Thus IV with weak instruments can be more biased than OLS.
51
Voluntary job training program
Day 3 - Technical Track Session IV: Instrumental Variables Voluntary job training program Say we decide to compare outcomes for those who participate to the outcomes of those who do not participate: A simple model to do this: y = α + β1 P + β2 x + ε 1 If person participates in training P = 0 If person does not participate in training x = Control variables (exogenous & observed) Why is this not working? 2 problems: Variables that we omit (for various reasons) but that are important Decision to participate in training is endogenous.
52
Problem #1: Omitted Variables
Day 3 - Technical Track Session IV: Instrumental Variables Problem #1: Omitted Variables Even if we try to control for “everything”, we’ll miss: Characteristics that we didn’t know they mattered, and Characteristics that are too complicated to measure (not observables or not observed): Talent, motivation Level of information and access to services Opportunity cost of participation Full model would be: y = γ0 + γ1 x + γ2 P + γ3 M1 + η But we cannot observe M1 , the “missing” and unobserved variables.
53
Omitted variable bias y = γ0 + γ1 x + γ2 P + γ3 M1 + η
True model is: But we estimate: y = β0 + β1 x + β2 P + ε If there is a correlation between M1 and P, then the OLS estimator of β2 will not be a consistent estimator of γ2, the true impact of P. Why? When M1 is missing from the regression, the coefficient of P will “pick up” some of the effect of M1
54
Problem #2: Endogenous Decision to Participate
Day 3 - Technical Track Session IV: Instrumental Variables Problem #2: Endogenous Decision to Participate y = γ0 + γ1 x + γ2 P + η True model is: with P = π0 + π 1 x + π 2 M2 +ξ M2 = Vector of unobserved / missing characteristics (i.e. we don’t fully know why people decide to participate) Since we don’t observe M2 , we can only estimate a simplified model: y = β0 + β 1 x + β2 P + ε Is β2, OLS an unbiased estimator of γ2?
55
Problem #2: Endogenous Decision to Participate
Day 3 - Technical Track Session IV: Instrumental Variables Problem #2: Endogenous Decision to Participate We estimate: y = β0 + β1 x + β2 P + ε But true model is: y = γ0 + γ1 x + γ2 P + η with P = π0 + π1 x + π2 M2 +ξ Is β2, OLS an unbiased estimator of γ2? Corr (ε, P) = corr (ε, π0 + π 1 x + π 2 M2 +ξ) = π 1 corr (ε, x)+ π 2 corr (ε, M2) = π 2 corr (ε, M2) If there is a correlation between the missing variables that determine participation (e.g. Talent) and outcomes not explained by observed characteristics, then the OLS estimator will be biased.
56
What can we do to solve this problem?
Day 3 - Technical Track Session IV: Instrumental Variables What can we do to solve this problem? We estimate: y = β0 + β1 x + β2 P + ε So the problem is the correlation between P and ε How about we replace P with “something else”, call it Z: Z needs to be similar to P But is not correlated with ε
57
Back to the job training program
P = participation ε = that part of outcomes that is not explained by program participation or by observed characteristics I’m looking for a variable Z that is: Closely related to participation P but doesn’t directly affect people’s outcomes Y, other than through its effect on participation. So this variable must be coming from outside.
58
Generating an outside variable for the job training program
Say that a social worker visits unemployed persons to encourage them to participate. She only visits 50% of persons on her roster, and She randomly chooses whom she will visit If she is effective, many people she visits will enroll. There will be a correlation between receiving a visit and enrolling But visit does not have direct effect on outcomes (e.g. income) apart from its effect through enrollment in the training program. Randomized “encouragement” or “promotion” visits are an Instrumental Variable.
59
Characteristics of an instrumental variable
Define a new variable Z 1 If person was randomly chosen to receive the encouragement visit from the social worker Z = 0 If person was randomly chosen not to receive the encouragement visit from the social worker Corr ( Z , P ) > 0 People who receive the encouragement visit are more likely to participate than those who don’t Corr ( Z , ε ) = 0 No correlation between receiving a visit and benefit to the program apart from the effect of the visit on participation. Z is called an instrumental variable
69
Regression Discontinuity Design What is the value of an Olympic Gold?
A Good RD Design Regression discontinuity design is based on the simple idea that if process A jumps when process B jumps then we have good grounds for thinking that the change in B caused the change in A so long as other causes of A vary smoothly.
70
Regression Discontinuity Design
Not a Good RD Design Regression discontinuity design is based on the simple idea that if process A jumps when process B jumps then we have good grounds for thinking that the change in B caused the change in A so long as other causes of A vary smoothly.
71
Regression Discontinuity Design
We have a running variable, X, an index with a defined cut-off Units with a score X≤C, where C=the cutoff are eligible Units with a score X>C are not eligible Or vice-versa Intuitive explanation of the method: Units just above the cut-off point are very similar to units just below it – good comparison. Compare outcomes Y for units just above and below the cut-off point. For presentation (with animation effects) For a discontinuity design, you need: Running variable Sharp cut-off.
72
Estimating 1 The simplest RD design occurs if we have lots of observations with X=C+ε (and thus T=0) and lots of observations with X=C-ε (T=1) where ε is small. In this case we can just compare means between the two groups who are on either side of C+/-ε. Since the two groups are similar to within an ε this estimates the causal effect of treatment. More typically, however, we will only have a few observations clustered within ε of the cutoff value so we need to use all of the observations to estimate the regression line(s) around the cutoff value.
73
Example: Effect of fertilizer program on agriculture production
Goal Improve agriculture production (rice yields) for small farmers Method Farms with a score (Ha) of land ≤50 are small Farms with a score (Ha) of land >50 are not small Intervention Small farmers receive subsidies to purchase fertilizer
74
Regression Discontinuity Design-Baseline
Not eligible Eligible
75
Regression Discontinuity Design-Post Intervention
IMPACT
76
Estimating 2 A linear model can be estimated very simply where X is the running variable and T the treatment which occurs when X>C. But this model imposes a number of restrictions on the data including linearity in X and an identical regression slope for X<C and X≥C.
77
Non-linearity Non-linearity may be mistaken for discontinuity.
To handle this we can estimate using, for example:
78
Estimation 3 One can allow different functions pre and post C.
Interpreting coefficients in a regression with interaction terms can be tricky. To aid in interpretation it's often useful to normalize the running variable so it's zero at the cutoff. In this case, we create a new variable 𝑋 =X-C so 𝑋 =0 at the cutoff then run: In this case B₇ measures at 𝑋 =X-C=0 which is the jump at C, the estimate of the causal effect. It's also possible to estimate the function in X using a non-parametric approach which is even more flexible.
79
Example: The Power of Incumbency
Politicians are routinely reelected at 90%+ rates. Is this because of advantages of incumbency or is it because of a selection effect? The best politicians will be the ones in the sample!
83
Fuzzy Regression Discontinuity
84
Fuzzy RD Use predicted class size as an instrument to predict effect of class size on average test scores. Estimate suggests that increase of 8 students would increase reading scores by 2.2 points (.29 of sd).
92
A FICO score of 620 was the cutoff for most loans – if your loan was 619, you couldn’t be resold to an investment bank in a CDO. At 621 you could. This number was picked arbitrarily by the GSEs in the 1990s, and kept by the hedge funds and investment bankers in the 2000s who were trading the stuff. Since it is arbitrary, a FICO score at 621 is just marginally better than 619. There is no magical jump there in terms of how FICO is measured at that point. Loans at 621 could be securitized sold off to others – thus less monitoring. Note that 621 is a slightly better FICO score than 619 but that these loans defaulted more. Note that if the authors control strategy doesn’t work it goes against their theory – so the fact that they find an effect is even more impressive. From Did Securitization Lead to Lax Screening? Evidence From Subprime Loans (Benjamin J. Keys, Tanmoy Mukherjee, Amit Seru, Vikrant Vig)
93
Minimum Legal Drinking Age Carpenter, C. , & Dobkin, C. 2011
Minimum Legal Drinking Age Carpenter, C., & Dobkin, C The Minimum Legal Drinking Age and Public Health. The journal of economic perspectives : a journal of the American Economic Association, 25(2): 133–156. Retrieved from
94
Regression Discontinuity a Warning - Heaping
Consider the following data from NYC hygiene inspections (From NYTimes A restaurant receiving any score from 0 to 13 points gets an A, but the difference from one end of that range to the other is substantial. A zero score means that inspectors found no violations at all, while 13 points means they found a host of concerns. .. The graph below shows the distribution of A and B-rated restaurants…The horizontal axis tracks the number of violation points, and the vertical axis tracks the number of restaurants scoring in that category. The blue bars are A-rated restaurants, and the green bars are B-rated. From NYTimes There are at least two possible explanations for this unusual distribution. One has to do with a provision of the system that allows non-“A” restaurants to request a re-inspection in hopes of earning a better score, and to hang a “Grade Pending” sign in their windows in the meanwhile. The spike in restaurants with 11, 12, or 13 points could be the result of restaurants with initial scores in the B range cleaning up their acts just enough to qualify for an A. An alternative explanation could be related to the inspectors themselves. Knowing that restaurants that get B grades are likely to appeal them, inspectors may be more likely to rate a restaurants on the cusp with an A-range score; after all, the difference between 13 points and 14 is marginal, but the difference between an A and a B is meaningful. Given the subjective nature of the inspection process and the discretion that inspectors have to assign scores, the data suggest that inspectors may be disproportionately likely to assign restaurants a just-made-it A score than a just-missed B.
95
Low Birth Weight Babies
Almond, Douglas, Joseph J. Doyle, Amanda E. Kowalski, and Heidi Williams “Estimating Marginal Returns to Medical Care: Evidence from At-Risk Newborns.” The Quarterly Journal of Economics 125 (2): 591–634. VLBW babies (below 1500 grams) get extra treatment. 1500 g is conventional but arbitrary. Authors find that mortality deceases at just below 1500 suggesting value of additional medical care. Barreca, Alan I., Melanie Guldi, Jason M. Lindo, and Glen R. Waddell “Saving Babies? Revisiting the Effect of Very Low Birth Weight Classification.” The Quarterly Journal of Economics 126 (4): 2117–23. Finds that outcomes around 1500 grams and in general around many round numbers are peculiar both above and below. Nurses and doctors may be manipulating. Reminder that not all discontinuities are equally useful—exogenous discontinuities like election outcome more useful.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.