Evaluation Design: Experiments, Quasi-Experiments, and Non-Experiments T1 x T2 C1 C2 Something happens here (pre-post reflexive design)

Core Concepts 1.Causal Analysis and the Counter-Factual 1.The Competing Hypotheses Framework 2.Calculation of Program Effect (effect size) 3.Randomization Process 4.Control versus Comparison Group 5.Treatment Effect 1.Average Treatment Effect (ATE) 2.Intention to Treat (ITT) 3.Treatment on the Treated (TOT)

THE COUNTERFACTUAL:

Compared to what? T=1T=2 Program Program Effect? The counter-factual is exercise of trying to figure out what the outcome would have been if there had been no program intervention. So we need a point of reference to measure that scenario…

Were suicide rates HIGH for a specific high school in suburban California? 95% CI: Average Rate Per Year at HS Null Hypothesis: Population Average Null Hypothesis: All HS Students OR All Californians Null Hypothesis: All Suburban HS Students Null These are all valid conter-factuals. How we define our comparison drives the conclusions.

What is the implied counter-factual? T=1T=2 Program Program Effect? If you see a p-value, there is often an implied counter-factual. Figuring out what it is, you often know what went wrong with the research design.

Another Example:

G1 t=0 G1 t=1 G1 t=2 G1 t=3 G1 t=4 G2 t=0 G2 t=1 G2 t=2 G2 t=3 G2 t=4 G3 t=0 G3 t=1 G3 t=2 G3 t=3 G3 t=4 G4 t=0 G4 t=1 G4 t=2 G4 t=3 G4 t=4 x x x x Treatment Groups Control Groups

G1 t=0 G1 t=1 G1 t=2 G1 t=3 G1 t=4 G2 t=0 G2 t=1 G2 t=2 G2 t=3 G2 t=4 G3 t=0 G3 t=1 G3 t=2 G3 t=3 G3 t=4 G4 t=0 G4 t=1 G4 t=2 G4 t=3 G4 t=4 x x x x Treatment Group Control Group Specific tests: treatment gains for late treatment?

A valid counter-factual allows us to answer the following two questions: 1)Compared to what? The program outcomes are different than outcomes in the comparison group. The comparison group is defined by the researcher. In some special cases the comparison group is identical (statistically speaking) to the treatment group. In this case we call it a “control” group.

Selecting a good (valid) counter-factual is hard: There is no going back… we can’t unlearn, undo the effects of a drug, reverse the effects of a program. How do we identify a plausible counter-factual?

A valid counter-factual allows us to answer the following two questions: 1)Compared to what? The program outcomes are different than outcomes in the comparison group. The comparison group is defined by the researcher. In some special cases the comparison group is identical (statistically speaking) to the treatment group. In this case we call it a “control” group. 2)How big is the program effect? Is the difference is meaningful (statistically significant and socially salient).? In the simple case the program effects is just the difference of the average outcome of the treatment and control group, but in practice there are many ways we calculate an effect.

Effect Size Is the Princeton Review Class Effective? Treatment Group: GRE=580 Control Group: GRE=440 Effect = 580 – 440 = 140 P-val = 0.023 Diff=0 X T -X C =140 Is it significant? How big is the effect? “Effect” translates to a calculation of impact plus statistical significance.

Effects Size in Correlation Study: For a one-unit change in X, we expect a β 1 change in Y. Effect One unit of X Slope=0 β 1 =140 How big is the effect? Is it significant?

THE THREE WAYS WE CALCULATE EFFECTS

Two “counterfeit” or “weak” counterfactuals time=1 time=2 Program time=1time=2 Program Pre-Post Effect Post-Only Effect These are only valid when certain conditions are met.

Example: Do Charter Schools Outperform Public Schools? http://www.rightmichigan.com/story/2011/6/21/23927/4600 “Don't get me wrong. I am not opposed to charter schools on principle. My beef with charter schools is that most skim the most motivated students out of the poorest communities, and many have disproportionately small numbers of children who need special education or who are English-language learners. The typical charter, operating in this way, increases the burden on the regular public schools, while privileging the lucky few. Continuing on this path will further disable public education in the cities and hand over the most successful students to private entrepreneurs.” http://blogs.edweek.org/edweek/Bridging-Differences/2009/11/obama-and-duncan-are-wrong-abo.html

“Don't get me wrong. I am not opposed to charter schools on principle. My beef with charter schools is that most skim the most motivated students out of the poorest communities, and many have disproportionately small numbers of children who need special education or who are English-language learners. The typical charter, operating in this way, increases the burden on the regular public schools, while privileging the lucky few. Continuing on this path will further disable public education in the cities and hand over the most successful students to private entrepreneurs.” Obama and Duncan Are Wrong About Charters By Diane Ravitch on November 16, 2009 1:12 PMDiane Ravitch http://blogs.edweek.org/edweek/Bridging-Differences/2009/11/obama-and-duncan-are-wrong-abo.html

The difference in difference estimator: Group / TimeT=1T=2 TreatmentT1T2 ControlC1C2 Difference between treated and untreated: T2 – C2 But what about the trend??? Program Impact= (T2 – T1) – (C2 – C1) Gains during the program period Gains as a result of trend

time=1time=2 Program Policy / program group Control / comparison group Outcome T2 T1 C1 C2 First Difference:Difference in Difference: Effect = T2 – T1Effect = (T2 – T1) – (C2 – C1) Second difference: Effect = T2 – C2 *trend* Total Change Trend Effect The difference in difference estimator:

Example of the difference-in-difference estimate in practice

Comparing teacher “effects” Distribution of diff-in-diff scores

Breaking it Down: PrePost a a + c a + b a + b + c + d Effect = (T2 – T1) – (C2 – C1) Effect = ( c + d ) – c = d Single Diff 2 = (a+b+c+d)-(a+b) = (c+d) Single Diff 1 = (a+c)-(a)= c

As a regression of dummies Group / TimeT=1T=2 TreatmentT1T2 ControlC1C2 SlopeEffectInterpretation β0β0 C1Baseline β1β1 C2-C1Trend β2β2 T1-C1Initial Difference β3β3 (T2-T1)-(C2-C1)Treatment effect

Y i,t = a + bTreat i,t + cPost i,t + d(Treat i,t Post i,t )+ e i,t Diff-in-Diff=(Single Diff 2-Single Diff 1)=(c+d)-c=d Putting Graph & Regression Together PrePost a a + c a + b a + b + c + d Single Diff 2 = (a+b+c+d)-(a+b) = (c+d) Single Diff 1 = (a+c)-(a)= c

Validity of two “counterfeit” counterfactuals: (T2 – T1) – (C2 – C1) = T2 – T1 IFF C2 – C1 = 0 (no trend) Pre-PostPost-Only (T2 – T1) – (C2 – C1) = T2 – C2 IFF C1 – T1 = 0 (equivalent at time 1) time=1time=2 Program Outcome T2 T1 C1 C2 Why we use Randomization or Matching

Textbook notation for the effect calculation: Program Effect = E( Y | P=1 ) – E( Y | P=0 ) In plain English, we calculate the effect by subtracting the average outcome E(Y) of the control group (P=0) from the average outcome of the treatment group (P=1). Why might this be misleading?

Validity of two “counterfeit” counterfactuals: (T2 – T1) – (C2 – C1) = T2 – T1 IFF C2 – C1 = 0 (no trend) Pre-PostPost-Only (T2 – T1) – (C2 – C1) = T2 – C2 IFF C1 – T1 = 0 (equivalent at time 1) time=1time=2 Program Outcome T2 T1 C1 C2 Randomization gives us this. Therefore we can use post-test only estimators. This is the exception, not the rule.

DIFFERENT TYPES OF COUNTERFACTUALS IN PRACTICE

Varieties of the Counter-Factual: T=1T=2 Program T=1T=2 Program Pre-Post Effect Post-Only Pre-Post With Control Effect T=1T=2 Program A B Effect: A-B Time Interrupted Time Series Effect Program Qualification Regression Discontinuity Effect Program (Diff-in-Diff)

Interrupted Time Series

Regression discontinuity design Source: Martinez, 2006, Course notes

Regression Discontinuity

Core Concepts 1.Causal Analysis and the Counter-Factual 1.The Competing Hypotheses Framework 2.Calculation of Program Effect (effect size) 3.Randomization Process 4.Control versus Comparison Group 5.Treatment Effect 1.Average Treatment Effect (ATE) 2.Intention to Treat (ITT) 3.Treatment on the Treated (TOT)

SUCCESSFUL AND UNSUCCESSFUL RANDOMIZATION

Is this problematic?

Bonferroni Correction: When we want to be 95% confident that two groups are the same, and we can measure those groups using a set of contrasts, then our decision rule is no longer to reject the null (that the groups are the same) if the p-value < 0.05. A “contrast” is a comparison of means of any measured characteristic between two groups. If we have a 5% chance of observing a p-value of 0.05 for each contrast, then the probability of observing at least one contrast with a p-value that small is greater than 5%! It is actually n*.05, where n is the number of contrasts. So if we want to be 95% confident that the groups are different (not just the contrasts), we have to adjust our decision rule to  /n. For example, if we have 10 contrasts, then our decision rule is now 0.05/10, or 0.005. The p-value of at least one contrast must be below 0.005 for us to conclude that the groups are different. x1 <- rbinom( 10000, 6, 0.05 ) table( x1 ) / 10000 y1 <- rbinom( 10000, 6, 0.05/6 ) table( y1 ) / 10000

Test for “Happy” Randomization 0.05 / 6 = 0.0083 x1 <- rbinom( 10000, 6, 0.05 ) table( x1 ) / 10000 y1 <- rbinom( 10000, 6, 0.05/6 ) table( y1 ) / 10000

RCT versus Natural Experiments: 1.RCT assumes complete control over the assignment process 2.Natural Experiments often utilize randomization:  Charter Schools Example  Vietnam Veterans Example

time=1time=2 Program Policy / program group Control / comparison group “Control” Versus “Comparison” Groups Outcome First Difference:Difference in Difference: Effect = T2 – T1Effect = (T2 – T1) – (C2 – C1) Effect = T2 – C2 T2 T1 C1 C2 *trend* Two Considerations: Does C1 = T1 ? Not usually for comparison groups. Is C2 – C1 accurate reflection of trend? In some cases, the comparison group can adequately capture trend.

DIFFERENT INTERPRETATIONS OF PROGRAM EFFECTS

Estimation of the counter-factual: Program Effect = E( Y | P=1 ) – E( Y | P=0 ) Operationalized as: Program Effect = ( T2 – T1 ) – (C2 – C1 ) Control Group Given Bed Nets Given Bed Nets AND Use Them!

Calculation of Treatment Effects:

Terminology: “Average” treatment effects – Treatment on the Treated (TOT) Effects – Intention to Treat (ITT) Effects Given Bed Nets Given Bed Nets AND Use Them!

Exam Question: What is the difference between non-compliance and attrition?

CAMPBELL SCORES: ELIMINATING COMPETING HYPOTHESES

http://www.youtube.com/watch?v=7DDF8WZFnoU Can Ants Count?

Competing Hypotheses The Program Hypothesis: The change that we saw in our study group above and beyond the comparison group (the effect size) was a result of the program. The Competing Hypothesis: The change that we saw in our study group above and beyond the comparison group was a result of _______. (insert any item of the Campbell Score)

The Campbell Score: Omitted Variable Bias Selection / Omitted Variables Non-Random Attrition Trends in the Data Maturation Secular Trends Testing Seasonality Regression to the Mean Study Calibration Measurement Error Time-Frame of Study Contamination Factors Intervening Events

Competing Hypothesis #1 Selection Into a Program If people have a choice to enroll in a program, those that enroll will be different than those that do not. This is a source of omitted variable bias. The Fix: Randomization into treatment and control groups. Randomization must be “happy”!

Test for “Happy” Randomization x1 <- rbinom( 10000, 6, 0.04 ) table( x1 ) / 10000

Competing Hypothesis #2 Non-Random Attrition If the people that leave a program or study are different than those that stay, the calculation of effects will be biased. The Fix: Examine characteristics of those that stay versus those that leave. $3.00 $2.50 $2.00 $1.50 $1.00 $3.00 $2.50 $2.00 $1.50 $1.00 Mean = $2.00 Mean = $2.50 Microfinance Example: Artificial effect size Attrit

Test for Attrition Can also be tested in another way: do background characteristics of T1 = T2

Separating Trend from Effects Treatment Effect Trend Total Gains During Study Time=1Time=2 C1 = T1 T2 C2 Outcome T2-C2 removes trend

Separating Trend from Effects Total Gain Removed By T2-C2 Time=1Time=2 C1 ≠ T1 T2 C2 Outcome Actual Trend T2-C2 does NOT fully remove trend NOTE, diff-in-diff separates trends even when groups are not equivalent. T2-C2

Separating Trend from Effects Total Gain Removed By T2-C2 Time=1Time=2 C1 ≠ T1 T2 C2 Outcome Actual Trend T2-C2 removes too much trend NOTE, diff-in-diff separates trends even when groups are not equivalent. T2-C2

Competing Hypothesis #3 Maturation Occurs when growth is expected naturally, such as increase in cognitive ability of children because of natural development independent of program effects. The Fix: Use a comparison group to remove the trend. Pre-Post With Control T=1T=2 Program A B Effect: A-B

Competing Hypothesis #4 Secular Trends Very similar to maturation, except the trend in the data is caused by a global process outside of individuals, such as economic or cultural trends. The Fix: Use a comparison group to remove the trend. Pre-Post With Control T=1T=2 Program A B Effect: A-B

Competing Hypothesis #5 Seasonality Data with seasonal trends or other cycles will have natural highs and lows. The Fix: Only compare observations from the same time period, or average observations over an entire year (or cycle period).

Competing Hypothesis #6 Testing When the same group is exposed repeatedly to the same set of questions or tasks they can improve independent of any training. The Fix: This problem only applies to a small set of programs. Change tests, use post-test only designs, or use a control group that receives the test. Pre-Post With Control T=1T=2 Program A B Effect: A-B

Competing Hypothesis #7 Regression to the Mean Every time period that you observe an outcome, during the next time period the outcome naturally has a higher probability of being closer to the mean than it does of staying the same or being more extreme. As a result, quality improvement programs for low-performing units often have a built-in improvement bias regardless of program effects. The Fix: Take care not to select a study group from the top or bottom of the distribution in a single time period (only high or low performers).

Average = 15.6 Regression to the Mean Example

Competing Hypothesis #8 Measurement Error If there is significant measurement error in the dependent variables, it will bias the effects towards zero and make programs look less effective. The Fix: Use better measures of dependent variables.

Competing Hypothesis #9 Study Time-Frame If the study is not long enough it make look like the program had no impact when in fact it did. If the study is too long then attrition becomes a problem. The Fix: Use prior knowledge or research from the study domain to pick an appropriate study period. Examples: Michigan Affirmative Action Study Iowa liquor law change

Competing Hypothesis #10 Intervening Events Has something happened during the study that affects one of the groups (treatment or control) but not the other? The Fix: If there is an intervening event, it may be hard to remove the effects from the study.

Evaluation Design: Experiments, Quasi-Experiments, and Non-Experiments T1 x T2 C1 C2 Something happens here (pre-post reflexive design)

Similar presentations

Presentation on theme: "Evaluation Design: Experiments, Quasi-Experiments, and Non-Experiments T1 x T2 C1 C2 Something happens here (pre-post reflexive design)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Evaluation Design: Experiments, Quasi-Experiments, and Non-Experiments T1 x T2 C1 C2 Something happens here (pre-post reflexive design)

Similar presentations

Presentation on theme: "Evaluation Design: Experiments, Quasi-Experiments, and Non-Experiments T1 x T2 C1 C2 Something happens here (pre-post reflexive design)"— Presentation transcript:

Similar presentations

About project

Feedback