Thomas D. Cook Northwestern University

Slides:



Advertisements
Similar presentations
Stephen C. Court Presented at
Advertisements

Introduction Describe what panel data is and the reasons for using it in this format Assess the importance of fixed and random effects Examine the Hausman.
REGRESSION, IV, MATCHING Treatment effect Boualem RABTA Center for World Food Studies (SOW-VU) Vrije Universiteit - Amsterdam.
What you don’t see can’t matter? The effects of unobserved differences on causal attributions Robert Coe CEM, Durham University Randomised Controlled Trials.
Treatment Evaluation. Identification Graduate and professional economics mainly concerned with identification in empirical work. Concept of understanding.
EMR 6550: Experimental and Quasi- Experimental Designs Dr. Chris L. S. Coryn Kristin A. Hobson Fall 2013.
Copyright © Allyn & Bacon (2007) Single-Variable, Independent-Groups Designs Graziano and Raulin Research Methods: Chapter 10 This multimedia product and.
Advantages and limitations of non- and quasi-experimental methods Module 2.2.
MCUAAAR: Methods & Measurement Core Workshop: Structural Equation Models for Longitudinal Analysis of Health Disparities Data April 11th, :00 to.
Regression Discontinuity Design Thanks to Sandi Cleveland and Marc Shure (class of 2011) for some of these slides.
Effect Size and Meta-Analysis
Who are the participants? Creating a Quality Sample 47:269: Research Methods I Dr. Leonard March 22, 2010.
Longitudinal Experiments Larry V. Hedges Northwestern University Prepared for the IES Summer Research Training Institute July 28, 2010.
Non-Experimental designs: Developmental designs & Small-N designs
Statistics Micro Mini Threats to Your Experiment!
Non-Experimental designs: Developmental designs & Small-N designs
Meta-analysis & psychotherapy outcome research
Non-Experimental designs
Analysis of Variance. ANOVA Probably the most popular analysis in psychology Why? Ease of implementation Allows for analysis of several groups at once.
ANCOVA Lecture 9 Andrew Ainsworth. What is ANCOVA?
Selecting a Research Design. Research Design Refers to the outline, plan, or strategy specifying the procedure to be used in answering research questions.
Consumer Preference Test Level 1- “h” potato chip vs Level 2 - “g” potato chip 1. How would you rate chip “h” from 1 - 7? Don’t Delicious like.
Epidemiology The Basics Only… Adapted with permission from a class presentation developed by Dr. Charles Lynch – University of Iowa, Iowa City.
T tests comparing two means t tests comparing two means.
S-005 Intervention research: True experiments and quasi- experiments.
INTRO TO EXPERIMENTAL RESEARCH, continued Lawrence R. Gordon Psychology Research Methods I.
Evaluating Job Training Programs: What have we learned? Haeil Jung and Maureen Pirog School of Public and Environmental Affairs Indiana University Bloomington.
Testing Hypotheses about Differences among Several Means.
A Randomized Experiment Comparing Random to Nonrandom Assignment William R Shadish University of California, Merced and M.H. Clark Southern Illinois University,
Statistics for the Social Sciences Psychology 340 Fall 2012 Analysis of Variance (ANOVA)
Propensity Score Matching for Causal Inference: Possibilities, Limitations, and an Example sean f. reardon MAPSS colloquium March 6, 2007.
DIRECTIONAL HYPOTHESIS The 1-tailed test: –Instead of dividing alpha by 2, you are looking for unlikely outcomes on only 1 side of the distribution –No.
Introduction section of article
 Descriptive Methods ◦ Observation ◦ Survey Research  Experimental Methods ◦ Independent Groups Designs ◦ Repeated Measures Designs ◦ Complex Designs.
Experimental Research Methods in Language Learning Chapter 5 Validity in Experimental Research.
Chapter 8 – Lecture 6. Hypothesis Question Initial Idea (0ften Vague) Initial ObservationsSearch Existing Lit. Statement of the problem Operational definition.
Framework of Preferred Evaluation Methodologies for TAACCCT Impact/Outcomes Analysis Random Assignment (Experimental Design) preferred – High proportion.
T tests comparing two means t tests comparing two means.
Chapter 8: Between Subjects Designs
Teaching the Control of Variables Strategy in Fourth Grade Classrooms Robert F. Lorch, Jr., William J. Calderhead, Emily E. Dunlap, Emily C. Hodell, Benjamin.
Randomized Assignment Difference-in-Differences
Research Methods and Data Analysis in Psychology Spring 2015 Kyle Stephenson.
Chapter 13 Understanding research results: statistical inference.
HYPOTHESIS TESTING FOR DIFFERENCES BETWEEN MEANS AND BETWEEN PROPORTIONS.
Issues in Selecting Covariates for Propensity Score Adjustment William R Shadish University of California, Merced.
Looking for statistical twins
Logic of Hypothesis Testing
DTC Quantitative Methods Bivariate Analysis: t-tests and Analysis of Variance (ANOVA) Thursday 20th February 2014  
Regression in Practice: Observational studies with controls for pretests work better than you think Shadish, W. R., Clark, M. H., & Steiner, P. M. (2008).
An Empirical Test of the Regression Discontinuity Design
Research Designs, Threats to Validity and the Hierarchy of Evidence and Appraisal of Limitations (HEAL) Grading System.
Research design I: Experimental design and quasi-experimental research
12 Inferential Analysis.
Chapter Eight: Quantitative Methods
2 independent Groups Graziano & Raulin (1997).
Making Causal Inferences and Ruling out Rival Explanations
Introduction to Design
Review: What influences confidence intervals?
Gerald Dyer, Jr., MPH October 20, 2016
Quasi-Experimental Design
RESEARCH METHODS Lecture 33
12 Inferential Analysis.
Impact Evaluation Methods: Difference in difference & Matching
Evaluating Impacts: An Overview of Quantitative Methods
Psych 231: Research Methods in Psychology
Inferential Statistics
Psych 231: Research Methods in Psychology
Psych 231: Research Methods in Psychology
CS 594: Empirical Methods in HCC Experimental Research in HCI (Part 1)
RESEARCH METHODS Lecture 33
Presentation transcript:

Thomas D. Cook Northwestern University Comparing Results from RCTs and Quasi-Experiments that share the same Intervention Group Thomas D. Cook Northwestern University

Why RCTs are to be preferred Statistical theory re expectations Relative advantage over other bias-free methods--e.g., regression-discontinuity (RDD) and instrumental variables (IV) Ad hoc theory and research on implementation Privileged credibility in science and policy Claim that non-exp. alternatives routinely fail to produce similar causal estimates

Dissimilar Estimates Come from empirical studies comparing exp. and non-exp. results on same topic Strongest are within-study comparisons These take an experiment, throw out the control group, and substitute a non-equivalent comparison group Given the intervention group is a constant, this is a test of the different control groups

Within-Study Comparison Lit. 20 studies, mostly in job training. Of the 14 in job training reviews contend: (1) no study produces a clearly similar causal estimate, including Deheija & Wahba (2) Some design and analysis features associated with less bias, but still bias (3) the average of the experiments is not different from the average of the non-experiments--but be careful here and note the variance of the effect sizes differs by design type

Brief History of Literature on Within Study Comparisons LaLonde; Fraker & Maynard 12 subsequent studies in job training Extension to examples in education in USA and social welfare in Mexico, never yet reviewed

Policy Consequences Department of Labor, as early as 1985 Health and Human Services, job training and beyond National Academy of Sciences Institute of Educational Sciences Do within-study comparisons deserve all this?

We will: Deconstruct „non-experiment“ and compare experimental estimates to 1. Regression-discontinuity estimates 2. Estimates from difference-of-differences (fixed effects) design Ask: Is general conclusion about the inadequacy of non-experiments true across at least these different kinds of non-experiment

Criteria of Good Within-Study Comparison Design 1. Variation in mode of assignment--random or not 2. No third variables correlated with both assignment and outcome--e.g., measurement 3. Randomized experiment properly executed 4. Quasi-experiment good instance of “type” 5. Both design types estimate the same causal entity--e.g, LATE in regression-discontinuity 6. Acceptable criteria of correspondence between design types--ESs seem similar; not formally differ; stat significance patterns not differ, etc.

Experiments vs. Regression-Discontinuity Design Studies

Three Known within-Study Comparisons of Exp and R-D Aiken, West et al (1998)- R-D study; experiment; LATE; analysis; results Buddelmeyer & Skoufias (2003)-R-D study; experiment; LATE; analysis; results Black, Galdo & Smith (2005)-R-D study; experiment; LATE; analysis; results

Comments on R-D vs Exp. Cumulative correspondence demonstrated over three cases Is this theoretically trivial, though? Is it pragmatically significant, given variation in implementation in both the experiment and R-D? As “existence proof”, it belies over-generalized argument that non-experiments don’t work As practical issue, does it mean we should support RDD when treatments are assigned by need, merit. Emboldens to deconstruct non-experiment further

Experiment vs Differences-in-Differences Most frequent non-experimental design by far across many fields of study Also modal in within-study comparisons in job training, and so it provides major basis for past opinion that non-experiments are routinely biased We review: 3 studies with comparable estimates 14 job training studies with dissimilar estimates 2 education examples with dissimilar estimates

Bloom et al Bloom et al (2002; 2005)--job training the topic Experiment 11 sites - 8 pre earning waves; 20 post Non-Experiment = 5 within-state comparisons; 4 within-city; all comparison Ss enrolled in welfare We present only control/comparison contrast because treatment time series is a constant

Issue is: Is there overall difference between control groups randomly or non-randomly formed? If yes, can statistical controls—OLS, IV (incl. Heckman models), propensity scores, random growth models—eliminate this difference? Tested 1O modes, but only one longitudinal Why we treat this as d-in-d rather than ITS

Bloom et al. Results

Bloom et al. Results (continued)

Implications of Bloom et al Averaging across the 4 within-city sites showed no difference-also true if 5th between-city site added Selecting within-study comparisons obviated the need for statistical adjustments for non-equivalence--design alone did it. Bloom et al tested differential effects of statistical adjustments in between-state comparisons where there were large differences None worked, or did better than OLS

Aiken et al (1998) Revisited The experiment. Remember that sample was selected on narrow range of test score values Quasi-Experiment--sample selection limited to students who register late or cannot be found in summer but who score in the same range as the experiment No differences between experiment and non-experiment on test scores or pretest writing tests Measurement identical in experiment and non-exp

Results for Aiken et al Writing standardized test = .59 and .57 - sig Rated essay = .06 and .16 – ns High degree of comparability in statistical test results and effect size estimates

Implications of Aiken et al Like Bloom et al, careful selection of sample gets close correspondence on important observables. Little need for stat adjustment for non-equivalence limited only to unobservables Statistical adjustment minor compared to use of sampling design to construct initial correspondence

What happens if there is an initial selection difference? Shadish, Luellen & Clark (2006)

Figure 1: Design of Shadish et al. (2006) N = 445 Undergraduate Psychology Students Pretests, and then Random Assignment to Randomized Experiment n = 235 Randomly Assigned to Nonrandomized Experiment n = 210 Self-Selected into Mathematics Training n = 79 Vocabulary Training n = 131 Mathematics Training n = 119 Vocabulary Training n = 116 All participants measured on both mathematics and vocabulary outcomes

What’s special in Shadish et al Variation in mode of assignment Hold constant most other factors thru first RA--population/measures /activity patterns Good experiment? Pretests; short-term and attrition; no chance for contamination. Good quasi-experiment? - selection process; quality of measurement; analysis and role of Rosenbaum

Results Shadish et al

Implications of Shadish et al Here the sampling design produced non- equivalent groups on observables, unlike Bloom Here the statistical adjustments worked when computed as propensity scores However, big overlap in experimental and non-experimental scores due to first stage random assignment, making propensity scores more valid Extensive, unusually valid measurement of a relatively simple selection process, though not homogeneous.

Limitations to Shadish et al What about more complex settings? What about more complex selection processes? What about OLS and other analyses? This is not a unique test of propensity scores!

Examine Within-Study Comparison Studies with different Results The Bulk of the Job Training Comparisons Two Examples from Education

Earliest Job Training Studies: Adding to Smith/Todd Critique Mode of Assignment clearly varied We assume RCT implemented reasonably well But third variable irrelevancies were not controlled, esp location and measurement, given dependence on matching from extant data sets Large initial differences between randomly and non-randomly formed comparison groups Reliance on statistical adjustment to reduce selection, and not initial design

Recent Educational Examples

Agodini & M. Dynarski (2004) Drop-out prevention experiment, 16 m/h schools Individual students, likely dropouts, were randomly assigned within schools—16 replicates Quasi-Experiment—students matched from 2 quite different sources: middle school controls in another study, and national NELS data. Matching on individual and school demographic factors 4 outcomes examined and so in non-experiment 128 propensity scores -16 x 4 x 2--computed basically from demographic background variables

Results Only 29 of 128 cases were balanced matches obtained Why quality matching so rare? In non-experiment, groups hardly overlap. Treatment group is high and middle schools, but comparisons are middle only or from a very non-local national data set Mixed pattern of outcome correspondences in 29 cases of computable propensity scores. Not good OLS did as well as propensity scores

Critique Who would design a quasi-experiment this way? Is a mediocre non-experiment being compared to a good experiment? Alternative design might have been: 1. Regression-discontinuity. 2. Local comparison schools, same selection mechanism to select similar comparison students. 3 Use of multi-year prior achievement data.

Wilde & Hollister (2005) The Experiment—reducing class size in 11 sites; no pretest used at the individual level Quasi-experimental design—individuals in reduced classes matched to individual cases from other 10 sites Propensity scores; mostly demographic Analysis treat each site as a separate experiment And so 11 replicates comparing an experimental and non-experimental effect size

Results Low level of correspondence in experimental and non-experimental effect sizes across the 11 sites So for each site it makes a causal difference whether experiment or quasi-experiment When aggregated across sites, results closer: exp = .68; non-exp = 1.07 But they do reliably differ

Critique Who would design a quasi-exp on this topic without a pretest on same scale as outcome? Who would design it with these controls? Instead select controls from one or more matched schools on prior achievement history Again, a good experiment is being compared to a bad quasi-experiment Who would treat this as 11 separate experiments vs. a more stable pooled experiment? Even the authors, pooled results are much more congruent.

Hypothesis is that... The job training and educational examples that produce different conclusions from the experiment are examples of poor quasi-experimental design To compare good exp to poor quasi-exp is to confound a design type and the quality of its implementation—a logical fallacy But I reach this conclusion ex post facto and knowing the randomized experimental results in advance

Big Conclusions: R-D has given results not much different from experiment in three of three cases. Simpler Quasi-Experiments tend to give same results as experiment if: (a) population matching in the sampling design—Bloom and Aiken studies, or if (b) careful conceptualization and measurement of selection model, as in Shadish et.

What I am not Concluding: That well designed quasi-experiment is as good as an experiment. Difference in: Number and transparency of assumptions Statistical power Knowledge of implementation Social and political acceptance If you have the option, do an experiment because you can rarely put right by statistics what you have messed up by design

What I am suggesting you consider: Whether this be a unit on RCTs or quality causal studies Whether you want to do RDD studies in cases where an experiment is not possible because resources are distributed otherwise Whether you want to do quasi-experiments if group matching on the pretest is possible, as in many school-level interventions?

More Contentiously if: The selection process can be conceptualized, observed and measured very well. An abbreviated ITS analysis is possible, as in Bloom et al. The instinct to avoid quasi-experiments is correct, but it reduces the scope of the causal issues that can be examined

Shadish, Luellen & Clark (2006)

Shadish, Luellen & Clark (2006)

Results-Aiken et al pretest values on SAT/CAT, 2 writing measures Measurement framework the same Pretest ACTs and writing - ns exp vs non OLS tests Results for writing test = .59 and .57 - sig Results for essay = .06 and .16 - ns

Bloom et al Revisited Analysis at the individual level Within city, within welfare to work center, same measurement design Absolute bias- yes Average bias none across 5 within-state sites, even w/o stat tests Average bias limited to small site and non-within-city site-Detroit vs Grand Rapids

Correspondence Criteria Random error and no exact agreement Shared stat sig pattern from zero - 68% Two ESs not statistically different “Comparable” magnitude estimates One as percent of other Indulgence, common sense and mix

Our Research Issues Deconstructing “non-experiment”--do experimental and non-experimental ESs correspond differently for R-D, for ITS, and for simple non-equivalent designs? How far can we generalize results about invalidity of non-experiments beyond job training? Do these within-study comparison studies bear the weight ascribed to them in evaluation policy at DoL and IES?