Evidence-based Methods for improving Evidence-based Policy Thomas D. Cook, Northwestern University and Mathematica Policy Research, Inc. Funded by NSF.

Slides:



Advertisements
Similar presentations
PhD Research Seminar Series: Valid Research Designs
Advertisements

Agency for Healthcare Research and Quality (AHRQ)
Standardized Scales.
Ch 8: Experimental Design Ch 9: Conducting Experiments
RESEARCH & DATA (Part 1).
Designs to Estimate Impacts of MSP Projects with Confidence. Ellen Bobronnikov March 29, 2010.
Experimental Research Designs
Analyzing Regression-Discontinuity Designs with Multiple Assignment Variables: A Comparative Study of Four Estimation Methods Vivian C. Wong Northwestern.
Culture and psychological knowledge: A Recap
Chapter 3 Producing Data 1. During most of this semester we go about statistics as if we already have data to work with. This is okay, but a little misleading.
Who are the participants? Creating a Quality Sample 47:269: Research Methods I Dr. Leonard March 22, 2010.
MSc Applied Psychology PYM403 Research Methods Validity and Reliability in Research.
Statistics Micro Mini Threats to Your Experiment!
Non-Experimental designs: Developmental designs & Small-N designs
Math Candel Maastricht University. 1.Internal validity Do your conclusions reflect the “true state of nature” ? 2.External validity or generalizability.
Design Conditions & Variables Limitations of 2-group designs “Kinds” of Treatment & Control conditions Kinds of Causal Hypotheses Explicating Design Variables.
Today Concepts underlying inferential statistics
Chapter 9 Experimental Research Gay, Mills, and Airasian
Experimental Design The Gold Standard?.
Experimental and Quasi-Experimental Designs
Quasi-Experimental Designs Manipulated Treatment Variable but Groups Not Equated.
Understanding the Variability of Your Data: Dependent Variable Two "Sources" of Variability in DV (Response Variable) –Independent (Predictor/Explanatory)
Making decisions about distributions: Introduction to the Null Hypothesis 47:269: Research Methods I Dr. Leonard April 14, 2010.
Evaluating a Research Report
Some terms Parametric data assumptions(more rigorous, so can make a better judgment) – Randomly drawn samples from normally distributed population – Homogenous.
Educational Research: Competencies for Analysis and Application, 9 th edition. Gay, Mills, & Airasian © 2009 Pearson Education, Inc. All rights reserved.
Chapter Four Experimental & Quasi-experimental Designs.
Slides to accompany Weathington, Cunningham & Pittenger (2010), Chapter 3: The Foundations of Research 1.
Estimating Causal Effects from Large Data Sets Using Propensity Scores Hal V. Barron, MD TICR 5/06.
CAUSAL INFERENCE Presented by: Dan Dowhower Alysia Cohen H 615 Friday, October 4, 2013.
A Randomized Experiment Comparing Random to Nonrandom Assignment William R Shadish University of California, Merced and M.H. Clark Southern Illinois University,
1 Copyright © 2011 by Saunders, an imprint of Elsevier Inc. Chapter 8 Clarifying Quantitative Research Designs.
35th Annual National Conference on Large-Scale Assessment June 18, 2005 How to compare NAEP and State Assessment Results NAEP State Analysis Project Don.
WWC Standards for Regression Discontinuity Study Designs June 2010 Presentation to the IES Research Conference John Deke ● Jill Constantine.
Comments on Tradeoffs and Issues William R. Shadish University of California, Merced.
Evaluating Impacts of MSP Grants Ellen Bobronnikov Hilary Rhodes January 11, 2010 Common Issues and Recommendations.
Chapter 10 Experimental Research Gay, Mills, and Airasian 10th Edition
SOCW 671: #5 Measurement Levels, Reliability, Validity, & Classic Measurement Theory.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 4 Designing Studies 4.2Experiments.
Evaluation Requirements for MSP and Characteristics of Designs to Estimate Impacts with Confidence Ellen Bobronnikov February 16, 2011.
Better Quasi-Experimental Design Thomas D. Cook Northwestern University and Mathematica, Inc. Stockholm, 2015.
EMR 6550: Experimental and Quasi- Experimental Designs Dr. Chris L. S. Coryn Kristin A. Hobson Fall 2013.
CJ490: Research Methods in Criminal Justice UNIT #4 SEMINAR Professor Jeffrey Hauck.
Rerandomization to Improve Covariate Balance in Randomized Experiments Kari Lock Harvard Statistics Advisor: Don Rubin 4/28/11.
How Psychologists Do Research Chapter 2. How Psychologists Do Research What makes psychological research scientific? Research Methods Descriptive studies.
1 The Training Benefits Program – A Methodological Exposition To: The Research Coordination Committee By: Jonathan Adam Lind Date: 04/01/16.
Evidence-Based Mental Health PSYC 377. Structure of the Presentation 1. Describe EBP issues 2. Categorize EBP issues 3. Assess the quality of ‘evidence’
Issues in Selecting Covariates for Propensity Score Adjustment William R Shadish University of California, Merced.
William M. Trochim James P. Donnelly Kanika Arora 8 Introduction to Design.
8 Experimental Research Design.
Evaluation Requirements for MSP and Characteristics of Designs to Estimate Impacts with Confidence Ellen Bobronnikov March 23, 2011.
CHAPTER 4 Designing Studies
Regression in Practice: Observational studies with controls for pretests work better than you think Shadish, W. R., Clark, M. H., & Steiner, P. M. (2008).
An Empirical Test of the Regression Discontinuity Design
CHAPTER 4 Designing Studies
CHAPTER 4 Designing Studies
Quasi-Experimental Design
CHAPTER 4 Designing Studies
RESEARCH METHODS Lecture 33
Statistical Reasoning December 8, 2015 Chapter 6.2
Student Equity Planning August 28, rd Meeting
CHAPTER 4 Designing Studies
CHAPTER 4 Designing Studies
CHAPTER 4 Designing Studies
CHAPTER 4 Designing Studies
CHAPTER 4 Designing Studies
Chapter 8 VALIDITY AND RELIABILITY
RESEARCH METHODS Lecture 33
CHAPTER 4 Designing Studies
Regression in Practice: Observational studies with controls for pretests work better than you think Shadish, W. R., Clark, M. H., & Steiner, P. M. (2008).
Presentation transcript:

Evidence-based Methods for improving Evidence-based Policy Thomas D. Cook, Northwestern University and Mathematica Policy Research, Inc. Funded by NSF Grant DRL

My Confessions 1. I am a randomista. Nothing I say challenges that RCTs are best for testing causal hypotheses because they control for all potential biases -- observed and unobserved. 2. However, I am a conditional randomista. I believe there is a strong empirical evidence that certain kinds of non-experiments provide acceptable causal answers. I am not afraid of non-experiments, therefore.

Why are they Acceptable Causal Estimates? Because the methods generating them have often reproduced similar estimates to RCTs in rigorous tests. Why is this important?

Hierarchy of Accepted Causal Methods in Clearinghouses Coalition for Evidence-Based Policy What Works Clearinghouse – Education Blueprints – historically about crime and violence prevention Cochrane Collaboration in Medicine Many other inventories of effective practices from NGOs or government agencies All agree on “best” single method, but they disagree about acceptability of other methods.

One Gain? We assume most social effects are co- conditioned by many extraneous factors that cannot be held constant in the worlds of application as one might do in a lab E.g., designer effects, social context effects, temporal comparison effects We usually need many studies to probe causal robustness/conditionality. Nice to have more than just RCTs if we trust ‘em

The Rigorous Method: Within Study Comparisons aka Design Experiments

WSC Design: Three-Arm Study

WSC Design: Four-Arm Study POPULATION Randomly Assigned to Randomized Experiment Observational Study Treatment Control Treatment Control ATE = ?

Conditions for a good WSC A well implemented RCT, with minimal sampling error No third variable confounds –like from measurement Comparable estimands – RD and RCT Blinding to the RCT or adjusted QE results Defensible criterion for correspondence of RCT and adjusted QE results

Limitations of WSCs Only be done on topics where an RCT is possible No reason to believe that a given design will always replicate experimental findings; our more modest goal is to identify designs that often replicate findings. This is inductive and requires a large sample size of WSCs. This talk is not the final word. More definitive word requires more WSCs

PURPOSES We will now compare RCT to non-RCT results using this method, differentiating non-RCT Regression-Discontinuity (RD), especially comparative RD (CRD) Interrupted Time Series (ITS), especially comparative ITS (CITS) Non-Equivalent Control Group Designs without RD or ITS feature

COMPARING RCT TO REGRESSION DISCONTINUITY (RD) AND CRD RESULTS

RDD Visual Depiction Comparison

RDD Visual Depiction ComparisonTreatment Counterfactual regression line Discontinuity, or treatment effect

RDD Visual Depiction ComparisonTreatment

Three Major Limitations of RDD Functional form assumptions – see green line Causal generalization, LATE at cutoff Statistical power relative to RCTs – about 3 times lower

WSC Results for RD There have now been 10 WSCs and all report similar finding – generally comparable effect estimates at the cutoff, but no meta-analysis Could be considered: (i) Theoretically trivial; (ii) a traditional empirical test of statistical hypothesis; or iii) as we prefer, a test of robustness of RD in practice – it is generally good enough in real research practice despite sampling variability in both the RD and RCT

Addressing 3 Major Limitations of RD by Adding a Comparison Function The no-treatment comparison regression function can be one or several pretest time points or a non-equivalent comparison group Functional form estimation in RD? Causal generalization? Statistical power?

Posttest regression Pretest regression

Example 1: Effects of Head Start – Tang & Cook (2014) Random selection of HS centers (89% agree) followed by random assignment within centers of 3 year olds Outcome = math, literacy; social behavior CRD-Pre has pretest as no-treatment regression function CRD-CG has non-equivalent group of 4 year olds from same locations

Summary: CRD-Pre above the cutoff

Summary: CRD-CG above the cutoff

Example 2: Wing & Cook (2013) RCT = professionals or family making decisions about services for disabled. Here we examine how much of allotment spent Assignment variable = age (35, 50 and 70) Comparison = payments made before the RCT began (pretest measure of the outcome) No CRD-CG

Comparative RD Results in Standardized Difference from RCT away from Cutoff: Non-Parametric Analysis StateCutoff Age = AK NJ FL

CRD 3 studies to date using WSC methods All 3 include pretest as comparison regression function 2 include non-equivalent comparison group function – e.g, 4 yr for 3 yr olds in Head Start All show plausibly parallel functional forms All show much smaller SEs than simple RD and close to RCT All show unbiased causal inference at cutoff and also away from it (highest =.13 SDs)

Summary for RD vs RCT Little doubt from 10 studies that unbiased causal inference results at the cutoff in actual research Empirical reason from 3 studies to believe that generic limitations of RD can be mitigated by adding a reliable RDD function from pretest or from non-equivalent control group We should call for more CRD designs in the future; their results particularly close to those of RCT in terms of bias and precision. Yet not a category in any compendium.

COMPARISON OF RCT AND INTERRUPTED TIME SERIES (ITS) RESULTS

Interrupted Time Series Can Provide Strong Evidence for Causal Effects Clear Intervention Time Point Huge and Immediate Effect Clear Pretest Functional Form + many Observations No AlternatIve at Interventio Can Explain Change

Limitations of Simple One-Group ITS History alternative explanations around the intervention point Functional form extrapolation needed Analysis has to account for correlated errors (we will not deal with this issue here) First two points, suggest the advisability of a comparative ITS

NCLB NAEP Test Score Time Hypothetical NCLB effects on public (red) versus private schools (blue)

What if Everyone in Canada Flushed at the Same Time *

WSC and CITS Four studies in medicine, three in education All claim causal inferences similar No meta-analysis to date No analysis of file drawer problem

St. Clair, Cook, & Hallberg (In Press) RCT: Study of Indiana’s system for feedback on student performance (schools as unit of assignment) Comparative ITS comparison groups – Basically all schools in the state – Matched schools in the state

Math (All schools)

Math: WSC Results

ELA (All Schools)

ELA: WSC Results

With matching C to T Units, instead of Modeling Baseline Means/Slopes Same results Somers et al got the same results Environmental science found replicate RCT only with matching

CITS Summary To date, CITS does well relative to RCT to date Get similar effects despite possible group differences in (a) pre-treatment trend,(b) historical events at treatment; (c) changes in instrument; (d) stat regression– all these could be confounds, but they have not been to date CITS is not in any compendium except Cochrane

NON-EQUIVALENT CONTROL GROUP DESIGNS : (A) MODELING A LIKELY FULLY KNOWN SELECTION PROCESS

Statistical Theory Knowing the selection process and measuring it perfectly always gives unbiased causal inference Rarely do we know it fully, but we often know major elements of selection process – why children are retained in grade; why couples self- select into divorce; Here’s one example – why students self-select into learning more about English or mathematics

Possibly fully known selection process Shadish, Clark & Steiner (2008) N = 445 Undergraduate Psychology Students Randomly Assigned to Randomized Experiment N = 235 Randomly Assigned to Observational Study N = 210 Self-Selected into Mathematics Training N = 119 Vocabulary Training N = 116 Mathematics Training N = 79 Vocabulary Training N = 131 ATE = ?

23 Constructs and 5 Construct Domains assessed prior to Intervention Proxy-pretests (2 multi-item constructs): 36-item Vocabulary Test II, 15-item Arithmetic Aptitude Test Prior academic achievement (3 multi-item constructs): High school GPA, current college GPA, ACT college admission score Topic preference (6 multi-item constructs): Liking literature, liking mathematics, preferring mathematics over literature, number of prior mathematics courses, major field of study (math-intensive or not), 25-item mathematics anxiety scale

Construct Domains Psychological predisposition (6 multi-item constructs): Big five personality factors (50 items on extroversion, emotional stability, agreeableness, openness to experience, conscientiousness), Short Beck Depression Inventory (13 items) Demographics (5 single-item constructs): Student‘s age, sex, race (Caucasian, Afro-American, Hispanic), marital status, credit hours

Was there Bias in the QE with Self- Selection into Tracks? RCT showed effects for each outcome. But both math and vocab effects were larger when students self-selected into T versus C So our question is: How much of this self- selection bias is reduced by use of covariates measuring several different possible selection processes?

Bias Reduction: Construct Domains Mathematics

Bias Reduction: Single Constructs Mathematics

Bias Reduction: Construct Domains Vocabulary

Bias Reduction: Single Constructs Vocabulary

Given Initial Group Differences 1. The choice of covariates for selection adjustment is crucial 2. How you analyze the outcome using covariates (OLS and PS matching) makes little difference, though PS preferred in theory 3. Replicated in Pohl et al. (2011)

NON-EQUIVALENT CONTROL GROUP DESIGN: (B) FORMS OF INTACT GROUP MATCHING AND CASE MATCHING WHEN SELECTION IS PARTIALLY KNOWN

Case 1: Intact Group Matching Have T group and select an intact control group without case matching – each unit in T is matched to some Cs Diaz & Handa (2006) – Oportunidades Aiken, West et al. (1999) Same result as RCT without any case matching But no guarantee – job training But it can be Step 1 to maximize overlap followed by case matching

Case 2: Local, Focal and Hybrid Case Matching Local Matching: school districts or labor markets – hope is to achieve match on some unobservables, esp all those local policies that apply to T and C. Reduces bias in labor econ Focal matching. Product of analyses of variables responsible for selection and correlated with outcome but depends on getting the right covariates – e.g., just seen in Shadish et al.

Hybrid sampling model of Stuart and Rubin (2008) Define caliper for adequate case matches Match all LOCAL Cs to T that fall within caliper For those Ts with no adequate matches, perform a match on the basis of the best propensity score after analysis of possible selection processes Now have a mix of acceptably matched local Cs, being preferred for control over some unobservables, and of acceptably matched non- local Cs, matched only on observables

Hallberg, Wong, & Cook (in press) This paper draws on a WSC to examines correspondence with the RCT benchmark (Indiana student feedback study) after matching – Within district as long as the schools do not differ by more than 0.75 standard deviations of the propensity score (Local) – For others match on observed school-level covariates known to be highly correlated with the outcome of interest (Focal) – Combine both T and C matched cases (Hybrid)

Performance of local, focal and hybrid matching across two dependent variables

Percentage of times observational approach performed best across 1000 replications

Summary Intact group matching increases overlap. Useful first stage in a QE design strategy Local matching matching is always useful and often brings about RCT result. Neither is a guarantee Hybrid matching is perhaps best, but only one study and at school and not individual level Need for more studies of hybrid matching

NON-EQUIVALENT CONTROL GROUP DESIGN: (C) AMONG COVARIATES, HOW SPECIAL IS A PRETEST MEASURE OF STUDY OUTCOME?

Claims about Pretest Claim that pretest is privileged for precision, but here now bias reduction In studies limited to modeling the outcome, pretest often most highly correlated, but issue is correlation of pretest with selection into T Though selection on the pretest may be frequent, no one knows how often and when Next WSC studies vary when the pretest does and does not vary with selection

Existing Empirical Evidence WSCs provide some support for privileging true pretest because it is better than others at reducing bias, but does not always reduce all bias Workforce development (Glazerman, Levy, & Myers, 2003; Bloom, Michalopus, and Hill, 2005; Smith & Todd, 2005) Yet Magnet school study (Bifulco, 2010) and earlier CITS studies here This study examines the bias reduction due to conditioning on pretest measures when we vary the correlation with selection both between and within studies

Between-Studies: Kindergarten Retention Hong and Raudenbush (2005; 2006) used the rich covariates in the ECLS-K to estimate the effect of kindergarten retention on academic outcomes in math and reading

Retention Selection Process Past academic performance plays a critical role in identifying which students will be retained – Students are retained “to remedy inadequate academic progress and to aid in the development of students who are judged to be emotionally immature” (Jackson, 1975, p. 614) – “It is a ‘high risk’ profile generally – for academic setbacks in the near-term, for a lifetime of struggle over the longer term.” (Alexander, Entwisle, and Dauber, 2003, p. 68)

Dataset 1: Correlation with Selection Correlation with Retention in Kindergarten Correlation Lower Bound Percent of lower bound Reading Pretest-0.185* % Math Pretest-0.179* %

Dataset 1: Math Results

Dataset 1: ELA Results

Between- Study Contrast Dataset 2: Indiana Benchmark Assessment Study (Grade 5) 56 K-8 schools serving 5 th graders randomly assigned to: – Treatment: implementation of the state’s benchmark assessment system (n=34) – Control schools: business as usual (n=22) – Outcomes: Math and ELA ISAT scores Quasi-experimental comparison group drawn from all other schools in the state that served 5 th grade students (n = 681) Rich set of student and school covariates with multiple waves of pretest data

Dataset 2: No Meaningful Correlation with Selection Correlation with Selection into Benchmark Assessment System Reading Pretest0.041 Math Pretest-0.012

Dataset 2: Math Results

Dataset 2: ELA Results

Correlation with Selection Correlation with Selection into Vocabulary Training Reading Pretest0.169* Math Pretest-0.090

Dataset 3: ELA Results where Pretest and Selection correlate

Math Results where Pretest and Selection not correlate

Summary of Pretest Results Cannot assume the pretest is always related to selection even if it often is You should probably always include it But consistent with the principle of fully knowing the selection process, you should include it in a PS analysis predicated on measures guided by theoretical explication of all other plausible selection processes

Data-Bound Summary No meta-analysis to date, but looks good for RD, CRD, and CITS It also looks good for designing prospective studies and including measures to account for multiple possible selection processes Intact, local and focal matching with heterogeneous covariates each sometimes reduce all bias, almost always reduce some bias, but likely best together in the form of hybrid matching Pretests do not always reduce bias, though they sometimes achieve this. They afford no guarantee, but should be a significant part of a bias reduction strategy with other sampling and covariate choice efforts. This presentation would be very different five years from now, not so much with respect to RD and ITS, but with respect to the non- equivalent control group design options worth disseminating and suppressing.

Broader Summary Let us all acknowledge that RCT is best in theory and not get into meaningless fights. Let’s ask: is the assumption warranted that the RCT is “far” superior for warranting causal inference? Is an evidence-based empirical rationale already emerging for including some QE studies as acceptable contributions to evidence-based policy suggestions?

Broader Summary The second assumption is that evidence-based policy will be better if we have more info about external validity so as to learn about robustness or conditions under which effect sizes vary for the same treatment and effect Will having more acceptable studies in our knowledge compendia promote external validity, an Achilles Heel of much evidence-based practice research in the social and educational domains at least?

END and THANKS

Reliability of Construct Measurement Steiner, Cook & Shadish (2011) How important is the reliable measurement of constructs (given selection on latent constructs)? – Does including many covariates in the PS model compensate for any one covariate’s unreliability? – We add measurement error to the observed covariates in a simulation study – Assume that original set of covariates is measured without error and removes 100% of selection bias – Systematically added measurement error such that the reliability of each covariate was  =.6,.7,.8,.9, 1.0

Mathematics: Reliability 1.0

Mathematics: Reliability.6

The three untreated parallel segments of comparative regression discontinuity (CRD) design

Bias of CRD-Pre above cutoff only

Results: precision of CRD-Pre

Results: bias of CRD-CG above the cutoff

Results: precision of CRD-CG