Better Quasi-Experimental Design Thomas D. Cook Northwestern University and Mathematica, Inc. Stockholm, 2015.

Better Quasi-Experimental Design Thomas D. Cook Northwestern University and Mathematica, Inc. Stockholm, 2015

Introduction The theory of cause we discuss today – manipulability theory RCT as the best reflection of it because of Primacy of manipulation in each theory Statistical theory of comparability on all variables, in expectation at least Assumptions clear and testable Social consensus about when met or not

RCTs not always possible; we need Alternatives Closest approximations are quasi-experiments QEs CAN work in theory –Rubin Causal Model BUT conditions where work are opaque Need a method to learn which QEs are usually effective or not in producing results like RCTs Today will present a method for identifying QE design and analysis alternatives that “often” produce results similar to those of RCT

What is at Stake? Evidence-Based Practice Rhetoric of evidence-based practice What is acceptable as “evidence” In causal realm, RCTs are sometimes sole (Evidence-based Coalition), always preferred and often more heavily weighted than other causal designs (WWC) Entails slow rate of progress and requires Accepting all the limitations of RCTs –external validity

At Stake: Future of Bigger Datasets More variables, hence more constructs More versions of same construct --reliability More frequent assessments – more time series More cases More local sampling – better comparisons More linkage potential to other data sets Are better QE designs those with data attributes like that that are becoming more available?

We will explore today Which alternatives to the RCT are worth trusting and not trusting so that we can accumulate results faster over greater range of UTOS One concept of trustworthy is theory; another is that the causal estimates are routinely close to those of an RCT. today is about Empirical Criterion of Correspondence between RCT and QE results

Overview of Day First, present and discuss the method called Within Study Comparison (WSC) or design experiment. Then break Then report results of WSCs for two versions of RD and for CITS. Then break. Then report results of WSCs for simpler designs with no pretest time series nor a fully known selection mechanism. Then break Finally, Conclusions and Open discussion

WSC Design: Three-Arm Study

WSC Design: Four-Arm Study POPULATION Randomly Assigned to Randomized Experiment Observational Study Treatment Control Treatment Control ATE = ?

Within-Study Comparison aka Design Experiment Have a benchmark– RCT in 64 of the 72 examples to date. Compute causal estimate and SE Quasi-experiment of many different “types” – ITS, RD, NECGD with pretest and local match, NECGD with fully known selection process, with partially known, with scarcely known etc. – Attach same treatment group to RCT and QE, adjust QE, and then compute QE estimate Compare QE and RCT estimates, and conclude

Evolution of WSC Purposes Began : CAN we get similar results? existence proof; job training; Limitations of approach Does widely used QE practice work? E.g, use of pretest, of local comparisons, combining the 2; of PS analysis Bias-reduction potential of alternatives in separate studies – e.g, covariate choice vs analysis Direct comparison of alternatives – Pretest when correlated with selection process or not Novel alternatives – e.g., Stuart & Rubin (2008) Later = more sophisticated questions bretter linked to statistical theory

Why do WSCs? All QE results can be disputed in terms of possible alternative interpretations, but not necessarily plausible Single WSCs of limited value – if similar results by design, existence proof of what stat theory tells us; if not similar, would they have been by adding X, Y or Z? Goal is not to identify QE will always replicate RCT finding; it is to identify designs that often do so With multiple studies, (a) we could have external warrant for claims that a given QE practice often works, given no internal warrant from theory, as with RCT; (b) we will escape from QE and OS language to be more specific in language re designs

Conditions for a good WSC A well implemented RCT, with minimal sampling error No third variable confounds – like from measurement Comparable estimands – RD and RCT Blinding to the RCT or adjusted QE results Defensible criterion for correspondence of RCT and adjusted QE results

Limitations of WSCs - I Blinding is rare and so are protocols – is there reason to assume that folks most likely to do WSCs want RCT or QE to “win out”? No perfect criterion for correspondence, given sampling error in RCT and QE – of similar signs, stat test results by design, of null and equivalence tests of design differences Only be done on topics with benchmark, and 80%-90% of WSCs have an RCT benchmark. Don’t we want to know how QEs do when no RCT is possible?

Discussion of WSCs Break.

WSC Results for Two Variants of the Regression Discontinuity Design (RD): Simple and Comparative RD

RDD Visual Depiction ComparisonTreatment

RDD Visual Depiction ComparisonTreatment Counterfactual regression line Discontinuity, or treatment effect

Stress for simple RD that Process of assignment into treatment completely known and perfectly measured We should be surprised, therefore, if we do not get same results as RCT estimated at same point in sharp RD. Trivial theory test BUT Non-trivial test of implementability of RD in real world

Why same Causal Estimate in Simple RD and RCT? Rationale 1: process of selection into treatment is completely known – “sharp” RD Rationale 2 for “sharp RD” – like RCT around the cutoff and only there. Rationale 3: If there is contamination – T cases in untreated area and/or C cases in treated area, the way of dealing with this is the same in RD and RCT – instrumental variables (IV)

Why Larger Standard Errors, less Power, in Simple RD? RD analysis requires measure of the treatment assignment (1/0) as treatment indicator, and of the assignment variable as selection control The two are correlated, one a binary measure totally located within the assignment variable Hence if any slope in the relationship of assignment variable and outcome, The treatment cutoff score and the assignment variable will be CO-LINEAR

13 WSCs of RD vs RCT at the Cutoff Almost all causal results at cutoff are similar More so as N increases, and across parametric and non-parametric analyses Results in many different substantive fields No meta-analysis yet; no thorough examination of file drawer problem Results look promising for internal validity of RD in the crucible of practice. But SEs are about 3X larger for same sample size

Conclusion re Simple RD You can trust simple RD to give unbiased causal estimate at the cutoff despite great heterogeneity in how it is implemented It is less efficient than RCT by a factor of about 3 In practice, many RDs have larger sample sizes than a comparable RCT would, thus reducing the statistical power loss in practice.

Alas, Simple RD is very limited Less Statistical Power than RCT Functional Form or Bandwidth Assumption Lesser causal generalization – LATE vs ATE Is there way to do RD better so that (a) support for extrapolation always needed; (b) more power, and especially (c) unbiased causal estimates in all the treated area and not just at the cutoff

Now we will examine Comparative Regression Discontinuity (CRD) Visually, what is it?

Posttest regression Pretest regression

Non-Equivalent Regression Function From Pretest From Non-Equivalent Comparison Group From Non-Equivalent Dependent variables Ludwig and Miller health results 5 years after Head Start – pretest health; local cohort too old for HS; health outcomes do or not affect little kids –pulmonary problems vs accidents How do we study CRD? Via form of WSC

Creating the synthetic RD from the RCT

Now Imagine Dropping the treated cases in the untreated part of the assignment variable Dropping the untreated cases in the treated part of the assignment variable. You are left with two groups instead of the four when the RCT is subdivided by cutoff score. This the very simplest RD.

Posttest regression Pretest regression

Walk you three examples, testing How well does CRD do relative to both RCT and simple RD with respect to: Functional form estimation Statistical power Bias in all Area away from Cutoff

Example One Wing & Cook (JAPAM; 2013)

Case 1: Cash and Counseling Demonstration: RCT T=having control over Medicaid funds for disability services vs Business as usual = Medicaid selecting providers of service DV = total expenditures for disability services cos Medicaid dispenses less than the allotment usually Question: DO families spend more when they have control over funds?

CRD Design Specifics Set an assignment variable = age Set cutoff: 35, 50, and 70 – use only 70 here cos of age distribution Comparison function is pretest spending RDD analysis both parametric and non-para (LLR) – report only LLR here 3 States: NJ, Ark, Fla. About 1,000 cases per site

Research questions again: If you add pretest RDD function as in this example, do you Have more confidence in functional form? -- How comparable are the 3 untreated regression segments? Have lower standard errors, how close to RCT Get causal estimates for whole age range of treated from 70 to 90+ and not just at 70? Begin with support

What about standard errors of estimates away from cutoff? For cutoff at age 70, higher than RCT by 1.3 across all the area away from the cutoff At the cutoff, smaller than the RD across all comparisons of RD and CRD-Pre at age 70 (and other ages, too).

What about Bias?: Comparisons StateEstimationCut-OffBias at Cut- Off: Post-Test Only Bias at Cut- Off: Pre-Test Design Bias Above Cut-off: Pre- Test Design ArkansasLLR70-0.060.070.04 New JerseyLLR700.010.080.12 FloridaLLR700.08-0.02-0.04

Example 2: Effects of Head Start – Tang & Cook (2014) Random selection of HS centers (89% agree) followed by random assignment within centers of 3 year olds Outcome = math, literacy; social behavior CRD-Pre has pretest as no-treatment regression function, as Wing and Cook CRD-CG has non-equivalent group of 4 year olds from same locations, not in W & C

Two Forms of CRD tested CRD-Pre – Supplement the basic RD design with pretest scores of the same individuals CRD-CG – Supplement the basic RD design with a non- equivalent comparison group. – Two different cutoff scores for replication – a test scoreis one and date of testing is other

Sample sizes RCT is 2326 RD with IRT-generated PPVT as the assignment variable: 1163 CRD-Pre: 1163 subjects with 2326 observations RD with date of assessment as the assignment variable: 1045 CRD-CG: 1856 subjects (observations)

What about support? 3 untreated segments of CRD-Pre – CRD-CG similar

Results: Precision of CRD-Pre

Results: Precision of CRD-CG

Results: bias of CRD-Pre above the cutoff

Results: bias of CRD-CG above the cutoff

Summary: CRD-Pre above the cutoff

Summary: CRD-CG above the cutoff

Case 3: Stress Test Effects of Training Kisbu-Sakarya, Tang & Cook

Shadish, Clark & Steiner (2008) N = 445 Undergraduate Psychology Students Randomly Assigned to Randomized Experiment N = 235 Randomly Assigned to Observational Study N = 210 Self-Selected into Mathematics Training N = 119 Vocabulary Training N = 116 Mathematics Training N = 79 Vocabulary Training N = 131 ATE = ?

“Stress Test” due to Modest Ns N for RCT is 235 N for basic RD and CRD-Pre is 123 for math and 112 for vocabulary N for CRD-CG is 254 for the math outcome (123+131) and 191 for vocab (112+79). These are small sample sizes for regression techniques with individual data

Support for Regression Assumption: CRD-Pre math outcome

Support for Regression Assumption: CRD-CG math outcome: Lowess

Above cutoff for math

Support for Regression Assumption: CRD-Pre vocabulary outcome

Support for Regression Assumption: CRD-CG vocabulary outcome: Lowess

Above cutoff for vocab

SEs: At cutoff for vocab

Overall Conclusions about CRD With either CRD-Pre or CRD-CG, the added functional form can help if the untreated functional forms are parallel-ish and if sample size large enough for reasonable stability. The addition will: Increase confidence in functional form extrapolation Increase power relative to RD and close to that of RCT Lead to unbiased causal inference at the cutoff AND ALSO AWAY FROM IT. CRD shrinks the advantages of RCT, but without entirely eliminating them

Why do Simple RD? Why tolerate its disadvantages if they are so easily mitigated by a non-treated regression function that can be observed and will be even more feasible in “big data‘ era? Why is the design of choice not automatically some form of CRD rather than RD Analog here to the development of RCT. How many posttest–only RCT designs in social sciencepractice; most have covariates at least

MORE PRETEST DATA POINTS: RCT VS. INTERRUPTED TIME SERIES (ITS) AND ESPECIALLY COMPARATIVE INTERRUPTED TIME SERIES (CITS)

Interrupted Time Series Can Provide Strong Evidence for Causal Effects Clear Intervention Time Point Huge and Immediate Effect Clear Pretest Functional Form + many Observations No AlternatIve at Interventio Can Explain Change

Limitations of Simple One-Group ITS History, around the intervention point Instrumentation Stat Regression Functional form extrapolation needed Analysis has to account for correlated errors (we will not deal with this issue here) Suggest the advisability of a comparative ITS

WSCs on Simple ITS All except one done by Frethelm. Now almost a dozen datasets comparing RCT and ITS Inconsistency in ability to recreate RCT results Why? Inherent weakness of design? Let’s look at most feasible alternative/

NCLB NAEP Test Score Time 208 200 Hypothetical NCLB effects on public (red) versus private schools (blue)

WSC and CITS Six studies in medicine, four in education, one in environmental sciences All claim causal inferences similar No meta-analysis to date No analysis of file drawer problem Remarkable cos the internal validity threats of differential history, instrumentation and regression could have operated but did not

St. Clair, Cook, & Hallberg (2014) RCT: Study of Indiana’s system for feedback on student performance (schools as unit of assignment) Comparative ITS comparison groups – Basically all schools in the state – Matched schools in the state

Math (All schools)

Math: WSC Results

ELA (All Schools)

ELA: WSC Results

What about Matching C to T Units? We can match C to T units, though this entails some case loss. Then no need to assume functional form is correct Same results Somers et al got the same results Environmental science found replicate RCT only with matching Matching safest analysis unless sure of FF

CITS Summary To date, CITS does well relative to RCT Matching is the most consistent to date Models with the correct functional form do well; and one can observe the functional form Similar effects despite possible group differences in (a) pre-treatment trend,(b) historical events at treatment; (c) changes in instrument; (d) stat regression– have never been confounds

Less Elaborate QEs - NECGDs NO known selection process and no pretest time trends Probably the bulk of all current QEs, but will change with bigger data towards CITS Within currently dominant practice, trick is: (1) To reduce the size of initial difference through how the comparison case is sampled or comparison cases are sampled - overlap maximization; and then (2) how to choose (a) covariates and (b) mode of data analysis to reduce remaining selection bias – most action with (b) and (c), though 1. likely more important, (b) next and (c) quite trivial.

NEXT SECTION Non-Equivalent Control Group Designs without RD or pretest time series This is a matter of How to select comparison population so as to reduce the initial group non-equivalence How to select covariates so as to reduce selection How to analyze the data

Flavor of Two Positions Rubin: Study the process of selection into treatment in one or many of many different ways and use this to select covariates. Heckman and his students – choose local comparisons, choose pretest measure of outcome, choose “rich” collection of other covariates

1. SELECTING NON-EQUIVALENT COMPARISON GROUPS TO REDUCE INITIAL NON- EQUIVALENCE

The Trick with most QEs is To select an intact C group as similar to T as possible to minimize selection difference thru sampling. Contrast is with making them seem similar through individual case matching To use covariates in analysis that reduce any selection difference still remaining. This is where propensity scores, ANCOVA come in.

What does Local “Mean”? Identical twins, non-identical, sibs, cousins Same grade cohort in schools, birth cohort Schools in same district vs other Job training sites in same local labor market Towns at border of different states vs all state More local the better since it matches on more unobservables as well as observables

Local intact comparison groups Past empirical research in Cook et al. (2008) shows 3 cases in different fields where local choice eliminated all bias. Two more WSCs since, and two others earlier with same result. But some counter-cases in job training. Always reduces bias but DOES NOT ALWAYS ELIMINATE IT Problem is: Not all local matches are good How can we take advantage of its bias-reduction qualities without bias elimination? Come back to this later after discussing covariate choice

2. GIVEN AN OBSERVED PRETEST DIFFERENCE BETWEEN TREATMENT AND CONTROLS, HOW TO MODEL (A) STRONGLY SUSPECTED SELECTION PROCESS

Statistical Theory Knowing selection and measuring it perfectly gives unbiased causal inference BUT rarely know it fully – RDD exception Yet we often know major selection elements: why retained in grade; why self-select into divorce; why use emergency rooms? How to make selection process better known? Here’s one example – why students self-select into learning English or math

Strongly suspected selection process Shadish, Clark & Steiner (2008) N = 445 Undergraduate Psychology Students Randomly Assigned to Randomized Experiment N = 235 Randomly Assigned to Observational Study N = 210 Self-Selected into Mathematics Training N = 119 Vocabulary Training N = 116 Mathematics Training N = 79 Vocabulary Training N = 131 ATE = ?

23 Constructs and 5 Construct Domains assessed prior to Intervention Proxy-pretests (2 multi-item constructs): 36-item Vocabulary Test II, 15-item Arithmetic Aptitude Test Prior academic achievement (3 multi-item constructs): High school GPA, current college GPA, ACT college admission score Topic preference (6 multi-item constructs): Liking literature, liking mathematics, preferring mathematics over literature, number of prior mathematics courses, major field of study (math-intensive or not), 25-item mathematics anxiety scale

Construct Domains Psychological predisposition (6 multi-item constructs): Big five personality factors (50 items on extroversion, emotional stability, agreeableness, openness to experience, conscientiousness), Short Beck Depression Inventory (13 items) Demographics (5 single-item constructs): Student‘s age, sex, race (Caucasian, Afro-American, Hispanic), marital status, credit hours

Was there Bias in the QE with Self- Selection into Tracks? RCT showed effects for each outcome. Both math and vocab effects larger than in RCT when there was self-selection into T versus C – thus, bias in QE. Our question is: How much of self-selection bias is reduced by use of covariates measuring several different possible selection processes?

Bias Reduction: Construct Domains Mathematics

Bias Reduction: Single Constructs Mathematics

Bias Reduction: Construct Domains Vocabulary

Bias Reduction: Single Constructs Vocabulary

Given Initial Group Differences 1. Choice of covariates is crucial 2. Reliability counts, but secondary within bounds of 1 to.60. 3. Mode of analyzing covariates (OLS and PS matching) makes little difference, though PS preferred in theory 4. Replicated in Pohl et al. (2011)

2. GIVEN OBSERVED DIFFERENCE, HOW SPECIAL IS (B) PRETEST MEASURE OF STUDY OUTCOME FOR BIAS REDUCTION?

Claims about Pretest Claim that pretest is privileged for bias reduction; yet by itself did little for math in Shadish et al. In studies modeling the outcome only, pretest often the most highly correlated single variable But issue is cor of pretest with selection into T Though we suspect selection on pretest to be frequent, not know how often and when Next WSC studies vary when the pretest does and does not vary with selection

Existing Empirical Evidence WSCs support privileging true pretest because it is better than others at reducing bias - Heckman Sometimes reduces all by itself -- Magnet school study (Bifulco, 2010) and earlier CITS studies here But it does not always reduce all bias – e.g., Shadish et al. and workforce development lit This study examines bias reduction due to pretest when we vary the correlation with selection both between and within studies

Between-Studies: Kindergarten Retention Hong and Raudenbush (2005; 2006) used rich covariates in ECLS-K to estimate the effect of kindergarten retention on math and reading Two prior waves Evidence of selection-maturation: Retained have lower mean and lower rate of change. Selection process largely known: past perf and teacher ratings –both available at 2 pretest times

Dataset 1: Correlation with Selection Correlation with Retention in Kindergarten Correlation Lower Bound Percent of lower bound Reading Pretest-0.185*-0.3848.7% Math Pretest-0.179*-0.3748.4%

Data set 1: Analytic Approach Broke 144 covariates into three groups: – One wave of pretest data (spring of K) – Two waves (fall and spring of K) – 140 other covariates Created propensity scores with each cov set and estimated reading and math effects Note: Bias reduction compared to benchmark model, not RCT!

Dataset 1: Math Results

Dataset 1: ELA Results

Dataset 2: Indiana Benchmark Assessment Study (Grade 5) 56 K-8 schools 5 th graders randomly assigned to: – Treatment: state benchmark assess system (n=34) – Control schools: business as usual (n=22) – Outcomes: Math and ELA ISAT scores QE comparison group from all other schools in state serving 5 th grade students (n = 681) Rich set of student and school covariates with multiple waves of pretest data

Dataset 2: Selection Schools selected into study cos interested in implementing the program Principals interviewed and cited – Taking advantage of free resource from the state – A commitment to data driven decision making – Knowledge of other schools implementing – No mention of participation due to school’s past academic performance – i.e., the pretest

2: No Correlation with Selection Correlation with Selection into Benchmark Assessment System Reading Pretest0.041 Math Pretest-0.012

Dataset 2: Math Results

Dataset 2: ELA Results

Shadish et al. Correlation with Selection Correlation with Selection into Vocabulary Training Reading Pretest0.169* Math Pretest-0.090

Dataset 3: ELA Results where Pretest and Selection correlate

Math Results where Pretest and Selection not correlate

Summary of Pretest Results Cannot assume the pretest is always related to selection, even if it often is You should probably always include it But you are better guided by theoretical explication of all plausible selection processes Better supplementing it with more waves and other covariates.

2. GIVEN PRETEST DIFFERENCE, (C) WHAT HAPPENS IF THE SELECTION PROCESS IS NOT KNOWN BUT HAVE “RICH’ SET OF COVARIATES?

Steiner, Cook & Li (2015) “Rich” covariates – more domains (presumptively independent constructs) and higher reliability (number of items assessing each construct) Theory = pick up increasingly more parts of the true but unknown selection process Two data sets – one with 156 covariates at one pretest and the other with 144 over two pretest time points. Each has reasonably known theory of selection; We identify it and then throw it away the variables to ask: How well do the remaining covariates function collectively, though they are individually imperfect?

Remove effective single covariates Mathematics

All Covariates

Critical Covariates Removed

Conclusion: “Rich” Covariates w/o Independent Info on Selection Helps reduce some bias More so with more reliable assessments Within limits we imposed of 12 domains, still 40% of bias remaining If more domains, each of 5 items, who knows?

“Rich” Covariates Useful cos it increases chances of choosing the true selection variables But no guarantee If put together “rich” covariates, local comparison group choice and pretest (Heckman), each does mostly OK by self and the three together might be even better But an even better option is possible

Hybrid sampling model of Stuart and Rubin (2008) Define caliper for adequacy of a match Match all LOCAL Cs to T that fall within caliper For others, perform a match using a PS predicated on analysis of selection processes Result = mix of acceptably matched local Cs that control for more unobservables, and acceptably matched non-local Cs, but matched only on observables

Hallberg, Wong, & Cook (in press) This paper draws on a WSC to examines correspondence with the RCT benchmark (Indiana student feedback study) after matching – Within district as long as the schools do not differ by more than 0.75 standard deviations of the propensity score (Local) – For others match on observed school-level covariates known to be highly correlated with the outcome of interest (Focal) – Combine both T and C matched cases (Hybrid)

Performance of local, focal and hybrid matching across two dependent variables

Percentage of times observational approach performed best across 1000 replications

Summary Intact group matching increases overlap. Useful first stage in a QE design strategy? But have been counter-cases in job training We will see focal matching is no guarantee either, though we know when it is better Is this hybrid model best? Too early to tell. Need more studies of it

Conclusions re Weaker Designs than RD and ITS It is not just a matter of analysis. Minor It’s not just a matter of reliability of covariates It’s a matter of how you select intact comparison groups – local and hybrid Matter of how much you know about remaining selection bias Matter of correspondence between your covariates and knowledge of selection

Conclusions re Weaker Designs than RD and ITS Heckman’s Advice? Pretest, local comparisons and rich covariates – probably OK but not yet tested “Rich Covariates” alone - problematic? “Which variables are on hand” Disaster Demographics only – disaster Best = hybrid matching? If so, more needed on caliper choice for local part, focal part needs all the care needed when initial difference –

BIG PICTURE CONCLUSIONS RD is advisable, but CRD is much preferred to it, though its assumptions need to be checked CITS is advisable, but ITS is not For NECGDs, Rubin’s advice is helpful but not complete and sometimes impossible Heckman’s advice seems very likely to work Hybrid Model may be better than all the others but not clear yet. THINKING HELPS; JUST ANALYZING DOES NOT

Better Quasi-Experimental Design Thomas D. Cook Northwestern University and Mathematica, Inc. Stockholm, 2015.

Similar presentations

Presentation on theme: "Better Quasi-Experimental Design Thomas D. Cook Northwestern University and Mathematica, Inc. Stockholm, 2015."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Better Quasi-Experimental Design Thomas D. Cook Northwestern University and Mathematica, Inc. Stockholm, 2015.

Similar presentations

Presentation on theme: "Better Quasi-Experimental Design Thomas D. Cook Northwestern University and Mathematica, Inc. Stockholm, 2015."— Presentation transcript:

Similar presentations

About project

Feedback