Sessions 22 & 23: (Second Best) Alternatives to Randomized Designs Mark W. Lipsey Vanderbilt University IES/NCER Summer Research Training Institute, 2008.

Slides:



Advertisements
Similar presentations
EMR 6550: Experimental and Quasi- Experimental Designs Dr. Chris L. S. Coryn Kristin A. Hobson Fall 2013.
Advertisements

1 COMM 301: Empirical Research in Communication Lecture 10 Kwan M Lee.
Analyzing Regression-Discontinuity Designs with Multiple Assignment Variables: A Comparative Study of Four Estimation Methods Vivian C. Wong Northwestern.
Regression Discontinuity Design Thanks to Sandi Cleveland and Marc Shure (class of 2011) for some of these slides.
Quasi-Experimental Design
The World Bank Human Development Network Spanish Impact Evaluation Fund.
Jeff Beard Lisa Helma David Parrish Start Presentation.
IES Workshop on Evaluating State and District Level Interventions
Impact Evaluation: The case of Bogotá’s concession schools Felipe Barrera-Osorio World Bank 1 October 2010.
Questions What is the best way to avoid order effects while doing within subjects design? We talked about people becoming more depressed during a treatment.
Chapter 11 Quasi-Experimental Designs ♣ ♣ Introduction   Nonequivalent Comparison Group Design   Time-Series Design   Regression Discontinuity Design.
Today Concepts underlying inferential statistics
Experiments Pierre-Auguste Renoir: Barges on the Seine, 1869.
Chapter 14 Inferential Data Analysis
Experimental Research
Chapter 12 Inferential Statistics Gay, Mills, and Airasian
Chapter 8 Experimental Research
Advanced Statistics for Interventional Cardiologists.
Research Design for Quantitative Studies
Quantitative Research Designs
Quasi-Experimental Designs Manipulated Treatment Variable but Groups Not Equated.
Chapter 11 Experimental Designs
S-005 Intervention research: True experiments and quasi- experiments.
Educational Research: Competencies for Analysis and Application, 9 th edition. Gay, Mills, & Airasian © 2009 Pearson Education, Inc. All rights reserved.
Chapter Eight. Figure 8.1 Relationship of Experimentation to the Previous Chapters and the Marketing Research Process Focus of This Chapter Relationship.
Correlational Research Chapter Fifteen Bring Schraw et al.
Assisting GPRA Report for MSP Xiaodong Zhang, Westat MSP Regional Conference Miami, January 7-9, 2008.
Article Review Cara Carty 09-Mar-06. “Confounding by indication in non-experimental evaluation of vaccine effectiveness: the example of prevention of.
Beyond surveys: the research frontier moves to the use of administrative data to evaluate R&D grants Oliver Herrmann Ministry of Business, Innovation.
Estimating Causal Effects from Large Data Sets Using Propensity Scores Hal V. Barron, MD TICR 5/06.
Session III Regression discontinuity (RD) Christel Vermeersch LCSHD November 2006.
Research Methods for Counselors COUN 597 University of Saint Joseph Class # 4 Copyright © 2015 by R. Halstead. All rights reserved.
Chapter 4 Linear Regression 1. Introduction Managerial decisions are often based on the relationship between two or more variables. For example, after.
A Randomized Experiment Comparing Random to Nonrandom Assignment William R Shadish University of California, Merced and M.H. Clark Southern Illinois University,
Using A Regression Discontinuity Design (RDD) to Measure Educational Effectiveness: Howard S. Bloom
AFRICA IMPACT EVALUATION INITIATIVE, AFTRL Africa Program for Education Impact Evaluation David Evans Impact Evaluation Cluster, AFTRL Slides by Paul J.
Educational Research Chapter 13 Inferential Statistics Gay, Mills, and Airasian 10 th Edition.
Chapter 10 Finding Relationships Among Variables: Non-Experimental Research.
Framework of Preferred Evaluation Methodologies for TAACCCT Impact/Outcomes Analysis Random Assignment (Experimental Design) preferred – High proportion.
Randomized Assignment Difference-in-Differences
REBECCA M. RYAN, PH.D. GEORGETOWN UNIVERSITY ANNA D. JOHNSON, M.P.A. TEACHERS COLLEGE, COLUMBIA UNIVERSITY ANNUAL MEETING OF THE CHILD CARE POLICY RESEARCH.
Developing an evaluation of professional development Webinar #2: Going deeper into planning the design 1.
Bilal Siddiqi Istanbul, May 12, 2015 Measuring Impact: Non-Experimental Methods.
Experimental and Ex Post Facto Designs
Prof. (FH) Dr. Alexandra Caspari Rigorous Impact Evaluation What It Is About and How It Can Be.
©2005, Pearson Education/Prentice Hall CHAPTER 6 Nonexperimental Strategies.
11 Chapter 9 Experimental Designs © 2009 John Wiley & Sons Ltd.
Assumptions of Multiple Regression 1. Form of Relationship: –linear vs nonlinear –Main effects vs interaction effects 2. All relevant variables present.
Educational Research Inferential Statistics Chapter th Chapter 12- 8th Gay and Airasian.
(ARM 2004) 1 INNOVATIVE STATISTICAL APPROACHES IN HSR: BAYESIAN, MULTIPLE INFORMANTS, & PROPENSITY SCORES Thomas R. Belin, UCLA.
Alexander Spermann University of Freiburg, SS 2008 Matching and DiD 1 Overview of non- experimental approaches: Matching and Difference in Difference Estimators.
Session 3 Overview of Research Designs Introduction to Research and Evaluation in Education.
William M. Trochim James P. Donnelly Kanika Arora 8 Introduction to Design.
Regression in Practice: Observational studies with controls for pretests work better than you think Shadish, W. R., Clark, M. H., & Steiner, P. M. (2008).
William R. Shadish University of California, Merced
Experimental Research Designs
12 Inferential Analysis.
Making Causal Inferences and Ruling out Rival Explanations
Introduction to Design
Impact evaluation: The quantitative methods with applications
Matching Methods & Propensity Scores
Matching Methods & Propensity Scores
Sessions 22 & 23: (Second Best) Alternatives to Randomized Designs
Quasi-Experimental Design
Matching Methods & Propensity Scores
The Nonexperimental and Quasi-Experimental Strategies
12 Inferential Analysis.
Impact Evaluation Methods: Difference in difference & Matching
Evaluating Impacts: An Overview of Quantitative Methods
Explanation of slide: Logos, to show while the audience arrive.
Presentation transcript:

Sessions 22 & 23: (Second Best) Alternatives to Randomized Designs Mark W. Lipsey Vanderbilt University IES/NCER Summer Research Training Institute, 2008

Quasi-Experimental Designs to Discuss Regression discontinuity Nonrandomized comparison groups with statistical controls –Analysis of covariance –Matching –Propensity scores

The Regression-Discontinuity (R-D) Design

Advantages of the R-D design? When well-executed, its internal validity is strong– comparable to a randomized experiment. It is adaptable to many circumstances where it may be difficult to apply a randomized design.

Think first of a pretest-posttest randomized field experiment

Posttest on pretest regression for randomized experiment (with no treatment effect) Pretest ( S ) Posttest ( Y ) Mean S T C Corresponding regression equation ( T : 1=treatment, 0=control) Mean Y

Pretest-posttest randomized experiment (with treatment effect) Pretest ( S ) Posttest ( Y ) T & C Mean S T C C Mean Y T Mean Y Δ

Regression discontinuity (with no treatment effect) Selection Variable ( S ) Posttest ( Y ) Cutting Point T C C Mean Y T Mean Y

Regression discontinuity (with treatment effect) Selection Variable ( S ) Posttest ( Y ) Cutting Point T C Δ

Regression discontinuity effect estimate compared with RCT estimate Selection Variable ( S ) Posttest ( Y ) T C Δ Cutting Point

Regression discontinuity scatterplot (null case) Selection Variable ( S ) Posttest ( Y ) Cutting Point T C

Regression discontinuity scatterplot (Tx effect) Selection Variable ( S ) Posttest ( Y ) Cutting Point T C

The selection variable for R-D A continuous quantitative variable measured on every candidate for assignment to T or C who will participate in the study Assignment to T or C strictly on the basis of the score obtained and a predetermined cutting point Does not have to correlate highly with the outcome variable (more power if it does) Can be tailored to represent an appropriate basis for the assignment decision in the setting

Why does it work? There is selection bias and nonequivalence between the T and C groups but … its source is perfectly specified by the cutting point variable and can be statistically modeled (think perfect propensity score) Any difference between the T and C groups that might affect the outcome, whether known or unknown, has to be correlated with the cutting point variable and be “controlled” by it to the extent that it is related to the outcome (think perfect covariate)

R-D variants: Tie-breaker randomization Selection Variable ( S ) Posttest ( Y ) Cutting Points T C Randomly assign

R-D variants: Double cutting points Selection Variable ( S ) Posttest ( Y ) Cutting Points T1T1 C T2T2

Special issues with the R-D design Correctly fitting the functional form– possibility that it is not linear –curvilinear functions –interaction with the cutting point –consider short, dense regression lines Statistical power –sample size requirements relative to RCT –when covariates are helpful

Lines fit to curvilinear function Selection Variable ( S ) Posttest ( Y ) Cutting Point T C False discontinuity

R-D effect estimate with an interaction compared with RCT estimate Selection Variable ( S ) Posttest ( Y ) T C Cutting Point (centered here)

Modeling the functional form Visual inspection of scatterplots with candidate functions superimposed is important If possible, establish the functional form on data observed prior to implementation of treatment, e.g., pretest and posttest archival data for a prior school year Reverse stepwise modeling– fit higher order functions and successively drop those that are not needed Use regression diagnostics– R 2 and goodness of fit indicators, distribution of residuals

Short dense regressions for R-D Selection Variable ( S ) Posttest ( Y ) Cutting Point T C

Statistical power Typically requires about 3 times as many participants as a comparable RCT Lower when the correlation between the cutting point continuum and treatment variable is large Higher when the correlation between the cutting point continuum and the outcome variable is large Improved by adding covariates correlated with outcome but not the cutting point continuum

Example: Effects of pre-k W. T. Gormley, T. Gayer, D. Phillips, & B. Dawson (2005). The effects of universal pre-k on cognitive development. Developmental Psychology, 41(6),

Study overview Universal pre-k for four year old children in Oklahoma Eligibility for pre-k determined strictly on the basis of age– cutoff by birthday Overall sample of 1,567 children just beginning pre-k plus 1,461 children just beginning kindergarten who had been in pre- k the previous year WJ Letter-Word, Spelling, and Applied Problems as outcome variables

Samples and testing pre-kkindergarten pre-k First cohort Second cohort Year 1Year 2 Administer WJ tests

Entry into Pre-K Selected by Birthday WJ test score Born before September 1 T Completed pre-K; tested at beginning of K Born after September 1 Age C No Pre-K yet; tested at beginning of pre-K year ?

Excerpts from Regression Analysis Letter-WordSpellingApplied Probs VariableB coeff Treatment (T)3.00*1.86*1.94* Age: Days ± from Sept *.02* Days 2.00 Days x T Days 2 x T.00 Free lunch-1.28*-.89*-1.38* Black *-2.34* Hispanic-1.70*-.48*-3.66* Female.92*1.05*.76* Mother’s educ: HS.59*.57*1.25* * p<.05

Selected references on R-D Shadish, W., Cook, T., and Campbell, D. (2002). Experimental and quasi- experimental designs for generalized causal inference. Boston: Houghton Mifflin. Mohr, L.B. (1988). Impact analysis for program evaluation. Chicago: Dorsey. Hahn, J., Todd, P. and Van der Klaauw, W. (2002). Identification and estimation of treatment effects with a regression-discontinuity design. Econometrica, 69(1), Cappelleri J.C. and Trochim W. (2000). Cutoff designs. In Chow, Shein- Chung (Ed.) Encyclopedia of Biopharmaceutical Statistics, NY: Marcel Dekker. Cappelleri, J., Darlington, R.B. and Trochim, W. (1994). Power analysis of cutoff-based randomized clinical trials. Evaluation Review, 18, Jacob, B. A. & Lefgren, L. (2002). Remedial education and student achievement: A regression-discontinuity analysis. Working Paper 8919, National Bureau of Economic Research ( Kane, T. J. (2003). A quasi-experimental estimate of the impact of financial aid on college-going. Working Paper 9703, National Bureau of Economic Research (

Nonrandomized Comparison Groups with Statistical Controls ANCOVA/OLS statistical controls Matching Propensity scores

Nonequivalent comparison analog to the completely randomized design Individuals are selected into treatment and control conditions through some nonrandom more-or-less natural process Treatment Comparison Individual 1 Individual 2 …… Individual n

Nonequivalent comparison analog to the randomized block design Block 1…Block m Treatment Individual 1 … …… … …… Individual n Comparison Individual n +1 …… Individual 2n

The nonequivalent comparison analog to the hierarchical design TreatmentComparison Block 1Block mBlock m+1Block 2m Individual 1 … … Individual 2 ………… Individual n

Issues for obtaining good treatment effect estimates from nonrandomized comparison groups The fundamental problem: selection bias Knowing/measuring the variables necessary and sufficient to statistically control the selection bias –characteristics on which the groups differ that are related to the outcome –relevant characteristics not highly correlated with other characteristics already accounted for Using an analysis model that properly adjusts for the selection bias, given appropriate control variables

Nonrandomized comparisons of possible interest Nonequivalent comparison/control group for estimating treatment effects Attrition analysis– comparing leavers and stayers, adjusting for differential attrition Treatment on the treated analysis (TOT)– estimating treatment effects on those who actually received treatment.

Nonequivalent comparison groups: Pretest/covariate and posttest means Pretest/Covariate(s) ( X ) Posttest ( Y ) T C Diff in post means Diff in pretest/cov means

Nonequivalent comparison groups: Covariate- adjusted treatment effect estimate Pretest/Covariate(s) ( X ) Posttest ( Y ) T C Δ

Covariate-adjusted treatment effect estimate with a relevant covariate left out Pretest/Covariate(s) ( X ) Posttest ( Y ) T C Δ

Nonequivalent comparison groups: Unreliability in the covariate Pretest/Covariate(s) ( X ) Posttest ( Y )

Nonequivalent comparison groups: Unreliability in the covariate Pretest/Covariate(s) ( X ) Posttest ( Y ) T C Note: Will not always under- estimate, may over-estimate

Using control variables via matching Groupwise matching: select control comparison to be groupwise similar to treatment group, e.g., schools with similar demographics, geography, etc. Generally a good idea. Individual matching: select individuals from the potential control pool that match treatment individuals on one or more observed characteristics. May not be a good idea.

Potential problems with individual level matching Basic problem with nonequivalent designs– need to match on all relevant variables to obtain a good treatment effect estimate. If match on too few variables, may omit some that are important to control. If try to match on too many variables, the sample will be restricted to the cases that can be matched; may be overly narrow. If must select disproportionately from one tail of the treatment distribution and the other tail of the control distribution, may have regression to the mean artifact.

Regression to the mean: Matching on the pretest TC Area where matches can be found

Propensity scores What is a propensity score (Rosenbaum & Rubin, 1983)? The probability (or propensity) of being in the treatment group instead of the comparison group (or stayers vs. leavers, treated vs untreated) Estimated (“predicted”) from data on the characteristics of individuals in the sample As a probability, it ranges from 0 to 1 Is calculated for each member of a sample

Computing the propensity score First estimate T i = f(S 1i, S 2i, S 3i … S ki ) for all T and C members in the sample –Logistic regression typically used; other methods include probit regression, classification trees –All relevant characteristics assumed to be included among the predictor variables (covariates) E.g., fit logistic regression T i = B 1 S 1i + B 2 S 2i …+ e i Compute propensity i = B 1 S 1i + B 2 S 2i … S ki The propensity score thus combines all the information from the covariates into a single variable optimized to distinguish the characteristics of the T sample from those of the C sample

What to do with the propensity score Determine how similar the treatment and control group are on the composite of variables in the propensity score; decide if one or both need to be trimmed to create a better overall match. Use the propensity score to select T and C cases that match. Use the propensity score as a statistical control variable, e.g., in an ANCOVA.

Propensity score distribution before trimming (example from Hill pre-K Study) Comparison Group (n=1,144): 25 th percentile: th percentile: th percentile: 0.53 Mean = 0.39 Treatment Group (n=908): 25 th percentile: th percentile: th percentile: 0.60 Mean = 0.50

Propensity score distribution after trimming (example from Hill pre-K Study) Treatment Group (n=908): 25 th percentile: th percentile: th percentile: 0.60 Mean = 0.50 Comparison Group (n=908): 25 th percentile: th percentile: th percentile: 0.56 Mean = 0.46

Estimate the treatment effect, e.g., by differences between matched strata Propensity Score Quintiles Treatment Group Control Group Matches

Estimate the treatment effect, e.g., by using the propensity score as a covariate Propensity score ( P ) Posttest ( Y ) T C Δ

Discouraging evidence about the validity of treatment effect estimates The relatively few studies with head-to-head comparisons of nonequivalent comparison groups with statistical controls and randomized controls show very uneven results– some cases where treatment effects are comparable, many where they are not. Failure to include all the relevant covariates appears to be the main problem.

Selected references on nonequivalent comparison designs Rosenbaum, P.R., & Rubin, D.B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika 70(1): Luellen, J. K., Shadish, W.R., & Clark, M.H. (2005). Propensity scores: An introduction and experimental test. Evaluation Review 29(6): Schochet, P.Z., & Burghardt, J. (2007). Using propensity scoring to estimate program- related subgroup impacts in experimental program evaluations. Evaluation Review 31(2): Bloom, H.S., Michalopoulos, C., & Hill, C.J. (2005). Using experiments to assess nonexperimental comparison-group methods for measuring program effects. Chapter 5 in H. S. Bloom (ed.) Learning more from social experiments: Evolving analytic approaches (NY: Russell Sage Foundation), pp Agodini, R. & Dynarski, M. (2004). Are experiments the only option? A look at dropout prevention programs. The Review of Economics and Statistics 86(1): Wilde, E. T & Hollister, R (2002). How close Is close enough? Testing nonexperimental estimates of impact against experimental estimates of impact with education test scores as outcomes,” Institute for Research on Poverty Discussion paper, no ; Glazerman, S., Levy, D.M., & Myers, D. (2003). Nonexperimental versus experimental estimates of earnings impacts. Annals AAPSS, 589: Arceneaux, K., Gerber, A.S., & Green, D. P. (2006). Comparing experimental and matching methods using a large-scale voter mobilization experiment. Political Analysis, 14: