Sample Design for Group- Randomized Trials Howard S. Bloom Chief Social Scientist MDRC Prepared for the IES/NCER Summer Research Training Institute held.

Slides:

Advertisements

Similar presentations

A Spreadsheet for Analysis of Straightforward Controlled Trials

Advertisements

Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and l Chapter 12 l Multiple Regression: Predicting One Factor from Several Others.

Designs to Estimate Impacts of MSP Projects with Confidence. Ellen Bobronnikov March 29, 2010.

Experimental Research Designs

Beyond p-Values: Characterizing Education Intervention Effects in Meaningful Ways Mark Lipsey, Kelly Puzio, Cathy Yun, Michael Hebert, Kasia Steinka-Fry,

Estimation and Reporting of Heterogeneity of Treatment Effects in Observational Comparative Effectiveness Research Prepared for: Agency for Healthcare.

Power Considerations for Educational Studies with Restricted Samples that Use State Tests as Pretest and Outcome Measures June 2010 Presentation at the.

Longitudinal Experiments Larry V. Hedges Northwestern University Prepared for the IES Summer Research Training Institute July 28, 2010.

Topic 2: Statistical Concepts and Market Returns

Clustered or Multilevel Data

Using Hierarchical Growth Models to Monitor School Performance: The effects of the model, metric and time on the validity of inferences THE 34TH ANNUAL.

Today Concepts underlying inferential statistics

Agenda: Block Watch: Random Assignment, Outcomes, and indicators Issues in Impact and Random Assignment: Youth Transition Demonstration –Who is randomized?

Impact Evaluation Session VII Sampling and Power Jishnu Das November 2006.

Chapter 5 Copyright © Allyn & Bacon 2008 This multimedia product and its contents are protected under copyright law. The following are prohibited by law:

Analysis of Clustered and Longitudinal Data

Chapter 12 Inferential Statistics Gay, Mills, and Airasian

Experimental Design The Gold Standard?.

Determining Sample Size

Education 793 Class Notes T-tests 29 October 2003.

The Power of Pairing in Cluster Randomized Block Designs: A Monte Carlo Simulation Nianbo Dong & Mark Lipsey 03/04/2010.

Program Evaluation. Program evaluation Methodological techniques of the social sciences social policy public welfare administration.

T tests comparing two means t tests comparing two means.

Povertyactionlab.org Planning Sample Size for Randomized Evaluations Esther Duflo MIT and Poverty Action Lab.

Understanding Inferential Statistics—Estimation

Statistics & Biology Shelly’s Super Happy Fun Times February 7, 2012 Will Herrick.

Slide 1 Estimating Performance Below the National Level Applying Simulation Methods to TIMSS Fourth Annual IES Research Conference Dan Sherman, Ph.D. American.

Sample Design for Group- Randomized Trials Howard S. Bloom Chief Social Scientist MDRC Prepared for the IES/NCER Summer Research Training Institute held.

PolMeth2009: Freedman Panel Regression Adjustments to Experimental Data: Do David Freedman’s Concerns Apply to Political Science? Donald P. Green Yale.

Educational Research: Competencies for Analysis and Application, 9 th edition. Gay, Mills, & Airasian © 2009 Pearson Education, Inc. All rights reserved.

Problems with the Design and Implementation of Randomized Experiments By Larry V. Hedges Northwestern University Presented at the 2009 IES Research Conference.

1 Estimation From Sample Data Chapter 08. Chapter 8 - Learning Objectives Explain the difference between a point and an interval estimate. Construct and.

Optimal Design for Longitudinal and Multilevel Research Jessaca Spybrook July 10, 2008 *Joint work with Steve Raudenbush and Andres Martinez.

Using Regression Discontinuity Analysis to Measure the Impacts of Reading First Howard S. Bloom

Evaluating Impacts of MSP Grants Hilary Rhodes, PhD Ellen Bobronnikov February 22, 2010 Common Issues and Recommendations.

Educational Research Chapter 13 Inferential Statistics Gay, Mills, and Airasian 10 th Edition.

ITEC6310 Research Methods in Information Technology Instructor: Prof. Z. Yang Course Website: c6310.htm Office:

Evaluating Impacts of MSP Grants Ellen Bobronnikov Hilary Rhodes January 11, 2010 Common Issues and Recommendations.

Going from data to analysis Dr. Nancy Mayo. Getting it right Research is about getting the right answer, not just an answer An answer is easy The right.

© 2006 by The McGraw-Hill Companies, Inc. All rights reserved. 1 Chapter 12 Testing for Relationships Tests of linear relationships –Correlation 2 continuous.

Review I A student researcher obtains a random sample of UMD students and finds that 55% report using an illegally obtained stimulant to study in the past.

Chapter Eight: Using Statistics to Answer Questions.

Framework of Preferred Evaluation Methodologies for TAACCCT Impact/Outcomes Analysis Random Assignment (Experimental Design) preferred – High proportion.

T tests comparing two means t tests comparing two means.

Chapter 10 Copyright © Allyn & Bacon 2008 This multimedia product and its contents are protected under copyright law. The following are prohibited by law:

IMPORTANCE OF STATISTICS MR.CHITHRAVEL.V ASST.PROFESSOR ACN.

1 Children Left Behind in AYP and Non-AYP Schools: Using Student Progress and the Distribution of Student Gains to Validate AYP Kilchan Choi Michael Seltzer.

Sharon Wolf NYU Abu Dhabi Additional Insights Summer Training Institute June 15,

Evaluation Requirements for MSP and Characteristics of Designs to Estimate Impacts with Confidence Ellen Bobronnikov February 16, 2011.

Using State Tests to Measure Student Achievement in Large-Scale Randomized Experiments IES Research Conference June 28 th, 2010 Marie-Andrée Somers (Presenter)

Design of Clinical Research Studies ASAP Session by: Robert McCarter, ScD Dir. Biostatistics and Informatics, CNMC

Course: Research in Biomedicine and Health III Seminar 5: Critical assessment of evidence.

Effectiveness of Selected Supplemental Reading Comprehension Interventions: Impacts on a First Cohort of Fifth-Grade Students June 8, 2009 IES Annual Research.

Learning Objectives After this section, you should be able to: The Practice of Statistics, 5 th Edition1 DESCRIBE the shape, center, and spread of the.

Chapter 9: Introduction to the t statistic. The t Statistic The t statistic allows researchers to use sample data to test hypotheses about an unknown.

Educational Research Inferential Statistics Chapter th Chapter 12- 8th Gay and Airasian.

IMPACT EVALUATION PBAF 526 Class 5, October 31, 2011.

Uses of Diagnostic Tests Screen (mammography for breast cancer) Diagnose (electrocardiogram for acute myocardial infarction) Grade (stage of cancer) Monitor.

Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.

Evaluation Requirements for MSP and Characteristics of Designs to Estimate Impacts with Confidence Ellen Bobronnikov March 23, 2011.

12 Inferential Analysis.

Comparing Populations

Sample Design for Group-Randomized Trials

12 Inferential Analysis.

Product moment correlation

What are their purposes? What kinds?

15.1 The Role of Statistics in the Research Process

SAMPLING AND STATISTICAL POWER

Understanding Statistical Inferences

Chapter Nine: Using Statistics to Answer Questions

Presentation transcript:

Sample Design for Group- Randomized Trials Howard S. Bloom Chief Social Scientist MDRC Prepared for the IES/NCER Summer Research Training Institute held at Vanderbilt University June 17 – 28, 2007.

Today we will examine Sample size determinants Sample size determinants Precision requirements Precision requirements Sample allocation Sample allocation Covariate adjustments Covariate adjustments Matching and blocking Matching and blocking Subgroup analyses Subgroup analyses Generalizing findings for sites and blocks Generalizing findings for sites and blocks Using two-level data for three-level situations Using two-level data for three-level situations

Part I: The Basics

Statistical properties of group- randomized impact estimators Unbiased estimates Y ij =  +B 0 T j +e j +  ij E(b 0 ) = B 0 Less precise estimates VAR(  ij ) =  2 VAR(e j ) =  2  =  2 /(  2 +  2 )

Design Effect (for a given total number of individuals)______________________________________ Intraclass Individuals per Group (n) Correlation (         _____________________________________

Sample design parameters Number of randomized groups (J) Number of randomized groups (J) Number of individuals per randomized group (n) Number of individuals per randomized group (n) Proportion of groups randomized to program status (P) Proportion of groups randomized to program status (P)

Reporting precision A minimum detectable effect (MDE) is the smallest true effect that has a “good chance” of being found to be statistically significant. A minimum detectable effect (MDE) is the smallest true effect that has a “good chance” of being found to be statistically significant. We typically define an MDE as the smallest true effect that has 80 percent power for a two-tailed test of statistical significance at the 0.05 level. We typically define an MDE as the smallest true effect that has 80 percent power for a two-tailed test of statistical significance at the 0.05 level. An MDE is reported in natural units whereas a minimum detectable effect size (MDES) is reported in units of standard deviations An MDE is reported in natural units whereas a minimum detectable effect size (MDES) is reported in units of standard deviations

Minimum Detectable Effect Sizes For a Group-Randomized Design with  = 0.05 and no Covariates ___________________________________ Randomized Individuals per Group (n) Groups (J) ___________________________________

Implications for sample design It is extremely important to randomize an adequate number of groups. It is extremely important to randomize an adequate number of groups. It is often far less important how many individuals per group you have. It is often far less important how many individuals per group you have.

Part II Determining required precision

When assessing how much precision is needed: Always ask “relative to what?”  Program benefits  Program costs  Existing outcome differences  Past program performance

Effect Size Gospel According to Cohen and Lipsey Cohen Lipsey Cohen Lipsey (speculative) (empirical) (speculative) (empirical)_______________________________________________ Small = 0.2  Small = 0.15  Small = 0.2  Small = 0.15  Medium = 0.5  Medium = 0.45  Large = 0.8  Large = 0.90  Large = 0.8  Large = 0.90 

Five-year impacts of the Tennessee class-size experiment Treatment: versus students per class Effect sizes: 0.11  to 0.22  for reading and math Findings are summarized from Nye, Barbara, Larry V. Hedges and Spyros Konstantopoulos (1999) “The Long-Term Effects of Small Classes: A Five- Year Follow-up of the Tennessee Class Size Experiment,” Educational Evaluation and Policy Analysis, Vol. 21, No. 2:

Annual reading and math growth ReadingMath Grade Growth Growth Transition Effect Size Effect Size K K Based on work in progress using documentation on the national norming samples for the CAT5, SAT9, Terra Nova CTBS, Gates MacGinitie (for reading only), MAT8, Terra Nova CAT, and SAT10. 95% confidence intervals range in reading from +/-.03 to.15 and in math from +/-.03 to.22

Performance gap between “average” (50 th percentile) and “weak” (10 th percentile) schools Subject and gradeDistrict IDistrict IIDistrict IIIDistrict IV Reading Grade Grade Grade NA Grade NA Math Grade Grade Grade NA Grade NA Source: District I outcomes are based on ITBS scaled scores, District II on SAT 9 scaled scores, District III on MAT NCE scores, and District IV on SAT 8 NCE scores.

Demographic performance gap in reading and math: Main NAEP scores Subject and grade Black- White Hispanic- White Male- Female Eligible- Ineligible for free/reduced price lunch Reading Grade Grade Grade Math Grade Grade Grade Source: U.S. Department of Education, Institute of Education Sciences, National Center for Education Statistics, National Assessment of Educational Progress (NAEP), 2002 Reading Assessment and 2000 Mathematics Assessment.

ES Results from Randomized Studies Achievement MeasurenMean Elementary School Standardized test (Broad) Standardized test (Narrow) Specialized Topic/Test Middle Schools High Schools430.27

Part III The ABCs of Sample Allocation

Sample allocation alternatives Balanced allocation  maximizes precision for a given sample size;  maximizes robustness to distributional assumptions. Unbalanced allocation  precision erodes slowly with imbalance for a given sample size  imbalance can facilitate a larger sample  Imbalance can facilitate randomization

Variance relationships for the program and control groups Equal variances: when the program does not affect the outcome variance. Equal variances: when the program does not affect the outcome variance. Unequal variances: when the program does affect the outcome variance. Unequal variances: when the program does affect the outcome variance.

MDES for equal variances without covariates

How allocation affects MDES

Minimum Detectable Effect Size For Sample Allocations Given Equal Variances Allocation Example * Ratio to Balanced Allocation Example * Ratio to Balanced Allocation Allocation 0.5/  /  /  /  /  /  /  /  /  /  1.67________________________________________ * Example is for n = 20, J = 10,  = 0.05, a one-tail hypothesis test and no covariates.

Implications of unbalanced allocations with unequal variances

Implications Continued The estimated standard error is unbiased  When the allocation is balanced  When the variances are equal The estimated standard error is biased upward  When the larger sample has the larger variance The estimated standard error is biased downward  When the larger sample has the smaller variance

Interim Conclusions Don’t use the equal variance assumption for an unbalanced allocation with many degrees of freedom. Don’t use the equal variance assumption for an unbalanced allocation with many degrees of freedom. Use a balanced allocation when there are few degrees of freedom. Use a balanced allocation when there are few degrees of freedom.

References Gail, Mitchell H., Steven D. Mark, Raymond J. Carroll, Sylvan B. Green and David Pee (1996) “On Design Considerations and Randomization-Based Inferences for Community Intervention Trials,” Statistics in Medicine 15: 1069 – Bryk, Anthony S. and Stephen W. Raudenbush (1988) “Heterogeneity of Variance in Experimental Studies: A Challenge to Conventional Interpretations,” Psychological Bulletin, 104(3): 396 – 404.

Part IV Using Covariates to Reduce Sample Size

Basic ideas Goal: Reduce the number of clusters randomized Goal: Reduce the number of clusters randomized Approach: Reduce the standard error of the impact estimator by controlling for baseline covariates Approach: Reduce the standard error of the impact estimator by controlling for baseline covariates Alternative Covariates Alternative Covariates  Individual-level  Cluster-level  Pretests  Other characteristics

Impact Estimation with a Covariate y ij = the outcome for student i from school j T j = 1 for treatment schools and 0 for control schools X j = a covariate for school j x ij = a covariate for student i from school j x ij = a covariate for student i from school j e j = a random error term for school j  ij = a random error term for student i from school j

Minimum Detectable Effect Size with a Covariate MDES = minimum detectable effect size M J-K = a degrees-of-freedom multiplier 1 J = the total number of schools randomized n = the number of students in a grade per school P = the proportion of schools randomized to treatment  = the unconditional intraclass correlation (without a covariate) R 1 2 = the proportion of variance across individuals within schools (at level 1) predicted by the covariate R 2 2 = the proportion of variance across schools (at level 2) predicted by the covariate 1 For 20 or more degrees of freedom M J-K equals 2.8 for a two-tail test and 2.5 for a one-tail test with statistical power of 0.80 and statistical significance of 0.05

Questions Addressed Empirically about the Predictive Power of Covariates School-level vs. student-level pretests School-level vs. student-level pretests Earlier vs. later follow-up years Earlier vs. later follow-up years Reading vs. math Reading vs. math Elementary vs. middle vs. high school Elementary vs. middle vs. high school All schools vs. low-income schools vs. low-performing schools All schools vs. low-income schools vs. low-performing schools

Empirical Analysis Estimate , R 2 2 and R 1 2 from data on thousands of students from hundreds of schools, during multiple years at five urban school districts Estimate , R 2 2 and R 1 2 from data on thousands of students from hundreds of schools, during multiple years at five urban school districts Summarize these estimates for reading and math in grades 3, 5, 8 and 10 Summarize these estimates for reading and math in grades 3, 5, 8 and 10 Compute implications for minimum detectable effect sizes Compute implications for minimum detectable effect sizes

Estimated Parameters for Reading with a School-level Pretest Lagged One Year___________________________________________________________________ School District ___________________________________________________________ ___________________________________________________________ A B C D E A B C D E___________________________________________________________________ Grade 3   R R Grade 5  NA 0.12  NA 0.12 R NA 0.70 R NA 0.70 Grade 8  0.18 NA 0.23 NA NA  0.18 NA 0.23 NA NA R NA 0.91 NA NA R NA 0.91 NA NA Grade 10  0.15 NA 0.29 NA NA  0.15 NA 0.29 NA NA R NA 0.95 NA NA R NA 0.95 NA NA____________________________________________________________________

Minimum Detectable Effect Sizes for Reading with a School-Level Pretest (Y -1 ) or a Student-Level Pretest (y -1 ) Lagged One Year ________________________________________________________ Grade 3 Grade 5 Grade 8 Grade 10 Grade 3 Grade 5 Grade 8 Grade 10________________________________________________________ 20 schools randomized No covariate No covariate Y Y y y schools randomized No covariate No covariate Y Y y y schools randomized No covariate No covariate Y Y y y ________________________________________________________

Key Findings Using a pretest improves precision dramatically. Using a pretest improves precision dramatically. This improvement increases appreciably from elementary school to middle school to high school because R 2 2 increases. This improvement increases appreciably from elementary school to middle school to high school because R 2 2 increases. School-level pretests produce as much precision as do student-level pretests. School-level pretests produce as much precision as do student-level pretests. The effect of a pretest declines somewhat as the time between it and the post-test increases. The effect of a pretest declines somewhat as the time between it and the post-test increases. Adding a second pretest increases precision slightly. Adding a second pretest increases precision slightly. Using a pretest for a different subject increases precision substantially. Using a pretest for a different subject increases precision substantially. Narrowing the sample to schools that are similar to each other does not improve precision beyond that achieved by a pretest. Narrowing the sample to schools that are similar to each other does not improve precision beyond that achieved by a pretest.

Source Bloom, Howard S., Lashawn Richburg-Hayes and Alison Rebeck Black (2007) “Using Covariates to Improve Precision for Studies that Randomize Schools to Evaluate Educational Interventions” Educational Evaluation and Policy Analysis, 29(1): 30 – 59.

Part V The Putative Power of Pairing A Tail of Two Tradeoffs (“It was the best of techniques. It was the worst of techniques.” Who the dickens said that?)

Pairing Why match pairs? Why match pairs?  for face validity  for precision How to match pairs? How to match pairs?  rank order clusters by covariate  pair clusters in rank-ordered list  randomize clusters in each pair

When to pair? When the gain in predictive power outweighs the loss of degrees of freedom When the gain in predictive power outweighs the loss of degrees of freedom Degrees of freedom Degrees of freedom  J - 2 without pairing  J/2 - 1 with pairing

Deriving the Minimum Required Predictive Power of Pairing Without pairing With pairing Breakeven R 2

The Minimum Required Predictive Power of Pairing Randomized Required Predictive Clusters (J) Power (R min 2 ) * Clusters (J) Power (R min 2 ) * * For a two-tail test.

A few key points about blocking Blocking for face validity vs. blocking for precision Blocking for face validity vs. blocking for precision Treating blocks as fixed effects vs.random effects Treating blocks as fixed effects vs.random effects Defining blocks using baseline information Defining blocks using baseline information

Part VI Subgroup Analyses: Learning from Diversity

Purposes To assess generalizability through description (by exploring how impacts vary) To assess generalizability through description (by exploring how impacts vary) To enhance generalizability through explanation (by exploring what predicts impact variation) To enhance generalizability through explanation (by exploring what predicts impact variation)

Considerations Research protocol: Maximize ex ante specification through theory and thought to minimize ex post data mining. Research protocol: Maximize ex ante specification through theory and thought to minimize ex post data mining. Assessment criteria Assessment criteria  Internal validity  Precision Defining Features Defining Features  Program characteristics  Randomized group characteristics  Individual characteristics

Defining Subgroups by The Characteristics of Programs Based only on program features that were randomized Based only on program features that were randomized Thus one cannot use implementation quality Thus one cannot use implementation quality

Defining Subgroups by Characteristics Of Randomized Groups Types of impacts Types of impacts  Net impacts  Differential impacts Internal validity Internal validity  only use pre-existing characteristics Precision Precision  Net impact estimates are limited by reduced number of randomized groups  Differential impact estimates are triply limited (and often need four times as many randomized groups)

Defining Subgroups by Characteristics of Individuals Types of impacts Types of impacts  Net impacts  Differential impacts Internal validity Internal validity  Only use pre-existing characteristics  Only use subgroups with sample members from all randomized groups Precision Precision  For net impacts: can be almost as good as for full sample  For differential impacts: can be even better than for full sample

Part VII Generalizing Results from Multiple Sites and Blocks

Fixed vs. Random Effects Inference  Known vs. unknown populations  Broader vs. narrower inferences  Weaker vs. stronger precision  Few vs. many sites or blocks

Weighting Sites and Blocks Implicitly through a pooled regression Implicitly through a pooled regression Explicitly based on Explicitly based on  Number of schools  Number of students Explicitly based on precision Explicitly based on precision  Fixed effects  Random effects Bottom line: the question addressed is what counts Bottom line: the question addressed is what counts

Part VIII Using Two-Level Data for Three- Level Situations

The Issue General Question: What happens when you design a study with randomized groups that comprise three levels based on data which do not account explicitly for the middle level? General Question: What happens when you design a study with randomized groups that comprise three levels based on data which do not account explicitly for the middle level? Specific Example: What happens when you design a study that randomizes schools (with students clustered in classrooms in schools) based on data for students clustered in schools? Specific Example: What happens when you design a study that randomizes schools (with students clustered in classrooms in schools) based on data for students clustered in schools?

3-level vs. 2-level Variance Components

3-level vs. 2-level MDES for Original Sample

Further References Bloom, Howard S. (2005) “Randomizing Groups to Evaluate Place-Based Programs,” in Howard S. Bloom, editor, Learning More From Social Experiments: Evolving Analytic Approaches (New York: Russell Sage Foundation). Bloom, Howard S., Lashawn Richburg-Hayes and Alison Rebeck Black (2005) “Using Covariates to Improve Precision: Empirical Guidance for Studies that Randomize Schools to Measure the Impacts of Educational Interventions” (New York: MDRC). Donner, Allan and Neil Klar (2000) Cluster Randomization Trials in Health Research (London: Arnold). Hedges, Larry V. and Eric C. Hedberg (2006) “Intraclass Correlation Values for Planning Group Randomized Trials in Education” (Chicago: Northwestern University). Murray, David M. (1998) Design and Analysis of Group-Randomized Trials (New York: Oxford University Press). Raudenbush, Stephen W., Andres Martinez and Jessaca Spybrook (2005) “Strategies for Improving Precision in Group-Randomized Experiments” (University of Chicago). Raudenbush, Stephen W. (1997) “Statistical Analysis and Optimal Design for Cluster Randomized Trials” Psychological Methods, 2(2): 173 – 185. Schochet, Peter Z. (2005) “Statistical Power for Random Assignment Evaluations of Education Programs,” (Princeton, NJ: Mathematica Policy Research).