The Multiple Comparisons Problem in IES Impact Evaluations: Guidelines and Applications Peter Z. Schochet and John Deke June 2009, IES Research Conference.

Slides:



Advertisements
Similar presentations
What To Do About the Multiple Comparisons Problem? Peter Z. Schochet February 2008.
Advertisements

1 1 Slide STATISTICS FOR BUSINESS AND ECONOMICS Seventh Edition AndersonSweeneyWilliams Slides Prepared by John Loucks © 1999 ITP/South-Western College.
Review of the Basic Logic of NHST Significance tests are used to accept or reject the null hypothesis. This is done by studying the sampling distribution.
Estimation and Reporting of Heterogeneity of Treatment Effects in Observational Comparative Effectiveness Research Prepared for: Agency for Healthcare.
Evaluation (practice). 2 Predicting performance  Assume the estimated error rate is 25%. How close is this to the true error rate?  Depends on the amount.
Part I – MULTIVARIATE ANALYSIS
Differentially expressed genes
Comparing Means.
Elementary hypothesis testing Purpose of hypothesis testing Type of hypotheses Type of errors Critical regions Significant levels Hypothesis vs intervals.
Analysis of Differential Expression T-test ANOVA Non-parametric methods Correlation Regression.
PSY 1950 Post-hoc and Planned Comparisons October 6, 2008.
Lecture 9: One Way ANOVA Between Subjects
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 9-1 Chapter 9 Fundamentals of Hypothesis Testing: One-Sample Tests Basic Business Statistics.
The Need For Resampling In Multiple testing. Correlation Structures Tukey’s T Method exploit the correlation structure between the test statistics, and.
Significance Tests P-values and Q-values. Outline Statistical significance in multiple testing Statistical significance in multiple testing Empirical.
Today Concepts underlying inferential statistics
“There are three types of lies: Lies, Damn Lies and Statistics” - Mark Twain.
Impact Evaluation Session VII Sampling and Power Jishnu Das November 2006.
Chapter 14 Inferential Data Analysis
Multiple Testing Procedures Examples and Software Implementation.
Statistics 11 Hypothesis Testing Discover the relationships that exist between events/things Accomplished by: Asking questions Getting answers In accord.
Hypothesis Testing:.
Chapter 4 Hypothesis Testing, Power, and Control: A Review of the Basics.
Multiple testing correction
Multiple testing in high- throughput biology Petter Mostad.
Jeopardy Hypothesis Testing T-test Basics T for Indep. Samples Z-scores Probability $100 $200$200 $300 $500 $400 $300 $400 $300 $400 $500 $400.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 6 – Multiple comparisons, non-normality, outliers Marshall.
CENTRE FOR INNOVATION, RESEARCH AND COMPETENCE IN THE LEARNING ECONOMY Session 2: Basic techniques for innovation data analysis. Part I: Statistical inferences.
© 2013 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
1 Chapter 1: Introduction to Design of Experiments 1.1 Review of Basic Statistical Concepts (Optional) 1.2 Introduction to Experimental Design 1.3 Completely.
More About Significance Tests
Week 8 Fundamentals of Hypothesis Testing: One-Sample Tests
Essential Statistics in Biology: Getting the Numbers Right
Non-parametric Tests. With histograms like these, there really isn’t a need to perform the Shapiro-Wilk tests!
Chapter 8 Introduction to Hypothesis Testing
Education Research 250:205 Writing Chapter 3. Objectives Subjects Instrumentation Procedures Experimental Design Statistical Analysis  Displaying data.
Maximum Likelihood Estimator of Proportion Let {s 1,s 2,…,s n } be a set of independent outcomes from a Bernoulli experiment with unknown probability.
False Discovery Rates for Discrete Data Joseph F. Heyse Merck Research Laboratories Graybill Conference June 13, 2008.
Chapter 10: Analyzing Experimental Data Inferential statistics are used to determine whether the independent variable had an effect on the dependent variance.
Inference and Inferential Statistics Methods of Educational Research EDU 660.
1 Chapter 1: Introduction to Design of Experiments 1.1 Review of Basic Statistical Concepts (Optional) 1.2 Introduction to Experimental Design 1.3 Completely.
Back to basics – Probability, Conditional Probability and Independence Probability of an outcome in an experiment is the proportion of times that.
One-way ANOVA: - Comparing the means IPS chapter 12.2 © 2006 W.H. Freeman and Company.
Statistical Methods for Identifying Differentially Expressed Genes in Replicated cDNA Microarray Experiments Presented by Nan Lin 13 October 2002.
Statistical Inference for the Mean Objectives: (Chapter 9, DeCoursey) -To understand the terms: Null Hypothesis, Rejection Region, and Type I and II errors.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 14 Comparing Groups: Analysis of Variance Methods Section 14.1 One-Way ANOVA: Comparing.
©2010 John Wiley and Sons Chapter 2 Research Methods in Human-Computer Interaction Chapter 2- Experimental Research.
Fall 2002Biostat Statistical Inference - Proportions One sample Confidence intervals Hypothesis tests Two Sample Confidence intervals Hypothesis.
Simple examples of the Bayesian approach For proportions and means.
Suppose we have T genes which we measured under two experimental conditions (Ctl and Nic) in n replicated experiments t i * and p i are the t-statistic.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
Chapter 10 Copyright © Allyn & Bacon 2008 This multimedia product and its contents are protected under copyright law. The following are prohibited by law:
Spatial Smoothing and Multiple Comparisons Correction for Dummies Alexa Morcom, Matthew Brett Acknowledgements.
Education 793 Class Notes Inference and Hypothesis Testing Using the Normal Distribution 8 October 2003.
Multiple Testing in Impact Evaluations: Discussant Comments IES Research Conference June 11, 2008 Larry L. Orr.
Effectiveness of Selected Supplemental Reading Comprehension Interventions: Impacts on a First Cohort of Fifth-Grade Students June 8, 2009 IES Annual Research.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 6 –Multiple hypothesis testing Marshall University Genomics.
Hypothesis Testing. Statistical Inference – dealing with parameter and model uncertainty  Confidence Intervals (credible intervals)  Hypothesis Tests.
Guidelines for Multiple Testing in Impact Evaluations Peter Z. Schochet June 2008.
BIOL 582 Lecture Set 2 Inferential Statistics, Hypotheses, and Resampling.
Statistical Inference for the Mean Objectives: (Chapter 8&9, DeCoursey) -To understand the terms variance and standard error of a sample mean, Null Hypothesis,
Educational Research Inferential Statistics Chapter th Chapter 12- 8th Gay and Airasian.
Chapter 22 Inferential Data Analysis: Part 2 PowerPoint presentation developed by: Jennifer L. Bellamy & Sarah E. Bledsoe.
IMPACT EVALUATION PBAF 526 Class 5, October 31, 2011.
Posthoc Comparisons finding the differences. Statistical Significance What does a statistically significant F statistic, in a Oneway ANOVA, tell us? What.
STA248 week 121 Bootstrap Test for Pairs of Means of a Non-Normal Population – small samples Suppose X 1, …, X n are iid from some distribution independent.
Methods of Presenting and Interpreting Information Class 9.
Hypothesis Testing.
Hypothesis Testing and Confidence Intervals (Part 1): Using the Standard Normal Lecture 8 Justin Kern October 10 and 12, 2017.
Power, Sample Size, & Effect Size:
Presentation transcript:

The Multiple Comparisons Problem in IES Impact Evaluations: Guidelines and Applications Peter Z. Schochet and John Deke June 2009, IES Research Conference

What Is the Problem? Multiple hypothesis tests are often conducted in impact studies –Outcomes –Subgroups –Treatment groups Standard testing methods could yield: – Spurious significant impacts – Incorrect policy conclusions Multiple hypothesis tests are often conducted in impact studies –Outcomes –Subgroups –Treatment groups Standard testing methods could yield: – Spurious significant impacts – Incorrect policy conclusions 2

Overview of Presentation Background Testing guidelines adopted by IES Examples of their use by the RELs New guidance on statistical methods for “between-domain” analyses Background Testing guidelines adopted by IES Examples of their use by the RELs New guidance on statistical methods for “between-domain” analyses 3

Background

Assume a Classical Hypothesis Testing Framework Test H 0j : Impact j = 0 Reject H 0j if p-value of t-test <  =.05 Chance of finding a spurious impact is 5 percent for each test alone Test H 0j : Impact j = 0 Reject H 0j if p-value of t-test <  =.05 Chance of finding a spurious impact is 5 percent for each test alone 5

But If Tests Are Considered Together and No True Impacts… Probability  1 t-test Number of Tests a Is Statistically Significant a Assumes independent tests 6

Impact Findings Can Be Misrepresented Publishing bias A focus on “stars” Publishing bias A focus on “stars” 7

Adjustment Procedures Lower  Levels for Individual Tests Methods control the “combined” error rate Many available methods: –Bonferroni: Compare p-values to (.05 / # of tests) –Fisher’s LSD, Holm (1979), Sidak (1967), Scheffe (1959), Hochberg (1988), Rom (1990), Tukey (1953) –Resampling methods (Westfall and Young 1993) –Benjamini-Hochberg (1995) Methods control the “combined” error rate Many available methods: –Bonferroni: Compare p-values to (.05 / # of tests) –Fisher’s LSD, Holm (1979), Sidak (1967), Scheffe (1959), Hochberg (1988), Rom (1990), Tukey (1953) –Resampling methods (Westfall and Young 1993) –Benjamini-Hochberg (1995) 8

These Methods Reduce Statistical Power: The Chances of Finding Real Effects Simulated Statistical Power a Number of Tests Unadjusted Bonferroni a Assumes 1,000 treatments and 1,000 controls, 20 percent of all null hypotheses are true, and independent tests 9

Basic Testing Guidelines Balance Type I and II Errors

Problem Should Be Addressed by First Structuring the Data Structure will depend on the research questions, previous evidence, and theory Adjustments should not be conducted blindly across all contrasts Structure will depend on the research questions, previous evidence, and theory Adjustments should not be conducted blindly across all contrasts 11

The Plan Must Be Specified Up Front To avoid “fishing” for findings Study protocols should specify: –Data structure –Confirmatory analyses –Exploratory analyses –Testing strategy To avoid “fishing” for findings Study protocols should specify: –Data structure –Confirmatory analyses –Exploratory analyses –Testing strategy 12

Delineate Separate Outcome Domains Based on a conceptual framework Represent key clusters of constructs Domain “items” are likely to measure the same underlying trait (have high correlations) –Test scores –Teacher practices –Student behavior Based on a conceptual framework Represent key clusters of constructs Domain “items” are likely to measure the same underlying trait (have high correlations) –Test scores –Teacher practices –Student behavior 13

Testing Strategy: Both Confirmatory and Exploratory Components Confirmatory component –Addresses central study hypotheses –Used to make overall decisions about program –Must adjust for multiple comparisons Exploratory component –Identify impacts or relationships for future study –Findings should be regarded as preliminary Confirmatory component –Addresses central study hypotheses –Used to make overall decisions about program –Must adjust for multiple comparisons Exploratory component –Identify impacts or relationships for future study –Findings should be regarded as preliminary 14

Focus of Confirmatory Analysis Is on Experimental Impacts Focus is on key child outcomes, such as test scores Targeted subgroups: eg. ELL students Some experimental impacts could be exploratory –Subgroups –Secondary child and teacher outcomes Focus is on key child outcomes, such as test scores Targeted subgroups: eg. ELL students Some experimental impacts could be exploratory –Subgroups –Secondary child and teacher outcomes 15

Confirmatory Analysis Has Two Potential Parts 1. Domain-specific analysis 2. Between-domain analysis 1. Domain-specific analysis 2. Between-domain analysis 16

Domain-Specific Analysis: Test Impacts for Outcomes as a Group Create a composite domain outcome –Weighted average of standardized outcomes  Equal weights  Expert judgment  Predictive validity weights  Factor analysis weights  MANOVA not recommended Conduct a t-test on the composite Create a composite domain outcome –Weighted average of standardized outcomes  Equal weights  Expert judgment  Predictive validity weights  Factor analysis weights  MANOVA not recommended Conduct a t-test on the composite 17

Between-Domain Analysis: Test Impacts for Composites Across Domains Are impacts significant in all domains? –No adjustments are needed Are impacts significant in any domain? –Adjustments are needed –Discussed later Are impacts significant in all domains? –No adjustments are needed Are impacts significant in any domain? –Adjustments are needed –Discussed later 18

Application of Guidelines by the Regional Educational Labs

Basic Features of the REL Studies 25 Randomized Control Trials –Single treatment and control groups –Testing diverse interventions –Typically grades K-8 –Fall-spring data collection, some longer –Collecting data on teachers and students 25 Randomized Control Trials –Single treatment and control groups –Testing diverse interventions –Typically grades K-8 –Fall-spring data collection, some longer –Collecting data on teachers and students 20

Each RCT Provided a Detailed Analysis Plan to IES Confirmatory research questions Confirmatory domains and outcomes Within- and between-domain testing strategy Study samples Statistical power levels Confirmatory research questions Confirmatory domains and outcomes Within- and between-domain testing strategy Study samples Statistical power levels 21 Each Plan Included Information on:

Key Features of Confirmatory Domains Student academic achievement domains are specified in all RCTs Some domains pertain to: –Behavioral outcomes –A specific time period for longitudinal studies –Subgroups: ELL students Student academic achievement domains are specified in all RCTs Some domains pertain to: –Behavioral outcomes –A specific time period for longitudinal studies –Subgroups: ELL students 22

Most RCTs Have Specified Structured Research Questions Most have fewer than 3 domains –Some have only 1 –Most domains have a small number of outcomes Main between-domain question: “ Are there positive impacts in any domain?” Most have fewer than 3 domains –Some have only 1 –Most domains have a small number of outcomes Main between-domain question: “ Are there positive impacts in any domain?” 23

Adjustment Methods for Between-Domain Confirmatory Analyses

Focus on Methods to Control the Familywise Error Rate (FWER) FWER = Prob (find ≥1 significant impact given that no impacts truly exist) Preferred over the false discovery rate developed by Benjamini-Hochberg (BH) –BH is a preponderance-of-evidence method –BH does not control the FDR for all forms of dependencies across test statistics FWER = Prob (find ≥1 significant impact given that no impacts truly exist) Preferred over the false discovery rate developed by Benjamini-Hochberg (BH) –BH is a preponderance-of-evidence method –BH does not control the FDR for all forms of dependencies across test statistics 25

Consider Four FWER Adjustment Methods Sidak: Exact adjustment when tests are independent Bonferroni: Approximate adjustment when tests are independent Generalized Tukey: Adjusts for correlated tests that follow a multivariate t-distribution Resampling: Robust adjustment for correlated tests for general distributions Sidak: Exact adjustment when tests are independent Bonferroni: Approximate adjustment when tests are independent Generalized Tukey: Adjusts for correlated tests that follow a multivariate t-distribution Resampling: Robust adjustment for correlated tests for general distributions 26

Main Research Questions How do these four methods work? Are the more complex methods likely to provide more powerful tests for between- domain analyses? –There are no single-routine statistical packages for the complex methods under clustered designs How do these four methods work? Are the more complex methods likely to provide more powerful tests for between- domain analyses? –There are no single-routine statistical packages for the complex methods under clustered designs 27

Basic Setup for the Between- Domain Analysis Assume N domain composites Test whether any domain composite is statistically significant Aim to control the FWER at  =.05 All methods reduce the  level for individual tests:  * =.05/fact Assume N domain composites Test whether any domain composite is statistically significant Aim to control the FWER at  =.05 All methods reduce the  level for individual tests:  * =.05/fact 28

Sidak Uses the relation that the FWER = [1 – Pr(correctly rejecting all N null hypotheses)] For independent tests, FWER = 1 – (1-  * ) N Sidak picks  * so that FWER = 0.05 For example, if N = 3: –  * = –fact = 0.05/ = Uses the relation that the FWER = [1 – Pr(correctly rejecting all N null hypotheses)] For independent tests, FWER = 1 – (1-  * ) N Sidak picks  * so that FWER = 0.05 For example, if N = 3: –  * = –fact = 0.05/ =

The Bonferroni Method Tends to Be More Conservative  * = (.05 / N); fact = N 30 NSidakBonferroni The Value of fact for the Sidak and Bonferroni

Sidak and Bonferroni Are Likely To Be Conservative with Correlated Tests Correlated tests can occur if: –Domain composites are correlated –Treatment effects are heterogeneous Yields tests with lower power Correlated tests can occur if: –Domain composites are correlated –Treatment effects are heterogeneous Yields tests with lower power 31

Generalized Tukey and Resampling Methods Adjust for Correlated Tests Let p i be the p-value from test i Both methods use the relation: FWER = Pr(min(p 1, p 2, p 3,…, p N )≤.05 | H 0 is true) Both methods calculate FWER using the distribution of min(p 1, p 2, p 3,…, p N ) or max(t 1, t 2, t 3,…, t N ) Let p i be the p-value from test i Both methods use the relation: FWER = Pr(min(p 1, p 2, p 3,…, p N )≤.05 | H 0 is true) Both methods calculate FWER using the distribution of min(p 1, p 2, p 3,…, p N ) or max(t 1, t 2, t 3,…, t N ) 32

Generalized Tukey Assumes test statistics have multivariate t distributions with known correlations The MULTCOMP package in R can implement this adjustment (Hothorn, Bretz, Westfall 2008) –Multi-stage procedure that requires user inputs Assumes test statistics have multivariate t distributions with known correlations The MULTCOMP package in R can implement this adjustment (Hothorn, Bretz, Westfall 2008) –Multi-stage procedure that requires user inputs 33

Using the MULTCOMP Package Inputs are a vector of impact estimates and the corresponding variance-covariance matrix Challenge is to get cross-equation covariances of the impact estimates One option: use the suest command in STATA, then copy resulting covariance matrix to R –Uses GEE rather than HLM to adjust for clustering Inputs are a vector of impact estimates and the corresponding variance-covariance matrix Challenge is to get cross-equation covariances of the impact estimates One option: use the suest command in STATA, then copy resulting covariance matrix to R –Uses GEE rather than HLM to adjust for clustering 34

Resampling/Bootstrapping The distribution of the maximum t-statistic can be estimated through resampling (Westfall and Young 1993) –Allows for general forms of correlations and outcome distributions Resampling must be performed “under the null hypothesis” The distribution of the maximum t-statistic can be estimated through resampling (Westfall and Young 1993) –Allows for general forms of correlations and outcome distributions Resampling must be performed “under the null hypothesis” 35

Homoskedastic Bootstrap Algorithm 1. Calculate impacts and tstats using the original data 2. Define Y* as the residuals from these regressions 3. Repeat the following at least 10,000 times: –Randomly sample schools, with replacement, from Y* –Randomly assign sampled schools to treatment and control groups in the same proportion as in the original data –Calculate impacts and save the maximum absolute tstat 4. Adjusted p-values = proportion of maximum tstats that lie above the absolute value of the original tstats 1. Calculate impacts and tstats using the original data 2. Define Y* as the residuals from these regressions 3. Repeat the following at least 10,000 times: –Randomly sample schools, with replacement, from Y* –Randomly assign sampled schools to treatment and control groups in the same proportion as in the original data –Calculate impacts and save the maximum absolute tstat 4. Adjusted p-values = proportion of maximum tstats that lie above the absolute value of the original tstats 36

Example of Resampling Method 37 Original tstats are and 3.247; Adjusted p-values are 0.89 and 0.00 tstat 1tstat 2Maximum abs(tstat) a a 1 = Max tstat > 0.793; 2 = Max tstat > 3.247

Implementation of Resampling The MULTTEST procedure in SAS implements resampling, but only for non- clustered data Simple approach: Aggregate data to the school level, and use MULTTEST More complex approach: Write a program to implement the algorithm with clustering The MULTTEST procedure in SAS implements resampling, but only for non- clustered data Simple approach: Aggregate data to the school level, and use MULTTEST More complex approach: Write a program to implement the algorithm with clustering 38

Comparing Methods Assume 3 composite domain outcomes with correlations of 0.20, 0.50, and 0.80 Outcomes are normally distributed or heavily skewed normals (focus on skewed) Four types of comparisons: –FWER –Values of fact –Minimum Detectable Effect Size (MDES) –“Goal Line” scenario Assume 3 composite domain outcomes with correlations of 0.20, 0.50, and 0.80 Outcomes are normally distributed or heavily skewed normals (focus on skewed) Four types of comparisons: –FWER –Values of fact –Minimum Detectable Effect Size (MDES) –“Goal Line” scenario 39

FWER Values Are Similar by Method Except With Large Correlations 40 FWER Values, by Method and Test Correlations ρ=0.2ρ=0.5ρ=0.8 No Adjustment Bonferroni Sidak Generalized Tukey Bootstrap

Values of fact Are Similar by Method Except With Large Correlations 41 Values of fact, by Method and Test Correlations ρ=0.2ρ=0.5ρ=0.8 Bonferroni3.00 Sidak2.85 Generalized Tukey Bootstrap

All Methods Yield Similar MDES 42 MDE Values, by Method and Test Correlations a ρ=0.2ρ=0.5ρ=0.8 No Adjustment0.21 Bonferroni0.25 Sidak0.24 Generalized Tukey Bootstrap a Assumes 60 schools, 60 students per school, R 2 =0.50, ICC=0.15

“Goal Line” Scenario: The Method Could Matter for Marginally Significant Impacts 43 Adjusted p-values, by Method and Test Correlations a a Assumes 60 schools, 60 students per School, R 2 =0.50, ICC=0.15 ρ=0.2ρ=0.5ρ=0.8 No Adjustment0.019 Bonferroni0.057 Sidak0.054 Generalized Tukey Bootstrap

Summary and Conclusions Multiple comparisons guidelines: –Specify confirmatory analyses in study protocols –Delineate outcome domains –Conduct hypothesis tests on domain composites RELs have implemented guidelines Multiple comparisons guidelines: –Specify confirmatory analyses in study protocols –Delineate outcome domains –Conduct hypothesis tests on domain composites RELs have implemented guidelines 44

Summary and Conclusions Adjustments are needed for between- domain analyses –For calculating MDEs in the design stage, using the Bonferroni is sufficient –For estimating impacts, the more complex methods may be preferred in “goal-line situations” when test correlations are large Adjustments are needed for between- domain analyses –For calculating MDEs in the design stage, using the Bonferroni is sufficient –For estimating impacts, the more complex methods may be preferred in “goal-line situations” when test correlations are large 45

References and Contact Information Guidelines in Multiple Testing in Impact Evaluations (Schochet 2008) –ies.ed.gov/ncee/pubs/ asp Resampling-Based Multiple Testing (Westfall and Young 1993; John Wiley and Sons) Guidelines in Multiple Testing in Impact Evaluations (Schochet 2008) –ies.ed.gov/ncee/pubs/ asp Resampling-Based Multiple Testing (Westfall and Young 1993; John Wiley and Sons) 46