Download presentation
Presentation is loading. Please wait.
Published byNeil Montgomery Modified over 8 years ago
1
Guidelines for Multiple Testing in Impact Evaluations Peter Z. Schochet June 2008
2
What Is the Problem? Multiple hypothesis tests are often conducted in impact studies –Outcomes –Subgroups –Treatment groups Standard testing methods could yield: – Spurious significant impacts – Incorrect policy conclusions Multiple hypothesis tests are often conducted in impact studies –Outcomes –Subgroups –Treatment groups Standard testing methods could yield: – Spurious significant impacts – Incorrect policy conclusions 2
3
Overview of Presentation Background Suggested testing guidelines Background Suggested testing guidelines 3
4
Background
5
Assume a Classical Hypothesis Testing Framework True impacts are fixed for the study population Test H 0j : Impact j = 0 Reject H 0j if p-value of t-test < =.05 Chance of finding a spurious impact is 5 percent for each test alone True impacts are fixed for the study population Test H 0j : Impact j = 0 Reject H 0j if p-value of t-test < =.05 Chance of finding a spurious impact is 5 percent for each test alone 5
6
But If Tests Are Considered Together and No True Impacts… Probability 1 t-test Number of Tests a Is Statistically Significant 1.05 5.23 10.40 20.64 50.92 a Assumes independent tests 6
7
Impact Findings Can Be Misrepresented Publishing bias A focus on “stars” Publishing bias A focus on “stars” 7
8
Adjustment Procedures Lower Levels for Individual Tests Methods control the “combined” error rate Many available methods: –Bonferroni: Compare p-values to (.05 / # of tests) –Fisher’s LSD, Holm (1979), Sidak (1967), Scheffe (1959), Hochberg (1988), Rom (1990), Tukey (1953) –Resampling methods (Westfall and Young 1993) –Benjamini-Hochberg (1995) Methods control the “combined” error rate Many available methods: –Bonferroni: Compare p-values to (.05 / # of tests) –Fisher’s LSD, Holm (1979), Sidak (1967), Scheffe (1959), Hochberg (1988), Rom (1990), Tukey (1953) –Resampling methods (Westfall and Young 1993) –Benjamini-Hochberg (1995) 8
9
These Methods Reduce Statistical Power: The Chances of Finding Real Effects Simulated Statistical Power a Number of Tests Unadjusted Bonferroni 5.80.59 10.80.50 20.80.41 50.80.31 a Assumes 1,000 treatments and 1,000 controls, 20 percent of all null hypotheses are true, and independent tests 9
10
Big Debate on Whether To Use Adjustment Procedures What is the proper balance between Type I and Type II errors? 10
11
To Adjust or Not To Adjust?
12
February, July, December 2007 Advisory Panel Meetings Held at IES Chairs: Phoebe Cottingham, IES Rob Hollister, Swarthmore Rebecca Maynard, U. of PA Chairs: Phoebe Cottingham, IES Rob Hollister, Swarthmore Rebecca Maynard, U. of PA Participants: Steve Bell, Abt Howard Bloom, MDRC John Burghardt, MPR Mark Dynarski, MPR Andrew Gelman, Columbia David Judkins, Westat Jeff Kling, Brookings David Myers, AIR Larry Orr, Abt Peter Schochet, MPR 12
13
Views Expressed Here May Not Represent Those of all Panel Members
14
Basic Testing Principles
15
The Problem Should Not Be Ignored Erroneous conclusions can result otherwise But need to balance Type I and II errors Erroneous conclusions can result otherwise But need to balance Type I and II errors 15
16
Limiting the Number of Outcomes and Subgroups Can Help But not always possible or desirable Need flexible strategy for confirmatory and exploratory analyses But not always possible or desirable Need flexible strategy for confirmatory and exploratory analyses 16
17
Problem Should Be Addressed by First Structuring the Data Structure will depend on the research questions Adjustments should not be conducted blindly across all contrasts Structure will depend on the research questions Adjustments should not be conducted blindly across all contrasts 17
18
Suggested Testing Guidelines
19
The Plan Must Be Specified Up Front Study protocols should specify: –Data structure –Confirmatory analyses –Testing strategy Study protocols should specify: –Data structure –Confirmatory analyses –Testing strategy 19
20
Delineate Separate Outcome Domains Based on a conceptual framework Represent key clusters of constructs Domain “items” are likely to measure the same underlying trait (have high correlations) –Test scores –Teacher practices –School attendance Based on a conceptual framework Represent key clusters of constructs Domain “items” are likely to measure the same underlying trait (have high correlations) –Test scores –Teacher practices –School attendance 20
21
Testing Strategy: Both Confirmatory and Exploratory Components Confirmatory component –Addresses central study hypotheses –Must adjust for multiple comparisons Exploratory component –Identify impacts or relationships for future study –Findings should be regarded as preliminary Confirmatory component –Addresses central study hypotheses –Must adjust for multiple comparisons Exploratory component –Identify impacts or relationships for future study –Findings should be regarded as preliminary 21
22
Confirmatory Analysis Has Two Potential Parts 1. Domain-specific analysis 2. Between-domain analysis 1. Domain-specific analysis 2. Between-domain analysis 22
23
Domain-Specific Analysis
24
Test Impacts for Outcomes as a Group Create a composite domain outcome –Weighted average of standardized outcomes Equal weights Expert judgment Predictive validity weights Factor analysis weights MANOVA not recommended Conduct a t-test on the composite Create a composite domain outcome –Weighted average of standardized outcomes Equal weights Expert judgment Predictive validity weights Factor analysis weights MANOVA not recommended Conduct a t-test on the composite 24
25
What About Tests for Individual Domain Outcomes? If impact on composite is significant –Test impacts for individual domain outcomes without multiplicity corrections –Use only for interpretation If impact on composite is not significant –Further tests are not warranted If impact on composite is significant –Test impacts for individual domain outcomes without multiplicity corrections –Use only for interpretation If impact on composite is not significant –Further tests are not warranted 25
26
Between-Domain Analysis
27
Applicable If Studies Require Summative Evidence of Impacts Constructing “unified” composites may not make sense –Domains are likely to measure different latent traits Test domain composites individually using adjustment procedures Constructing “unified” composites may not make sense –Domains are likely to measure different latent traits Test domain composites individually using adjustment procedures 27
28
Testing Strategy Will Depend on the Research Questions Are impacts significant in all domains? –No adjustments are needed Are impacts significant in any domain? –Adjustments are needed Are impacts significant in all domains? –No adjustments are needed Are impacts significant in any domain? –Adjustments are needed 28
29
Other Situations That Require Multiplicity Corrections
30
Designs With Multiple Treatment Groups Need stringent evidence to conclude that some treatments are preferred over others Apply Tukey-Kramer, Dunnett, Orthogonal Contrasts or resampling methods to domain composites Need stringent evidence to conclude that some treatments are preferred over others Apply Tukey-Kramer, Dunnett, Orthogonal Contrasts or resampling methods to domain composites 30
31
Subgroup Analyses That Are Part of the Confirmatory Analysis Limit to a few educationally meaningful subgroups –Justify subgroups –Stratify by subgroup in sampling Conduct F-tests for differences across subgroup impacts Limit to a few educationally meaningful subgroups –Justify subgroups –Stratify by subgroup in sampling Conduct F-tests for differences across subgroup impacts 31
32
Exploratory Analyses? Two schools of thought The use of corrections does not make exploratory findings confirmatory Two schools of thought The use of corrections does not make exploratory findings confirmatory 32
33
Statistical Power Studies must be designed to have sufficient statistical power for all confirmatory analyses –Includes subgroup analyses Studies must be designed to have sufficient statistical power for all confirmatory analyses –Includes subgroup analyses 33
34
Reporting Qualify confirmatory and exploratory analysis findings in reports –No one way to present adjusted and unadjusted p-values –Confidence intervals could be helpful –Confirmatory analysis results should be emphasized in the executive summary Qualify confirmatory and exploratory analysis findings in reports –No one way to present adjusted and unadjusted p-values –Confidence intervals could be helpful –Confirmatory analysis results should be emphasized in the executive summary 34
35
Testing Approach Summary Specify plan in study protocols Structure the data –Delineate outcome domains Confirmatory analysis –Within and between domains Exploratory analysis Qualify findings appropriately Specify plan in study protocols Structure the data –Delineate outcome domains Confirmatory analysis –Within and between domains Exploratory analysis Qualify findings appropriately 35
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.