Guidelines for Multiple Testing in Impact Evaluations Peter Z. Schochet June 2008.

Slides:



Advertisements
Similar presentations
What To Do About the Multiple Comparisons Problem? Peter Z. Schochet February 2008.
Advertisements

1 Health Warning! All may not be what it seems! These examples demonstrate both the importance of graphing data before analysing it and the effect of outliers.
Copyright © 2011 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 12 Measures of Association.
Comparing Two Population Means The Two-Sample T-Test and T-Interval.
Hypothesis Testing Steps of a Statistical Significance Test. 1. Assumptions Type of data, form of population, method of sampling, sample size.
SADC Course in Statistics Comparing Means from Independent Samples (Session 12)
ANOVA Determining Which Means Differ in Single Factor Models Determining Which Means Differ in Single Factor Models.
Comparing Means.
Introduction to Hypothesis Testing CJ 526 Statistical Analysis in Criminal Justice.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 9-1 Chapter 9 Fundamentals of Hypothesis Testing: One-Sample Tests Basic Business Statistics.
Introduction to Hypothesis Testing CJ 526 Statistical Analysis in Criminal Justice.
Analysis of Variance & Multivariate Analysis of Variance
K-group ANOVA & Pairwise Comparisons ANOVA for multiple condition designs Pairwise comparisons and RH Testing Alpha inflation & Correction LSD & HSD procedures.
Comparing Means.
Today Concepts underlying inferential statistics
On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.
Statistics for Managers Using Microsoft® Excel 5th Edition
Impact Evaluation Session VII Sampling and Power Jishnu Das November 2006.
Chapter 14 Inferential Data Analysis
Richard M. Jacobs, OSA, Ph.D.
INFERENTIAL STATISTICS – Samples are only estimates of the population – Sample statistics will be slightly off from the true values of its population’s.
Experimental Statistics - week 2
Chapter 4 Hypothesis Testing, Power, and Control: A Review of the Basics.
Confidence Intervals and Hypothesis Testing - II
1 Dr. Jerrell T. Stracener EMIS 7370 STAT 5340 Probability and Statistics for Scientists and Engineers Department of Engineering Management, Information.
Business Statistics, A First Course (4e) © 2006 Prentice-Hall, Inc. Chap 9-1 Chapter 9 Fundamentals of Hypothesis Testing: One-Sample Tests Business Statistics,
Fundamentals of Hypothesis Testing: One-Sample Tests
© 2013 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Intermediate Applied Statistics STAT 460
+ Chapter 9 Summary. + Section 9.1 Significance Tests: The Basics After this section, you should be able to… STATE correct hypotheses for a significance.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Section 9-2 Inferences About Two Proportions.
Statistical Fundamentals: Using Microsoft Excel for Univariate and Bivariate Analysis Alfred P. Rovai Hypothesis Testing PowerPoint Prepared by Alfred.
STA Statistical Inference
Regression Part II One-factor ANOVA Another dummy variable coding scheme Contrasts Multiple comparisons Interactions.
Education Research 250:205 Writing Chapter 3. Objectives Subjects Instrumentation Procedures Experimental Design Statistical Analysis  Displaying data.
January 31 and February 3,  Some formulae are presented in this lecture to provide the general mathematical background to the topic or to demonstrate.
ANOVA (Analysis of Variance) by Aziza Munir
Copyright © Cengage Learning. All rights reserved. 10 Inferences Involving Two Populations.
Testing Hypotheses about Differences among Several Means.
Chapter 10: Analyzing Experimental Data Inferential statistics are used to determine whether the independent variable had an effect on the dependent variance.
Inference and Inferential Statistics Methods of Educational Research EDU 660.
Evaluating Impacts of MSP Grants Hilary Rhodes, PhD Ellen Bobronnikov February 22, 2010 Common Issues and Recommendations.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 8-1 Chapter 8 Fundamentals of Hypothesis Testing: One-Sample Tests Statistics.
The Multiple Comparisons Problem in IES Impact Evaluations: Guidelines and Applications Peter Z. Schochet and John Deke June 2009, IES Research Conference.
Evaluating Impacts of MSP Grants Ellen Bobronnikov Hilary Rhodes January 11, 2010 Common Issues and Recommendations.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 13: One-way ANOVA Marshall University Genomics Core.
Fall 2002Biostat Statistical Inference - Proportions One sample Confidence intervals Hypothesis tests Two Sample Confidence intervals Hypothesis.
CHAPTER OVERVIEW Say Hello to Inferential Statistics The Idea of Statistical Significance Significance Versus Meaningfulness Meta-analysis.
Chapter 10 Copyright © Allyn & Bacon 2008 This multimedia product and its contents are protected under copyright law. The following are prohibited by law:
EBM --- Journal Reading Presenter :呂宥達 Date : 2005/10/27.
Testing Hypotheses II Lesson 10. A Directional Hypothesis (1-tailed) n Does reading to young children increase IQ scores?  = 100,  = 15, n = 25 l sample.
Applied Quantitative Analysis and Practices LECTURE#14 By Dr. Osman Sadiq Paracha.
Multiple Testing in Impact Evaluations: Discussant Comments IES Research Conference June 11, 2008 Larry L. Orr.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved.Copyright © 2010 Pearson Education Section 9-4 Inferences from Matched.
Sullivan – Fundamentals of Statistics – 2 nd Edition – Chapter 11 Section 3 – Slide 1 of 27 Chapter 11 Section 3 Inference about Two Population Proportions.
Chapters Way Analysis of Variance - Completely Randomized Design.
BIOL 582 Lecture Set 2 Inferential Statistics, Hypotheses, and Resampling.
Chapter 9: Hypothesis Tests for One Population Mean 9.2 Terms, Errors, and Hypotheses.
ANalysis Of VAriance (ANOVA) Used for continuous outcomes with a nominal exposure with three or more categories (groups) Result of test is F statistic.
Independent Samples ANOVA. Outline of Today’s Discussion 1.Independent Samples ANOVA: A Conceptual Introduction 2.The Equal Variance Assumption 3.Cumulative.
1 השוואות מרובות מדדי טעות, עוצמה, רווחי סמך סימולטניים ד"ר מרינה בוגומולוב מבוסס על ההרצאות של פרופ' יואב בנימיני ופרופ' מלכה גורפיין.
MARE 250 Dr. Jason Turner Analysis of Variance (ANOVA)
Statistical Inference for the Mean Objectives: (Chapter 8&9, DeCoursey) -To understand the terms variance and standard error of a sample mean, Null Hypothesis,
IMPACT EVALUATION PBAF 526 Class 5, October 31, 2011.
Posthoc Comparisons finding the differences. Statistical Significance What does a statistically significant F statistic, in a Oneway ANOVA, tell us? What.
Descriptive Statistics Report Reliability test Validity test & Summated scale Dr. Peerayuth Charoensukmongkol, ICO NIDA Research Methods in Management.
Logic of Hypothesis Testing
Chapter Review Problems
1-Way Analysis of Variance - Completely Randomized Design
Presentation transcript:

Guidelines for Multiple Testing in Impact Evaluations Peter Z. Schochet June 2008

What Is the Problem? Multiple hypothesis tests are often conducted in impact studies –Outcomes –Subgroups –Treatment groups Standard testing methods could yield: – Spurious significant impacts – Incorrect policy conclusions Multiple hypothesis tests are often conducted in impact studies –Outcomes –Subgroups –Treatment groups Standard testing methods could yield: – Spurious significant impacts – Incorrect policy conclusions 2

Overview of Presentation Background Suggested testing guidelines Background Suggested testing guidelines 3

Background

Assume a Classical Hypothesis Testing Framework True impacts are fixed for the study population Test H 0j : Impact j = 0 Reject H 0j if p-value of t-test <  =.05 Chance of finding a spurious impact is 5 percent for each test alone True impacts are fixed for the study population Test H 0j : Impact j = 0 Reject H 0j if p-value of t-test <  =.05 Chance of finding a spurious impact is 5 percent for each test alone 5

But If Tests Are Considered Together and No True Impacts… Probability  1 t-test Number of Tests a Is Statistically Significant a Assumes independent tests 6

Impact Findings Can Be Misrepresented Publishing bias A focus on “stars” Publishing bias A focus on “stars” 7

Adjustment Procedures Lower  Levels for Individual Tests Methods control the “combined” error rate Many available methods: –Bonferroni: Compare p-values to (.05 / # of tests) –Fisher’s LSD, Holm (1979), Sidak (1967), Scheffe (1959), Hochberg (1988), Rom (1990), Tukey (1953) –Resampling methods (Westfall and Young 1993) –Benjamini-Hochberg (1995) Methods control the “combined” error rate Many available methods: –Bonferroni: Compare p-values to (.05 / # of tests) –Fisher’s LSD, Holm (1979), Sidak (1967), Scheffe (1959), Hochberg (1988), Rom (1990), Tukey (1953) –Resampling methods (Westfall and Young 1993) –Benjamini-Hochberg (1995) 8

These Methods Reduce Statistical Power: The Chances of Finding Real Effects Simulated Statistical Power a Number of Tests Unadjusted Bonferroni a Assumes 1,000 treatments and 1,000 controls, 20 percent of all null hypotheses are true, and independent tests 9

Big Debate on Whether To Use Adjustment Procedures What is the proper balance between Type I and Type II errors? 10

To Adjust or Not To Adjust?

February, July, December 2007 Advisory Panel Meetings Held at IES Chairs: Phoebe Cottingham, IES Rob Hollister, Swarthmore Rebecca Maynard, U. of PA Chairs: Phoebe Cottingham, IES Rob Hollister, Swarthmore Rebecca Maynard, U. of PA Participants: Steve Bell, Abt Howard Bloom, MDRC John Burghardt, MPR Mark Dynarski, MPR Andrew Gelman, Columbia David Judkins, Westat Jeff Kling, Brookings David Myers, AIR Larry Orr, Abt Peter Schochet, MPR 12

Views Expressed Here May Not Represent Those of all Panel Members

Basic Testing Principles

The Problem Should Not Be Ignored Erroneous conclusions can result otherwise But need to balance Type I and II errors Erroneous conclusions can result otherwise But need to balance Type I and II errors 15

Limiting the Number of Outcomes and Subgroups Can Help But not always possible or desirable Need flexible strategy for confirmatory and exploratory analyses But not always possible or desirable Need flexible strategy for confirmatory and exploratory analyses 16

Problem Should Be Addressed by First Structuring the Data Structure will depend on the research questions Adjustments should not be conducted blindly across all contrasts Structure will depend on the research questions Adjustments should not be conducted blindly across all contrasts 17

Suggested Testing Guidelines

The Plan Must Be Specified Up Front Study protocols should specify: –Data structure –Confirmatory analyses –Testing strategy Study protocols should specify: –Data structure –Confirmatory analyses –Testing strategy 19

Delineate Separate Outcome Domains Based on a conceptual framework Represent key clusters of constructs Domain “items” are likely to measure the same underlying trait (have high correlations) –Test scores –Teacher practices –School attendance Based on a conceptual framework Represent key clusters of constructs Domain “items” are likely to measure the same underlying trait (have high correlations) –Test scores –Teacher practices –School attendance 20

Testing Strategy: Both Confirmatory and Exploratory Components Confirmatory component –Addresses central study hypotheses –Must adjust for multiple comparisons Exploratory component –Identify impacts or relationships for future study –Findings should be regarded as preliminary Confirmatory component –Addresses central study hypotheses –Must adjust for multiple comparisons Exploratory component –Identify impacts or relationships for future study –Findings should be regarded as preliminary 21

Confirmatory Analysis Has Two Potential Parts 1. Domain-specific analysis 2. Between-domain analysis 1. Domain-specific analysis 2. Between-domain analysis 22

Domain-Specific Analysis

Test Impacts for Outcomes as a Group Create a composite domain outcome –Weighted average of standardized outcomes  Equal weights  Expert judgment  Predictive validity weights  Factor analysis weights  MANOVA not recommended Conduct a t-test on the composite Create a composite domain outcome –Weighted average of standardized outcomes  Equal weights  Expert judgment  Predictive validity weights  Factor analysis weights  MANOVA not recommended Conduct a t-test on the composite 24

What About Tests for Individual Domain Outcomes? If impact on composite is significant –Test impacts for individual domain outcomes without multiplicity corrections –Use only for interpretation If impact on composite is not significant –Further tests are not warranted If impact on composite is significant –Test impacts for individual domain outcomes without multiplicity corrections –Use only for interpretation If impact on composite is not significant –Further tests are not warranted 25

Between-Domain Analysis

Applicable If Studies Require Summative Evidence of Impacts Constructing “unified” composites may not make sense –Domains are likely to measure different latent traits Test domain composites individually using adjustment procedures Constructing “unified” composites may not make sense –Domains are likely to measure different latent traits Test domain composites individually using adjustment procedures 27

Testing Strategy Will Depend on the Research Questions Are impacts significant in all domains? –No adjustments are needed Are impacts significant in any domain? –Adjustments are needed Are impacts significant in all domains? –No adjustments are needed Are impacts significant in any domain? –Adjustments are needed 28

Other Situations That Require Multiplicity Corrections

Designs With Multiple Treatment Groups Need stringent evidence to conclude that some treatments are preferred over others Apply Tukey-Kramer, Dunnett, Orthogonal Contrasts or resampling methods to domain composites Need stringent evidence to conclude that some treatments are preferred over others Apply Tukey-Kramer, Dunnett, Orthogonal Contrasts or resampling methods to domain composites 30

Subgroup Analyses That Are Part of the Confirmatory Analysis Limit to a few educationally meaningful subgroups –Justify subgroups –Stratify by subgroup in sampling Conduct F-tests for differences across subgroup impacts Limit to a few educationally meaningful subgroups –Justify subgroups –Stratify by subgroup in sampling Conduct F-tests for differences across subgroup impacts 31

Exploratory Analyses? Two schools of thought The use of corrections does not make exploratory findings confirmatory Two schools of thought The use of corrections does not make exploratory findings confirmatory 32

Statistical Power Studies must be designed to have sufficient statistical power for all confirmatory analyses –Includes subgroup analyses Studies must be designed to have sufficient statistical power for all confirmatory analyses –Includes subgroup analyses 33

Reporting Qualify confirmatory and exploratory analysis findings in reports –No one way to present adjusted and unadjusted p-values –Confidence intervals could be helpful –Confirmatory analysis results should be emphasized in the executive summary Qualify confirmatory and exploratory analysis findings in reports –No one way to present adjusted and unadjusted p-values –Confidence intervals could be helpful –Confirmatory analysis results should be emphasized in the executive summary 34

Testing Approach Summary Specify plan in study protocols Structure the data –Delineate outcome domains Confirmatory analysis –Within and between domains Exploratory analysis Qualify findings appropriately Specify plan in study protocols Structure the data –Delineate outcome domains Confirmatory analysis –Within and between domains Exploratory analysis Qualify findings appropriately 35