Multiple Testing in Impact Evaluations: Discussant Comments IES Research Conference June 11, 2008 Larry L. Orr.

Slides:



Advertisements
Similar presentations
What To Do About the Multiple Comparisons Problem? Peter Z. Schochet February 2008.
Advertisements

SADC Course in Statistics Estimating population characteristics with simple random sampling (Session 06)
Psych 5500/6500 t Test for Two Independent Groups: Power Fall, 2008.
1 G Lect 2a G Lecture 2a Thinking about variability Samples and variability Null hypothesis testing.
1 Statistics Achim Tresch UoC / MPIPZ Cologne treschgroup.de/OmicsModule1415.html
From the homework: Distribution of DNA fragments generated by Micrococcal nuclease digestion mean(nucs) = bp median(nucs) = 110 bp sd(nucs+ = 17.3.
Statistical Issues in Research Planning and Evaluation
Statistical Techniques I EXST7005 Lets go Power and Types of Errors.
Objectives Look at Central Limit Theorem Sampling distribution of the mean.
Review: What influences confidence intervals?
HYPOTHESIS TESTING Four Steps Statistical Significance Outcomes Sampling Distributions.
Lecture 14 – Thurs, Oct 23 Multiple Comparisons (Sections 6.3, 6.4). Next time: Simple linear regression (Sections )
Hypothesis Testing: Type II Error and Power.
PSY 1950 Confidence and Power December, Requisite Quote “The picturing of data allows us to be sensitive not only to the multiple hypotheses that.
Chapter Sampling Distributions and Hypothesis Testing.
PSY 307 – Statistics for the Behavioral Sciences
Lecture 14: Thur., Feb. 26 Multiple Comparisons (Sections ) Next class: Inferences about Linear Combinations of Group Means (Section 6.2).
Getting Started with Hypothesis Testing The Single Sample.
Sample size calculation
False Discovery Rate (FDR) = proportion of false positive results out of all positive results (positive result = statistically significant result) Ladislav.
Statistics 11 Hypothesis Testing Discover the relationships that exist between events/things Accomplished by: Asking questions Getting answers In accord.
Multiple testing in high- throughput biology Petter Mostad.
Determining Sample Size
STA Lecture 261 STA 291 Lecture 26 Two types of errors in testing hypothesis. Connection between testing hypothesis and confidence intervals.
Chapter 8 Hypothesis Testing I. Chapter Outline  An Overview of Hypothesis Testing  The Five-Step Model for Hypothesis Testing  One-Tailed and Two-Tailed.
Chapter 15 Data Analysis: Testing for Significant Differences.
Copyright © 2012 by Nelson Education Limited. Chapter 7 Hypothesis Testing I: The One-Sample Case 7-1.
Psy B07 Chapter 4Slide 1 SAMPLING DISTRIBUTIONS AND HYPOTHESIS TESTING.
1 Lecture 19: Hypothesis Tests Devore, Ch Topics I.Statistical Hypotheses (pl!) –Null and Alternative Hypotheses –Testing statistics and rejection.
Maximum Likelihood Estimator of Proportion Let {s 1,s 2,…,s n } be a set of independent outcomes from a Bernoulli experiment with unknown probability.
Biostatistics Class 6 Hypothesis Testing: One-Sample Inference 2/29/2000.
Multiple Testing in Microarray Data Analysis Mi-Ok Kim.
Inference and Inferential Statistics Methods of Educational Research EDU 660.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
Confidence Intervals for Proportions Chapter 8, Section 3 Statistical Methods II QM 3620.
PSY2004 Research Methods PSY2005 Applied Research Methods Week Five.
Back to basics – Probability, Conditional Probability and Independence Probability of an outcome in an experiment is the proportion of times that.
Introduction to sample size and power calculations Afshin Ostovar Bushehr University of Medical Sciences.
Unit 8 Section 8-1 & : Steps in Hypothesis Testing- Traditional Method  Hypothesis Testing – a decision making process for evaluating a claim.
Multiple Testing Matthew Kowgier. Multiple Testing In statistics, the multiple comparisons/testing problem occurs when one considers a set of statistical.
Statistical Inference for the Mean Objectives: (Chapter 9, DeCoursey) -To understand the terms: Null Hypothesis, Rejection Region, and Type I and II errors.
6.1 Inference for a Single Proportion  Statistical confidence  Confidence intervals  How confidence intervals behave.
The Multiple Comparisons Problem in IES Impact Evaluations: Guidelines and Applications Peter Z. Schochet and John Deke June 2009, IES Research Conference.
Review I A student researcher obtains a random sample of UMD students and finds that 55% report using an illegally obtained stimulant to study in the past.
Inen 460 Lecture 2. Estimation (ch. 6,7) and Hypothesis Testing (ch.8) Two Important Aspects of Statistical Inference Point Estimation – Estimate an unknown.
Spatial Smoothing and Multiple Comparisons Correction for Dummies Alexa Morcom, Matthew Brett Acknowledgements.
Statistical Techniques
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 6 –Multiple hypothesis testing Marshall University Genomics.
Guidelines for Multiple Testing in Impact Evaluations Peter Z. Schochet June 2008.
Hypothesis Tests. An Hypothesis is a guess about a situation that can be tested, and the test outcome can be either true or false. –The Null Hypothesis.
IMPACT EVALUATION PBAF 526 Class 5, October 31, 2011.
DSCI 346 Yamasaki Lecture 1 Hypothesis Tests for Single Population DSCI 346 Lecture 1 (22 pages)1.
Chapter 9 Introduction to the t Statistic
Estimating the False Discovery Rate in Genome-wide Studies BMI/CS 576 Colin Dewey Fall 2008.
Slides to accompany Weathington, Cunningham & Pittenger (2010), Chapter 11: Between-Subjects Designs 1.
Unit 5: Hypothesis Testing
Hypothesis Testing and Confidence Intervals (Part 1): Using the Standard Normal Lecture 8 Justin Kern October 10 and 12, 2017.
CHAPTER 9 Testing a Claim
Hypothesis Testing: Hypotheses
Chapter 9 Hypothesis Testing.
CHAPTER 9 Testing a Claim
Review: What influences confidence intervals?
CHAPTER 9 Testing a Claim
Significance Tests: The Basics
Sampling and Power Slides by Jishnu Das.
CHAPTER 9 Testing a Claim
CHAPTER 9 Testing a Claim
CHAPTER 9 Testing a Claim
STA 291 Spring 2008 Lecture 17 Dustin Lueker.
Presentation transcript:

Multiple Testing in Impact Evaluations: Discussant Comments IES Research Conference June 11, 2008 Larry L. Orr

IES Research Conference2 The Guidelines  Just a couple of comments on Peter’s presentation: Sensible advice, masterfully presented – I urge you to adopt them On the open issue of adjusting exploratory tests, I come down on the side of adjusting (with lower significance threshold)  Focus remainder of remarks on an issue on which Peter was (appropriately) agnostic – which adjustment and what is its effect on power?

IES Research Conference3 The Guidelines  Just a couple of comments on Peter’s presentation: Sensible advice, masterfully presented – I urge you to adopt them On the open issue of adjusting exploratory tests, I come down on the side of adjusting (with lower significance threshold)  Focus remainder of remarks on an issue on which Peter was (appropriately) agnostic – which adjustment and what is its effect on power? Disclaimer: I was a member of the working group that developed the guidelines Peter presented. My remarks today represent my own views, not those of the working group.

IES Research Conference4 Different adjustments deal with different issues  Many (e.g., Bonferroni, Holm, Tukey-Kramer) test for a nonzero Family-wise Error Rate (FWER) – i.e., for any nonzero effects. That’s not usually what concerns us  Typical situation: we have some set of estimates that are significant by conventional standards; we want to be assured that most of them reflect real effects – i.e., we’re concerned with the False Discovery Rate  Benjamini-Hochberg attempts to control the false discovery rate

IES Research Conference5 The False Discovery Rate (FDR)  FDR = proportion of significant estimates that are false positives (Type I errors)  Example: Suppose we have: 20 statistically significant estimates 8 true nonzero impacts 12 are false positives FDR =.6 (= 12/20)  Low FDR is good

IES Research Conference6 An Example  Suppose we estimate impacts on 4 outcomes for each of the following subgroups: Gender Ethnicity (4 groups) Region (4 groups) School size (2 groups) Central city/Suburban/Rural SES (3 groups) Number siblings (4 groups) Pretest score (3 groups)  100 estimates – not atypical for an education study

IES Research Conference7 Example (cont’d)  Suppose 10 estimates are significant at.05 level  That might reflect: 10 true nonzero impacts 9 true nonzero impacts and 1 false positive 8 true nonzero impacts and 2 false positives …  Expected mix = 5 true nonzero impacts, 5 false positives; this would imply FDR = 50%

IES Research Conference8 Example (cont’d)  Suppose 10 estimates are significant at.05 level  That might reflect: 10 true nonzero impacts 9 true nonzero impacts and 1 false positive 8 true nonzero impacts and 2 false positives …  Expected mix = 5 true nonzero impacts, 5 false positives; this would imply FDR = 50% But you can never know what the actual mix is, and you cannot know which is which

IES Research Conference9 Expected FDR as function of proportion true nonzero impacts (assumes no MC adjustment, significance level =.05; power =.80)

IES Research Conference10 Implications  When only 5% of all true impacts are nonzero, FDR =.5 – i.e., half of the significant estimates are likely to be Type I errors (but you cannot know which ones they are!)  FDR is quite high until proportion of true impacts that are nonzero rises above 25%  Only when > 50% of true impacts are nonzero, is the FDR relatively low (<.06)

IES Research Conference11 Simulations  Real education data from the ECLS-K Demo  4 Outcomes: reading, math, attendance, peers  25 subgroups (see earlier list)  Imputed zero or nonzero (ES=.2) impacts for varying proportions of subgroups  Measured FDR with and w/o B-H correction  500 replications of 100 estimates

IES Research Conference12 Simulation results: FDR as function of true zero impact rate, unadjusted vs. B-H adjusted B ased on 500 replications of estimated impacts on 4 outcomes for 25 subgroups with simulated effect size (ES) = 0 or ES =.20, using data from the ECLS-K Demonstration

IES Research Conference13 Implications  B-H does indeed control the FDR in real-world education data (at least, in these RW education data)  Even at very low nonzero impact rates, FDR is well below 5%  This comes at a price, however…

IES Research Conference14 The effect of the B-H adjustment on Type II errors Adjusted Unadjusted Based on 500 replications of estimated impacts on 4 outcomes for 25 subgroups with simulated effect size (ES) = 0 or ES =.20, using data from the ECLS-K Demonstration

IES Research Conference15 The cost of adjusting for multiple comparisons within a fixed sample  For a given sample, reducing the chance of Type I errors (false positives) increases the chance of Type II errors (missing true effects)  In this case, for very low nonzero impact rates, Type II error rate for a typical subgroup (probability of missing a true effect when there is one) went from.28 to.70 (i.e., power fell from.74 to.30!)  For high nonzero impact rates, the power loss is much smaller – when nonzero impact rate is 95%, adjustment increases Type II error rate only from.27 to.33 (i.e., power falls from.73 to.67)

IES Research Conference16 Does this mean we must sacrifice power to deal with multiple comparisons? Yes – if you have already designed your sample ignoring the MC problem. BUT…if you take the adjustment into account at the design stage, you can build the loss of power associated with MC adjustments into the sample size, to maintain power This means, of course, larger samples and more expensive studies (sorry about that, Phoebe)

IES Research Conference17 For a copy of this presentation… Send an to: