Optimality Considerations in Testing Massive Numbers of Hypotheses Peter H. Westfall Ananda Bandulasiri Texas Tech University
Hypotheses; FWE and FDR H 0i (point null) vs. H 1i, i=1,…,k. H 0i (point null) vs. H 1i, i=1,…,k. k is large! k is large! A decision algorithm for classifying k tests produces R total rejections, with V erroneous. A decision algorithm for classifying k tests produces R total rejections, with V erroneous. FWE= P(V>0 ) FWE= P(V>0 ) FDR = E(V/R + ) FDR = E(V/R + ) To control FWE: Hochberg, Westfall-Young,… To control FWE: Hochberg, Westfall-Young,… To control FDR: Benjamini and Hochberg,… To control FDR: Benjamini and Hochberg,…
Scale Up, wrt k FWE-controlling methods do not scale up as k : Reject H 0i when p i ~ /k. FWE-controlling methods do not scale up as k : Reject H 0i when p i ~ /k. FDR-controlling methods do scale up as k : FDR-controlling methods do scale up as k : Reject H 0i when p i k Reject H 0i when p i k where k as k 0< where k as k 0<
FDR Convergence as k Critical t 3 for FDR(.05) is 4.93 Marginal (unadjusted) is
Application: EEG Responses to Light Stimuli 43 time series responses; 62 scalp locations; 70 ind. reps; 5 trt: (1)G60% (2)R90% (3)G80% (4)R100% (5)G100%
Average EEG Curves
Questions of Interest Validity Check: Differences should exist between responses for different intensities. Validity Check: Differences should exist between responses for different intensities. Main Question: Are there differences between red and green stimuli? When? Where? Main Question: Are there differences between red and green stimuli? When? Where? Number of tests: Number of tests: k = (10 trt comparisons) x (62 scalp spots) x (43 time locations) = 26,660. k = (10 trt comparisons) x (62 scalp spots) x (43 time locations) = 26,660.
Histograms of t-statistics with null reference G60 v R90 G60 v G80
Histograms of t-statistics with null reference G60 v R100 G60 v G100
Histograms of t-statistics with null reference R90 v G80 R90 v R100
Histograms of t-statistics with null reference R90 v G100 G80 v R100
Histograms of t-statistics with null reference G80 v G100 R100 v G100
Results Westfall-Young FWE-controlling method: Westfall-Young FWE-controlling method: No significant R100 v G100 comparisons No significant R100 v G100 comparisons Significant comparisons for all other contrasts Significant comparisons for all other contrasts Benjamini-Hochberg FDR-controlling method: Benjamini-Hochberg FDR-controlling method: 23 significant R100 v G100 comparisons 23 significant R100 v G100 comparisons Claim: The FWE-controlling method gave the right answer. Claim: The FWE-controlling method gave the right answer.
A Comment FDR scales up better as k , but that does not necessarily mean the results are “better,” even for large k. FDR scales up better as k , but that does not necessarily mean the results are “better,” even for large k.
Scale Up, wrt n Model for test statistics Z i, i=1,…,k Model for test statistics Z i, i=1,…,k Z i | i ~ N(n 1/2 i,1); i = i / xi = stdzd effect size Z i | i ~ N(n 1/2 i,1); i = i / xi = stdzd effect size i ~ F i ~ F Suppose P( i =0 ) = 0. Then Suppose P( i =0 ) = 0. Then FDR does not scale up as n . FDR does not scale up as n . FWE might scale up, but only serendipitously, if n and k diverge at appropriate rates FWE might scale up, but only serendipitously, if n and k diverge at appropriate rates
Efron’s Method JASA(2004), JASA(2004), Estimate an “empirical null distribution,” f 0 (z), from the center of the histogram of z’s. Estimate an “empirical null distribution,” f 0 (z), from the center of the histogram of z’s. Estimate the combined distribution, f(z). Estimate the combined distribution, f(z). Estimate a “local FDR” for each z i, as fdr(z i ) = f 0 (z i )/f(z i ). Estimate a “local FDR” for each z i, as fdr(z i ) = f 0 (z i )/f(z i ). Choose as “interesting” cases with fdr(z i ) <0.1. Choose as “interesting” cases with fdr(z i ) <0.1.
Discussion P( i =0) > 0 is usually false, but a reasonable approximation for small n. P( i =0) > 0 is usually false, but a reasonable approximation for small n. As n we need more realistic models: As n we need more realistic models: “P( i =0) > 0 never true”, but even if true - “P( i =0) > 0 never true”, but even if true - Unobserved covariates Unobserved covariates Failed model assumptions Failed model assumptions Imperfect sampling procedures Imperfect sampling procedures “Empirical null” sensible
Results from Efron’s Method Significant diffs only for 2 v 3, 2 v 4, 2 v 5 Significant diffs only for 2 v 3, 2 v 4, 2 v 5 No significant R100 v G100 comparisons (right answer) No significant R100 v G100 comparisons (right answer)
What is the “Right Answer”? Methods that have “good” utility are “right” Methods that have “good” utility are “right” MCPs must have reasonable utility, otherwise they would have disappeared long ago MCPs must have reasonable utility, otherwise they would have disappeared long ago DECISION THEORY: The right answer is to maximize utility/minimize loss. DECISION THEORY: The right answer is to maximize utility/minimize loss.
Loss Functions
0 k C1C1 C2C2
Optimal Decision Rule: Classify to: Classify to: “UE” when < - 0 “OE” when > 0 “NC” when | | < 0, regardless of the distribution of . Problem: We observe x = + , not . Here distributions *do* matter
Assumptions x| ~ N( , x ) ; (say x ) = ~ N(0, ) wp 0 ~ N(0, ) wp 1- 0
Baseline Parameters Model Parameters Model Parameters x = = 0.05 = 1.00 0 = 0.80 Loss Function Parameters Loss Function Parameters C 1 = 0.4 C 2 = 0.2 k = 0.99 0 = 0.223
Test Stats & Multiple Tests For testing H 0i : i = 0, for i = 1,…,k,, p-Values are FWE, FDR controlling methods use p i ; Efron’s method use t i.
Average Loss Simulation: generated, then x’s | Simulation: generated, then x’s | Independence Independence All combinations of: All combinations of: p=(400, 2000, 10000, 50000) p=(400, 2000, 10000, 50000) n=(10, 20, 40, 80, 160) n=(10, 20, 40, 80, 160)
Concluding Comments Consider scale up in both p and n Consider scale up in both p and n FWE often ok (serendipity) FWE often ok (serendipity) Efron promising for scaling up both ways Efron promising for scaling up both ways Recommendations: Either Recommendations: Either a.Learn to recognize situations where FWE/FDR/Efron/… have good utility, or b.Bite the bullet and construct loss functions