Sequential & Multiple Hypothesis Testing Procedures for Genome-wide Association Scans Qunyuan Zhang Division of Statistical Genomics Washington University School of Medicine
2 Multiple Comparison (strategy 1) Type I error False Positive Type II error False negative High power Low power P value adjustment/correction (Bonferroni, FDR) Empirical p value (permutation, bootstrap)
3 Type I error False Positive Type II error False negative Multiple Comparison (strategy 2) Larger sample size Meta analysis Biological info or evidence …… More powerful statistical approach SMDP: Sequential Multiple Decision Procedure
4 What is SMDP? A generalized framework for ranking and selection, using optimum sample sizes A combination of sequential analysis and multiple hypothesis test
5 Feature 1 of SMDP Sequential Analysis n0n0n0n0 Start from a small sample size Increase sample size, sequential test at each stage Stop when stopping rule is satisfied n 0 +1 n 0 +2 n 0 +i … …
6 Feature 2 of SMDP Multiple Decision SNP1SNP2SNP3 SNP4 SNP5 SNP6 … SNPn Simultaneous test Multiple hypothesis test Independent test Binary hypothesis test test 1 test 2 test 3 test 4 test 5 test 6 test n SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 … SNPn Signal group Noise group
7 Binary Hypothesis Test used by traditional methods SNP1SNP2SNP3 SNP4 SNP5 SNP6 … SNPn test 1 H0: Eff.(SNP1)=0 vs. H1: Eff.(SNP1)≠0 test 2 H0: Eff.(SNP2)=0 vs. H1: Eff.(SNP2)≠0 test 3 …… test 4 …… test 5 …… test 6 …… test n H0: Eff.(SNPn)=0 vs. H1: Eff.(SNPn)≠0 test-wise error and genome-wise error multiple testing issue
8 Multiple Hypothesis Test used by SMDP SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 … SNPn H1: SNP1,2,3 are truly different from the others H2: SNP1,2,4 are truly different from the others H3 …… H4 …… H5: SNP4,5,6 are truly different from the others H6 ……… Hu: SNPn,n-1,n-2 are truly different from the others Goal: search the best one H: any t SNPs are truly different from the others (n-t) u= number of all possible combination of t out of n
9 General Rule of SMDP (Bechhofer et al., 1968) Selecting the t best of M K-D populations Sequential Sampling 1 2 … h h+1 … Pop. 1 Pop. 2 : Pop. t-1 Pop. t Pop. k+1 Pop. k+2 : Pop. M D Y 1,h Y 2,h : Y t,h : Y M,h U possible combinations of t out of M For each combination u Stopping rule Prob. of correct selection (PCS) > P*, whenever D>D* Sequential statistic at stage h
10 Koopman-Darmois(K-D) Populations (Bechhofer et al., 1968) The freq/density function of a K-D population can be written in the form: f(x)=exp{P(x)Q(θ)+R(x)+S(θ)} A.The normal density function with unknown mean and known variance; B.The normal density function with unknown variance and known mean; C.The exponential density function with unknown scale parameter and known location parameter; D.The Poisson distribution with unknown mean; …… The distance of two K-D populations
11
12 Combine SMDP With Regression Model (M.A. Province, 2000, page 319) Case B : the normal density function with unknown variance and known mean;
13 SMDP - Regression (M.A. Province, 2000) Z 1, X 1 Z 2, X 2 Z 3, X 3 : Z h, X h Z h+1, X h+1 : Z N, X N Data pairs for a marker Sequential sum of squares of regression residuals Y i,h denotes Y for marker i at stage h (see slide 7)
14 A Real Data Example ( M.A. Province, 2000, page 308)
15 Simulation Results M.A. Province, 2000, page 312
16 SMDP: Computational Problem : h h+1 : N Sequential stage Y 1,h Y 2,h : Y k,h Y k+1,h Y k+2,h : Y M,h U sums of U possible combinations of t out of M Each sum contains t members of Y i,h Computer t ime ?
17 Simplified Stopping Rule U-S+1= Top Combination Number (TCN) TCN=2 (i.e. S=U-1,U-S=1)=> the simplest stopping rule When TCN=U (i.e. S=1, U-S=U-1)=> the original stopping rule How to choose TCN? Balance between computational accuracy and computational time Zhang & Province, 2005
18
19 Application to Pharmacal Genetics Data Sample size GenotypePhenotype 85 Cell lines 5841 SNPs ViabFu7 P*=0.95D*=10TCN= SNPs P<0.01
20 SMDP for GAWS Some technical/programming problems 1. Computer time (approximation & parallelization) 2. Missing data 3. Stability at early stage 4. Rare SNPs Now SMDP can done for an analysis of GWAS data (500K chip, 1000 subjects) within 10 hours via cluster
21 Simulation SNPs 1 true signal 500 replications
22 Simulation 2: Multiple signals Genotype data: GAW16 problem 3, 500K SNP data; Phenotype data: Simulated LDL (measured at the first visit), ~6500 subjects, 200 replications Analyses: For each replication, randomly draw 1000 SNPs without true effects and 10 SNPs with minor poly-gene effects and keep all 6 SNPs with relatively major effects to create a subset of genotypes. Recode the genotypes to 0, 1 and 2 according the copy number of minor alleles; Apply SMDP to the selected data and repeat the analysis over 200 replications.
23 Modified SMDP (analysis procedure) (1)Start analysis (or experiment) from a small sample size; (2) Perform multiple decision analysis to simultaneously test if a group of makers are significant; (3) Eliminate significant markers from the list (if identified); (4) Add one or multiple new samples to the data; (5) Repeat (2),(3),(4) … (6) Stop the procedure when all samples have been used and no makers are identified any more.
24 ROC Curves of SMDP and Regular Regression Analyses Ar, Br : Regular regression using all samples As, Bs: SMDP analyses Ars, Brs: Regular regression using SMDP’s average sample sizes (ASN) Ar, As and Ars: Analysis of SNPs with major effects; Br, Bs and Brs: Anaysis of SNPs with minor effects. ASN: the average sample size used in SMDP, presented as proportion of the entire sample size.
25 Power comparison of SMPD and regular regression (type I error rate = ) SNPs with true effects Simulated h 2 SMDP Power of regular regression using ASN powerASN*Validation* rs rs rs NA 0.00 rs rs rs *Proportion of significant tests (P<0.05), based on regression using the rest of samples after SMDP stops. *ASN: Average sample number used in SMDP Conclusion: given the same sample size, SMDP-regression is more powerful than regular regression.
26 The NHLBI Family Heart Study Illumina HuamanMap550 array data 983 subjects Coronary Artery Calcification (CAC) SMDP identifies 69 SNPs using less than 811 samples Traditional regression analysis of all 983 samples identifies SNPs (p<0.05) 15 SNPs (FDR<0.05) 11 identified by SMDP 1 SNPs (p<0.05/500K) also identified by SMDP Application to Real Data
27 Efficient use of sample size, extra sample size after stopping can be used for validation Simultaneously test group of signals, avoid one-by- one test and p-value adjustment Increase power (or decrease false positives) given the same average sample size Flexible experimental design. Extra N Summary of SMDP (advantages)
28 Compute time (needs approximation & parallelization ) Requirement of Koopman-Darmois distribution family Summary of SMDP (limitations)
29 SMDP: P*, t, D* P* arbitrary, 0.95 t fixed or varied D* indifference zone Pop. 1 Pop. 2 : Pop. t-1 Pop. t Pop. t+1 Pop. t+2 : Pop. M SMDP stopping rule Prob. of correct selection (PCS) > P* whenever D>D* Correct selection Populations with Q(θ)> Q(θ t )+D* are selected D* Q(θ t )+D* Q(θ t )
30 References R.E. Bechhofer, J. Kiefer., M. Sobel Sequential identification and ranking procedures. The University of Chicago Press, Chicago. M.A. Province A single, sequential, genome-wide test to identify simultaneously all promising areas in a linkage scan. Genetic Epidemiology,19: Q. Zhang, M.A. Province . Simplified sequential multiple decision procedures for genome scans . 2005 Proceedings of American Statistical Association. Biometrics section:463~468
31 Application to GWAS slide 9 slide 10
32 Simplified Stopping Rule U-S+1= Top Combination Number (TCN) TCN=2 (i.e. S=U-1,U-S=1)=> the simplest stopping rule When TCN=U (i.e. S=1, U-S=U-1)=> the original stopping rule How to choose TCN? Balance between computational accuracy and computational time Zhang & Province, 2005
33 Zhang & Province,2005,page 467 P*=0.95D*=10TCN= SNPs P<0.01
34 Simplified Stopping Rule M.A. Province, 2000 page
35 A Real Data Example ( M.A. Province, 2000, page 310)
36 Simulation Results (2) M.A. Province, 2000, page 313
37 Simplified SMDP (Bechhofer et al., 1968) U-S+1= Top Combination Number (TCN) How to choose TCN? Balance between computational accuracy and computational time
38 Relation of W and t (h=50, D*=10) Effective Top Combination Number ETCN Zhang & Province,2005,page 465
39 ETCN Curve Zhang & Province,2005,page 466
40 t =? Zhang & Province,2005,page 466
41 SMDP Summary Advantages: Test, identify all signals simultaneously, no multiple comparisons Use “Minimal” N to find significant signals, efficient Tight control statistical errors (Type I, II), powerful Save rest of N for validation, reliable Further studies: Computer time Extension to more methods/models Extension to non-K-D distributions