Sequential & Multiple Hypothesis Testing Procedures for Genome-wide Association Scans Qunyuan Zhang Division of Statistical Genomics Washington University.

Slides:



Advertisements
Similar presentations
COMPUTER INTENSIVE AND RE-RANDOMIZATION TESTS IN CLINICAL TRIALS Thomas Hammerstrom, Ph.D. USFDA, Division of Biometrics The opinions expressed are those.
Advertisements

Gene-by-Environment and Meta-Analysis Eleazar Eskin University of California, Los Angeles.
A Method for Detecting Pleiotropy
Meta-analysis for GWAS BST775 Fall DEMO Replication Criteria for a successful GWAS P
From the homework: Distribution of DNA fragments generated by Micrococcal nuclease digestion mean(nucs) = bp median(nucs) = 110 bp sd(nucs+ = 17.3.
Analysis of variance (ANOVA)-the General Linear Model (GLM)
Multiple Comparisons Measures of LD Jess Paulus, ScD January 29, 2013.
Sampling distributions of alleles under models of neutral evolution.
Admixture Mapping Qunyuan Zhang Division of Statistical Genomics GEMS Course M Computational Statistical Genetics Computational Statistical Genetics.
More Powerful Genome-wide Association Methods for Case-control Data Robert C. Elston, PhD Case Western Reserve University Cleveland Ohio.
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Additional Topics in Regression Analysis
Differentially expressed genes
MSc GBE Course: Genes: from sequence to function Genome-wide Association Studies Sven Bergmann Department of Medical Genetics University of Lausanne Rue.
Independent Samples and Paired Samples t-tests PSY440 June 24, 2008.
Stat 112 – Notes 3 Homework 1 is due at the beginning of class next Thursday.
Lecture 24: Thurs. Dec. 4 Extra sum of squares F-tests (10.3) R-squared statistic (10.4.1) Residual plots (11.2) Influential observations (11.3,
Probability & Statistics for Engineers & Scientists, by Walpole, Myers, Myers & Ye ~ Chapter 10 Notes Class notes for ISE 201 San Jose State University.
Using biological networks to search for interacting loci in genome-wide association studies Mathieu Emily et. al. European journal of human genetics, e-pub.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Graph Regularized Dual Lasso for Robust eQTL Mapping Wei Cheng 1 Xiang Zhang 2 Zhishan Guo 1 Yu Shi 3 Wei.
Give me your DNA and I tell you where you come from - and maybe more! Lausanne, Genopode 21 April 2010 Sven Bergmann University of Lausanne & Swiss Institute.
1 Chapter 20 Two Categorical Variables: The Chi-Square Test.
An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets.
Multiple testing in high- throughput biology Petter Mostad.
© 2013 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Chuanyu Sun Paul VanRaden National Association of Animal breeders, USA Animal Improvement Programs Laboratory, USA Increasing long term response by selecting.
Lecture 5: Segregation Analysis I Date: 9/10/02  Counting number of genotypes, mating types  Segregation analysis: dominant, codominant, estimating segregation.
+ Chapter 9 Summary. + Section 9.1 Significance Tests: The Basics After this section, you should be able to… STATE correct hypotheses for a significance.
Essential Statistics in Biology: Getting the Numbers Right
Statistics & Biology Shelly’s Super Happy Fun Times February 7, 2012 Will Herrick.
Generalized Linear Mixed Model (GLMM) & Weighted Sum Test (WST) Detecting Association between Rare Variants and Complex Traits Qunyuan Zhang, Ingrid Borecki,
Fine mapping QTLs using Recombinant-Inbred HS and In-Vitro HS William Valdar Jonathan Flint, Richard Mott Wellcome Trust Centre for Human Genetics.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Case(Control)-Free Multi-SNP Combinations in Case-Control Studies Dumitru Brinza and Alexander Zelikovsky Combinatorial Search (CS) for Disease-Association:
B AD 6243: Applied Univariate Statistics Hypothesis Testing and the T-test Professor Laku Chidambaram Price College of Business University of Oklahoma.
IAP workshop, Ghent, Sept. 18 th, 2008 Mixed model analysis to discover cis- regulatory haplotypes in A. Thaliana Fanghong Zhang*, Stijn Vansteelandt*,
1 SMU EMIS 7364 NTU TO-570-N Inferences About Process Quality Updated: 2/3/04 Statistical Quality Control Dr. Jerrell T. Stracener, SAE Fellow.
1 Association Analysis of Rare Genetic Variants Qunyuan Zhang Division of Statistical Genomics Course M Computational Statistical Genetics.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
Confidence intervals and hypothesis testing Petter Mostad
Chapter 7 Sampling and Sampling Distributions ©. Simple Random Sample simple random sample Suppose that we want to select a sample of n objects from a.
Sequential Multiple Decision Procedures (SMDP) for Genome Scans Q.Y. Zhang and M.A. Province Division of Statistical Genomics Washington University School.
Jianfeng Xu, M.D., Dr.PH Professor of Public Health and Cancer Biology Director, Program for Genetic and Molecular Epidemiology of Cancer Associate Director,
Multiple Testing Matthew Kowgier. Multiple Testing In statistics, the multiple comparisons/testing problem occurs when one considers a set of statistical.
Single-Factor Studies KNNL – Chapter 16. Single-Factor Models Independent Variable can be qualitative or quantitative If Quantitative, we typically assume.
Correlation Matrix Diagonal Segmentation (CMDS) A Fast Genome-wide Approach for Identifying Recurrent DNA Copy Number Alterations across Cancer Patients.
Comparison of 2 Population Means Goal: To compare 2 populations/treatments wrt a numeric outcome Sampling Design: Independent Samples (Parallel Groups)
- 1 - Overall procedure of validation Calibration Validation Figure 12.4 Validation, calibration, and prediction (Oberkampf and Barone, 2004 ). Model accuracy.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 11: Models Marshall University Genomics Core Facility.
C2BAT: Using the same data set for screening and testing. A testing strategy for genome-wide association studies in case/control design Matt McQueen, Jessica.
Practical With Merlin Gonçalo Abecasis. MERLIN Website Reference FAQ Source.
1 Definitions In statistics, a hypothesis is a claim or statement about a property of a population. A hypothesis test is a standard procedure for testing.
Discrete Algorithms for Disease Association Search and Dumitru Brinza Department of Computer Science Georgia State University UCSD, November 29, 2006 Susceptibility.
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 6 –Multiple hypothesis testing Marshall University Genomics.
Hypothesis Testing Steps for the Rejection Region Method State H 1 and State H 0 State the Test Statistic and its sampling distribution (normal or t) Determine.
Lecture 8 Estimation and Hypothesis Testing for Two Population Parameters.
Efficient calculation of empirical p- values for genome wide linkage through weighted mixtures Sarah E Medland, Eric J Schmitt, Bradley T Webb, Po-Hsiu.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 7 Inferences Concerning Means.
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Power and Meta-Analysis Dr Geraldine M. Clarke Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for.
Statistical Genomics Zhiwu Zhang Washington State University Lecture 20: MLMM.
Chapter 7 Review.
Statistical Quality Control, 7th Edition by Douglas C. Montgomery.
Differential Gene Expression
Understanding Results
Genome Wide Association Studies using SNP
Regression-based linkage analysis
Presentation transcript:

Sequential & Multiple Hypothesis Testing Procedures for Genome-wide Association Scans Qunyuan Zhang Division of Statistical Genomics Washington University School of Medicine

2 Multiple Comparison (strategy 1) Type I error False Positive Type II error False negative High power Low power  P value adjustment/correction (Bonferroni, FDR)  Empirical p value (permutation, bootstrap)

3 Type I error False Positive Type II error False negative Multiple Comparison (strategy 2)  Larger sample size  Meta analysis  Biological info or evidence  ……  More powerful statistical approach SMDP: Sequential Multiple Decision Procedure

4 What is SMDP?  A generalized framework for ranking and selection, using optimum sample sizes  A combination of sequential analysis and multiple hypothesis test

5 Feature 1 of SMDP Sequential Analysis n0n0n0n0 Start from a small sample size Increase sample size, sequential test at each stage Stop when stopping rule is satisfied n 0 +1 n 0 +2 n 0 +i … …

6 Feature 2 of SMDP Multiple Decision SNP1SNP2SNP3 SNP4 SNP5 SNP6 … SNPn Simultaneous test Multiple hypothesis test Independent test Binary hypothesis test test 1 test 2 test 3 test 4 test 5 test 6 test n SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 … SNPn Signal group Noise group

7 Binary Hypothesis Test used by traditional methods SNP1SNP2SNP3 SNP4 SNP5 SNP6 … SNPn test 1 H0: Eff.(SNP1)=0 vs. H1: Eff.(SNP1)≠0 test 2 H0: Eff.(SNP2)=0 vs. H1: Eff.(SNP2)≠0 test 3 …… test 4 …… test 5 …… test 6 …… test n H0: Eff.(SNPn)=0 vs. H1: Eff.(SNPn)≠0 test-wise error and genome-wise error multiple testing issue

8 Multiple Hypothesis Test used by SMDP SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 … SNPn H1: SNP1,2,3 are truly different from the others H2: SNP1,2,4 are truly different from the others H3 …… H4 …… H5: SNP4,5,6 are truly different from the others H6 ……… Hu: SNPn,n-1,n-2 are truly different from the others Goal: search the best one H: any t SNPs are truly different from the others (n-t) u= number of all possible combination of t out of n

9 General Rule of SMDP (Bechhofer et al., 1968) Selecting the t best of M K-D populations Sequential Sampling 1 2 … h h+1 … Pop. 1 Pop. 2 : Pop. t-1 Pop. t Pop. k+1 Pop. k+2 : Pop. M D Y 1,h Y 2,h : Y t,h : Y M,h U possible combinations of t out of M For each combination u Stopping rule Prob. of correct selection (PCS) > P*, whenever D>D* Sequential statistic at stage h

10 Koopman-Darmois(K-D) Populations (Bechhofer et al., 1968) The freq/density function of a K-D population can be written in the form: f(x)=exp{P(x)Q(θ)+R(x)+S(θ)} A.The normal density function with unknown mean and known variance; B.The normal density function with unknown variance and known mean; C.The exponential density function with unknown scale parameter and known location parameter; D.The Poisson distribution with unknown mean; …… The distance of two K-D populations

11

12 Combine SMDP With Regression Model (M.A. Province, 2000, page 319) Case B : the normal density function with unknown variance and known mean;

13 SMDP - Regression (M.A. Province, 2000) Z 1, X 1 Z 2, X 2 Z 3, X 3 : Z h, X h Z h+1, X h+1 : Z N, X N Data pairs for a marker Sequential sum of squares of regression residuals Y i,h denotes Y for marker i at stage h (see slide 7)

14 A Real Data Example ( M.A. Province, 2000, page 308)

15 Simulation Results M.A. Province, 2000, page 312

16 SMDP: Computational Problem : h h+1 : N Sequential stage Y 1,h Y 2,h : Y k,h Y k+1,h Y k+2,h : Y M,h U sums of U possible combinations of t out of M Each sum contains t members of Y i,h Computer t ime ?

17 Simplified Stopping Rule U-S+1= Top Combination Number (TCN) TCN=2 (i.e. S=U-1,U-S=1)=> the simplest stopping rule When TCN=U (i.e. S=1, U-S=U-1)=> the original stopping rule How to choose TCN? Balance between computational accuracy and computational time Zhang & Province, 2005

18

19 Application to Pharmacal Genetics Data Sample size GenotypePhenotype 85 Cell lines 5841 SNPs ViabFu7 P*=0.95D*=10TCN= SNPs P<0.01

20 SMDP for GAWS Some technical/programming problems 1. Computer time (approximation & parallelization) 2. Missing data 3. Stability at early stage 4. Rare SNPs Now SMDP can done for an analysis of GWAS data (500K chip, 1000 subjects) within 10 hours via cluster

21 Simulation SNPs 1 true signal 500 replications

22 Simulation 2: Multiple signals Genotype data: GAW16 problem 3, 500K SNP data; Phenotype data: Simulated LDL (measured at the first visit), ~6500 subjects, 200 replications Analyses: For each replication, randomly draw 1000 SNPs without true effects and 10 SNPs with minor poly-gene effects and keep all 6 SNPs with relatively major effects to create a subset of genotypes. Recode the genotypes to 0, 1 and 2 according the copy number of minor alleles; Apply SMDP to the selected data and repeat the analysis over 200 replications.

23 Modified SMDP (analysis procedure) (1)Start analysis (or experiment) from a small sample size; (2) Perform multiple decision analysis to simultaneously test if a group of makers are significant; (3) Eliminate significant markers from the list (if identified); (4) Add one or multiple new samples to the data; (5) Repeat (2),(3),(4) … (6) Stop the procedure when all samples have been used and no makers are identified any more.

24 ROC Curves of SMDP and Regular Regression Analyses Ar, Br : Regular regression using all samples As, Bs: SMDP analyses Ars, Brs: Regular regression using SMDP’s average sample sizes (ASN) Ar, As and Ars: Analysis of SNPs with major effects; Br, Bs and Brs: Anaysis of SNPs with minor effects. ASN: the average sample size used in SMDP, presented as proportion of the entire sample size.

25 Power comparison of SMPD and regular regression (type I error rate = ) SNPs with true effects Simulated h 2 SMDP Power of regular regression using ASN powerASN*Validation* rs rs rs NA 0.00 rs rs rs *Proportion of significant tests (P<0.05), based on regression using the rest of samples after SMDP stops. *ASN: Average sample number used in SMDP Conclusion: given the same sample size, SMDP-regression is more powerful than regular regression.

26 The NHLBI Family Heart Study Illumina HuamanMap550 array data 983 subjects Coronary Artery Calcification (CAC) SMDP identifies 69 SNPs using less than 811 samples Traditional regression analysis of all 983 samples identifies SNPs (p<0.05) 15 SNPs (FDR<0.05) 11 identified by SMDP 1 SNPs (p<0.05/500K) also identified by SMDP Application to Real Data

27  Efficient use of sample size, extra sample size after stopping can be used for validation  Simultaneously test group of signals, avoid one-by- one test and p-value adjustment  Increase power (or decrease false positives) given the same average sample size  Flexible experimental design. Extra N Summary of SMDP (advantages)

28  Compute time (needs approximation & parallelization )  Requirement of Koopman-Darmois distribution family Summary of SMDP (limitations)

29 SMDP: P*, t, D* P* arbitrary, 0.95 t fixed or varied D* indifference zone Pop. 1 Pop. 2 : Pop. t-1 Pop. t Pop. t+1 Pop. t+2 : Pop. M SMDP stopping rule Prob. of correct selection (PCS) > P* whenever D>D* Correct selection Populations with Q(θ)> Q(θ t )+D* are selected D* Q(θ t )+D* Q(θ t )

30 References   R.E. Bechhofer, J. Kiefer., M. Sobel Sequential identification and ranking procedures. The University of Chicago Press, Chicago.   M.A. Province A single, sequential, genome-wide test to identify simultaneously all promising areas in a linkage scan. Genetic Epidemiology,19:   Q. Zhang, M.A. Province . Simplified sequential multiple decision procedures for genome scans . 2005 Proceedings of American Statistical Association. Biometrics section:463~468

31 Application to GWAS slide 9 slide 10

32 Simplified Stopping Rule U-S+1= Top Combination Number (TCN) TCN=2 (i.e. S=U-1,U-S=1)=> the simplest stopping rule When TCN=U (i.e. S=1, U-S=U-1)=> the original stopping rule How to choose TCN? Balance between computational accuracy and computational time Zhang & Province, 2005

33 Zhang & Province,2005,page 467 P*=0.95D*=10TCN= SNPs P<0.01

34 Simplified Stopping Rule M.A. Province, 2000 page

35 A Real Data Example ( M.A. Province, 2000, page 310)

36 Simulation Results (2) M.A. Province, 2000, page 313

37 Simplified SMDP (Bechhofer et al., 1968) U-S+1= Top Combination Number (TCN) How to choose TCN? Balance between computational accuracy and computational time

38 Relation of W and t (h=50, D*=10) Effective Top Combination Number ETCN Zhang & Province,2005,page 465

39 ETCN Curve Zhang & Province,2005,page 466

40 t =? Zhang & Province,2005,page 466

41 SMDP Summary Advantages: Test, identify all signals simultaneously, no multiple comparisons Use “Minimal” N to find significant signals, efficient Tight control statistical errors (Type I, II), powerful Save rest of N for validation, reliable Further studies: Computer time Extension to more methods/models Extension to non-K-D distributions