Qunyuan Zhang Ingrid Borecki, Michael A. Province

Name: Qunyuan Zhang Ingrid Borecki, Michael A. Province
Uploaded: 2017-10-10T08:31:49+00:00
Duration: PTM9S41
Channel: Lorin Rice
Description: Qunyuan Zhang Ingrid Borecki, Michael A. Province

Statistical Methods for Rare Variant Association Test Using Summarized Data
Qunyuan Zhang Ingrid Borecki, Michael A. Province Division of Statistical Genomics

Motivation Next generation sequencing => rare variants
Two types of data Individual level Summarized level Pooled DNA sequencing Public data (as control) Subject Variant Trait V1 V2 V3 1 case 2 3 4 control 5 6 … Variant V1 V2 V3 Variant No. in cases 10 8 3 Variant No. in controls 2 1 No. of cases 300 No. of controls 500

Models for Individual-level Data
Single-variant test (Regular GWAS) Collective/group test Burden/collapsing test

Methods for Individual-level Data
CMC (Li and Leal, 2008) WSS (Madsen and Browning, 2009) VT (Price et al, 2010) aSum (Han and Pan, 2010) KBAC (Liu and Leal, 2010) RBT (Ionita-Laza et al, 2011) PWST (Zhang et al, 2011) SKAT( Wu et al, 2011) EREC( Lin et al, 2011) … 4

Methods for Summarized Data
Description Bi-directional effects Ref. EFT Exclusive Frequency Test testing mutually exclusive allele/carrier freq. × Commonly-used in publications, such as Cohen et al., 2004 TFT Total Frequency Test testing total allele/carrier freq. CAST Cohort Allele Sum Test testing total allele/carrier number Morgenthaler & Thilly, 2006 C-alpha testing variance √ Neale et al., 2011

An Example of Summarized Data
6 Jonathan C. Cohen, et al. Science 305, 869 (2004) 6

An Example of Summarized Data (cont.)
Variant allele number Reference allele Total Low-HDL group 20 236 256 High-HDL group 2 254 22 490 512 7

QQ Plots of Existing Methods (under the null)
EFT TFT EFT and C-alpha inflated with false positives TFT and CAST no inflation, but need to assume single direction of effects Objective More general, non-inflated, powerful methods … CAST C-alpha

Structure of Summarized data
variant 1 variant 2 … … variant 3 variant i variant k Strategy Instead of testing total freq./number, we test the randomness of all tables.

Exact Probability Test (EPT)
1.Calculating the probability of each table based on hypergeometric distribution 2. Calculating the logarized joint probability (L) for multiple tables 3. Enumerating all possible table combinations and L scores 4. Calculating p-value P= Prob.( )

Likelihood Ratio Test (LRT)
Binomial distribution Maximum likelihood estimation

Q-Q Plots of EPT and LRT (under the null)

Power Comparison significance level=0.00001
Simulation Logistic model N=500, 1000,3000 50% cases 50% controls 5-15 variants MAF<1% Positive causal: 80% (OR=2 to 6) Neutral :20% Negative causal:0% Power Power Power Sample size Sample size Sample size

Simulation Logistic model N=500, 1000,3000 50% cases 50% controls 5-15 variants MAF<1% Positive causal: 60% (OR=2 to 6) Neutral :20% Negative causal:20% (OR=1/6 to 1/2) Power Sample size

Simulation Logistic model N=500, 1000,3000 50% cases 50% controls 5-15 variants MAF<1% Positive causal: 40% (OR=2 to 6) Neutral :20% Negative causal:40% (OR=1/6 to 1/2) Power Sample size

-LOG10 p-values of 933 cancer-related genes
Application Cases: 460 ovarian cancer cases, germline exome data, from TCGA Controls: ~3500 individuals from the NHBLI exome project -LOG10 p-values of 933 cancer-related genes

Individual-level Data Based Methods vs. Summarized Data Based Methods
An interesting question: If we have individual-level data, but we choose to perform summarized data based analysis, will there be any power gain or loss? 17

positive : neutral : negative (%)
Power Comparison individual-level data vs. summarized data N=1000, significance level= Power Individual-level data based methods: CMC Li & Leal, 2008 SKAT Wu et al., 2011 Variant proportion positive : neutral : negative (%)

(This study has not bee published)
Conclusions EFT and C-alpha produce inflated p-value. TFT and CAST produce correct p-value, but lose power in detecting bi-directional effects. EPT produces correct p-value and maintains power regardless of effect directions, more computer time. LRT produces slightly biased p-value for small N, can be improved by larger N, similar power of EPT, less computer time, a good alternative for large datasets. If no confounders need to be modeled, there is no significant loss of power in the use of summarized data (This study has not bee published)

(for providing the TCGA and NHBLI exome data)
Acknowledgements Dr. Li Ding Charles Lu Krishna-Latha Kanchi (for providing the TCGA and NHBLI exome data)

Qunyuan Zhang Ingrid Borecki, Michael A. Province

Similar presentations

Presentation on theme: "Qunyuan Zhang Ingrid Borecki, Michael A. Province"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Qunyuan Zhang Ingrid Borecki, Michael A. Province

Similar presentations

Presentation on theme: "Qunyuan Zhang Ingrid Borecki, Michael A. Province"— Presentation transcript:

Similar presentations

About project

Feedback