Statistical Methods for Rare Variant Association Test Using Summarized Data Qunyuan Zhang Ingrid Borecki, Michael A. Province Division of Statistical Genomics.

Statistical Methods for Rare Variant Association Test Using Summarized Data Qunyuan Zhang Ingrid Borecki, Michael A. Province Division of Statistical Genomics

Motivation Individual levelSummarized level Subject Variant Trait V1V2V3 1000case 2100 3000 4000control 5000 6001 …………… Variant V1V2V3 Variant No. in cases1083 Variant No. in controls201 No. of cases300 No. of controls500 Pooled DNA sequencing Public data (as control) Next generation sequencing => rare variants Two types of data

Existing Methods MethodDescription Bi-directional effects Ref. EFT Exclusive Frequency Test testing mutually exclusive allele/carrier freq. × Commonly- used in publications, such as Cohen et al., 2004 TFT Total Frequency Test testing total allele/carrier freq. × CAST Cohort Allele Sum Test testing total allele/carrier number × Morgenthaler & Thilly, 2006 C-alpha testing variance √ Neale et al., 2011

QQ Plots of Existing Methods (under the null) EFT and C-alpha inflated with false positives TFT and CAST no inflation, but assuming single effect-direction Objective More general, powerful methods … CAST C-alpha EFT TFT

Structure of Summarized data variant 1 variant i variant k variant 2 … Strategy Instead of testing total freq./number, we test the randomness of all tables. variant 3 …

4. Calculating p-value P= Prob.( ) Exact Probability Test (EPT) 1.Calculating the probability of each table based on hypergeometric distribution 2. Calculating the logarized joint probability (L) for all k tables 3. Enumerating all possible tables and L scores

Likelihood Ratio Test (LRT) Binomial distribution

Q-Q Plots of EPT and LRT (under the null) EPT N=500 EPT N=3000 LRT N=500 LRT N=3000

Power Comparison significance level=0.00001 Variant proportion Positive causal 80% Neutral 20% Negative Causal 0% Power Sample size Power Sample size Power Sample size

Power Comparison significance level=0.00001 Variant proportion Positive causal 60% Neutral 20% Negative Causal 20% Power Sample size

Power Comparison significance level=0.00001 Variant proportion Positive causal 40% Neutral 20% Negative Causal 40% Power Sample size

Power Comparison individual-level data vs. summarized data N=1000, significance level=0.00001 Power Variant proportion positive : neutral : negative (%) CMC Li & Leal, 2008 SKAT Wu et al., 2011

Application -LOG10 p-values of 933 cancer-related genes Cases: 460 ovarian cancer cases, germline exome data, from TCGA Controls: ~3500 individuals, exome data, from NHBLI

Conclusions  EFT and C-alpha produce inflated p-value.  TFT and CAST produce correct p-value, but lose power in detecting bi-directional effects.  EPT produces correct p-value and maintains power regardless of effect directions, more computer time.  LRT produces slightly biased p-value for small N, can be improved by larger N, similar power of EPT, less computer time, a good alternative for large datasets.  If no confounders need to be modeled, there is no significant loss of power in the use of summarized data

Acknowledgements Dr. Li Ding Charles Lu Krishna-Latha Kanchi (for providing the TCGA and NHBLI exome data)

Statistical Methods for Rare Variant Association Test Using Summarized Data Qunyuan Zhang Ingrid Borecki, Michael A. Province Division of Statistical Genomics.

Similar presentations

Presentation on theme: "Statistical Methods for Rare Variant Association Test Using Summarized Data Qunyuan Zhang Ingrid Borecki, Michael A. Province Division of Statistical Genomics."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Statistical Methods for Rare Variant Association Test Using Summarized Data Qunyuan Zhang Ingrid Borecki, Michael A. Province Division of Statistical Genomics.

Similar presentations

Presentation on theme: "Statistical Methods for Rare Variant Association Test Using Summarized Data Qunyuan Zhang Ingrid Borecki, Michael A. Province Division of Statistical Genomics."— Presentation transcript:

Similar presentations

About project

Feedback