1 Association Analysis of Rare Genetic Variants Qunyuan Zhang Division of Statistical Genomics Course M21-621 Computational Statistical Genetics.

Slides:



Advertisements
Similar presentations
Statistical methods for genetic association studies
Advertisements

Sequential Kernel Association Tests for the Combined Effect of Rare and Common Variants Journal club (Nov/13) SH Lee.
Analysis of imputed rare variants
Gene-by-Environment and Meta-Analysis Eleazar Eskin University of California, Los Angeles.
Association Tests for Rare Variants Using Sequence Data
A Method for Detecting Pleiotropy
Meta-analysis for GWAS BST775 Fall DEMO Replication Criteria for a successful GWAS P
GBS & GWAS using the iPlant Discovery Environment
1 Associating Genomic Variations with Phenotypes Model comparison, rare variants, and analysis pipeline Qunyuan Zhang Division of Statistical Genomics.
Genetic Association Analysis --- impact of NGS 1.
Association Modeling With iPlant
Estimating “Heritability” using Genetic Data David Evans University of Queensland.
MSc GBE Course: Genes: from sequence to function Genome-wide Association Studies Sven Bergmann Department of Medical Genetics University of Lausanne Rue.
Gene Set Analysis 09/24/07. From individual gene to gene sets Finding a list of differentially expressed genes is only the starting point. Suppose we.
Correlation. Two variables: Which test? X Y Contingency analysis t-test Logistic regression Correlation Regression.
Give me your DNA and I tell you where you come from - and maybe more! Lausanne, Genopode 21 April 2010 Sven Bergmann University of Lausanne & Swiss Institute.
5-3 Inference on the Means of Two Populations, Variances Unknown
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
Statistical hypothesis testing – Inferential statistics II. Testing for associations.
Kaitlyn Cook Carleton College Northfield Undergraduate Mathematics Symposium October 7, 2014 A METHOD FOR COMBINING FAMILY-BASED RARE VARIANT TESTS OF.
General Linear Model & Classical Inference Guillaume Flandin Wellcome Trust Centre for Neuroimaging University College London SPM M/EEGCourse London, May.
Robust and powerful sibpair test for rare variant association
1 G Lect 11W Logistic Regression Review Maximum Likelihood Estimates Probit Regression and Example Model Fit G Multiple Regression Week 11.
Lecture 5: Segregation Analysis I Date: 9/10/02  Counting number of genotypes, mating types  Segregation analysis: dominant, codominant, estimating segregation.
1 G Lect 6b G Lecture 6b Generalizing from tests of quantitative variables to tests of categorical variables Testing a hypothesis about a.
Generalized Linear Mixed Model (GLMM) & Weighted Sum Test (WST) Detecting Association between Rare Variants and Complex Traits Qunyuan Zhang, Ingrid Borecki,
Population Stratification
IAP workshop, Ghent, Sept. 18 th, 2008 Mixed model analysis to discover cis- regulatory haplotypes in A. Thaliana Fanghong Zhang*, Stijn Vansteelandt*,
HSRP 734: Advanced Statistical Methods June 19, 2008.
Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.
What host factors are at play? Paul de Bakker Division of Genetics, Brigham and Women’s Hospital Broad Institute of MIT and Harvard
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
Multiple Regression and Model Building Chapter 15 Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
Jianfeng Xu, M.D., Dr.PH Professor of Public Health and Cancer Biology Director, Program for Genetic and Molecular Epidemiology of Cancer Associate Director,
Statistical Methods for Rare Variant Association Test Using Summarized Data Qunyuan Zhang Ingrid Borecki, Michael A. Province Division of Statistical Genomics.
E XOME SEQUENCING AND COMPLEX DISEASE : practical aspects of rare variant association studies Alice Bouchoms Amaury Vanvinckenroye Maxime Legrand 1.
Qunyuan Zhang Ingrid Borecki, Michael A. Province
Sequential & Multiple Hypothesis Testing Procedures for Genome-wide Association Scans Qunyuan Zhang Division of Statistical Genomics Washington University.
1 B-b B-B B-b b-b Lecture 2 - Segregation Analysis 1/15/04 Biomath 207B / Biostat 237 / HG 207B.
Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests.
Single-Factor Studies KNNL – Chapter 16. Single-Factor Models Independent Variable can be qualitative or quantitative If Quantitative, we typically assume.
Future Directions Pak Sham, HKU Boulder Genetics of Complex Traits Quantitative GeneticsGene Mapping Functional Genomics.
Statistical Issues in Genetic Association Studies
Math 4030 Final Exam Review. Probability (Continuous) Definition of pdf (axioms, finding k) Cdf and probability (integration) Mean and variance (short-cut.
GenABEL: an R package for Genome Wide Association Analysis
Tutorial I: Missing Value Analysis
Lecture 23: Quantitative Traits III Date: 11/12/02  Single locus backcross regression  Single locus backcross likelihood  F2 – regression, likelihood,
Powerful Regression-based Quantitative Trait Linkage Analysis of General Pedigrees Pak Sham, Shaun Purcell, Stacey Cherny, Gonçalo Abecasis.
24.1 Quantitative Characteristics Vary Continuously and Many Are Influenced by Alleles at Multiple Loci The Relationship Between Genotype and Phenotype.
Association tests. Basics of association testing Consider the evolutionary history of individuals proximal to the disease carrying mutation.
Analysis of Next Generation Sequence Data BIOST /06/2015.
Nonparametric Statistics
Sequence Kernel Association Tests (SKAT) for the Combined Effect of Rare and Common Variants 統計論文 奈良原.
Increasing Power in Association Studies by using Linkage Disequilibrium Structure and Molecular Function as Prior Information Eleazar Eskin UCLA.
Association Mapping in Families Gonçalo Abecasis University of Oxford.
Power and Meta-Analysis Dr Geraldine M. Clarke Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for.
ODP and SVA European Institute of Statistical Genetics Liege, Belgium September 4, 2007 Greg Gibson.
upstream vs. ORF binding and gene expression?
Genetic Association Analysis
Genome Wide Association Studies using SNP
Beyond GWAS Erik Fransen.
Correlation for a pair of relatives
Rare-Variant Extensions of the Transmission Disequilibrium Test: Application to Autism Exome Sequence Data  Zongxiao He, Brian J. O’Roak, Joshua D. Smith,
A Flexible Bayesian Framework for Modeling Haplotype Association with Disease, Allowing for Dominance Effects of the Underlying Causative Variants  Andrew.
Rare-Variant Association Testing for Sequencing Data with the Sequence Kernel Association Test  Michael C. Wu, Seunggeun Lee, Tianxi Cai, Yun Li, Michael.
Dan-Yu Lin, Zheng-Zheng Tang  The American Journal of Human Genetics 
Estimating Genetic Effects and Quantifying Missing Heritability Explained by Identified Rare-Variant Associations  Dajiang J. Liu, Suzanne M. Leal  The.
Iuliana Ionita-Laza, Seunggeun Lee, Vlad Makarov, Joseph D
Hong Zhang, Judong Shen & Devan V. Mehrotra
Presentation transcript:

1 Association Analysis of Rare Genetic Variants Qunyuan Zhang Division of Statistical Genomics Course M Computational Statistical Genetics

2 Rare Variants Rare Variants Low allele frequency: usually less than 1% Low power: for most analyses, due to less variation of observations High false positive rate: for some model-based analyses, due to sparse distribution of data, unstable/biased parameter estimation and inflated p- value.

3 An Example of Low Power Jonathan C. Cohen, et al.Science 305, 869 (2004)

An Example of High False Positive Rate (Q-Q plots from GWAS data, unpublished) N=~2500 MAF>0.03 N=~2500 MAF<0.03 N=~2500 MAF<0.03 Permuted N=50000 MAF<0.03 Bootstrapped

5 Three Levels of Rare Variant Data Three Levels of Rare Variant Data Level 1: Individual-level Level 2: Summarized over subjects Level 3: Summarized over both subjects and variants

6 Level 1: Individual-level SubjectV1V2V3V4Trait-1Trait

7 Level 2: Summarized over subjects (by group) Jonathan C. Cohen, et al.Science 305, 869 (2004)Jonathan C. Cohen, et al.Science 305, 869 (2004)

Level 3: Summarized over subjects (by group) and variants (usually by gene) Variant allele number Reference allele number Total Low-HDL group High-HDL group Total

9 Methods For Level 3 Data

10 Single-variant Test vs Total Freq.Test (TFT) Jonathan C. Cohen, et al.Science 305, 869 (2004)

11 What we have learned …  Single-variant test of rare variants has very low power for detecting association, due to extremely low frequency (usually < 0.01)  Testing collective effect of a set of rare variants may increase the power (sum test, collective test, group test, collapsing test, burden test…)

12 Methods For Level 2 Data  Allowing different samples sizes for different variants  Different variants can be weighted differently

13 CAST: A cohort allelic sums test Morgenthaler and Thilly, Mutation Research 615 (2007) 28–56 Under H0: S(cases)/2N(cases)−S(controls)/2N(controls) =0 S: variant number; N: sample size T= S(cases) − S(controls)N(cases)/N(controls) = S(cases) − S ∗ (controls) (S can be calculated variant by variant and can be weighted differently, the final T=sum(W i S i ) ) Z=T/SQRT(Var(T)) ~ N (0,1) Var(T)= Var (S(cases) − S* (controls) ) =Var(S(cases)) + Var(S* (controls)) =Var(S(cases)) + Var(S(controls)) X [N(cases)/N(controls)]^2

14 C-alpha PLOS Genetics, 2011 | Volume 7 | Issue 3 | e Effect direction problem

15 C-alpha

QQ Plots of Existing Methods (under the null) EFT and C-alpha inflated with false positives TFT and CAST no inflation, but assuming single effect-direction Objective More general, powerful methods … CAST C-alpha EFT TFT

17 More Generalized Methods For Level 2 Data

Structure of Level 2 data variant 1 variant i variant k variant 2 … Strategy Instead of testing total freq./number, we test the randomness of all tables. variant 3 …

4. Calculating p-value P= Prob.( ) Exact Probability Test (EPT) 1.Calculating the probability of each table based on hypergeometric distribution 2. Calculating the logarized joint probability (L) for all k tables 3. Enumerating all possible tables and L scores ASHG Meeting 1212, Zhang

Likelihood Ratio Test (LRT) Binomial distribution ASHG Meeting 1212, Zhang

Q-Q Plots of EPT and LRT (under the null) EPT N=500 EPT N=3000 LRT N=500 LRT N=3000

Power Comparison significance level= Variant proportion Positive causal 80% Neutral 20% Negative Causal 0% Power Sample size Power Sample size Power Sample size

Power Comparison significance level= Variant proportion Positive causal 60% Neutral 20% Negative Causal 20% Power Sample size

Power Comparison significance level= Variant proportion Positive causal 40% Neutral 20% Negative Causal 40% Power Sample size

25 Methods For Level 1 Data Including covariates Extended to quantitative trait Better control for population structure More sophisticate model

26 Collapsing (C) test Step 1 Step 2 logit(y)=a + b* X + e (logistic regression) Li and Leal,The American Journal of Human Genetics 2008(83): 311–321

27 Variant Collapsing (+) (.) SubjectV1V2V3V4CollapsedTrait

28 WSS

29 WSS

30 WSS

31 Weighted Sum Test Collapsing test (Li & Leal, 2008), w i =1 and s=1 if s>1 Weighted-sum test (Madsen & Browning,2009), w i calculated based-on allele freq. in control group aSum: Adaptive sum test (Han & Pan,2010), w i = -1 if b<0 and p<0.1, otherwise w j =1 KBAC (Liu and Leal, 2010), w i = left tail p value RBT (Ionita-Laza et al, 2011), w i = log scaled probability PWST p-value weighted sum test (Zhang et al., 2011) :, w i = rescaled left tail p value, incorporating both significance and directions EREC( Lin et al, 2011), w i = estimated effect size

32 When there are only causal(+) variants … (+) Subjec tV1V2 Collapse dTrait Collapsing (Li & Leal,2008) works well, power increased

33 (+) (.) SubjectV1V2V3V4 Collapse dTrait When there are causal(+) and non-causal(.) variants … Collapsing still works, power reduced

34 (+) (.) (-) SubjectV1V2V3V4V5V6 Collaps edTrait When there are causal(+) non-causal(.) and causal (-) variants … Power of collapsing test significantly down

35 P-value Weighted Sum Test (PWST) (+) (.) (-) SubjectV1V2V3V4V5V6CollapsedpSumTrait t p(x≤t) *(p-0.5) Rescaled left-tail p-value [-1,1] is used as weight

36 P-value Weighted Sum Test (PWST) Power of collapsing test is retained even there are bidirectional effects

37 PWST:Q-Q Plots Under the Null Direct test Inflation of type I error Corrected by permutation test (permutation of phenotype)

Generalized Linear Mixed Model (GLMM) & Weighted Sum Test (WST) 38

GLMM & WST Y : quantitative trait or logit(binary trait) α : intercept β : regression coefficient of weighted sum m : number of RVs to be collapsed w i : weight of variant i g i : genotype (recoded) of variant i Σw i g i : weighted sum (WS) X : covariate(s), such as population structure variable(s) τ : fixed effect(s) of X Z: design matrix corresponding to γ γ : random polygene effects for individual subjects, ~N(0, G), G=2σ 2 K, K is the kinship matrix and σ 2 the additive ploygene genetic variance ε : residual 39

Base on allele frequency, binary(0,1) or continuous, fixed or variable threshold; Based on function annotation/prediction; SIFT, PolyPhen etc. Based on sequencing quality (coverage, mapping quality, genotyping quality etc.); Data-driven, using both genotype and phenotype data, learning weight from data or adaptive selection, permutation test; Any combination … Weight 40

Adjusting relatedness in family data for non-data- driven test of rare variants. Application 1: Family Data 41 γ ~N(0,2σ 2 K) Unadjusted: Adjusted:

Q-Q Plots of –log 10 (P) under the Null Li & Leal’s collapsing test, ignoring family structure, inflation of type-1 error Li & Leal’s collapsing test, modeling family structure via GLMM, inflation is corrected 42 (From Zhang et al, 2011, BMC Proc.)

Application 2: Permuting Family Data Permuted Non-permuted, subject IDs fixed 43 MMPT: Mixed Model-based Permutation Test Adjusting relatedness in family data for data-driven permutation test of rare variants. γ ~N(0,2σ 2 K)

Q-Q Plots under the Null WSS SPWSTPWST aSum Permutation test, ignoring family structure, inflation of type-1 error 44 (From Zhang et al, 2011, IGES Meeting)

Q-Q Plots under the Null WSS SPWSTPWST aSum Mixed model-based permutation test (MMPT), modeling family structure, inflation corrected (From Zhang et al, 2011, IGES Meeting)

Burden Test vs. Non-burden Test 46 Burden test Non-burden test T-test, Likelihood Ratio Test, F-test, score test, … SKAT: sequence kernel association test

Extension of SKAT to Family Data kinship matrix Polygenic heritability of the traitResidual Han Chen et al., 2012, Genetic Epidemiology

Other problems 49  Missing genotypes & imputation  Genotyping errors & QC (family consistency, sequence review)  Population Stratification  Inherited variants and de novo mutation  Family data & linkage infomation  Variant validation and association validation  Public databases  And more …