1 Association Analysis of Rare Genetic Variants Qunyuan Zhang Division of Statistical Genomics Course M Computational Statistical Genetics
2 Rare Variants Rare Variants Low allele frequency: usually less than 1% Low power: for most analyses, due to less variation of observations High false positive rate: for some model-based analyses, due to sparse distribution of data, unstable/biased parameter estimation and inflated p- value.
3 An Example of Low Power Jonathan C. Cohen, et al.Science 305, 869 (2004)
An Example of High False Positive Rate (Q-Q plots from GWAS data, unpublished) N=~2500 MAF>0.03 N=~2500 MAF<0.03 N=~2500 MAF<0.03 Permuted N=50000 MAF<0.03 Bootstrapped
5 Three Levels of Rare Variant Data Three Levels of Rare Variant Data Level 1: Individual-level Level 2: Summarized over subjects Level 3: Summarized over both subjects and variants
6 Level 1: Individual-level SubjectV1V2V3V4Trait-1Trait
7 Level 2: Summarized over subjects (by group) Jonathan C. Cohen, et al.Science 305, 869 (2004)Jonathan C. Cohen, et al.Science 305, 869 (2004)
Level 3: Summarized over subjects (by group) and variants (usually by gene) Variant allele number Reference allele number Total Low-HDL group High-HDL group Total
9 Methods For Level 3 Data
10 Single-variant Test vs Total Freq.Test (TFT) Jonathan C. Cohen, et al.Science 305, 869 (2004)
11 What we have learned … Single-variant test of rare variants has very low power for detecting association, due to extremely low frequency (usually < 0.01) Testing collective effect of a set of rare variants may increase the power (sum test, collective test, group test, collapsing test, burden test…)
12 Methods For Level 2 Data Allowing different samples sizes for different variants Different variants can be weighted differently
13 CAST: A cohort allelic sums test Morgenthaler and Thilly, Mutation Research 615 (2007) 28–56 Under H0: S(cases)/2N(cases)−S(controls)/2N(controls) =0 S: variant number; N: sample size T= S(cases) − S(controls)N(cases)/N(controls) = S(cases) − S ∗ (controls) (S can be calculated variant by variant and can be weighted differently, the final T=sum(W i S i ) ) Z=T/SQRT(Var(T)) ~ N (0,1) Var(T)= Var (S(cases) − S* (controls) ) =Var(S(cases)) + Var(S* (controls)) =Var(S(cases)) + Var(S(controls)) X [N(cases)/N(controls)]^2
14 C-alpha PLOS Genetics, 2011 | Volume 7 | Issue 3 | e Effect direction problem
15 C-alpha
QQ Plots of Existing Methods (under the null) EFT and C-alpha inflated with false positives TFT and CAST no inflation, but assuming single effect-direction Objective More general, powerful methods … CAST C-alpha EFT TFT
17 More Generalized Methods For Level 2 Data
Structure of Level 2 data variant 1 variant i variant k variant 2 … Strategy Instead of testing total freq./number, we test the randomness of all tables. variant 3 …
4. Calculating p-value P= Prob.( ) Exact Probability Test (EPT) 1.Calculating the probability of each table based on hypergeometric distribution 2. Calculating the logarized joint probability (L) for all k tables 3. Enumerating all possible tables and L scores ASHG Meeting 1212, Zhang
Likelihood Ratio Test (LRT) Binomial distribution ASHG Meeting 1212, Zhang
Q-Q Plots of EPT and LRT (under the null) EPT N=500 EPT N=3000 LRT N=500 LRT N=3000
Power Comparison significance level= Variant proportion Positive causal 80% Neutral 20% Negative Causal 0% Power Sample size Power Sample size Power Sample size
Power Comparison significance level= Variant proportion Positive causal 60% Neutral 20% Negative Causal 20% Power Sample size
Power Comparison significance level= Variant proportion Positive causal 40% Neutral 20% Negative Causal 40% Power Sample size
25 Methods For Level 1 Data Including covariates Extended to quantitative trait Better control for population structure More sophisticate model
26 Collapsing (C) test Step 1 Step 2 logit(y)=a + b* X + e (logistic regression) Li and Leal,The American Journal of Human Genetics 2008(83): 311–321
27 Variant Collapsing (+) (.) SubjectV1V2V3V4CollapsedTrait
28 WSS
29 WSS
30 WSS
31 Weighted Sum Test Collapsing test (Li & Leal, 2008), w i =1 and s=1 if s>1 Weighted-sum test (Madsen & Browning,2009), w i calculated based-on allele freq. in control group aSum: Adaptive sum test (Han & Pan,2010), w i = -1 if b<0 and p<0.1, otherwise w j =1 KBAC (Liu and Leal, 2010), w i = left tail p value RBT (Ionita-Laza et al, 2011), w i = log scaled probability PWST p-value weighted sum test (Zhang et al., 2011) :, w i = rescaled left tail p value, incorporating both significance and directions EREC( Lin et al, 2011), w i = estimated effect size
32 When there are only causal(+) variants … (+) Subjec tV1V2 Collapse dTrait Collapsing (Li & Leal,2008) works well, power increased
33 (+) (.) SubjectV1V2V3V4 Collapse dTrait When there are causal(+) and non-causal(.) variants … Collapsing still works, power reduced
34 (+) (.) (-) SubjectV1V2V3V4V5V6 Collaps edTrait When there are causal(+) non-causal(.) and causal (-) variants … Power of collapsing test significantly down
35 P-value Weighted Sum Test (PWST) (+) (.) (-) SubjectV1V2V3V4V5V6CollapsedpSumTrait t p(x≤t) *(p-0.5) Rescaled left-tail p-value [-1,1] is used as weight
36 P-value Weighted Sum Test (PWST) Power of collapsing test is retained even there are bidirectional effects
37 PWST:Q-Q Plots Under the Null Direct test Inflation of type I error Corrected by permutation test (permutation of phenotype)
Generalized Linear Mixed Model (GLMM) & Weighted Sum Test (WST) 38
GLMM & WST Y : quantitative trait or logit(binary trait) α : intercept β : regression coefficient of weighted sum m : number of RVs to be collapsed w i : weight of variant i g i : genotype (recoded) of variant i Σw i g i : weighted sum (WS) X : covariate(s), such as population structure variable(s) τ : fixed effect(s) of X Z: design matrix corresponding to γ γ : random polygene effects for individual subjects, ~N(0, G), G=2σ 2 K, K is the kinship matrix and σ 2 the additive ploygene genetic variance ε : residual 39
Base on allele frequency, binary(0,1) or continuous, fixed or variable threshold; Based on function annotation/prediction; SIFT, PolyPhen etc. Based on sequencing quality (coverage, mapping quality, genotyping quality etc.); Data-driven, using both genotype and phenotype data, learning weight from data or adaptive selection, permutation test; Any combination … Weight 40
Adjusting relatedness in family data for non-data- driven test of rare variants. Application 1: Family Data 41 γ ~N(0,2σ 2 K) Unadjusted: Adjusted:
Q-Q Plots of –log 10 (P) under the Null Li & Leal’s collapsing test, ignoring family structure, inflation of type-1 error Li & Leal’s collapsing test, modeling family structure via GLMM, inflation is corrected 42 (From Zhang et al, 2011, BMC Proc.)
Application 2: Permuting Family Data Permuted Non-permuted, subject IDs fixed 43 MMPT: Mixed Model-based Permutation Test Adjusting relatedness in family data for data-driven permutation test of rare variants. γ ~N(0,2σ 2 K)
Q-Q Plots under the Null WSS SPWSTPWST aSum Permutation test, ignoring family structure, inflation of type-1 error 44 (From Zhang et al, 2011, IGES Meeting)
Q-Q Plots under the Null WSS SPWSTPWST aSum Mixed model-based permutation test (MMPT), modeling family structure, inflation corrected (From Zhang et al, 2011, IGES Meeting)
Burden Test vs. Non-burden Test 46 Burden test Non-burden test T-test, Likelihood Ratio Test, F-test, score test, … SKAT: sequence kernel association test
Extension of SKAT to Family Data kinship matrix Polygenic heritability of the traitResidual Han Chen et al., 2012, Genetic Epidemiology
Other problems 49 Missing genotypes & imputation Genotyping errors & QC (family consistency, sequence review) Population Stratification Inherited variants and de novo mutation Family data & linkage infomation Variant validation and association validation Public databases And more …