Association tests
Basics of association testing Consider the evolutionary history of individuals proximal to the disease carrying mutation.
Association testing The goal of association testing is to identify SNPs that ‘associate’ (are correlated) with the phenotype. Recall that spatially close SNPs are correlated because of LD. As we go further, recombination changes evolutionary history, and the SNPs are no longer correlated.
Tests for association: Pearson Case-control phenotype: –Build a 3X2 contingency table –Pearson test (2df)= CasesControls mm Mm MM E 1 =P MM.#cases
Test for association: Fisher exact test Here P is the probability of seeing the exact count. The actual significance is computed by summing over all such tables that are at least this extreme. To identify such tables, rank order all tables according to Pearson’s test. CasesControls mm Mm MM a e cd b f
Fischer exact test CasesControls mm Mm MM a e cd b f Num: #ways of getting configuration (a,b,c,d,e,f) Den: #ways of ensuring that the row sums and column sums are fixed
Continuous outcomes Instead of discrete (Case/control) data, we have real-valued phenotypes –Ex: Diastolic Blood Pressure In this case, how do we test for association
Continuous outcome ANOVA Often, the phenotypes are not offered as case- controls but like a continuous variable –Ex: blood-pressure measurements Question: Are the mean values of the two groups significantly different? MMmm
Two-sided t-test For two categories, ANOVA is also known as the t-test Assume that the variables from the two sets are drawn from Normal distributions –Different means, equal variances Null hypothesis is that they are both from the same distribution
t-test continued
Two-sample t-test As the variance is not known, we use an estimate S, defined by The T-statistic is given by Significant deviations from 0 are used to reject the Null hypothesis
Two-sample t-test (unequal variances) If the variances cannot be assumed to be equal, we use The T-statistic is given by Significant deviations from 0 are used to reject the Null hypothesis
Continuous outcome ANOVA How do we extend the t-test when we have multiple groups? MMmm
T-test again
F-statistic for 2 groups
F-statistic for g groups
Haplotype testing Why test with multiple SNPs? Pros: haplotypes might be better correlated with disease outcome The tests are similar, except that instead of 3 rows, we have a certain number (k) of haplotypes.
Linear regression Sometimes, we have additional information on phenotype values Ex: the phenotype value might be additive in the number of alleles
Linear regression The parameters can be estimated using linear regression analysis –Let X ij be the phenotypic value of the j-th individual in class i (genotype i) –X ij = + i + ij – i =0 Generally, –X=C + –Goal is to estimate so that || || is minimized How is this useful?
Linear regression testing Recall that we want to test if the genotype is useful in predicting phenotype (X) If not, then the null model X ij = + ij should have the same amount of variance in the residual ij
Solving for least squares Min ||Ax-b || It is solved by
Using partial derivatives
Linear regression summary Linear regression methods can be used to estimate the parameters of –X = C + To test for association, estimate the parameters for two models –Ex: X ij = + i + ij vs X ij = + ’ ij Note that both , ’ are assumed to be random variables with mean 0, and that Var( )<=Var( ’) We can test for association by asking if the reduction in variance Var( ’)-Var( ) is significant –This can be done parametrically (Ex: F-test) –Or, non-parametrically, using a permutation test
Association test summary (Single locus) Discrete outcomes (case-control) –Pearson’s/Fischer exact test Continuous variables –T-test (2 categories) –ANOVA (multiple categories) –Linear regression (multiple categories with linearity assumption) Single locus can be extended to haplotypes –Multiple correlated SNPs –Only change is that the number of categories expands.
Epistatic and gene environment interactions The typical Mendelian disorder assumes that there is a single causal variation. –Having the variation pre-disposes you to a certain phenotype For complex disease, this may not be a correct model –Different variants may combinatorially interact
Two-way ANOVA Suppose that there are two ways of classifying individuals. –Ex: genotypes at two loci –Ex: genotype versus sex –Ex: genotype versus environment Assume that there are sufficient individuals in each cell. –Estimate the means/variances in each cell An ANOVA test may be used to determine if the values can are significantly different aa AA Aa MF
2-way ANOVA model X ijk : phenotype value for the k-th individual in cell (i,j) Assume that X ijk = + i + j + ij + ijk i j are fixed parameters contributing to class i,j ij is a parameter corresponding to interaction between class i,j i n i i =0, n j j =0, n ij ij =0
Interaction between genes Bologically, the interaction between genes can be –Additive, Recessive, or more complex etc or 2 9 epistatic combinations are possible aa Aa AA bbBbBB bbBb
ANOVA model We have two questions: –Are the loci associated with the disease? To answer this, test this model against the null model X ijk = + ijk –Is epistatic interaction important Test this model against X ijk = + i + j + ijk (Set ij = 0 in the null hypothesis)
Detecting multiple loci The most naïve strategy, is to look at all pairs of loci (or all k-tuples) that influence a complex disease. This is computationally intensive, and also has a problem with multiple testing. Other strategies: –Consider a subset S of SNPs that show an association individually. –Limit association testing to pairs: At least one of the SNPs comes from S Both SNPs come from S
Exhaustive search might be needed In some interactions, neither SNP might individually be correlated with disease. Exhaustive search might be the only answer Marchini et al. tested this extensively a Bb A
Genome wide multi-locus testing They consider many models of interaction, and simulated data according to the model Each entry corresponds to the odds of getting a disease Nature Genetics 37, (2005)
Simulating Data Define Penetrance (allele)=Pr(“case”|allele) Given the penetrance values for each of the 9 alleles, it is easy to simulate data. Generate n individuals. Individual i is assigned a genotype g independently assuming HWE, and LE –Pr(“aa,Bb”) = P a 2 2P B P b Each individual genotype is assigned to “case” based on penetrance, and “control” otherwise.
Parameters The penetrance values depend upon the model of interaction chosen. Here, corresponds to an odds ratio –Pr(C|”aa”,”bb”)/ Pr(N|”aa”,”bb”) The actual penetrance values can be computed based on the odds To compute values of odds, they use empirical estimates
Two locus testing results The power represents the fraction of times the test succeeded in detecting the right pair. The pair-wise models often do much better than the other models. Model 1 Model 2 Model 3
Exhaustive versus 2-stage The two stage strategy may not capture all epistatic interactions. Q: Is there a scheme that can capture interacting loci without considering all pairs
Efficient detection of interactions Warning: these are all half-baked ideas, not established knowledge Consider two loci that interactively associate with a disease, but are not individually that significant Also assume (for now) that we only have binary data (instead of ternary) at each SNP
Re-mapping the problem Consider the distribution of haplotypes for diseased individuals For the two SNPs, we have 4 haplotypes –00,01,10,11 Let #{00} represent number of diseased individuals with 00 Claim: If the haplotypes are correlated, we should see an excess of #{00,11}, or #{01,10} Is this claim true? Can you prove it? Cases
Re-mapping the problem Define an interacting pair as a pair in which #{00,11} is HIGH. –We will deal with the #{01,10} case later Given m SNPs, the goal is to identify all interacting pairs in o(m^2) time Ex: if m=10 6, can we identify all such pairs in time –O(m) ~10 6 –O(m log m)~ –O(m 3/2 )~10 9 We modify this problem as follows: –Given m SNPs, identify a subset of o(m 2 ) pairs S’ in o(m 2 ) time such that w.h.p all interacting SNPs are in S’
Coin tosses Suppose you have a loaded coin (p=0.7) How can you detect with high confidence that it is loaded Toss the coin n times. The number of heads should be different For n=100 For n=10000
Hamming distance problem For interacting SNPs on n-individuals, hamming distance is LOW <= k 1 /n For non-interacting SNPs on n individuals, hamming distance >= k 2 /n Can you identify (w.h.p) all the low hamming distance while only including a small fraction of the high hamming distance SNPs
s 1 s 2 1: 0 0 2: 0 0 3: 0 0 4: 0 1 5: 0 0 6: 1 1 7: 1 1 8: 1 1 9: 1 0 Think of every SNP as a binary string of length n. Choose l positions at random for a SNP s, and consider the binary string due to those l positions as h(s) EX: h 1,3,7 (s 1 )=001 Pr(h(s 1 )= h(s 2 )| SNPs interact) = ? s1s1