Association tests. Basics of association testing Consider the evolutionary history of individuals proximal to the disease carrying mutation.

Slides:



Advertisements
Similar presentations
Computational Statistics. Basic ideas  Predict values that are hard to measure irl, by using co-variables (other properties from the same measurement.
Advertisements

1 Chapter 4 Experiments with Blocking Factors The Randomized Complete Block Design Nuisance factor: a design factor that probably has an effect.
Chapter 4 Randomized Blocks, Latin Squares, and Related Designs
Copyright © 2009 Pearson Education, Inc. Chapter 29 Multiple Regression.
Dealing With Statistical Uncertainty
Nonparametric Statistics Timothy C. Bates
Comparing k Populations Means – One way Analysis of Variance (ANOVA)
Basics of Linkage Analysis
Copyright © 2010, 2007, 2004 Pearson Education, Inc. *Chapter 29 Multiple Regression.
More Powerful Genome-wide Association Methods for Case-control Data Robert C. Elston, PhD Case Western Reserve University Cleveland Ohio.
Dealing With Statistical Uncertainty Richard Mott Wellcome Trust Centre for Human Genetics.
Chapter 10 Simple Regression.
Basics of ANOVA Why ANOVA Assumptions used in ANOVA Various forms of ANOVA Simple ANOVA tables Interpretation of values in the table Exercises.
Lecture 23: Tues., Dec. 2 Today: Thursday:
ANOVA: ANalysis Of VAriance. In the general linear model x = μ + σ 2 (Age) + σ 2 (Genotype) + σ 2 (Measurement) + σ 2 (Condition) + σ 2 (ε) Each of the.
Simulation Modeling and Analysis Session 12 Comparing Alternative System Designs.
Final Review Session.
Lecture 24: Thurs. Dec. 4 Extra sum of squares F-tests (10.3) R-squared statistic (10.4.1) Residual plots (11.2) Influential observations (11.3,
Using biological networks to search for interacting loci in genome-wide association studies Mathieu Emily et. al. European journal of human genetics, e-pub.
Chapter 11 Multiple Regression.
Topic 3: Regression.
Lecture 12 One-way Analysis of Variance (Chapter 15.2)
Statistical Methods in Computer Science Hypothesis Testing II: Single-Factor Experiments Ido Dagan.
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
Sample Size Determination Ziad Taib March 7, 2014.
Nonparametrics and goodness of fit Petter Mostad
6.1 - One Sample One Sample  Mean μ, Variance σ 2, Proportion π Two Samples Two Samples  Means, Variances, Proportions μ 1 vs. μ 2.
Inferential Statistics
AM Recitation 2/10/11.
Dealing With Statistical Uncertainty Richard Mott Wellcome Trust Centre for Human Genetics.
Basics of ANOVA Why ANOVA Assumptions used in ANOVA Various forms of ANOVA Simple ANOVA tables Interpretation of values in the table Exercises.
Inferences in Regression and Correlation Analysis Ayona Chatterjee Spring 2008 Math 4803/5803.
Analyzing Data: Comparing Means Chapter 8. Are there differences? One of the fundament questions of survey research is if there is a difference among.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide
Analysis of variance Petter Mostad Comparing more than two groups Up to now we have studied situations with –One observation per object One.
Statistics 11 Confidence Interval Suppose you have a sample from a population You know the sample mean is an unbiased estimate of population mean Question:
Correlation and Regression Used when we are interested in the relationship between two variables. NOT the differences between means or medians of different.
Testing Multiple Means and the Analysis of Variance (§8.1, 8.2, 8.6) Situations where comparing more than two means is important. The approach to testing.
Biostatistics, statistical software VII. Non-parametric tests: Wilcoxon’s signed rank test, Mann-Whitney U-test, Kruskal- Wallis test, Spearman’ rank correlation.
Contingency tables Brian Healy, PhD. Types of analysis-independent samples OutcomeExplanatoryAnalysis ContinuousDichotomous t-test, Wilcoxon test ContinuousCategorical.
Education 793 Class Notes Presentation 10 Chi-Square Tests and One-Way ANOVA.
Statistics for Differential Expression Naomi Altman Oct. 06.
Ch11: Comparing 2 Samples 11.1: INTRO: This chapter deals with analyzing continuous measurements. Later, some experimental design ideas will be introduced.
Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests.
Inferential Statistics. The Logic of Inferential Statistics Makes inferences about a population from a sample Makes inferences about a population from.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 13: One-way ANOVA Marshall University Genomics Core.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 14 Comparing Groups: Analysis of Variance Methods Section 14.3 Two-Way ANOVA.
Lecture 24: Quantitative Traits IV Date: 11/14/02  Sources of genetic variation additive dominance epistatic.
N318b Winter 2002 Nursing Statistics Specific statistical tests Chi-square (  2 ) Lecture 7.
Lecture 21: Quantitative Traits I Date: 11/05/02  Review: covariance, regression, etc  Introduction to quantitative genetics.
Association mapping for mendelian, and complex disorders January 16Bafna, BfB.
Hypothesis test flow chart frequency data Measurement scale number of variables 1 basic χ 2 test (19.5) Table I χ 2 test for independence (19.9) Table.
Practical With Merlin Gonçalo Abecasis. MERLIN Website Reference FAQ Source.
Fast test for multiple locus mapping By Yi Wen Nisha Rajagopal.
Lecture 22: Quantitative Traits II
Lecture 23: Quantitative Traits III Date: 11/12/02  Single locus backcross regression  Single locus backcross likelihood  F2 – regression, likelihood,
The Mixed Effects Model - Introduction In many situations, one of the factors of interest will have its levels chosen because they are of specific interest.
Multiple-Locus Genome-Wide Association Testing David Dean CSE280A.
The p-value approach to Hypothesis Testing
Significance Tests for Regression Analysis. A. Testing the Significance of Regression Models The first important significance test is for the regression.
Nonparametric Statistics
Today’s lesson (Chapter 12) Paired experimental designs Paired t-test Confidence interval for E(W-Y)
ENGR 610 Applied Statistics Fall Week 8 Marshall University CITE Jack Smith.
Increasing Power in Association Studies by using Linkage Disequilibrium Structure and Molecular Function as Prior Information Eleazar Eskin UCLA.
Estimating standard error using bootstrap
Applied statistics Usman Roshan.
CHAPTER 29: Multiple Regression*
Presentation transcript:

Association tests

Basics of association testing Consider the evolutionary history of individuals proximal to the disease carrying mutation.

Association testing The goal of association testing is to identify SNPs that ‘associate’ (are correlated) with the phenotype. Recall that spatially close SNPs are correlated because of LD. As we go further, recombination changes evolutionary history, and the SNPs are no longer correlated.

Tests for association: Pearson Case-control phenotype: –Build a 3X2 contingency table –Pearson test (2df)= CasesControls mm Mm MM E 1 =P MM.#cases

Test for association: Fisher exact test Here P is the probability of seeing the exact count. The actual significance is computed by summing over all such tables that are at least this extreme. To identify such tables, rank order all tables according to Pearson’s test. CasesControls mm Mm MM a e cd b f

Fischer exact test CasesControls mm Mm MM a e cd b f Num: #ways of getting configuration (a,b,c,d,e,f) Den: #ways of ensuring that the row sums and column sums are fixed

Continuous outcomes Instead of discrete (Case/control) data, we have real-valued phenotypes –Ex: Diastolic Blood Pressure In this case, how do we test for association

Continuous outcome ANOVA Often, the phenotypes are not offered as case- controls but like a continuous variable –Ex: blood-pressure measurements Question: Are the mean values of the two groups significantly different? MMmm

Two-sided t-test For two categories, ANOVA is also known as the t-test Assume that the variables from the two sets are drawn from Normal distributions –Different means, equal variances Null hypothesis is that they are both from the same distribution

t-test continued

Two-sample t-test As the variance is not known, we use an estimate S, defined by The T-statistic is given by Significant deviations from 0 are used to reject the Null hypothesis

Two-sample t-test (unequal variances) If the variances cannot be assumed to be equal, we use The T-statistic is given by Significant deviations from 0 are used to reject the Null hypothesis

Continuous outcome ANOVA How do we extend the t-test when we have multiple groups? MMmm

T-test again

F-statistic for 2 groups

F-statistic for g groups

Haplotype testing Why test with multiple SNPs? Pros: haplotypes might be better correlated with disease outcome The tests are similar, except that instead of 3 rows, we have a certain number (k) of haplotypes.

Linear regression Sometimes, we have additional information on phenotype values Ex: the phenotype value might be additive in the number of alleles

Linear regression The parameters can be estimated using linear regression analysis –Let X ij be the phenotypic value of the j-th individual in class i (genotype i) –X ij =  +  i +  ij –  i =0 Generally, –X=C  +  –Goal is to estimate  so that ||  || is minimized How is this useful?

Linear regression testing Recall that we want to test if the genotype is useful in predicting phenotype (X) If not, then the null model X ij =  +  ij should have the same amount of variance in the residual  ij

Solving for least squares Min  ||Ax-b || It is solved by

Using partial derivatives

Linear regression summary Linear regression methods can be used to estimate the parameters of –X = C  +  To test for association, estimate the parameters for two models –Ex: X ij =  +  i +  ij vs X ij =  +  ’ ij Note that both ,  ’ are assumed to be random variables with mean 0, and that Var(  )<=Var(  ’) We can test for association by asking if the reduction in variance Var(  ’)-Var(  ) is significant –This can be done parametrically (Ex: F-test) –Or, non-parametrically, using a permutation test

Association test summary (Single locus) Discrete outcomes (case-control) –Pearson’s/Fischer exact test Continuous variables –T-test (2 categories) –ANOVA (multiple categories) –Linear regression (multiple categories with linearity assumption) Single locus can be extended to haplotypes –Multiple correlated SNPs –Only change is that the number of categories expands.

Epistatic and gene environment interactions The typical Mendelian disorder assumes that there is a single causal variation. –Having the variation pre-disposes you to a certain phenotype For complex disease, this may not be a correct model –Different variants may combinatorially interact

Two-way ANOVA Suppose that there are two ways of classifying individuals. –Ex: genotypes at two loci –Ex: genotype versus sex –Ex: genotype versus environment Assume that there are sufficient individuals in each cell. –Estimate the means/variances in each cell An ANOVA test may be used to determine if the values can are significantly different aa AA Aa MF

2-way ANOVA model X ijk : phenotype value for the k-th individual in cell (i,j) Assume that X ijk =  +  i +  j +  ij +  ijk  i  j are fixed parameters contributing to class i,j  ij is a parameter corresponding to interaction between class i,j  i n i  i =0,  n j  j =0,  n ij  ij =0

Interaction between genes Bologically, the interaction between genes can be –Additive, Recessive, or more complex etc or 2 9 epistatic combinations are possible aa Aa AA bbBbBB bbBb

ANOVA model We have two questions: –Are the loci associated with the disease? To answer this, test this model against the null model X ijk =  +  ijk –Is epistatic interaction important Test this model against X ijk =  +  i +  j +  ijk (Set  ij = 0 in the null hypothesis)

Detecting multiple loci The most naïve strategy, is to look at all pairs of loci (or all k-tuples) that influence a complex disease. This is computationally intensive, and also has a problem with multiple testing. Other strategies: –Consider a subset S of SNPs that show an association individually. –Limit association testing to pairs: At least one of the SNPs comes from S Both SNPs come from S

Exhaustive search might be needed In some interactions, neither SNP might individually be correlated with disease. Exhaustive search might be the only answer Marchini et al. tested this extensively a Bb A

Genome wide multi-locus testing They consider many models of interaction, and simulated data according to the model Each entry corresponds to the odds of getting a disease Nature Genetics 37, (2005)

Simulating Data Define Penetrance (allele)=Pr(“case”|allele) Given the penetrance values for each of the 9 alleles, it is easy to simulate data. Generate n individuals. Individual i is assigned a genotype g independently assuming HWE, and LE –Pr(“aa,Bb”) = P a 2 2P B P b Each individual genotype is assigned to “case” based on penetrance, and “control” otherwise.

Parameters The penetrance values depend upon the model of interaction chosen. Here,  corresponds to an odds ratio –Pr(C|”aa”,”bb”)/ Pr(N|”aa”,”bb”) The actual penetrance values can be computed based on the odds To compute values of odds, they use empirical estimates

Two locus testing results The power represents the fraction of times the test succeeded in detecting the right pair. The pair-wise models often do much better than the other models. Model 1 Model 2 Model 3

Exhaustive versus 2-stage The two stage strategy may not capture all epistatic interactions. Q: Is there a scheme that can capture interacting loci without considering all pairs

Efficient detection of interactions Warning: these are all half-baked ideas, not established knowledge Consider two loci that interactively associate with a disease, but are not individually that significant Also assume (for now) that we only have binary data (instead of ternary) at each SNP

Re-mapping the problem Consider the distribution of haplotypes for diseased individuals For the two SNPs, we have 4 haplotypes –00,01,10,11 Let #{00} represent number of diseased individuals with 00 Claim: If the haplotypes are correlated, we should see an excess of #{00,11}, or #{01,10} Is this claim true? Can you prove it? Cases

Re-mapping the problem Define an interacting pair as a pair in which #{00,11} is HIGH. –We will deal with the #{01,10} case later Given m SNPs, the goal is to identify all interacting pairs in o(m^2) time Ex: if m=10 6, can we identify all such pairs in time –O(m) ~10 6 –O(m log m)~ –O(m 3/2 )~10 9 We modify this problem as follows: –Given m SNPs, identify a subset of o(m 2 ) pairs S’ in o(m 2 ) time such that w.h.p all interacting SNPs are in S’

Coin tosses Suppose you have a loaded coin (p=0.7) How can you detect with high confidence that it is loaded Toss the coin n times. The number of heads should be different For n=100 For n=10000

Hamming distance problem For interacting SNPs on n-individuals, hamming distance is LOW <= k 1 /n For non-interacting SNPs on n individuals, hamming distance >= k 2 /n Can you identify (w.h.p) all the low hamming distance while only including a small fraction of the high hamming distance SNPs

s 1 s 2 1: 0 0 2: 0 0 3: 0 0 4: 0 1 5: 0 0 6: 1 1 7: 1 1 8: 1 1 9: 1 0 Think of every SNP as a binary string of length n. Choose l positions at random for a SNP s, and consider the binary string due to those l positions as h(s) EX: h 1,3,7 (s 1 )=001 Pr(h(s 1 )= h(s 2 )| SNPs interact) = ? s1s1