Genome-wide Association

Slides:



Advertisements
Similar presentations
Statistical methods for genetic association studies
Advertisements

Population structure.
Generalized Regional Admixture Mapping (RAM) and Structured Association Testing (SAT) David T. Redden, Associate Professor, Department of Biostatistics,
Association Tests for Rare Variants Using Sequence Data
Multiple Comparisons Measures of LD Jess Paulus, ScD January 29, 2013.
Basics of Linkage Analysis
Association Mapping David Evans. Outline Definitions / Terminology What is (genetic) association? How do we test for association? When to use association.
MALD Mapping by Admixture Linkage Disequilibrium.
Lab 13: Association Genetics. Goals Use a Mixed Model to determine genetic associations. Understand the effect of population structure and kinship on.
Estimating “Heritability” using Genetic Data David Evans University of Queensland.
More Powerful Genome-wide Association Methods for Case-control Data Robert C. Elston, PhD Case Western Reserve University Cleveland Ohio.
MSc GBE Course: Genes: from sequence to function Genome-wide Association Studies Sven Bergmann Department of Medical Genetics University of Lausanne Rue.
Thoughts about the TDT. Contribution of TDT: Finding Genes for 3 Complex Diseases PPAR-gamma in Type 2 diabetes Altshuler et al. Nat Genet 26:76-80, 2000.
Give me your DNA and I tell you where you come from - and maybe more! Lausanne, Genopode 21 April 2010 Sven Bergmann University of Lausanne & Swiss Institute.
Shaun Purcell & Pak Sham Advanced Workshop Boulder, CO, 2003
Analysis of genome-wide association studies
1 Association Analysis of Rare Genetic Variants Qunyuan Zhang Division of Statistical Genomics Course M Computational Statistical Genetics.
Genome-Wide Association Study (GWAS)
Lecture 19: Association Studies II Date: 10/29/02  Finish case-control  TDT  Relative Risk.
Experimental Design and Data Structure Supplement to Lecture 8 Fall
Type 1 Error and Power Calculation for Association Analysis Pak Sham & Shaun Purcell Advanced Workshop Boulder, CO, 2005.
Jianfeng Xu, M.D., Dr.PH Professor of Public Health and Cancer Biology Director, Program for Genetic and Molecular Epidemiology of Cancer Associate Director,
CORRELATION: Correlation analysis Correlation analysis is used to measure the strength of association (linear relationship) between two quantitative variables.
INTRODUCTION TO ASSOCIATION MAPPING
Lab 13: Association Genetics December 5, Goals Use Mixed Models and General Linear Models to determine genetic associations. Understand the effect.
Tutorial #10 by Ma’ayan Fishelson. Classical Method of Linkage Analysis The classical method was parametric linkage analysis  the Lod-score method. This.
1 B-b B-B B-b b-b Lecture 2 - Segregation Analysis 1/15/04 Biomath 207B / Biostat 237 / HG 207B.
Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests.
Lecture 21: Quantitative Traits I Date: 11/05/02  Review: covariance, regression, etc  Introduction to quantitative genetics.
Association analysis Genetics for Computer Scientists Biomedicum & Department of Computer Science, Helsinki Päivi Onkamo.
A Transmission/disequilibrium Test for Ordinal Traits in Nuclear Families and a Unified Approach for Association Studies Heping Zhang, Xueqin Wang and.
Practical With Merlin Gonçalo Abecasis. MERLIN Website Reference FAQ Source.
Lecture 22: Quantitative Traits II
Lecture 23: Quantitative Traits III Date: 11/12/02  Single locus backcross regression  Single locus backcross likelihood  F2 – regression, likelihood,
Efficient calculation of empirical p- values for genome wide linkage through weighted mixtures Sarah E Medland, Eric J Schmitt, Bradley T Webb, Po-Hsiu.
Increasing Power in Association Studies by using Linkage Disequilibrium Structure and Molecular Function as Prior Information Eleazar Eskin UCLA.
Association Mapping in Families Gonçalo Abecasis University of Oxford.
Power and Meta-Analysis Dr Geraldine M. Clarke Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for.
Lecture 2 Survey Data Analysis Principal Component Analysis Factor Analysis Exemplified by SPSS Taylan Mavruk.
Power Calculations for GWAS
University of Colorado at Boulder
GS/PPAL Section N Research Methods and Information Systems
MULTIPLE GENES AND QUANTITATIVE TRAITS
Genome-wide Association
Complex disease and long-range regulation: Interpreting the GWAS using a Dual Colour Transgenesis Strategy in Zebrafish.
upstream vs. ORF binding and gene expression?
Genetic Association Analysis
Mendelian randomization with invalid instruments: Egger regression and Weighted Median Approaches David Evans.
Genome Wide Association Studies using SNP
CJT 765: Structural Equation Modeling
Marker heritability Biases, confounding factors, current methods, and best practices Luke Evans, Matthew Keller.
12 Inferential Analysis.
Elementary Statistics
Population stratification
Epidemiology 101 Epidemiology is the study of the distribution and determinants of health-related states in populations Study design is a key component.
MULTIPLE GENES AND QUANTITATIVE TRAITS
Genome-wide Associations
The ‘V’ in the Tajima D equation is:
Genome-wide Association Studies
Beyond GWAS Erik Fransen.
Correlation for a pair of relatives
Chapter 8: Weighting adjustment
12 Inferential Analysis.
Inferential Statistics
Lecture 9: QTL Mapping II: Outbred Populations
Perspectives from Human Studies and Low Density Chip
An Expanded View of Complex Traits: From Polygenic to Omnigenic
BOULDER WORKSHOP STATISTICS REVIEWED: LIKELIHOOD MODELS
Presentation transcript:

Genome-wide Association David Evans

This Session Tests of association in unrelated individuals Population Stratification Assessing significance in genome-wide association Replication Population Stratification Practical

Tests of Association in Unrelated Individuals

Simple Additive Regression Model of Association (Unrelated individuals) Yi = a + bXi + ei where Yi = trait value for individual i Xi = number of ‘A’ alleles an individual has 1.2 1 0.8 Y 0.6 0.4 0.2 X 1 2 Association test is whether b > 0

Linear Regression Including Dominance Yi = a + bxXi + bzZi + ei where Yi = trait value for individual i Xi = 1 if individual i has genotype ‘AA’ Zi= 0 for ‘AA’ 0 if individual i has genotype ‘Aa’ 1 for ‘Aa’ -1 if individual i has genotype ‘aa‘ 0 for ‘aa’ 1.2 1 d a 0.8 Y 0.6 -a 0.4 0.2 1 X 2

Genetic Case Control Study Cases G/T G/G T/T T/T T/G T/G T/T T/T T/G G/G T/G T/G T/G T/T Allele G is ‘associated’ with disease

Allele-based tests Each individual contributes two counts to 2x2 table. Test of association where X2 has χ2 distribution with 1 degrees of freedom under null hypothesis.   Cases Controls Total G n1A n1U n1· T n0A n0U n0· n·A n·U n··

Genotypic tests SNP marker data can be represented in 2x3 table. Test of association where X2 has χ2 distribution with 2 degrees of freedom under null hypothesis. Cases Controls Total GG n2A n2U n2· GT n1A n1U n1· TT n0A n0U n0· n·A n·U n··

Dominance Model Each individual contributes two counts to 2x2 table. Test of association where X2 has χ2 distribution with 1 degrees of freedom under null hypothesis. Cases Controls Total GG/GT n1A n1U n1· TT n0A n0U n0· n·A n·U n··

Logistic regression framework Model case/control status within a logistic regression framework. Let πi denote the probability that individual i is a case, given their genotype Gi. Logit link function where

Indicator variables Represent genotypes of each individual by indicator variables: Additive model Genotype model Genotype Z(M)i Z(Mm)i Z(MM)i mm -1 Mm 1 MM 2

Likelihood calculations Log-likelihood of case-control data given marker genotypes where yi = 1 if individual i is a case, and yi = 0 if individual i is a control. Maximise log-likelihood over β parameters, denoted . Models fitted using PLINK.

Model comparison Compare models via deviance, having a χ2 distribution with degrees of freedom given by the difference in the number of model parameters. Models Deviance df Additive vs null 1 Genotype vs null 2

Covariates It is straightforward to incorporate covariates in the logistic regression model: age, gender, and other environmental risk factors. Generalisation of link function, e.g. for additive model: where Xij is the response of individual i to the jth covariate, and γj is the corresponding covariate regression coefficient.

Caution with Covariates! Covariates useful for: Controlling for confounding Increasing power Should be used with caution! Lung Cancer SNP Smoking

Caution with Covariates! SNP SNP “Collider” Bias BMI CRP G, E (-SNP) SNP “Collider” Bias Outcome Covariate

Caution with Covariates! Intuition is different for binary traits! Case control studies only Can increase or decrease power Depends on prevalence of disease (<20%) Most apparent for strongly associated covariates

Population Stratification

Definitions: Stratification and Admixture Stratification / Sub-structure Refers to the situation where a sample of individuals consists of several discrete subgroups which do not interbreed as a single randomly mating unit Admixture Implies that subgroups also interbreed. Therefore individuals may be a mixture of different ancestries.

My Samples Sample 1 Americans χ2=0 p=1 Use of Chopsticks A Yes No Total A1 320 640 A2 80 160 400 800

My Samples Sample 2 Chinese χ2=0 p=1 Use of Chopsticks A Yes No Total 320 20 340 A2 640 40 680

Sample 3 Americans + Chinese My Samples Sample 3 Americans + Chinese χ2=34.2 p=4.9x10-9 Use of Chopsticks A Yes No Total A1 640 340 980 A2 400 100 500 1040 440 1480

Population structure Marchini, Nat Genet (2004)

Study without knowledge of genetic background: ADMIXTURE: (DIABETES IN AMERICAN INDIANS) Full heritage American Indian Population + - Gm3;5,13,14 ~1% ~99% (NIDDM Prevalence  40%) Caucasian Population + - Gm3;5,13,14 ~66% ~34% (NIDDM Prevalence  15%) Study without knowledge of genetic background: OR=0.27 95%CI = 0.18 - 0.40

Index of Indian Heritage ADMIXTURE: (DIABETES IN AMERICAN INDIANS) 39.3% 35.9% 8 28.8% 28.3% 4 19.9% 17.8% - + Gm3;5,13,14 Index of Indian Heritage Gm haplotype serves as a marker for Caucasian admixture

QQ plots McCarthy et al. (2008) Nature Genetics

Solutions (common variants) Family-based Analysis Stratified Analysis Analyze Chinese and American samples separately then combine statistically Model the confounder Include a term for Chinese or American ancestry in a logistic regression model Principal Components Genomic Control Linear Mixed Models LD score regression

Transmission Disequilibrium Test Rationale: Related individuals have to be from the same population Compare number of times heterozygous parents transmit “A” vs “C” allele to affected offspring Many variations AC AA AC

TDT Spielman et al 1993 AJHG

TDT Advantages Robust to stratification Identification of Mendelian Inconsistencies Parent of Origin Effects More accurate haplotyping AC AA AC

TDT Disadvantages Difficult to gather families Difficult to get parents for late onset / psychiatric conditions Genotyping error produces bias Inefficient for genotyping (particularly GWA) AC AA AC

Case-control versus TDT α = 0.05; RAA = RAa = 2

Genomic control 2 No stratification Test locus Unlinked ‘null’ markers 2 Stratification  adjust test statistic

Genomic control TN /  ~ 21 “λ” is Genome-wide inflation factor Test statistic is distributed under the null: TN /  ~ 21 Problems…

Principal Components Analysis Principal Components Analysis is applied to genotype data to infer continuous axes of genetic variation Each axis explains as much of the genetic variance in the data as possible with the constraint that each component is orthogonal to the preceding components The top principal Components tend to describe population ancestry Include principal components in regression analysis => correct for the effects of stratification EIGENSTRAT, SHELLFISH

Principal Component Two Novembre et al, Nature (2008) Principal Component Two Principal Component One

Wellcome Trust Case Control Consortium

Population structure -  Disease 1 1.15 2 1.08 3 1.09 4 1.26 5 1.06 6 1.07 7 1.10 Genomic control -  genome-wide inflation of median test statistic

Disease collection center 1 No. of samples 524 2 271 3 439 4 465 5 301 Center 3:  = 1.77 All others:  = 1.09

Multi-dimensional Scaling

Linear Mixed Models The test of association is performed in the fixed effects part of the model (“model for the means”) “Relatedness” between individuals (due to both population structure and cryptic relatedness) is captured in the modelling of the covariance between individuals Can increase power by implicitly conditioning on associated loci other than the candidate locus (quantitative traits) Variety of software packages (e.g. GCTA, GEMMA)

Linear Mixed Models y = Xβ + g + ε y is N x 1 vector of observed phenotypes X is N x k vector of observed covariates β is k x 1 vector of fixed effects coefficients g is N x 1 vector of total genetic effects per individual g ~(0, Aσg2) A is the GRM between different individuals V = Aσg2 + Iσε2

Example Sawcer et al, Nature (2011)

Comparison of Approaches in Sawcer et al. No correction PCA correction (top 100 PCs) Mixed-model correction

Linear Mixed Models - Complexities Many markers required for proper control of stratification Inclusion of the causal variant in the GRM will decrease power to detect association (GCTA-LOCO) Case-control analyses are a different story and these sorts of models can involve a substantial decrease in power

LD Score Regression

LD Score Regression- Key Points A key issue in GWAS is how to distinguish inflation by polygenicity from bias This is increasingly important as the size of GWAS (meta-analyses) increases LD score regression quantifies the contribution of each by examining the relationship between the test statistics and LD Estimates a more accurate measure of test score inflation than genomic control

LD Score Regression- Basic Idea The basic idea is that the more genetic variation a marker tags, the higher the probability that it will tag a causal variant In contrast, variation from population stratification/cryptic relatedness shouldn’t correlate with LD Regress test statistics from GWAS against LD score. The intercept minus one from this regression is an estimator of the mean contribution of confounding to the inflation of the test statistics

Simulation of a genetic signal in polygenic architecture

Simulation of a genetic signal in polygenic architecture LD blocks

Simulation of a genetic signal in polygenic architecture Lonely SNPs [no LD] LD blocks

Simulation of a genetic signal in polygenic architecture LD Block Lonely SNP

Simulation of a genetic signal in polygenic architecture LD Block Lonely SNP * Causal SNP

Simulation of a genetic signal in polygenic architecture LD Block Lonely SNP * Causal SNP Sim 1

Simulation of a genetic signal in polygenic architecture LD Block Lonely SNP * Causal SNP * * Sim 1

Simulation of a genetic signal in polygenic architecture LD Block Lonely SNP * Causal SNP * * Sim 1

Simulation of a genetic signal in polygenic architecture LD Block Lonely SNP * Causal SNP * * Sim 1 * * Sim 2

Simulation of a genetic signal in polygenic architecture LD Block Lonely SNP * Causal SNP * * Sim 1 * * Sim 2

Simulation of a genetic signal in polygenic architecture LD Block Lonely SNP * Causal SNP * * Sim 1 * * Sim 2 * * Sim 3

Simulation of a genetic signal in polygenic architecture LD Block Lonely SNP * Causal SNP * * Sim 1 * * Sim 2 * * Sim 3

Simulation of a genetic signal in polygenic architecture LD Block Lonely SNP * Causal SNP * * Sim 1 * * Sim 2 * * Sim 3 ... Sim N * *

Simulation of a genetic signal in polygenic architecture LD Block Lonely SNP * Causal SNP * * Sim 1 * * Sim 2 * * Sim 3 ... Sim 20 * * Prob for 1 causal SNP 9/20 1/20 4/20 1/20 5/20

Simulation of a genetic signal in polygenic architecture LD Block Lonely SNP * Causal SNP Assuming a uniform prior, we see SNPs with more LD friends showing more association ie., the more you tag, the more likely you are to tag a causal variant * * Sim 1 * * Sim 2 * * Sim 3 ... Sim 20 * * Prob for 1 causal SNP 9/20 1/20 4/20 1/20 5/20

What happens under genetic drift? LD Block Lonely SNP Probability of drift

What happens under genetic drift? LD Block Lonely SNP Probability of drift 1/5 1/5 1/5 1/5 1/5

What happens under genetic drift? LD Block Lonely SNP Probability of drift 1/5 1/5 1/5 1/5 1/5 Under pure drift we expect LD to have no relationship to differences in allele frequencies between populations

LD Score Regression (LD Score) N = sample size M = number of SNPs h2 / M = average heritability per SNP a = Population structure / cryptic relatedness

Polygenic Architecture Lambda = 1.30; LD score regression intercept = 1.02

Population Stratification Lambda = 1.30; LD score regression intercept = 1.32

Empirical Results Bulik-Sullivan et al. (2015) Nat Genet

Imputation (Sarah Medland)

Meta-analysis (Meike Bartels)

Assessing “Significance” in Genome-wide Association Studies

Asymptotic P values “The probability of observing the test result or a more extreme value than the test result under the null hypothesis” The p value is NOT the probability that the null hypothesis is true The probability that the null/alternate hypothesis is true is a function of the evidence contained in the data (p value), the power of the test, and the prior probability that the association is true/false The p value is a fluid measure of the strength of evidence against the null hypothesis that was designed to be interpreted in conjunction with other (pre-existing) evidence

Interpreting p values STRONGER EVIDENCE WEAKER EVIDENCE Genotyping error unlikely “Suspicious” SNP Stratification unlikely Stratification possible Low p value Borderline p value Powerful Study Weak Study High MAF Low MAF Candidate Gene Intergenic region Previous Association No previous evidence

Multiple Testing Multiple Testing Problem: The probability of observing a “significant” result purely by chance increases with the number of statistical tests performed For testing 500,000 SNPs 5,000 expected to be significant at α < .01 500 expected to be significant at α < .001 … 0.05 expected to be significant at α < 10-7 One solution is to maintain αFWER = .05 Bonferroni correction for m tests Set significance level to α = .05/m “Effective number of statistical tests “Genome-wide Significance” suggested at around α = 5 x 10-8 for European populations

Permutation Testing The distribution of the test statistic under the null hypothesis can be derived by shuffling case-control status relative to the genotypes, and performing the test of association many times Permutation breaks down the relationship between genotype and phenotype but maintains the pattern of linkage disequilibrium in the data Appropriate for rare genotypes, small studies, non-normal phenotypes etc.

Replication

Replication Replicating the genotype-phenotype association is the “gold standard” for “proving” an association is genuine Most loci underlying complex diseases will not be of large effect It is unlikely that a single study will unequivocally establish an association without the need for replication

Guidelines for Replication Replication studies should be of sufficient size to demonstrate the effect The same SNP should be tested The replicated signal should be in the same direction Replication studies should conducted in independent datasets Joint analysis should lead to a lower p value than the original report Replication should involve the same phenotype Replication should be conducted in a similar population Well designed negative studies are valuable

Practical