Heritability, prediction and the genomic

Slides:



Advertisements
Similar presentations
BST 775 Lecture PLINK – A Popular Toolset for GWAS
Advertisements

Association Tests for Rare Variants Using Sequence Data
Canonical Correlation
GBS & GWAS using the iPlant Discovery Environment
FTP Biostatistics II Model parameter estimations: Confronting models with measurements.
Analysis of variance (ANOVA)-the General Linear Model (GLM)
The Inheritance of Complex Traits
Lab 13: Association Genetics. Goals Use a Mixed Model to determine genetic associations. Understand the effect of population structure and kinship on.
Estimating “Heritability” using Genetic Data David Evans University of Queensland.
Chapter 5 Human Heredity by Michael Cummings ©2006 Brooks/Cole-Thomson Learning Chapter 5 Complex Patterns of Inheritance.
ANCOVA Lecture 9 Andrew Ainsworth. What is ANCOVA?
Inference for regression - Simple linear regression
Karri Silventoinen University of Helsinki Osaka University.
1 Chapter 10 Correlation and Regression 10.2 Correlation 10.3 Regression.
“Hotspot” algorithm chr5:131,975, ,012,092 Idea: gauge enrichment of tags relative to a local background model based on the number of tags in a 50kb.
A single-nucleotide polymorphism tagging set for human drug metabolism and transport Kourosh R Ahmadi, Mike E Weale, Zhengyu Y Xue, Nicole Soranzo, David.
1 Association Analysis of Rare Genetic Variants Qunyuan Zhang Division of Statistical Genomics Course M Computational Statistical Genetics.
MGT-491 QUANTITATIVE ANALYSIS AND RESEARCH FOR MANAGEMENT OSMAN BIN SAIF Session 16.
Jeff O’ConnellInterbull annual meeting, Orlando, FL, July 2015 (1) J. R. O’Connell 1 and P. M. VanRaden 2 1 University of Maryland School of Medicine,
INTRODUCTION TO ASSOCIATION MAPPING
E XOME SEQUENCING AND COMPLEX DISEASE : practical aspects of rare variant association studies Alice Bouchoms Amaury Vanvinckenroye Maxime Legrand 1.
Regression-Based Linkage Analysis of General Pedigrees Pak Sham, Shaun Purcell, Stacey Cherny, Gonçalo Abecasis.
September 18-19, 2006 – Denver, Colorado Sponsored by the U.S. Department of Housing and Urban Development Conducting and interpreting multivariate analyses.
Lab 13: Association Genetics December 5, Goals Use Mixed Models and General Linear Models to determine genetic associations. Understand the effect.
G.R. Wiggans* and P.M. VanRaden Animal Improvement Programs Laboratory Agricultural Research Service, USDA Beltsville, MD
Lecture 24: Quantitative Traits IV Date: 11/14/02  Sources of genetic variation additive dominance epistatic.
Practical With Merlin Gonçalo Abecasis. MERLIN Website Reference FAQ Source.
Mx modeling of methylation data: twin correlations [means, SD, correlation] ACE / ADE latent factor model regression [sex and age] genetic association.
Using Merlin in Rheumatoid Arthritis Analyses Wei V. Chen 05/05/2004.
Sequence Kernel Association Tests (SKAT) for the Combined Effect of Rare and Common Variants 統計論文 奈良原.
I. Statistical Methods for Genome-Enabled Prediction of Complex Traits OUTLINE THE CHALLENGES OF PREDICTING COMPLEX TRAITS ORDINARY LEAST SQUARES (OLS)
Date of download: 11/12/2016 Copyright © 2016 American Medical Association. All rights reserved. From: Influence of Child Abuse on Adult DepressionModeration.
Unsupervised Learning
Theme 5. Association 1. Introduction. 2. Bivariate tables and graphs.
University of Colorado at Boulder
Regression Models for Linkage: Merlin Regress
Chapter 14 Inference on the Least-Squares Regression Model and Multiple Regression.
Polygenic methods in analysis of complex trait genetics
GCTA Practical 2.
upstream vs. ORF binding and gene expression?
Introduction to Multivariate Genetic Analysis
Genome Wide Association Studies using SNP
Marker heritability Biases, confounding factors, current methods, and best practices Luke Evans, Matthew Keller.
Elementary Statistics
Zhengzheng Tang and Danyu Lin March 26, 2013
ESP6800 BP Analysis (recessive model)
Introduction to Data Formats and tools
Estimating Recombination Rates
Genome-wide Association Studies
Beyond GWAS Erik Fransen.
Correlation for a pair of relatives
OVERVIEW OF LINEAR MODELS
Integrating Gene Expression with Summary Association Statistics to Identify Genes Associated with 30 Complex Traits  Nicholas Mancuso, Huwenbo Shi, Pagé.
What are BLUP? and why they are useful?
Exercise: Effect of the IL6R gene on IL-6R concentration
OVERVIEW OF LINEAR MODELS
Improved Heritability Estimation from Genome-wide SNPs
Boltzmann Machine (BM) (§6.4)
Factor Analysis BMTRY 726 7/19/2018.
Multivariate Linear Regression
Genome-wide Complex Trait Analysis and extensions
Lecture 9: QTL Mapping II: Outbred Populations
Genotype Imputation with Millions of Reference Samples
An Expanded View of Complex Traits: From Polygenic to Omnigenic
Huwenbo Shi, Gleb Kichaev, Bogdan Pasaniuc 
Multivariate Genetic Analysis: Introduction
Model Adequacy Checking
The Basic Genetic Model
Chapter 13 Excel Extension: Now You Try!
Unsupervised Learning
Presentation transcript:

Heritability, prediction and the genomic architecture of complex traits David Balding and Doug Speed EMBO Practical Course on Genotype to Phenotype Mapping of Complex Traits Hinxton, July 30, 2014 Any questions, email doug.speed@ucl.ac.uk

Datafiles data.bed, data.bim, data.fam – contain genotypic data for 1500 individuals and 60k SNPs (chr 1, 6 & 10) phen.pheno – contains three phenotypes sex.covar – contains gender mlist1, mlist2 – lists of kinship files genelist.txt – gene annotations weights.txt – SNP weightings train.fam – (randomly drawn) list of individuals used to train the prediction model test.fam – list of individuals used to test the prediction model Note, all arguments are preceded by two dashes (even if it appears as one long dash!)

Why use allelic correlations A = XXT/N? The linear mixed model with A=XXT/N is equivalent to a random effects regression model: Y = α + β1 X1 + β2 X2 + β3 X3 + β4 X4 + β5 X5 + β6 X6 + β7 X7 + β8 X8 + β9 X9 + β10X10 + β11X11 + β12X12 + β13X13 + β14X14 + β15X15 + β16X16 + β17X17 + β18X18 + β19X19 + β20X20 + β21X21 + β22X22 + β23X23 + β24X24 + .... + .... + .... + .... + .... + .... + .... + β60,000X60,000 + e Suppose βj ~ N ( 0, σg2 / N ) and ei ~ N ( 0, σe2 ) Then g = Σ Xj βj ~ N ( 0, Aσa2 ) where A= XXT / N We obtain the mixed model: Y = α + g + e ~ N( α, A σg2 + I σe2 ) Variance explained by all SNPs is σa2 / (σa2 + σe2 )

Using GCTA (Genome-wide Complex Trait Analysis) 1 – calculate similarity matrices separately across Chromosomes 1, 6 and 10 [because XXT = X1X1T + X6X6T + X10X10T] gcta –bfile data –make-grm –chr 1 –out chr1 gcta –bfile data –make-grm –chr 6 –out chr6 gcta –bfile data –make-grm –chr 10 –out chr10 Other options: –extract, –maf or –keep, or for dosage data –dosage-mach 2 – join these three similarity matrices together [mlist1.txt contains the prefixes of the matrices] gcta –mgrm mlist1 –make-grm –out chrALL Elements of chrALL.grm.bin reflect “genome-wide” similarity for pairs of individuals GRM is stored in binary format, but can view it using gcta –mgrm mlist1 –make-grm-gz –out chrALL less –S chrALL.grm.gz gunzip –c chrALL.grm.gz | awk ‘{if($1!=$2 && $4>0.25) print $0}’ E.g. individual pairs 314 & 236 have kinship 0.404, while 433 & 264 have kinship 0.999

Using GCTA (Genome-wide Complex Trait Analysis) 3 – identify and remove closely related individuals (lose 18 individuals) gcta –bfile data –grm chrALL –grm-cutoff 0.05 –make-grm –out filter 4 – perform PCA (used to identify population outliers which should be removed) gcta –bfile data –grm chrALL –keep filter.grm.id –pca 2 –out filter

Using GCTA (Genome-wide Complex Trait Analysis) 5 – estimate total variance explained by SNPs for Phenotype 1 gcta –reml –grm chrALL –pheno phen.pheno –mpheno 1 --keep filtered.grm.id –out p1 Could add covariates with option –covar Find that total variance explained is 93% 6a – estimate total variance explained by SNPs for Phenotype 3 (binary trait) gcta –reml –grm chrALL –pheno phen.pheno –mpheno 1 --keep filtered.grm.id –out p3 Find that estimate on the observed scale is 63% 6b – to convert to the liability scale provide the prevalence (assumed to be 1 in 100) gcta –reml –grm chrALL–pheno phen.pheno –mpheno 3 --keep filtered.grm.id –out binary –prevalence 0.01 On the liability scale the variance explained is 36%

Genome partitioning Y = α + β1 X1 + β2 X2 + β3 X3 + β4 X4 + β5 X5 + β6 X6 + β7 X7 + β8 X8 + β9 X9 + β10X10 + β11X11 + β12X12 + β13X13 + β14X14 + β15X15 + β16X16 + β17X17 + β18X18 + β19X19 + β20X20 + β21X21 + β22X22 + β23X23 + β24X24 + β25X25 + β26X26 + β27X27 + β28X28 + β29X29 + β30X30 + β31X31 + β32X32 + e Suppose: β1, β2, β3, ..., β16 ~ N ( 0, σa12/ N ) and β17, β18, β19, ..., β32 ~ N ( 0, σa22/N ) Now we have the mixed model: Y = α + g + e ~ N( α, A1 σa12 + A2 σa22 + I σe2 ) where: A1 = X1X1T / N1 and A2 = X2X2T / N2 Variance explained by first 16 SNPs is σa12 / ( σa12 + σa22 + σe2 ) by last 16 SNPs is σa22 / ( σa12 + σa22 + σe2 )

a) height, b) BMI, c) vWF, d) QTi Genome partitioning Genome partitioning has been used to estimate the proportion of a trait’s phenotypic variance explained by individual chromosomes a) height, b) BMI, c) vWF, d) QTi Used to demonstrate traits are polygenic

Testing for inflation due to cryptic relatedness Divide the SNPs X into left half X1 and right half X2 (e.g. X1 contains SNPs in Chr 1-8; X2 contains those in Chr 9-22) Estimate total h2 mixed model with K = XXT / N Estimate left half h2L mixed model with K1 = X1X1T / N1 Estimate right half h2R mixed model with K2 = X2X2T / N2 If cryptic relatedness is not inflating estimates, then would expect h2 = h2L + h2R In which case, estimates of h2L + h2R from the joint model Y ~ N ( α, K σg2 + K σg2 + I σe2 ) should match those from individual models Y ~ N ( α, K σg2 + I σe2 ) and Y ~ N( α, K σg2 + I σe2 ) [5] h2L h2R h2R

Genome partitioning 7a – estimate variance explained for Phenotype 1 by each chromosome jointly gcta –reml –mgrm mlist1 –pheno phen.pheno –mpheno 1 --keep filtered.grm.id –out gp Get breakdown as 47%, 23% & 24% (sum to 95%, slightly higher than combined) 7b – can compare this with variance explained by each chromosome separately gcta –reml –grm chr1 –pheno phen.pheno –mpheno 1 --keep filtered.grm.id –out chr1 gcta –reml –grm chr6 –pheno phen.pheno –mpheno 1 --keep filtered.grm.id –out chr6 gcta –reml –grm chr10 –pheno phen.pheno –mpheno 1 --keep filtered.grm.id –out chr10 Estimates are 45%, 22% & 27% (sum to 93%) The only slight difference between sums (93% and 95%) indicates only a mild contribution from inflation due to residual relatedness and population structure (a useful check).

Bivariate analysis SNP-based heritability analysis can be used to assess the concordance between traits Y1 = α1 + β1 X1 + β2 X2 + β3 X3 + β4 X4 + β5 X5 + β6 X6 + ... + e1 Y2 = α2 + λ1 X1 + λ 2 X2 + λ 3 X3 + λ 4 X4 + λ 5 X5 + λ 6 X6 + ... + e2 The individual analyses suppose βj ~ N ( 0, σg12 / N) and λj ~ N ( 0, σg22/N ) But we can additionally specify cor( βj , λj ) = τ τ takes value in [-1, 1] Similar to univariate analysis, except now we analyse Traits 1 and 2 jointly using a generalized version of REML which estimates σ2g1, σg22 and τ [9] For different traits, interested in testing whether τ is significantly non-zero (pleiotropy) Can also test whether for sub-phenotypes τ is significantly less than one (heterogeneity)

Bivariate analysis Mainly used to test concordance between traits (pleiotropy) e.g. demonstrate overlap between Bipolar Disorder and schizophrenia But can also be used to test concordance within traits e.g., how similar are focal and non-focal epilepsy?

Bivariate analysis 8a – bivariate analysis between Phenotypes 1 and 2 gcta –reml-bivar 1 2 –grm both –pheno phen.pheno --keep filtered.grm.id –out bivar Estimate the correlation between Traits 1 and 2 to be 53% 8b – bivariate analysis – test whether significantly different to 0 (pleiotropy) gcta –reml-bivar 1 2 –grm both –pheno phen.pheno --keep filtered.grm.id –out bivar0 –reml-bivar-lrt-rg 0 The correlation is significantly greater than 0 (P < 1e-16, chi squared test stat 527) 8c – bivariate analysis – test whether significantly lower than 1 (heterogeneity) gcta –reml-bivar 1 2 –grm both –pheno phen.pheno --keep filtered.grm.id –out bivar1 –reml-bivar-lrt-rg 1 The correlation is significantly lower than 1 (P < 1e-16, chi squared test stat 82)

Using LDAK (Linkage Disequilibrium Adjusted Kinships) The aim of the SNP weightings is to equalise the signal variance across the genome. If SNPs are weighted equally, then signal in high LD (well tagged) regions will be overrepresented when calculating similarity, and vice versa. E.g., suppose two SNPs are identical, then could give both weighting 0.5. In practice, many SNPs get weighting zero, which means that their signal is completely explained by the SNPs which remain.

Using LDAK (Linkage Disequilibrium Adjusted Kinships) 1 – calculate SNP weightings (will not carry out these during the practical) ldak3 –cut-weights sections –bfile data ldak3 –calc-weights sections –bfile data –section 1 ... ldak3 –calc-weights sections –bfile data –section 21 ldak3 –join-weights sections Weightings (prepared earlier) are stored in weights.txt For computational reasons, it is necessary to divide the genome into sections (default size 3000 SNPs). For sparse genotyping, 3000 SNPs ~ 30Mb, which is sufficiently large For dense (sequence / imputed) data, 3000 SNPs ~ 1Mbp, so we suggest calculating weightings “twice”, the second time using only SNPs with non-zero weights from the first run. This will increase the window size about 10-fold.

Using LDAK (Linkage Disequilibrium Adjusted Kinships) 2 – calculate kinships ldak3 –calc-kins partitions –bfile data –by-chr YES ldak3 –calc-kins partitions –bfile data –weights weights.txt –partition 1 ldak3 –calc-kins partitions –bfile data –weights weights.txt –partition 2 ldak3 –calc-kins partitions –bfile data –weights weights.txt –partition 3 ldak3 –join-kins partitions –kinship-matrix YES Other options: --min-maf, --min-var, --max-maf and --min-obs To facilitate genome partitioning, can provide lists of SNPs Can examine kinships using less –S partitions/kinshipALL.grm.raw

Without weightings After weightings Using LDAK (Linkage Disequilibrium Adjusted Kinships) 3 – decompose kinship ldak3 –decompose partitions/kinshipALL.eigen –grm partitions/kinshipALL –keep filter.grm.id Will save eigenvalues and eigenvectors in partitions/kinshipALL.eigen Without weightings After weightings

Without weightings After weightings Using LDAK (Linkage Disequilibrium Adjusted Kinships) 4a – estimate total variance explained by SNPs for Phenotype 1 ldak3 –reml outp1 –grm partitions/kinshipALL –pheno phen.pheno –mpheno 1 –keep filter.grm.id As before, could add covariates with option –covar Find that total variance explained is 88% (compared to 93% without weightings) 4b – estimate total variance explained by SNPs for Phenotype 3 ldak3 –reml outp3 –grm partitions/kinshipALL –pheno phen.pheno –mpheno 3 –keep filter.grm.id Conversion from observed to liability scale not yet implemented, sorry Without weightings After weightings

Using LDAK (Linkage Disequilibrium Adjusted Kinships) 5a – genome partitioning for Phenotype 1 ldak3 –reml outgp –mgrm mlist2 –pheno phen.pheno –mpheno 1 –keep filter.grm.id The breakdown is 41%, 18% and 27% (sum to 86%) 5b – estimate total variance explained by SNPs for Phenotype 3 ldak3 –reml outchr1 –grm partitions/kinship1 –pheno phen.pheno –mpheno 1 –keep filter.grm.id ldak3 –reml outchr6 –grm partitions/kinship2 –pheno phen.pheno –mpheno 1 –keep filter.grm.id ldak3 –reml outchr10 –grm partitions/kinship3 –pheno phen.pheno –mpheno 1 –keep filter.grm.id Individual estimates are 44%, 18%, 32% (sum to 94%)

Gene-based / regional tests Y = α + β1 X1 + β2 X2 + β3 X3 + β4 X4 + β5 X5 + β6 X6 + β7 X7 + β8 X8 + β9 X9 + β10X10 + β11X11 + β12X12 + β13X13 + β14X14 + β15X15 + β16X16 + β17X17 + β18X18 + β19X19 + β20X20 + β21X21 + β22X22 + β23X23 + β24X24 + .... + .... + .... + .... + .... + .... + .... + .... + β500,000X500,000 + e Can focus in on very small regions or individual genes (tens of SNPs) Suppose we want to compute variance explained by SNPs 14 to 18 Naive approach is to compute K from these SNPs and perform REML But because N=5 < # individuals, fast computational tricks can be used (FastLMM)

Gene-based analysis 6a – divide genome into genes ldak3 –cut-genes genes –bfile data –genefile genelist.txt –weights weights.txt –gene-buffer 10000 3356 of the 4580 genes are found (have at least one non-zero weight SNP) Can split the genes into partitions, and analyse each in parallel, but typically not necessary 6b – test each gene for association with Phenotype 3 ldak3 –calc-genes-reml genes –bfile data –weights weights.txt –pheno phen.pheno –mpheno 3 –keep filter.grm.id –covar sex.covar –partition 1 6c – join the partitions (here, only one partition) ldak3 –join-genes-reml genes

Heritability LRT Pvalue Score Pvalue Gene-based analysis Heritability LRT Pvalue Score Pvalue LRT Pvalue can be conservative, although suffices for our purposes Listgarten and Lippert implement a permutation-based p-value

Chunk-based analysis Restricting only to genes can miss a lot! 7a – divide into 75,000 base pair chunks ldak3 –cut-genes chunks –bfile data –chunks-bp 75000 --weights weights.txt Can also divide according to weighting using --chunks 7b – now test each chunk ldak3 –calc-genes-reml chunks –bfile data –weights weights.txt –pheno phen.pheno –mpheno 3 –keep filter.grm.id –covar sex.covar –partition 1

BLUP Standard BLUP using a single (genetic) random effect 8a – perform REML analysis (this time restricted to training individuals) ldak3 –reml blupALL –grm partitions/kinshipALL –pheno phen.pheno ---mpheno 3 –keep train.fam –covar sex.covar 8b – now get the blup estimates of effect sizes and project onto these ldak3 –calc-blups blupALL –remlfile blupALL.reml –grm partitions/kinshipALL –bfile data blupALL.blup contains effect sizes blupALL.pred contains predictions Can measure prediction performance by correlation between predicted and observed phenotypic values With standard BLUP performance is 2%

MultiBLUP Standard BLUP model: Y = α + g + e g ~ N(0, σg2A) e ~ N(0,σe2I) MultiBLUP model Y = α + g1 + g2 + ... + gM + e gm ~ N(0, σm2Am) e ~ N(0,σe2I) Each similarity matrix Am corresponds to a subset of SNPs Prediction improved when subset of SNPs have distinct effect size variances Performance of MultiBLUP depends on how well SNP subsets are chosen We provide an algorithm for finding suitable SNP subsets

Adaptive MultiBLUP Step 1: Divide genome into (say) 75kb overlapping chunks (--chunks-bp) Step 2: Test each chunk for association (--calc-genes-reml) Step 3: Identify all significant chunks (say, P<0.00001) Create regions by merging these with neighbouring chunks with P<0.001 ( join-genes-reml –sig1 1e-6 –sig2 0.001) Step 4: Run MultiBLUP using these local regions plus the background region (--reml with –grm, –region-prefix and –region-number)

Adaptive MultiBLUP 9a - Collect results of 7a ldak3 --join-genes-reml chunks –bfile data –sig1 1e-6 –sig2 1e-3 This creates six regions, chunks/region1, ... , chunks/region6

Adaptive MultiBLUP 9b – Run MultiBLUP using these 6 regions and the background kinship matrix ldak3 –reml outmb –grm partitions/kinshipALL –region-prefix chunks/region –region-number 6 –pheno phen.pheno –mpheno 3 –keep train.fam –weights weights.txt –bfile data 9c – compute effect size estimates and predictions ldak3 –calc-blups outmb –remlfile outmb.reml –bfile data –grm partitions/kinshipALL Prediction improves From 2% to 25%