Heritability, prediction and the genomic

Heritability, prediction and the genomic
architecture of complex traits David Balding and Doug Speed EMBO Practical Course on Genotype to Phenotype Mapping of Complex Traits Hinxton, July 30, 2014 Any questions,

Datafiles data.bed, data.bim, data.fam –
contain genotypic data for 1500 individuals and 60k SNPs (chr 1, 6 & 10) phen.pheno – contains three phenotypes sex.covar – contains gender mlist1, mlist2 – lists of kinship files genelist.txt – gene annotations weights.txt – SNP weightings train.fam – (randomly drawn) list of individuals used to train the prediction model test.fam – list of individuals used to test the prediction model Note, all arguments are preceded by two dashes (even if it appears as one long dash!)

Why use allelic correlations A = XXT/N?
The linear mixed model with A=XXT/N is equivalent to a random effects regression model: Y = α + β1 X β2 X β3 X β4 X β5 X β6 X β7 X β8 X8 + β9 X β10X β11X β12X β13X β14X β15X β16X16 + β17X17 + β18X β19X β20X β21X β22X β23X β24X24 β60,000X60,000 + e Suppose βj ~ N ( 0, σg2 / N ) and ei ~ N ( 0, σe2 ) Then g = Σ Xj βj ~ N ( 0, Aσa2 ) where A= XXT / N We obtain the mixed model: Y = α + g + e ~ N( α, A σg2 + I σe2 ) Variance explained by all SNPs is σa2 / (σa2 + σe2 )

Using GCTA (Genome-wide Complex Trait Analysis)
1 – calculate similarity matrices separately across Chromosomes 1, 6 and 10 [because XXT = X1X1T + X6X6T + X10X10T] gcta –bfile data –make-grm –chr 1 –out chr1 gcta –bfile data –make-grm –chr 6 –out chr6 gcta –bfile data –make-grm –chr 10 –out chr10 Other options: –extract, –maf or –keep, or for dosage data –dosage-mach 2 – join these three similarity matrices together [mlist1.txt contains the prefixes of the matrices] gcta –mgrm mlist1 –make-grm –out chrALL Elements of chrALL.grm.bin reflect “genome-wide” similarity for pairs of individuals GRM is stored in binary format, but can view it using gcta –mgrm mlist1 –make-grm-gz –out chrALL less –S chrALL.grm.gz gunzip –c chrALL.grm.gz | awk ‘{if($1!=$2 && $4>0.25) print $0}’ E.g. individual pairs 314 & 236 have kinship 0.404, while 433 & 264 have kinship 0.999

3 – identify and remove closely related individuals (lose 18 individuals) gcta –bfile data –grm chrALL –grm-cutoff 0.05 –make-grm –out filter 4 – perform PCA (used to identify population outliers which should be removed) gcta –bfile data –grm chrALL –keep filter.grm.id –pca 2 –out filter

5 – estimate total variance explained by SNPs for Phenotype 1 gcta –reml –grm chrALL –pheno phen.pheno –mpheno 1 --keep filtered.grm.id –out p1 Could add covariates with option –covar Find that total variance explained is 93% 6a – estimate total variance explained by SNPs for Phenotype 3 (binary trait) gcta –reml –grm chrALL –pheno phen.pheno –mpheno 1 --keep filtered.grm.id –out p3 Find that estimate on the observed scale is 63% 6b – to convert to the liability scale provide the prevalence (assumed to be 1 in 100) gcta –reml –grm chrALL–pheno phen.pheno –mpheno 3 --keep filtered.grm.id –out binary –prevalence 0.01 On the liability scale the variance explained is 36%

Genome partitioning Y = α + β1 X β2 X β3 X β4 X β5 X β6 X β7 X β8 X8 + β9 X β10X β11X β12X β13X β14X β15X β16X16 + β17X17 + β18X β19X β20X β21X β22X β23X β24X24 + β25X25 + β26X β27X β28X β29X β30X β31X β32X32 + e Suppose: β1, β2, β3, ..., β16 ~ N ( 0, σa12/ N ) and β17, β18, β19, ..., β32 ~ N ( 0, σa22/N ) Now we have the mixed model: Y = α + g + e ~ N( α, A1 σa12 + A2 σa22 + I σe2 ) where: A1 = X1X1T / N1 and A2 = X2X2T / N2 Variance explained by first 16 SNPs is σa12 / ( σa12 + σa22 + σe2 ) by last 16 SNPs is σa22 / ( σa12 + σa22 + σe2 )

a) height, b) BMI, c) vWF, d) QTi
Genome partitioning Genome partitioning has been used to estimate the proportion of a trait’s phenotypic variance explained by individual chromosomes a) height, b) BMI, c) vWF, d) QTi Used to demonstrate traits are polygenic

Testing for inflation due to cryptic relatedness
Divide the SNPs X into left half X1 and right half X2 (e.g. X1 contains SNPs in Chr 1-8; X2 contains those in Chr 9-22) Estimate total h2 mixed model with K = XXT / N Estimate left half h2L mixed model with K1 = X1X1T / N1 Estimate right half h2R mixed model with K2 = X2X2T / N2 If cryptic relatedness is not inflating estimates, then would expect h2 = h2L + h2R In which case, estimates of h2L + h2R from the joint model Y ~ N ( α, K σg2 + K σg2 + I σe2 ) should match those from individual models Y ~ N ( α, K σg2 + I σe2 ) and Y ~ N( α, K σg2 + I σe2 ) [5] h2L h2R h2R

Genome partitioning 7a – estimate variance explained for Phenotype 1 by each chromosome jointly gcta –reml –mgrm mlist1 –pheno phen.pheno –mpheno 1 --keep filtered.grm.id –out gp Get breakdown as 47%, 23% & 24% (sum to 95%, slightly higher than combined) 7b – can compare this with variance explained by each chromosome separately gcta –reml –grm chr1 –pheno phen.pheno –mpheno 1 --keep filtered.grm.id –out chr1 gcta –reml –grm chr6 –pheno phen.pheno –mpheno 1 --keep filtered.grm.id –out chr6 gcta –reml –grm chr10 –pheno phen.pheno –mpheno 1 --keep filtered.grm.id –out chr10 Estimates are 45%, 22% & 27% (sum to 93%) The only slight difference between sums (93% and 95%) indicates only a mild contribution from inflation due to residual relatedness and population structure (a useful check).

Bivariate analysis SNP-based heritability analysis can be used to assess the concordance between traits Y = α β1 X β2 X β3 X β4 X β5 X β6 X e1 Y = α λ1 X λ 2 X λ 3 X λ 4 X λ 5 X5 + λ 6 X e2 The individual analyses suppose βj ~ N ( 0, σg12 / N) and λj ~ N ( 0, σg22/N ) But we can additionally specify cor( βj , λj ) = τ τ takes value in [-1, 1] Similar to univariate analysis, except now we analyse Traits 1 and 2 jointly using a generalized version of REML which estimates σ2g1, σg22 and τ [9] For different traits, interested in testing whether τ is significantly non-zero (pleiotropy) Can also test whether for sub-phenotypes τ is significantly less than one (heterogeneity)

Bivariate analysis Mainly used to test concordance between traits (pleiotropy) e.g. demonstrate overlap between Bipolar Disorder and schizophrenia But can also be used to test concordance within traits e.g., how similar are focal and non-focal epilepsy?

Bivariate analysis 8a – bivariate analysis between Phenotypes 1 and 2
gcta –reml-bivar 1 2 –grm both –pheno phen.pheno --keep filtered.grm.id –out bivar Estimate the correlation between Traits 1 and 2 to be 53% 8b – bivariate analysis – test whether significantly different to 0 (pleiotropy) gcta –reml-bivar 1 2 –grm both –pheno phen.pheno --keep filtered.grm.id –out bivar0 –reml-bivar-lrt-rg 0 The correlation is significantly greater than 0 (P < 1e-16, chi squared test stat 527) 8c – bivariate analysis – test whether significantly lower than 1 (heterogeneity) gcta –reml-bivar 1 2 –grm both –pheno phen.pheno --keep filtered.grm.id –out bivar1 –reml-bivar-lrt-rg 1 The correlation is significantly lower than 1 (P < 1e-16, chi squared test stat 82)

Using LDAK (Linkage Disequilibrium Adjusted Kinships)
The aim of the SNP weightings is to equalise the signal variance across the genome. If SNPs are weighted equally, then signal in high LD (well tagged) regions will be overrepresented when calculating similarity, and vice versa. E.g., suppose two SNPs are identical, then could give both weighting 0.5. In practice, many SNPs get weighting zero, which means that their signal is completely explained by the SNPs which remain.

1 – calculate SNP weightings (will not carry out these during the practical) ldak3 –cut-weights sections –bfile data ldak3 –calc-weights sections –bfile data –section 1 ... ldak3 –calc-weights sections –bfile data –section 21 ldak3 –join-weights sections Weightings (prepared earlier) are stored in weights.txt For computational reasons, it is necessary to divide the genome into sections (default size 3000 SNPs). For sparse genotyping, 3000 SNPs ~ 30Mb, which is sufficiently large For dense (sequence / imputed) data, 3000 SNPs ~ 1Mbp, so we suggest calculating weightings “twice”, the second time using only SNPs with non-zero weights from the first run. This will increase the window size about 10-fold.

2 – calculate kinships ldak3 –calc-kins partitions –bfile data –by-chr YES ldak3 –calc-kins partitions –bfile data –weights weights.txt –partition 1 ldak3 –calc-kins partitions –bfile data –weights weights.txt –partition 2 ldak3 –calc-kins partitions –bfile data –weights weights.txt –partition 3 ldak3 –join-kins partitions –kinship-matrix YES Other options: --min-maf, --min-var, --max-maf and --min-obs To facilitate genome partitioning, can provide lists of SNPs Can examine kinships using less –S partitions/kinshipALL.grm.raw

Without weightings After weightings
Using LDAK (Linkage Disequilibrium Adjusted Kinships) 3 – decompose kinship ldak3 –decompose partitions/kinshipALL.eigen –grm partitions/kinshipALL –keep filter.grm.id Will save eigenvalues and eigenvectors in partitions/kinshipALL.eigen Without weightings After weightings

Without weightings After weightings
Using LDAK (Linkage Disequilibrium Adjusted Kinships) 4a – estimate total variance explained by SNPs for Phenotype 1 ldak3 –reml outp1 –grm partitions/kinshipALL –pheno phen.pheno –mpheno 1 –keep filter.grm.id As before, could add covariates with option –covar Find that total variance explained is 88% (compared to 93% without weightings) 4b – estimate total variance explained by SNPs for Phenotype 3 ldak3 –reml outp3 –grm partitions/kinshipALL –pheno phen.pheno –mpheno 3 –keep filter.grm.id Conversion from observed to liability scale not yet implemented, sorry Without weightings After weightings

5a – genome partitioning for Phenotype 1 ldak3 –reml outgp –mgrm mlist2 –pheno phen.pheno –mpheno 1 –keep filter.grm.id The breakdown is 41%, 18% and 27% (sum to 86%) 5b – estimate total variance explained by SNPs for Phenotype 3 ldak3 –reml outchr1 –grm partitions/kinship1 –pheno phen.pheno –mpheno 1 –keep filter.grm.id ldak3 –reml outchr6 –grm partitions/kinship2 –pheno phen.pheno –mpheno 1 –keep filter.grm.id ldak3 –reml outchr10 –grm partitions/kinship3 –pheno phen.pheno –mpheno 1 –keep filter.grm.id Individual estimates are 44%, 18%, 32% (sum to 94%)

Gene-based / regional tests
Y = α + β1 X β2 X β3 X β4 X β5 X β6 X β7 X β8 X8 + β9 X β10X β11X β12X β13X β14X β15X β16X16 + β17X17 + β18X β19X β20X β21X β22X β23X β24X24 β500,000X500,000 + e Can focus in on very small regions or individual genes (tens of SNPs) Suppose we want to compute variance explained by SNPs 14 to 18 Naive approach is to compute K from these SNPs and perform REML But because N=5 < # individuals, fast computational tricks can be used (FastLMM)

Gene-based analysis 6a – divide genome into genes
ldak3 –cut-genes genes –bfile data –genefile genelist.txt –weights weights.txt –gene-buffer 10000 3356 of the 4580 genes are found (have at least one non-zero weight SNP) Can split the genes into partitions, and analyse each in parallel, but typically not necessary 6b – test each gene for association with Phenotype 3 ldak3 –calc-genes-reml genes –bfile data –weights weights.txt –pheno phen.pheno –mpheno 3 –keep filter.grm.id –covar sex.covar –partition 1 6c – join the partitions (here, only one partition) ldak3 –join-genes-reml genes

Heritability LRT Pvalue Score Pvalue
Gene-based analysis Heritability LRT Pvalue Score Pvalue LRT Pvalue can be conservative, although suffices for our purposes Listgarten and Lippert implement a permutation-based p-value

Chunk-based analysis Restricting only to genes can miss a lot!
7a – divide into 75,000 base pair chunks ldak3 –cut-genes chunks –bfile data –chunks-bp weights weights.txt Can also divide according to weighting using --chunks 7b – now test each chunk ldak3 –calc-genes-reml chunks –bfile data –weights weights.txt –pheno phen.pheno –mpheno 3 –keep filter.grm.id –covar sex.covar –partition 1

BLUP Standard BLUP using a single (genetic) random effect
8a – perform REML analysis (this time restricted to training individuals) ldak3 –reml blupALL –grm partitions/kinshipALL –pheno phen.pheno ---mpheno 3 –keep train.fam –covar sex.covar 8b – now get the blup estimates of effect sizes and project onto these ldak3 –calc-blups blupALL –remlfile blupALL.reml –grm partitions/kinshipALL –bfile data blupALL.blup contains effect sizes blupALL.pred contains predictions Can measure prediction performance by correlation between predicted and observed phenotypic values With standard BLUP performance is 2%

MultiBLUP Standard BLUP model:
Y = α + g + e g ~ N(0, σg2A) e ~ N(0,σe2I) MultiBLUP model Y = α + g1 + g gM + e gm ~ N(0, σm2Am) e ~ N(0,σe2I) Each similarity matrix Am corresponds to a subset of SNPs Prediction improved when subset of SNPs have distinct effect size variances Performance of MultiBLUP depends on how well SNP subsets are chosen We provide an algorithm for finding suitable SNP subsets

Adaptive MultiBLUP Step 1: Divide genome into (say) 75kb overlapping chunks (--chunks-bp) Step 2: Test each chunk for association (--calc-genes-reml) Step 3: Identify all significant chunks (say, P< ) Create regions by merging these with neighbouring chunks with P<0.001 ( join-genes-reml –sig1 1e-6 –sig ) Step 4: Run MultiBLUP using these local regions plus the background region (--reml with –grm, –region-prefix and –region-number)

Adaptive MultiBLUP 9a - Collect results of 7a
ldak3 --join-genes-reml chunks –bfile data –sig1 1e-6 –sig2 1e-3 This creates six regions, chunks/region1, ... , chunks/region6

Adaptive MultiBLUP 9b – Run MultiBLUP using these 6 regions and the background kinship matrix ldak3 –reml outmb –grm partitions/kinshipALL –region-prefix chunks/region –region-number 6 –pheno phen.pheno –mpheno 3 –keep train.fam –weights weights.txt –bfile data 9c – compute effect size estimates and predictions ldak3 –calc-blups outmb –remlfile outmb.reml –bfile data –grm partitions/kinshipALL Prediction improves From 2% to 25%

Heritability, prediction and the genomic

Similar presentations

Presentation on theme: "Heritability, prediction and the genomic"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Heritability, prediction and the genomic

Similar presentations

Presentation on theme: "Heritability, prediction and the genomic"— Presentation transcript:

Similar presentations

About project

Feedback