Slides of this talk: google “Alkes HSPH” Impact of negative selection on common variant disease architectures Alkes L. Price Harvard School of Public Health October 19, 2018 Slides of this talk: google “Alkes HSPH”
What is negative selection? Negative selection is the negative pressure on allele frequencies of mutations that reduce fitness. Allele frequency Allele frequency Time Time Kryukov et al. 2007 Am J Hum Genet, Kiezun et al. 2013 PLoS Genet
Negative selection: causal effect on trait is larger for rare and low-frequency variants Let p be minor allele frequency (MAF). Let β be per-allele causal effect on disease/trait. Let h2 = β22p(1−p) be variance explained by a SNP. Selection: α model Var(β) ~ [p(1−p)]α E(h2) ~ [p(1−p)]1+α rare variants explain more h2 if α < 0 (negative selection) than if α = 0 (no selection) Speed et al. 2012 Am J Hum Genet; also see Schoech et al. biorxiv 09/13/17 (analytical derivations support validity of α model for p above a threshold)
Negative selection: causal effect on trait is larger for rare and low-frequency variants Let p be minor allele frequency (MAF). Let β be per-allele causal effect on disease/trait. Let h2 = β22p(1−p) be variance explained by a SNP. Selection: α model Var(β) ~ [p(1−p)]α E(h2) ~ [p(1−p)]1+α rare variants explain more h2 if α < 0 (negative selection) α = −0.38 across 25 UK Biobank traits (rare variants: larger causal effects but smaller per-SNP h2 vs. common variants) Schoech et al. biorxiv 09/13/17; also see Zeng et al. 2018 Nat Genet
Rare variants explain limited trait heritability, despite larger causal effects + many rare variants MAF = 5% α = 0 (no selection) Schoech et al. biorxiv 09/13/17; also see Zeng et al. 2018 Nat Genet
Beyond the α model: does negative selection impact common variant disease architectures? MAF = 5% α = 0 (no selection) Schoech et al. biorxiv 09/13/17; also see Zeng et al. 2018 Nat Genet
Outline 1. LD-dependent architectures 2. Functional architectures 3. Polygenicity
Outline 1. LD-dependent architectures 2. Functional architectures 3. Polygenicity
What does “LD-dependent architecture” mean?
What does “LD-dependent architecture” mean? • SNPs with higher LD have higher average χ2 association statistics due to increased tagging of causal variants. Pritchard & Przeworski 2001 Am J Hum Genet
What does “LD-dependent architecture” mean? • SNPs with higher LD have higher average χ2 association statistics due to increased tagging of causal variants. “LD-dependent architecture”: dependence of causal effect sizes on the level of LD of a SNP. Speed et al. 2012 Am J Hum Genet
What does “LD-dependent architecture” mean? • Common SNPs have higher LD and higher causal variance than rare SNPs => SNPs with higher LD have higher causal variance. Schoech et al. biorxiv 09/13/17; also see Zeng et al. 2018 Nat Genet
What does “LD-dependent architecture” mean? • Common SNPs have higher LD and higher causal variance than rare SNPs => SNPs with higher LD have higher causal variance. “LD-dependent architecture”: dependence of causal effect sizes on the level of LD of a SNP, after conditioning on MAF.
Inferring LD-dependent architectures from summary statistics using S-LDSC Extend S-LDSC (Finucane et al. 2015 Nat Genet) to continuous annot. q: E(χ2) = 1 + NΣq τqLDscoreq LDscoreq(SNP m) = = normalized conditional effect of annot. q (proportionate change in trait h2 per 1 s.d. increase in annot. q) am,q = value of annot. q at SNP m τq = conditional effect of annot. q h2 = genome-wide trait heritability m m Gazal et al. 2017 Nat Genet
Inferring LD-dependent architectures using continuous LLD annotation Level of LD ( LLD ): MAF-adjusted LD score (MAF-stratified quantile normalization) LDscoreLLD (SNP m) for continuous LLD annotation = • Include “baseline model” annotations (Finucane et al. 2015 Nat Genet) • Also include binary annotations for 10 common SNP MAF bins • Simulations confirm robust results (not shown) Gazal et al. 2017 Nat Genet
SNPs with lower MAF-adjusted level of LD (LLD) have larger causal effect sizes Same sign of effect across all 56 traits (average N=101K)
Many annotations correlated to LD could contribute to LD-dependent architectures LD-related annotations Predicted allele age (ARGweaver; Rasmussen et al. 2014 PLoS Genet) LLD in Africans (LLD-AFR) Recombination rate (±10kb window; Hussin et al. 2015 Nat Genet) GC-content (±1Mb window; Loh et al. 2015b Nat Genet) Replication timing (Koren et al. 2012 Am J Hum Genet) Background selection (1 − B statistic; McVicker et al. 2009 PLoS Genet) Nucleotide diversity (SNPs per kb; ±10kb window) CpG content (±50kb window) Functional annotations (Finucane et al. 2015 Nat Genet) Coding, regulatory, conserved, etc.
Many annotations correlated to LD could contribute to LD-dependent architectures | LD-related annotations Functional annotations from “baseline model” (Finucane et al. 2015 Nat Genet)
Many annotations correlated to LD could contribute to LD-dependent architectures | LD-related annotations Functional annotations from “baseline model” (Finucane et al. 2015 Nat Genet)
Many annotations correlated to LD could contribute to LD-dependent architectures | LD-related annotations Functional annotations from “baseline model” (Finucane et al. 2015 Nat Genet)
Many LD-related annotations impact causal effect sizes + MAF Annotation + baseline model + MAF Meta-analysis of 31 independent traits
Many LD-related annotations impact causal effect sizes + MAF Annotation + baseline model + MAF Recombination rate has discordant sign of effect (Hill & Robertson 1966 Genet Res) Heritability is enriched in SNPs with low LLD in low recombination rate regions r = −0.63 Meta-analysis of 31 independent traits
Many LD-related annotations impact causal effect sizes after conditioning on baseline model + MAF Annotation + baseline model + MAF Meta-analysis of 31 independent traits
Many LD-related annotations impact causal effect sizes after conditioning on baseline model + MAF Annotation + baseline model + MAF LLD effect is 0.37x smaller when including annotations from baseline model Some, but not all, of LD-dependent architecture due to DHS, enhancers, etc. 0.37x Meta-analysis of 31 independent traits
Many LD-related annotations impact causal effect sizes after conditioning on baseline model + MAF Annotation + baseline model + MAF LLD effect is 0.51x smaller after adding baseline model Predicted allele age has largest effect. Meta-analysis of 31 independent traits
Many LD-related annotations impact causal effect sizes in joint fit with baseline model baseline-LD model Annotation + MAF Annotation + baseline model + MAF Joint-fit annotations + baseline model + MAF Meta-analysis of 31 independent traits
Many LD-related annotations impact causal effect sizes in joint fit with baseline model baseline-LD model Annotation + MAF Annotation + baseline model + MAF Joint-fit annotations + baseline model + MAF LLD effect is 0.51x smaller af 6 significant annotations in joint fit Meta-analysis of 31 independent traits
Many LD-related annotations impact causal effect sizes in joint fit with baseline model baseline-LD model Annotation + MAF Annotation + baseline model + MAF Joint-fit annotations + baseline model + MAF LLD effect is 0.51x smaller af predicted allele age has largest effect Meta-analysis of 31 independent traits
Forward simulations show that negative selection explains LD-dependent architectures Annotation + MAF Annotation + baseline model + MAF Joint-fit annotations + baseline model + MAF Forward Simulations: impact on s • Forward simulations using SLiM (Messer 2013 Genetics) under African-European demographic model (Gravel et al. 2011 PNAS) • Jointly regress selection coeff s on 4 LD-related annotations and minor allele frequency X X 31 traits Simulations
Proportion of heritability Quintiles illustrate large effects of LD-related annotations from baseline-LD model 40% 30% Proportion of heritability 20% 10% 0% Youngest 20% explain 3.8x more heritability than oldest 20% vs. 1.8x for MAF
Proportion of heritability Quintiles illustrate large effects of TMRCA annotation inferred using ASMC • ASMCavg annotation: Average TMRCA inferred by Ascertained Sequentially Markovian Coalescent (ASMC) in GoNL WGS data • Jointly statistically significant with other LD-related annotations (τ* = ‒0.25±0.01) Proportion of heritability Low-TMRCA 20% explain 3.8x more heritability than high-TMRCA 20% Palamara et al. 2018 Nat Genet
LD-dependent architectures can lead to bias in estimates of heritability and functional enrichment Modeling LD-dependent architectures is critically important. Speed et al. 2012 Am J Hum Genet, Gusev et al. 2013 PLoS Genet, Yang et al. 2015 Nat Genet, Speed et al. 2017 Nat Genet, Gazal et al. 2017 Nat Genet
How well does the baseline-LD model fit the data?
How well does the baseline-LD model fit the data? Idea (Speed et al. 2017 Nat Genet): use out-of-sample likelihoods for formal model comparisons Speed et al. 2017 Nat Genet: LDAK model > infinitesimal model (“GCTA model”) in analysis of 1000G SNPs
2.8M 1000G SNPs: LDAK model > GCTA model y-axis = change in log likelihood vs. LDAK model (16 UK Biobank traits) “GCTA” = infinitesimal model 2.8M SNPs from 1000G 4.6M SNPs from HRC ASHG 2018 poster 3432/W Price + Gazal et al. biorxiv 10/16/18
4.6M HRC SNPs: GCTA model (>) LDAK model y-axis = change in log likelihood vs. LDAK model (16 UK Biobank traits) “GCTA” = infinitesimal model 2.8M SNPs from 1000G 4.6M SNPs from HRC ASHG 2018 poster 3432/W Price + Gazal et al. biorxiv 10/16/18
baseline-LD > LDAK and GCTA in both SNP sets y-axis = change in log likelihood vs. LDAK model (16 UK Biobank traits) “GCTA” = infinitesimal model (also see Yang et al. 2015 Nat Genet; LDMS model) “Gazal-LD” = LD + MAF annotations only from baseline-LD model “baseline-LD+LDAK” = model with baseline-LD + LDAK annotations 2.8M SNPs from 1000G 4.6M SNPs from HRC ASHG 2018 poster 3432/W Price + Gazal et al. biorxiv 10/16/18
Outline 1. LD-dependent architectures 2. Functional architectures 3. Polygenicity image from Shlyueva et al. 2014 Nat Rev Genet
Common variant functional architectures: coding + regulatory (tissue-specific) Coding variants explain ~10% Regulatory variant enrichments of common variant h2 are often tissue/cell-type-specific Finucane et al. 2015 Nat Genet; also see Finucane et al. 2018 Nat Genet
Low-frequency variant functional architectures: ??? + ??? Coding variants explain ??? Regulatory variants explain ??? of low-frequency variant h2 of low-frequency variant h2 ??? ??? ??? ??? Coding variants likely important for low-frequency variant architectures: UK10K 2015 Nature, Astle et al. 2016 Cell, Marouli et al. 2017 Nature
Inferring low-frequency variant functional architectures by extending S-LDSC Multi-linear Regression: χ2 statistic = 1 + Σq(Nτq)LDscoreq • Separate annotations for common and low-frequency SNPs • Also include binary annotations for 5 low-frequency MAF bins • UK Biobank target samples + UK10K LD reference samples • Simulations confirm robust results (not shown) m m, ASHG 2018 poster 2699/F Gazal + Gazal et al. 2018 Nat Genet
Inferring low-frequency variant functional architectures by extending S-LDSC Common variant enrichment (CVE) of an annotation = prop. of hc2 / prop. of common SNPs Low-frequency variant enrichment (LFVE) of an annotation = prop. of hlf2 / prop. of low-frequency SNPs • Separate annotations for common and low-frequency SNPs • Also include binary annotations for 10 low-frequency MAF bins • UK Biobank target samples + UK10K LD reference samples • Simulations confirm robust results (not shown) ASHG 2018 poster 2699/F Gazal + Gazal et al. 2018 Nat Genet
LFVE is correlated to CVE LFVE > CVE when CVE is large 33 main annotations: r(LFVE,CVE) = 0.79 Meta-analysis across 40 UK Biobank traits (average N = 363K) assoc. method: BOLT-LMM (Loh et al. 2018 Nat Genet) Low-frequency variant enrichment (LFVE) Common variant enrichment (CVE)
LFVE is correlated to CVE LFVE > CVE when CVE is large 33 main annotations: r(LFVE,CVE) = 0.79 Non-synonymous variants: 17.3% of hlf2 vs. 2.1% of hc2 (Even larger LFVE for n.s. variants • predicted as damaging: PolyPhen-2 • in genes under strong selection: shet) Low-frequency variant enrichment (LFVE) Common variant enrichment (CVE)
LFVE ≈ CVE for most regulatory annotations but LFVE > CVE for brain annotations 637 cell-type-specific (CTS) annotation-trait pairs with significant CVE (Finucane et al. 2018 Nat Genet) Low-frequency variant enrichment (LFVE) 55 brain annotation-trait pairs with LFVE/CVE>2x Common variant enrichment (CVE)
LFVE ≈ CVE for most regulatory annotations but LFVE > CVE for brain annotations H3K4me3 in brain DPFC-Neuroticism: 56.9% of hlf2 vs. 11.7% of hc2 (P = 0.0002) Low-frequency variant enrichment (LFVE) Common variant enrichment (CVE)
LFVE/CVE ratio depends primarily on strength of selection sdn = avg selection coefficient of deleterious de novo variants π = prob. that de novo variant is causal for trait Forward simulations (SLiM2 + τEyre-Walker) LFVE/CVE ratio Non-synonymous variants: LFVE/CVE=5x, sdn=‒0.003 55 brain annotation-trait pairs: LFVE/CVE>2x, sdn<‒0.0006 (potentially useful for WGS) Proportion of causal variants (π)
Outline 1. LD-dependent architectures 2. Functional architectures 3. Polygenicity image from Evangelou et al. 2018 Nat Genet
Complex traits are extremely polygenic Systolic blood pressure: GWAS of 1 million people identifies 901 genome-wide significant loci explaining 5.7% of trait variance (vs. total SNP-heritability = 21%) Evangelou et al. 2018 Nat Genet; also see Purcell et al. 2009 Nature, Yang et al. 2010 Nat Genet, Stahl et al. 2012 Nat Genet, PGC-SCZ 2014 Nature, Loh et al. 2015b Nat Genet, Zhang et al. 2018 Nat Genet
Omnigenic model: polygenicity arises from extraordinary biological complexity Boyle et al. 2017 Cell also see Wray et al. 2018 Cell, Liu et al. biorxiv 09/24/18
Flattening hypothesis: polygenicity arises because negative selection flattens common variant effect sizes (always small) ASHG 2018 poster 3528/W O’Connor + O’Connor et al. biorxiv 09/18/18
New definition of polygenicity: effective number of associated SNPs (Ma) No LD between causal SNPs: (M = #SNPs, β = normalized effect size)
New definition of polygenicity: effective number of associated SNPs (Ma) No LD between causal SNPs: (M = #SNPs, β = normalized effect size) If SNP effects follow normal distribution: Ma = number of SNPs If SNP effects follow point-normal distribution: Ma = number of causal SNPs
New definition of polygenicity: effective number of associated SNPs (Ma) No LD between causal SNPs: (M = #SNPs, β = normalized effect size) If SNP effects follow normal distribution: Ma = number of SNPs If SNP effects follow point-normal distribution: Ma = number of causal SNPs Estimates of the number of causal SNPs under a point-normal model are sample size dependent (Zhang et al. 2018 Nat Genet, O’Connor et al. biorxiv 09/18/18)
New method to estimate Ma: Stratified LD 4th moments regression (S-LD4M) S-LDSC (Finucane et al. 2015 Nat Genet): • Regress χ2 statistics on stratified LD scores (∑ r2) • Include baseline-LD model annotations (Gazal et al. 2017 Nat Genet) S-LD4M (O’Connor et al. biorxiv 09/18/18): • Regress squared χ2 statistics on stratified LD 4th moments (∑ r4)
New method to estimate Ma: Stratified LD 4th moments regression (S-LD4M) S-LD4M (O’Connor et al. biorxiv 09/18/18): • Regress squared χ2 statistics on stratified LD 4th moments (∑ r4) • Include baseline-LD model annotations (Gazal et al. 2017 Nat Genet) Applicable to genome-wide SNP or categories of SNPs (e.g. low-frequency SNPs, coding SNPs, etc.)
New method to estimate Ma: Stratified LD 4th moments regression (S-LD4M) S-LD4M (O’Connor et al. biorxiv 09/18/18): • Regress squared χ2 statistics on stratified LD 4th moments (∑ r4) • Include baseline-LD model annotations (Gazal et al. 2017 Nat Genet) Applicable to genome-wide SNP or categories of SNPs (e.g. low-frequency SNPs, coding SNPs, etc.) Robust results in simulations
Approaches to understanding polygenicity Love is Understanding. -- Madonna Data is Understanding. -- Alkes
Brain-related traits are particularly polygenic (Number of children is even more polygenic) Results sub-selected from 33 diseases and complex traits (average N = 361K)
Common variants are more polygenic than low-frequency variants Polygenicity (Ma) of common vs. low-frequency SNPs Common variants: ~4x more polygenic than low-frequency variants (evolutionary modeling: ≥30x more polygenic than de novo variants)
Functional categories are more polygenic (in proportion to heritability enrichment) Main functional categories from baseline-LD model Results aggregated across common + low-frequency variants heritability enrichment reflects differences in the number of associations rather than their magnitude (which is constrained by selection)
Flattening hypothesis: polygenicity arises because negative selection flattens common variant effect sizes (always small) ASHG 2018 poster 3528/W O’Connor + O’Connor et al. biorxiv 09/18/18
Flattening hypothesis: implications for GWAS • GWAS effect sizes are largely determined by negative selection, not just the biological importance of the implicated gene.
Flattening hypothesis: implications for GWAS • GWAS effect sizes are largely determined by negative selection, not just the biological importance of the implicated gene. • Weak perturbations to strongly constrained genes will yield more insights than strong perturbations to weakly constrained genes.
Flattening hypothesis: implications for GWAS • GWAS effect sizes are largely determined by negative selection, not just the biological importance of the implicated gene. • Weak perturbations to strongly constrained genes will yield more insights than strong perturbations to weakly constrained genes. 37 fine-mapped IBD loci (Huang et al. 2017 Nature): 0/8 candidate genes with fine-mapped coding variants vs. 12/29 candidate genes near fine-mapped non-coding variants were loss-of-function intolerant (pLI ≥ 0.9; Lek et al. 2016 Nature) (P = 0.006 for difference)
Flattening hypothesis: implications for GWAS • GWAS effect sizes are largely determined by negative selection, not just the biological importance of the implicated gene. • Weak perturbations to strongly constrained genes will yield more insights than strong perturbations to weakly constrained genes. • Rare variant association studies (in very large sample sizes) will usefully complement GWAS, as rare variant architectures are less impacted by flattening due to negative selection.
Outline 1. LD-dependent architectures 2. Functional architectures 3. Polygenicity
Conclusions • Low-LD variants have larger causal effect sizes (at a given MAF), consistent with negative selection (Gazal et al. 2017 Nat Genet); the baseline-LD model attains higher likelihoods than other models in formal model comparisons (Gazal et al. biorxiv 10/16/18). Modeling LD-dependent architectures is critically important. • Non-synonymous + conserved + some brain-related annotations have LFVE >> CVE, consistent with strong negative selection (Gazal et al. 2018 Nat Genet). • Common variants are more polygenic than low-frequency variants + common variants are far more polygenic than de novo variants, due to negative selection (O’Connor et al. biorxiv 09/18/18).
Acknowledgements Harvard T.H. Chan School of Public Health: • Steven Gazal • Luke O’Connor Broad Institute: • Hilary Finucane BWH/Harvard Medical School: • Shamil Sunyaev • Po-Ru Loh • All authors of Gazal et al. 2017 Nat Genet, Gazal et al. biorxiv 10/16/18, Gazal et al. 2018 Nat Genet, O’Connor et al. biorxiv 09/18/18 Additional thanks to UK Biobank and 23andMe