Marker heritability Biases, confounding factors, current methods, and best practices Luke Evans, Matthew Keller.

Slides:



Advertisements
Similar presentations
Sequential Kernel Association Tests for the Combined Effect of Rare and Common Variants Journal club (Nov/13) SH Lee.
Advertisements

Analysis of imputed rare variants
Gene-by-Environment and Meta-Analysis Eleazar Eskin University of California, Los Angeles.
Association Tests for Rare Variants Using Sequence Data
ANCOVA Workings of ANOVA & ANCOVA ANCOVA, Semi-Partial correlations, statistical control Using model plotting to think about ANCOVA & Statistical control.
Meta-analysis for GWAS BST775 Fall DEMO Replication Criteria for a successful GWAS P
Objectives Cover some of the essential concepts for GWAS that have not yet been covered Hardy-Weinberg equilibrium Meta-analysis SNP Imputation Review.
Estimating “Heritability” using Genetic Data David Evans University of Queensland.
MSc GBE Course: Genes: from sequence to function Genome-wide Association Studies Sven Bergmann Department of Medical Genetics University of Lausanne Rue.
Give me your DNA and I tell you where you come from - and maybe more! Lausanne, Genopode 21 April 2010 Sven Bergmann University of Lausanne & Swiss Institute.
Quantitative Genetics
Robust and powerful sibpair test for rare variant association
BPS - 3rd Ed. Chapter 211 Inference for Regression.
Discussion for a statement for biobank and cohort studies in human genome epidemiology John P.A. Ioannidis, MD International Biobank and Cohort Studies.
Jianfeng Xu, M.D., Dr.PH Professor of Public Health and Cancer Biology Director, Program for Genetic and Molecular Epidemiology of Cancer Associate Director,
The International Consortium. The International HapMap Project.
Capture-recapture Models for Open Populations “Single-age Models” 6.13 UF-2015.
Analysis of Next Generation Sequence Data BIOST /06/2015.
Sequence Kernel Association Tests (SKAT) for the Combined Effect of Rare and Common Variants 統計論文 奈良原.
BPS - 5th Ed. Chapter 231 Inference for Regression.
Genome-Wides Association Studies (GWAS) Veryan Codd.
Association Mapping in Families Gonçalo Abecasis University of Oxford.
Power and Meta-Analysis Dr Geraldine M. Clarke Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for.
1 One goal in most meta-analyses is to examine overall, or typical effects We might wish to estimate and test the value of an overall (general) parameter.
University of Colorado at Boulder
Common variation, GWAS & PLINK
Genome-wide Association
Polygenic methods in analysis of complex trait genetics
GCTA Practical 2.
LD Hub: a centralized database and web interface to perform LD score regression that maximizes the potential of summary level GWAS data for SNP heritability.
Signatures of Selection
Genetic Association Analysis
Stratification Lon Cardon University of Oxford
Genome Wide Association Studies using SNP
Slides of this talk: google “Alkes HSPH”
Case Study #2 Session 1, Day 3, Liu
Itsik Pe’er, Yves R. Chretien, Paul I. W. de Bakker, Jeffrey C
PNPLA3 gene in liver diseases
High level GWAS analysis
Epidemiology 101 Epidemiology is the study of the distribution and determinants of health-related states in populations Study design is a key component.
The effect of using sequence data instead of a lower density SNP chip on a GWAS EAAP 2017; Tallinn, Estonia Sanne van den Berg, Roel Veerkamp, Fred van.
Beyond GWAS Erik Fransen.
Correlation for a pair of relatives
Genome-wide Association
Why general modeling framework?
What are BLUP? and why they are useful?
Alicia R. Martin, Christopher R. Gignoux, Raymond K
Huwenbo Shi, Nicholas Mancuso, Sarah Spendlove, Bogdan Pasaniuc 
Pi = Gi + Ei Pi = pi - p Gi = gi - g Ei = ei - e _ _ _ Phenotype
Basic Practice of Statistics - 3rd Edition Inference for Regression
Improved Heritability Estimation from Genome-wide SNPs
10 Years of GWAS Discovery: Biology, Function, and Translation
GREML: Heritability Estimation Using Genomic Data
New Statistical Methods for Family-Based Sequencing Studies
A Flexible Bayesian Framework for Modeling Haplotype Association with Disease, Allowing for Dominance Effects of the Underlying Causative Variants  Andrew.
A Selection Operator for Summary Association Statistics Reveals Allelic Heterogeneity of Complex Traits  Zheng Ning, Youngjo Lee, Peter K. Joshi, James.
Medical genomics BI420 Department of Biology, Boston College
Sequencing the IL4 locus in African Americans implicates rare noncoding variants in asthma susceptibility  Gabe Haller, BA, Dara G. Torgerson, PhD, Carole.
Genome-wide Complex Trait Analysis and extensions
Lecture 9: QTL Mapping II: Outbred Populations
Perspectives from Human Studies and Low Density Chip
10 Years of GWAS Discovery: Biology, Function, and Translation
Hugues Aschard, Bjarni J. Vilhjálmsson, Amit D. Joshi, Alkes L
Medical genomics BI420 Department of Biology, Boston College
An Expanded View of Complex Traits: From Polygenic to Omnigenic
Selecting a Maximally Informative Set of Single-Nucleotide Polymorphisms for Association Analyses Using Linkage Disequilibrium  Christopher S. Carlson,
Interpretation of Association Signals and Identification of Causal Variants from Genome- wide Association Studies  Kai Wang, Samuel P. Dickson, Catherine.
Evaluating the Effects of Imputation on the Power, Coverage, and Cost Efficiency of Genome-wide SNP Platforms  Carl A. Anderson, Fredrik H. Pettersson,
MGS 3100 Business Analysis Regression Feb 18, 2016
Presentation transcript:

Marker heritability Biases, confounding factors, current methods, and best practices Luke Evans, Matthew Keller

Background – What Matt Keller presented GREML-SC: single genetic relatedness matrix (GRM) to estimate heritability (h2SNP) Relate allele sharing at genome-wide SNPs to phenotypic similarity Genetic Relatedness Matrix (GRM) is a proxy for allele sharing at causal variants (CVs)

Background – GCTA-style approach Unrelated individuals (e.g., Aij < 0.05) Common markers from SNP arrays (e.g., MAF > 0.05, m = 500,000 – 2.5M SNPs) Low-moderate stratification in samples E.g., UK Biobank, GoT2D, AMD Homogeneous populations, e.g., North Finland Birth Cohort, Sardinia

Background – GCTA-style approach GRM: 𝐴 𝑖𝑗 = 1 𝑚 𝑘 𝑚 ( 𝑥 𝑖𝑘 −2 𝑝 𝑘 )( 𝑥 𝑗𝑘 −2 𝑝 𝑘 ) 2 𝑝 𝑘 (1− 𝑝 𝑘 ) Phenotype: yi = gi + ei 𝑣𝑎𝑟 𝒚 =𝑨  𝑣 2 + 𝑰  𝑒 2 h2SNP = 2v / (2v + 2e)

Examine biases using real genotypes and simulated phenotypes Genotypes: Haplotype Reference Consortium whole genome sequences Relatively homogeneous subset Build GRM from Axiom array positions only Whole genome sequence variants Vary the MAF of markers for GRM GRM ELEMENTS: 𝐴 𝑖𝑗 = 1 𝑚 𝑘 𝑚 ( 𝑥 𝑖𝑘 −2 𝑝 𝑘 )( 𝑥 𝑗𝑘 −2 𝑝 𝑘 ) 2 𝑝 𝑘 (1− 𝑝 𝑘 )

Examine biases using real genotypes and simulated phenotypes Phenotypes: Simulated from whole genome sequence 1,000 CVs drawn randomly from sequence data Vary the MAF of CVs yi = gi + ei gi = w ikk

CVs from whole genome sequence include many rare variants

Simulated phenotypes, GRM from common Axiom array Mean +/- 95% CI 100 replicates common very common rarer Causal Variant Minor Allele Frequency

Simulated phenotypes, GRM from common Axiom array Unbiased estimate of heritability h2SNP ~ h2 Mean +/- 95% CI 100 replicates common very common rarer Causal Variant Minor Allele Frequency

Simulated phenotypes, GRM from common Axiom array Overestimated heritability h2SNP >> h2 Mean +/- 95% CI 100 replicates common very common rarer Causal Variant Minor Allele Frequency

Simulated phenotypes, GRM from common Axiom array Underestimated heritability h2SNP << h2 Mean +/- 95% CI 100 replicates common very common rarer Causal Variant Minor Allele Frequency common common common common more common rarer

Simulated phenotypes GRM from common Whole Genome Sequence variants Simply adding more common markers (e.g., WGS or imputed) won’t fix biases GRM MARKERS: Mean +/- 95% CI 100 replicates common very common rarer Causal Variant Minor Allele Frequency

CVs from whole genome sequence include many rare variants

Simulated phenotypes, Axiom array or Whole Genome GRM GRM includes all variants CVs drawn from all variants WGS MAF ≃ CV MAF UNBIASED GRM MARKERS: Mean +/- 95% CI 100 replicates common very common rarer all variants Causal Variant Minor Allele Frequency

Why is there a relationship between GREML-SC heritability estimates and MAF? Unbiased estimates when marker MAF is the same as the CV MAF Underestimated when CVs are rarer than the markers used Overestimated when CVs are more common than the markers MAF is related to LD, and LD is related to biases in h2 estimation Details: Wray 2005 Twin Res. Hum. Gen., Speed et al. 2012 AJHG, Yang et al. 2015 NG

MAF is related to LD – Wray 2005 Twin Res. Hum. Gen MAF is related to LD – Wray 2005 Twin Res. Hum. Gen. Figure 1 Common SNPs can’t be in high LD with very rare SNPs r2 ≥ 0.2 r2 ≥ 0.5 r2 ≥ 0.8

When does GREML-SC correctly estimate h2? LD among markers and between markers and CVs (Yang et al. 2015 NG) h2SNP = h2( 𝑟 2 QM / 𝑟 2 MM) 𝑟 2 QM = average LD between markers and CV genome-wide 𝑟 2 MM = average LD among markers genome-wide GRM MARKERS: MAF CVS MAF GRM common common rarer all variants common very common common all variants

When does GREML-SC correctly estimate h2? 𝑟 2 QM == 𝑟 2 MM Unbiased estimate of h2 LD among markers and between markers and CVs (Yang et al. 2015 NG) h2SNP = h2( 𝑟 2 QM / 𝑟 2 MM) 𝑟 2 QM = average LD between markers and CV genome-wide 𝑟 2 MM = average LD among markers genome-wide GRM MARKERS: MAF CVS MAF GRM common common rarer all variants common very common common all variants

When does GREML-SC correctly estimate h2? LD among markers and between markers and CVs (Yang et al. 2015 NG) h2SNP = h2( 𝑟 2 QM / 𝑟 2 MM) 𝑟 2 QM = average LD between markers and CV genome-wide 𝑟 2 MM = average LD among markers genome-wide 𝑟 2 QM << 𝑟 2 MM Underestimate h2 GRM MARKERS: MAF CVS MAF GRM common common rarer all variants common very common common all variants

When does GREML-SC correctly estimate h2? 𝑟 2 QM >> 𝑟 2 MM Overestimate h2 LD among markers and between markers and CVs (Yang et al. 2015 NG) h2SNP = h2( 𝑟 2 QM / 𝑟 2 MM) 𝑟 2 QM = average LD between markers and CV genome-wide 𝑟 2 MM = average LD among markers genome-wide GRM MARKERS: MAF CVS MAF GRM common common rarer all variants common very common common all variants

Heritability estimate related to LD patterns of markers and CVs CV MAF: Random from full distribution Common Uncommon Rare Very Rare LD among markers and between markers and CVs (Yang et al. 2015 NG) h2SNP = h2( 𝑟 2 QM / 𝑟 2 MM) 𝑟 2 QM = average LD between markers and CV genome-wide 𝑟 2 MM = average LD among markers genome-wide h2SNP 𝑟 2 QM / 𝑟 2 MM

Multiple Component GREML Yang 2011, Yang 2015 Can correct for many of these biases GRMs from various MAF or LD bins Bin variants into MAF and/or LD categories, create a GRM for each GCTA will partition phenotypic variance among all GRMs (plus error) Sum of all genetic variances is the total h2SNP Partitioned estimates can explore aspects of genetic architecture (e.g., rare vs. common variants)

MAF-stratified approach: Allows the variance to change among MAF bins k ~N(0,1/[2pk(1-pk)]) 𝑽= 𝑖 𝐀𝑖  𝑖 2 + 𝑰  𝑒 2 k Rare MAF bin Uncommon MAF bin Common MAF bin Relationship between markers and causal variants within bins: h2SNP = h2( 𝑟 2 QM / 𝑟 2 MM) MAF

MAF-stratified approach: Allows the variance to change among MAF bins k ~N(0,1/[2pk(1-pk)]) k ~N(0,1) k MAF MAF

MAF-Stratified GREML is unbiased Whole genome sequence 4 MAF bins common rare all variants Causal Variant Minor Allele Frequency MAF range of 1,000 random causal variants

Causal Variant Minor Allele Frequency MAF-stratified GREML Correctly partitions variance to the correct MAF range Causal Variant Minor Allele Frequency

PRACTICAL 2

Stratification: Population structure influences LD (and therefore h2SNP) Europe-wide (HRC data) Homogeneous Subset Make environmental sharing a separate slide Fairly easily controlled by PCs Separate but less recognized issue is the LD impacts of strat in the absence of environmental confounding Ancestry PCs account for mean differences, and here, there may not be a mean difference, so controlling for ancestry PCs doesn’t remove effect of strat on LD (cor between CVs and markers)

Stratification and confounding Remember stratification talks Environments can also be confounded with ancestry Other covariates – sex, batch, etc. Typically, PC scores for some number of axes included as covariates (Price et al. 2010, Yang et al. 2014, etc.) Covariates included correct for mean differences, but not the LD effects of stratification

Homogeneous Vs. Stratified Samples: h2SNP = h2( 𝒓 𝟐 QM / 𝒓 𝟐 MM) Rare, ancestry-informative alleles are in high LD, driving up LD scores Homogeneous Sample Stratified Sample CV MAF: Random from full distribution Common Uncommon Rare Very Rare CV MAF: Random from full distribution Common Uncommon Rare Very Rare h2SNP h2SNP Alter to have the homogeneous figure similar to this on same page. Point out shift in r2 ratios that lead to biases. 𝑟 2 QM / 𝑟 2 MM 𝑟 2 QM / 𝑟 2 MM

Homogeneous Vs. Stratified Samples: h2SNP = h2( 𝒓 𝟐 QM / 𝒓 𝟐 MM) Rare, ancestry-informative alleles are in high LD, driving up LD scores Homogeneous Sample Stratified Sample CV MAF: Random from full distribution Common Uncommon Rare Very Rare CV MAF: Random from full distribution Common Uncommon Rare Very Rare h2SNP h2SNP Very rare variants have higher LD due to stratification Alter to have the homogeneous figure similar to this on same page. Point out shift in r2 ratios that lead to biases. 𝑟 2 QM / 𝑟 2 MM 𝑟 2 QM / 𝑟 2 MM

MAF-stratified or single component in homogeneous samples vs MAF-stratified or single component in homogeneous samples vs. structured samples Single GRM using WGS MAF-stratified GRMs using WGS Causal Variant Minor Allele Frequency Causal Variant Minor Allele Frequency

Imputation vs. genome sequence – GREML-MS Causal Variant Minor Allele Frequency

Causal Variant Minor Allele Frequency Numerous methods developed Relative performance varies Dependent on model & assumptions Causal Variant Minor Allele Frequency

BEST PRACTICES: Careful QC, appropriate covariates Whole genome sequence is best Impute! Use the Haplotype Reference Consortium. Remove related individuals – these share confounding environmental effects, but this is avoided using unrelated samples. Carefully interpret results from studies that use a single GRM in GREML. There are clear biases from this approach, yet most have used GREML-SC. GREML-MS or GREML-LDMS are much preferred. GREML-SC GREML-MS/LDMS LD score regression MENTION OTHER METHODS TESTED – MAKE NOTE THAT THEY PERFORM POORLY