Genome-wide Complex Trait Analysis and extensions

Slides:



Advertisements
Similar presentations
Population structure.
Advertisements

BST 775 Lecture PLINK – A Popular Toolset for GWAS
Transformations & Data Cleaning
Association Tests for Rare Variants Using Sequence Data
ANCOVA Workings of ANOVA & ANCOVA ANCOVA, Semi-Partial correlations, statistical control Using model plotting to think about ANCOVA & Statistical control.
Perspectives from Human Studies and Low Density Chip Jeffrey R. O’Connell University of Maryland School of Medicine October 28, 2008.
Estimating “Heritability” using Genetic Data David Evans University of Queensland.
More Powerful Genome-wide Association Methods for Case-control Data Robert C. Elston, PhD Case Western Reserve University Cleveland Ohio.
ANCOVA Workings of ANOVA & ANCOVA ANCOVA, Semi-Partial correlations, statistical control Using model plotting to think about ANCOVA & Statistical control.
Give me your DNA and I tell you where you come from - and maybe more! Lausanne, Genopode 21 April 2010 Sven Bergmann University of Lausanne & Swiss Institute.
Robust and powerful sibpair test for rare variant association
Statistical Power Calculations Boulder, 2007 Manuel AR Ferreira Massachusetts General Hospital Harvard Medical School Boston.
Multiple imputation using ICE: A simulation study on a binary response Jochen Hardt Kai Görgen 6 th German Stata Meeting, Berlin June, 27 th 2008 Göteborg.
Copy the folder… Faculty/Sarah/Tues_merlin to the C Drive C:/Tues_merlin.
© 2002 Prentice-Hall, Inc.Chap 14-1 Introduction to Multiple Regression Model.
Polymorphism & Variant Analysis Lab Saurabh Sinha Polymorphism and Variant Analysis Lab v1 | Saurabh Sinha 1 Powerpoint by Casey Hanson.
1 Association Analysis of Rare Genetic Variants Qunyuan Zhang Division of Statistical Genomics Course M Computational Statistical Genetics.
Adjusted from slides attributed to Andrew Ainsworth
Regression-Based Linkage Analysis of General Pedigrees Pak Sham, Shaun Purcell, Stacey Cherny, Gonçalo Abecasis.
G Lecture 81 Comparing Measurement Models across Groups Reducing Bias with Hybrid Models Setting the Scale of Latent Variables Thinking about Hybrid.
Statistical Issues in Genetic Association Studies
PLINK / Haploview Whole genome association software tutorial
Practical With Merlin Gonçalo Abecasis. MERLIN Website Reference FAQ Source.
Matching. Objectives Discuss methods of matching Discuss advantages and disadvantages of matching Discuss applications of matching Confounding residual.
Mx modeling of methylation data: twin correlations [means, SD, correlation] ACE / ADE latent factor model regression [sex and age] genetic association.
Sequence Kernel Association Tests (SKAT) for the Combined Effect of Rare and Common Variants 統計論文 奈良原.
University of Colorado at Boulder
Regression Models for Linkage: Merlin Regress
SNPs and complex traits: where is the hidden heritability?
Chapter 7. Classification and Prediction
DTC Quantitative Methods Bivariate Analysis: t-tests and Analysis of Variance (ANOVA) Thursday 20th February 2014  
Heritability, prediction and the genomic
Polygenic methods in analysis of complex trait genetics
GCTA Practical 2.
LD Hub: a centralized database and web interface to perform LD score regression that maximizes the potential of summary level GWAS data for SNP heritability.
Introduction to Multivariate Genetic Analysis
Stratification Lon Cardon University of Oxford
Genome Wide Association Studies using SNP
Marker heritability Biases, confounding factors, current methods, and best practices Luke Evans, Matthew Keller.
Migrant Studies Migrant Studies: vary environment, keep genetics constant: Evaluate incidence of disorder among ethnically-similar individuals living.
Introduction to Data Formats and tools
High level GWAS analysis
Linkage in Selected Samples
Estimating Recombination Rates
Tabulations and Statistics
Genome-wide Association Studies
The effect of using sequence data instead of a lower density SNP chip on a GWAS EAAP 2017; Tallinn, Estonia Sanne van den Berg, Roel Veerkamp, Fred van.
Beyond GWAS Erik Fransen.
Correlation for a pair of relatives
Why general modeling framework?
Chapter 7 Multifactorial Traits
Exercise: Effect of the IL6R gene on IL-6R concentration
Huwenbo Shi, Nicholas Mancuso, Sarah Spendlove, Bogdan Pasaniuc 
12 Inferential Analysis.
Incremental Partitioning of Variance (aka Hierarchical Regression)
Improved Heritability Estimation from Genome-wide SNPs
GREML: Heritability Estimation Using Genomic Data
A Selection Operator for Summary Association Statistics Reveals Allelic Heterogeneity of Complex Traits  Zheng Ning, Youngjo Lee, Peter K. Joshi, James.
Regression Analysis.
Chapter 7 Beyond alleles: Quantitative Genetics
Perspectives from Human Studies and Low Density Chip
Pier Francesco Palamara, Laurent C. Francioli, Peter R
Hugues Aschard, Bjarni J. Vilhjálmsson, Amit D. Joshi, Alkes L
Evan G. Williams, Johan Auwerx  Cell 
Haplotypes When the presence of two or more polymorphisms on a single chromosome is statistically correlated in a population, this is a haplotype Example.
Sampling and estimation
An Expanded View of Complex Traits: From Polygenic to Omnigenic
Population Stratification Practical
Presentation transcript:

Genome-wide Complex Trait Analysis and extensions Matthew Keller Teresa de Candia University of Colorado at Boulder

Outline Issues and extensions of GCTA (de Candia) SNP variance estimates and heritability Estimating multiple genetic variances (e.g., two groups of SNPs) Bivariate models (e.g., two traits) Practical

Outline Issues and extensions of GCTA (de Candia) SNP variance estimates and heritability Estimating multiple genetic variances Bivariate models Practical

Estimating Multiple Genetic Variances Just as we can simultaneously estimate partial or independent effects of several predictors in standard linear regression: Y = β0 + β1X1 + β2X2 + β3X3+ .. + ε We can similarly estimate the independent effects of several groups of SNPs: Var(Y) = G1σ2g1 + G2σ2g2 + G3σ2g3 + … + Iσ2e GRMs don’t allow us to examine independent effects of SNPs since SNP effects are estimated in aggregate

Estimating Multiple Genetic Variances Advantages: Ability to question specific categories of CVs, by grouping similar SNPs together. Some examples of SNP categories: chromosome, expression, pathway, allele frequency, etc. Increase robustness of model to overestimates due to confounding of GRM due to: Stratification when environmental influences DO differ by ethnicity: π-hats confounded w/ environmental effects, Systematic plate effects (e.g., case control data): π-hats confounded w/ genotyping artifacts, Cryptic relatedness: π-hats confounded w/ rare or non-additive genetic variance, or shared environmental effects Systematic plate effects and other biases are more likely to occur in ascertained case/control samples. When there are environmental differences that effect phenotype, stratification can lead to overestimates because the GRM is confounded with some other variable such as environmental main effects

SNP variances for schizophrenia, by chromosome in ethnically homogeneous sample This shows us evidence for polygeneity Lee et al., 2012

So, comparisons aren’t usually feasible SNP variances for schizophrenia, by chromosome in ethnically homogeneous sample SEs are large So, comparisons aren’t usually feasible SEs on specific GRMs (e.g. chromosomal) can be quite large Lee et al., 2012

Look at Difference In Sums of Point Estimates SNP variances for schizophrenia, by chromosome in ethnically homogeneous sample Look at Difference In Sums of Point Estimates Difference between MGRM and GRM-sum tells you that confounds exist here, even though effect is relatively small. Lee et al., 2012

Outline Issues and extensions of GCTA (de Candia) SNP variance estimates and heritability Estimating multiple genetic variances Bivariate models Practical

Bivariate Models Can be used to examine genetic overlap between two separate measures thought to be related, such as: Different phenotypes Same phenotype across different datasets or genotyping procedures Same phenotype across different populations or environments Importantly, model estimation does not require individuals to be assessed on both measures Useful for examining rare traits

Trait1 Trait2 Trait2 Trait1 Bivariate Models For 2 measures, using all available pairs of individuals i and j, we use the following 3 different parts of the G & Y matrices: 1 matrix for each of measures 1 and 2 1 matrix for covariances between measures 1 and 2 This model simultaneously estimates 3 genetic parameters: σ2g1, σ2g2, σg12 Using these we can calculate a SNP correlation: rSNP = σg12 / (σg1σg2)

Bivariate Models SNP correlations (rSNP) are only our best estimates of underlying genetic correlations (rg): They will reflect the extent to which more common CVs are shared between traits rSNP is not a direct estimate of the correlation of effect sizes of causal alleles. Systematic genotyping artifacts and population structure (distinct populations with MAF and background LD differences) will produce underestimates of rg If we look across different traits, each of which is measured in separate datasets, then rSNP between traits can be biased downward. It is important to make apples to apples comparisons (different traits, same dataset), and/or to use benchmarks (e.g., same trait, different datasets) Expect rSNP to be biased downward for the same reasons that SNP heritabilities are

Outline Issues and extensions of GCTA (de Candia) SNP variance estimates and heritability Estimating multiple genetic variances Bivariate models Practical

SNP-h2 < Narrow-sense h2* 1. Estimates rely on LD between SNPs and causal variants (CVs), and are therefore imperfect: Datasets with lower SNP density will capture less heritability Estimates are biased if background LD around CVs does not mirror that around SNPs. LDAK has been used to correct for this but may over-adjust (Speed et al., 2013) If allele frequency spectrum of SNPs is different than that of CVs, estimate will be too low. Rarer CVs are not well represented by SNP panels Denser == closer together *usually, maybe

Max R2 as function of SNP MAF for several CV allele frequencies Figure is of upper bound of how much VA GWAS can detect. We only have SNPs to the right of the dashed line, so really rare CVs are not going to be tagged at all. A common CVs will be well detected by a variety of SNP MAFs. Lower MAF phenotypes will give lower h2 estimates because poorly tagged by common SNPs. Therefore, heritability estimates reflect aggregate importance of COMMON causal variants.

SNP-h2 < Narrow-sense h2* 2. Noise in SNP calls (thus, π-hats) and phenotypes tends to bias SNP-h2 downward when: Random plate effects inflate variance across π-hats compared with πs. This seems to be a problem with ascertained case-control samples as well (Golan et al., 2014) GSNP = cGCV Var(Y) = cGCV1σ2g + Iσ2e h2 is underestimated when c is a scalar >1 on GCV, inflating var(π-hat) Stratification is present, but environmental influences DO NOT differ by ethnicity Genetic heterogeneity exists, such that two “phenotypes” that are genetically quite different are regarded as the same thing *usually, maybe

SNP variances for simulated trait, as a function of stratification Proportion of Genetic Variance Detected SNP variances for simulated trait, as a function of stratification

SNP-h2 < Narrow-sense h2* But, not so shabby. Overall, things seem to work: SNPs pick up a substantial proportion of variance for a lot of tested traits, and simulations of phenotypes using real data confirm that methods are relatively robust to most assumptions Given published SNP h2, amount of missing heritability is not surprising, especially if we take into account that estimates from family studies include rare and non-additive heritability as well as shared environmental effects *usually, maybe

Outline Issues and extensions of GCTA (de Candia) SNP variance estimates and heritability Estimating multiple genetic variances Bivariate models Practical

Practical Objectives Let’s estimate some univariate and bivariate models using simulated genotype and phenotype data. Suppose we have plink binary files for two studies - dat1 and dat2. (We also have merged these plink files into a single file - dat.) Suppose the first dataset (dat1) is of 2k females measured on height, the second (dat2) is of 2k females measured on BMI. Each individual is only present in one dataset. Our aim is to first estimate heritability separately for each of the two traits in univariate models, and then to jointly estimate two heritabilities and a genetic correlation in a bivariate model. To do this we will be a) calculating GRMs, and b) running REML to estimate model parameters. This is simulated GWA data covering 20Mb, so less than 1% of the genome. We have 6500 SNPs in this region, and that is roughly equivalent to a SNP chip with a genome-wide density of ~1M SNPs; QC already done

GCTA Software Can be used for: Data management (similar to PLINK) Calculation of GRM from genome-wide SNPs (this can also be done in PLINK) Model estimation by REML PCA, simulations, etc.

Input Files Binary PLINK files Fam file (.fam) Bim file (.bim) Bed file (.bed)

Data management Inclusion criteria Using phenotypes files --keep mylist.txt, --remove mylist.txt --extract mysnps.txt, --exclude mysnps.txt --chr 6, --autosome Using phenotypes files --pheno, Using covariate files --covar, --qcovar

Calculating GRM GRM: Generates: gcta –bfile dat1 --make-grm-gz -–thread-num 2 --out dat1.gcta Generates: dat1.gcta.grm.gz dat1.gcta.grm.id thread-num allows for parallelizing if you’re on a multi-threaded machine

Genetic Relationship Matrix (GRM) example.grm.id 10 01 10 02 17 01 28 01 33 01 33 02 37 50 38 01 45 50 46 01 example.grm.gz 1 1 273588 0.99629 2 1 273566 0.47804 2 2 273600 0.99192 3 1 269152 0.00656 3 2 269164 0.00215 3 3 269192 0.99075 4 1 273582 0.00004 ID file contains FID and IID. GZ file contains row number for ind 1, row number for ind 2, non-missing SNPs, pihat.

Estimating SNP h2 Estimate SNP h2 for trait1 (and then do the same for trait 2): gcta –grm-gz dat1.gcta –pheno dat.pheno --mpheno XXX --reml –out dat1.results Jointly estimate SNP h2 for both measures as well as SNP-correlation: gcta –grm-gz dat.gcta –pheno dat.pheno –reml-bivar XXX XXX –out dat.results "XXX" will be 1 for phenotype data in 3rd column, 2 for phenotype data in 4th column. Exactly two columns must be specified for bivariate model Both traits are in phenotype file dat.pheno. Height is in column 3 and BMi is in column 2. Extension of results files is “.hsq”

Phenotype File Dataset 1 Dataset 2 example.pheno fdfs1 fdfs1 0.99629 NA absd2 absd2 -0.47804 NA edgg3 edgg3 0.49192 NA rkls4 rkls4 0.00656 NA eedf1 eedf1 NA 0.00215 aaaa2 aaaa2 NA -0.99075 bbbb3 bbbb3 NA 0.00004 Dataset 1 ID file contains FID and IID. GZ file contains row number for ind 1, row number for ind 2, non-missing SNPs, pihat. Dataset 2

Help and Questions Go to your desktop’s “Home” directory, create a directory called “GCTA”, and copy everything from Matt’s subdirectory “Boulder2015” into it Use “GCTA_2015.Practical.R” to do all this GCTA website: http://www.complextraitgenomics.com/software/gcta/ What is SNP h2 for measure1? Measure2? And what are the SEs on those point estimates? How genetically correlated are they? What SNP r would you expect across datasets assessed on the same exact measure?

Results Pretty different SNP h2 between traits: SNP h2 estimated for height: ~.78 (simulated h2 was .8) SNP h2 for BMI: ~.36 (simulated h2 was .4) Relatively high SNP r between traits: ~.64 Are these simulated SNP h2 estimates realistic?

From Wikipedia “Apples and Oranges” “At least two tongue-in-cheek scientific studies have been conducted on the subject, each of which concluded that apples can be compared with oranges fairly easily and on a low budget and the two fruits are quite similar. …[One] study … concluded: ‘…the comparing apples and oranges defense should no longer be considered valid. This is a somewhat startling revelation. It can be anticipated to have a dramatic effect on the strategies used in arguments and discussions in the future.’” “In many languages, oranges are, implicitly or explicitly, referred to as a type of apple”. “Oranges, like apples, grow on trees.” Additionally, one figure with the subtitle “Not all apples are alike”, at the very least, possibly calls into question the use of the phrase “apples-with-apples comparison”.