Washington State University

Slides:

Advertisements

Similar presentations

GBS & GWAS using the iPlant Discovery Environment

Advertisements

PBG 650 Advanced Plant Breeding Module 9: Best Linear Unbiased Prediction – Purelines – Single-crosses.

PAG 2011 TASSEL Terry Casstevens 1, Peter Bradbury 2,3, Zhiwu Zhang 1, Yang Zhang 1, Edward Buckler 1,2,4 1 Institute.

Association Modeling With iPlant

Lab 13: Association Genetics. Goals Use a Mixed Model to determine genetic associations. Understand the effect of population structure and kinship on.

Generalized Linear Mixed Model (GLMM) & Weighted Sum Test (WST) Detecting Association between Rare Variants and Complex Traits Qunyuan Zhang, Ingrid Borecki,

Population Stratification

Lab 13: Association Genetics December 5, Goals Use Mixed Models and General Linear Models to determine genetic associations. Understand the effect.

Powerful Regression-based Quantitative Trait Linkage Analysis of General Pedigrees Pak Sham, Shaun Purcell, Stacey Cherny, Gonçalo Abecasis.

Statistical Genomics Zhiwu Zhang Washington State University Lecture 26: Kernel method.

Statistical Genomics Zhiwu Zhang Washington State University Lecture 19: SUPER.

Statistical Genomics Zhiwu Zhang Washington State University Lecture 25: Ridge Regression.

Washington State University

Statistical Genomics Zhiwu Zhang Washington State University Lecture 29: Bayesian implementation.

Statistical Genomics Zhiwu Zhang Washington State University Lecture 16: CMLM.

Statistical Genomics Zhiwu Zhang Washington State University Lecture 20: MLMM.

Statistical Genomics Zhiwu Zhang Washington State University Lecture 11: Power, type I error and FDR.

Genome Wide Association Studies Zhiwu Zhang Washington State University.

Lecture 28: Bayesian methods

Lecture 10: GWAS by correlation

Washington State University

Lecture 28: Bayesian Tools

Washington State University

Lecture 22: Marker Assisted Selection

Lecture 10: GWAS by correlation

Lecture 12: Population structure

Washington State University

Washington State University

Genome Wide Association Studies using SNP

Washington State University

Lecture 12: Population structure

Washington State University

Washington State University

Washington State University

Washington State University

Washington State University

Lecture 10: GWAS by correlation

Washington State University

Mapping Quantitative Trait Loci

Marc A. Coram, Huaying Fang, Sophie I. Candille, Themistocles L

Genome-wide Association Studies

Lecture 23: Cross validation

Lecture 23: Cross validation

Washington State University

I. TOPICS WE INTEND TO COVER

GENOME WIDE ASSOCIATION STUDIES (GWAS)

OVERVIEW OF LINEAR MODELS

Washington State University

Washington State University

Lecture 10: GWAS by correlation

What are BLUP? and why they are useful?

Lecture 16: Likelihood and estimates of variances

Washington State University

Statistical Analysis and Design of Experiments for Large Data Sets

OVERVIEW OF LINEAR MODELS

Lecture 11: Power, type I error and FDR

Lecture 11: Power, type I error and FDR

Lecture 12: Population structure

Sherlock: Detecting Gene-Disease Associations by Matching Patterns of Expression QTL and GWAS Xin He, Chris K. Fuller, Yi Song, Qingying Meng, Bin Zhang,

Washington State University

Washington State University

Lecture 18: Heritability and P3D

Washington State University

Lecture 17: Likelihood and estimates of variances

Washington State University

Lecture 23: Cross validation

Lecture 29: Bayesian implementation

Lecture 22: Marker Assisted Selection

Washington State University

The Basic Genetic Model

Presentation transcript:

Washington State University Statistical Genomics Lecture 16: CMLM Zhiwu Zhang Washington State University

Objective Criticism on MLM CMLM ECMLM

Hidden, observed, induction, and modeling Genes SNPs PCs K y=SNP+e y=SNP+PC+e y=SNP+PC+K+e y=SNP+PC+BLUP+e BLUP=SNP+e BLUP=SNP+PC+e Residual=SNP+e Residual=SNP+PC+e BV BLUP y Residual Hidden Observed Induction Modeling

MLM for GWAS Y = SNP + Q (or PCs) + Kinship + e Phenotype Population structure Unequal relatedness Y = SNP + Q (or PCs) + Kinship + e (fixed effect) (fixed effect) (random effect) General Linear Model (GLM) Mixed Linear Model (MLM) (Yu et al. 2005, Nature Genetics)

GWAS does not work for traits associated with structure Atwell et al Nature 2010 a, No correction test b, Correction with MLM Magnus Norborg GWAS does not work for traits associated with structure

Phenotype simulation myGD=read.table(file="http://zzlab.net/GAPIT/data/mdp_numeric.txt",head=T) myGM=read.table(file="http://zzlab.net/GAPIT/data/mdp_SNP_information.txt",head=T) setwd("~/Dropbox/Current/ZZLab/WSUCourse/CROPS545/Demo") source("G2P.R") source("GWASbyCor.R") X=myGD[,-1] index1to5=myGM[,2]<6 X1to5 = X[,index1to5] set.seed(99164) mySim=G2P(X= X1to5,h2=.75,alpha=1,NQTN=10,distribution="norm")

Inflation by structure y=mySim$y G=myGD[,-1] n=nrow(G) m=ncol(G) P=matrix(NA,1,m) for (i in 1:m){ x=G[,i] if(max(x)==min(x)){ p=1}else{ X=cbind(1,x) LHS=t(X)%*%X C=solve(LHS) RHS=t(X)%*%y b=C%*%RHS yb=X%*%b e=y-yb n=length(y) ve=sum(e^2)/(n-1) vt=C*ve t=b/sqrt(diag(vt)) p=2*(1-pt(abs(t),n-2)) } #end of testing variation P[i]=p[length(p)] } #end of looping for markers Single marker test split.screen(rbind( c(0.8,0.98,0.1, 0.98),c(0.05, 0.73, 0.1, 0.98))) screen(1) par(mar = c(0, 0, 0, 0)) p.obs=P m2=length(p.obs) p.uni=runif(m2,0,1) order.obs=order(p.obs) order.uni=order(p.uni) plot(-log10(p.uni[order.uni]), -log10(p.obs[order.obs]), ) abline(a = 0, b = 1, col = "red") screen(2) color.vector <- rep(c("deepskyblue","orange","forestgreen","indianred3"),10) m=nrow(myGM) plot(t(-log10(P))~seq(1:m),col=color.vector[myGM[,2]]) abline(v=mySim$QTN.position, lty = 2, lwd=2, col = "black") close.screen(all.screens = TRUE) Inflation by structure

Add 2nd PC as covariate Inflation reduced PCA=prcomp(X) y=mySim$y G=myGD[,-1] n=nrow(G) m=ncol(G) P=matrix(NA,1,m) for (i in 1:m){ x=G[,i] if(max(x)==min(x)){ p=1}else{ X=cbind(1, PCA$x[,2],x) LHS=t(X)%*%X C=solve(LHS) RHS=t(X)%*%y b=C%*%RHS yb=X%*%b e=y-yb n=length(y) ve=sum(e^2)/(n-1) vt=C*ve t=b/sqrt(diag(vt)) p=2*(1-pt(abs(t),n-2)) } #end of testing variation P[i]=p[length(p)] } #end of looping for markers Add 2nd PC as covariate split.screen(rbind( c(0.8,0.98,0.1, 0.98),c(0.05, 0.73, 0.1, 0.98))) screen(1) par(mar = c(0, 0, 0, 0)) p.obs=P m2=length(p.obs) p.uni=runif(m2,0,1) order.obs=order(p.obs) order.uni=order(p.uni) plot(-log10(p.uni[order.uni]), -log10(p.obs[order.obs]), ) abline(a = 0, b = 1, col = "red") screen(2) color.vector <- rep(c("deepskyblue","orange","forestgreen","indianred3"),10) m=nrow(myGM) plot(t(-log10(P))~seq(1:m),col=color.vector[myGM[,2]]) abline(v=mySim$QTN.position, lty = 2, lwd=2, col = "black") close.screen(all.screens = TRUE) Inflation reduced

Inflation controlled better y=mySim$y G=myGD[,-1] n=nrow(G) m=ncol(G) P=matrix(NA,1,m) for (i in 1:m){ x=G[,i] if(max(x)==min(x)){ p=1}else{ X=cbind(1, PCA$x[,1:3],x) LHS=t(X)%*%X C=solve(LHS) RHS=t(X)%*%y b=C%*%RHS yb=X%*%b e=y-yb n=length(y) ve=sum(e^2)/(n-1) vt=C*ve t=b/sqrt(diag(vt)) p=2*(1-pt(abs(t),n-2)) } #end of testing variation P[i]=p[length(p)] } #end of looping for markers Using three PCs split.screen(rbind( c(0.8,0.98,0.1, 0.98),c(0.05, 0.73, 0.1, 0.98))) screen(1) par(mar = c(0, 0, 0, 0)) p.obs=P m2=length(p.obs) p.uni=runif(m2,0,1) order.obs=order(p.obs) order.uni=order(p.uni) plot(-log10(p.uni[order.uni]), -log10(p.obs[order.obs]), ) abline(a = 0, b = 1, col = "red") screen(2) color.vector <- rep(c("deepskyblue","orange","forestgreen","indianred3"),10) m=nrow(myGM) plot(t(-log10(P))~seq(1:m),col=color.vector[myGM[,2]]) abline(v=mySim$QTN.position, lty = 2, lwd=2, col = "black") close.screen(all.screens = TRUE) Inflation controlled better

Using breeding value as observation y=mySim$add G=myGD[,-1] n=nrow(G) m=ncol(G) P=matrix(NA,1,m) for (i in 1:m){ x=G[,i] if(max(x)==min(x)){ p=1}else{ X=cbind(1, x) LHS=t(X)%*%X C=solve(LHS) RHS=t(X)%*%y b=C%*%RHS yb=X%*%b e=y-yb n=length(y) ve=sum(e^2)/(n-1) vt=C*ve t=b/sqrt(diag(vt)) p=2*(1-pt(abs(t),n-2)) } #end of testing variation P[i]=p[length(p)] } #end of looping for markers Using breeding value as observation split.screen(rbind( c(0.8,0.98,0.1, 0.98),c(0.05, 0.73, 0.1, 0.98))) screen(1) par(mar = c(0, 0, 0, 0)) p.obs=P m2=length(p.obs) p.uni=runif(m2,0,1) order.obs=order(p.obs) order.uni=order(p.uni) plot(-log10(p.uni[order.uni]), -log10(p.obs[order.obs]), ) abline(a = 0, b = 1, col = "red") screen(2) color.vector <- rep(c("deepskyblue","orange","forestgreen","indianred3"),10) m=nrow(myGM) plot(t(-log10(P))~seq(1:m),col=color.vector[myGM[,2]]) abline(v=mySim$QTN.position, lty = 2, lwd=2, col = "black") close.screen(all.screens = TRUE) Still inflated by structure

PCs remove inflation (many apps before MLM GWAS) y=mySim$add G=myGD[,-1] n=nrow(G) m=ncol(G) P=matrix(NA,1,m) for (i in 1:m){ x=G[,i] if(max(x)==min(x)){ p=1}else{ X=cbind(1, PCA$x[,1:3],x) LHS=t(X)%*%X C=solve(LHS) RHS=t(X)%*%y b=C%*%RHS yb=X%*%b e=y-yb n=length(y) ve=sum(e^2)/(n-1) vt=C*ve t=b/sqrt(diag(vt)) p=2*(1-pt(abs(t),n-2)) } #end of testing variation P[i]=p[length(p)] } #end of looping for markers Using three PCs split.screen(rbind( c(0.8,0.98,0.1, 0.98),c(0.05, 0.73, 0.1, 0.98))) screen(1) par(mar = c(0, 0, 0, 0)) p.obs=P m2=length(p.obs) p.uni=runif(m2,0,1) order.obs=order(p.obs) order.uni=order(p.uni) plot(-log10(p.uni[order.uni]), -log10(p.obs[order.obs]), ) abline(a = 0, b = 1, col = "red") screen(2) color.vector <- rep(c("deepskyblue","orange","forestgreen","indianred3"),10) m=nrow(myGM) plot(t(-log10(P))~seq(1:m),col=color.vector[myGM[,2]]) abline(v=mySim$QTN.position, lty = 2, lwd=2, col = "black") close.screen(all.screens = TRUE) PCs remove inflation (many apps before MLM GWAS)

Using residual as observation y=mySim$residual G=myGD[,-1] n=nrow(G) m=ncol(G) P=matrix(NA,1,m) for (i in 1:m){ x=G[,i] if(max(x)==min(x)){ p=1}else{ X=cbind(1,x) LHS=t(X)%*%X C=solve(LHS) RHS=t(X)%*%y b=C%*%RHS yb=X%*%b e=y-yb n=length(y) ve=sum(e^2)/(n-1) vt=C*ve t=b/sqrt(diag(vt)) p=2*(1-pt(abs(t),n-2)) } #end of testing variation P[i]=p[length(p)] } #end of looping for markers Using residual as observation split.screen(rbind( c(0.8,0.98,0.1, 0.98),c(0.05, 0.73, 0.1, 0.98))) screen(1) par(mar = c(0, 0, 0, 0)) p.obs=P m2=length(p.obs) p.uni=runif(m2,0,1) order.obs=order(p.obs) order.uni=order(p.uni) plot(-log10(p.uni[order.uni]), -log10(p.obs[order.obs]), ) abline(a = 0, b = 1, col = "red") screen(2) color.vector <- rep(c("deepskyblue","orange","forestgreen","indianred3"),10) m=nrow(myGM) plot(t(-log10(P))~seq(1:m),col=color.vector[myGM[,2]]) abline(v=mySim$QTN.position, lty = 2, lwd=2, col = "black") close.screen(all.screens = TRUE) This is not silly! It works for low heritable traits

Using genetic effect as covariates y=mySim$y G=myGD[,-1] n=nrow(G) m=ncol(G) P=matrix(NA,1,m) for (i in 1:m){ x=G[,i] if(max(x)==min(x)){ p=1}else{ X=cbind(1, mySim$add,x) LHS=t(X)%*%X C=solve(LHS) RHS=t(X)%*%y b=C%*%RHS yb=X%*%b e=y-yb n=length(y) ve=sum(e^2)/(n-1) vt=C*ve t=b/sqrt(diag(vt)) p=2*(1-pt(abs(t),n-2)) } #end of testing variation P[i]=p[length(p)] } #end of looping for markers Using genetic effect as covariates split.screen(rbind( c(0.8,0.98,0.1, 0.98),c(0.05, 0.73, 0.1, 0.98))) screen(1) par(mar = c(0, 0, 0, 0)) p.obs=P m2=length(p.obs) p.uni=runif(m2,0,1) order.obs=order(p.obs) order.uni=order(p.uni) plot(-log10(p.uni[order.uni]), -log10(p.obs[order.obs]), ) abline(a = 0, b = 1, col = "red") screen(2) color.vector <- rep(c("deepskyblue","orange","forestgreen","indianred3"),10) m=nrow(myGM) plot(t(-log10(P))~seq(1:m),col=color.vector[myGM[,2]]) abline(v=mySim$QTN.position, lty = 2, lwd=2, col = "black") close.screen(all.screens = TRUE) Everything absorbed

Critical thinking on MLM Computation intensive, cubic to sample size (n3) Converge problems (h2=0 or 1) Q(PC) and K from same set of markers, double counted Confounded between testing marker and Q(PC) and K Disappointed on the opposite side of inflated p values

Queen + King

Compressed MLM y = SNP + Q (or PCs) + Kinship + e y = x1b1 + x2b2+x3b3+x4b4 + Zu+ e Group Zhang Zhang, Z. et al. Mixed linear model approach adapted for genome-wide association studies. Nat Genet 42, 355–360 (2010).

Group by kinship

Compression improves power Average number of individuals per group

Average number of individuals per group Fit matches power Average number of individuals per group

Compression is robust across species Human (n=1315) Dog (n=292) Maize (n=277) Fit of Model 0.20sd (0.83%) 0.1sd (0.21%) 0.2sd (0.83%) 0.3sd (1.85%) 0.4sd (3.25%) 0.5sd (4.99%) 0.5sd (4.99%) 0.16sd (0.53%) 0.4sd (3.25%) 0.12sd (0.30%) Statistical power 0.3sd (1.85%) 0.08sd (0.13%) 0.2sd (0.83%) 0.04sd 0(.03%) 0.1sd (0.21%) Compression level Compression is robust across species

Compressed MLM is more general GLM (1 group) SA, GC, PCA and QTDT Compressed MLM Sire model Compressed MLM (s groups) n ≥ s ≥ 1 Full MLM (n groups) Henderson’s MLM Unified MLM Pedigree based kinship Marker based kinship

Enriched Compressed MLM Kinship: Among individuals -> among groups 1 .167 .72 Average 1 .25 .125 .5 .75 1 .25 Maximum Minimum Median …

Better optimization with group kinship A-Human B-Dog C-Maize D-Arabidopsis

Dimensions of parameter space 5. Group method 6. Group kinship 4. Group numbers 3. Variance components 2. Kinship (BLUP) 1. Structure (BLUE) More dimensions, better optimization

Statistical power improvement Meng Li Method shift Human Dog Maize Arabidopsis GLM to MLM 3.6% 13.8% 10.1% 29.6% MLM to compression 4.0% 14.2% 7.6% 2.5% Compression to group kinship 6.4% 13.3% 2.9% 2.6% BMC Biology, 2014

GWAS by CMLM library('MASS') # required for ginv library(multtest) library(gplots) library(compiler) #required for cmpfun library("scatterplot3d") source("http://www.zzlab.net/GAPIT/emma.txt") source("http://www.zzlab.net/GAPIT/gapit_functions.txt") setwd("~/Desktop/temp") myY=cbind(as.data.frame(myGD[,1]), mySim$y) myGAPIT=GAPIT( Y=myY, GD=myGD, GM=myGM, QTN.position=mySim$QTN.position, PCA.total=3, group.from=1, group.to=1000000, group.by=10, memo="CMLM") GWAS by CMLM

Highlight Criticism on MLM CMLM ECMLM