Washington State University

Slides:



Advertisements
Similar presentations
GBS & GWAS using the iPlant Discovery Environment
Advertisements

PBG 650 Advanced Plant Breeding Module 9: Best Linear Unbiased Prediction – Purelines – Single-crosses.
PAG 2011 TASSEL Terry Casstevens 1, Peter Bradbury 2,3, Zhiwu Zhang 1, Yang Zhang 1, Edward Buckler 1,2,4 1 Institute.
Association Modeling With iPlant
Lab 13: Association Genetics. Goals Use a Mixed Model to determine genetic associations. Understand the effect of population structure and kinship on.
Generalized Linear Mixed Model (GLMM) & Weighted Sum Test (WST) Detecting Association between Rare Variants and Complex Traits Qunyuan Zhang, Ingrid Borecki,
Population Stratification
Lab 13: Association Genetics December 5, Goals Use Mixed Models and General Linear Models to determine genetic associations. Understand the effect.
Powerful Regression-based Quantitative Trait Linkage Analysis of General Pedigrees Pak Sham, Shaun Purcell, Stacey Cherny, Gonçalo Abecasis.
Statistical Genomics Zhiwu Zhang Washington State University Lecture 26: Kernel method.
Statistical Genomics Zhiwu Zhang Washington State University Lecture 19: SUPER.
Statistical Genomics Zhiwu Zhang Washington State University Lecture 25: Ridge Regression.
Washington State University
Statistical Genomics Zhiwu Zhang Washington State University Lecture 29: Bayesian implementation.
Statistical Genomics Zhiwu Zhang Washington State University Lecture 16: CMLM.
Statistical Genomics Zhiwu Zhang Washington State University Lecture 20: MLMM.
Statistical Genomics Zhiwu Zhang Washington State University Lecture 11: Power, type I error and FDR.
Genome Wide Association Studies Zhiwu Zhang Washington State University.
Lecture 28: Bayesian methods
Lecture 10: GWAS by correlation
Washington State University
Lecture 28: Bayesian Tools
Washington State University
Lecture 22: Marker Assisted Selection
Lecture 10: GWAS by correlation
Lecture 12: Population structure
Washington State University
Washington State University
Genome Wide Association Studies using SNP
Washington State University
Lecture 12: Population structure
Washington State University
Washington State University
Washington State University
Washington State University
Washington State University
Lecture 10: GWAS by correlation
Washington State University
Mapping Quantitative Trait Loci
Marc A. Coram, Huaying Fang, Sophie I. Candille, Themistocles L
Genome-wide Association Studies
Lecture 23: Cross validation
Lecture 23: Cross validation
Washington State University
I. TOPICS WE INTEND TO COVER
GENOME WIDE ASSOCIATION STUDIES (GWAS)
OVERVIEW OF LINEAR MODELS
Washington State University
Washington State University
Lecture 10: GWAS by correlation
What are BLUP? and why they are useful?
Lecture 16: Likelihood and estimates of variances
Washington State University
Statistical Analysis and Design of Experiments for Large Data Sets
OVERVIEW OF LINEAR MODELS
Lecture 11: Power, type I error and FDR
Lecture 11: Power, type I error and FDR
Lecture 12: Population structure
Sherlock: Detecting Gene-Disease Associations by Matching Patterns of Expression QTL and GWAS  Xin He, Chris K. Fuller, Yi Song, Qingying Meng, Bin Zhang,
Washington State University
Washington State University
Lecture 18: Heritability and P3D
Washington State University
Lecture 17: Likelihood and estimates of variances
Washington State University
Lecture 23: Cross validation
Lecture 29: Bayesian implementation
Lecture 22: Marker Assisted Selection
Washington State University
The Basic Genetic Model
Presentation transcript:

Washington State University Statistical Genomics Lecture 16: CMLM Zhiwu Zhang Washington State University

Objective Criticism on MLM CMLM ECMLM

Hidden, observed, induction, and modeling Genes SNPs PCs K y=SNP+e y=SNP+PC+e y=SNP+PC+K+e y=SNP+PC+BLUP+e BLUP=SNP+e BLUP=SNP+PC+e Residual=SNP+e Residual=SNP+PC+e BV BLUP y Residual Hidden Observed Induction Modeling

MLM for GWAS Y = SNP + Q (or PCs) + Kinship + e Phenotype Population structure Unequal relatedness Y = SNP + Q (or PCs) + Kinship + e (fixed effect) (fixed effect) (random effect) General Linear Model (GLM) Mixed Linear Model (MLM) (Yu et al. 2005, Nature Genetics)

GWAS does not work for traits associated with structure Atwell et al Nature 2010 a, No correction test b, Correction with MLM Magnus Norborg GWAS does not work for traits associated with structure

Phenotype simulation myGD=read.table(file="http://zzlab.net/GAPIT/data/mdp_numeric.txt",head=T) myGM=read.table(file="http://zzlab.net/GAPIT/data/mdp_SNP_information.txt",head=T) setwd("~/Dropbox/Current/ZZLab/WSUCourse/CROPS545/Demo") source("G2P.R") source("GWASbyCor.R") X=myGD[,-1] index1to5=myGM[,2]<6 X1to5 = X[,index1to5] set.seed(99164) mySim=G2P(X= X1to5,h2=.75,alpha=1,NQTN=10,distribution="norm")

Inflation by structure y=mySim$y G=myGD[,-1] n=nrow(G) m=ncol(G) P=matrix(NA,1,m) for (i in 1:m){ x=G[,i] if(max(x)==min(x)){ p=1}else{ X=cbind(1,x) LHS=t(X)%*%X C=solve(LHS) RHS=t(X)%*%y b=C%*%RHS yb=X%*%b e=y-yb n=length(y) ve=sum(e^2)/(n-1) vt=C*ve t=b/sqrt(diag(vt)) p=2*(1-pt(abs(t),n-2)) } #end of testing variation P[i]=p[length(p)] } #end of looping for markers Single marker test split.screen(rbind( c(0.8,0.98,0.1, 0.98),c(0.05, 0.73, 0.1, 0.98))) screen(1) par(mar = c(0, 0, 0, 0)) p.obs=P m2=length(p.obs) p.uni=runif(m2,0,1) order.obs=order(p.obs) order.uni=order(p.uni) plot(-log10(p.uni[order.uni]), -log10(p.obs[order.obs]), ) abline(a = 0, b = 1, col = "red") screen(2) color.vector <- rep(c("deepskyblue","orange","forestgreen","indianred3"),10) m=nrow(myGM) plot(t(-log10(P))~seq(1:m),col=color.vector[myGM[,2]]) abline(v=mySim$QTN.position, lty = 2, lwd=2, col = "black") close.screen(all.screens = TRUE) Inflation by structure

Add 2nd PC as covariate Inflation reduced PCA=prcomp(X) y=mySim$y G=myGD[,-1] n=nrow(G) m=ncol(G) P=matrix(NA,1,m) for (i in 1:m){ x=G[,i] if(max(x)==min(x)){ p=1}else{ X=cbind(1, PCA$x[,2],x) LHS=t(X)%*%X C=solve(LHS) RHS=t(X)%*%y b=C%*%RHS yb=X%*%b e=y-yb n=length(y) ve=sum(e^2)/(n-1) vt=C*ve t=b/sqrt(diag(vt)) p=2*(1-pt(abs(t),n-2)) } #end of testing variation P[i]=p[length(p)] } #end of looping for markers Add 2nd PC as covariate split.screen(rbind( c(0.8,0.98,0.1, 0.98),c(0.05, 0.73, 0.1, 0.98))) screen(1) par(mar = c(0, 0, 0, 0)) p.obs=P m2=length(p.obs) p.uni=runif(m2,0,1) order.obs=order(p.obs) order.uni=order(p.uni) plot(-log10(p.uni[order.uni]), -log10(p.obs[order.obs]), ) abline(a = 0, b = 1, col = "red") screen(2) color.vector <- rep(c("deepskyblue","orange","forestgreen","indianred3"),10) m=nrow(myGM) plot(t(-log10(P))~seq(1:m),col=color.vector[myGM[,2]]) abline(v=mySim$QTN.position, lty = 2, lwd=2, col = "black") close.screen(all.screens = TRUE) Inflation reduced

Inflation controlled better y=mySim$y G=myGD[,-1] n=nrow(G) m=ncol(G) P=matrix(NA,1,m) for (i in 1:m){ x=G[,i] if(max(x)==min(x)){ p=1}else{ X=cbind(1, PCA$x[,1:3],x) LHS=t(X)%*%X C=solve(LHS) RHS=t(X)%*%y b=C%*%RHS yb=X%*%b e=y-yb n=length(y) ve=sum(e^2)/(n-1) vt=C*ve t=b/sqrt(diag(vt)) p=2*(1-pt(abs(t),n-2)) } #end of testing variation P[i]=p[length(p)] } #end of looping for markers Using three PCs split.screen(rbind( c(0.8,0.98,0.1, 0.98),c(0.05, 0.73, 0.1, 0.98))) screen(1) par(mar = c(0, 0, 0, 0)) p.obs=P m2=length(p.obs) p.uni=runif(m2,0,1) order.obs=order(p.obs) order.uni=order(p.uni) plot(-log10(p.uni[order.uni]), -log10(p.obs[order.obs]), ) abline(a = 0, b = 1, col = "red") screen(2) color.vector <- rep(c("deepskyblue","orange","forestgreen","indianred3"),10) m=nrow(myGM) plot(t(-log10(P))~seq(1:m),col=color.vector[myGM[,2]]) abline(v=mySim$QTN.position, lty = 2, lwd=2, col = "black") close.screen(all.screens = TRUE) Inflation controlled better

Using breeding value as observation y=mySim$add G=myGD[,-1] n=nrow(G) m=ncol(G) P=matrix(NA,1,m) for (i in 1:m){ x=G[,i] if(max(x)==min(x)){ p=1}else{ X=cbind(1, x) LHS=t(X)%*%X C=solve(LHS) RHS=t(X)%*%y b=C%*%RHS yb=X%*%b e=y-yb n=length(y) ve=sum(e^2)/(n-1) vt=C*ve t=b/sqrt(diag(vt)) p=2*(1-pt(abs(t),n-2)) } #end of testing variation P[i]=p[length(p)] } #end of looping for markers Using breeding value as observation split.screen(rbind( c(0.8,0.98,0.1, 0.98),c(0.05, 0.73, 0.1, 0.98))) screen(1) par(mar = c(0, 0, 0, 0)) p.obs=P m2=length(p.obs) p.uni=runif(m2,0,1) order.obs=order(p.obs) order.uni=order(p.uni) plot(-log10(p.uni[order.uni]), -log10(p.obs[order.obs]), ) abline(a = 0, b = 1, col = "red") screen(2) color.vector <- rep(c("deepskyblue","orange","forestgreen","indianred3"),10) m=nrow(myGM) plot(t(-log10(P))~seq(1:m),col=color.vector[myGM[,2]]) abline(v=mySim$QTN.position, lty = 2, lwd=2, col = "black") close.screen(all.screens = TRUE) Still inflated by structure

PCs remove inflation (many apps before MLM GWAS) y=mySim$add G=myGD[,-1] n=nrow(G) m=ncol(G) P=matrix(NA,1,m) for (i in 1:m){ x=G[,i] if(max(x)==min(x)){ p=1}else{ X=cbind(1, PCA$x[,1:3],x) LHS=t(X)%*%X C=solve(LHS) RHS=t(X)%*%y b=C%*%RHS yb=X%*%b e=y-yb n=length(y) ve=sum(e^2)/(n-1) vt=C*ve t=b/sqrt(diag(vt)) p=2*(1-pt(abs(t),n-2)) } #end of testing variation P[i]=p[length(p)] } #end of looping for markers Using three PCs split.screen(rbind( c(0.8,0.98,0.1, 0.98),c(0.05, 0.73, 0.1, 0.98))) screen(1) par(mar = c(0, 0, 0, 0)) p.obs=P m2=length(p.obs) p.uni=runif(m2,0,1) order.obs=order(p.obs) order.uni=order(p.uni) plot(-log10(p.uni[order.uni]), -log10(p.obs[order.obs]), ) abline(a = 0, b = 1, col = "red") screen(2) color.vector <- rep(c("deepskyblue","orange","forestgreen","indianred3"),10) m=nrow(myGM) plot(t(-log10(P))~seq(1:m),col=color.vector[myGM[,2]]) abline(v=mySim$QTN.position, lty = 2, lwd=2, col = "black") close.screen(all.screens = TRUE) PCs remove inflation (many apps before MLM GWAS)

Using residual as observation y=mySim$residual G=myGD[,-1] n=nrow(G) m=ncol(G) P=matrix(NA,1,m) for (i in 1:m){ x=G[,i] if(max(x)==min(x)){ p=1}else{ X=cbind(1,x) LHS=t(X)%*%X C=solve(LHS) RHS=t(X)%*%y b=C%*%RHS yb=X%*%b e=y-yb n=length(y) ve=sum(e^2)/(n-1) vt=C*ve t=b/sqrt(diag(vt)) p=2*(1-pt(abs(t),n-2)) } #end of testing variation P[i]=p[length(p)] } #end of looping for markers Using residual as observation split.screen(rbind( c(0.8,0.98,0.1, 0.98),c(0.05, 0.73, 0.1, 0.98))) screen(1) par(mar = c(0, 0, 0, 0)) p.obs=P m2=length(p.obs) p.uni=runif(m2,0,1) order.obs=order(p.obs) order.uni=order(p.uni) plot(-log10(p.uni[order.uni]), -log10(p.obs[order.obs]), ) abline(a = 0, b = 1, col = "red") screen(2) color.vector <- rep(c("deepskyblue","orange","forestgreen","indianred3"),10) m=nrow(myGM) plot(t(-log10(P))~seq(1:m),col=color.vector[myGM[,2]]) abline(v=mySim$QTN.position, lty = 2, lwd=2, col = "black") close.screen(all.screens = TRUE) This is not silly! It works for low heritable traits

Using genetic effect as covariates y=mySim$y G=myGD[,-1] n=nrow(G) m=ncol(G) P=matrix(NA,1,m) for (i in 1:m){ x=G[,i] if(max(x)==min(x)){ p=1}else{ X=cbind(1, mySim$add,x) LHS=t(X)%*%X C=solve(LHS) RHS=t(X)%*%y b=C%*%RHS yb=X%*%b e=y-yb n=length(y) ve=sum(e^2)/(n-1) vt=C*ve t=b/sqrt(diag(vt)) p=2*(1-pt(abs(t),n-2)) } #end of testing variation P[i]=p[length(p)] } #end of looping for markers Using genetic effect as covariates split.screen(rbind( c(0.8,0.98,0.1, 0.98),c(0.05, 0.73, 0.1, 0.98))) screen(1) par(mar = c(0, 0, 0, 0)) p.obs=P m2=length(p.obs) p.uni=runif(m2,0,1) order.obs=order(p.obs) order.uni=order(p.uni) plot(-log10(p.uni[order.uni]), -log10(p.obs[order.obs]), ) abline(a = 0, b = 1, col = "red") screen(2) color.vector <- rep(c("deepskyblue","orange","forestgreen","indianred3"),10) m=nrow(myGM) plot(t(-log10(P))~seq(1:m),col=color.vector[myGM[,2]]) abline(v=mySim$QTN.position, lty = 2, lwd=2, col = "black") close.screen(all.screens = TRUE) Everything absorbed

Critical thinking on MLM Computation intensive, cubic to sample size (n3) Converge problems (h2=0 or 1) Q(PC) and K from same set of markers, double counted Confounded between testing marker and Q(PC) and K Disappointed on the opposite side of inflated p values

Queen + King

Compressed MLM y = SNP + Q (or PCs) + Kinship + e y = x1b1 + x2b2+x3b3+x4b4 + Zu+ e Group Zhang Zhang, Z. et al. Mixed linear model approach adapted for genome-wide association studies. Nat Genet 42, 355–360 (2010).

Group by kinship

Compression improves power Average number of individuals per group

Average number of individuals per group Fit matches power Average number of individuals per group

Compression is robust across species Human (n=1315) Dog (n=292) Maize (n=277) Fit of Model 0.20sd (0.83%) 0.1sd (0.21%) 0.2sd (0.83%) 0.3sd (1.85%) 0.4sd (3.25%) 0.5sd (4.99%) 0.5sd (4.99%) 0.16sd (0.53%) 0.4sd (3.25%) 0.12sd (0.30%) Statistical power 0.3sd (1.85%) 0.08sd (0.13%) 0.2sd (0.83%) 0.04sd 0(.03%) 0.1sd (0.21%) Compression level Compression is robust across species

Compressed MLM is more general GLM (1 group) SA, GC, PCA and QTDT Compressed MLM Sire model Compressed MLM (s groups) n ≥ s ≥ 1 Full MLM (n groups) Henderson’s MLM Unified MLM Pedigree based kinship Marker based kinship

Enriched Compressed MLM Kinship: Among individuals -> among groups 1 .167 .72 Average 1 .25 .125 .5 .75 1 .25 Maximum Minimum Median …

Better optimization with group kinship A-Human B-Dog C-Maize D-Arabidopsis

Dimensions of parameter space 5. Group method 6. Group kinship 4. Group numbers 3. Variance components 2. Kinship (BLUP) 1. Structure (BLUE) More dimensions, better optimization

Statistical power improvement Meng Li Method shift Human Dog Maize Arabidopsis GLM to MLM 3.6% 13.8% 10.1% 29.6% MLM to compression 4.0% 14.2% 7.6% 2.5% Compression to group kinship 6.4% 13.3% 2.9% 2.6% BMC Biology, 2014

GWAS by CMLM library('MASS') # required for ginv library(multtest) library(gplots) library(compiler) #required for cmpfun library("scatterplot3d") source("http://www.zzlab.net/GAPIT/emma.txt") source("http://www.zzlab.net/GAPIT/gapit_functions.txt") setwd("~/Desktop/temp") myY=cbind(as.data.frame(myGD[,1]), mySim$y) myGAPIT=GAPIT( Y=myY, GD=myGD, GM=myGM, QTN.position=mySim$QTN.position, PCA.total=3, group.from=1, group.to=1000000, group.by=10, memo="CMLM") GWAS by CMLM

Highlight Criticism on MLM CMLM ECMLM