Lecture 23: Cross validation Statistical Genomics Lecture 23: Cross validation Zhiwu Zhang Washington State University
Administration Homework 5, due April 12, Wednesday, 3:10PM Final exam: May 4 (Thursday), 120 minutes (3:10-5:10PM), 50
Course evaluation and response Genomic selection methods with packages in R GS by GWAS rrBLUP gBLUP cBLUP sBLUP Bayesian: A, B, CPi LASSO
Outline GS by GWAS Over fitting Cross validation K-fold validation Jack knife Re-sampling Two ways of calculating accuracy Bias and correction
Setup GAPIT #source("http://www.bioconductor.org/biocLite.R") #biocLite("multtest") #install.packages("gplots") #install.packages("scatterplot3d")#The downloaded link at: http://cran.r-project.org/package=scatterplot3d library('MASS') # required for ginv library(multtest) library(gplots) library(compiler) #required for cmpfun library("scatterplot3d") source("http://www.zzlab.net/GAPIT/emma.txt") source("http://www.zzlab.net/GAPIT/gapit_functions.txt")
Import data and simulate phenotype myGD=read.table(file="http://zzlab.net/GAPIT/data/mdp_numeric.txt",head=T) myGM=read.table(file="http://zzlab.net/GAPIT/data/mdp_SNP_information.txt",head=T) myCV=read.table(file="http://zzlab.net/GAPIT/data/mdp_env.txt",head=T) #Simultate 10 QTN on the first half chromosomes X=myGD[,-1] index1to5=myGM[,2]<6 X1to5 = X[,index1to5] taxa=myGD[,1] set.seed(99164) GD.candidate=cbind(taxa,X1to5) source("~/Dropbox/GAPIT/Functions/GAPIT.Phenotype.Simulation.R") mySim=GAPIT.Phenotype.Simulation(GD=GD.candidate,GM=myGM[index1to5,],h2=.5,NQTN=10, effectunit =.95,QTNDist="normal",CV=myCV,cveff=c(.51,.51)) setwd("~/Desktop/temp")
Prediction with PC and ENV myGAPIT <- GAPIT( Y=mySim$Y, GD=myGD, GM=myGM, PCA.total=3, CV=myCV, group.from=1, group.to=1, group.by=10, QTN.position=mySim$QTN.position, #SNP.test=FALSE, memo="GLM",) ry2=cor(myGAPIT$Pred[,8],mySim$Y[,2])^2 ru2=cor(myGAPIT$Pred[,8],mySim$u)^2 par(mfrow=c(2,1), mar = c(3,4,1,1)) plot(myGAPIT$Pred[,8],mySim$Y[,2]) mtext(paste("R square=",ry2,sep=""), side = 3) plot(myGAPIT$Pred[,8],mySim$u) mtext(paste("R square=",ru2,sep=""), side = 3)
Choosing the top ten SNPs ntop=10 index=order(myGAPIT$P) top=index[1:ntop] myQTN=cbind(myGAPIT$PCA[,1:4], myCV[,2:3],myGD[,c(top+1)])
Prediction with top ten SNPs myGAPIT2<- GAPIT( Y=mySim$Y, GD=myGD, GM=myGM, #PCA.total=3, CV=myQTN, group.from=1, group.to=1, group.by=10, QTN.position=mySim$QTN.position, SNP.test=FALSE, memo="GLM+QTN",) ry2=cor(myGAPIT2$Pred[,8],mySim$Y[,2])^2 ru2=cor(myGAPIT2$Pred[,8],mySim$u)^2 par(mfrow=c(2,1), mar = c(3,4,1,1)) plot(myGAPIT2$Pred[,8],mySim$Y[,2]) mtext(paste("R square=",ry2,sep=""), side = 3) plot(myGAPIT2$Pred[,8],mySim$u) mtext(paste("R square=",ru2,sep=""), side = 3) Improved Improved
Prediction with top 200SNPs ntop=200 index=order(myGAPIT$P) top=index[1:ntop] myQTN=cbind(myGAPIT$PCA[,1:4], myCV[,2:3],myGD[,c(top+1)]) myGAPIT2<- GAPIT( Y=mySim$Y, GD=myGD, GM=myGM, #PCA.total=3, CV=myQTN, group.from=1, group.to=1, group.by=10, QTN.position=mySim$QTN.position, SNP.test=FALSE, memo="GLM+QTN",) Improved No Improve
Validation All individuals training Testing Phenothpe Genotype Phenotype Accuracy SNP effect Prediction
Cross validation All individuals Testing Training Phenothpe Genotype Phenotype Accuracy Prediction SNP effect
Five fold Cross validation Inference Reference By Yao Zhou
Until every individuals get predicted Jack Knife Until every individuals get predicted Inference Inference
Jack Knife: extreme case of K=N N: number of individuals K: number of folds Leave-one-out cross-validation Inference (training) contain only one individuals Not possible to calculate correlation between observed and predicted within inference Evaluation of accuracy must be hold until every individuals receive predictions. Resampling is not available
Re-sampling Sample partial population, e.g., 20%, as inference (testing), and leave the rest as reference (Training) Instantly evaluate accuracy of inference Repeated for multiple times Average accuracy across replicates Some individuals may never be in the testing
Negative prediction accuracy Theor Appl Genet. 2013 Jan;126(1):13-22 Genomewide predictions from maize single-cross data. Massman JM1, Gordillo A, Lorenzana RE, Bernardo R.
Two ways of calculating correlation
Artifactual negative hold accuracy
Hold bias relates to number of fold
Problem of instant accuracy
Small sample causes bias
Correction of instant accuracy
Highlight GS by GWAS Over fitting Cross validation K-fold validation Jack knife Re-sampling Two ways of calculating accuracy Bias and correction