Lecture 23: Cross validation

Slides:



Advertisements
Similar presentations
Genetic Statistics Lectures (5) Multiple testing correction and population structure correction.
Advertisements

Aaron Lorenz Department of Agronomy and Horticulture
Sequential Multiple Decision Procedures (SMDP) for Genome Scans Q.Y. Zhang and M.A. Province Division of Statistical Genomics Washington University School.
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Validation.
Statistical Genomics Zhiwu Zhang Washington State University Lecture 26: Kernel method.
Statistical Genomics Zhiwu Zhang Washington State University Lecture 19: SUPER.
Statistical Genomics Zhiwu Zhang Washington State University Lecture 25: Ridge Regression.
Washington State University
Statistical Genomics Zhiwu Zhang Washington State University Lecture 29: Bayesian implementation.
Statistical Genomics Zhiwu Zhang Washington State University Lecture 16: CMLM.
Statistical Genomics Zhiwu Zhang Washington State University Lecture 7: Impute.
Statistical Genomics Zhiwu Zhang Washington State University Lecture 20: MLMM.
Statistical Genomics Zhiwu Zhang Washington State University Lecture 4: Statistical inference.
Statistical Genomics Zhiwu Zhang Washington State University Lecture 11: Power, type I error and FDR.
Statistical Genomics Zhiwu Zhang Washington State University Lecture 27: Bayesian theorem.
Genome Wide Association Studies Zhiwu Zhang Washington State University.
Bootstrap and Model Validation
Washington State University
A Multi-stage Approach to Detect Gene-gene Interactions Associated with Multiple Correlated Phenotypes Zhou Xiangdong,Keith Chan, Danhong Zhu Department.
Lecture 28: Bayesian methods
Lecture 10: GWAS by correlation
Washington State University
Lecture 28: Bayesian Tools
Washington State University
Washington State University
Lecture 22: Marker Assisted Selection
Lecture 10: GWAS by correlation
Lecture 12: Population structure
Washington State University
Washington State University
Washington State University
Lecture 12: Population structure
Washington State University
IMPORTANT: 20 minute assemblies
Washington State University
Washington State University
Washington State University
Washington State University
Lecture 10: GWAS by correlation
Washington State University
Washington State University
Investigating Inheritance
The effect of using sequence data instead of a lower density SNP chip on a GWAS EAAP 2017; Tallinn, Estonia Sanne van den Berg, Roel Veerkamp, Fred van.
Lecture 23: Cross validation
Lecture 23: Cross validation
Washington State University
Canine hip dysplasia is predictable by genotyping
Washington State University
Lecture 10: GWAS by correlation
What are BLUP? and why they are useful?
Washington State University
Lecture 26: Bayesian theory
Washington State University
Cross-validation Brenda Thomson/ Peter Fox Data Analytics
Lecture 11: Power, type I error and FDR
Washington State University
Lecture 11: Power, type I error and FDR
Lecture 12: Population structure
Washington State University
Lecture 27: Bayesian theorem
Washington State University
Lecture 18: Heritability and P3D
Washington State University
Lecture 17: Likelihood and estimates of variances
Washington State University
CS639: Data Management for Data Science
Lecture 29: Bayesian implementation
Lecture 22: Marker Assisted Selection
Washington State University
Evaluation David Kauchak CS 158 – Fall 2019.
Presentation transcript:

Lecture 23: Cross validation Statistical Genomics Lecture 23: Cross validation Zhiwu Zhang Washington State University

Administration Homework 5, due April 13, Wednesday, 3:10PM Final exam: May 3, 120 minutes (3:10-5:10PM), 50

Course evaluation and response Genomic selection methods with packages in R GS by GWAS rrBLUP gBLUP cBLUP sBLUP Bayesian LASSO

Outline GS by GWAS Over fitting Cross validation K-fold validation Jack knife Re-sampling Two ways of calculating accuracy Bias and correction

Setup GAPIT #source("http://www.bioconductor.org/biocLite.R") #biocLite("multtest") #install.packages("gplots") #install.packages("scatterplot3d")#The downloaded link at: http://cran.r-project.org/package=scatterplot3d library('MASS') # required for ginv library(multtest) library(gplots) library(compiler) #required for cmpfun library("scatterplot3d") source("http://www.zzlab.net/GAPIT/emma.txt") source("http://www.zzlab.net/GAPIT/gapit_functions.txt")

Import data and simulate phenotype myGD=read.table(file="http://zzlab.net/GAPIT/data/mdp_numeric.txt",head=T) myGM=read.table(file="http://zzlab.net/GAPIT/data/mdp_SNP_information.txt",head=T) myCV=read.table(file="http://zzlab.net/GAPIT/data/mdp_env.txt",head=T) #Simultate 10 QTN on the first half chromosomes X=myGD[,-1] index1to5=myGM[,2]<6 X1to5 = X[,index1to5] taxa=myGD[,1] set.seed(99164) GD.candidate=cbind(taxa,X1to5) source("~/Dropbox/GAPIT/Functions/GAPIT.Phenotype.Simulation.R") mySim=GAPIT.Phenotype.Simulation(GD=GD.candidate,GM=myGM[index1to5,],h2=.5,NQTN=10, effectunit =.95,QTNDist="normal",CV=myCV,cveff=c(.51,.51)) setwd("~/Desktop/temp")

Prediction with PC and ENV myGAPIT <- GAPIT( Y=mySim$Y, GD=myGD, GM=myGM, PCA.total=3, CV=myCV, group.from=1, group.to=1, group.by=10, QTN.position=mySim$QTN.position, #SNP.test=FALSE, memo="GLM",) ry2=cor(myGAPIT$Pred[,8],mySim$Y[,2])^2 ru2=cor(myGAPIT$Pred[,8],mySim$u)^2 par(mfrow=c(2,1), mar = c(3,4,1,1)) plot(myGAPIT$Pred[,8],mySim$Y[,2]) mtext(paste("R square=",ry2,sep=""), side = 3) plot(myGAPIT$Pred[,8],mySim$u) mtext(paste("R square=",ru2,sep=""), side = 3)

Choosing the top ten SNPs ntop=10 index=order(myGAPIT$P) top=index[1:ntop] myQTN=cbind(myGAPIT$PCA[,1:4], myCV[,2:3],myGD[,c(top+1)])

Prediction with top ten SNPs myGAPIT2<- GAPIT( Y=mySim$Y, GD=myGD, GM=myGM, #PCA.total=3, CV=myQTN, group.from=1, group.to=1, group.by=10, QTN.position=mySim$QTN.position, SNP.test=FALSE, memo="GLM+QTN",) ry2=cor(myGAPIT2$Pred[,8],mySim$Y[,2])^2 ru2=cor(myGAPIT2$Pred[,8],mySim$u)^2 par(mfrow=c(2,1), mar = c(3,4,1,1)) plot(myGAPIT2$Pred[,8],mySim$Y[,2]) mtext(paste("R square=",ry2,sep=""), side = 3) plot(myGAPIT2$Pred[,8],mySim$u) mtext(paste("R square=",ru2,sep=""), side = 3) Improved Improved

Prediction with top 200SNPs ntop=200 index=order(myGAPIT$P) top=index[1:ntop] myQTN=cbind(myGAPIT$PCA[,1:4], myCV[,2:3],myGD[,c(top+1)]) myGAPIT2<- GAPIT( Y=mySim$Y, GD=myGD, GM=myGM, #PCA.total=3, CV=myQTN, group.from=1, group.to=1, group.by=10, QTN.position=mySim$QTN.position, SNP.test=FALSE, memo="GLM+QTN",) Improved No Improve

Validation All individuals training Testing Phenothpe Genotype Phenotype Accuracy SNP effect Prediction

Cross validation All individuals Testing Training Phenothpe Genotype Phenotype Accuracy Prediction SNP effect

Five fold Cross validation Inference Reference By Yao Zhou

Until every individuals get predicted Jack Knife Until every individuals get predicted Inference Inference

Jack Knife: extreme case of K=N N: number of individuals K: number of folds Leave-one-out cross-validation Inference (training) contain only one individuals Not possible to calculate correlation between observed and predicted within inference Evaluation of accuracy must be hold until every individuals receive predictions. Resampling is not available

Re-sampling Sample partial population, e.g., 20%, as inference (testing), and leave the rest as reference (Training) Instantly evaluate accuracy of inference Repeated for multiple times Average accuracy across replicates Some individuals may never be in the testing

Negative prediction accuracy Theor Appl Genet. 2013 Jan;126(1):13-22 Genomewide predictions from maize single-cross data. Massman JM1, Gordillo A, Lorenzana RE, Bernardo R.

Two ways of calculating correlation

Artifactual negative hold accuracy

Hold bias relates to number of fold

Problem of instant accuracy

Small sample causes bias

Correction of instant accuracy

Highlight GS by GWAS Over fitting Cross validation K-fold validation Jack knife Re-sampling Two ways of calculating accuracy Bias and correction