Lecture 23: Cross validation

Slides:

Advertisements

Similar presentations

Genetic Statistics Lectures （５） Multiple testing correction and population structure correction.

Advertisements

Aaron Lorenz Department of Agronomy and Horticulture

Sequential Multiple Decision Procedures (SMDP) for Genome Scans Q.Y. Zhang and M.A. Province Division of Statistical Genomics Washington University School.

CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Validation.

Statistical Genomics Zhiwu Zhang Washington State University Lecture 26: Kernel method.

Statistical Genomics Zhiwu Zhang Washington State University Lecture 19: SUPER.

Statistical Genomics Zhiwu Zhang Washington State University Lecture 25: Ridge Regression.

Washington State University

Statistical Genomics Zhiwu Zhang Washington State University Lecture 29: Bayesian implementation.

Statistical Genomics Zhiwu Zhang Washington State University Lecture 16: CMLM.

Statistical Genomics Zhiwu Zhang Washington State University Lecture 7: Impute.

Statistical Genomics Zhiwu Zhang Washington State University Lecture 20: MLMM.

Statistical Genomics Zhiwu Zhang Washington State University Lecture 4: Statistical inference.

Statistical Genomics Zhiwu Zhang Washington State University Lecture 11: Power, type I error and FDR.

Statistical Genomics Zhiwu Zhang Washington State University Lecture 27: Bayesian theorem.

Genome Wide Association Studies Zhiwu Zhang Washington State University.

Bootstrap and Model Validation

Washington State University

A Multi-stage Approach to Detect Gene-gene Interactions Associated with Multiple Correlated Phenotypes Zhou Xiangdong，Keith Chan, Danhong Zhu Department.

Lecture 28: Bayesian methods

Lecture 10: GWAS by correlation

Washington State University

Lecture 28: Bayesian Tools

Washington State University

Washington State University

Lecture 22: Marker Assisted Selection

Lecture 10: GWAS by correlation

Lecture 12: Population structure

Washington State University

Washington State University

Washington State University

Lecture 12: Population structure

Washington State University

IMPORTANT: 20 minute assemblies

Washington State University

Washington State University

Washington State University

Washington State University

Lecture 10: GWAS by correlation

Washington State University

Washington State University

Investigating Inheritance

The effect of using sequence data instead of a lower density SNP chip on a GWAS EAAP 2017; Tallinn, Estonia Sanne van den Berg, Roel Veerkamp, Fred van.

Lecture 23: Cross validation

Washington State University

Canine hip dysplasia is predictable by genotyping

Washington State University

Lecture 10: GWAS by correlation

What are BLUP? and why they are useful?

Washington State University

Lecture 26: Bayesian theory

Washington State University

Cross-validation Brenda Thomson/ Peter Fox Data Analytics

Lecture 11: Power, type I error and FDR

Washington State University

Lecture 11: Power, type I error and FDR

Lecture 12: Population structure

Washington State University

Lecture 27: Bayesian theorem

Washington State University

Lecture 18: Heritability and P3D

Washington State University

Lecture 17: Likelihood and estimates of variances

Washington State University

Lecture 23: Cross validation

CS639: Data Management for Data Science

Lecture 29: Bayesian implementation

Lecture 22: Marker Assisted Selection

Washington State University

Evaluation David Kauchak CS 158 – Fall 2019.

Presentation transcript:

Lecture 23: Cross validation Statistical Genomics Lecture 23: Cross validation Zhiwu Zhang Washington State University

Administration Homework 5, due April 12, Wednesday, 3:10PM Final exam: May 4 (Thursday), 120 minutes (3:10-5:10PM), 50

Course evaluation and response Genomic selection methods with packages in R GS by GWAS rrBLUP gBLUP cBLUP sBLUP Bayesian: A, B, CPi LASSO

Outline GS by GWAS Over fitting Cross validation K-fold validation Jack knife Re-sampling Two ways of calculating accuracy Bias and correction

Setup GAPIT #source("http://www.bioconductor.org/biocLite.R") #biocLite("multtest") #install.packages("gplots") #install.packages("scatterplot3d")#The downloaded link at: http://cran.r-project.org/package=scatterplot3d library('MASS') # required for ginv library(multtest) library(gplots) library(compiler) #required for cmpfun library("scatterplot3d") source("http://www.zzlab.net/GAPIT/emma.txt") source("http://www.zzlab.net/GAPIT/gapit_functions.txt")

Import data and simulate phenotype myGD=read.table(file="http://zzlab.net/GAPIT/data/mdp_numeric.txt",head=T) myGM=read.table(file="http://zzlab.net/GAPIT/data/mdp_SNP_information.txt",head=T) myCV=read.table(file="http://zzlab.net/GAPIT/data/mdp_env.txt",head=T) #Simultate 10 QTN on the first half chromosomes X=myGD[,-1] index1to5=myGM[,2]<6 X1to5 = X[,index1to5] taxa=myGD[,1] set.seed(99164) GD.candidate=cbind(taxa,X1to5) source("~/Dropbox/GAPIT/Functions/GAPIT.Phenotype.Simulation.R") mySim=GAPIT.Phenotype.Simulation(GD=GD.candidate,GM=myGM[index1to5,],h2=.5,NQTN=10, effectunit =.95,QTNDist="normal",CV=myCV,cveff=c(.51,.51)) setwd("~/Desktop/temp")

Prediction with PC and ENV myGAPIT <- GAPIT( Y=mySim$Y, GD=myGD, GM=myGM, PCA.total=3, CV=myCV, group.from=1, group.to=1, group.by=10, QTN.position=mySim$QTN.position, #SNP.test=FALSE, memo="GLM",) ry2=cor(myGAPIT$Pred[,8],mySim$Y[,2])^2 ru2=cor(myGAPIT$Pred[,8],mySim$u)^2 par(mfrow=c(2,1), mar = c(3,4,1,1)) plot(myGAPIT$Pred[,8],mySim$Y[,2]) mtext(paste("R square=",ry2,sep=""), side = 3) plot(myGAPIT$Pred[,8],mySim$u) mtext(paste("R square=",ru2,sep=""), side = 3)

Choosing the top ten SNPs ntop=10 index=order(myGAPIT$P) top=index[1:ntop] myQTN=cbind(myGAPIT$PCA[,1:4], myCV[,2:3],myGD[,c(top+1)])

Prediction with top ten SNPs myGAPIT2<- GAPIT( Y=mySim$Y, GD=myGD, GM=myGM, #PCA.total=3, CV=myQTN, group.from=1, group.to=1, group.by=10, QTN.position=mySim$QTN.position, SNP.test=FALSE, memo="GLM+QTN",) ry2=cor(myGAPIT2$Pred[,8],mySim$Y[,2])^2 ru2=cor(myGAPIT2$Pred[,8],mySim$u)^2 par(mfrow=c(2,1), mar = c(3,4,1,1)) plot(myGAPIT2$Pred[,8],mySim$Y[,2]) mtext(paste("R square=",ry2,sep=""), side = 3) plot(myGAPIT2$Pred[,8],mySim$u) mtext(paste("R square=",ru2,sep=""), side = 3) Improved Improved

Prediction with top 200SNPs ntop=200 index=order(myGAPIT$P) top=index[1:ntop] myQTN=cbind(myGAPIT$PCA[,1:4], myCV[,2:3],myGD[,c(top+1)]) myGAPIT2<- GAPIT( Y=mySim$Y, GD=myGD, GM=myGM, #PCA.total=3, CV=myQTN, group.from=1, group.to=1, group.by=10, QTN.position=mySim$QTN.position, SNP.test=FALSE, memo="GLM+QTN",) Improved No Improve

Validation All individuals training Testing Phenothpe Genotype Phenotype Accuracy SNP effect Prediction

Cross validation All individuals Testing Training Phenothpe Genotype Phenotype Accuracy Prediction SNP effect

Five fold Cross validation Inference Reference By Yao Zhou

Until every individuals get predicted Jack Knife Until every individuals get predicted Inference Inference

Jack Knife: extreme case of K=N N: number of individuals K: number of folds Leave-one-out cross-validation Inference (training) contain only one individuals Not possible to calculate correlation between observed and predicted within inference Evaluation of accuracy must be hold until every individuals receive predictions. Resampling is not available

Re-sampling Sample partial population, e.g., 20%, as inference (testing), and leave the rest as reference (Training) Instantly evaluate accuracy of inference Repeated for multiple times Average accuracy across replicates Some individuals may never be in the testing

Negative prediction accuracy Theor Appl Genet. 2013 Jan;126(1):13-22 Genomewide predictions from maize single-cross data. Massman JM1, Gordillo A, Lorenzana RE, Bernardo R.

Two ways of calculating correlation

Artifactual negative hold accuracy

Hold bias relates to number of fold

Problem of instant accuracy

Small sample causes bias

Correction of instant accuracy

Highlight GS by GWAS Over fitting Cross validation K-fold validation Jack knife Re-sampling Two ways of calculating accuracy Bias and correction