Statistical Genomics Zhiwu Zhang Washington State University Lecture 20: MLMM.

Slides:



Advertisements
Similar presentations
All Possible Regressions and Statistics for Comparing Models
Advertisements

GBS & GWAS using the iPlant Discovery Environment
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
Lecture 22: Evaluation April 24, 2010.
PAG 2011 TASSEL Terry Casstevens 1, Peter Bradbury 2,3, Zhiwu Zhang 1, Yang Zhang 1, Edward Buckler 1,2,4 1 Institute.
Lecture 23: Tues., Dec. 2 Today: Thursday:
1 BA 275 Quantitative Business Methods Residual Analysis Multiple Linear Regression Adjusted R-squared Prediction Dummy Variables Agenda.
MSc GBE Course: Genes: from sequence to function Genome-wide Association Studies Sven Bergmann Department of Medical Genetics University of Lausanne Rue.
Multiple Linear Regression Introduction to Business Statistics, 5e Kvanli/Guynes/Pavur (c)2000 South-Western College Publishing.
Lecture 19: Tues., Nov. 11th R-squared (8.6.1) Review
Multiple Regression Selecting the Best Equation. Techniques for Selecting the "Best" Regression Equation The best Regression equation is not necessarily.
● Final exam Wednesday, 6/10, 11:30-2:30. ● Bring your own blue books ● Closed book. Calculators and 2-page cheat sheet allowed. No cell phone/computer.
Repeated Measurements Analysis. Repeated Measures Analysis of Variance Situations in which biologists would make repeated measurements on same individual.
Lesson Multiple Regression Models. Objectives Obtain the correlation matrix Use technology to find a multiple regression equation Interpret the.
Sequential Multiple Decision Procedures (SMDP) for Genome Scans Q.Y. Zhang and M.A. Province Division of Statistical Genomics Washington University School.
Univariate Linear Regression Problem Model: Y=  0 +  1 X+  Test: H 0 : β 1 =0. Alternative: H 1 : β 1 >0. The distribution of Y is normal under both.
Welcome to Econ 420 Applied Regression Analysis Study Guide Week Four Ending Wednesday, September 19 (Assignment 4 which is included in this study guide.
Lab 13: Association Genetics December 5, Goals Use Mixed Models and General Linear Models to determine genetic associations. Understand the effect.
Sequential & Multiple Hypothesis Testing Procedures for Genome-wide Association Scans Qunyuan Zhang Division of Statistical Genomics Washington University.
Roger B. Hammer Assistant Professor Department of Sociology Oregon State University Conducting Social Research Specification: Choosing the Independent.
Research Methodology Lecture No :26 (Hypothesis Testing – Relationship)
Statistical Genomics Zhiwu Zhang Washington State University Lecture 26: Kernel method.
Statistical Genomics Zhiwu Zhang Washington State University Lecture 19: SUPER.
Statistical Genomics Zhiwu Zhang Washington State University Lecture 25: Ridge Regression.
Washington State University
Statistical Genomics Zhiwu Zhang Washington State University Lecture 29: Bayesian implementation.
Statistical Genomics Zhiwu Zhang Washington State University Lecture 16: CMLM.
Statistical Genomics Zhiwu Zhang Washington State University Lecture 7: Impute.
Statistical Genomics Zhiwu Zhang Washington State University Lecture 11: Power, type I error and FDR.
Genome Wide Association Studies Zhiwu Zhang Washington State University.
1 BUSI 6220 By Dr. Nick Evangelopoulos, © 2012 Brief overview of Linear Regression Models (Pre-MBA level)
Lecture 28: Bayesian methods
Lecture 10: GWAS by correlation
Lecture 28: Bayesian Tools
Washington State University
Washington State University
Lecture 22: Marker Assisted Selection
Lecture 10: GWAS by correlation
Washington State University
Genome Wide Association Studies using SNP
Forward Selection The Forward selection procedure looks to add variables to the model. Once added, those variables stay in the model even if they become.
Washington State University
Washington State University
Washington State University
Washington State University
Washington State University
Washington State University
Lecture 10: GWAS by correlation
Washington State University
Washington State University
Lecture 23: Cross validation
Lecture 23: Cross validation
Washington State University
Washington State University
Linear Model Selection and regularization
Lecture 10: GWAS by correlation
Washington State University
Lecture 11: Power, type I error and FDR
Washington State University
Lecture 11: Power, type I error and FDR
Washington State University
Lecture 18: Heritability and P3D
Washington State University
Lecture 17: Likelihood and estimates of variances
Washington State University
Lecture 23: Cross validation
Lecture 29: Bayesian implementation
Lecture 22: Marker Assisted Selection
Washington State University
Presentation transcript:

Statistical Genomics Zhiwu Zhang Washington State University Lecture 20: MLMM

 Homework 4 graded  Homework 5, due April 13, Wednesday, 3:10PM  Final exam: May 3, 120 minutes (3:10-5:10PM), 50  Department seminar (March 28), Brigid Meints, “Breeding Barley and Beans for Western Washington” Administration

 After final exam  but, something remain life long My believes Xi~N(0,1), Y=Sum(Xi) over n, Y~X2(n) y = Xb + Zu + e Vay (y) = 2K SigmaA + I SiggmaE rep(rainbow(7),100) sample(100,5, replace=F) QTNs on CHR1-5, signals pop out on CHR % "prediction accuracy" on a trait with h2=0

 Doing >> looking  Reasoning  Learn = (re)invent  Creative  Self confidence Core values behind statistics, programming, genetics, GWAS and GS in CROPS545

Doing >> looking

Reasoning Teaching model  Hypothesis: There is no space to improve  Objective: Reject the null hypothesis  Method: Increase statistical power

Learn = (re)Invent

Creative Dare to break the rules with judgment

Self confidence  Questioning why decreasing missing rate does not improve accuracy of stochastic imputation by Chongqing  Questioning what is "u" in MLM by Joe  Finding of setting seed in impute (KNN) package by Louisa  One more example of my own

Evaluation Comment: Much more work than other WSU courses Adjustment 1.Assignments: 9 to 6 2.Requirements: No experience with statistics and programming 3.Easy to pass, or a grade C - after 1 st assignment unless unusual behavior or recommended to withdraw

Outline  Stepwise regression  Criteria  MLMM  Power vs FDR and Type I error  Replicate and mean

Testing SNPs, one at a time Phenotype Population structure Unequal relatedness Y = SNP + Q (or PCs) + Kinship + e (fixed effect)(random effect) General Linear Model (GLM) Mixed Linear Model (MLM) (fixed effect) (Yu et al. 2005, Nature Genetics)

Stepwise regression Choose m predictive variables from M (M>>m) variables The challenges : 1.Choosing m from M is an NP problem 2.Option: approximation 3.Non unique criteria

1.sequence of F-tests or t-tests 2.Adjusted R-square 3.Akaike information criterion (AIC) 4.Bayesian information criterion (BIC) 5.Mallows's Cp 6.PRESS 7.false discovery rate (FDR) Stepwise regression procedures Why so many?

Forward stepwise regression t or F test Test M variables one at a time Fit the most significant variable as covariate Test rest variables one at a time Is the most influential variable significant End Yes No

Backward stepwise regression t or F test Test m variables simultaneously Is the least influential variable significant Remove it and test the rest (m) End Yes No

Hind from MHC (Major histocompatibility complex)

GLM Two QTNs MLM MLMM Nature Genetics, 2012, 44,

MLMM y = SNP + Q + K + e y = SNP + QTN1 + Q + K + e y = SNP + QTN1 + QTN2 + Q + K + e Most significant SNP as pseudo QTN So on and so forth until…

Forward regression y = SNP +QTN1+QTN2+…+ Q + K + e Var(y) Var(u) Stop when the ratio close to zero

Backward elimination y = QTN 1 +QTN 2 +…+QTN t + Q + K + e y = QTN 1 +QTN 2 +…+QTN t-1 + Q + K + e Remove the least significant pseudo QTN Until all pseudo QTNs are significant

Final p values y = QTN 1 +QTN 2 +…+ Q + K + e Pseudo QTNs: y = SNP +QTN 1 +QTN 2 +…+ Q + K + e Other markers:

MLMM R on GitHub

rm(list=ls()) setwd('/Users/Zhiwu/Dropbox/Current/ZZLab/WSUCourse/CROPS545/mlmm-master') source('mlmm_cof.r') library("MASS") # required for ginv library(multtest) library(gplots) library(compiler) #required for cmpfun library("scatterplot3d") source(" source(" source("/Users/Zhiwu/Dropbox//GAPIT/functions/gapit_functions.txt") setwd("/Users/Zhiwu/Dropbox/Current/ZZLab/WSUCourse/CROPS512/Demo") myGD <- read.table("mdp_numeric.txt", head = TRUE) myGM <- read.table("mdp_SNP_information.txt", head = TRUE) #for PC and K setwd("~/Desktop/temp") myGAPIT0=GAPIT(GD=myGD,GM=myGM,PCA.total=3,) myPC=as.matrix(myGAPIT0$PCA[,-1]) myK=as.matrix(myGAPIT0$kinship[,-1]) myX=as.matrix(myGD[,-1]) #Siultate 10 QTN on the first chromosomes X=myGD[,-1] index1to5=myGM[,2]<6 X1to5 = X[,index1to5] taxa=myGD[,1] set.seed(99164) GD.candidate=cbind(taxa,X1to5) mySim=GAPIT.Phenotype.Simulation(GD=GD.candidate,GM=myGM[ind ex1to5,],h2=.5,NQTN=10,QTNDist="norm") myy=as.numeric(mySim$Y[,-1]) myMLMM<- mlmm_cof(myy,myX,myPC[,1:2],myK,nbchunks=2,maxsteps=20) myP=myMLMM$pval_step[[1]]$out[,2] myGI.MP=cbind(myGM[,-1],myP) setwd("~/Desktop/temp") GAPIT.Manhattan(GI.MP=myGI.MP,seqQTN=mySim$QTN.position) GAPIT.QQ(myP)

GAPIT.FDR.TypeI Function myGWAS=cbind(myGM,myP,NA) myStat=GAPIT.FDR.TypeI(WS=c(1e0,1e3,1e4,1e5),GM=myGM,seq QTN=mySim$QTN.position,GWAS=myGWAS)

Return

Area Under Curve (AUC) par(mfrow=c(1,2),mar = c(5,2,5,2)) plot(myStat$FDR[,1],myStat$Power,type="b") plot(myStat$TypeI[,1],myStat$Power,type="b")

Replicates nrep=10 set.seed(99164) statRep=replicate(nrep, { mySim=GAPIT.Phenotype.Simulation(GD=GD.candidate,GM=myGM[index1to5,],h 2=.5,NQTN=10,QTNDist="norm") myy=as.numeric(mySim$Y[,-1]) myMLMM<-mlmm_cof(myy,myX,myPC[,1:2],myK,nbchunks=2,maxsteps=20) myP=myMLMM$pval_step[[1]]$out[,2] myGWAS=cbind(myGM,myP,NA) myStat=GAPIT.FDR.TypeI(WS=c(1e0,1e3,1e4,1e5),GM=myGM,seqQTN=mySim$QT N.position,GWAS=myGWAS) })

str(statRep)

Means over replicates power=statRep[[2]] #FDR s.fdr=seq(3,length(statRep),7) fdr=statRep[s.fdr] fdr.mean=Reduce ("+", fdr) / length(fdr) #AUC: power vs. FDR s.auc.fdr=seq(6,length(statRep),7) auc.fdr=statRep[s.auc.fdr] auc.fdr.mean=Reduce ("+", auc.fdr) / length(auc.fdr)

Plots of power vs. FDR theColor=rainbow(4) plot(fdr.mean[,1],power, type="b", col=theColor [1],xlim=c(0,1)) for(i in 2:ncol(fdr.mean)){ lines(fdr.mean[,i], power, type="b", col= theColor [i]) }

Highlight  Stepwise regression  Criteria  MLMM  Power vs FDR and Type I error  Replicate and mean