Lecture 19 Classification analysis – R code

Slides:



Advertisements
Similar presentations
The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke
Advertisements

AIME03, Oct 21, 2003 Classification of Ovarian Tumors Using Bayesian Least Squares Support Vector Machines C. Lu 1, T. Van Gestel 1, J. A. K. Suykens.
CSCE555 Bioinformatics Lecture 15 classification for microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data.
Indian Statistical Institute Kolkata
Classification of Cancer Patients Naomi Altman Nov. 06.
Machine Learning in R and its use in the statistical offices
1 Robust diagnosis DLBCL from gene expression data from different laboratories Dimacs Workshop, June 22, 2005 Gyan Bhanot, IBM Research.
Classification: Support Vector Machine 10/10/07. What hyperplane (line) can separate the two classes of data?
Classification of Microarray Data. Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical.
Multidimensional Analysis If you are comparing more than two conditions (for example 10 types of cancer) or if you are looking at a time series (cell cycle.
Classification 10/03/07.
Classification of Microarray Data. Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical.
1 John R. Stevens Utah State University Notes 3. Statistical Methods II Mathematics Educators Workshop 28 March Advanced Statistical Methods: Beyond.
Modeling Gene Interactions in Disease CS 686 Bioinformatics.
1 Diagnosing Breast Cancer with Ensemble Strategies for a Medical Diagnostic Decision Support System David West East Carolina University Paul Mangiameli.
Guidelines on Statistical Analysis and Reporting of DNA Microarray Studies of Clinical Outcome Richard Simon, D.Sc. Chief, Biometric Research Branch National.
Forecasting with Twitter data Presented by : Thusitha Chandrapala MARTA ARIAS, ARGIMIRO ARRATIA, and RAMON XURIGUERA.
1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.
Machine Learning CS 165B Spring 2012
JM - 1 Introduction to Bioinformatics: Lecture VIII Classification and Supervised Learning Jarek Meller Jarek Meller Division.
A Multivariate Biomarker for Parkinson’s Disease M. Coakley, G. Crocetti, P. Dressner, W. Kellum, T. Lamin The Michael L. Gargano 12 th Annual Research.
Whole Genome Expression Analysis
Classification (Supervised Clustering) Naomi Altman Nov '06.
Gene Expression Profiling Illustrated Using BRB-ArrayTools.
Statistics for Microarray Data Analysis with R Session 8: Discrimination Class web site:
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
The Broad Institute of MIT and Harvard Classification / Prediction.
Selection of Patient Samples and Genes for Disease Prognosis Limsoon Wong Institute for Infocomm Research Joint work with Jinyan Li & Huiqing Liu.
PCA, Clustering and Classification by Agnieszka S. Juncker Part of the slides is adapted from Chris Workman.
Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.
Gene Expression Profiling. Good Microarray Studies Have Clear Objectives Class Comparison (gene finding) –Find genes whose expression differs among predetermined.
Classification Course web page: vision.cis.udel.edu/~cv May 12, 2003  Lecture 33.
Classification of microarray samples Tim Beißbarth Mini-Group Meeting
Today Ensemble Methods. Recap of the course. Classifier Fusion
Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.
Feature (Gene) Selection MethodsSample Classification Methods Gene filtering: Variance (SD/Mean) Principal Component Analysis Regression using variable.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics
Project of CZ5225 Zhang Jingxian:
Classification Ensemble Methods 1
Classification Course web page: vision.cis.udel.edu/~cv May 14, 2003  Lecture 34.
***Classification Model*** Hosam Al-Samarraie, PhD. CITM-USM.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015.
Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.
Notes on HW 1 grading I gave full credit as long as you gave a description, confusion matrix, and working code Many people’s descriptions were quite short.
A Kernel Approach for Learning From Almost Orthogonal Pattern * CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin * B. Scholkopf.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
LECTURE 09: CLASSIFICATION WRAP-UP February 24, 2016 SDS 293 Machine Learning.
Canadian Bioinformatics Workshops
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
David Amar, Tom Hait, and Ron Shamir
Course Review Questions will not be all on one topic, i.e. questions may have parts covering more than one area.
Classification with Gene Expression Data
Trees, bagging, boosting, and stacking
Statistical Techniques
Machine Learning Week 1.
PCA, Clustering and Classification by Agnieszka S. Juncker
Experiments in Machine Learning
Boosting For Tumor Classification With Gene Expression Data
Support Vector Machine _ 2 (SVM)
Comparison of the csEN algorithm to existing predictive methods and model reduction. Comparison of the csEN algorithm to existing predictive methods and.
Machine Learning with Clinical Data
Machine Learning – a Probabilistic Perspective
Didi Amar and Tom Hait Group meeting October 2013
Manisha Panta, Avdesh Mishra, Md Tamjidul Hoque, Joel Atallah
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Outlines Introduction & Objectives Methodology & Workflow
Presentation transcript:

Lecture 19 Classification analysis – R code MCB 416A/516A Statistical Bioinformatics and Genomic Analysis Prof. Lingling An Univ of Arizona

Last time: Introduction to classification analysis Why gene select Performance assessment Case study

ALL data preprocessed gene expression data Chiaretti et al., Blood (2004) 12625 genes (hgu95av2 Affymetrix GeneChip) 128 samples (arrays) phenotypic data on all 128 patients, including: 95 B-cell cancer 33 T-cell cancer

Packages needed #### install the required packages for the first time ### source("http://www.bioconductor.org/biocLite.R") biocLite("ALL") biocLite("Biobase") biocLite("genefilter") biocLite("hgu95av2.db") biocLite("MLInterfaces") install.packages("gplots") install.packages("e1071")

Then … #### load the packages ### library("ALL") ## or without quotes library("Biobase") library("genefilter") library("hgu95av2.db") library("MLInterfaces") library("gplots") library("e1071")

Read in data and other practices data() ## list all data sets available in loaded packages data(package="ALL") ## list all data sets in the “ALL” package data(ALL) ## manually load the specific data into R (R doesn’t automatically load everything to save memory) ?ALL ## description of the ALL dataset ALL ## data set “ALL”; It is an instance of “ExpressionSet” class help(ExpressionSet) exprs(ALL)[1:3,] ## get the first three rows of the data matrix

Data manipulation pData(ALL) ## info about the data/experiments names(pData(ALL)) ## column names of the info Or varLabels(ALL) ALL.1<-ALL[,order(ALL$mol.bio)] ### order samples by sample types ALL.1$mol.bio

Effect on gene selection by correlation map #### plot the correlation matrix of the 128 samples using all 12625 genes ##### library(gplots) heatmap( cor(exprs(ALL.1)), Rowv=NA, Colv=NA, scale="none", labRow=ALL.1$mol.bio, labCol= ALL.1$mol.bio, RowSideColors= as.character(as.numeric(ALL.1$mol.bio)), ColSideColors= as.character(as.numeric(ALL.1$mol.bio)), col=greenred(75))

Simple gene-filtering #### make a simple filtering and select genes with standard deviation (i.e., sd) larger than 1 ##### ALL.sd<-apply(exprs(ALL.1), 1, sd) ALL.new<-ALL.1 [ALL.sd>1, ] ALL.new #### 379 genes left

Heatmap of the selected genes heatmap( cor(exprs(ALL.new)), Rowv=NA, Colv=NA, scale="none", labRow=ALL.new$mol.bio, labCol= ALL.new$mol.bio, RowSideColors= as.character(as.numeric(ALL.new$mol.bio)), ColSideColors= as.character(as.numeric(ALL.new$mol.bio)), col=greenred(75)) ALL.new$BT ## In the correlation plot, NEG has two subgroups. One is B-cell (BT=B) and the other is T-cell (BT=T). ## BT: The type and stage of the disease; B indicates B-cell ALL while a T indicates T-cell ALL

Exploration/illustration of why we need do gene selection is done Exploration/illustration of why we need do gene selection is done. Let’s go back and focus on classification analysis

Classification analysis Preprocessing and selection of samples # back to the ALL data and select all B-cell lymphomas from the BT column: > table(ALL$BT) B B1 B2 B3 B4 T T1 T2 T3 T4 5 19 36 23 12 5 1 15 10 2 >bcell <- grep("^B", as.character(ALL$BT)) > table(ALL$mol.biol) ## take a look at the summary of mol.biol ALL1/AF4 BCR/ABL E2A/PBX1 NEG NUP-98 p15/p16 10 37 5 74 1 1 # select BCR/ABL abnormality and negative controls from the mol.biol column: moltype <- which(as.character(ALL$mol.biol) %in% c("NEG", "BCR/ABL")) # only keep patients with B-cell lymphomas in BCR/ABL or NEG ALL_bcrneg <- ALL[, intersect(bcell, moltype)] ALL_bcrneg$mol.biol <- factor(ALL_bcrneg$mol.biol) ## drop unused levels

Select the samples # structure of the new dataset > head(pData(ALL_bcrneg)) > table(ALL_bcrneg$mol.biol) BCR/ABL NEG 37 42

Gene filtering library("hgu95av2.db") # Apply non-specific filtering library(genefilter) small <- nsFilter(ALL_bcrneg, var.cutoff=0.75)$eset small ## find out how many genes are left? dim(small) ## same thing

> table(small$mol.biol) BCR/ABL NEG 37 42 #... define positive and negative cases > Negs <- which(small$mol.biol == "NEG") > Bcr <- which(small$mol.biol == "BCR/ABL") > Negs [1] 2 4 5 6 7 8 11 12 14 19 22 24 26 28 31 35 37 38 39 [20] 43 44 45 46 49 50 51 52 54 55 56 57 58 61 62 65 66 67 68 [39] 70 74 75 77 > Bcr [1] 1 3 9 10 13 15 16 17 18 20 21 23 25 27 29 30 32 33 34 36 40 41 [23] 42 47 48 53 59 60 63 64 69 71 72 73 76 78 79

Split data into training and validation set #... randomly sample 20 individuals from each group for training and put them together as training data set.seed(7) S1 <- sample(Negs, 20, replace=FALSE) S2 <- sample(Bcr, 20, replace=FALSE) TrainInd <- c(S1, S2) #... the remainder is for validation > TestInd <- setdiff(1:79, TrainInd) > TestInd [1] 1 3 4 6 10 11 14 15 17 20 22 23 24 25 27 28 32 33 38 39 40 41 43 44 45 46 50 51 52 [30] 53 54 58 61 64 68 71 74 75 79

Gene selection further #... perform gene/feature selection on the TRAINING SET Train=small[,TrainInd] Traintt <- rowttests(Train, "mol.biol") dim(Traintt) ## check the size >head(Traintt) ## and take a look at the top part of the data statistic dm p.value 41654_at 0.9975424 0.22064713 0.3248114 35430_at 0.4318284 0.10770465 0.6683066 38924_s_at -0.3245558 -0.06028665 0.7472972 36023_at 1.1890007 0.19535279 0.2418165 266_s_at 0.3070361 0.12969821 0.7604922 37569_at 1.1079437 0.26620492 0.2748499

#... order the genes by abs. value of test statistic > ordTT <- order(abs(Traintt$statistic), decreasing=TRUE) > head(ordTT) [1] 677 1795 680 1369 1601 684

#... select top 50 significant genes/features for simplicity >featureNamesFromtt <- featureNames(small[ordTT[1:50],]) >featureNamesFromtt [1] "1635_at" "37027_at" "40480_s_at" "36795_at" "41071_at" "1249_at" [7] "38385_at" "33232_at" "41038_at" "32562_at" "37398_at" "37762_at" [13] "37363_at" "40504_at" "41478_at" "35831_at" "34798_at" "1140_at" [19] "37021_at" "41193_at" "32434_at" "40167_s_at" "39837_s_at" "35940_at" [25] "40202_at" "35162_s_at" "33819_at" "39329_at" "39319_at" "31886_at" [31] "36131_at" "39372_at" "41015_at" "36638_at" "41742_s_at" "38408_at" [37] "40074_at" "37377_i_at" "37598_at" "37033_s_at" "40196_at" "760_at" [43] "31898_at" "40051_at" "36591_at" "36008_at" "33206_at" "35763_at" [49] "32612_at" "33362_at" #... create a reduced expressionSet with top 50 significant features selected on the TRAINING SET > esetShort <- small[featureNamesFromtt,] > dim(esetShort) Features Samples 50 79

And selected/filtered genes twice: So far we have selected the samples (79) with B-cell type cancer and mol.bio = BCR/AML or NEG And selected/filtered genes twice: first ,filtering out the genes with small variation across 79 samples Second, selecting the top 50 genes (via two sample t-test) from a randomly selected training data set (20 NEGs + 20 BCR/AML s) Data set: esetShort Now we’ll use these 50 genes as marker genes and build classifiers on training data set, and calculate the classification error rate (i.e., misclassification rate) on test data set.

Classification methods library(MLInterfaces) library(help=MLInterfaces) vignette("MLInterfaces") ?MLearn

Use validation set to estimate predictive accuracy # -----------K nearest neighbors--------- Knn.out <- MLearn(mol.biol ~ ., data=esetShort, knnI(k=1, l=0), TrainInd) show(Knn.out) confuMat(Knn.out) bb=confuMat(Knn.out) Err=(bb[1,2]+bb[2,1])/sum(bb) ## calculate misclassification rate

# ----Linear discriminant analysis------------- Lda # ----Linear discriminant analysis------------- Lda.out <- MLearn(mol.biol ~ ., data=esetShort, ldaI, TrainInd) show(Lda.out) confuMat(Lda.out) bb=confuMat(Lda.out) Err=(bb[1,2]+bb[2,1])/sum(bb) ## calculate misclassification rate

# -----------Support vector machine ------------- library(e1071) Svm.out <- MLearn(mol.biol ~ ., data=esetShort, svmI, TrainInd) show(Svm.out) bb=confuMat(Svm.out) Err=(bb[1,2]+bb[2,1])/sum(bb) ## calculate misclassification rate

# -----------Classification tree------------------ Ct # -----------Classification tree------------------ Ct.out <- MLearn(mol.biol ~ ., data=esetShort, rpartI, TrainInd) show(Ct.out) confuMat(Ct.out)

# -----------Logistic regression------------------ # it may be worthwhile using a more specialized implementation of logistic regression Lr.out <- MLearn(mol.biol ~ ., data=esetShort, glmI.logistic(threshold=.5), TrainInd, family=binomial) show(Lr.out) confuMat(Lr.out)

Cross validation Define function for random partition: ranpart = function(K, data) { N = nrow(data) cu = as.numeric(cut(1:N, K)) sample(cu, size = N, replace = FALSE) } ranPartition = function(K) function(data, clab, iternum) { p = ranpart(K, data) which(p == iternum)

random 5-fold cross-validation set.seed(1) kk=100 error=rep(0, kk) for (i in 1:kk){ r1 <- MLearn(mol.biol ~ ., esetShort, knnI(k = 1, l = 0), xvalSpec("LOG", 5, partitionFunc = ranPartition(5))) bb= confuMat(r1) error[i]=(bb[1,2]+bb[2,1])/sum(bb) } mean(error) ## get the average of the error/misspecification from cross-validation boxplot(error)

Classification methods available in Bioconductor “MLInterfaces” package This package is meant to be a unifying platform for all machine learning procedures (including classification and clustering methods). Useful but use of the package easily becomes a black box!! Separated functions: Linear and quadratic discriminant analysis: “lda” and “qda” KNN classification: “knn” CART: “rpart” Bagging and AdaBoosting: “bagging” and “logitboost” Random forest: “randomForest” Support Vector machines: “svm” Artifical neural network: “nnet” Nearest shrunken centroids: “pamr”