Statistics for Microarray Data Analysis with R Session 8: Discrimination Class web site:

Slides:



Advertisements
Similar presentations
The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke
Advertisements

COMPUTER AIDED DIAGNOSIS: CLASSIFICATION Prof. Yasser Mostafa Kadah –
Random Forest Predrag Radenković 3237/10
CSCE555 Bioinformatics Lecture 15 classification for microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.
Model generalization Test error Bias, variance and complexity
Indian Statistical Institute Kolkata
Model assessment and cross-validation - overview
Microarray Data Analysis
Sparse vs. Ensemble Approaches to Supervised Learning
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Discrimination and clustering with microarray gene expression data Terry Speed, Jane Fridlyand, Yee Hwa Yang and Sandrine Dudoit* Department of Statistics,
Discrimination or Class prediction or Supervised Learning.
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Classification in Microarray Experiments
Discrimination Class web site: Statistics for Microarrays.
Discrimination Methods As Used In Gene Array Analysis.
Classification 10/03/07.
Basics of discriminant analysis
DIMACS Workshop on Machine Learning Techniques in Bioinformatics 1 Cancer Classification with Data-dependent Kernels Anne Ya Zhang (with Xue-wen.
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation X = {
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation Given.
Ensemble Learning (2), Tree and Forest
1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Classification (Supervised Clustering) Naomi Altman Nov '06.
Chapter 10 Boosting May 6, Outline Adaboost Ensemble point-view of Boosting Boosting Trees Supervised Learning Methods.
Xuelian Wei Department of Statistics Most of Slides Adapted from by Darlene Goldstein Classification.
Analysis and Management of Microarray Data Dr G. P. S. Raghava.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
1 Classification and Clustering Methods for Gene Expression Kim-Anh Do, Ph.D. Department of Biostatistics The University of Texas M.D. Anderson Cancer.
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
SLIDES RECYCLED FROM ppt slides by Darlene Goldstein Supervised Learning, Classification, Discrimination.
The Broad Institute of MIT and Harvard Classification / Prediction.
Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.
Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.
Learning Theory Reza Shadmehr Linear and quadratic decision boundaries Kernel estimates of density Missing data.
1 Advanced analysis: Classification, clustering and other multivariate methods. Statistics for Microarray Data Analysis – Lecture 4 The Fields Institute.
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
1 E. Fatemizadeh Statistical Pattern Recognition.
Application of Class Discovery and Class Prediction Methods to Microarray Data Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
CSE182 L14 Mass Spec Quantitation MS applications Microarray analysis.
Supervised learning in high-throughput data  General considerations  Dimension reduction with outcome variables  Classification models.
Parameter Estimation. Statistics Probability specified inferred Steam engine pump “prediction” “estimation”
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
Topics in analysis of microarray data : clustering and discrimination
Ensemble Classifiers.
Bagging and Random Forests
Heping Zhang, Chang-Yung Yu, Burton Singer, Momian Xiong
CH 5: Multivariate Methods
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
Supervised Learning: Classification
Boosting For Tumor Classification With Gene Expression Data
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Class Prediction Based on Gene Expression Data Issues in the Design and Analysis of Microarray Experiments Michael D. Radmacher, Ph.D. Biometric Research.
Generally Discriminant Analysis
Ensemble learning Reminder - Bagging of Trees Random Forest
Model generalization Brief summary of methods
Parametric Methods Berlin Chen, 2005 References:
Multivariate Methods Berlin Chen
Mathematical Foundations of BME
Multivariate Methods Berlin Chen, 2005 References:
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

Statistics for Microarray Data Analysis with R Session 8: Discrimination Class web site:

Today’s Outline Discrimination (classification) Some classification rules Performance assessment Discrimination (classification) in R Exercises

cDNA gene expression data Data on G genes for n samples Genes mRNA samples Gene expression level of gene i in mRNA sample j = (normalized) Log( Red intensity / Green intensity) sample1sample2sample3sample4sample5 …

Classification Task: assign objects to classes (groups) on the basis of measurements made on the objects Unsupervised: classes unknown, want to discover them from the data (cluster analysis) Supervised: classes are predefined, want to use a (training or learning) set of labeled objects to form a classifier for classification of future observations

Discrimination Objects (e.g. arrays) are to be classified as belonging to one of a number of predefined classes {1, 2, …, K} Each object associated with a class label (or response) Y  {1, 2, …, K} and a feature vector (vector of predictor variables) of G measurements: X = (X 1, …, X G ) Aim: predict Y from X

Example: Tumor Classification Reliable and precise classification essential for successful cancer treatment Current methods for classifying human malignancies rely on a variety of morphological, clinical and molecular variables Uncertainties in diagnosis remain; likely that existing classes are heterogeneous Characterize molecular variations among tumors by monitoring gene expression (microarray) Hope: that microarrays will lead to more reliable tumor classification (and therefore more appropriate treatments and better outcomes)

Tumor Classification Using Gene Expression Data Three main types of statistical problems associated with tumor classification: Identification of new/unknown tumor classes using gene expression profiles (unsupervised learning – clustering) Classification of malignancies into known classes (supervised learning – discrimination) Identification of “marker” genes that characterize the different tumor classes (feature or variable selection)

Classifiers A predictor or classifier partitions (divides) the space of gene expression profiles into K disjoint subsets, A 1,..., A K, such that for a sample with expression profile X=(X 1,...,X G ) in A k the predicted class is k Classifiers are built from a learning set (LS) L = (X 1, Y 1 ),..., (X n,Y n ) Classifier C built from a learning set L: C(.,L): X  {1,2,...,K} Predicted class for observation X: C(X,L) = k if X is in A k

Maximum likelihood discriminant rule A maximum likelihood estimator (MLE) chooses the parameter value that makes the chance of the observations the highest For example, if I toss a coin 10 times and I see 7 heads, the MLE for p = probability of heads is.7 If the histograms for the individual classes are known, the maximum likelihood (ML) discriminant rule predicts the class of an observation X to be the one with the highest bar on the histogram (density scale)

Fisher Linear Discriminant Analysis First applied in 1935 by M. Barnard at the suggestion of R. A. Fisher (1936), Fisher linear discriminant analysis (FLDA): 1.finds linear combinations of the gene expression profiles X=X 1,...,X G with large ratios of between-groups to within-groups sums of squares - discriminant variables; 2.predicts the class of an observation X by the class whose mean vector is closest to X in terms of the discriminant variables

Gaussian ML Discriminant Rules For multivariate Gaussian (normal) class densities X|Y= k ~ N(  k,  k ), the ML classifier is C(X) = argmin k {(X -  k )  k -1 (X -  k )’ + log|  k |} In general, this is a quadratic rule (Quadratic discriminant analysis, or QDA) In practice, population mean vectors  k and covariance matrices  k are estimated by corresponding sample quantities

Gaussian ML Discriminant Rules When all class densities have the same covariance matrix,  k =  the discriminant rule is linear (Linear discriminant analysis, or LDA; FLDA for k = 2) When all class densities have the same diagonal covariance matrix  =diag(  1 2 …  G 2 ), the discriminant rule is again linear (Diagonal linear discriminant analysis, or DLDA)

Nearest Neighbor Classification Based on a measure of distance between observations (e.g. Euclidean distance or one minus correlation) k-nearest neighbor rule (Fix and Hodges (1951)) classifies an observation X as follows: –find the k observations in the learning set closest to X –predict the class of X by majority vote, i.e., choose the class that is most common among those k observations. The number of neighbors k can be chosen by cross-validation (more on this later)

Classification Trees Partition the feature space into a set of rectangles, then fit a simple model in each one Binary tree structured classifiers are constructed by repeated splits of subsets (nodes) of the measurement space X into two descendant subsets (starting with X itself) Each terminal subset is assigned a class label; the resulting partition of X corresponds to the classifier

Classification Tree

Three Aspects of Tree Construction Split Selection Rule Split-stopping Rule Class assignment Rule Different approaches to these three issues (e.g. CART: Classification And Regression Trees, Breiman et al. (1984); C4.5 and C5.0, Quinlan (1993))

Three Rules (CART) Splitting: At each node, choose split maximizing decrease in impurity (e.g. misclassification error) Split-stopping: Grow large tree, prune to obtain a sequence of subtrees, then use cross- validation to identify the subtree with lowest misclassification rate Class assignment: For each terminal node, choose the class minimizing the resubstitution estimate of misclassification probability, given that a case falls into this node

Other Classifiers Include… Support vector machines (SVMs) Neural networks Bayesian regression methods

Features Feature selection –Automatic with trees –For DA, NN need preliminary selection –Need to account for selection when assessing performance Missing data –Automatic imputation with trees –Otherwise, impute (or ignore)

Performance assessment (I) Resubstitution estimation: error rate on the learning set –Problem: downward bias Test set estimation: divide cases in learning set into two sets, L 1 and L 2 ; classifier built using L 1, error rate computed for L 2. L 1 and L 2 must be iid. –Problem: reduced effective sample size

Performance assessment (II) V-fold cross-validation (CV) estimation: Cases in learning set randomly divided into V subsets of (nearly) equal size (for example, 5 or 10 subsets). Build classifiers leaving one set out; test set error rates computed on left out set and averaged to get overall error –Bias-variance tradeoff: smaller V can give larger bias but smaller variance Out-of-bag estimation: covered below ROC curves: not covered at all, I’m still learning about this!

Cross-validation learning test learningtestlearning testlearning error average error

Performance assessment (III) Common to do feature selection using all of the data, then CV only for model building and classification However, usually features are unknown and the intended inference includes feature selection. Then, CV estimates as above tend to be downward biased Features should be selected only from the learning set used to build the model (and not the entire set)

How NOT to estimate error ** DON’T DO THIS ** DON’T DO THIS ** Use the whole data set to choose which variables (features) to use in the classifier Divide the data into (10, say) subsets for CV Leave out a subset and build a classifier with features chosen from the whole data set Use the classifier to predict the left out subset Average over left out subsets to estimate error ** DON’T DO THIS ** DON’T DO THIS **

Aggregating classifiers Breiman (1996, 1998) found that gains in accuracy could be obtained by aggregating predictors built from perturbed versions of the learning set Many possibilities for perturbing The multiple versions of the predictor are aggregated by (weighted) voting

Bagging Bagging = Bootstrap aggregating Nonparametric Bootstrap (standard bagging): perturbed learning sets drawn at random with replacement from the learning sets; predictors built for each perturbed dataset and aggregated by plurality voting (w b = 1) Parametric Bootstrap: perturbed learning sets are some known distribution (e.g. multivariate Gaussian) Convex pseudo-data (Breiman 1996)

Aggregation By-products: Out- of-bag estimation of error rate Out-of-bag error rate estimate: unbiased Expect about 1/3 of cases to be left out of each bootstrap sample Use these left out cases from each bootstrap sample as a test set Classify these test set cases, and compare to the true class labels to get the out-of-bag estimate of the error rate

Boosting Freund and Schapire (1997), Breiman (1998) Data resampled adaptively so that the weights in the resampling are increased for those cases most often misclassified Predictor aggregation done by weighted voting

Comparison of classifiers Dudoit, Fridlyand, Speed (JASA, 2002) FLDA DLDA DQDA NN CART Bagging and boosting

Comparison study datasets Leukemia – Golub et al. (1999) n = 72 samples, G = 3,571 genes 3 classes (B-cell ALL, T-cell ALL, AML) Lymphoma – Alizadeh et al. (2000) n = 81 samples, G = 4,682 genes 3 classes (B-CLL, FL, DLBCL) NCI 60 – Ross et al. (2000) N = 64 samples, p = 5,244 genes 8 classes

Leukemia data, 2 classes: Test set error rates;150 LS/TS runs

Leukemia data, 3 classes: Test set error rates;150 LS/TS runs

Lymphoma data, 3 classes: Test set error rates; N=150 LS/TS runs

NCI 60 data :Test set error rates;150 LS/TS runs

Results In the main comparison, NN and DLDA had the smallest error rates, FLDA had the highest Aggregation improved the performance of CART classifiers, the largest gains being with boosting and bagging with convex pseudo-data Dettling and Bühlmann, improved the performance of boosting (LogitBoost)

Results, cont For the lymphoma and leukemia datasets, increasing the number of genes to G=200 didn't greatly affect the performance of the various classifiers; there was an improvement for the NCI 60 dataset. More careful selection of a small number of genes (10) improved the performance of FLDA dramatically

Comparison study – Discussion (I) “Diagonal” LDA: ignoring correlation between genes helped here Unlike classification trees and nearest neighbors, LDA is unable to take into account gene interactions Although nearest neighbors are simple and intuitive classifiers, their main limitation is that they give very little insight into mechanisms underlying the class distinctions

Comparison study – Discussion (II) Classification trees are capable of handling and revealing interactions between variables Useful by-product of aggregated classifiers: prediction votes, variable importance statistics Variable selection: A crude criterion such as BSS/WSS may not identify the genes that discriminate between all the classes and may not reveal interactions between genes With larger training sets, expect improvement in performance of aggregated classifiers

Acknowledgements Sandrine Dudoit Jane Fridlyand Yee Hwa (Jean) Yang Terry Speed

R: discrimination A number of R packages (libraries) contain functions to carry out discrimination, including: –MASS : lda, qda –sma: dlda –class : knn –rpart : classification and regression trees (recursive partitioning) –ipred: bagging –e1071: svm –LogitBoost : boosting

Exercise: discrimination These ideas have been hard! The handout gives a very brief guide to using some of the classification routines in R, including a little bit of cross-validation There are many others as well, feel free to explore...