Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.

Slides:



Advertisements
Similar presentations
The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke
Advertisements

Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
CSCE555 Bioinformatics Lecture 15 classification for microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Data Mining Classification: Alternative Techniques
Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data.
Model generalization Test error Bias, variance and complexity
Indian Statistical Institute Kolkata
Minimum Redundancy and Maximum Relevance Feature Selection
Statistical Classification Rong Jin. Classification Problems X Input Y Output ? Given input X={x 1, x 2, …, x m } Predict the class label y  Y Y = {-1,1},
Chapter 7 – Classification and Regression Trees
Microarray Data Analysis
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Discrimination or Class prediction or Supervised Learning.
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Classification in Microarray Experiments
Discrimination Class web site: Statistics for Microarrays.
Ensemble Learning: An Introduction
Discrimination Methods As Used In Gene Array Analysis.
Lecture 5 (Classification with Decision Trees)
Classification 10/03/07.
Computational Genomics and Proteomics Lecture 9 CGP Part 2: Introduction Based on: Martin Bachler slides.
Feature Selection Lecture 5
Feature Selection Bioinformatics Data Analysis and Tools
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation X = {
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation Given.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Ensemble Learning (2), Tree and Forest
1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
This week: overview on pattern recognition (related to machine learning)
Classification (Supervised Clustering) Naomi Altman Nov '06.
Xuelian Wei Department of Statistics Most of Slides Adapted from by Darlene Goldstein Classification.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Statistics for Microarray Data Analysis with R Session 8: Discrimination Class web site:
SLIDES RECYCLED FROM ppt slides by Darlene Goldstein Supervised Learning, Classification, Discrimination.
The Broad Institute of MIT and Harvard Classification / Prediction.
Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
Today Ensemble Methods. Recap of the course. Classifier Fusion
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Chapter1: Introduction Chapter2: Overview of Supervised Learning
Chapter 13 (Prototype Methods and Nearest-Neighbors )
COT6930 Course Project. Outline Gene Selection Sequence Alignment.
Lecture Notes for Chapter 4 Introduction to Data Mining
Machine Learning 5. Parametric Methods.
Validation methods.
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Part 9: Review.
Intro. ANN & Fuzzy Systems Lecture 16. Classification (II): Practical Considerations.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Overfitting, Bias/Variance tradeoff. 2 Content of the presentation Bias and variance definitions Parameters that influence bias and variance Bias and.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Predictive Automatic Relevance Determination by Expectation Propagation Y. Qi T.P. Minka R.W. Picard Z. Ghahramani.
Topics in analysis of microarray data : clustering and discrimination
CH 5: Multivariate Methods
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Generally Discriminant Analysis
Parametric Methods Berlin Chen, 2005 References:
Multivariate Methods Berlin Chen
Multivariate Methods Berlin Chen, 2005 References:
Linear Discrimination
Presentation transcript:

Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San Francisco Elsinore, Denmark May 17-21, 2004

Microarray Workshop 2 ? Bad prognosis recurrence < 5yrs Good Prognosis recurrence > 5yrs Reference L van’t Veer et al (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature, Jan.. Objects Array Feature vectors Gene expression Predefine classes Clinical outcome new array Learning set Classification rule Good Prognosis Matesis > 5

Microarray Workshop 3 Discrimination and Allocation Learning Set Data with known classes Classification Technique Classification rule Data with unknown classes Class Assignment Discrimination Prediction

Microarray Workshop 4 Classification rule Maximum likelihood discriminant rule A maximum likelihood estimator (MLE) chooses the parameter value that makes the chance of the observations the highest. For known class conditional densities p k (X), the maximum likelihood (ML) discriminant rule predicts the class of an observation X by C(X) = argmax k p k (X)

Microarray Workshop 5 Gaussian ML discriminant rules For multivariate Gaussian (normal) class densities X|Y= k ~ N(  k,  k ), the ML classifier is C(X) = argmin k {(X -  k )  k -1 (X -  k )’ + log|  k |} In general, this is a quadratic rule (Quadratic discriminant analysis, or QDA) In practice, population mean vectors  k and covariance matrices  k are estimated by corresponding sample quantities

Microarray Workshop 6 ML discriminant rules - special cases [DLDA] Diagonal linear discriminant analysis class densities have the same diagonal covariance matrix  = diag(s 1 2, …, s p 2 ) [DQDA] Diagonal quadratic discriminant analysis) class densities have different diagonal covariance matrix  k = diag(s 1k 2, …, s pk 2 ) Note. Weighted gene voting of Golub et al. (1999) is a minor variant of DLDA for two classes (different variance calculation).

Microarray Workshop 7 Classification with SVMs Generalization of the ideas of separating hyperplanes in the original space. Linear boundaries between classes in higher-dimensional space lead to the non-linear boundaries in the original space. Adapted from internet

Microarray Workshop 8 Nearest neighbor classification Based on a measure of distance between observations (e.g. Euclidean distance or one minus correlation). k-nearest neighbor rule (Fix and Hodges (1951)) classifies an observation X as follows: -find the k observations in the learning set closest to X -predict the class of X by majority vote, i.e., choose the class that is most common among those k observations. The number of neighbors k can be chosen by cross-validation (more on this later).

Microarray Workshop 9 Nearest neighbor rule

Microarray Workshop 10 Classification tree Partition the feature space into a set of rectangles, then fit a simple model in each one Binary tree structured classifiers are constructed by repeated splits of subsets (nodes) of the measurement space X into two descendant subsets (starting with X itself) Each terminal subset is assigned a class label; the resulting partition of X corresponds to the classifier

Microarray Workshop 11 Classification tree Gene 1 M i1 < Gene 2 M i2 > yes no Gene 1 Gene

Microarray Workshop 12 Three aspects of tree construction Split selection rule: -Example, at each node, choose split maximizing decrease in impurity (e.g. Gini index, entropy, misclassification error). Split-stopping: -Example, grow large tree, prune to obtain a sequence of subtrees, then use cross-validation to identify the subtree with lowest misclassification rate. Class assignment: -Example, for each terminal node, choose the class minimizing the resubstitution estimate of misclassification probability, given that a case falls into this node. Supplementary slide

Microarray Workshop 13 Other classifiers include… Neural networks Projection pursuit Bayesian belief networks …

Microarray Workshop 14 Why select features Lead to better classification performance by removing variables that are noise with respect to the outcome May provide useful insights into etiology of a disease Can eventually lead to the diagnostic tests (e.g., “breast cancer chip”)

Microarray Workshop 15 Approaches to feature selection Methods fall into three basic category -Filter methods -Wrapper methods -Embedded methods The simplest and most frequently used methods are the filter methods. Adapted from A. Hartemnick

Microarray Workshop 16 Filter methods R p Feature selection R s s << p Classifier design Features are scored independently and the top s are used by the classifier Score: correlation, mutual information, t-statistic, F-statistic, p-value, tree importance statistic etc Easy to interpret. Can provide some insight into the disease markers. Adapted from A. Hartemnick

Microarray Workshop 17 Problems with filter method Redundancy in selected features: features are considered independently and not measured on the basis of whether they contribute new information Interactions among features generally can not be explicitly incorporated (some filter methods are smarter than others) Classifier has no say in what features should be used: some scores may be more appropriates in conjuction with some classifiers than others. Supplementary slide Adapted from A. Hartemnick

Microarray Workshop 18 Wrapper methods R p Feature selection R s s << p Classifier design Iterative approach: many feature subsets are scored based on classification performance and best is used. Selection of subsets: forward selection, backward selection, Forward-backward selection, tree harvesting etc Adapted from A. Hartemnick

Microarray Workshop 19 Problems with wrapper methods Computationally expensive: for each feature subset to be considered, a classifier must be built and evaluated No exhaustive search is possible (2 subsets to consider) : generally greedy algorithms only. Easy to overfit. p Supplementary slide Adapted from A. Hartemnick

Microarray Workshop 20 Embedded methods Attempt to jointly or simultaneously train both a classifier and a feature subset Often optimize an objective function that jointly rewards accuracy of classification and penalizes use of more features. Intuitively appealing Some examples: tree-building algorithms, shrinkage methods (LDA, kNN) Adapted from A. Hartemnick

Microarray Workshop 21 Performance assessment Any classification rule needs to be evaluated for its performance on the future samples. It is almost never the case in microarray studies that a large independent population-based collection of samples is available at the time of initial classifier-building phase. One needs to estimate future performance based on what is available: often the same set that is used to build the classifier. Assessing performance of the classifier based on -Cross-validation. -Test set -Independent testing on future dataset

Microarray Workshop 22 Performance assessment (I) Resubstitution estimation: error rate on the learning set. -Problem: downward bias Test set estimation: 1) divide learning set into two sub-sets, L and T; Build the classifier on L and compute the error rate on T. 2) Build the classifier on the training set (L) and compute the error rate on an independent test set (T). -L and T must be independent and identically distributed (i.i.d). -Problem: reduced effective sample size Supplementary slide

Microarray Workshop 23 Performance assessment (II) V-fold cross-validation (CV) estimation: Cases in learning set randomly divided into V subsets of (nearly) equal size. Build classifiers by leaving one set out; compute test set error rates on the left out set and averaged. -Bias-variance tradeoff: smaller V can give larger bias but smaller variance -Computationally intensive. Leave-one-out cross validation (LOOCV). (Special case for V=n). Works well for stable classifiers (k- NN, LDA, SVM) Supplementary slide

Microarray Workshop 24 Performance assessment (III) Common practice to do feature selection using the learning, then CV only for model building and classification. However, usually features are unknown and the intended inference includes feature selection. Then, CV estimates as above tend to be downward biased. Features (variables) should be selected only from the learning set used to build the model (and not the entire set)