Boosting for tumor classification

Slides:



Advertisements
Similar presentations
Random Forest Predrag Radenković 3237/10
Advertisements

Lectures 17,18 – Boosting and Additive Trees Rice ECE697 Farinaz Koushanfar Fall 2006.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.
Data Mining Classification: Alternative Techniques
CMPUT 466/551 Principal Source: CMU
Introduction to Boosting Slides Adapted from Che Wanxiang( 车 万翔 ) at HIT, and Robin Dhamankar of Many thanks!
x – independent variable (input)
Classification and risk prediction
Model Evaluation Metrics for Performance Evaluation
Sparse vs. Ensemble Approaches to Supervised Learning
CS 8751 ML & KDDEvaluating Hypotheses1 Sample error, true error Confidence intervals for observed hypothesis error Estimators Binomial distribution, Normal.
Decision Tree Algorithm
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.
Ensemble Learning: An Introduction
L15:Microarray analysis (Classification). The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.
Machine Learning CMPT 726 Simon Fraser University
Three kinds of learning
Machine Learning: Ensemble Methods
Sparse vs. Ensemble Approaches to Supervised Learning
Boosting Main idea: train classifiers (e.g. decision trees) in a sequence. a new classifier should focus on those cases which were incorrectly classified.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Ensemble Learning (2), Tree and Forest
1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.
PATTERN RECOGNITION AND MACHINE LEARNING
Efficient Model Selection for Support Vector Machines
Chapter 10 Boosting May 6, Outline Adaboost Ensemble point-view of Boosting Boosting Trees Supervised Learning Methods.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
Benk Erika Kelemen Zsolt
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
BOOSTING David Kauchak CS451 – Fall Admin Final project.
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
Linear Models for Classification
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Ensemble Methods in Machine Learning
Classification Ensemble Methods 1
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
Classification and Regression Trees
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Boosting and Additive Trees (Part 1) Ch. 10 Presented by Tal Blum.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.
Ensemble Classifiers.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Machine Learning: Ensemble Methods
Chapter 7. Classification and Prediction
Data Mining Practical Machine Learning Tools and Techniques
Bagging and Random Forests
Deep Feedforward Networks
Trees, bagging, boosting, and stacking
ECE 5424: Introduction to Machine Learning
Asymmetric Gradient Boosting with Application to Spam Filtering
Data Mining Practical Machine Learning Tools and Techniques
Collaborative Filtering Matrix Factorization Approach
Introduction to Data Mining, 2nd Edition
Boosting For Tumor Classification With Gene Expression Data
Introduction to Boosting
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Model Combination.
Roc curves By Vittoria Cozza, matr
Presentation transcript:

Boosting for tumor classification with gene expression data Kfir Barhum

Overview – Today: Review the Classification Problem Features (genes) Preselection Weak Learners & Boosting LogitBoost Algorithm Reduction to the Binary Case Errors & ROC Curves Simulation Results

Classification Problem Given n training data pairs: With : and Corresponds to: X – features vector, p features Y – class label Typically: n between 20-80 samples p varies from 2,000 to 20,000

Our Goal: is minimal Construct a classifier C: From which a new tissue sample is classified based on it’s expression vector X. For the optimal C holds: is minimal We first handle only binary problems, for which: Problem: p>>n we use boosting in conjunction with decision trees !

Features (genes) Preselection Problem: p>>n: sample size is much smaller than the features dimension (number of genes – p). Many genes are irrelevant for discrimination. Optional solution: Dimensionality Reduction – was discussed earlier We score each individual gene g, with g {1,…,p}, according to it’s strength for phenotype discrimination.

Features (genes) Preselection We denote: the expression value of gene g for individual I Define: Which counts, for each input s.t. Y(x) = 0, the number of inputs of the form Y(x)=1, such that their expression difference is negative. Corresponding to set of indices having response Y( )=0, Y( )=1

Features (genes) Preselection A gene does not discrimintate, if it’s score is about It discriminates best when or even s(g) = 0 ! Therefore define We then simply take the genes with the highest values of q(g) as our top features. Formal choice of can be done via cross-validation

Weak Learners & Boosting Suppose… Suppose we had a “weak” learner, which can learn the data, and make a good estimation. Problem: Our learner has an error rate which is too high for us. We search for a method to “boost” those weak classifiers.

Weak Learners & Boosting Introduce: …… Boosting !!! create an accurate combined classifier from a sequence of weak classifiers Weak classifiers are fitted to iteratively reweighed versions of the data. In each boosting iteration m, with m = 1…M: Weight of data observations that have been misclassified at the previous step, have their weights increased The weight of data that has been classified correctly is decreased

Weak Learners & Boosting The m th weak classifier - - is forced to concentrate on the individual inputs that were classified wrong at earlier iterations. Now, suppose we have remapped the output classes Y(x) into {-1, 1} instead of {0,1}. We have M different classifiers. How shall we combine them into a stronger one ?

Weak Learners & Boosting “The Committee” Define the combined classifier to a weighted majority vote of the “weak” classifiers: Points which need to be clarified, and specify the alg. : i) Which weak learners shall we use ? ii) reweighing the data, and the aggregation weights iii) How many iterations (choosing M) ?

Weak Learners & Boosting Which type of “weak” learners: Our case, we use a special kind of decision trees, called stumps - trees with two terminal nodes. Stumps are simple “rule of thumb”, which test on a single attribute. Our example: yes no

Weak Learners & Boosting The Additive Logistic Regression Model Examine the logistic regression: The logit transformation, gurantees that for any F(x), p(x) is a probability in [0,1]. inverting, we get:

LogitBoost Algorithm So.. How to update the weights ? We define a loss function, and follow gradient decent principle. AdaBoost uses the exponential loss function: LogitBoost uses the binomial log-likelihood: Let Define

LogitBoost Algorithm

LogitBoost Algorithm Step 1: Initialization committee function: initial probabilities: Step 2: LogitBoost iterations for m=1,2,...,M repeat:

LogitBoost Algorithm A. Fitting the weak learner Compute working response and weights for i=1,...,n Fit a regression stump by weighted least squares

LogitBoost Algorithm B. Updating and classifier output

LogitBoost Algorithm Choosing the stop parameter M: overfitting: when the model no longer concentrates on the general aspects of the problem, but on specific it’s specific learnning set In general: Boosting is quite resistant to overfitting, so picking M higher as 100 will be good enough Alternatively one can compute the binomial log-likelihood for each iteration and choose to stop on maximal approximation

Reduction to the Binary Case Our Algorithm discusses only 2-Classifications. For J>2 Classes, we simply reduce to J Binary problems as follows: Define the j th problem as:

Reduction to the Binary Case Now we run J times the entire procedure, including features preselection, and estimating stopping parameter on new data. different classes may preselect different features (genes) This yields estimation probabilities: for j = 1,...,J

Reduction to the Binary Case These can be converted into probability estimates for J=j via normalization: Note that there exists a LogitBoost Algorithm for J>2 classes, which treats the multiclass problem simultaneously. It yielded >1.5 times error rate.

Errors & ROC Curves We measure errors by leave-one-out cross validation : For i=1 to n: Set aside the i th observation Carry out the whole process (i.e. feature selection, classifier fitting) on the remaining (n-1) data points. Predict the class label for the i th observation Now define:

Errors & ROC Curves Question: Should this be the situation ? False Positive Error - when we classify a positive result as a negative one False Negative Error – when we classify a negative result as a positive one Our Algorithm uses Equal misclassification Costs. (i.e. punish false-positive and false-negative errors equally) Question: Should this be the situation ?

Recall Our Problem… NO ! In our case: false positive - means we diagnosed a normal tissue as a tumorous one. Probably further tests will be carried out. false negative – We just classified a tumorous tissue as a healthy one. Outcome might be deadly.

Errors & ROC Curves ROC Curves illustrate how accurate classifiers are under asymmetric losses Each point corresponds to a specific probability which was chosen as a threshold for positive classification. Tradeoff between false positive and false negative errors. The closer the curve is to (0,1) on graph, the better the test. ROC is: Reciever Operating Characteristic – comes from field called “Signal Detection Theory”, developed in WW-II, when radar had to decide whether a ship is friendly, enemy or just a backgroud noise.

Errors & ROC Curves Colon data w/o features preselection X-axis: negative examples classified as positive (tumorous ones) Y-axis positive classified correctly Each point on graph, represents a Beta chosen from [0,1] – as a threshold for positive classification

Simulation The algorithm worked better than benchmark methods for our real examination. Real datasets are hard / expensive to get Relevant differences between discrimination methods might be difficult to detect Let’s try to simulate gene expression data, with large dataset

Simulation Produce gene expression profiles from a multivariate normal distribution where the covariance structure is from the colon dataset we took p = 2000 genes Now assign one of two respond classes, with probabilities:

Simulation The conditional probabilities are take as follows: For j=1,...,10: Pick Cj of of size uniformly random from {1,...,10} - Mean values across random gene The expected number of relevant genes is therefore 10*5.5=55 Pick from normal distrubtion with stddev 2,1,0.5 respectively

Simulation The trainning size was set to n=200 samples, and tested on 1000 new observations test The whole process was repeated 20 times, and was tested against 2 well-known benchmarks: 1-Nearest-Neighbor and Classification Tree. LogitBoost did better than both, even on arbitrary fixed number of iterations (150)

Results Boosting method was tried on 6 publicly avilable datasets (Lukemia, Colon, Estrogen, Nodal, Lymphoma, NCI). Data was processed and tested against other benchmarks: AdaBoost, 1-Nearest-Neighbor, Classification Tree. On all 6 datasets the choice of the actual stopping parameter did not matter, and the choice of 100 iterations, did fairly well.

Results Tests were made for several numbers of preselected features, and all of them. Using all genes, the classical method of 1-nearest-neighbor is interrupted by noise variables, and the boosting methods outperforms it.

Results

-fin-