Boosting For Tumor Classification With Gene Expression Data

Slides:



Advertisements
Similar presentations
Brief introduction on Logistic Regression
Advertisements

CMPUT 466/551 Principal Source: CMU
Longin Jan Latecki Temple University
Sparse vs. Ensemble Approaches to Supervised Learning
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Generic Object Detection using Feature Maps Oscar Danielsson Stefan Carlsson
4 th NETTAB Workshop Camerino, 5 th -7 th September 2004 Alberto Bertoni, Raffaella Folgieri, Giorgio Valentini
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Discrimination Methods As Used In Gene Array Analysis.
Machine Learning: Ensemble Methods
Sparse vs. Ensemble Approaches to Supervised Learning
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Boosting for tumor classification
Ensemble Learning (2), Tree and Forest
1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.
Predicting Income from Census Data using Multiple Classifiers Presented By: Arghya Kusum Das Arnab Ganguly Manohar Karki Saikat Basu Subhajit Sidhanta.
Chapter 10 Boosting May 6, Outline Adaboost Ensemble point-view of Boosting Boosting Trees Supervised Learning Methods.
Gene Expression Profiling Illustrated Using BRB-ArrayTools.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Controlling FDR in Second Stage Analysis Catherine Tuglus Work with Mark van der Laan UC Berkeley Biostatistics.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
Today Ensemble Methods. Recap of the course. Classifier Fusion
An Ensemble of Three Classifiers for KDD Cup 2009: Expanded Linear Model, Heterogeneous Boosting, and Selective Naive Bayes Members: Hung-Yi Lo, Kai-Wei.
Evaluating Results of Learning Blaž Zupan
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
Notes on HW 1 grading I gave full credit as long as you gave a description, confusion matrix, and working code Many people’s descriptions were quite short.
Classification and Regression Trees
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Boosting and Additive Trees (Part 1) Ch. 10 Presented by Tal Blum.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.
Predictive Automatic Relevance Determination by Expectation Propagation Y. Qi T.P. Minka R.W. Picard Z. Ghahramani.
Logistic Regression: Regression with a Binary Dependent Variable.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Data Science Credibility: Evaluating What’s Been Learned
Ensemble Classifiers.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Machine Learning: Ensemble Methods
Bagging and Random Forests
Heping Zhang, Chang-Yung Yu, Burton Singer, Momian Xiong
Boosted Augmented Naive Bayes. Efficient discriminative learning of
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Trees, bagging, boosting, and stacking
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
Evaluating Results of Learning
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
Gene Expression Classification
ECE 5424: Introduction to Machine Learning
Asymmetric Gradient Boosting with Application to Spam Filtering
Data Mining Practical Machine Learning Tools and Techniques
Combining HMMs with SVMs
Machine Learning Techniques for Data Mining
Cos 429: Face Detection (Part 2) Viola-Jones and AdaBoost Guest Instructor: Andras Ferencz (Your Regular Instructor: Fei-Fei Li) Thanks to Fei-Fei.
Ying shen Sse, tongji university Sep. 2016
Introduction to Data Mining, 2nd Edition
Introduction to Boosting
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Discriminative Frequent Pattern Analysis for Effective Classification
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Ensemble learning.
Chapter 6 Logistic Regression: Regression with a Binary Dependent Variable Copyright © 2010 Pearson Education, Inc., publishing as Prentice-Hall.
Didi Amar and Tom Hait Group meeting October 2013
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

Boosting For Tumor Classification With Gene Expression Data Marcel Dettling and Peter Buhlmann Bioinformatics, Vol.19, no.9 2003, 1061-1069 Hojung Cho Topics for Bioinformatics Oct 10 2006

Outline Background Microarrays Scoring Algorithm Decision Stumps Boosting (primer) Methods Feature preselection LogitBoost Choice of the stopping parameters Multiclass approach Results Data preprocessing Error rates ROC curves Validation of the results Simulation Discussion

Microarray data (p»n) Park et al, 2001 Boosting decision trees applied for the classification of gene expression data (Ben-Dor et al (2000), Dudoit et al.(2002)) “AdaBoost did not yield good results compared to other classifiers”

The objective The strategies improve the performance of boosting for classification of gene expression data by modifying the algorithm The strategies Feature preselection: nonparametric scoring method Binary LogitBoost with decision stumps Multiclass approach: reduce to multiple binary classification

Background: Scoring Algorithm Nonparametric Allows data analysis w/o assuming underlying distribution Feature preselection Score each gene according to its strength for phenotype discrimination Consider only genes differentially expressed across samples

Scoring Algorithm Score defined as smallest number of swaps of Sort expression levels Group membership determined by resulting sequence of 0’s and 1’s Correspondence between expression levels and group membership determined by how well clustered together Score defined as smallest number of swaps of consecutive digits necessary to arrive at a perfect splitting (this example : score = +4) comparing pairwise comparison to comparison in rank

Background Scoring Algorithm Allows ordering of genes according to their potential significance Captures to what extent a gene discriminates the response categories Both the near zero and the maximum score n0n1 indicate a differentially expressed gene Quality measure: restrict boosting classifier to work with this subset

Background: Decision Stumps Test attribute Decision trees with only a single split The presence or absence of a single term as a predicate “predict basketball player if and only if height > 2m” Base learner for boosting procedure Weak learner Subroutine returns a hypothesis given some finite training set Performs slightly better than random choice: may be enhanced combined with the feature preselection Label A Label B

Background Boosting “AdaBoost does not yield very impressive results” LogitBoost Relies on the binomial log-likelihood as a loss function Found to have a slight edge over AdaBoost in many classification problems Usually performs better on noisy data or when there are misspecifications or inhomogeneities of the class labels in the training data (common in microarray)

Methods Choice of the stopping parameter Rank and Select Features : selection of genes with the highest values of the quality measure : the number of p can be determined by CV. Train LogitBoost classification using decision stumps as weak learners Choice of the stopping parameter Leave-One-Out Cross-Validation: find m for maximizing ( l(m) )

LogitBoost Algorithm Show the first step of the iteration when i= 1, p(x) = P[Y=1|X=x] , p(0)(x) = ½ , F(0)(x) = 0 Show the first step of the iteration when i= 1,

Reducing multiclass to binary Multiclass -> Multibinary Match each class against all the other classes (one-against-all) Combine binary probabilities to multiclass probabilities probability estimates for Y = j via normalization, Plugged into the Bayes classifier,

Sample data sets Data preprocessing   Data ChipType Leukemia 47 ALL Affymetrix Oligo 25 AML 3571 Genes Colon 40 Tumor 22 Normal 6500 Genes Estrogen and Nodal 25 ER+ Samples 24 ER Samples 7129 Genes Lymphoma 42 DLBL CDNA 9 Follicular 11 Chronic 4026 Genes NCI 7 Breat 5 CNS 7 Colon 6 Leukemia 8 Melanoma 9 NSCLC 6 Ovarian 9 Renal 5244 Genes Data preprocessing Leukemia, NCI: thresholding (floor 100, ceiling 16000), filtering (fold change > 5, max – min > 500), log transform, normalization Colon: log transform, normalization Estrogen and Nodal: thresholding, log transform, normalization Lymphoma: normalization

Results – Error Rates (1): The test error using symmetrically equal misclassification costs

Results – Error Rates (2)

Results:No. of Iterations & Performance The choice of the stopping parameter for boosting is not very critical in all six datasets. ”stopping after a large, but arbitrary number of 100 iterations is a reasonable strategy in the microarray data”

Results: ROC curves The test error using asymmetric misclassification costs Both boosting classifiers yield the similar curve to the ideal ROC curve (red line) than the one from classification trees. Boosting has an advantage for small negative rates

Validation of the results Disease This study AdaBoost SVM Others Leukemia 1/34 (Ben-dor et al) 2.78+1.39 % (Furey et al) 2-4/34 (Golub et al.) 5/34 Colon 12.90-14.52 % 17.74+9.68% 9.68% NCI 22.9 % (Dudoit et al.) 48% Estrogen and nodal Better predictions than Bayesian approach (West et al.) Lymphoma N/A Disease # of Classes Multiclass (Friedman et al) One-Against-All NCI 8 36.10% 22.90% Lymphoma 3 8.06% 1.61%

Simulation Model dataset Gene expression profiles from a multivariate nomal distribution, where covariance is from the colon dataset. Assign one out of two response classes according to Bernoulli distribution

Conclusion Feature preselection generally improved the predictive power Slightly better performance of LogitBoost over AdaBoost Reducing multiclass problems to multiple binary problems yielded more accurate results

Discussion Biological interpretation Marginal and “far from significant” edge of LogitBoost over AdaBoost Did feature preselection really improve the performance? Manipulated to make LogitBoost perform better? Cross-validation of algorithms with published data Authors have other considerations than simple performance of the algorithms with the training datasets Leave-One-Out is just one way to cross-validate Biological interpretation