Boosting For Tumor Classification With Gene Expression Data Marcel Dettling and Peter Buhlmann Bioinformatics, Vol.19, no.9 2003, 1061-1069 Hojung Cho Topics for Bioinformatics Oct 10 2006
Outline Background Microarrays Scoring Algorithm Decision Stumps Boosting (primer) Methods Feature preselection LogitBoost Choice of the stopping parameters Multiclass approach Results Data preprocessing Error rates ROC curves Validation of the results Simulation Discussion
Microarray data (p»n) Park et al, 2001 Boosting decision trees applied for the classification of gene expression data (Ben-Dor et al (2000), Dudoit et al.(2002)) “AdaBoost did not yield good results compared to other classifiers”
The objective The strategies improve the performance of boosting for classification of gene expression data by modifying the algorithm The strategies Feature preselection: nonparametric scoring method Binary LogitBoost with decision stumps Multiclass approach: reduce to multiple binary classification
Background: Scoring Algorithm Nonparametric Allows data analysis w/o assuming underlying distribution Feature preselection Score each gene according to its strength for phenotype discrimination Consider only genes differentially expressed across samples
Scoring Algorithm Score defined as smallest number of swaps of Sort expression levels Group membership determined by resulting sequence of 0’s and 1’s Correspondence between expression levels and group membership determined by how well clustered together Score defined as smallest number of swaps of consecutive digits necessary to arrive at a perfect splitting (this example : score = +4) comparing pairwise comparison to comparison in rank
Background Scoring Algorithm Allows ordering of genes according to their potential significance Captures to what extent a gene discriminates the response categories Both the near zero and the maximum score n0n1 indicate a differentially expressed gene Quality measure: restrict boosting classifier to work with this subset
Background: Decision Stumps Test attribute Decision trees with only a single split The presence or absence of a single term as a predicate “predict basketball player if and only if height > 2m” Base learner for boosting procedure Weak learner Subroutine returns a hypothesis given some finite training set Performs slightly better than random choice: may be enhanced combined with the feature preselection Label A Label B
Background Boosting “AdaBoost does not yield very impressive results” LogitBoost Relies on the binomial log-likelihood as a loss function Found to have a slight edge over AdaBoost in many classification problems Usually performs better on noisy data or when there are misspecifications or inhomogeneities of the class labels in the training data (common in microarray)
Methods Choice of the stopping parameter Rank and Select Features : selection of genes with the highest values of the quality measure : the number of p can be determined by CV. Train LogitBoost classification using decision stumps as weak learners Choice of the stopping parameter Leave-One-Out Cross-Validation: find m for maximizing ( l(m) )
LogitBoost Algorithm Show the first step of the iteration when i= 1, p(x) = P[Y=1|X=x] , p(0)(x) = ½ , F(0)(x) = 0 Show the first step of the iteration when i= 1,
Reducing multiclass to binary Multiclass -> Multibinary Match each class against all the other classes (one-against-all) Combine binary probabilities to multiclass probabilities probability estimates for Y = j via normalization, Plugged into the Bayes classifier,
Sample data sets Data preprocessing Data ChipType Leukemia 47 ALL Affymetrix Oligo 25 AML 3571 Genes Colon 40 Tumor 22 Normal 6500 Genes Estrogen and Nodal 25 ER+ Samples 24 ER Samples 7129 Genes Lymphoma 42 DLBL CDNA 9 Follicular 11 Chronic 4026 Genes NCI 7 Breat 5 CNS 7 Colon 6 Leukemia 8 Melanoma 9 NSCLC 6 Ovarian 9 Renal 5244 Genes Data preprocessing Leukemia, NCI: thresholding (floor 100, ceiling 16000), filtering (fold change > 5, max – min > 500), log transform, normalization Colon: log transform, normalization Estrogen and Nodal: thresholding, log transform, normalization Lymphoma: normalization
Results – Error Rates (1): The test error using symmetrically equal misclassification costs
Results – Error Rates (2)
Results:No. of Iterations & Performance The choice of the stopping parameter for boosting is not very critical in all six datasets. ”stopping after a large, but arbitrary number of 100 iterations is a reasonable strategy in the microarray data”
Results: ROC curves The test error using asymmetric misclassification costs Both boosting classifiers yield the similar curve to the ideal ROC curve (red line) than the one from classification trees. Boosting has an advantage for small negative rates
Validation of the results Disease This study AdaBoost SVM Others Leukemia 1/34 (Ben-dor et al) 2.78+1.39 % (Furey et al) 2-4/34 (Golub et al.) 5/34 Colon 12.90-14.52 % 17.74+9.68% 9.68% NCI 22.9 % (Dudoit et al.) 48% Estrogen and nodal Better predictions than Bayesian approach (West et al.) Lymphoma N/A Disease # of Classes Multiclass (Friedman et al) One-Against-All NCI 8 36.10% 22.90% Lymphoma 3 8.06% 1.61%
Simulation Model dataset Gene expression profiles from a multivariate nomal distribution, where covariance is from the colon dataset. Assign one out of two response classes according to Bernoulli distribution
Conclusion Feature preselection generally improved the predictive power Slightly better performance of LogitBoost over AdaBoost Reducing multiclass problems to multiple binary problems yielded more accurate results
Discussion Biological interpretation Marginal and “far from significant” edge of LogitBoost over AdaBoost Did feature preselection really improve the performance? Manipulated to make LogitBoost perform better? Cross-validation of algorithms with published data Authors have other considerations than simple performance of the algorithms with the training datasets Leave-One-Out is just one way to cross-validate Biological interpretation