Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00.

Slides:



Advertisements
Similar presentations
Ensemble Learning Reading: R. Schapire, A brief introduction to boosting.
Advertisements

Lectures 17,18 – Boosting and Additive Trees Rice ECE697 Farinaz Koushanfar Fall 2006.
Data Mining Classification: Alternative Techniques
Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.
Model Assessment, Selection and Averaging
CMPUT 466/551 Principal Source: CMU
Longin Jan Latecki Temple University
MCS Multiple Classifier Systems, Cagliari 9-11 June Giorgio Valentini Random aggregated and bagged ensembles.
Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.
Sparse vs. Ensemble Approaches to Supervised Learning
Ensemble Learning what is an ensemble? why use an ensemble?
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Ensemble Learning: An Introduction
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
Competent Undemocratic Committees Włodzisław Duch, Łukasz Itert and Karol Grudziński Department of Informatics, Nicholas Copernicus University, Torun,
Bayesian Learning Rong Jin.
Sparse vs. Ensemble Approaches to Supervised Learning
Boosting Main idea: train classifiers (e.g. decision trees) in a sequence. a new classifier should focus on those cases which were incorrectly classified.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Ensemble Learning (2), Tree and Forest
For Better Accuracy Eick: Ensemble Learning
Ensembles of Classifiers Evgueni Smirnov
Machine Learning CS 165B Spring 2012
Issues with Data Mining
Zhangxi Lin ISQS Texas Tech University Note: Most slides are from Decision Tree Modeling by SAS Lecture Notes 6 Ensembles of Trees.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
CS 391L: Machine Learning: Ensembles
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
Benk Erika Kelemen Zsolt
CS Fall 2015 (© Jude Shavlik), Lecture 7, Week 3
Data Mining - Volinsky Columbia University 1 Topic 10 - Ensemble Methods.
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
Ensemble Methods: Bagging and Boosting
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
CLASSIFICATION: Ensemble Methods
Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.
The Bias-Variance Trade-Off Oliver Schulte Machine Learning 726.
Learning with AdaBoost
INTRODUCTION TO Machine Learning 3rd Edition
Bayesian Averaging of Classifiers and the Overfitting Problem Rayid Ghani ML Lunch – 11/13/00.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Ensemble Methods in Machine Learning
Bias-Variance in Machine Learning. Bias-Variance: Outline Underfitting/overfitting: –Why are complex hypotheses bad? Simple example of bias/variance Error.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16.
Classification Ensemble Methods 1
1 January 24, 2016Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 7 — Classification Ensemble Learning.
Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Ensembles of Classifiers Evgueni Smirnov. Outline 1 Methods for Independently Constructing Ensembles 1.1 Bagging 1.2 Randomness Injection 1.3 Feature-Selection.
CSE 446: Ensemble Learning Winter 2012 Daniel Weld Slides adapted from Tom Dietterich, Luke Zettlemoyer, Carlos Guestrin, Nick Kushmerick, Padraig Cunningham.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Overfitting, Bias/Variance tradeoff. 2 Content of the presentation Bias and variance definitions Parameters that influence bias and variance Bias and.
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss Pedro Domingos, Michael Pazzani Presented by Lu Ren Oct. 1, 2007.
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Machine Learning: Ensemble Methods
Bagging and Random Forests
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Ch 14. Combining Models Pattern Recognition and Machine Learning, C. M
ECE 5424: Introduction to Machine Learning
Bias and Variance of the Estimator
A “Holy Grail” of Machine Learing
Bayesian Averaging of Classifiers and the Overfitting Problem
Data Mining Practical Machine Learning Tools and Techniques
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Lecture 18: Bagging and Boosting
Ensemble learning Reminder - Bagging of Trees Random Forest
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
CS 391L: Machine Learning: Ensembles
Presentation transcript:

Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00

What is Ensemble Classification? Set of Classifiers Decisions combined in ”some” way Often more accurate than the individual classifiers What properties should the base learners have?

Why should it work? More accurate ONLY if the individual classifiers disagree Error rate < 0.5 and errors are independent Error rate is highly correlated with the correlations of the errors made by the different learners (Ali & Pazzani)

Averaging Fails! Use Delta-functions as classifiers (predict +1 at a point and –1 everywhere else) For training sample size m, construct a set of at most 2m classifiers s.t. the majority vote is always correct Associate 1 delta function with every example Add M+ (# of +ve examples) copies of the function that predicts +1 everywhere and M- (# of -ve examples) copies of the function that predicts -1 everywhere Applying boosting to this results in zero training error but bad generalizations Applying the margin analysis results in zero training error but margin is small O(1/m)

Ideas? Subsampling training examples Bagging, Cross-Validated Committees, Boosting Manipulating input features Choose different features Manipulating output targets ECOC and variants Injecting randomness NN(different initial weights), DT(pick different splits), injecting noise, MCMC

Combining Classifiers Unweighted Voting Bagging, ECOC etc. Weighted Voting Weight  accuracy (training or holdout set), LSR (weights  1/variance) Bayesian model averaging

BMA All possible models in the model space used weighted by their probability of being the “Correct” model Optimal given the correct model space and priors Not widely used even though it was said not to overfit (Buntine, 1990)

BMA - Equations prior likelihood noise model

Equations Posterior Uniform Noise Model Pure classification model Model space too large – approximation required Model with highest posterior, Sampling

BMA of Bagged C4.5 Rules Bagging as a form of importance sampling where all samples are weighed equally Experimental Results Every version of BMA performed worse than bagging on 19 out of 26 datasets Posteriors skewed – dominated by a single rule model – model selection rather than averaging

BMA of various learners RISE Rule sets with partitioning 8 databases from UCI BMA worse than RISE in every domain Trading Rules Intuition (there is no single right rule so BMA should help) BMA similar to choosing the single best rule

Overfitting in BMA Issue of overfitting is usually ignored (Freund et al. 2000) Is overfitting the explanation for the poor performance of BMA? Preferring a hypothesis that does not truly have the lowest error of any hypothesis considered, but by chance has the lowest error on training data. Overfitting is the result of the likelihood’s exponential sensitivity to random fluctuations in the sample and increases with # of models considered

To BMA or not to BMA? Net effect will depend on which effect prevails? Increased overfitting (small if few models are considered) Reduction in error obtained by giving some weight to alternative models (skewed weights => small effect) Ali & Pazzani (1996) report good results but bagging wasn’t tried Domingos (2000) used bootstrapping before BMA so the models were built from less data

Why they work? Bias / Variance Decomposition Training data insufficient for choosing a single best classifier Learning algorithms not “smart” enough! Hypothesis space may not contain the true function

Definitions Bias is the persistent/systematic error of a learner independent of the training set. Zero for a learner that always makes the optimal prediction Variance is the error incurred by fluctuations in response to different training sets. Independent of the true value of the predicted variable and zero for a learner that always predicts the same class regardless of the training set

Bias–Variance Decomposition Kong & Dietterich (1995) – variance can be negative and noise is ignored Breiman (1996) – undefined for any given example and variance can be zero even when the learners predictions fluctuate Tibshirani (1996) Hastie (1997) Kohavi & Wolpert (1996) allows the bias of the Bayes optimal classifier to be non- zero Friedman (1997) leaves bias and variance for zero-one loss undefined

Domingos (2000) Single definition of bias and variance Applicable to “any” loss function Explains the margin effect (Schapire et al. 1997) using the decomposition Incorporates variable misclassification costs Experimental study

Unified Decomposition Loss functions Squared L(t,y)=(t-y)2 Absolute L(t,y)=|t-y| Zero-One L(t,y)=0 if y=t else 1 Goal = Minimize average L(t,y) over all weighted examples c 1 N(x) + B(x) + c 2 V(x)

Properties of the unified decomposition Relation to Order-correct learner Relation to Margin of a learner Maximizing margins is a combination of reducing the number of biased examples, decreasing variance on unbiased examples, and increasing it on biased ones.

Experimental Study 30 UCI datasets Methodology 100 bootstrap samples – averaged over the test set with uniform weights Estimate bias, variance, zero-one loss DT, kNN, boosting

Boosting C4.5 - Results Decreases both bias and variance Bulk of bias reduction happens in the first few rounds Variance reduction is more gradual and the dominant effect

kNN results kNN bias increases with k dominates variance reduction however increasing k has the effect of reducing variance on unbiased examples while increasing it on biased ones.

Issues Does not work with “Any” loss function e.g. absolute loss Decomposition is not purely additive unlike the original one for squared- loss

Spectrum of ensembles Asymmetry of weights Overfitting BaggingBagging BoostingBoosting BMABMA

Open Issues concerning ensembles Best way to construct ensembles? No extensive comparison done Computationally expensive Not easily comprehensible

Bibliography Overview T. Dietterich Bauer & Kohavi Averaging Domingos Freund, Mansour, Schapire Ali, Pazzani Bias – Variance Decomposition Kohavi & Wolpert Domingos Friedman Kong & Dietterich