Ensembles (Bagging, Boosting, and all that)

Slides:



Advertisements
Similar presentations
Longin Jan Latecki Temple University
Advertisements

Paper presentation for CSI5388 PENGCHENG XI Mar. 23, 2005
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Sparse vs. Ensemble Approaches to Supervised Learning
Ensemble Learning what is an ensemble? why use an ensemble?
2D1431 Machine Learning Boosting.
Ensemble Learning: An Introduction
Adaboost and its application
Three kinds of learning
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
Sparse vs. Ensemble Approaches to Supervised Learning
For Better Accuracy Eick: Ensemble Learning
Ensembles of Classifiers Evgueni Smirnov
Machine Learning CS 165B Spring 2012
AdaBoost Robert E. Schapire (Princeton University) Yoav Freund (University of California at San Diego) Presented by Zhi-Hua Zhou (Nanjing University)
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
© Jude Shavlik 2006 David Page 2010 CS 760 – Machine Learning (UW-Madison) Ensembles (Bagging, Boosting, and all that) Old View Learn one good modelLearn.
CS 391L: Machine Learning: Ensembles
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00.
CS Fall 2015 (© Jude Shavlik), Lecture 7, Week 3
BOOSTING David Kauchak CS451 – Fall Admin Final project.
CS Machine Learning 21 Jan CS 391L: Machine Learning: Decision Tree Learning Raymond J. Mooney University of Texas at Austin.
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Ensemble Methods: Bagging and Boosting
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
CS Ensembles1 Ensembles. 2 A “Holy Grail” of Machine Learning Automated Learner Just a Data Set or just an explanation of the problem Hypothesis.
CLASSIFICATION: Ensemble Methods
Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.
Today’s Topics HW1 Due 11:55pm Today (no later than next Tuesday) HW2 Out, Due in Two Weeks Next Week We’ll Discuss the Make-Up Midterm Be Sure to Check.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Ensemble Methods in Machine Learning
Classification Ensemble Methods 1
1 January 24, 2016Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 7 — Classification Ensemble Learning.
Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.
… Algo 1 Algo 2 Algo 3 Algo N Meta-Learning Algo.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.
Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong.
Ensemble Classifiers.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Machine Learning: Ensemble Methods
HW 2.
CS Fall 2016 (Shavlik©), Lecture 5
Data Mining Practical Machine Learning Tools and Techniques
Reading: R. Schapire, A brief introduction to boosting
Ensembles (Bagging, Boosting, and all that)
Trees, bagging, boosting, and stacking
The Boosting Approach to Machine Learning
Machine Learning: Ensembles
COMP61011 : Machine Learning Ensemble Models
Ensemble Learning Introduction to Machine Learning and Data Mining, Carla Brodley.
ECE 5424: Introduction to Machine Learning
A “Holy Grail” of Machine Learing
INTRODUCTION TO Machine Learning
Combining Base Learners
Data Mining Practical Machine Learning Tools and Techniques
Introduction to Data Mining, 2nd Edition
cs540 - Fall 2016 (Shavlik©), Lecture 20, Week 11
CS Fall 2016 (© Jude Shavlik), Lecture 7, Week 4
Ensemble Methods for Machine Learning: The Ensemble Strikes Back
Ensemble learning.
Model Combination.
Ensemble learning Reminder - Bagging of Trees Random Forest
Ensembles An ensemble is a set of classifiers whose combined results give the final decision. test feature vector classifier 1 classifier 2 classifier.
INTRODUCTION TO Machine Learning 3rd Edition
CS 391L: Machine Learning: Ensembles
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

Ensembles (Bagging, Boosting, and all that) Old View Learn one good model New View Learn a good set of models Probably best example of interplay between “theory & practice” in Machine Learning Naïve Bayes, k-NN, neural net d-tree, SVM, etc © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

Ensembles of Neural Networks (or any supervised learner) OUTPUT Combiner Network Network Network INPUT Ensembles often produce accuracy gains of 5-10 percentage points! Can combine “classifiers” of various types E.g., decision trees, rule sets, neural networks, etc. © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

Combining Multiple Models Three ideas Simple (unweighted) votes Standard choice Weighted votes e.g., weight by tuning-set accuracy Train a combining function Prone to overfitting? “Stacked generalization” (Wolpert) © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

Some Relevant Early Papers Hansen & Salamen, PAMI:20, 1990 If (a) the combined predictors have errors that are independent from one another And (b) prob any given model correct predicts any given testset example is > 50%, then Think about flipping N coins, each with prob > ½ of coming up heads – what is the prob more than half will come up heads? © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

Some Relevant Early Papers Schapire, MLJ:5, 1990 (“Boosting”) If you have an algorithm that gets > 50% on any distribution of examples, you can create an algorithm that gets > (100% - ), for any  > 0 - Impossible by NFL theorem (later) ??? Need an infinite (or very large, at least) source of examples - Later extensions (eg, AdaBoost) address this weakness Also see Wolpert, “Stacked Generalization,” Neural Networks, 1992 © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

Some Methods for Producing “Uncorrelated” Members of an Ensemble k times randomly choose (with replacement) N examples from a training set of size N - give each training set to a std ML algo “Bagging” by Brieman (MLJ, 1996) Want unstable algorithms Part of HW3 Reweight examples each cycle (if wrong, increase weight; else decrease weight) “AdaBoosting” by Freund & Schapire (1995, 1996) © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

Some More Methods for Producing “Uncorrelated” Members of an Ensemble Directly optimize accuracy + diversity Opitz & Shavlik (1995; used genetic algo’s) Melville & Mooney (2004-5; DECORATE algo) Different number of hidden units in a neural network, different k in k -NN, tie-breaking scheme, example ordering, etc Various people See 2005 and 2006 papers of Caruana’s group for large-scale empirical study of ensembles © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

Variance/Diversity Creating Methods (cont.) Train with different associated tasks Caruana (1996) Use different input features, randomly perturb training examples, etc Cherkauer (UW-CS), Melville & Mooney X age Other functions related to the main task of X “Multi-Task Learning” X gender X income © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

Variance/Diversity Creating Methods (cont.) Assign each category an error correcting code, and train on each bit separately Dietterich et al. (ICML 1995) Cat1 = 1 1 1 0 1 1 1 Cat2 = 1 1 0 1 1 0 0 Cat3 = 1 0 1 1 0 1 0 Cat4 = 0 1 1 1 0 0 1 Predicting 5 of 7 bits correctly suffices Want: Large Hamming distance between rows Large Hamming distance between columns © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

Random Forests (Breiman, MLJ 2001; related to Ho, 1995) A variant of BAGGING Algorithm Repeat k times Draw with replacement N examples, put in train set Build d-tree, but in each recursive call Choose (w/o replacement) i features Choose best of these i as the root of this (sub)tree Do NOT prune Let N = # of examples F = # of features i = some number << F © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

CS 760 – Machine Learning (UW-Madison) More on Random Forests Increasing i Increases correlation among individual trees (BAD) Also increases accuracy of individual trees (GOOD) Can use tuning set to choose good setting for i Overall, random forests Are very fast (e.g., 50K examples, 10 features, 10 trees/min on 1 GHz CPU in 2004) Deal with large # of features Reduce overfitting substantially Work very well in practice © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

AdaBoosting (Freund & Schapire) Wi,j = weight on exj on cycle i Initially weight all ex’s equally (ie, 1/N, N=#examples) Let Hi = concept/hypothesis learned on current weighted train set Let I = weighted error of Hi on current train set If i>1/2, return {H1, H2, …, Hi-1} (all previous hypotheses) Reweight correct ex’s: Note: since i<1/2, wi+1< wi Renormalize, so sum wgts = 1 i  i+1, goto 1 © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

Using the Set of Hypothesis Produced by AdaBoost Output for example x = where (false)  0, (true)  1 - ie, count weighted votes for hypotheses that predict category y for input x © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

Dealing with Weighted Examples in an ML Algo Two approaches Sample from this probability distribution and train as normal (ie, create prob dist from wgts, then sample to create an unweighted train set) Alter learning algorithm so it counts weighted examples and not just examples eg) from accuracy = # correct / # total to weighted accuracy = wi of correct / wi of all #2 preferred – avoids sampling effects © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

CS 760 – Machine Learning (UW-Madison) AdaBoosting & ID3 Apply to PRUNED trees* – otherwise no trainset error! (Can avoid i = 0 via m -est’s) ID3’s calc’s all based on weighted sums, so easy to extend to weighted examples © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

Boosting & Overfitting Often get better test-set results, even when (and after) train error is ZERO Hypothesis (see papers by Schurmanns or Schapire) Still improving number/strength of votes even though getting all train-set ex’s correct Wider “margins” between pos and neg ex’s – relates to SVM’s test train Error (on unweighted examples) cycles © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

CS 760 – Machine Learning (UW-Madison) Empirical Studies (from Freund & Schapire; reprinted in Dietterich’s AI Mag paper) Error Rate of C4.5 Error Rate of AdaBoost Error Rate of Bagging On average, boosting slightly better? Boosting and Bagging helped almost always! (Each point one data set) Error Rate of Bagged (Boosted) C4.5 © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

Large Empirical Study of Bagging vs. Boosting Opitz & Maclin (UW CS PhD’s), JAIR Vol 11, pp 169-198, 1999 www.jair.org/abstracts/opitz99a.html Bagging almost always better than single D-tree or ANN (artificial neural net) Boosting can be much better than Bagging However, boosting can sometimes be harmful (too much emphasis on “outliers”?) © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

Thought Experiment (possible class project) Consider a learning curve Plot, averaged across many datasets, error rates of Best single model Some ensemble method We know that for many #examples, ensembles have lower error rates What happens as #examples   ? © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)

Boosting/Bagging/Etc Wrapup An easy to use and usually highly effective technique always consider it (bagging, at least) when applying ML to practical problems Does reduce “comprehensibility” of models see work by Craven & Shavlik though (“rule extraction”) Also an increase in runtime, but cycles usually much cheaper than examples © Jude Shavlik 2006 David Page 2007 CS 760 – Machine Learning (UW-Madison)