Ensemble Methods for Machine Learning: The Ensemble Strikes Back

Slides:



Advertisements
Similar presentations
Florida International University COP 4770 Introduction of Weka.
Advertisements

On-line learning and Boosting
Boosting Rong Jin.
Boosting Approach to ML
FilterBoost: Regression and Classification on Large Datasets Joseph K. Bradley 1 and Robert E. Schapire 2 1 Carnegie Mellon University 2 Princeton University.
CMPUT 466/551 Principal Source: CMU
Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)
Cos 429: Face Detection (Part 2) Viola-Jones and AdaBoost Guest Instructor: Andras Ferencz (Your Regular Instructor: Fei-Fei Li) Thanks to Fei-Fei Li,
Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.
Sparse vs. Ensemble Approaches to Supervised Learning
2D1431 Machine Learning Boosting.
A Brief Introduction to Adaboost
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
Sparse vs. Ensemble Approaches to Supervised Learning
Boosting Main idea: train classifiers (e.g. decision trees) in a sequence. a new classifier should focus on those cases which were incorrectly classified.
Machine Learning CS 165B Spring 2012
Chapter 10 Boosting May 6, Outline Adaboost Ensemble point-view of Boosting Boosting Trees Supervised Learning Methods.
Outline Classification Linear classifiers Perceptron Multi-class classification Generative approach Naïve Bayes classifier 2.
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00.
Benk Erika Kelemen Zsolt
Boosting of classifiers Ata Kaban. Motivation & beginnings Suppose we have a learning algorithm that is guaranteed with high probability to be slightly.
CS Fall 2015 (© Jude Shavlik), Lecture 7, Week 3
Ensemble Methods: Bagging and Boosting
CLASSIFICATION: Ensemble Methods
Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.
Classification Derek Hoiem CS 598, Spring 2009 Jan 27, 2009.
Learning with AdaBoost
Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall 2004 AdaBoost.. Binary Classification. Read 9.5 Duda,
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Ensemble Methods in Machine Learning
Bias-Variance in Machine Learning. Bias-Variance: Outline Underfitting/overfitting: –Why are complex hypotheses bad? Simple example of bias/variance Error.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16.
Classification Ensemble Methods 1
Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.
Ensemble Methods for Machine Learning. COMBINING CLASSIFIERS: ENSEMBLE APPROACHES.
CSE 446: Ensemble Learning Winter 2012 Daniel Weld Slides adapted from Tom Dietterich, Luke Zettlemoyer, Carlos Guestrin, Nick Kushmerick, Padraig Cunningham.
AdaBoost Algorithm and its Application on Object Detection Fayin Li.
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
HW 2.
Data Mining Practical Machine Learning Tools and Techniques
Reading: R. Schapire, A brief introduction to boosting
Bagging and Random Forests
Ensembles (Bagging, Boosting, and all that)
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
The Boosting Approach to Machine Learning
Machine Learning: Ensembles
Boosting and Additive Trees
COMP61011 : Machine Learning Ensemble Models
Ensemble Learning Introduction to Machine Learning and Data Mining, Carla Brodley.
The Boosting Approach to Machine Learning
ECE 5424: Introduction to Machine Learning
Asymmetric Gradient Boosting with Application to Spam Filtering
CS 4/527: Artificial Intelligence
A “Holy Grail” of Machine Learing
Data Mining Practical Machine Learning Tools and Techniques
Cos 429: Face Detection (Part 2) Viola-Jones and AdaBoost Guest Instructor: Andras Ferencz (Your Regular Instructor: Fei-Fei Li) Thanks to Fei-Fei.
Introduction to Boosting
Machine Learning Ensemble Learning: Voting, Boosting(Adaboost)
Lecture 18: Bagging and Boosting
Ensemble learning.
Lecture 06: Bagging and Boosting
Ensemble learning Reminder - Bagging of Trees Random Forest
Ensembles An ensemble is a set of classifiers whose combined results give the final decision. test feature vector classifier 1 classifier 2 classifier.
Derek Hoiem CS 598, Spring 2009 Jan 27, 2009
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
CS 391L: Machine Learning: Ensembles
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Ensembles (Bagging, Boosting, and all that)
Presentation transcript:

Ensemble Methods for Machine Learning: The Ensemble Strikes Back

Outline Motivations and techniques Bias, variance: bagging Combining learners vs choosing between them: bucket of models stacking & blending Pac-learning theory: boosting Relation of boosting to other learning methods—optimization, SVMs, …

Review Of Boosting

Sample with replacement Increase weight of xi if ht is wrong, decrease weight if ht is right. Linear combination of base hypotheses - best weight αt depends on error of ht.

Boosting: A toy example Thanks, Rob Schapire

Boosting: A toy example Thanks, Rob Schapire

Boosting: A toy example Thanks, Rob Schapire

Boosting: A toy example Thanks, Rob Schapire

Boosting: A toy example Thanks, Rob Schapire

Boosting improved decision trees… 1988 - KV 1989 - S 1993 - DSS 1995 - FS 1950 - T …

Analysis Of Boosting

Theorem 1: error rate

upper bound on “[error on i ]” Theorem 1: error rate Proof: = sign(f(x)) where upper bound on “[error on i ]” QED!

imequality holds for -1 <= u <= +1 Theorem 1: So: pick h’s and α’s to minimize Z’s Simplified notation: drop the t’s, let ui=yiht(xi), remember that ui = +1 or -1 Claim: 1 1 = sign(f(x)) where ui = +1 ui = -1 equality for u = +1, -1 imequality holds for -1 <= u <= +1 So: let’s minimize f(α) = to pick a best α

Minimize f(α) = = sign(f(x)) where

and hence training error is bounded by Theorem 1: So: pick h’s and α’s to minimize Z’s Theorem 2: when for then and hence training error is bounded by Comment if h(x)=+/- 1 then

Boosting as Optimization

Even boosting single features worked well… 1988 - KV 1989 - S 1993 - DSS 1995 - FS 1950 - T … Reuters newswire corpus

Some background facts Coordinate descent optimization to minimize f(w) For t=1,…,T or till convergence: For i=1,…,N where w=<w1,…,wN> Pick w* to minimize f(<w1,…,wi-1,w*,wi+1,…,wN> Set wi = w* 1984 - V 1988 - KV 1989 - S 1993 - DSS 1995 - FS 1950 - T …

Boosting as optimization using coordinate descent With a small number of possible h’s, you can think of boosting as finding a linear combination of these: So boosting is sort of like stacking: Boosting uses coordinate descent to minimize an upper bound on error rate:

Boosting and optimization 1984 - V 1988 - KV 1989 - S 1993 - DSS 1995 - FS 1950 - T … Jerome Friedman, Trevor Hastie and Robert Tibshirani. Additive logistic regression: a statistical view of boosting. The Annals of Statistics, 2000. Compared using AdaBoost to set feature weights vs direct optimization of feature weights to minimize log-likelihood, squared error, … 1999 - FHT

Boosting as Margin Learning

Boosting didn’t seem to overfit…(!) 1988 - KV 1989 - S 1993 - DSS 1995 - FS 1950 - T … test error train error

…because it turned out to be increasing the margin of the classifier 1984 - V 1988 - KV 1989 - S 1993 - DSS 1995 - FS 1950 - T … 1000 rounds 100 rounds

Boosting movie

Some background facts Coordinate descent optimization to minimize f(w) For t=1,…,T or till convergence: For i=1,…,N where w=<w1,…,wN> Pick w* to minimize f(<w1,…,wi-1,w*,wi+1,…,wN> Set wi = w* 1984 - V 1988 - KV 1989 - S 1993 - DSS 1995 - FS 1950 - T …

Boosting is closely related to margin classifiers like SVM, voted perceptron, … (!) 1988 - KV 1989 - S 1993 - DSS 1995 - FS 1950 - T … Boosting: The “coordinates” are being extended by one in each round of boosting --- usually, unless you happen to generate the same tree twice

Boosting is closely related to margin classifiers like SVM, voted perceptron, … (!) 1988 - KV 1989 - S 1993 - DSS 1995 - FS 1950 - T … Boosting: Linear SVMs:

Wrapup On Boosting

Boosting in the real world 1984 - V 1988 - KV 1989 - S 1993 - DSS 1995 - FS 1950 - T … William’s wrap up: Boosting is not discussed much in the ML research community any more It’s much too well understood It’s really useful in practice as a meta-learning method Eg, boosted Naïve Bayes usually beats Naïve Bayes Boosted decision trees are almost always competitive with respect to accuracy very robust against rescaling numeric features, extra features, non-linearities, … somewhat slower to learn and use than many linear classifiers But getting probabilities out of them is a little less reliable. now