CMPUT 466/551 Principal Source: CMU

Slides:



Advertisements
Similar presentations
Ensemble Learning Reading: R. Schapire, A brief introduction to boosting.
Advertisements

Regularization David Kauchak CS 451 – Fall 2013.
On-line learning and Boosting
Lectures 17,18 – Boosting and Additive Trees Rice ECE697 Farinaz Koushanfar Fall 2006.
Boosting Rong Jin.
Pattern Recognition and Machine Learning
Boosting Ashok Veeraraghavan. Boosting Methods Combine many weak classifiers to produce a committee. Resembles Bagging and other committee based methods.
FilterBoost: Regression and Classification on Large Datasets Joseph K. Bradley 1 and Robert E. Schapire 2 1 Carnegie Mellon University 2 Princeton University.
Model Assessment, Selection and Averaging
Model assessment and cross-validation - overview
Chapter 4: Linear Models for Classification
Longin Jan Latecki Temple University
Review of : Yoav Freund, and Robert E
Introduction to Boosting Slides Adapted from Che Wanxiang( 车 万翔 ) at HIT, and Robin Dhamankar of Many thanks!
Boosting CMPUT 615 Boosting Idea We have a weak classifier, i.e., it’s error rate is a little bit better than 0.5. Boosting combines a lot of such weak.
x – independent variable (input)
Sparse vs. Ensemble Approaches to Supervised Learning
Boosting Rong Jin. Inefficiency with Bagging D Bagging … D1D1 D2D2 DkDk Boostrap Sampling h1h1 h2h2 hkhk Inefficiency with boostrap sampling: Every example.
A Brief Introduction to Adaboost
Ensemble Learning: An Introduction
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
Kernel Methods Part 2 Bing Han June 26, Local Likelihood Logistic Regression.
Sparse vs. Ensemble Approaches to Supervised Learning
Boosting Main idea: train classifiers (e.g. decision trees) in a sequence. a new classifier should focus on those cases which were incorrectly classified.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Classification and Prediction: Regression Analysis
Ensemble Learning (2), Tree and Forest
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Chapter 10 Boosting May 6, Outline Adaboost Ensemble point-view of Boosting Boosting Trees Supervised Learning Methods.
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00.
Benk Erika Kelemen Zsolt
LOGISTIC REGRESSION David Kauchak CS451 – Fall 2013.
Additive Logistic Regression: a Statistical View of Boosting
BOOSTING David Kauchak CS451 – Fall Admin Final project.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Ensemble Methods: Bagging and Boosting
Ensemble Learning (1) Boosting Adaboost Boosting is an additive model
CN700: HST Neil Weisenfeld (notes were recycled and modified from Prof. Cohen and an unnamed student) April 12, 2005.
Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.
E NSEMBLE L EARNING : A DA B OOST Jianping Fan Dept of Computer Science UNC-Charlotte.
Linear Models for Classification
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Ensemble Methods in Machine Learning
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16.
Classification Ensemble Methods 1
Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Machine Learning 5. Parametric Methods.
Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.
Machine Learning CUNY Graduate Center Lecture 6: Linear Regression II.
Boosting and Additive Trees (Part 1) Ch. 10 Presented by Tal Blum.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Overfitting, Bias/Variance tradeoff. 2 Content of the presentation Bias and variance definitions Parameters that influence bias and variance Bias and.
Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.
Adaboost (Adaptive boosting) Jo Yeong-Jun Schapire, Robert E., and Yoram Singer. "Improved boosting algorithms using confidence- rated predictions."
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Machine Learning: Ensemble Methods
Data Mining Practical Machine Learning Tools and Techniques
Ch 14. Combining Models Pattern Recognition and Machine Learning, C. M
Boosting and Additive Trees (2)
Boosting and Additive Trees
Asymmetric Gradient Boosting with Application to Spam Filtering
Data Mining Practical Machine Learning Tools and Techniques
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Overfitting and Underfitting
Presentation transcript:

CMPUT 466/551 Principal Source: CMU Boosting CMPUT 466/551 Principal Source: CMU

Boosting Idea We have a weak classifier, i.e., it’s error rate is slightly better than 0.5. Boosting combines a lot of such weak learners to make a strong classifier (the error rate of which is much less than 0.5)

Boosting: Combining Classifiers What is ‘weighted sample?’

Discrete Ada(ptive)boost Algorithm Create weight distribution W(x) over N training points Initialize W0(x) = 1/N for all x, step T=0 At each iteration T : Train weak classifier CT(x) on data using weights WT(x) Get error rate εT . Set αT = log ((1 - εT )/εT ) Calculate WT+1(xi ) = WT(xi ) ∙ exp[αT ∙ I(yi ≠ CT(xi ))] Final classifier CFINAL(x) =sign [ ∑ αi Ci (x) ] Assumes weak method CT can use weights WT(x) If this is hard, we can sample using WT(x)

Real Adaboost Algorithm Create weight distribution W(x) over N training points Initialize W0(x) = 1/N for all x, step T=0 At each iteration T : Train weak classifier CT(x) on data using weights WT(x) Obtain class probabilities pT(xi) for each data point xi Set fT(x) = ½ log [ pT(xi)/(1- pT(xi)) ] Calculate WT+1(xi ) = WT(xi ) ∙ exp[yi ∙ fT(x)] for all xi Final classifier CFINAL(x) =sign [ ∑ ft(x) ]

Boosting With Decision Stumps

First classifier

First 2 classifiers

First 3 classifiers

Final Classifier learned by Boosting

Performance of Boosting with Stumps Problem: Xj are standard Gaussian variables About 1000 positive and 1000 negative training examples 10,000 test observations Weak classifier is a “stump” i.e., a two-terminal node classification tree

AdaBoost is Special The properties of the exponential loss function cause the AdaBoost algorithm to be simple. AdaBoost’s closed form solution is in terms of minimized training set error on weighted data. This simplicity is very special and not true for all loss functions!

Boosting: An Additive Model Consider the additive model: Can we minimize this cost function? N: number of training data points L: loss function b: basis functions This optimization is Non-convex and hard! Boosting takes a greedy approach

Boosting: Forward stagewise greedy search Adding basis one by one

Boosting As Additive Model Simple case: Squared-error loss Forward stagewise modeling amounts to just fitting the residuals from previous iteration Squared-error loss not robust for classification

Boosting As Additive Model AdaBoost for Classification: L(y, f (x)) = exp(-y ∙ f (x)) - the exponential loss function Margin ≡ y ∙ f (x) Note that we use a property of the exponential loss function at this step. Many other functions (e.g. absolute loss) would start getting in the way…

Boosting As Additive Model First assume that β is constant, and minimize G:

Boosting As Additive Model First assume that β is constant, and minimize G: So if we choose G such that training error errm on the weighted data is minimized, that’s our optimal G.

Boosting As Additive Model Next, assume we have found this G, so given G, we next minimize β: Another property of the exponential loss function is that we get an especially simple derivative

Boosting: Practical Issues When to stop? Most improvement for first 5 to 10 classifiers Significant gains up to 25 classifiers Generalization error can continue to improve even after training error is zero! Methods: Cross-validation Discrete estimate of expected generalization error EG How are bias and variance affected? Variance usually decreases Boosting can give reduction in both bias and variance

Boosting: Practical Issues When can boosting have problems? Not enough data Really weak learner Really strong learner Very noisy data Although this can be mitigated e.g. detecting outliers, or regularization methods Boosting can be used to detect noise Look for very high weights

Population Minimizers Why do we care about them? We try to approximate the optimal Bayes classifier: predict the label with the largest likelihood All we really care about is finding a function that has the same sign response as optimal Bayes By approximating the population minimizer, (which must satisfy certain weak conditions) we approximate the optimal Bayes classifier

Features of Exponential Loss Advantages Leads to simple decomposition into observation weights + weak classifier Smooth with gradually changing derivatives Convex Disadvantages Incorrectly classified outliers may get weighted too heavily (exponentially increased weights), leading to over-sensitivity to noise

Squared Error Loss Explanation of Fig. 10.4:

Other Loss Functions For Classification Logistic Loss Very similar population minimizer to exponential Similar behavior for positive margins, very different for negative margins Logistic is more robust against outliers and misspecified data

Other Loss Functions For Classification Hinge (SVM) General Hinge (SVM) These can give improved robustness or accuracy, but require more complex optimization methods Boosting with exponential loss is linear optimization SVM is quadratic optimization

Robustness of different Loss function

Loss Functions for Regression Squared-error Loss weights outliers very highly More sensitive to noise, long-tailed error distributions Absolute Loss Huber Loss is hybrid:

Robust Loss function for Regression

Boosting and SVM Boosting increases the margin “yf(x)” by additive stagewise optimization SVM also maximizes the margin “yf(x)” The difference is in the loss function– Adaboost uses exponential loss, while SVM uses “hinge loss” function SVM is more robust to outliers than Adaboost Boosting can turn base weak classifiers into a strong one, SVM itself is a strong classifier

Summary Boosting combines weak learners to obtain a strong one From the optimization perspective, boosting is a forward stage-wise minimization to maximize a classification/regression margin It’s robustness depends on the choice of the Loss function Boosting with trees is claimed to be “best off-the-self classification” algorithm Boosting can overfit!