BOOSTING David Kauchak CS451 – Fall 2013. Admin Final project.

Slides:



Advertisements
Similar presentations
An Introduction to Boosting Yoav Freund Banter Inc.
Advertisements

Regularization David Kauchak CS 451 – Fall 2013.
On-line learning and Boosting
Lectures 17,18 – Boosting and Additive Trees Rice ECE697 Farinaz Koushanfar Fall 2006.
Data Mining and Machine Learning
Imbalanced data David Kauchak CS 451 – Fall 2013.
Boosting Rong Jin.
EE462 MLCV Lecture 5-6 Object Detection – Boosting Tae-Kyun Kim.
Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.
PERCEPTRON LEARNING David Kauchak CS 451 – Fall 2013.
Games of Prediction or Things get simpler as Yoav Freund Banter Inc.
CMPUT 466/551 Principal Source: CMU
AdaBoost & Its Applications
Longin Jan Latecki Temple University
Cos 429: Face Detection (Part 2) Viola-Jones and AdaBoost Guest Instructor: Andras Ferencz (Your Regular Instructor: Fei-Fei Li) Thanks to Fei-Fei Li,
Review of : Yoav Freund, and Robert E
Introduction to Boosting Slides Adapted from Che Wanxiang( 车 万翔 ) at HIT, and Robin Dhamankar of Many thanks!
The Viola/Jones Face Detector (2001)
Sparse vs. Ensemble Approaches to Supervised Learning
2D1431 Machine Learning Boosting.
Boosting Rong Jin. Inefficiency with Bagging D Bagging … D1D1 D2D2 DkDk Boostrap Sampling h1h1 h2h2 hkhk Inefficiency with boostrap sampling: Every example.
Ensemble Learning: An Introduction
Adaboost and its application
Three kinds of learning
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
Machine Learning: Ensemble Methods
Sparse vs. Ensemble Approaches to Supervised Learning
Boosting Main idea: train classifiers (e.g. decision trees) in a sequence. a new classifier should focus on those cases which were incorrectly classified.
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Foundations of Computer Vision Rapid object / face detection using a Boosted Cascade of Simple features Presented by Christos Stoilas Rapid object / face.
Machine Learning CS 165B Spring 2012
Chapter 10 Boosting May 6, Outline Adaboost Ensemble point-view of Boosting Boosting Trees Supervised Learning Methods.
ENSEMBLE LEARNING David Kauchak CS451 – Fall 2013.
ADVANCED CLASSIFICATION TECHNIQUES David Kauchak CS 159 – Fall 2014.
CS 391L: Machine Learning: Ensembles
Recognition using Boosting Modified from various sources including
Window-based models for generic object detection Mei-Chen Yeh 04/24/2012.
Boosting of classifiers Ata Kaban. Motivation & beginnings Suppose we have a learning algorithm that is guaranteed with high probability to be slightly.
CSC2515 Fall 2008 Introduction to Machine Learning Lecture 11a Boosting and Naïve Bayes All lecture slides will be available as.ppt,.ps, &.htm at
LOGISTIC REGRESSION David Kauchak CS451 – Fall 2013.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Ensemble Methods: Bagging and Boosting
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
ADVANCED PERCEPTRON LEARNING David Kauchak CS 451 – Fall 2013.
Ensemble Learning (1) Boosting Adaboost Boosting is an additive model
ISQS 6347, Data & Text Mining1 Ensemble Methods. ISQS 6347, Data & Text Mining 2 Ensemble Methods Construct a set of classifiers from the training data.
Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.
E NSEMBLE L EARNING : A DA B OOST Jianping Fan Dept of Computer Science UNC-Charlotte.
The Viola/Jones Face Detector A “paradigmatic” method for real-time object detection Training is slow, but detection is very fast Key ideas Integral images.
Ensemble Methods in Machine Learning
Classification Ensemble Methods 1
1 January 24, 2016Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 7 — Classification Ensemble Learning.
Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.
AdaBoost Algorithm and its Application on Object Detection Fayin Li.
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.
Ensemble Classifiers.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Machine Learning: Ensemble Methods
Large Margin classifiers
COMP61011 : Machine Learning Ensemble Models
Ensemble Learning Introduction to Machine Learning and Data Mining, Carla Brodley.
Asymmetric Gradient Boosting with Application to Spam Filtering
Data Mining Practical Machine Learning Tools and Techniques
Cos 429: Face Detection (Part 2) Viola-Jones and AdaBoost Guest Instructor: Andras Ferencz (Your Regular Instructor: Fei-Fei Li) Thanks to Fei-Fei.
The
Introduction to Data Mining, 2nd Edition
Ensemble learning.
Model Combination.
Presentation transcript:

BOOSTING David Kauchak CS451 – Fall 2013

Admin Final project

Ensemble learning Basic idea: if one classifier works well, why not use multiple classifiers!

Ensemble learning Basic idea: if one classifier works well, why not use multiple classifiers! Training Data model 1 learning alg Training learning alg … model 2 learning alg model m

Ensemble learning Basic idea: if one classifier works well, why not use multiple classifiers! model 1 Testing model 2model m example to label … prediction 1 prediction 2 prediction m

Idea 4: boosting DataLabel training data Weight 0.2 DataLabel “training” data DataLabel “training” data Weight

“Strong” learner Given  a reasonable amount of training data  a target error rate ε  a failure probability p A strong learning algorithm will produce a classifier with error rate < ε with probability 1-p

“Weak” learner Given  a reasonable amount of training data  a failure probability p A weak learning algorithm will produce a classifier with error rate < 0.5 with probability 1-p Weak learners are much easier to create!

weak learners for boosting DataLabel Weight 0.2 weak learning algorithm weak classifier Need a weak learning algorithm that can handle weighted examples Which of our algorithms can handle weights?

boosting: basic algorithm Training: start with equal example weights for some number of iterations: - learn a weak classifier and save - change the example weights Classify: - get prediction from all learned weak classifiers - weighted vote based on how well the weak classifier did when it was trained

boosting basics E1 E2E3E4E5 Examples: Weights: Start with equal weighted examples Learn a weak classifier: weak 1

Boosting E1 E2E3E4E5 Examples: Weights: weak 1 We want to reweight the examples and then learn another weak classifier How should we change the example weights? classified correct classified incorrect

Boosting E1 E2E3E4E5 Examples: Weights: -decrease the weight for those we’re getting correct -increase the weight for those we’re getting incorrect

Boosting E1 E2E3E4E5 Examples: Weights: Learn another weak classifier: weak 2

Boosting E1 E2E3E4E5 Examples: Weights: weak 2

Boosting E1 E2E3E4E5 Examples: Weights: -decrease the weight for those we’re getting correct -increase the weight for those we’re getting incorrect

Classifying weak 2weak 1 prediction 1 prediction 2 … weighted vote based on how well they classify the training data weak_2_vote > weak_1_vote since it got more right

Notation x i example i in the training data w i weight for example i, we will enforce: classifier k (x i ) +1/-1 prediction of classifier k example i

AdaBoost: train for k = 1 to iterations: - classifier k = learn a weak classifier based on weights - calculate weighted error for this classifier - calculate “score” for this classifier: - change the example weights

AdaBoost: train classifier k = learn a weak classifier based on weights weighted error for this classifier is: What does this say?

AdaBoost: train classifier k = learn a weak classifier based on weights weighted error for this classifier is: prediction did we get the example wrong weighted sum of the errors/mistakes What is the range of possible values?

AdaBoost: train classifier k = learn a weak classifier based on weights weighted error for this classifier is: prediction did we get the example wrong weighted sum of the errors/mistakes Between 0, if we get all examples right, and 1, if we get them all wrong

AdaBoost: train classifier k = learn a weak classifier based on weights “score” or weight for this classifier is: What does this look like (specifically for errors between 0 and 1)?

AdaBoost: train -ranges from 1 to -1 -errors of 50% = 0

AdaBoost: classify What does this do?

AdaBoost: classify The weighted vote of the learned classifiers weighted by α (remember α varies from 1 to -1 training error) What happens if a classifier has error >50%

AdaBoost: classify The weighted vote of the learned classifiers weighted by α (remember α varies from 1 to -1 training error) We actually vote the opposite!

AdaBoost: train, updating the weights update the example weights Remember, we want to enforce: Z is called the normalizing constant. It is used to make sure that the weights sum to 1 What should it be?

AdaBoost: train update the example weights Remember, we want to enforce: normalizing constant (i.e. the sum of the “new” w i ):

AdaBoost: train update the example weights What does this do?

AdaBoost: train update the example weights correct positive incorrect negative correct incorrect ?

AdaBoost: train update the example weights correct positive incorrect negative correctsmall value incorrectlarge value Note: only change weights based on current classifier (not all previous classifiers)

AdaBoost: train update the example weights What does the α do?

AdaBoost: train update the example weights What does the α do? If the classifier was good (<50% error) α is positive: trust classifier output and move as normal If the classifier was back (>50% error) α is negative classifier is so bad, consider opposite prediction of classifier

AdaBoost: train update the example weights correct positive incorrect negative correctsmall value incorrectlarge value If the classifier was good (<50% error) α is positive If the classifier was back (>50% error) α is negative

AdaBoost justification update the example weights Does this look like anything we’ve seen before?

AdaBoost justification update the example weights Exponential loss! AdaBoost turns out to be another approach for minimizing the exponential loss!

Other boosting variants Loss Correct loss Mistakes Brownboost Logitboost Adaboost = 0-1 loss

Boosting example Start with equal weighted data set

Boosting example h => p(error) = 0.5 it is at chance weak learner = line What would be the best line learned on this data set?

Boosting example This one seems to be the best This is a ‘ weak classifier ’ : It performs slightly better than chance. How should we reweight examples?

Boosting example reds on this side get more weight blues on this side get less weight reds on this side get less weight blues on this side get more weight What would be the best line learned on this data set?

Boosting example How should we reweight examples?

Boosting example What would be the best line learned on this data set?

Boosting example

The strong (non- linear) classifier is built as the combination of all the weak (linear) classifiers. f1f1 f2f2 f3f3 f4f4

AdaBoost: train for k = 1 to iterations: - classifier k = learn a weak classifier based on weights - weighted error for this classifier is: - “score” or weight for this classifier is: - change the example weights What can we use as a classifier?

AdaBoost: train for k = 1 to iterations: - classifier k = learn a weak classifier based on weights - weighted error for this classifier is: - “score” or weight for this classifier is: - change the example weights -Anything that can train on weighted examples -For most applications, must be fast! Why?

AdaBoost: train for k = 1 to iterations: - classifier k = learn a weak classifier based on weights - weighted error for this classifier is: - “score” or weight for this classifier is: - change the example weights -Anything that can train on weighted examples -For most applications, must be fast! -Each iteration we have to train a new classifier

Boosted decision stumps One of the most common classifiers to use is a decision tree: - can use a shallow (2-3 level tree) - even more common is a 1-level tree - called a decision stump - asks a question about a single feature What does the decision boundary look like for a decision stump?

Boosted decision stumps One of the most common classifiers to use is a decision tree: - can use a shallow (2-3 level tree) - even more common is a 1-level tree - called a decision stump - asks a question about a single feature What does the decision boundary look like for boosted decision stumps?

Boosted decision stumps One of the most common classifiers to use is a decision tree: - can use a shallow (2-3 level tree) - even more common is a 1-level tree - called a decision stump - asks a question about a single feature - Linear classifier! - Each stump defines the weight for that dimension - If you learn multiple stumps for that dimension then it’s the weighted average

Boosting in practice Very successful on a wide range of problems One of the keys is that boosting tends not to overfit, even for a large number of iterations Using 2,000,000 parameters!

Adaboost application example: face detection

To give you some context of importance: or:

“weak” learners 4 Types of “ Rectangle filters ” (Similar to Haar wavelets Papageorgiou, et al. ) Based on 24x24 grid: 160,000 features to choose from g(x) = sum(WhiteArea) - sum(BlackArea)

“weak” learners F(x) = α 1 f 1 (x) + α 2 f 2 (x) +... f i (x) = 1 if g i (x) > θ i -1 otherwise

Example output

Solving other “ Face ” Tasks Facial Feature Localization Demographic Analysis Profile Detection

“weak” classifiers learned

Bagging vs Boosting

Boosting Neural Networks Ada-Boosting Arcing Bagging White bar represents 1 standard deviation Change in error rate over standard classifier

Boosting Decision Trees