Lecture 06: Bagging and Boosting

Slides:



Advertisements
Similar presentations
Ensemble Learning Reading: R. Schapire, A brief introduction to boosting.
Advertisements

On-line learning and Boosting
Data Mining and Machine Learning
FilterBoost: Regression and Classification on Large Datasets Joseph K. Bradley 1 and Robert E. Schapire 2 1 Carnegie Mellon University 2 Princeton University.
Longin Jan Latecki Temple University
Sparse vs. Ensemble Approaches to Supervised Learning
Ensemble Learning what is an ensemble? why use an ensemble?
A Brief Introduction to Adaboost
Ensemble Learning: An Introduction
Adaboost and its application
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
Sparse vs. Ensemble Approaches to Supervised Learning
Boosting Main idea: train classifiers (e.g. decision trees) in a sequence. a new classifier should focus on those cases which were incorrectly classified.
Ensemble Learning (2), Tree and Forest
Machine Learning CS 165B Spring 2012
Chapter 10 Boosting May 6, Outline Adaboost Ensemble point-view of Boosting Boosting Trees Supervised Learning Methods.
Zhangxi Lin ISQS Texas Tech University Note: Most slides are from Decision Tree Modeling by SAS Lecture Notes 6 Ensembles of Trees.
ENSEMBLE LEARNING David Kauchak CS451 – Fall 2013.
CS 391L: Machine Learning: Ensembles
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00.
Lecture 29: Face Detection Revisited CS4670 / 5670: Computer Vision Noah Snavely.
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
Ensemble Learning (1) Boosting Adaboost Boosting is an additive model
Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.
E NSEMBLE L EARNING : A DA B OOST Jianping Fan Dept of Computer Science UNC-Charlotte.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Bias-Variance in Machine Learning. Bias-Variance: Outline Underfitting/overfitting: –Why are complex hypotheses bad? Simple example of bias/variance Error.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16.
Data Analytics CMIS Short Course part II Day 1 Part 3: Ensembles Sam Buttrey December 2015.
Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
AdaBoost Algorithm and its Application on Object Detection Fayin Li.
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.
Ensemble Classifiers.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Machine Learning: Ensemble Methods
HW 2.
Reading: R. Schapire, A brief introduction to boosting
Bagging and Random Forests
Session 7: Face Detection (cont.)
Lecture 07: Soft-margin SVM
Trees, bagging, boosting, and stacking
Boosting and Additive Trees
COMP61011 : Machine Learning Ensemble Models
Ensemble Learning Introduction to Machine Learning and Data Mining, Carla Brodley.
Basic machine learning background with Python scikit-learn
ECE 5424: Introduction to Machine Learning
Data Mining Practical Machine Learning Tools and Techniques
Cos 429: Face Detection (Part 2) Viola-Jones and AdaBoost Guest Instructor: Andras Ferencz (Your Regular Instructor: Fei-Fei Li) Thanks to Fei-Fei.
A New Boosting Algorithm Using Input-Dependent Regularizer
Lecture 07: Soft-margin SVM
The
Introduction to Data Mining, 2nd Edition
Lecture 18: Bagging and Boosting
Ensembles.
Lecture 08: Soft-margin SVM
Lecture 05: Decision Trees
Lecture 07: Soft-margin SVM
Ensemble learning.
Model Combination.
Ensemble learning Reminder - Bagging of Trees Random Forest
Model generalization Brief summary of methods
Lecture 29: Face Detection Revisited
CS639: Data Management for Data Science
CS 391L: Machine Learning: Ensembles
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

Lecture 06: Bagging and Boosting CS480/680: Intro to ML Lecture 06: Bagging and Boosting 09/25/18 Yao-Liang Yu

Announcement A2 on course webpage ICLR reproducibility challenge Need to submit on Kaggle Do not cheat! ICLR reproducibility challenge https://reproducibility-challenge.github.io/iclr_2019/ Final exam: Dec 19, 4-6:30pm 09/25/18 Yao-Liang Yu

Decision Stump Recalled Eye features low high No Yes Seems to be overly weak… Let’s make it strong! 09/25/18 Yao-Liang Yu

More Generally Which algorithm to use for my problem? Cheap answers Deep learning, but then which architecture? I don’t know; whatever my boss tells me Whatever I can find in scikit-learn I try many and pick the “best” Why don’t we combine a few algs? But how? 09/25/18 Yao-Liang Yu

The Power of Aggregation Train ht on data set Dt, say each with accuracy Assuming Dt are independent, hence ht are independent Predict with the majority label among (2T+1) ht’s What is the accuracy? 𝑝> 1 2 𝑘=𝑇+1 2𝑇+1 2𝑇+1 𝑘 (1−𝑝) 2𝑇+1−𝑘 𝑝 𝑘 iid ~ Bernoulli(p) ≈1 − Φ 𝑇+1− 2𝑇+1 𝑝 2𝑇+1 𝑝(1−𝑝) 1 T  ∞ Pr( 𝑡=1 2𝑇+1 𝐵 𝑡 ≥𝑇+1) 09/25/18 Yao-Liang Yu

Bootstrap Aggregating (Breiman’96) Can’t afford to have many independent training sets Bootstrapping! 1 … 5 n-1 n 2 3 4 1 … 4 n-1 n 3 sample with replacement … 2 5 n-1 n 4 sample with replacement h1 hT … e.g. majority vote 09/25/18 Yao-Liang Yu

Bagging for Regression Simply average the outputs of ht, each trained on a bootstrapped training set With T independent ht, averaging would reduce variance by a factor of T 09/25/18 Yao-Liang Yu

When Bagging Works Bagging is essentially averaging Beneficial if base classifiers have high variance (instable); e.g., performance changes a lot if training set is slightly perturbed Like decision trees but not k-NNs 09/25/18 Yao-Liang Yu

Randomizing Output (Breiman’00) For regression, add small Gaussian noise to each yi (leaving xi untouched) Train many ht and average their outputs For classification Use one-hot encoding and reduce to regression Randomly flip a small proportion of labels in training set Train many ht and majority vote 09/25/18 Yao-Liang Yu

Random Forest (Breiman’01) A collection of tree-structured classifiers {h(x; Θt), t=1, …, T}, where Θt are iid random vectors Bagging: random samples Random feature split Both 09/25/18 Yao-Liang Yu

Leo Breiman (1928-2005) 09/25/18 Yao-Liang Yu

Boosting Given many classifiers ht, each slightly better than random guessing Is it possible to construct a classifier with nearly optimal accuracy? Yes! First shown by Schapire (1990) 09/25/18 Yao-Liang Yu

The Power of Majority (Freund, 1995) (Schapire, 1990) 09/25/18 Yao-Liang Yu

Hedging (Freund & Schapire’97) 09/25/18 Yao-Liang Yu

What Guarantee? # of rounds total loss of best expert # of experts your total loss goes to 0 as T  inf choose beta appropriately 09/25/18 Yao-Liang Yu

Adaptive Boost (Freund & Schapire’97) (Schapire & Singer, 1999) 09/25/18 Yao-Liang Yu

Look Closely bigger 𝜀 𝑡 , bigger 𝛽 𝑡 , smaller coefficient can optimize y = 1 or 0 expected error of h adaptive if 𝜀 𝑡 ≤ 1 2 discount h(x) closer to y, bigger exponent, when 𝑤 𝑖 𝑡 =0? 09/25/18 Yao-Liang Yu

Does It Work? 09/25/18 Yao-Liang Yu

Exponential decay of … training error # of weak classifiers slightly smaller than ½ training error! Basically a form of gradient descent to minimize exponential loss: 𝑖 𝑒 − 𝑦 𝑖 ℎ( 𝒙 𝑖 ) , h in conic hull of ht’s Overfitting? Use simple base classifiers (e.g., decision stumps) 09/25/18 Yao-Liang Yu

Will Adaboost Overfit? margin: y h(x) 09/25/18 Yao-Liang Yu

Seriously? (Grove & Schuurmans’98; Breiman’99) 09/25/18 Yao-Liang Yu

Pros and Cons “Straightforward” way to boost performance Flexible with any base classifier Less interpretable Longer training time hard to parallelize (in contrast to bagging) 09/25/18 Yao-Liang Yu

Extensions LogitBoost GradBoost L2Boost … you name it Multi-class Regression Ranking 09/25/18 Yao-Liang Yu

Face Detection (Viola & Jones’01) Each detection window results in ~160k features Speed is crucial for real-time detection! 09/25/18 Yao-Liang Yu

Cascading 38 layers with ~6k features selected 09/25/18 Yao-Liang Yu

Examples 09/25/18 Yao-Liang Yu

Asymmetry (Viola & Jones’02) Too many non-faces vs few faces Trivial to achieve small false positives by sacrificing true positives Asymmetric loss confusion matrix 𝒚 = 1 𝒚 = -1 y = 1 𝑘 y = -1 1/ 𝑘 09/25/18 Yao-Liang Yu

SATzilla (Xu et al., 2008) 09/25/18 Yao-Liang Yu

Questions? 09/25/18 Yao-Liang Yu