Lecture 06: Bagging and Boosting

Lecture 06: Bagging and Boosting
CS480/680: Intro to ML Lecture 06: Bagging and Boosting 09/25/18 Yao-Liang Yu

Announcement A2 on course webpage ICLR reproducibility challenge
Need to submit on Kaggle Do not cheat! ICLR reproducibility challenge Final exam: Dec 19, 4-6:30pm 09/25/18 Yao-Liang Yu

Decision Stump Recalled
Eye features low high No Yes Seems to be overly weak… Let’s make it strong! 09/25/18 Yao-Liang Yu

More Generally Which algorithm to use for my problem? Cheap answers
Deep learning, but then which architecture? I don’t know; whatever my boss tells me Whatever I can find in scikit-learn I try many and pick the “best” Why don’t we combine a few algs? But how? 09/25/18 Yao-Liang Yu

The Power of Aggregation
Train ht on data set Dt, say each with accuracy Assuming Dt are independent, hence ht are independent Predict with the majority label among (2T+1) ht’s What is the accuracy? 𝑝> 1 2 𝑘=𝑇+1 2𝑇 𝑇+1 𝑘 (1−𝑝) 2𝑇+1−𝑘 𝑝 𝑘 iid ~ Bernoulli(p) ≈1 − Φ 𝑇+1− 2𝑇+1 𝑝 2𝑇+1 𝑝(1−𝑝) 1 T  ∞ Pr( 𝑡=1 2𝑇+1 𝐵 𝑡 ≥𝑇+1) 09/25/18 Yao-Liang Yu

Bootstrap Aggregating (Breiman’96)
Can’t afford to have many independent training sets Bootstrapping! 1 … 5 n-1 n 2 3 4 1 … 4 n-1 n 3 sample with replacement … 2 5 n-1 n 4 sample with replacement h1 hT … e.g. majority vote 09/25/18 Yao-Liang Yu

Bagging for Regression
Simply average the outputs of ht, each trained on a bootstrapped training set With T independent ht, averaging would reduce variance by a factor of T 09/25/18 Yao-Liang Yu

When Bagging Works Bagging is essentially averaging
Beneficial if base classifiers have high variance (instable); e.g., performance changes a lot if training set is slightly perturbed Like decision trees but not k-NNs 09/25/18 Yao-Liang Yu

Randomizing Output (Breiman’00)
For regression, add small Gaussian noise to each yi (leaving xi untouched) Train many ht and average their outputs For classification Use one-hot encoding and reduce to regression Randomly flip a small proportion of labels in training set Train many ht and majority vote 09/25/18 Yao-Liang Yu

Random Forest (Breiman’01)
A collection of tree-structured classifiers {h(x; Θt), t=1, …, T}, where Θt are iid random vectors Bagging: random samples Random feature split Both 09/25/18 Yao-Liang Yu

Leo Breiman ( ) 09/25/18 Yao-Liang Yu

Boosting Given many classifiers ht, each slightly better than random guessing Is it possible to construct a classifier with nearly optimal accuracy? Yes! First shown by Schapire (1990) 09/25/18 Yao-Liang Yu

The Power of Majority (Freund, 1995) (Schapire, 1990) 09/25/18
Yao-Liang Yu

Hedging (Freund & Schapire’97)
09/25/18 Yao-Liang Yu

What Guarantee? # of rounds total loss of best expert # of experts
your total loss goes to 0 as T  inf choose beta appropriately 09/25/18 Yao-Liang Yu

Adaptive Boost (Freund & Schapire’97)
(Schapire & Singer, 1999) 09/25/18 Yao-Liang Yu

Look Closely bigger 𝜀 𝑡 , bigger 𝛽 𝑡 , smaller coefficient
can optimize y = 1 or 0 expected error of h adaptive if 𝜀 𝑡 ≤ 1 2 discount h(x) closer to y, bigger exponent, when 𝑤 𝑖 𝑡 =0? 09/25/18 Yao-Liang Yu

Does It Work? 09/25/18 Yao-Liang Yu

Exponential decay of … training error
# of weak classifiers slightly smaller than ½ training error! Basically a form of gradient descent to minimize exponential loss: 𝑖 𝑒 − 𝑦 𝑖 ℎ( 𝒙 𝑖 ) , h in conic hull of ht’s Overfitting? Use simple base classifiers (e.g., decision stumps) 09/25/18 Yao-Liang Yu

Will Adaboost Overfit? margin: y h(x) 09/25/18 Yao-Liang Yu

Seriously? (Grove & Schuurmans’98; Breiman’99)
09/25/18 Yao-Liang Yu

Pros and Cons “Straightforward” way to boost performance
Flexible with any base classifier Less interpretable Longer training time hard to parallelize (in contrast to bagging) 09/25/18 Yao-Liang Yu

Extensions LogitBoost GradBoost L2Boost … you name it Multi-class
Regression Ranking 09/25/18 Yao-Liang Yu

Face Detection (Viola & Jones’01)
Each detection window results in ~160k features Speed is crucial for real-time detection! 09/25/18 Yao-Liang Yu

Cascading 38 layers with ~6k features selected 09/25/18 Yao-Liang Yu

Examples 09/25/18 Yao-Liang Yu

Asymmetry (Viola & Jones’02)
Too many non-faces vs few faces Trivial to achieve small false positives by sacrificing true positives Asymmetric loss confusion matrix 𝒚 = 1 𝒚 = -1 y = 1 𝑘 y = -1 1/ 𝑘 09/25/18 Yao-Liang Yu

SATzilla (Xu et al., 2008) 09/25/18 Yao-Liang Yu

Questions? 09/25/18 Yao-Liang Yu

Lecture 06: Bagging and Boosting

Similar presentations

Presentation on theme: "Lecture 06: Bagging and Boosting"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 06: Bagging and Boosting

Similar presentations

Presentation on theme: "Lecture 06: Bagging and Boosting"— Presentation transcript:

Similar presentations

About project

Feedback