Download presentation
Presentation is loading. Please wait.
1
Lecture 06: Bagging and Boosting
CS480/680: Intro to ML Lecture 06: Bagging and Boosting 09/25/18 Yao-Liang Yu
2
Announcement A2 on course webpage ICLR reproducibility challenge
Need to submit on Kaggle Do not cheat! ICLR reproducibility challenge Final exam: Dec 19, 4-6:30pm 09/25/18 Yao-Liang Yu
3
Decision Stump Recalled
Eye features low high No Yes Seems to be overly weak… Let’s make it strong! 09/25/18 Yao-Liang Yu
4
More Generally Which algorithm to use for my problem? Cheap answers
Deep learning, but then which architecture? I don’t know; whatever my boss tells me Whatever I can find in scikit-learn I try many and pick the “best” Why don’t we combine a few algs? But how? 09/25/18 Yao-Liang Yu
5
The Power of Aggregation
Train ht on data set Dt, say each with accuracy Assuming Dt are independent, hence ht are independent Predict with the majority label among (2T+1) ht’s What is the accuracy? 𝑝> 1 2 𝑘=𝑇+1 2𝑇 𝑇+1 𝑘 (1−𝑝) 2𝑇+1−𝑘 𝑝 𝑘 iid ~ Bernoulli(p) ≈1 − Φ 𝑇+1− 2𝑇+1 𝑝 2𝑇+1 𝑝(1−𝑝) 1 T ∞ Pr( 𝑡=1 2𝑇+1 𝐵 𝑡 ≥𝑇+1) 09/25/18 Yao-Liang Yu
6
Bootstrap Aggregating (Breiman’96)
Can’t afford to have many independent training sets Bootstrapping! 1 … 5 n-1 n 2 3 4 1 … 4 n-1 n 3 sample with replacement … 2 5 n-1 n 4 sample with replacement h1 hT … e.g. majority vote 09/25/18 Yao-Liang Yu
7
Bagging for Regression
Simply average the outputs of ht, each trained on a bootstrapped training set With T independent ht, averaging would reduce variance by a factor of T 09/25/18 Yao-Liang Yu
8
When Bagging Works Bagging is essentially averaging
Beneficial if base classifiers have high variance (instable); e.g., performance changes a lot if training set is slightly perturbed Like decision trees but not k-NNs 09/25/18 Yao-Liang Yu
9
Randomizing Output (Breiman’00)
For regression, add small Gaussian noise to each yi (leaving xi untouched) Train many ht and average their outputs For classification Use one-hot encoding and reduce to regression Randomly flip a small proportion of labels in training set Train many ht and majority vote 09/25/18 Yao-Liang Yu
10
Random Forest (Breiman’01)
A collection of tree-structured classifiers {h(x; Θt), t=1, …, T}, where Θt are iid random vectors Bagging: random samples Random feature split Both 09/25/18 Yao-Liang Yu
11
Leo Breiman ( ) 09/25/18 Yao-Liang Yu
12
Boosting Given many classifiers ht, each slightly better than random guessing Is it possible to construct a classifier with nearly optimal accuracy? Yes! First shown by Schapire (1990) 09/25/18 Yao-Liang Yu
13
The Power of Majority (Freund, 1995) (Schapire, 1990) 09/25/18
Yao-Liang Yu
14
Hedging (Freund & Schapire’97)
09/25/18 Yao-Liang Yu
15
What Guarantee? # of rounds total loss of best expert # of experts
your total loss goes to 0 as T inf choose beta appropriately 09/25/18 Yao-Liang Yu
16
Adaptive Boost (Freund & Schapire’97)
(Schapire & Singer, 1999) 09/25/18 Yao-Liang Yu
17
Look Closely bigger 𝜀 𝑡 , bigger 𝛽 𝑡 , smaller coefficient
can optimize y = 1 or 0 expected error of h adaptive if 𝜀 𝑡 ≤ 1 2 discount h(x) closer to y, bigger exponent, when 𝑤 𝑖 𝑡 =0? 09/25/18 Yao-Liang Yu
18
Does It Work? 09/25/18 Yao-Liang Yu
19
Exponential decay of … training error
# of weak classifiers slightly smaller than ½ training error! Basically a form of gradient descent to minimize exponential loss: 𝑖 𝑒 − 𝑦 𝑖 ℎ( 𝒙 𝑖 ) , h in conic hull of ht’s Overfitting? Use simple base classifiers (e.g., decision stumps) 09/25/18 Yao-Liang Yu
20
Will Adaboost Overfit? margin: y h(x) 09/25/18 Yao-Liang Yu
21
Seriously? (Grove & Schuurmans’98; Breiman’99)
09/25/18 Yao-Liang Yu
22
Pros and Cons “Straightforward” way to boost performance
Flexible with any base classifier Less interpretable Longer training time hard to parallelize (in contrast to bagging) 09/25/18 Yao-Liang Yu
23
Extensions LogitBoost GradBoost L2Boost … you name it Multi-class
Regression Ranking 09/25/18 Yao-Liang Yu
24
Face Detection (Viola & Jones’01)
Each detection window results in ~160k features Speed is crucial for real-time detection! 09/25/18 Yao-Liang Yu
25
Cascading 38 layers with ~6k features selected 09/25/18 Yao-Liang Yu
26
Examples 09/25/18 Yao-Liang Yu
27
Asymmetry (Viola & Jones’02)
Too many non-faces vs few faces Trivial to achieve small false positives by sacrificing true positives Asymmetric loss confusion matrix 𝒚 = 1 𝒚 = -1 y = 1 𝑘 y = -1 1/ 𝑘 09/25/18 Yao-Liang Yu
28
SATzilla (Xu et al., 2008) 09/25/18 Yao-Liang Yu
29
Questions? 09/25/18 Yao-Liang Yu
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.