Download presentation
Presentation is loading. Please wait.
1
Lecture 18: Bagging and Boosting
CS489/698: Intro to ML Lecture 18: Bagging and Boosting 11/21/17 Yao-Liang Yu
2
Decision Stump Recalled
Eye features low high No Yes Seems to be overly weak… Let’s make it strong! 11/21/17 Yao-Liang Yu
3
More Generally Which algorithm to use for my problem? Cheap answers
Deep learning, but then which architecture? I don’t know; whatever my boss tells me Whatever I can find in scikit-learn I try many and pick the “best” Why don’t we combine a few algs? But how? 11/21/17 Yao-Liang Yu
4
The Power of Aggregation
Train ht on data set Dt, say each with accuracy Assuming Dt are independent, hence ht are independent Predict with the majority label among (2T+1) ht’s What is the accuracy? 𝑝> 1 2 𝑘=𝑇+1 2𝑇 𝑇+1 𝑘 (1−𝑝) 2𝑇+1−𝑘 𝑝 𝑘 iid ~ Bernoulli(p) ≈1 − Φ 𝑇+1− 2𝑇+1 𝑝 2𝑇+1 𝑝(1−𝑝) 1 T ∞ Pr( 𝑡=1 2𝑇+1 𝐵 𝑡 ≥𝑇+1) 11/21/17 Yao-Liang Yu
5
Bootstrap Aggregating (Breiman’96)
Can’t afford to have many independent training sets Bootstrapping! 1 … 5 n-1 n 2 3 4 1 … 4 n-1 n 3 sample with replacement … 2 5 n-1 n 4 sample with replacement h1 hT … e.g. majority vote 11/21/17 Yao-Liang Yu
6
Bagging for Regression
Simply average the outputs of ht, each trained on a bootstrapped training set With T independent ht, averaging would reduce variance by a factor of T 11/21/17 Yao-Liang Yu
7
When Bagging Works Bagging is essentially averaging
Beneficial if base classifiers have high variance (instable); e.g., performance changes a lot if training set is slightly perturbed Like decision trees but not k-NNs 11/21/17 Yao-Liang Yu
8
Randomizing Output (Breiman’00)
For regression, add small Gaussian noise to each yi (leaving xi untouched) Train many ht and average their outputs For classification Use one-hot encoding and reduce to regression Randomly flip a small proportion of labels in training set Train many ht and majority vote 11/21/17 Yao-Liang Yu
9
Random Forest (Breiman’01)
A collection of tree-structured classifiers {h(x; Θt), t=1, …, T}, where Θt are iid random vectors Bagging: random samples Random feature split Both 11/21/17 Yao-Liang Yu
10
Leo Breiman ( ) 11/21/17 Yao-Liang Yu
11
Boosting Given many classifiers ht, each slightly better than random guessing Is it possible to construct a classifier with nearly optimal accuracy? Yes! First shown by Schapire (1990) 11/21/17 Yao-Liang Yu
12
The Power of Majority (Freund, 1995) (Schapire, 1990) 11/21/17
Yao-Liang Yu
13
Hedging (Freund & Schapire’97)
11/21/17 Yao-Liang Yu
14
What Guarantee? # of rounds total loss of best expert # of experts
your total loss goes to 0 as T inf choose beta appropriately 11/21/17 Yao-Liang Yu
15
Adaptive Boost (Freund & Schapire’97)
(Schapire & Singer, 1999) 11/21/17 Yao-Liang Yu
16
Look Closely bigger 𝜀 𝑡 , bigger 𝛽 𝑡 , smaller coefficient
can optimize y = 1 or 0 expected error of h adaptive if 𝜀 𝑡 ≤ 1 2 discount h(x) closer to y, bigger exponent, when 𝑤 𝑡 𝑖 =0? 11/21/17 Yao-Liang Yu
17
Does It Work? 11/21/17 Yao-Liang Yu
18
Exponential decay of … training error
# of weak classifiers slightly smaller than ½ training error! Basically a form of gradient descent to minimize exponential loss: 𝑖 𝑒 − 𝑦 𝑖 ℎ( 𝒙 𝑖 ) , h in conic hull of ht’s Overfitting? Use simple base classifiers (e.g., decision stumps) 11/21/17 Yao-Liang Yu
19
Will Adaboost Overfit? margin: y h(x) 11/21/17 Yao-Liang Yu
20
Seriously? (Grove & Schuurmans’98; Breiman’99)
11/21/17 Yao-Liang Yu
21
Pros and Cons “Straightforward” way to boost performance
Flexible with any base classifier Less interpretable Longer training time hard to parallelize (in contrast to bagging) 11/21/17 Yao-Liang Yu
22
Extensions LogitBoost GradBoost L2Boost … you name it Multi-class
Regression Ranking 11/21/17 Yao-Liang Yu
23
Face Detection (Viola & Jones’01)
Each detection window results in ~160k features Speed is crucial for real-time detection! 11/21/17 Yao-Liang Yu
24
Cascading 38 layers with ~6k features selected 11/21/17 Yao-Liang Yu
25
Examples 11/21/17 Yao-Liang Yu
26
Asymmetry (Viola & Jones’02)
Too many non-faces vs few faces Trivial to achieve small false positives by sacrificing true positives Asymmetric loss confusion matrix 𝒚 = 1 𝒚 = -1 y = 1 𝑘 y = -1 1/ 𝑘 11/21/17 Yao-Liang Yu
27
SATzilla (Xu et al., 2008) 11/21/17 Yao-Liang Yu
28
Questions? 11/21/17 Yao-Liang Yu
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.