Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 06: Bagging and Boosting

Similar presentations


Presentation on theme: "Lecture 06: Bagging and Boosting"— Presentation transcript:

1 Lecture 06: Bagging and Boosting
CS480/680: Intro to ML Lecture 06: Bagging and Boosting 09/25/18 Yao-Liang Yu

2 Announcement A2 on course webpage ICLR reproducibility challenge
Need to submit on Kaggle Do not cheat! ICLR reproducibility challenge Final exam: Dec 19, 4-6:30pm 09/25/18 Yao-Liang Yu

3 Decision Stump Recalled
Eye features low high No Yes Seems to be overly weak… Let’s make it strong! 09/25/18 Yao-Liang Yu

4 More Generally Which algorithm to use for my problem? Cheap answers
Deep learning, but then which architecture? I don’t know; whatever my boss tells me Whatever I can find in scikit-learn I try many and pick the “best” Why don’t we combine a few algs? But how? 09/25/18 Yao-Liang Yu

5 The Power of Aggregation
Train ht on data set Dt, say each with accuracy Assuming Dt are independent, hence ht are independent Predict with the majority label among (2T+1) ht’s What is the accuracy? 𝑝> 1 2 𝑘=𝑇+1 2𝑇 𝑇+1 𝑘 (1−𝑝) 2𝑇+1−𝑘 𝑝 𝑘 iid ~ Bernoulli(p) ≈1 − Φ 𝑇+1− 2𝑇+1 𝑝 2𝑇+1 𝑝(1−𝑝) 1 T  ∞ Pr( 𝑡=1 2𝑇+1 𝐵 𝑡 ≥𝑇+1) 09/25/18 Yao-Liang Yu

6 Bootstrap Aggregating (Breiman’96)
Can’t afford to have many independent training sets Bootstrapping! 1 5 n-1 n 2 3 4 1 4 n-1 n 3 sample with replacement 2 5 n-1 n 4 sample with replacement h1 hT e.g. majority vote 09/25/18 Yao-Liang Yu

7 Bagging for Regression
Simply average the outputs of ht, each trained on a bootstrapped training set With T independent ht, averaging would reduce variance by a factor of T 09/25/18 Yao-Liang Yu

8 When Bagging Works Bagging is essentially averaging
Beneficial if base classifiers have high variance (instable); e.g., performance changes a lot if training set is slightly perturbed Like decision trees but not k-NNs 09/25/18 Yao-Liang Yu

9 Randomizing Output (Breiman’00)
For regression, add small Gaussian noise to each yi (leaving xi untouched) Train many ht and average their outputs For classification Use one-hot encoding and reduce to regression Randomly flip a small proportion of labels in training set Train many ht and majority vote 09/25/18 Yao-Liang Yu

10 Random Forest (Breiman’01)
A collection of tree-structured classifiers {h(x; Θt), t=1, …, T}, where Θt are iid random vectors Bagging: random samples Random feature split Both 09/25/18 Yao-Liang Yu

11 Leo Breiman ( ) 09/25/18 Yao-Liang Yu

12 Boosting Given many classifiers ht, each slightly better than random guessing Is it possible to construct a classifier with nearly optimal accuracy? Yes! First shown by Schapire (1990) 09/25/18 Yao-Liang Yu

13 The Power of Majority (Freund, 1995) (Schapire, 1990) 09/25/18
Yao-Liang Yu

14 Hedging (Freund & Schapire’97)
09/25/18 Yao-Liang Yu

15 What Guarantee? # of rounds total loss of best expert # of experts
your total loss goes to 0 as T  inf choose beta appropriately 09/25/18 Yao-Liang Yu

16 Adaptive Boost (Freund & Schapire’97)
(Schapire & Singer, 1999) 09/25/18 Yao-Liang Yu

17 Look Closely bigger 𝜀 𝑡 , bigger 𝛽 𝑡 , smaller coefficient
can optimize y = 1 or 0 expected error of h adaptive if 𝜀 𝑡 ≤ 1 2 discount h(x) closer to y, bigger exponent, when 𝑤 𝑖 𝑡 =0? 09/25/18 Yao-Liang Yu

18 Does It Work? 09/25/18 Yao-Liang Yu

19 Exponential decay of … training error
# of weak classifiers slightly smaller than ½ training error! Basically a form of gradient descent to minimize exponential loss: 𝑖 𝑒 − 𝑦 𝑖 ℎ( 𝒙 𝑖 ) , h in conic hull of ht’s Overfitting? Use simple base classifiers (e.g., decision stumps) 09/25/18 Yao-Liang Yu

20 Will Adaboost Overfit? margin: y h(x) 09/25/18 Yao-Liang Yu

21 Seriously? (Grove & Schuurmans’98; Breiman’99)
09/25/18 Yao-Liang Yu

22 Pros and Cons “Straightforward” way to boost performance
Flexible with any base classifier Less interpretable Longer training time hard to parallelize (in contrast to bagging) 09/25/18 Yao-Liang Yu

23 Extensions LogitBoost GradBoost L2Boost … you name it Multi-class
Regression Ranking 09/25/18 Yao-Liang Yu

24 Face Detection (Viola & Jones’01)
Each detection window results in ~160k features Speed is crucial for real-time detection! 09/25/18 Yao-Liang Yu

25 Cascading 38 layers with ~6k features selected 09/25/18 Yao-Liang Yu

26 Examples 09/25/18 Yao-Liang Yu

27 Asymmetry (Viola & Jones’02)
Too many non-faces vs few faces Trivial to achieve small false positives by sacrificing true positives Asymmetric loss confusion matrix 𝒚 = 1 𝒚 = -1 y = 1 𝑘 y = -1 1/ 𝑘 09/25/18 Yao-Liang Yu

28 SATzilla (Xu et al., 2008) 09/25/18 Yao-Liang Yu

29 Questions? 09/25/18 Yao-Liang Yu


Download ppt "Lecture 06: Bagging and Boosting"

Similar presentations


Ads by Google