Lecture 06: Bagging and Boosting CS480/680: Intro to ML Lecture 06: Bagging and Boosting 09/25/18 Yao-Liang Yu
Announcement A2 on course webpage ICLR reproducibility challenge Need to submit on Kaggle Do not cheat! ICLR reproducibility challenge https://reproducibility-challenge.github.io/iclr_2019/ Final exam: Dec 19, 4-6:30pm 09/25/18 Yao-Liang Yu
Decision Stump Recalled Eye features low high No Yes Seems to be overly weak… Let’s make it strong! 09/25/18 Yao-Liang Yu
More Generally Which algorithm to use for my problem? Cheap answers Deep learning, but then which architecture? I don’t know; whatever my boss tells me Whatever I can find in scikit-learn I try many and pick the “best” Why don’t we combine a few algs? But how? 09/25/18 Yao-Liang Yu
The Power of Aggregation Train ht on data set Dt, say each with accuracy Assuming Dt are independent, hence ht are independent Predict with the majority label among (2T+1) ht’s What is the accuracy? 𝑝> 1 2 𝑘=𝑇+1 2𝑇+1 2𝑇+1 𝑘 (1−𝑝) 2𝑇+1−𝑘 𝑝 𝑘 iid ~ Bernoulli(p) ≈1 − Φ 𝑇+1− 2𝑇+1 𝑝 2𝑇+1 𝑝(1−𝑝) 1 T ∞ Pr( 𝑡=1 2𝑇+1 𝐵 𝑡 ≥𝑇+1) 09/25/18 Yao-Liang Yu
Bootstrap Aggregating (Breiman’96) Can’t afford to have many independent training sets Bootstrapping! 1 … 5 n-1 n 2 3 4 1 … 4 n-1 n 3 sample with replacement … 2 5 n-1 n 4 sample with replacement h1 hT … e.g. majority vote 09/25/18 Yao-Liang Yu
Bagging for Regression Simply average the outputs of ht, each trained on a bootstrapped training set With T independent ht, averaging would reduce variance by a factor of T 09/25/18 Yao-Liang Yu
When Bagging Works Bagging is essentially averaging Beneficial if base classifiers have high variance (instable); e.g., performance changes a lot if training set is slightly perturbed Like decision trees but not k-NNs 09/25/18 Yao-Liang Yu
Randomizing Output (Breiman’00) For regression, add small Gaussian noise to each yi (leaving xi untouched) Train many ht and average their outputs For classification Use one-hot encoding and reduce to regression Randomly flip a small proportion of labels in training set Train many ht and majority vote 09/25/18 Yao-Liang Yu
Random Forest (Breiman’01) A collection of tree-structured classifiers {h(x; Θt), t=1, …, T}, where Θt are iid random vectors Bagging: random samples Random feature split Both 09/25/18 Yao-Liang Yu
Leo Breiman (1928-2005) 09/25/18 Yao-Liang Yu
Boosting Given many classifiers ht, each slightly better than random guessing Is it possible to construct a classifier with nearly optimal accuracy? Yes! First shown by Schapire (1990) 09/25/18 Yao-Liang Yu
The Power of Majority (Freund, 1995) (Schapire, 1990) 09/25/18 Yao-Liang Yu
Hedging (Freund & Schapire’97) 09/25/18 Yao-Liang Yu
What Guarantee? # of rounds total loss of best expert # of experts your total loss goes to 0 as T inf choose beta appropriately 09/25/18 Yao-Liang Yu
Adaptive Boost (Freund & Schapire’97) (Schapire & Singer, 1999) 09/25/18 Yao-Liang Yu
Look Closely bigger 𝜀 𝑡 , bigger 𝛽 𝑡 , smaller coefficient can optimize y = 1 or 0 expected error of h adaptive if 𝜀 𝑡 ≤ 1 2 discount h(x) closer to y, bigger exponent, when 𝑤 𝑖 𝑡 =0? 09/25/18 Yao-Liang Yu
Does It Work? 09/25/18 Yao-Liang Yu
Exponential decay of … training error # of weak classifiers slightly smaller than ½ training error! Basically a form of gradient descent to minimize exponential loss: 𝑖 𝑒 − 𝑦 𝑖 ℎ( 𝒙 𝑖 ) , h in conic hull of ht’s Overfitting? Use simple base classifiers (e.g., decision stumps) 09/25/18 Yao-Liang Yu
Will Adaboost Overfit? margin: y h(x) 09/25/18 Yao-Liang Yu
Seriously? (Grove & Schuurmans’98; Breiman’99) 09/25/18 Yao-Liang Yu
Pros and Cons “Straightforward” way to boost performance Flexible with any base classifier Less interpretable Longer training time hard to parallelize (in contrast to bagging) 09/25/18 Yao-Liang Yu
Extensions LogitBoost GradBoost L2Boost … you name it Multi-class Regression Ranking 09/25/18 Yao-Liang Yu
Face Detection (Viola & Jones’01) Each detection window results in ~160k features Speed is crucial for real-time detection! 09/25/18 Yao-Liang Yu
Cascading 38 layers with ~6k features selected 09/25/18 Yao-Liang Yu
Examples 09/25/18 Yao-Liang Yu
Asymmetry (Viola & Jones’02) Too many non-faces vs few faces Trivial to achieve small false positives by sacrificing true positives Asymmetric loss confusion matrix 𝒚 = 1 𝒚 = -1 y = 1 𝑘 y = -1 1/ 𝑘 09/25/18 Yao-Liang Yu
SATzilla (Xu et al., 2008) 09/25/18 Yao-Liang Yu
Questions? 09/25/18 Yao-Liang Yu