Download presentation
Presentation is loading. Please wait.
Published byAnatole Legaré Modified over 6 years ago
1
Ensemble Methods for Machine Learning: The Ensemble Strikes Back
2
Outline Motivations and techniques Bias, variance: bagging
Combining learners vs choosing between them: bucket of models stacking & blending Pac-learning theory: boosting Relation of boosting to other learning methods—optimization, SVMs, …
3
Review Of Boosting
4
Sample with replacement
Increase weight of xi if ht is wrong, decrease weight if ht is right. Linear combination of base hypotheses - best weight αt depends on error of ht.
5
Boosting: A toy example
Thanks, Rob Schapire
6
Boosting: A toy example
Thanks, Rob Schapire
7
Boosting: A toy example
Thanks, Rob Schapire
8
Boosting: A toy example
Thanks, Rob Schapire
9
Boosting: A toy example
Thanks, Rob Schapire
10
Boosting improved decision trees…
KV S DSS FS T …
11
Analysis Of Boosting
12
Theorem 1: error rate
13
upper bound on “[error on i ]”
Theorem 1: error rate Proof: = sign(f(x)) where upper bound on “[error on i ]” QED!
14
imequality holds for -1 <= u <= +1
Theorem 1: So: pick h’s and α’s to minimize Z’s Simplified notation: drop the t’s, let ui=yiht(xi), remember that ui = +1 or -1 Claim: 1 1 = sign(f(x)) where ui = +1 ui = -1 equality for u = +1, -1 imequality holds for -1 <= u <= +1 So: let’s minimize f(α) = to pick a best α
15
Minimize f(α) = = sign(f(x)) where
16
and hence training error is bounded by
Theorem 1: So: pick h’s and α’s to minimize Z’s Theorem 2: when for then and hence training error is bounded by Comment if h(x)=+/- 1 then
18
Boosting as Optimization
19
Even boosting single features worked well…
KV S DSS FS T … Reuters newswire corpus
20
Some background facts Coordinate descent optimization to minimize f(w)
For t=1,…,T or till convergence: For i=1,…,N where w=<w1,…,wN> Pick w* to minimize f(<w1,…,wi-1,w*,wi+1,…,wN> Set wi = w* V KV S DSS FS T …
21
Boosting as optimization using coordinate descent
With a small number of possible h’s, you can think of boosting as finding a linear combination of these: So boosting is sort of like stacking: Boosting uses coordinate descent to minimize an upper bound on error rate:
22
Boosting and optimization
V KV S DSS FS T … Jerome Friedman, Trevor Hastie and Robert Tibshirani. Additive logistic regression: a statistical view of boosting. The Annals of Statistics, 2000. Compared using AdaBoost to set feature weights vs direct optimization of feature weights to minimize log-likelihood, squared error, … FHT
23
Boosting as Margin Learning
24
Boosting didn’t seem to overfit…(!)
KV S DSS FS T … test error train error
25
…because it turned out to be increasing the margin of the classifier
V KV S DSS FS T … 1000 rounds 100 rounds
26
Boosting movie
27
Some background facts Coordinate descent optimization to minimize f(w)
For t=1,…,T or till convergence: For i=1,…,N where w=<w1,…,wN> Pick w* to minimize f(<w1,…,wi-1,w*,wi+1,…,wN> Set wi = w* V KV S DSS FS T …
28
Boosting is closely related to margin classifiers like SVM, voted perceptron, … (!)
KV S DSS FS T … Boosting: The “coordinates” are being extended by one in each round of boosting --- usually, unless you happen to generate the same tree twice
29
Boosting is closely related to margin classifiers like SVM, voted perceptron, … (!)
KV S DSS FS T … Boosting: Linear SVMs:
30
Wrapup On Boosting
31
Boosting in the real world
V KV S DSS FS T … William’s wrap up: Boosting is not discussed much in the ML research community any more It’s much too well understood It’s really useful in practice as a meta-learning method Eg, boosted Naïve Bayes usually beats Naïve Bayes Boosted decision trees are almost always competitive with respect to accuracy very robust against rescaling numeric features, extra features, non-linearities, … somewhat slower to learn and use than many linear classifiers But getting probabilities out of them is a little less reliable. now
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.