Ensemble Methods for Machine Learning: The Ensemble Strikes Back
Outline Motivations and techniques Bias, variance: bagging Combining learners vs choosing between them: bucket of models stacking & blending Pac-learning theory: boosting Relation of boosting to other learning methods—optimization, SVMs, …
Review Of Boosting
Sample with replacement Increase weight of xi if ht is wrong, decrease weight if ht is right. Linear combination of base hypotheses - best weight αt depends on error of ht.
Boosting: A toy example Thanks, Rob Schapire
Boosting: A toy example Thanks, Rob Schapire
Boosting: A toy example Thanks, Rob Schapire
Boosting: A toy example Thanks, Rob Schapire
Boosting: A toy example Thanks, Rob Schapire
Boosting improved decision trees… 1988 - KV 1989 - S 1993 - DSS 1995 - FS 1950 - T …
Analysis Of Boosting
Theorem 1: error rate
upper bound on “[error on i ]” Theorem 1: error rate Proof: = sign(f(x)) where upper bound on “[error on i ]” QED!
imequality holds for -1 <= u <= +1 Theorem 1: So: pick h’s and α’s to minimize Z’s Simplified notation: drop the t’s, let ui=yiht(xi), remember that ui = +1 or -1 Claim: 1 1 = sign(f(x)) where ui = +1 ui = -1 equality for u = +1, -1 imequality holds for -1 <= u <= +1 So: let’s minimize f(α) = to pick a best α
Minimize f(α) = = sign(f(x)) where
and hence training error is bounded by Theorem 1: So: pick h’s and α’s to minimize Z’s Theorem 2: when for then and hence training error is bounded by Comment if h(x)=+/- 1 then
Boosting as Optimization
Even boosting single features worked well… 1988 - KV 1989 - S 1993 - DSS 1995 - FS 1950 - T … Reuters newswire corpus
Some background facts Coordinate descent optimization to minimize f(w) For t=1,…,T or till convergence: For i=1,…,N where w=<w1,…,wN> Pick w* to minimize f(<w1,…,wi-1,w*,wi+1,…,wN> Set wi = w* 1984 - V 1988 - KV 1989 - S 1993 - DSS 1995 - FS 1950 - T …
Boosting as optimization using coordinate descent With a small number of possible h’s, you can think of boosting as finding a linear combination of these: So boosting is sort of like stacking: Boosting uses coordinate descent to minimize an upper bound on error rate:
Boosting and optimization 1984 - V 1988 - KV 1989 - S 1993 - DSS 1995 - FS 1950 - T … Jerome Friedman, Trevor Hastie and Robert Tibshirani. Additive logistic regression: a statistical view of boosting. The Annals of Statistics, 2000. Compared using AdaBoost to set feature weights vs direct optimization of feature weights to minimize log-likelihood, squared error, … 1999 - FHT
Boosting as Margin Learning
Boosting didn’t seem to overfit…(!) 1988 - KV 1989 - S 1993 - DSS 1995 - FS 1950 - T … test error train error
…because it turned out to be increasing the margin of the classifier 1984 - V 1988 - KV 1989 - S 1993 - DSS 1995 - FS 1950 - T … 1000 rounds 100 rounds
Boosting movie
Some background facts Coordinate descent optimization to minimize f(w) For t=1,…,T or till convergence: For i=1,…,N where w=<w1,…,wN> Pick w* to minimize f(<w1,…,wi-1,w*,wi+1,…,wN> Set wi = w* 1984 - V 1988 - KV 1989 - S 1993 - DSS 1995 - FS 1950 - T …
Boosting is closely related to margin classifiers like SVM, voted perceptron, … (!) 1988 - KV 1989 - S 1993 - DSS 1995 - FS 1950 - T … Boosting: The “coordinates” are being extended by one in each round of boosting --- usually, unless you happen to generate the same tree twice
Boosting is closely related to margin classifiers like SVM, voted perceptron, … (!) 1988 - KV 1989 - S 1993 - DSS 1995 - FS 1950 - T … Boosting: Linear SVMs:
Wrapup On Boosting
Boosting in the real world 1984 - V 1988 - KV 1989 - S 1993 - DSS 1995 - FS 1950 - T … William’s wrap up: Boosting is not discussed much in the ML research community any more It’s much too well understood It’s really useful in practice as a meta-learning method Eg, boosted Naïve Bayes usually beats Naïve Bayes Boosted decision trees are almost always competitive with respect to accuracy very robust against rescaling numeric features, extra features, non-linearities, … somewhat slower to learn and use than many linear classifiers But getting probabilities out of them is a little less reliable. now