Introduction to Boosting Hojung Cho Topics for Bioinformatics Oct 10 2006
Boosting Underlying principle While building a highly accurate prediction rule is not an easy task, it is not hard to come up with very rough rules of thumb (“weak learners”) that are only moderately accurate and to combine these into a highly accurate classifier. Outline The boosting framework Choice of α AdaBoost LogiBoost References training error (the prediction error of the final hypothesis on the training data)
The Rules for Boosting 1) set all weights of training examples equal 2) train a weak learner on the weighted examples 3) see how well the weak learner performs on data and give it a weight based on how well it did 4) re-weight training examples and repeat 5) when done, predict by voting by majority Weak learner : “rough and moderately inaccurate” predictor, but one that can predict better than chance (1/2) - > Boosting shows the strength of weak learnability Start with an algorithm to find the rough rules of thumb (“ weak learner”) The boosting algorithm calls “weak learner” repeatedly, each time feeding it a different subset of the training examples (a different distribution or weighting over the training examples). Each time it is called, the base learning algorithm generates a new weak prediction rules. After many rounds, the boosting algorithm combine these weak rules into a single prediction rule that will be much more accurate than any of the single weak learner. Two fundamental questions for designing the Boosting algorithm How should each distribution or weighting (subset of examples) be chosen on each round? place the most weight on the examples most often misclassified by the preceding weak rules forcing the weak learner to focus on the “hardest “ examples How should the weak learners be combined into a single rule? take a weighted majority vote of their predictions choice of α : analytically or numerically
A Boosting approach Binary classification :AdaBoost
Simple example Round 1 Round 2 Example Round 3 Final Hypothesis
Choice of α Schapire and Singer proved that the training error is bounded by From the theorem above, We can derive
Proof SO e t decreases alpha t increases -> miss classified samples have Dt(i) will empahsize So poor classifier (e t is large) smaller weighting g missclassified samples)
Boosting and additive logistic regression (Friedman et al, 2000) Boosting: an approximation to additive modeling on the logistic scale using maximum Bernoulli (binomial in multiclass case) likelihood as a criterion. Propose more direct approximations that exhibit nearly identical results to boosting (AdaBoost). Reduce computation.
The probability of y =+1 when the f(x) is the weighted average of the basic classifiers in AdaBoost is represented by p(x), . Note than the close connection between the log loss (negative log likelihood)of a model above, and the function we attempt to minimize in AdaBoost, For any distribution over pairs(x,y), both the expectations are minimized by the function f, Rather than minimizing the exponential loss, we can attempt to directly minimize the logistic loss (the negative log likelihood): LogitBoost.
References Yoav Fruend and Robert E Schapire. A decision-theoretic generalization of the on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119-139, August 1997. Ron Meir and Gunnar Rätsch. An introduction to boosting and leveraging. In Advanced Lectures on Machine Learning (LNAI2600), 2003. Robert E. Schapire. The boosting approach to machine learning: An overview. In D. D. Denison, M. H. Hansen, C. Holmes, B. Mallick, B. Yu, editors, Nonlinear Estimation and Classification. Springer, 2003.