A Brief Introduction to Adaboost Hongbo Deng 6 Feb, 2007 Some of the slides are borrowed from Derek Hoiem & Jan ˇSochman.
Outline Background Adaboost Algorithm Theory/Interpretations
What’s So Good About Adaboost Can be used with many different classifiers Improves classification accuracy Commonly used in many areas Simple to implement Not prone to overfitting Can be used in conjunction with many other learning algorithms to improve their performance.
A Brief History Resampling for estimating statistic Bootstrapping Bagging Boosting (Schapire 1989) Adaboost (Schapire 1995) Resampling for classifier design (Bootstrapping: Resampling for estimating statistic) (Resampling for classifier design)
Bootstrap Estimation Repeatedly draw n samples from D For each set of samples, estimate a statistic The bootstrap estimate is the mean of the individual estimates Used to estimate a statistic (parameter) and its variance The training set D
Bagging - Aggregate Bootstrapping For i = 1 .. M Draw n*<n samples from D with replacement Learn classifier Ci Final classifier is a vote of C1 .. CM Increases classifier stability/reduces variance The previous section addressed the use of resampling in estimating statistics Now we turn to a number of general resampling methods for training classifiers. Bagging uses multiple versions of a training set, each created by drawing n* samples from D with replacement. Each of these bootstrap data sets is used to train a different “component classifier” and the final classification decision is based on the vote of each component classifier. D2 D1 D3 D
Boosting (Schapire 1989) Consider creating three component classifiers for a two-category problem through boosting. Randomly select n1 < n samples from D without replacement to obtain D1 Train weak learner C1 Select n2 < n samples from D with half of the samples misclassified by C1 to obtain D2 Train weak learner C2 Select all remaining samples from D that C1 and C2 disagree on Train weak learner C3 Final classifier is vote of weak learners The goal of boosting is to improve the accuracy of any given learning algorithm. Consider creating three component classifiers for a two-category problem through boosting. First, ……, call this set D1. Based on D1, train the first classifier, C1. Now, we seek a second training set D2, half of the patterns in D2 should be correctly classified by C1, half incorrectly classified by C1. Next, we seek a third data set D3, D D3 D1 D2 - + + -
Adaboost - Adaptive Boosting Instead of resampling, uses training set re-weighting Each training sample uses a weight to determine the probability of being selected for a training set. AdaBoost is an algorithm for constructing a “strong” classifier as linear combination of “simple” “weak” classifier Final classification based on weighted vote of weak classifiers
Adaboost Terminology ht(x) … “weak” or basis classifier (Classifier = Learner = Hypothesis) … “strong” or final classifier Weak Classifier: < 50% error over any distribution Strong Classifier: thresholded linear combination of weak classifier outputs
Discrete Adaboost Algorithm Each training sample has a weight, which determines the probability of being selected for training the component classifier D(i) is weight for each training sample, which determines the probability of being selected for a training set for an individual component classifier. (Epislon is the sum of the weights of those misclassified samples.) Specifically, we initialize the weights across the training set to be uniform. On each iteration t, we find a classifier h(x) that minimizes the error with respect to the distribution. Next we increase weights of training examples misclassified by h(x), and decrease weights of the examples correctly classified by h(x) The new distribution is used to train the next classifier, and the process is iterated.
Find the Weak Classifier The error is determined with respect to the distribution D
Find the Weak Classifier
The algorithm core
Reweighting y * h(x) = 1 y * h(x) = -1
Reweighting In this way, AdaBoost “focused on” the informative or “difficult” examples.
Reweighting In this way, AdaBoost “focused on” the informative or “difficult” examples.
Algorithm recapitulation
Algorithm recapitulation
Algorithm recapitulation
Algorithm recapitulation
Algorithm recapitulation
Algorithm recapitulation
Algorithm recapitulation
Algorithm recapitulation
Pros and cons of AdaBoost Advantages Very simple to implement Does feature selection resulting in relatively simple classifier Fairly good generalization Disadvantages Suboptimal solution Sensitive to noisy data and outliers Sensitive to noisy data and outliers Be wary if using noisy/unlabeled data
References Duda, Hart, ect – Pattern Classification Freund – “An adaptive version of the boost by majority algorithm” Freund – “Experiments with a new boosting algorithm” Freund, Schapire – “A decision-theoretic generalization of on-line learning and an application to boosting” Friedman, Hastie, etc – “Additive Logistic Regression: A Statistical View of Boosting” Jin, Liu, etc (CMU) – “A New Boosting Algorithm Using Input-Dependent Regularizer” Li, Zhang, etc – “Floatboost Learning for Classification” Opitz, Maclin – “Popular Ensemble Methods: An Empirical Study” Ratsch, Warmuth – “Efficient Margin Maximization with Boosting” Schapire, Freund, etc – “Boosting the Margin: A New Explanation for the Effectiveness of Voting Methods” Schapire, Singer – “Improved Boosting Algorithms Using Confidence-Weighted Predictions” Schapire – “The Boosting Approach to Machine Learning: An overview” Zhang, Li, etc – “Multi-view Face Detection with Floatboost”
Appendix Bound on training error Adaboost Variants
Bound on Training Error (Schapire)
Discrete Adaboost (DiscreteAB) (Friedman’s wording)
Discrete Adaboost (DiscreteAB) (Freund and Schapire’s wording)
Adaboost with Confidence Weighted Predictions (RealAB) The bigger absolute value of the output of classifier, the more confidence of the classifier.
Adaboost Variants Proposed By Friedman LogitBoost Solves Requires care to avoid numerical problems GentleBoost Update is fm(x) = P(y=1 | x) – P(y=0 | x) instead of Bounded [0 1]
Adaboost Variants Proposed By Friedman LogitBoost
Adaboost Variants Proposed By Friedman GentleBoost
Thanks!!! Any comments or questions?