Boosting Rong Jin
Inefficiency with Bagging D Bagging … D1D1 D2D2 DkDk Boostrap Sampling h1h1 h2h2 hkhk Inefficiency with boostrap sampling: Every example has equal chance to be sampled No distinction between “easy” examples and “difficult” examples Inefficiency with model combination A constant weight for each classifier No distinction between accurate classifiers and inaccurate classifiers
Improve the Efficiency of Bagging Better sampling strategy Focus on the examples that are difficult to classify correctly Better combination strategy Accurate model should be assigned with more weights
Intuition: Education in China Training Examples X1Y1X1Y1 X2Y2X2Y2 X3Y3X3Y3 X4Y4X4Y4 Mistakes X1Y1X1Y1 X3Y3X3Y3 Classifier1 Classifier2 Mistakes X1Y1X1Y1 + Classifier3 No training mistakes !! May overfitting to training data !! +
AdaBoost Algorithm
AdaBoost Example: t =ln2 x 1, y 1 x 2, y 2 x 3, y 3 x 4, y 4 x 5, y 5 1/5 D0:D0: x 5, y 5 x 3, y 3 x 1, y 1 Sample h1h1 Training 2/71/72/7 1/7 D1:D1: x 1, y 1 x 2, y 2 x 3, y 3 x 4, y 4 x 5, y 5 Update Weights h1h1 Sample x 3, y 3 x 1, y 1 h2h2 Training x 1, y 1 x 2, y 2 x 3, y 3 x 4, y 4 x 5, y 5 3/5h 1 + 2/5h 2 Update Weights 2/91/94/9 1/9 D2:D2: Sample …
How To Choose t in AdaBoost? Problem with constant weight t No distinguish between accurate classifiers and inaccurate classifiers Consider how to construct the best distribution D t+1 (i) given D t (i) and h t 1. D t+1 (i) should be significantly differen from D t (i) 2. D t+1 (i) should create a situation that classifier h t performs poorly
Optimization View for Choosing t h t (x): x {1,-1}; a basis (weak) classifier H T (x): a linear combination of basic classifiers Goal: minimize training error Approximate the training error with a exponential function
AdaBoost: A Greedy Approach to Optimize the Exponential Function Exponential cost function Use the inductive form H T (x)=H T-1 (x)+ T h T (x) Minimize the exponential function Data points that h T (x) predict correctly Data points that h T (x) predict incorrectly AdaBoost is a greedy approach overfitting ? Empirical studies show that AdaBoost is robust in general AdaBoost tends to overfit with noisy data
Empirical Study of AdaBoost AdaBoosting decision trees Generate 50 decision trees through the AdaBoost procedure Linearly combine decision trees using the weights computed by the AdaBoost Algorithm In general: AdaBoost = Bagging > C4.5 AdaBoost usually needs less number of classifiers than Bagging
Bia-Variance Tradeoff for AdaBoost AdaBoost can reduce both model variance and model bias single decision tree Bagging decision tree bias variance AdaBoosting decision trees