Bagging and Random Forests IOM 530: Intro. to Statistical Learning Bagging and Random Forests Chapter 08 (part 02) Disclaimer: This PPT is modified based on IOM 530: Intro. to Statistical Learning
IOM 530: Intro. to Statistical Learning Outline Bagging Bootstrapping Bagging for Regression Trees Bagging for Classification Trees Out-of-Bag Error Estimation Variable Importance: Relative Influence Plots Random Forests Boosting
IOM 530: Intro. to Statistical Learning bagging
IOM 530: Intro. to Statistical Learning Problem! http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/Ensemble%20(v6).pdf
IOM 530: Intro. to Statistical Learning http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/Ensemble%20(v6).pdf
IOM 530: Intro. to Statistical Learning http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/Ensemble%20(v6).pdf
IOM 530: Intro. to Statistical Learning http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/Ensemble%20(v6).pdf
IOM 530: Intro. to Statistical Learning Problem! Decision trees discussed earlier suffer from high variance! If we randomly split the training data into 2 parts, and fit decision trees on both parts, the results could be quite different We would like to have models with low variance To solve this problem, we can use bagging (bootstrap aggregating).
Chap5: Bootstrapping is simple! IOM 530: Intro. to Statistical Learning Chap5: Bootstrapping is simple! Resampling of the observed dataset (and of equal size to the observed dataset), each of which is obtained by random sampling with replacement from the original dataset.
IOM 530: Intro. to Statistical Learning Toy examples set.seed(100) sample(1:3, 3, replace = T) # set.seed(15) set.seed(594) set.seed(500) set.seed(200)
IOM 530: Intro. to Statistical Learning What is bagging? Bagging is an extremely powerful idea based on two things: Averaging: reduces variance! Bootstrapping: plenty of training datasets! Why does averaging reduces variance? Averaging a set of observations reduces variance. Recall that given a set of n independent observations Z1, …, Zn, each with variance , the variance of the mean of the observations is given by
IOM 530: Intro. to Statistical Learning How does bagging work? Generate B different bootstrapped training datasets, by taking repeated sample from the (single) training data set. Train the statistical learning method on each of the B bootstrapped training datasets, and obtain B predictions. For prediction: Regression: average all B predictions from all B trees Classification: majority vote among all B trees
Bagging for Regression Trees IOM 530: Intro. to Statistical Learning Bagging for Regression Trees Construct B regression trees using B bootstrapped training datasets Average the resulting predictions Note: These trees are not pruned, so each individual tree has high variance but low bias. Averaging these trees reduces variance, and thus we end up lowering both variance and bias
Bagging for Classification Trees IOM 530: Intro. to Statistical Learning Bagging for Classification Trees Construct B regression trees using B bootstrapped training datasets For prediction, there are two approaches: Record the class that each bootstrapped data set predicts and provide an overall prediction to the most commonly occurring one (majority vote). If our classifier produces probability estimates we can just average the probabilities and then predict to the class with the highest probability. Both methods work well.
A Comparison of Error Rates IOM 530: Intro. to Statistical Learning A Comparison of Error Rates Here the green line represents a simple majority vote approach The purple line corresponds to averaging the probability estimates. Both do far better than a single tree (dashed red) and get close to the Bayes error rate (dashed grey).
IOM 530: Intro. to Statistical Learning Example 1: Housing Data The red line represents the test mean sum of squares using a single tree. The black line corresponds to the bagging error rate
IOM 530: Intro. to Statistical Learning Example 2: Car Seat Data The red line represents the test error rate using a single tree. The black line corresponds to the bagging error rate using majority vote while the blue line averages the probabilities.
Out-of-Bag Error Estimation IOM 530: Intro. to Statistical Learning Out-of-Bag Error Estimation Since bootstrapping involves random selection of subsets of observations to build a training data set, then the remaining non-selected part could be the testing data. On average, each bagged tree makes use of around 2/3 of the observations, so we end up having 1/3 of the observations used for testing.
Variable Importance Measure IOM 530: Intro. to Statistical Learning Variable Importance Measure Bagging typically improves the accuracy over prediction using a single tree, but it is now hard to interpret the model! We have hundreds of trees, and it is no longer clear which variables are most important to the procedure Thus bagging improves prediction accuracy at the expense of interpretability But, we can still get an overall summary of the importance of each predictor using Relative Influence Plots
Relative Influence Plots IOM 530: Intro. to Statistical Learning Relative Influence Plots How do we decide which variables are most useful in predicting the response? We can compute something called relative influence plots. These plots give a score for each variable. These scores represents the decrease in MSE when splitting on a particular variable A number close to zero indicates the variable is not important and could be dropped. The larger the score the more influence the variable has.
IOM 530: Intro. to Statistical Learning Example: Housing Data Median Income is by far the most important variable. Longitude, Latitude and Average occupancy are the next most important.
IOM 530: Intro. to Statistical Learning Random Forests
IOM 530: Intro. to Statistical Learning Random Forests It is a very efficient statistical learning method It builds on the idea of bagging, but it provides an improvement because it de-correlates the trees Decision tree: Easy to achieve 0% error rate on training data If each training example has its own leaf …… Random forest: Bagging of decision tree Resampling training data is not sufficient Randomly restrict the features/questions used in each split How does it work? Build a number of decision trees on bootstrapped training sample, but when building these trees, each time a split in a tree is considered, a random sample of m predictors is chosen as split candidates from the full set of p predictors (Usually )
IOM 530: Intro. to Statistical Learning Why are we considering a random sample of m predictors instead of all p predictors for splitting? Suppose that we have a very strong predictor in the data set along with a number of other moderately strong predictor, then in the collection of bagged trees, most or all of them will use the very strong predictor for the first split! All bagged trees will look similar. Hence all the predictions from the bagged trees will be highly correlated Averaging many highly correlated quantities does not lead to a large variance reduction, and thus random forests “de-correlates” the bagged trees leading to more reduction in variance
Random Forest with different values of “m” IOM 530: Intro. to Statistical Learning Random Forest with different values of “m” Notice when random forests are built using m = p, then this amounts simply to bagging.
IOM 530: Intro. to Statistical Learning Boosting
IOM 530: Intro. to Statistical Learning Boosting Like bagging, boosting is a general approach that can be applied to many statistical learning methods for regression or classification. Consider boosting for decision trees. Recall that bagging involves creating multiple copies of the original training data set using the bootstrap, fitting a separate decision tree to each copy, and then combining all of the trees in order to create a single predictive model. Notably, each tree is built on a bootstrap data set, independent of the other trees. Boosting works in a similar way, except that the trees are grown sequentially: each tree is grown using information from previously grown trees.
Boosting algorithm for regression trees IOM 530: Intro. to Statistical Learning Boosting algorithm for regression trees See more details from the this PPT: http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/Ensemble%20(v6).pdf
What is the idea behind this procedure? IOM 530: Intro. to Statistical Learning What is the idea behind this procedure?
IOM 530: Intro. to Statistical Learning Boosting The R package gbm (gradient boosted models) handles a variety of regression and classification problems.
Toy Example + + + t=1 + + - + - + - + - - T=3, weak classifier = decision stump t=1 + 1.0 + 1.53 1.0 + 0.65 + - - 1.0 1.53 + 1.0 + 1.53 - 1.0 + - 0.65 - + - 1.0 1.0 0.65 0.65 𝜀 1 =0.30 1.0 + 1.0 - 0.65 0.65 + - 1.0 - 𝑑 1 =1.53 0.65 - 𝛼 1 =0.42 𝑓 1 𝑥 http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/Ensemble%20(v6).pdf
Toy Example + + + - - - t=2 T=3, weak classifier = decision stump 𝑓 1 𝑥 : t=2 𝛼 1 =0.42 + 0.78 + 1.53 + 0.65 0.33 - + - 1.53 0.78 + 1.53 0.78 + - - 0.65 - 0.33 + - + 0.65 1.26 0.65 1.26 𝜀 2 =0.21 0.65 + 0.65 0.33 - 0.33 + - 𝑑 2 =1.94 - 0.65 1.26 - 𝛼 2 =0.66 𝑓 2 𝑥 http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/Ensemble%20(v6).pdf
Toy Example - - - t=3 T=3, weak classifier = decision stump 𝑓 2 𝑥 : 𝑓 1 𝑥 : 𝑓 2 𝑥 : t=3 𝛼 1 =0.42 𝛼 2 =0.66 0.78 + 𝑓 3 𝑥 : 0.33 - 0.78 + 0.78 + 𝑓 3 𝑥 𝛼 3 =0.95 - - 0.33 + 1.26 1.26 𝜀 3 =0.13 0.33 0.33 - 𝑑 3 =2.59 + - 1.26 𝛼 3 =0.95 http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/Ensemble%20(v6).pdf
Toy Example Final Classifier: 𝐻 𝑥 =𝑠𝑖𝑔𝑛 𝑡=1 𝑇 𝛼 𝑡 𝑓 𝑡 𝑥 𝑠𝑖𝑔𝑛( + + ) + 0.42 + 0.66 + 0.95 ) + + - + + - - + - - http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/Ensemble%20(v6).pdf
IOM 530: Intro. to Statistical Learning R Lab in Chap 8
IOM 530: Intro. to Statistical Learning
IOM 530: Intro. to Statistical Learning
IOM 530: Intro. to Statistical Learning Deviance, page 206 Deviance is a measure deviance that plays the role of RSS for a broader class of models. The deviance is negative two times the maximized log-likelihood; The smaller the deviance, the better the fit.
Improving Weak Classifiers Ensemble: Boosting Improving Weak Classifiers Declaimer: All PPTs below are from: http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/Ensemble%20(v6).pdf
Boosting Training data: 𝑥 1 , 𝑦 1 ,⋯, 𝑥 𝑛 , 𝑦 𝑛 ,⋯, 𝑥 𝑁 , 𝑦 𝑁 𝑥 1 , 𝑦 1 ,⋯, 𝑥 𝑛 , 𝑦 𝑛 ,⋯, 𝑥 𝑁 , 𝑦 𝑁 𝑦 =±1 (binary classification) Boosting Guarantee: If your ML algorithm can produce classifier with error rate smaller than 50% on training data You can obtain 0% error rate classifier after boosting. Framework of boosting Obtain the first classifier 𝑓 1 𝑥 Find another function 𝑓 2 𝑥 to help 𝑓 1 𝑥 However, if 𝑓 2 𝑥 is similar to 𝑓 1 𝑥 , it will not help a lot. We want 𝑓 2 𝑥 to be complementary with 𝑓 1 𝑥 (How?) Obtain the second classifier 𝑓 2 𝑥 …… Finally, combining all the classifiers The classifiers are learned sequentially. We use an ML algorithm to obtain a function 𝑓 1 𝑥 If 𝑓 1 𝑥 is weak, i.e. bad performance on training data, what can we do? http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/Ensemble%20(v6).pdf
How to obtain different classifiers? Training on different training data sets How to have different training data sets Re-sampling your training data to form a new set Re-weighting your training data to form a new set In real implementation, you only have to change the cost/objective function 𝐿 𝑓 = 𝑛 𝑙 𝑓 𝑥 𝑛 , 𝑦 𝑛 𝑥 1 , 𝑦 1 , 𝑢 1 𝑢 1 =1 0.4 𝑥 2 , 𝑦 2 , 𝑢 2 𝑢 2 =1 2.1 𝐿 𝑓 = 𝑛 𝑢 𝑛 𝑙 𝑓 𝑥 𝑛 , 𝑦 𝑛 𝑥 3 , 𝑦 3 , 𝑢 3 𝑢 3 =1 0.7 http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/Ensemble%20(v6).pdf
Idea of Adaboost 𝜀 1 = 𝑛 𝑢 1 𝑛 𝛿 𝑓 1 𝑥 𝑛 ≠ 𝑦 𝑛 𝑍 1 𝑍 1 = 𝑛 𝑢 1 𝑛 Idea: training 𝒇 𝟐 𝒙 on the new training set that fails 𝒇 𝟏 𝒙 How to find a new training set that fails 𝑓 1 𝑥 ? 𝜀 1 : the error rate of 𝑓 1 𝑥 on its training data 𝜀 1 = 𝑛 𝑢 1 𝑛 𝛿 𝑓 1 𝑥 𝑛 ≠ 𝑦 𝑛 𝑍 1 𝑍 1 = 𝑛 𝑢 1 𝑛 𝜀 1 <0.5 𝜖 1 >0.5 Changing the example weights from 𝑢 1 𝑛 to 𝑢 2 𝑛 such that 𝑛 𝑢 2 𝑛 𝛿 𝑓 1 𝑥 𝑛 ≠ 𝑦 𝑛 𝑍 2 =0.5 The performance of 𝑓 1 for new weights would be random. Training 𝑓 2 𝑥 based on the new weights 𝑢 2 𝑛 http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/Ensemble%20(v6).pdf
Re-weighting Training Data Idea: training 𝒇 𝟐 𝒙 on the new training set that fails 𝒇 𝟏 𝒙 How to find a new training set that fails 𝑓 1 𝑥 ? 𝑢 1 =1/ 3 𝑥 1 , 𝑦 1 , 𝑢 1 𝑢 1 =1 𝑥 2 , 𝑦 2 , 𝑢 2 𝑢 2 =1 𝑢 2 = 3 𝑢 3 =1/ 3 𝑥 3 , 𝑦 3 , 𝑢 3 𝑢 3 =1 𝑢 4 =1/ 3 𝑥 4 , 𝑦 4 , 𝑢 4 𝑢 4 =1 𝜀 2 <0.5 𝜀 1 =0.25 0.5 𝑓 1 𝑥 𝑓 2 𝑥 http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/Ensemble%20(v6).pdf
Re-weighting Training Data Idea: training 𝒇 𝟐 𝒙 on the new training set that fails 𝒇 𝟏 𝒙 How to find a new training set that fails 𝑓 1 𝑥 ? If 𝑥 𝑛 misclassified by 𝑓 1 ( 𝑓 1 𝑥 𝑛 ≠ 𝑦 𝑛 ) 𝑢 2 𝑛 ← 𝑢 1 𝑛 multiplying 𝑑 1 increase If 𝑥 𝑛 correctly classified by 𝑓 1 ( 𝑓 1 𝑥 𝑛 = 𝑦 𝑛 ) Intuitive reason: 𝑢 2 𝑛 ← 𝑢 1 𝑛 divided by 𝑑 1 decrease 𝑓 2 will be learned based on example weights 𝑢 2 𝑛 What is the value of 𝑑 1 ? http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/Ensemble%20(v6).pdf
Re-weighting Training Data 𝑍 1 = 𝑛 𝑢 1 𝑛 𝜀 1 = 𝑛 𝑢 1 𝑛 𝛿 𝑓 1 𝑥 𝑛 ≠ 𝑦 𝑛 𝑍 1 𝑓 1 𝑥 𝑛 ≠ 𝑦 𝑛 𝑢 2 𝑛 ← 𝑢 1 𝑛 multiplying 𝑑 1 𝑛 𝑢 2 𝑛 𝛿 𝑓 1 𝑥 𝑛 ≠ 𝑦 𝑛 𝑍 2 =0.5 𝑓 1 𝑥 𝑛 = 𝑦 𝑛 𝑢 2 𝑛 ← 𝑢 1 𝑛 divided by 𝑑 1 = 𝑓 1 𝑥 𝑛 ≠ 𝑦 𝑛 𝑢 1 𝑛 𝑑 1 = 𝑓 1 𝑥 𝑛 ≠ 𝑦 𝑛 𝑢 2 𝑛 + 𝑓 1 𝑥 𝑛 = 𝑦 𝑛 𝑢 2 𝑛 = 𝑛 𝑢 2 𝑛 = 𝑓 1 𝑥 𝑛 ≠ 𝑦 𝑛 𝑢 1 𝑛 𝑑 1 + 𝑓 1 𝑥 𝑛 = 𝑦 𝑛 𝑢 1 𝑛 / 𝑑 1 𝑓 1 𝑥 𝑛 ≠ 𝑦 𝑛 𝑢 1 𝑛 𝑑 1 𝑓 1 𝑥 𝑛 ≠ 𝑦 𝑛 𝑢 1 𝑛 𝑑 1 + 𝑓 1 𝑥 𝑛 = 𝑦 𝑛 𝑢 1 𝑛 / 𝑑 1 =0.5 2 http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/Ensemble%20(v6).pdf
Re-weighting Training Data 𝑍 1 = 𝑛 𝑢 1 𝑛 𝜀 1 = 𝑛 𝑢 1 𝑛 𝛿 𝑓 1 𝑥 𝑛 ≠ 𝑦 𝑛 𝑍 1 𝑓 1 𝑥 𝑛 ≠ 𝑦 𝑛 𝑢 2 𝑛 ← 𝑢 1 𝑛 multiplying 𝑑 1 𝑛 𝑢 2 𝑛 𝛿 𝑓 1 𝑥 𝑛 ≠ 𝑦 𝑛 𝑍 2 =0.5 𝑓 1 𝑥 𝑛 = 𝑦 𝑛 𝑢 2 𝑛 ← 𝑢 1 𝑛 divided by 𝑑 1 𝑓 1 𝑥 𝑛 = 𝑦 𝑛 𝑢 1 𝑛 / 𝑑 1 𝑓 1 𝑥 𝑛 ≠ 𝑦 𝑛 𝑢 1 𝑛 𝑑 1 =1 𝑓 1 𝑥 𝑛 = 𝑦 𝑛 𝑢 1 𝑛 / 𝑑 1 = 𝑓 1 𝑥 𝑛 ≠ 𝑦 𝑛 𝑢 1 𝑛 𝑑 1 1 𝑑 1 𝑓 1 𝑥 𝑛 = 𝑦 𝑛 𝑢 1 𝑛 = 𝑑 1 𝑓 1 𝑥 𝑛 ≠ 𝑦 𝑛 𝑢 1 𝑛 𝑍 1 1− 𝜀 1 𝑍 1 𝜀 1 𝜀 1 = 𝑓 1 𝑥 𝑛 ≠ 𝑦 𝑛 𝑢 1 𝑛 𝑍 1 𝑍 1 1− 𝜀 1 / 𝑑 1 = 𝑍 1 𝜀 1 𝑑 1 𝑓 1 𝑥 𝑛 ≠ 𝑦 𝑛 𝑢 1 𝑛 = 𝑍 1 𝜀 1 𝑑 1 = 1− 𝜀 1 𝜀 1 >1 http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/Ensemble%20(v6).pdf
Algorithm for AdaBoost Giving training data 𝑥 1 , 𝑦 1 , 𝑢 1 1 ,⋯, 𝑥 𝑛 , 𝑦 𝑛 , 𝑢 1 𝑛 ,⋯, 𝑥 𝑁 , 𝑦 𝑁 , 𝑢 1 𝑁 𝑦 =±1 (Binary classification), 𝑢 1 𝑛 =1 (equal weights) For t = 1, …, T: Training weak classifier 𝑓 𝑡 𝑥 with weights 𝑢 𝑡 1 ,⋯, 𝑢 𝑡 𝑁 𝜀 𝑡 is the error rate of 𝑓 𝑡 𝑥 with weights 𝑢 𝑡 1 ,⋯, 𝑢 𝑡 𝑁 For n = 1, …, N: If 𝑥 𝑛 is misclassified by 𝑓 𝑡 𝑥 : 𝑢 𝑡+1 𝑛 = 𝑢 𝑡 𝑛 × 𝑑 𝑡 Else: 𝑢 𝑡+1 𝑛 = 𝑢 𝑡 𝑛 / 𝑑 𝑡 𝑦 𝑛 ≠ 𝑓 𝑡 𝑥 𝑛 (output is ±1) = 𝑢 𝑡 𝑛 ×𝑒xp 𝛼 𝑡 𝑑 𝑡 = 1− 𝜀 𝑡 𝜀 𝑡 𝛼 𝑡 =𝑙𝑛 1− 𝜀 𝑡 𝜀 𝑡 = 𝑢 𝑡 𝑛 ×𝑒xp − 𝛼 𝑡 𝑢 𝑡+1 𝑛 ← 𝑢 𝑡 𝑛 ×𝑒𝑥𝑝 − 𝑦 𝑛 𝑓 𝑡 𝑥 𝑛 𝛼 𝑡 http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/Ensemble%20(v6).pdf
Algorithm for AdaBoost We obtain a set of functions: 𝑓 1 𝑥 ,…, 𝑓 𝑡 𝑥 , …, 𝑓 𝑇 𝑥 How to aggregate them? Uniform weight: 𝐻 𝑥 =𝑠𝑖𝑔𝑛 𝑡=1 𝑇 𝑓 𝑡 𝑥 Non-uniform weight: 𝐻 𝑥 =𝑠𝑖𝑔𝑛 𝑡=1 𝑇 𝛼 𝑡 𝑓 𝑡 𝑥 Smaller error 𝜀 𝑡 , larger weight for final voting 𝛼 𝑡 =𝑙𝑛 1− 𝜀 𝑡 𝜀 𝑡 𝜀 𝑡 =0.1 𝜀 𝑡 =0.4 𝑢 𝑡+1 𝑛 = 𝑢 𝑡 𝑛 ×𝑒𝑥𝑝 − 𝑦 𝑛 𝑓 𝑡 𝑥 𝑛 𝛼 𝑡 𝛼 𝑡 =1.10 𝛼 𝑡 =0.20 http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/Ensemble%20(v6).pdf
Toy Example + + + t=1 + + - + - + - + - - T=3, weak classifier = decision stump t=1 + 1.0 + 1.53 1.0 + 0.65 + - - 1.0 1.53 + 1.0 + 1.53 - 1.0 + - 0.65 - + - 1.0 1.0 0.65 0.65 𝜀 1 =0.30 1.0 + 1.0 - 0.65 0.65 + - 1.0 - 𝑑 1 =1.53 0.65 - 𝛼 1 =0.42 𝑓 1 𝑥 http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/Ensemble%20(v6).pdf
Toy Example + + + - - - t=2 T=3, weak classifier = decision stump 𝑓 1 𝑥 : t=2 𝛼 1 =0.42 + 0.78 + 1.53 + 0.65 0.33 - + - 1.53 0.78 + 1.53 0.78 + - - 0.65 - 0.33 + - + 0.65 1.26 0.65 1.26 𝜀 2 =0.21 0.65 + 0.65 0.33 - 0.33 + - 𝑑 2 =1.94 - 0.65 1.26 - 𝛼 2 =0.66 𝑓 2 𝑥 http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/Ensemble%20(v6).pdf
Toy Example - - - t=3 T=3, weak classifier = decision stump 𝑓 2 𝑥 : 𝑓 1 𝑥 : 𝑓 2 𝑥 : t=3 𝛼 1 =0.42 𝛼 2 =0.66 0.78 + 𝑓 3 𝑥 : 0.33 - 0.78 + 0.78 + 𝑓 3 𝑥 𝛼 3 =0.95 - - 0.33 + 1.26 1.26 𝜀 3 =0.13 0.33 0.33 - 𝑑 3 =2.59 + - 1.26 𝛼 3 =0.95 http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/Ensemble%20(v6).pdf
Toy Example Final Classifier: 𝐻 𝑥 =𝑠𝑖𝑔𝑛 𝑡=1 𝑇 𝛼 𝑡 𝑓 𝑡 𝑥 𝑠𝑖𝑔𝑛( + + ) + 0.42 + 0.66 + 0.95 ) + + - + + - - + - - http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/Ensemble%20(v6).pdf
Warning of Math 𝐻 𝑥 =𝑠𝑖𝑔𝑛 𝑡=1 𝑇 𝛼 𝑡 𝑓 𝑡 𝑥 𝛼 𝑡 =𝑙𝑛 1− 𝜀 𝑡 𝜀 𝑡 𝐻 𝑥 =𝑠𝑖𝑔𝑛 𝑡=1 𝑇 𝛼 𝑡 𝑓 𝑡 𝑥 𝛼 𝑡 =𝑙𝑛 1− 𝜀 𝑡 𝜀 𝑡 As we have more and more 𝑓 𝑡 (T increases), 𝐻 𝑥 achieves smaller and smaller error rate on training data. http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/Ensemble%20(v6).pdf
Error Rate of Final Classifier 𝛼 𝑡 =𝑙𝑛 1− 𝜀 𝑡 𝜀 𝑡 𝑔 𝑥 Training Data Error Rate = 1 𝑁 𝑛 𝛿 𝐻 𝑥 𝑛 ≠ 𝑦 𝑛 = 1 𝑁 𝑛 𝛿 𝑦 𝑛 𝑔 𝑥 𝑛 <0 ≤ 1 𝑁 𝑛 𝑒𝑥𝑝 − 𝑦 𝑛 𝑔 𝑥 𝑛 𝑦 𝑛 𝑔 𝑥 𝑛 http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/Ensemble%20(v6).pdf
𝑍 𝑇+1 = 𝑛 𝑡=1 𝑇 𝑒𝑥𝑝 − 𝑦 𝑛 𝑓 𝑡 𝑥 𝑛 𝛼 𝑡 𝑔 𝑥 = 𝑡=1 𝑇 𝛼 𝑡 𝑓 𝑡 𝑥 Training Data Error Rate ≤ 1 𝑁 𝑛 𝑒𝑥𝑝 − 𝑦 𝑛 𝑔 𝑥 𝑛 = 1 𝑁 𝑍 𝑇+1 𝛼 𝑡 =𝑙𝑛 1− 𝜀 𝑡 𝜀 𝑡 𝑍 𝑡 : the summation of the weights of training data for training 𝑓 𝑡 What is 𝑍 𝑇+1 =? 𝑍 𝑇+1 = 𝑛 𝑢 𝑇+1 𝑛 𝑢 𝑇+1 𝑛 = 𝑡=1 𝑇 𝑒𝑥𝑝 − 𝑦 𝑛 𝑓 𝑡 𝑥 𝑛 𝛼 𝑡 𝑢 1 𝑛 =1 𝑢 𝑡+1 𝑛 = 𝑢 𝑡 𝑛 ×𝑒𝑥𝑝 − 𝑦 𝑛 𝑓 𝑡 𝑥 𝑛 𝛼 𝑡 𝑍 𝑇+1 = 𝑛 𝑡=1 𝑇 𝑒𝑥𝑝 − 𝑦 𝑛 𝑓 𝑡 𝑥 𝑛 𝛼 𝑡 𝑔 𝑥 = 𝑛 𝑒𝑥𝑝 − 𝑦 𝑛 𝑡=1 𝑇 𝑓 𝑡 𝑥 𝑛 𝛼 𝑡 http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/Ensemble%20(v6).pdf
𝑍 𝑡 = 𝑍 𝑡−1 𝜀 𝑡 𝑒𝑥𝑝 𝛼 𝑡 + 𝑍 𝑡−1 1− 𝜀 𝑡 𝑒𝑥𝑝 − 𝛼 𝑡 𝑔 𝑥 = 𝑡=1 𝑇 𝛼 𝑡 𝑓 𝑡 𝑥 Training Data Error Rate ≤ 1 𝑁 𝑛 𝑒𝑥𝑝 − 𝑦 𝑛 𝑔 𝑥 𝑛 = 1 𝑁 𝑍 𝑇+1 𝛼 𝑡 =𝑙𝑛 1− 𝜀 𝑡 𝜀 𝑡 𝑍 1 =𝑁 (equal weights) 𝑍 𝑡 = 𝑍 𝑡−1 𝜀 𝑡 𝑒𝑥𝑝 𝛼 𝑡 + 𝑍 𝑡−1 1− 𝜀 𝑡 𝑒𝑥𝑝 − 𝛼 𝑡 Misclassified portion in 𝑍 𝑡−1 Correctly classified portion in 𝑍 𝑡−1 = 𝑍 𝑡−1 𝜀 𝑡 1− 𝜀 𝑡 𝜀 𝑡 + 𝑍 𝑡−1 1− 𝜀 𝑡 𝜀 𝑡 1− 𝜀 𝑡 𝑍 𝑇+1 =𝑁 𝑡=1 𝑇 2 𝜀 𝑡 1− 𝜀 𝑡 = 𝑍 𝑡−1 ×2 𝜀 𝑡 1− 𝜀 𝑡 ≤ 𝑡=1 𝑇 2 𝜖 𝑡 1− 𝜖 𝑡 Smaller and smaller Training Data Error Rate <1 http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/Ensemble%20(v6).pdf
End of Warning So many good explaination !!!!!!!!!!!!!!!!! http://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/Ensemble%20(v6).pdf
Even though the training error is 0, the testing error still decreases? =𝑠𝑖𝑔𝑛 𝑡=1 𝑇 𝛼 𝑡 𝑓 𝑡 𝑥 𝐻 𝑥 𝑔 𝑥 Margin = 𝑦 𝑔 𝑥 http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/Ensemble%20(v6).pdf
Large Margin? =𝑠𝑖𝑔𝑛 𝑡=1 𝑇 𝛼 𝑡 𝑓 𝑡 𝑥 𝐻 𝑥 𝑔 𝑥 Training Data Error Rate = =𝑠𝑖𝑔𝑛 𝑡=1 𝑇 𝛼 𝑡 𝑓 𝑡 𝑥 Large Margin? 𝐻 𝑥 𝑔 𝑥 Training Data Error Rate = = 1 𝑁 𝑛 𝛿 𝐻 𝑥 𝑛 ≠ 𝑦 𝑛 Adaboost ≤ 1 𝑁 𝑛 𝑒𝑥𝑝 − 𝑦 𝑛 𝑔 𝑥 𝑛 Logistic regression = 𝑡=1 𝑇 2 𝜖 𝑡 1− 𝜖 𝑡 SVM Getting smaller and smaller as T increase 𝑦 𝑛 𝑔 𝑥 𝑛 http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/Ensemble%20(v6).pdf