Bagging and Random Forests

Slides:

Advertisements

Similar presentations

Ensemble Learning Reading: R. Schapire, A brief introduction to boosting.

Advertisements

Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.

Chapter 7 – Classification and Regression Trees

CMPUT 466/551 Principal Source: CMU

Chapter 7 – Classification and Regression Trees

Sparse vs. Ensemble Approaches to Supervised Learning

Ensemble Learning what is an ensemble? why use an ensemble?

Ensemble Learning: An Introduction

Machine Learning: Ensemble Methods

Sparse vs. Ensemble Approaches to Supervised Learning

Stat 112: Lecture 9 Notes Homework 3: Due next Thursday

Ensemble Learning (2), Tree and Forest

Ensembles of Classifiers Evgueni Smirnov

Chapter 10 Boosting May 6, Outline Adaboost Ensemble point-view of Boosting Boosting Trees Supervised Learning Methods.

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

ENSEMBLE LEARNING David Kauchak CS451 – Fall 2013.

CS 391L: Machine Learning: Ensembles

Chapter 9 – Classification and Regression Trees

LOGO Ensemble Learning Lecturer: Dr. Bo Yuan

Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.

Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.

Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.

ISQS 6347, Data & Text Mining1 Ensemble Methods. ISQS 6347, Data & Text Mining 2 Ensemble Methods Construct a set of classifiers from the training data.

Training of Boosted DecisionTrees Helge Voss (MPI–K, Heidelberg) MVA Workshop, CERN, July 10, 2009.

Ensemble Methods in Machine Learning

Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.

Classification Ensemble Methods 1

Ensemble Methods Construct a set of classifiers from the training data Predict class label of previously unseen records by aggregating predictions made.

Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.

Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.

Boosting and Additive Trees (Part 1) Ch. 10 Presented by Tal Blum.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.

By Subhasis Dasgupta Asst Professor Praxis Business School, Kolkata Classification Modeling Decision Tree (Part 2)

Decision tree and random forest

Ensemble Classifiers.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Machine Learning: Ensemble Methods

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques

Reading: R. Schapire, A brief introduction to boosting

Classification Methods

Week 2 Presentation: Project 3

Zaman Faisal Kyushu Institute of Technology Fukuoka, JAPAN

Chapter 13 – Ensembles and Uplift

Trees, bagging, boosting, and stacking

Boosting and Additive Trees

Ensemble Learning Introduction to Machine Learning and Data Mining, Carla Brodley.

Basic machine learning background with Python scikit-learn

ECE 5424: Introduction to Machine Learning

Ungraded quiz Unit 6.

Data Mining Practical Machine Learning Tools and Techniques

Introduction to Data Mining, 2nd Edition

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Lecture 18: Bagging and Boosting

Chap. 7 Regularization for Deep Learning (7.8~7.12 )

Implementing AdaBoost

Multiple Decision Trees ISQS7342

Linear Model Selection and regularization

Ensemble learning.

Model Combination.

Ensemble learning Reminder - Bagging of Trees Random Forest

Model generalization Brief summary of methods

Recitation 10 Oznur Tastan

Ensembles An ensemble is a set of classifiers whose combined results give the final decision. test feature vector classifier 1 classifier 2 classifier.

MGS 3100 Business Analysis Regression Feb 18, 2016

Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017

STT : Intro. to Statistical Learning

Presentation transcript:

Bagging and Random Forests IOM 530: Intro. to Statistical Learning Bagging and Random Forests Chapter 08 (part 02) Disclaimer: This PPT is modified based on IOM 530: Intro. to Statistical Learning

IOM 530: Intro. to Statistical Learning Outline Bagging Bootstrapping Bagging for Regression Trees Bagging for Classification Trees Out-of-Bag Error Estimation Variable Importance: Relative Influence Plots Random Forests Boosting

IOM 530: Intro. to Statistical Learning bagging

IOM 530: Intro. to Statistical Learning Problem! http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/Ensemble%20(v6).pdf

IOM 530: Intro. to Statistical Learning http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/Ensemble%20(v6).pdf

IOM 530: Intro. to Statistical Learning http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/Ensemble%20(v6).pdf

IOM 530: Intro. to Statistical Learning http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/Ensemble%20(v6).pdf

IOM 530: Intro. to Statistical Learning Problem! Decision trees discussed earlier suffer from high variance! If we randomly split the training data into 2 parts, and fit decision trees on both parts, the results could be quite different We would like to have models with low variance To solve this problem, we can use bagging (bootstrap aggregating).

Chap5: Bootstrapping is simple! IOM 530: Intro. to Statistical Learning Chap5: Bootstrapping is simple! Resampling of the observed dataset (and of equal size to the observed dataset), each of which is obtained by random sampling with replacement from the original dataset.

IOM 530: Intro. to Statistical Learning Toy examples set.seed(100) sample(1:3, 3, replace = T) # set.seed(15) set.seed(594) set.seed(500) set.seed(200)

IOM 530: Intro. to Statistical Learning What is bagging? Bagging is an extremely powerful idea based on two things: Averaging: reduces variance! Bootstrapping: plenty of training datasets! Why does averaging reduces variance? Averaging a set of observations reduces variance. Recall that given a set of n independent observations Z1, …, Zn, each with variance , the variance of the mean of the observations is given by

IOM 530: Intro. to Statistical Learning How does bagging work? Generate B different bootstrapped training datasets, by taking repeated sample from the (single) training data set. Train the statistical learning method on each of the B bootstrapped training datasets, and obtain B predictions. For prediction: Regression: average all B predictions from all B trees Classification: majority vote among all B trees

Bagging for Regression Trees IOM 530: Intro. to Statistical Learning Bagging for Regression Trees Construct B regression trees using B bootstrapped training datasets Average the resulting predictions Note: These trees are not pruned, so each individual tree has high variance but low bias. Averaging these trees reduces variance, and thus we end up lowering both variance and bias 

Bagging for Classification Trees IOM 530: Intro. to Statistical Learning Bagging for Classification Trees Construct B regression trees using B bootstrapped training datasets For prediction, there are two approaches: Record the class that each bootstrapped data set predicts and provide an overall prediction to the most commonly occurring one (majority vote). If our classifier produces probability estimates we can just average the probabilities and then predict to the class with the highest probability. Both methods work well.

A Comparison of Error Rates IOM 530: Intro. to Statistical Learning A Comparison of Error Rates Here the green line represents a simple majority vote approach The purple line corresponds to averaging the probability estimates. Both do far better than a single tree (dashed red) and get close to the Bayes error rate (dashed grey).

IOM 530: Intro. to Statistical Learning Example 1: Housing Data The red line represents the test mean sum of squares using a single tree. The black line corresponds to the bagging error rate

IOM 530: Intro. to Statistical Learning Example 2: Car Seat Data The red line represents the test error rate using a single tree. The black line corresponds to the bagging error rate using majority vote while the blue line averages the probabilities.

Out-of-Bag Error Estimation IOM 530: Intro. to Statistical Learning Out-of-Bag Error Estimation Since bootstrapping involves random selection of subsets of observations to build a training data set, then the remaining non-selected part could be the testing data. On average, each bagged tree makes use of around 2/3 of the observations, so we end up having 1/3 of the observations used for testing.

Variable Importance Measure IOM 530: Intro. to Statistical Learning Variable Importance Measure Bagging typically improves the accuracy over prediction using a single tree, but it is now hard to interpret the model! We have hundreds of trees, and it is no longer clear which variables are most important to the procedure Thus bagging improves prediction accuracy at the expense of interpretability But, we can still get an overall summary of the importance of each predictor using Relative Influence Plots

Relative Influence Plots IOM 530: Intro. to Statistical Learning Relative Influence Plots How do we decide which variables are most useful in predicting the response? We can compute something called relative influence plots. These plots give a score for each variable. These scores represents the decrease in MSE when splitting on a particular variable A number close to zero indicates the variable is not important and could be dropped. The larger the score the more influence the variable has.

IOM 530: Intro. to Statistical Learning Example: Housing Data Median Income is by far the most important variable. Longitude, Latitude and Average occupancy are the next most important.

IOM 530: Intro. to Statistical Learning Random Forests

IOM 530: Intro. to Statistical Learning Random Forests It is a very efficient statistical learning method It builds on the idea of bagging, but it provides an improvement because it de-correlates the trees Decision tree: Easy to achieve 0% error rate on training data If each training example has its own leaf …… Random forest: Bagging of decision tree Resampling training data is not sufficient Randomly restrict the features/questions used in each split How does it work? Build a number of decision trees on bootstrapped training sample, but when building these trees, each time a split in a tree is considered, a random sample of m predictors is chosen as split candidates from the full set of p predictors (Usually )

IOM 530: Intro. to Statistical Learning Why are we considering a random sample of m predictors instead of all p predictors for splitting? Suppose that we have a very strong predictor in the data set along with a number of other moderately strong predictor, then in the collection of bagged trees, most or all of them will use the very strong predictor for the first split! All bagged trees will look similar. Hence all the predictions from the bagged trees will be highly correlated Averaging many highly correlated quantities does not lead to a large variance reduction, and thus random forests “de-correlates” the bagged trees leading to more reduction in variance

Random Forest with different values of “m” IOM 530: Intro. to Statistical Learning Random Forest with different values of “m” Notice when random forests are built using m = p, then this amounts simply to bagging.

IOM 530: Intro. to Statistical Learning Boosting

IOM 530: Intro. to Statistical Learning Boosting Like bagging, boosting is a general approach that can be applied to many statistical learning methods for regression or classification. Consider boosting for decision trees. Recall that bagging involves creating multiple copies of the original training data set using the bootstrap, fitting a separate decision tree to each copy, and then combining all of the trees in order to create a single predictive model. Notably, each tree is built on a bootstrap data set, independent of the other trees. Boosting works in a similar way, except that the trees are grown sequentially: each tree is grown using information from previously grown trees.

Boosting algorithm for regression trees IOM 530: Intro. to Statistical Learning Boosting algorithm for regression trees See more details from the this PPT: http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/Ensemble%20(v6).pdf

What is the idea behind this procedure? IOM 530: Intro. to Statistical Learning What is the idea behind this procedure?

IOM 530: Intro. to Statistical Learning Boosting The R package gbm (gradient boosted models) handles a variety of regression and classification problems.

Toy Example + + + t=1 + + - + - + - + - - T=3, weak classifier = decision stump t=1 + 1.0 + 1.53 1.0 + 0.65 + - - 1.0 1.53 + 1.0 + 1.53 - 1.0 + - 0.65 - + - 1.0 1.0 0.65 0.65 𝜀 1 =0.30 1.0 + 1.0 - 0.65 0.65 + - 1.0 - 𝑑 1 =1.53 0.65 - 𝛼 1 =0.42 𝑓 1 𝑥 http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/Ensemble%20(v6).pdf

Toy Example + + + - - - t=2 T=3, weak classifier = decision stump 𝑓 1 𝑥 : t=2 𝛼 1 =0.42 + 0.78 + 1.53 + 0.65 0.33 - + - 1.53 0.78 + 1.53 0.78 + - - 0.65 - 0.33 + - + 0.65 1.26 0.65 1.26 𝜀 2 =0.21 0.65 + 0.65 0.33 - 0.33 + - 𝑑 2 =1.94 - 0.65 1.26 - 𝛼 2 =0.66 𝑓 2 𝑥 http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/Ensemble%20(v6).pdf

Toy Example - - - t=3 T=3, weak classifier = decision stump 𝑓 2 𝑥 : 𝑓 1 𝑥 : 𝑓 2 𝑥 : t=3 𝛼 1 =0.42 𝛼 2 =0.66 0.78 + 𝑓 3 𝑥 : 0.33 - 0.78 + 0.78 + 𝑓 3 𝑥 𝛼 3 =0.95 - - 0.33 + 1.26 1.26 𝜀 3 =0.13 0.33 0.33 - 𝑑 3 =2.59 + - 1.26 𝛼 3 =0.95 http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/Ensemble%20(v6).pdf

Toy Example Final Classifier: 𝐻 𝑥 =𝑠𝑖𝑔𝑛 𝑡=1 𝑇 𝛼 𝑡 𝑓 𝑡 𝑥 𝑠𝑖𝑔𝑛( + + ) + 0.42 + 0.66 + 0.95 ) + + - + + - - + - - http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/Ensemble%20(v6).pdf

IOM 530: Intro. to Statistical Learning R Lab in Chap 8

IOM 530: Intro. to Statistical Learning

IOM 530: Intro. to Statistical Learning

IOM 530: Intro. to Statistical Learning Deviance, page 206 Deviance is a measure deviance that plays the role of RSS for a broader class of models. The deviance is negative two times the maximized log-likelihood; The smaller the deviance, the better the fit.

Improving Weak Classifiers Ensemble: Boosting Improving Weak Classifiers Declaimer: All PPTs below are from: http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/Ensemble%20(v6).pdf

Boosting Training data: 𝑥 1 , 𝑦 1 ,⋯, 𝑥 𝑛 , 𝑦 𝑛 ,⋯, 𝑥 𝑁 , 𝑦 𝑁 𝑥 1 , 𝑦 1 ,⋯, 𝑥 𝑛 , 𝑦 𝑛 ,⋯, 𝑥 𝑁 , 𝑦 𝑁 𝑦 =±1 (binary classification) Boosting Guarantee: If your ML algorithm can produce classifier with error rate smaller than 50% on training data You can obtain 0% error rate classifier after boosting. Framework of boosting Obtain the first classifier 𝑓 1 𝑥 Find another function 𝑓 2 𝑥 to help 𝑓 1 𝑥 However, if 𝑓 2 𝑥 is similar to 𝑓 1 𝑥 , it will not help a lot. We want 𝑓 2 𝑥 to be complementary with 𝑓 1 𝑥 (How?) Obtain the second classifier 𝑓 2 𝑥 …… Finally, combining all the classifiers The classifiers are learned sequentially. We use an ML algorithm to obtain a function 𝑓 1 𝑥 If 𝑓 1 𝑥 is weak, i.e. bad performance on training data, what can we do? http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/Ensemble%20(v6).pdf

How to obtain different classifiers? Training on different training data sets How to have different training data sets Re-sampling your training data to form a new set Re-weighting your training data to form a new set In real implementation, you only have to change the cost/objective function 𝐿 𝑓 = 𝑛 𝑙 𝑓 𝑥 𝑛 , 𝑦 𝑛 𝑥 1 , 𝑦 1 , 𝑢 1 𝑢 1 =1 0.4 𝑥 2 , 𝑦 2 , 𝑢 2 𝑢 2 =1 2.1 𝐿 𝑓 = 𝑛 𝑢 𝑛 𝑙 𝑓 𝑥 𝑛 , 𝑦 𝑛 𝑥 3 , 𝑦 3 , 𝑢 3 𝑢 3 =1 0.7 http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/Ensemble%20(v6).pdf

Idea of Adaboost 𝜀 1 = 𝑛 𝑢 1 𝑛 𝛿 𝑓 1 𝑥 𝑛 ≠ 𝑦 𝑛 𝑍 1 𝑍 1 = 𝑛 𝑢 1 𝑛 Idea: training 𝒇 𝟐 𝒙 on the new training set that fails 𝒇 𝟏 𝒙 How to find a new training set that fails 𝑓 1 𝑥 ? 𝜀 1 : the error rate of 𝑓 1 𝑥 on its training data 𝜀 1 = 𝑛 𝑢 1 𝑛 𝛿 𝑓 1 𝑥 𝑛 ≠ 𝑦 𝑛 𝑍 1 𝑍 1 = 𝑛 𝑢 1 𝑛 𝜀 1 <0.5 𝜖 1 >0.5 Changing the example weights from 𝑢 1 𝑛 to 𝑢 2 𝑛 such that 𝑛 𝑢 2 𝑛 𝛿 𝑓 1 𝑥 𝑛 ≠ 𝑦 𝑛 𝑍 2 =0.5 The performance of 𝑓 1 for new weights would be random. Training 𝑓 2 𝑥 based on the new weights 𝑢 2 𝑛 http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/Ensemble%20(v6).pdf

Re-weighting Training Data Idea: training 𝒇 𝟐 𝒙 on the new training set that fails 𝒇 𝟏 𝒙 How to find a new training set that fails 𝑓 1 𝑥 ? 𝑢 1 =1/ 3 𝑥 1 , 𝑦 1 , 𝑢 1 𝑢 1 =1 𝑥 2 , 𝑦 2 , 𝑢 2 𝑢 2 =1 𝑢 2 = 3 𝑢 3 =1/ 3 𝑥 3 , 𝑦 3 , 𝑢 3 𝑢 3 =1 𝑢 4 =1/ 3 𝑥 4 , 𝑦 4 , 𝑢 4 𝑢 4 =1 𝜀 2 <0.5 𝜀 1 =0.25 0.5 𝑓 1 𝑥 𝑓 2 𝑥 http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/Ensemble%20(v6).pdf

Re-weighting Training Data Idea: training 𝒇 𝟐 𝒙 on the new training set that fails 𝒇 𝟏 𝒙 How to find a new training set that fails 𝑓 1 𝑥 ? If 𝑥 𝑛 misclassified by 𝑓 1 ( 𝑓 1 𝑥 𝑛 ≠ 𝑦 𝑛 ) 𝑢 2 𝑛 ← 𝑢 1 𝑛 multiplying 𝑑 1 increase If 𝑥 𝑛 correctly classified by 𝑓 1 ( 𝑓 1 𝑥 𝑛 = 𝑦 𝑛 ) Intuitive reason: 𝑢 2 𝑛 ← 𝑢 1 𝑛 divided by 𝑑 1 decrease 𝑓 2 will be learned based on example weights 𝑢 2 𝑛 What is the value of 𝑑 1 ? http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/Ensemble%20(v6).pdf

Re-weighting Training Data 𝑍 1 = 𝑛 𝑢 1 𝑛 𝜀 1 = 𝑛 𝑢 1 𝑛 𝛿 𝑓 1 𝑥 𝑛 ≠ 𝑦 𝑛 𝑍 1 𝑓 1 𝑥 𝑛 ≠ 𝑦 𝑛 𝑢 2 𝑛 ← 𝑢 1 𝑛 multiplying 𝑑 1 𝑛 𝑢 2 𝑛 𝛿 𝑓 1 𝑥 𝑛 ≠ 𝑦 𝑛 𝑍 2 =0.5 𝑓 1 𝑥 𝑛 = 𝑦 𝑛 𝑢 2 𝑛 ← 𝑢 1 𝑛 divided by 𝑑 1 = 𝑓 1 𝑥 𝑛 ≠ 𝑦 𝑛 𝑢 1 𝑛 𝑑 1 = 𝑓 1 𝑥 𝑛 ≠ 𝑦 𝑛 𝑢 2 𝑛 + 𝑓 1 𝑥 𝑛 = 𝑦 𝑛 𝑢 2 𝑛 = 𝑛 𝑢 2 𝑛 = 𝑓 1 𝑥 𝑛 ≠ 𝑦 𝑛 𝑢 1 𝑛 𝑑 1 + 𝑓 1 𝑥 𝑛 = 𝑦 𝑛 𝑢 1 𝑛 / 𝑑 1 𝑓 1 𝑥 𝑛 ≠ 𝑦 𝑛 𝑢 1 𝑛 𝑑 1 𝑓 1 𝑥 𝑛 ≠ 𝑦 𝑛 𝑢 1 𝑛 𝑑 1 + 𝑓 1 𝑥 𝑛 = 𝑦 𝑛 𝑢 1 𝑛 / 𝑑 1 =0.5 2 http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/Ensemble%20(v6).pdf

Re-weighting Training Data 𝑍 1 = 𝑛 𝑢 1 𝑛 𝜀 1 = 𝑛 𝑢 1 𝑛 𝛿 𝑓 1 𝑥 𝑛 ≠ 𝑦 𝑛 𝑍 1 𝑓 1 𝑥 𝑛 ≠ 𝑦 𝑛 𝑢 2 𝑛 ← 𝑢 1 𝑛 multiplying 𝑑 1 𝑛 𝑢 2 𝑛 𝛿 𝑓 1 𝑥 𝑛 ≠ 𝑦 𝑛 𝑍 2 =0.5 𝑓 1 𝑥 𝑛 = 𝑦 𝑛 𝑢 2 𝑛 ← 𝑢 1 𝑛 divided by 𝑑 1 𝑓 1 𝑥 𝑛 = 𝑦 𝑛 𝑢 1 𝑛 / 𝑑 1 𝑓 1 𝑥 𝑛 ≠ 𝑦 𝑛 𝑢 1 𝑛 𝑑 1 =1 𝑓 1 𝑥 𝑛 = 𝑦 𝑛 𝑢 1 𝑛 / 𝑑 1 = 𝑓 1 𝑥 𝑛 ≠ 𝑦 𝑛 𝑢 1 𝑛 𝑑 1 1 𝑑 1 𝑓 1 𝑥 𝑛 = 𝑦 𝑛 𝑢 1 𝑛 = 𝑑 1 𝑓 1 𝑥 𝑛 ≠ 𝑦 𝑛 𝑢 1 𝑛 𝑍 1 1− 𝜀 1 𝑍 1 𝜀 1 𝜀 1 = 𝑓 1 𝑥 𝑛 ≠ 𝑦 𝑛 𝑢 1 𝑛 𝑍 1 𝑍 1 1− 𝜀 1 / 𝑑 1 = 𝑍 1 𝜀 1 𝑑 1 𝑓 1 𝑥 𝑛 ≠ 𝑦 𝑛 𝑢 1 𝑛 = 𝑍 1 𝜀 1 𝑑 1 = 1− 𝜀 1 𝜀 1 >1 http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/Ensemble%20(v6).pdf

Algorithm for AdaBoost Giving training data 𝑥 1 , 𝑦 1 , 𝑢 1 1 ,⋯, 𝑥 𝑛 , 𝑦 𝑛 , 𝑢 1 𝑛 ,⋯, 𝑥 𝑁 , 𝑦 𝑁 , 𝑢 1 𝑁 𝑦 =±1 (Binary classification), 𝑢 1 𝑛 =1 (equal weights) For t = 1, …, T: Training weak classifier 𝑓 𝑡 𝑥 with weights 𝑢 𝑡 1 ,⋯, 𝑢 𝑡 𝑁 𝜀 𝑡 is the error rate of 𝑓 𝑡 𝑥 with weights 𝑢 𝑡 1 ,⋯, 𝑢 𝑡 𝑁 For n = 1, …, N: If 𝑥 𝑛 is misclassified by 𝑓 𝑡 𝑥 : 𝑢 𝑡+1 𝑛 = 𝑢 𝑡 𝑛 × 𝑑 𝑡 Else: 𝑢 𝑡+1 𝑛 = 𝑢 𝑡 𝑛 / 𝑑 𝑡 𝑦 𝑛 ≠ 𝑓 𝑡 𝑥 𝑛 (output is ±1) = 𝑢 𝑡 𝑛 ×𝑒xp 𝛼 𝑡 𝑑 𝑡 = 1− 𝜀 𝑡 𝜀 𝑡 𝛼 𝑡 =𝑙𝑛 1− 𝜀 𝑡 𝜀 𝑡 = 𝑢 𝑡 𝑛 ×𝑒xp − 𝛼 𝑡 𝑢 𝑡+1 𝑛 ← 𝑢 𝑡 𝑛 ×𝑒𝑥𝑝 − 𝑦 𝑛 𝑓 𝑡 𝑥 𝑛 𝛼 𝑡 http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/Ensemble%20(v6).pdf

Algorithm for AdaBoost We obtain a set of functions: 𝑓 1 𝑥 ,…, 𝑓 𝑡 𝑥 , …, 𝑓 𝑇 𝑥 How to aggregate them? Uniform weight: 𝐻 𝑥 =𝑠𝑖𝑔𝑛 𝑡=1 𝑇 𝑓 𝑡 𝑥 Non-uniform weight: 𝐻 𝑥 =𝑠𝑖𝑔𝑛 𝑡=1 𝑇 𝛼 𝑡 𝑓 𝑡 𝑥 Smaller error 𝜀 𝑡 , larger weight for final voting 𝛼 𝑡 =𝑙𝑛 1− 𝜀 𝑡 𝜀 𝑡 𝜀 𝑡 =0.1 𝜀 𝑡 =0.4 𝑢 𝑡+1 𝑛 = 𝑢 𝑡 𝑛 ×𝑒𝑥𝑝 − 𝑦 𝑛 𝑓 𝑡 𝑥 𝑛 𝛼 𝑡 𝛼 𝑡 =1.10 𝛼 𝑡 =0.20 http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/Ensemble%20(v6).pdf

Toy Example + + + t=1 + + - + - + - + - - T=3, weak classifier = decision stump t=1 + 1.0 + 1.53 1.0 + 0.65 + - - 1.0 1.53 + 1.0 + 1.53 - 1.0 + - 0.65 - + - 1.0 1.0 0.65 0.65 𝜀 1 =0.30 1.0 + 1.0 - 0.65 0.65 + - 1.0 - 𝑑 1 =1.53 0.65 - 𝛼 1 =0.42 𝑓 1 𝑥 http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/Ensemble%20(v6).pdf

Toy Example + + + - - - t=2 T=3, weak classifier = decision stump 𝑓 1 𝑥 : t=2 𝛼 1 =0.42 + 0.78 + 1.53 + 0.65 0.33 - + - 1.53 0.78 + 1.53 0.78 + - - 0.65 - 0.33 + - + 0.65 1.26 0.65 1.26 𝜀 2 =0.21 0.65 + 0.65 0.33 - 0.33 + - 𝑑 2 =1.94 - 0.65 1.26 - 𝛼 2 =0.66 𝑓 2 𝑥 http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/Ensemble%20(v6).pdf

Toy Example - - - t=3 T=3, weak classifier = decision stump 𝑓 2 𝑥 : 𝑓 1 𝑥 : 𝑓 2 𝑥 : t=3 𝛼 1 =0.42 𝛼 2 =0.66 0.78 + 𝑓 3 𝑥 : 0.33 - 0.78 + 0.78 + 𝑓 3 𝑥 𝛼 3 =0.95 - - 0.33 + 1.26 1.26 𝜀 3 =0.13 0.33 0.33 - 𝑑 3 =2.59 + - 1.26 𝛼 3 =0.95 http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/Ensemble%20(v6).pdf

Toy Example Final Classifier: 𝐻 𝑥 =𝑠𝑖𝑔𝑛 𝑡=1 𝑇 𝛼 𝑡 𝑓 𝑡 𝑥 𝑠𝑖𝑔𝑛( + + ) + 0.42 + 0.66 + 0.95 ) + + - + + - - + - - http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/Ensemble%20(v6).pdf

Warning of Math 𝐻 𝑥 =𝑠𝑖𝑔𝑛 𝑡=1 𝑇 𝛼 𝑡 𝑓 𝑡 𝑥 𝛼 𝑡 =𝑙𝑛 1− 𝜀 𝑡 𝜀 𝑡 𝐻 𝑥 =𝑠𝑖𝑔𝑛 𝑡=1 𝑇 𝛼 𝑡 𝑓 𝑡 𝑥 𝛼 𝑡 =𝑙𝑛 1− 𝜀 𝑡 𝜀 𝑡 As we have more and more 𝑓 𝑡 (T increases), 𝐻 𝑥 achieves smaller and smaller error rate on training data. http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/Ensemble%20(v6).pdf

Error Rate of Final Classifier 𝛼 𝑡 =𝑙𝑛 1− 𝜀 𝑡 𝜀 𝑡 𝑔 𝑥 Training Data Error Rate = 1 𝑁 𝑛 𝛿 𝐻 𝑥 𝑛 ≠ 𝑦 𝑛 = 1 𝑁 𝑛 𝛿 𝑦 𝑛 𝑔 𝑥 𝑛 <0 ≤ 1 𝑁 𝑛 𝑒𝑥𝑝 − 𝑦 𝑛 𝑔 𝑥 𝑛 𝑦 𝑛 𝑔 𝑥 𝑛 http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/Ensemble%20(v6).pdf

𝑍 𝑇+1 = 𝑛 𝑡=1 𝑇 𝑒𝑥𝑝 − 𝑦 𝑛 𝑓 𝑡 𝑥 𝑛 𝛼 𝑡 𝑔 𝑥 = 𝑡=1 𝑇 𝛼 𝑡 𝑓 𝑡 𝑥 Training Data Error Rate ≤ 1 𝑁 𝑛 𝑒𝑥𝑝 − 𝑦 𝑛 𝑔 𝑥 𝑛 = 1 𝑁 𝑍 𝑇+1 𝛼 𝑡 =𝑙𝑛 1− 𝜀 𝑡 𝜀 𝑡 𝑍 𝑡 : the summation of the weights of training data for training 𝑓 𝑡 What is 𝑍 𝑇+1 =? 𝑍 𝑇+1 = 𝑛 𝑢 𝑇+1 𝑛 𝑢 𝑇+1 𝑛 = 𝑡=1 𝑇 𝑒𝑥𝑝 − 𝑦 𝑛 𝑓 𝑡 𝑥 𝑛 𝛼 𝑡 𝑢 1 𝑛 =1 𝑢 𝑡+1 𝑛 = 𝑢 𝑡 𝑛 ×𝑒𝑥𝑝 − 𝑦 𝑛 𝑓 𝑡 𝑥 𝑛 𝛼 𝑡 𝑍 𝑇+1 = 𝑛 𝑡=1 𝑇 𝑒𝑥𝑝 − 𝑦 𝑛 𝑓 𝑡 𝑥 𝑛 𝛼 𝑡 𝑔 𝑥 = 𝑛 𝑒𝑥𝑝 − 𝑦 𝑛 𝑡=1 𝑇 𝑓 𝑡 𝑥 𝑛 𝛼 𝑡 http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/Ensemble%20(v6).pdf

𝑍 𝑡 = 𝑍 𝑡−1 𝜀 𝑡 𝑒𝑥𝑝 𝛼 𝑡 + 𝑍 𝑡−1 1− 𝜀 𝑡 𝑒𝑥𝑝 − 𝛼 𝑡 𝑔 𝑥 = 𝑡=1 𝑇 𝛼 𝑡 𝑓 𝑡 𝑥 Training Data Error Rate ≤ 1 𝑁 𝑛 𝑒𝑥𝑝 − 𝑦 𝑛 𝑔 𝑥 𝑛 = 1 𝑁 𝑍 𝑇+1 𝛼 𝑡 =𝑙𝑛 1− 𝜀 𝑡 𝜀 𝑡 𝑍 1 =𝑁 (equal weights) 𝑍 𝑡 = 𝑍 𝑡−1 𝜀 𝑡 𝑒𝑥𝑝 𝛼 𝑡 + 𝑍 𝑡−1 1− 𝜀 𝑡 𝑒𝑥𝑝 − 𝛼 𝑡 Misclassified portion in 𝑍 𝑡−1 Correctly classified portion in 𝑍 𝑡−1 = 𝑍 𝑡−1 𝜀 𝑡 1− 𝜀 𝑡 𝜀 𝑡 + 𝑍 𝑡−1 1− 𝜀 𝑡 𝜀 𝑡 1− 𝜀 𝑡 𝑍 𝑇+1 =𝑁 𝑡=1 𝑇 2 𝜀 𝑡 1− 𝜀 𝑡 = 𝑍 𝑡−1 ×2 𝜀 𝑡 1− 𝜀 𝑡 ≤ 𝑡=1 𝑇 2 𝜖 𝑡 1− 𝜖 𝑡 Smaller and smaller Training Data Error Rate <1 http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/Ensemble%20(v6).pdf

End of Warning So many good explaination !!!!!!!!!!!!!!!!! http://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/Ensemble%20(v6).pdf

Even though the training error is 0, the testing error still decreases? =𝑠𝑖𝑔𝑛 𝑡=1 𝑇 𝛼 𝑡 𝑓 𝑡 𝑥 𝐻 𝑥 𝑔 𝑥 Margin = 𝑦 𝑔 𝑥 http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/Ensemble%20(v6).pdf

Large Margin? =𝑠𝑖𝑔𝑛 𝑡=1 𝑇 𝛼 𝑡 𝑓 𝑡 𝑥 𝐻 𝑥 𝑔 𝑥 Training Data Error Rate = =𝑠𝑖𝑔𝑛 𝑡=1 𝑇 𝛼 𝑡 𝑓 𝑡 𝑥 Large Margin? 𝐻 𝑥 𝑔 𝑥 Training Data Error Rate = = 1 𝑁 𝑛 𝛿 𝐻 𝑥 𝑛 ≠ 𝑦 𝑛 Adaboost ≤ 1 𝑁 𝑛 𝑒𝑥𝑝 − 𝑦 𝑛 𝑔 𝑥 𝑛 Logistic regression = 𝑡=1 𝑇 2 𝜖 𝑡 1− 𝜖 𝑡 SVM Getting smaller and smaller as T increase 𝑦 𝑛 𝑔 𝑥 𝑛 http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/Ensemble%20(v6).pdf