Download presentation
Presentation is loading. Please wait.
1
Bagging and Random Forests
IOM 530: Intro. to Statistical Learning Bagging and Random Forests Chapter 08 (part 02) Disclaimer: This PPT is modified based on IOM 530: Intro. to Statistical Learning
2
IOM 530: Intro. to Statistical Learning
Outline Bagging Bootstrapping Bagging for Regression Trees Bagging for Classification Trees Out-of-Bag Error Estimation Variable Importance: Relative Influence Plots Random Forests Boosting
3
IOM 530: Intro. to Statistical Learning
bagging
4
IOM 530: Intro. to Statistical Learning
Problem!
5
IOM 530: Intro. to Statistical Learning
6
IOM 530: Intro. to Statistical Learning
7
IOM 530: Intro. to Statistical Learning
8
IOM 530: Intro. to Statistical Learning
Problem! Decision trees discussed earlier suffer from high variance! If we randomly split the training data into 2 parts, and fit decision trees on both parts, the results could be quite different We would like to have models with low variance To solve this problem, we can use bagging (bootstrap aggregating).
9
Chap5: Bootstrapping is simple!
IOM 530: Intro. to Statistical Learning Chap5: Bootstrapping is simple! Resampling of the observed dataset (and of equal size to the observed dataset), each of which is obtained by random sampling with replacement from the original dataset.
10
IOM 530: Intro. to Statistical Learning
Toy examples set.seed(100) sample(1:3, 3, replace = T) # set.seed(15) set.seed(594) set.seed(500) set.seed(200)
11
IOM 530: Intro. to Statistical Learning
What is bagging? Bagging is an extremely powerful idea based on two things: Averaging: reduces variance! Bootstrapping: plenty of training datasets! Why does averaging reduces variance? Averaging a set of observations reduces variance. Recall that given a set of n independent observations Z1, …, Zn, each with variance , the variance of the mean of the observations is given by
12
IOM 530: Intro. to Statistical Learning
How does bagging work? Generate B different bootstrapped training datasets, by taking repeated sample from the (single) training data set. Train the statistical learning method on each of the B bootstrapped training datasets, and obtain B predictions. For prediction: Regression: average all B predictions from all B trees Classification: majority vote among all B trees
13
Bagging for Regression Trees
IOM 530: Intro. to Statistical Learning Bagging for Regression Trees Construct B regression trees using B bootstrapped training datasets Average the resulting predictions Note: These trees are not pruned, so each individual tree has high variance but low bias. Averaging these trees reduces variance, and thus we end up lowering both variance and bias
14
Bagging for Classification Trees
IOM 530: Intro. to Statistical Learning Bagging for Classification Trees Construct B regression trees using B bootstrapped training datasets For prediction, there are two approaches: Record the class that each bootstrapped data set predicts and provide an overall prediction to the most commonly occurring one (majority vote). If our classifier produces probability estimates we can just average the probabilities and then predict to the class with the highest probability. Both methods work well.
15
A Comparison of Error Rates
IOM 530: Intro. to Statistical Learning A Comparison of Error Rates Here the green line represents a simple majority vote approach The purple line corresponds to averaging the probability estimates. Both do far better than a single tree (dashed red) and get close to the Bayes error rate (dashed grey).
16
IOM 530: Intro. to Statistical Learning
Example 1: Housing Data The red line represents the test mean sum of squares using a single tree. The black line corresponds to the bagging error rate
17
IOM 530: Intro. to Statistical Learning
Example 2: Car Seat Data The red line represents the test error rate using a single tree. The black line corresponds to the bagging error rate using majority vote while the blue line averages the probabilities.
18
Out-of-Bag Error Estimation
IOM 530: Intro. to Statistical Learning Out-of-Bag Error Estimation Since bootstrapping involves random selection of subsets of observations to build a training data set, then the remaining non-selected part could be the testing data. On average, each bagged tree makes use of around 2/3 of the observations, so we end up having 1/3 of the observations used for testing.
19
Variable Importance Measure
IOM 530: Intro. to Statistical Learning Variable Importance Measure Bagging typically improves the accuracy over prediction using a single tree, but it is now hard to interpret the model! We have hundreds of trees, and it is no longer clear which variables are most important to the procedure Thus bagging improves prediction accuracy at the expense of interpretability But, we can still get an overall summary of the importance of each predictor using Relative Influence Plots
20
Relative Influence Plots
IOM 530: Intro. to Statistical Learning Relative Influence Plots How do we decide which variables are most useful in predicting the response? We can compute something called relative influence plots. These plots give a score for each variable. These scores represents the decrease in MSE when splitting on a particular variable A number close to zero indicates the variable is not important and could be dropped. The larger the score the more influence the variable has.
21
IOM 530: Intro. to Statistical Learning
Example: Housing Data Median Income is by far the most important variable. Longitude, Latitude and Average occupancy are the next most important.
22
IOM 530: Intro. to Statistical Learning
Random Forests
23
IOM 530: Intro. to Statistical Learning
Random Forests It is a very efficient statistical learning method It builds on the idea of bagging, but it provides an improvement because it de-correlates the trees Decision tree: Easy to achieve 0% error rate on training data If each training example has its own leaf …… Random forest: Bagging of decision tree Resampling training data is not sufficient Randomly restrict the features/questions used in each split How does it work? Build a number of decision trees on bootstrapped training sample, but when building these trees, each time a split in a tree is considered, a random sample of m predictors is chosen as split candidates from the full set of p predictors (Usually )
24
IOM 530: Intro. to Statistical Learning
Why are we considering a random sample of m predictors instead of all p predictors for splitting? Suppose that we have a very strong predictor in the data set along with a number of other moderately strong predictor, then in the collection of bagged trees, most or all of them will use the very strong predictor for the first split! All bagged trees will look similar. Hence all the predictions from the bagged trees will be highly correlated Averaging many highly correlated quantities does not lead to a large variance reduction, and thus random forests “de-correlates” the bagged trees leading to more reduction in variance
25
Random Forest with different values of “m”
IOM 530: Intro. to Statistical Learning Random Forest with different values of “m” Notice when random forests are built using m = p, then this amounts simply to bagging.
26
IOM 530: Intro. to Statistical Learning
Boosting
27
IOM 530: Intro. to Statistical Learning
Boosting Like bagging, boosting is a general approach that can be applied to many statistical learning methods for regression or classification. Consider boosting for decision trees. Recall that bagging involves creating multiple copies of the original training data set using the bootstrap, fitting a separate decision tree to each copy, and then combining all of the trees in order to create a single predictive model. Notably, each tree is built on a bootstrap data set, independent of the other trees. Boosting works in a similar way, except that the trees are grown sequentially: each tree is grown using information from previously grown trees.
28
Boosting algorithm for regression trees
IOM 530: Intro. to Statistical Learning Boosting algorithm for regression trees See more details from the this PPT:
29
What is the idea behind this procedure?
IOM 530: Intro. to Statistical Learning What is the idea behind this procedure?
30
IOM 530: Intro. to Statistical Learning
Boosting The R package gbm (gradient boosted models) handles a variety of regression and classification problems.
31
Toy Example + + + t=1 + + - + - + - + - -
T=3, weak classifier = decision stump t=1 + 1.0 + 1.53 1.0 + 0.65 + - - 1.0 1.53 + 1.0 + 1.53 - 1.0 + - 0.65 - + - 1.0 1.0 0.65 0.65 𝜀 1 =0.30 1.0 + 1.0 - 0.65 0.65 + - 1.0 - 𝑑 1 =1.53 0.65 - 𝛼 1 =0.42 𝑓 1 𝑥
32
Toy Example + + + - - - t=2 T=3, weak classifier = decision stump
𝑓 1 𝑥 : t=2 𝛼 1 =0.42 + 0.78 + 1.53 + 0.65 0.33 - + - 1.53 0.78 + 1.53 0.78 + - - 0.65 - 0.33 + - + 0.65 1.26 0.65 1.26 𝜀 2 =0.21 0.65 + 0.65 0.33 - 0.33 + - 𝑑 2 =1.94 - 0.65 1.26 - 𝛼 2 =0.66 𝑓 2 𝑥
33
Toy Example - - - t=3 T=3, weak classifier = decision stump 𝑓 2 𝑥 :
𝑓 1 𝑥 : 𝑓 2 𝑥 : t=3 𝛼 1 =0.42 𝛼 2 =0.66 0.78 + 𝑓 3 𝑥 : 0.33 - 0.78 + 0.78 + 𝑓 3 𝑥 𝛼 3 =0.95 - - 0.33 + 1.26 1.26 𝜀 3 =0.13 0.33 0.33 - 𝑑 3 =2.59 + - 1.26 𝛼 3 =0.95
34
Toy Example Final Classifier: 𝐻 𝑥 =𝑠𝑖𝑔𝑛 𝑡=1 𝑇 𝛼 𝑡 𝑓 𝑡 𝑥 𝑠𝑖𝑔𝑛( + + ) +
0.42 + 0.66 + 0.95 ) + + - + + - - + - -
35
IOM 530: Intro. to Statistical Learning
R Lab in Chap 8
36
IOM 530: Intro. to Statistical Learning
37
IOM 530: Intro. to Statistical Learning
38
IOM 530: Intro. to Statistical Learning
Deviance, page 206 Deviance is a measure deviance that plays the role of RSS for a broader class of models. The deviance is negative two times the maximized log-likelihood; The smaller the deviance, the better the fit.
39
Improving Weak Classifiers
Ensemble: Boosting Improving Weak Classifiers Declaimer: All PPTs below are from:
40
Boosting Training data: 𝑥 1 , 𝑦 1 ,⋯, 𝑥 𝑛 , 𝑦 𝑛 ,⋯, 𝑥 𝑁 , 𝑦 𝑁
𝑥 1 , 𝑦 1 ,⋯, 𝑥 𝑛 , 𝑦 𝑛 ,⋯, 𝑥 𝑁 , 𝑦 𝑁 𝑦 =±1 (binary classification) Boosting Guarantee: If your ML algorithm can produce classifier with error rate smaller than 50% on training data You can obtain 0% error rate classifier after boosting. Framework of boosting Obtain the first classifier 𝑓 1 𝑥 Find another function 𝑓 2 𝑥 to help 𝑓 1 𝑥 However, if 𝑓 2 𝑥 is similar to 𝑓 1 𝑥 , it will not help a lot. We want 𝑓 2 𝑥 to be complementary with 𝑓 1 𝑥 (How?) Obtain the second classifier 𝑓 2 𝑥 …… Finally, combining all the classifiers The classifiers are learned sequentially. We use an ML algorithm to obtain a function 𝑓 1 𝑥 If 𝑓 1 𝑥 is weak, i.e. bad performance on training data, what can we do?
41
How to obtain different classifiers?
Training on different training data sets How to have different training data sets Re-sampling your training data to form a new set Re-weighting your training data to form a new set In real implementation, you only have to change the cost/objective function 𝐿 𝑓 = 𝑛 𝑙 𝑓 𝑥 𝑛 , 𝑦 𝑛 𝑥 1 , 𝑦 1 , 𝑢 1 𝑢 1 =1 0.4 𝑥 2 , 𝑦 2 , 𝑢 2 𝑢 2 =1 2.1 𝐿 𝑓 = 𝑛 𝑢 𝑛 𝑙 𝑓 𝑥 𝑛 , 𝑦 𝑛 𝑥 3 , 𝑦 3 , 𝑢 3 𝑢 3 =1 0.7
42
Idea of Adaboost 𝜀 1 = 𝑛 𝑢 1 𝑛 𝛿 𝑓 1 𝑥 𝑛 ≠ 𝑦 𝑛 𝑍 1 𝑍 1 = 𝑛 𝑢 1 𝑛
Idea: training 𝒇 𝟐 𝒙 on the new training set that fails 𝒇 𝟏 𝒙 How to find a new training set that fails 𝑓 1 𝑥 ? 𝜀 1 : the error rate of 𝑓 1 𝑥 on its training data 𝜀 1 = 𝑛 𝑢 1 𝑛 𝛿 𝑓 1 𝑥 𝑛 ≠ 𝑦 𝑛 𝑍 1 𝑍 1 = 𝑛 𝑢 1 𝑛 𝜀 1 <0.5 𝜖 1 >0.5 Changing the example weights from 𝑢 1 𝑛 to 𝑢 2 𝑛 such that 𝑛 𝑢 2 𝑛 𝛿 𝑓 1 𝑥 𝑛 ≠ 𝑦 𝑛 𝑍 2 =0.5 The performance of 𝑓 1 for new weights would be random. Training 𝑓 2 𝑥 based on the new weights 𝑢 2 𝑛
43
Re-weighting Training Data
Idea: training 𝒇 𝟐 𝒙 on the new training set that fails 𝒇 𝟏 𝒙 How to find a new training set that fails 𝑓 1 𝑥 ? 𝑢 1 =1/ 3 𝑥 1 , 𝑦 1 , 𝑢 1 𝑢 1 =1 𝑥 2 , 𝑦 2 , 𝑢 2 𝑢 2 =1 𝑢 2 = 3 𝑢 3 =1/ 3 𝑥 3 , 𝑦 3 , 𝑢 3 𝑢 3 =1 𝑢 4 =1/ 3 𝑥 4 , 𝑦 4 , 𝑢 4 𝑢 4 =1 𝜀 2 <0.5 𝜀 1 =0.25 0.5 𝑓 1 𝑥 𝑓 2 𝑥
44
Re-weighting Training Data
Idea: training 𝒇 𝟐 𝒙 on the new training set that fails 𝒇 𝟏 𝒙 How to find a new training set that fails 𝑓 1 𝑥 ? If 𝑥 𝑛 misclassified by 𝑓 1 ( 𝑓 1 𝑥 𝑛 ≠ 𝑦 𝑛 ) 𝑢 2 𝑛 ← 𝑢 1 𝑛 multiplying 𝑑 1 increase If 𝑥 𝑛 correctly classified by 𝑓 1 ( 𝑓 1 𝑥 𝑛 = 𝑦 𝑛 ) Intuitive reason: 𝑢 2 𝑛 ← 𝑢 1 𝑛 divided by 𝑑 1 decrease 𝑓 2 will be learned based on example weights 𝑢 2 𝑛 What is the value of 𝑑 1 ?
45
Re-weighting Training Data
𝑍 1 = 𝑛 𝑢 1 𝑛 𝜀 1 = 𝑛 𝑢 1 𝑛 𝛿 𝑓 1 𝑥 𝑛 ≠ 𝑦 𝑛 𝑍 1 𝑓 1 𝑥 𝑛 ≠ 𝑦 𝑛 𝑢 2 𝑛 ← 𝑢 1 𝑛 multiplying 𝑑 1 𝑛 𝑢 2 𝑛 𝛿 𝑓 1 𝑥 𝑛 ≠ 𝑦 𝑛 𝑍 2 =0.5 𝑓 1 𝑥 𝑛 = 𝑦 𝑛 𝑢 2 𝑛 ← 𝑢 1 𝑛 divided by 𝑑 1 = 𝑓 1 𝑥 𝑛 ≠ 𝑦 𝑛 𝑢 1 𝑛 𝑑 1 = 𝑓 1 𝑥 𝑛 ≠ 𝑦 𝑛 𝑢 2 𝑛 + 𝑓 1 𝑥 𝑛 = 𝑦 𝑛 𝑢 2 𝑛 = 𝑛 𝑢 2 𝑛 = 𝑓 1 𝑥 𝑛 ≠ 𝑦 𝑛 𝑢 1 𝑛 𝑑 𝑓 1 𝑥 𝑛 = 𝑦 𝑛 𝑢 1 𝑛 / 𝑑 1 𝑓 1 𝑥 𝑛 ≠ 𝑦 𝑛 𝑢 1 𝑛 𝑑 𝑓 1 𝑥 𝑛 ≠ 𝑦 𝑛 𝑢 1 𝑛 𝑑 𝑓 1 𝑥 𝑛 = 𝑦 𝑛 𝑢 1 𝑛 / 𝑑 1 =0.5 2
46
Re-weighting Training Data
𝑍 1 = 𝑛 𝑢 1 𝑛 𝜀 1 = 𝑛 𝑢 1 𝑛 𝛿 𝑓 1 𝑥 𝑛 ≠ 𝑦 𝑛 𝑍 1 𝑓 1 𝑥 𝑛 ≠ 𝑦 𝑛 𝑢 2 𝑛 ← 𝑢 1 𝑛 multiplying 𝑑 1 𝑛 𝑢 2 𝑛 𝛿 𝑓 1 𝑥 𝑛 ≠ 𝑦 𝑛 𝑍 2 =0.5 𝑓 1 𝑥 𝑛 = 𝑦 𝑛 𝑢 2 𝑛 ← 𝑢 1 𝑛 divided by 𝑑 1 𝑓 1 𝑥 𝑛 = 𝑦 𝑛 𝑢 1 𝑛 / 𝑑 𝑓 1 𝑥 𝑛 ≠ 𝑦 𝑛 𝑢 1 𝑛 𝑑 1 =1 𝑓 1 𝑥 𝑛 = 𝑦 𝑛 𝑢 1 𝑛 / 𝑑 1 = 𝑓 1 𝑥 𝑛 ≠ 𝑦 𝑛 𝑢 1 𝑛 𝑑 1 1 𝑑 1 𝑓 1 𝑥 𝑛 = 𝑦 𝑛 𝑢 1 𝑛 = 𝑑 1 𝑓 1 𝑥 𝑛 ≠ 𝑦 𝑛 𝑢 1 𝑛 𝑍 1 1− 𝜀 1 𝑍 1 𝜀 1 𝜀 1 = 𝑓 1 𝑥 𝑛 ≠ 𝑦 𝑛 𝑢 1 𝑛 𝑍 1 𝑍 1 1− 𝜀 1 / 𝑑 1 = 𝑍 1 𝜀 1 𝑑 1 𝑓 1 𝑥 𝑛 ≠ 𝑦 𝑛 𝑢 1 𝑛 = 𝑍 1 𝜀 1 𝑑 1 = 1− 𝜀 1 𝜀 1 >1
47
Algorithm for AdaBoost
Giving training data 𝑥 1 , 𝑦 1 , 𝑢 ,⋯, 𝑥 𝑛 , 𝑦 𝑛 , 𝑢 1 𝑛 ,⋯, 𝑥 𝑁 , 𝑦 𝑁 , 𝑢 1 𝑁 𝑦 =±1 (Binary classification), 𝑢 1 𝑛 =1 (equal weights) For t = 1, …, T: Training weak classifier 𝑓 𝑡 𝑥 with weights 𝑢 𝑡 1 ,⋯, 𝑢 𝑡 𝑁 𝜀 𝑡 is the error rate of 𝑓 𝑡 𝑥 with weights 𝑢 𝑡 1 ,⋯, 𝑢 𝑡 𝑁 For n = 1, …, N: If 𝑥 𝑛 is misclassified by 𝑓 𝑡 𝑥 : 𝑢 𝑡+1 𝑛 = 𝑢 𝑡 𝑛 × 𝑑 𝑡 Else: 𝑢 𝑡+1 𝑛 = 𝑢 𝑡 𝑛 / 𝑑 𝑡 𝑦 𝑛 ≠ 𝑓 𝑡 𝑥 𝑛 (output is ±1) = 𝑢 𝑡 𝑛 ×𝑒xp 𝛼 𝑡 𝑑 𝑡 = 1− 𝜀 𝑡 𝜀 𝑡 𝛼 𝑡 =𝑙𝑛 1− 𝜀 𝑡 𝜀 𝑡 = 𝑢 𝑡 𝑛 ×𝑒xp − 𝛼 𝑡 𝑢 𝑡+1 𝑛 ← 𝑢 𝑡 𝑛 ×𝑒𝑥𝑝 − 𝑦 𝑛 𝑓 𝑡 𝑥 𝑛 𝛼 𝑡
48
Algorithm for AdaBoost
We obtain a set of functions: 𝑓 1 𝑥 ,…, 𝑓 𝑡 𝑥 , …, 𝑓 𝑇 𝑥 How to aggregate them? Uniform weight: 𝐻 𝑥 =𝑠𝑖𝑔𝑛 𝑡=1 𝑇 𝑓 𝑡 𝑥 Non-uniform weight: 𝐻 𝑥 =𝑠𝑖𝑔𝑛 𝑡=1 𝑇 𝛼 𝑡 𝑓 𝑡 𝑥 Smaller error 𝜀 𝑡 , larger weight for final voting 𝛼 𝑡 =𝑙𝑛 1− 𝜀 𝑡 𝜀 𝑡 𝜀 𝑡 =0.1 𝜀 𝑡 =0.4 𝑢 𝑡+1 𝑛 = 𝑢 𝑡 𝑛 ×𝑒𝑥𝑝 − 𝑦 𝑛 𝑓 𝑡 𝑥 𝑛 𝛼 𝑡 𝛼 𝑡 =1.10 𝛼 𝑡 =0.20
49
Toy Example + + + t=1 + + - + - + - + - -
T=3, weak classifier = decision stump t=1 + 1.0 + 1.53 1.0 + 0.65 + - - 1.0 1.53 + 1.0 + 1.53 - 1.0 + - 0.65 - + - 1.0 1.0 0.65 0.65 𝜀 1 =0.30 1.0 + 1.0 - 0.65 0.65 + - 1.0 - 𝑑 1 =1.53 0.65 - 𝛼 1 =0.42 𝑓 1 𝑥
50
Toy Example + + + - - - t=2 T=3, weak classifier = decision stump
𝑓 1 𝑥 : t=2 𝛼 1 =0.42 + 0.78 + 1.53 + 0.65 0.33 - + - 1.53 0.78 + 1.53 0.78 + - - 0.65 - 0.33 + - + 0.65 1.26 0.65 1.26 𝜀 2 =0.21 0.65 + 0.65 0.33 - 0.33 + - 𝑑 2 =1.94 - 0.65 1.26 - 𝛼 2 =0.66 𝑓 2 𝑥
51
Toy Example - - - t=3 T=3, weak classifier = decision stump 𝑓 2 𝑥 :
𝑓 1 𝑥 : 𝑓 2 𝑥 : t=3 𝛼 1 =0.42 𝛼 2 =0.66 0.78 + 𝑓 3 𝑥 : 0.33 - 0.78 + 0.78 + 𝑓 3 𝑥 𝛼 3 =0.95 - - 0.33 + 1.26 1.26 𝜀 3 =0.13 0.33 0.33 - 𝑑 3 =2.59 + - 1.26 𝛼 3 =0.95
52
Toy Example Final Classifier: 𝐻 𝑥 =𝑠𝑖𝑔𝑛 𝑡=1 𝑇 𝛼 𝑡 𝑓 𝑡 𝑥 𝑠𝑖𝑔𝑛( + + ) +
0.42 + 0.66 + 0.95 ) + + - + + - - + - -
53
Warning of Math 𝐻 𝑥 =𝑠𝑖𝑔𝑛 𝑡=1 𝑇 𝛼 𝑡 𝑓 𝑡 𝑥 𝛼 𝑡 =𝑙𝑛 1− 𝜀 𝑡 𝜀 𝑡
𝐻 𝑥 =𝑠𝑖𝑔𝑛 𝑡=1 𝑇 𝛼 𝑡 𝑓 𝑡 𝑥 𝛼 𝑡 =𝑙𝑛 1− 𝜀 𝑡 𝜀 𝑡 As we have more and more 𝑓 𝑡 (T increases), 𝐻 𝑥 achieves smaller and smaller error rate on training data.
54
Error Rate of Final Classifier
𝛼 𝑡 =𝑙𝑛 1− 𝜀 𝑡 𝜀 𝑡 𝑔 𝑥 Training Data Error Rate = 1 𝑁 𝑛 𝛿 𝐻 𝑥 𝑛 ≠ 𝑦 𝑛 = 1 𝑁 𝑛 𝛿 𝑦 𝑛 𝑔 𝑥 𝑛 <0 ≤ 1 𝑁 𝑛 𝑒𝑥𝑝 − 𝑦 𝑛 𝑔 𝑥 𝑛 𝑦 𝑛 𝑔 𝑥 𝑛
55
𝑍 𝑇+1 = 𝑛 𝑡=1 𝑇 𝑒𝑥𝑝 − 𝑦 𝑛 𝑓 𝑡 𝑥 𝑛 𝛼 𝑡
𝑔 𝑥 = 𝑡=1 𝑇 𝛼 𝑡 𝑓 𝑡 𝑥 Training Data Error Rate ≤ 1 𝑁 𝑛 𝑒𝑥𝑝 − 𝑦 𝑛 𝑔 𝑥 𝑛 = 1 𝑁 𝑍 𝑇+1 𝛼 𝑡 =𝑙𝑛 1− 𝜀 𝑡 𝜀 𝑡 𝑍 𝑡 : the summation of the weights of training data for training 𝑓 𝑡 What is 𝑍 𝑇+1 =? 𝑍 𝑇+1 = 𝑛 𝑢 𝑇+1 𝑛 𝑢 𝑇+1 𝑛 = 𝑡=1 𝑇 𝑒𝑥𝑝 − 𝑦 𝑛 𝑓 𝑡 𝑥 𝑛 𝛼 𝑡 𝑢 1 𝑛 =1 𝑢 𝑡+1 𝑛 = 𝑢 𝑡 𝑛 ×𝑒𝑥𝑝 − 𝑦 𝑛 𝑓 𝑡 𝑥 𝑛 𝛼 𝑡 𝑍 𝑇+1 = 𝑛 𝑡=1 𝑇 𝑒𝑥𝑝 − 𝑦 𝑛 𝑓 𝑡 𝑥 𝑛 𝛼 𝑡 𝑔 𝑥 = 𝑛 𝑒𝑥𝑝 − 𝑦 𝑛 𝑡=1 𝑇 𝑓 𝑡 𝑥 𝑛 𝛼 𝑡
56
𝑍 𝑡 = 𝑍 𝑡−1 𝜀 𝑡 𝑒𝑥𝑝 𝛼 𝑡 + 𝑍 𝑡−1 1− 𝜀 𝑡 𝑒𝑥𝑝 − 𝛼 𝑡
𝑔 𝑥 = 𝑡=1 𝑇 𝛼 𝑡 𝑓 𝑡 𝑥 Training Data Error Rate ≤ 1 𝑁 𝑛 𝑒𝑥𝑝 − 𝑦 𝑛 𝑔 𝑥 𝑛 = 1 𝑁 𝑍 𝑇+1 𝛼 𝑡 =𝑙𝑛 1− 𝜀 𝑡 𝜀 𝑡 𝑍 1 =𝑁 (equal weights) 𝑍 𝑡 = 𝑍 𝑡−1 𝜀 𝑡 𝑒𝑥𝑝 𝛼 𝑡 + 𝑍 𝑡−1 1− 𝜀 𝑡 𝑒𝑥𝑝 − 𝛼 𝑡 Misclassified portion in 𝑍 𝑡−1 Correctly classified portion in 𝑍 𝑡−1 = 𝑍 𝑡−1 𝜀 𝑡 1− 𝜀 𝑡 𝜀 𝑡 + 𝑍 𝑡−1 1− 𝜀 𝑡 𝜀 𝑡 1− 𝜀 𝑡 𝑍 𝑇+1 =𝑁 𝑡=1 𝑇 2 𝜀 𝑡 1− 𝜀 𝑡 = 𝑍 𝑡−1 ×2 𝜀 𝑡 1− 𝜀 𝑡 ≤ 𝑡=1 𝑇 2 𝜖 𝑡 1− 𝜖 𝑡 Smaller and smaller Training Data Error Rate <1
57
End of Warning So many good explaination !!!!!!!!!!!!!!!!!
58
Even though the training error is 0, the testing error still decreases?
=𝑠𝑖𝑔𝑛 𝑡=1 𝑇 𝛼 𝑡 𝑓 𝑡 𝑥 𝐻 𝑥 𝑔 𝑥 Margin = 𝑦 𝑔 𝑥
59
Large Margin? =𝑠𝑖𝑔𝑛 𝑡=1 𝑇 𝛼 𝑡 𝑓 𝑡 𝑥 𝐻 𝑥 𝑔 𝑥 Training Data Error Rate =
=𝑠𝑖𝑔𝑛 𝑡=1 𝑇 𝛼 𝑡 𝑓 𝑡 𝑥 Large Margin? 𝐻 𝑥 𝑔 𝑥 Training Data Error Rate = = 1 𝑁 𝑛 𝛿 𝐻 𝑥 𝑛 ≠ 𝑦 𝑛 Adaboost ≤ 1 𝑁 𝑛 𝑒𝑥𝑝 − 𝑦 𝑛 𝑔 𝑥 𝑛 Logistic regression = 𝑡=1 𝑇 2 𝜖 𝑡 1− 𝜖 𝑡 SVM Getting smaller and smaller as T increase 𝑦 𝑛 𝑔 𝑥 𝑛
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.