1 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 8 Combining Methods and Ensemble Learning.

1 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 8 Combining Methods and Ensemble Learning

2 OUTLINE Objectives - explain motivation for combining methods - representative methods - examples and applications Motivation for combining methods Committee of networks Bagging Boosting Summary and discussion

3 Motivation for Combining Methods General setting (used in this course) - given training data set - flexible model parameterization ~ learning method - empirical loss function - optimization method Select the best model via single application of a learning method to data Learning Method + Data  Predictive Model

4 Motivation (cont’d) Learning Method + Data  Predictive Model Theoretical and empirical evidence - no single ‘best’ method exists Always possible to find: - best method for given data set - best data set for given method Philosophical connections, i.e. Epicurus, Eastern philosophy, Bayesian averaging: Combine several theories (models) explaining the data

5 Collective Decision Making Commonly used in our daily lives Politburo Jury trial Multiple expert opinions (in medicine, law)

6 Collective Decision Making When it works Experts are indeed intelligent Expert’ opinions are different (not correlated) Their decisions are combined intelligently When it does not work Majority are dumb Experts give similar opinions (highly correlated) Their decisions are not combined intelligently Similar conditions for combining in predictive learning Difference btwn predictive modeling vs social systems

7 Strategies for Combining Methods Standard inductive learning setting Two combining strategies (for improved generalization) 1. Apply different learning methods to the same data  Committee of Networks, Stacking, Bayesian averaging 2. Apply the same method to different (modified) realizations of training data  Bagging, Boosting Combining methods are used for: classification and regression No single analytic model

8 OUTLINE Objectives Motivation for combining methods Committee of networks Bagging Boosting Summary and discussion

9 Combining Strategy 1 Apply N different methods (parameterizations) to the same data  N distinct models Form (linear) combination of N models

10 Combining Strategy 1 (cont’d) Design issues: What parameterizations (methods) to use? - as different as possible How many component models? How to combine component models? - via empirical risk minimization (neural network strategy) - Bayesian averaging (statistical strategy)

11 Committee of networks approach Given training data Estimate N candidate (regression) models using different methods Construct the combined model as where coefficients are estimated via minimization of MSE under constraints

12 Example of Committee Approach Regression data set: with x-values uniform in [0,1] and noise variance Regression methods used (a) polynomial (b) trigonometric (c) combined (Committee of Networks) Model selection: - VC model selection for (a) and (b) - empirical risk minimization for (c)

13 Comparison for 50 training samples Red ~ target function; Green ~ polynomial model (test MSE=0.254) Blue ~ trigonometric (MSE=0.299) Black ~ combined (MSE=0.249)

14 Committee Approach for Classification Noisy Hyperbolas data set (50 training samples) Apply RBF SVM and Poly SVM Then construct combined model

15 Committee Approach for Classification Comparison results for 3 methods Combined model Note: RBF SVM is much better than POLY (for this data set),  combining does not improve test error RBF SVMPOLY SVM Combined Training Error1%5%1% Test Error0.25%4.5%1.2%

16 Combining for Improved Model Selection Consider a flexible method with several tuning parameters, e.g. RBF SVM classifier (C, gamma) Optimal (C, gamma) ~ provide min validation error Possibly several optimal (C, gamma) values  then better to combine their corresponding models than to select a single best model Model combination can be done via Committee of Networks approach  may result in more robust model selection (see Example 8.3 in the textbook)

17 When the Committee approach works Component models are optimally estimated from training data (~ optimally tuned complexity) Component models yield similar training/test error rate ~ one method is not significantly better than others Component models make different errors on the training set. This is achieved by: - different model parameterization; - different initialization (for MLP networks)

18 OUTLINE Objectives Motivation for combining methods Committee of networks Bagging Motivation: unstable methods (estimators) Combining strategy: Bagging Bootstrapping Examples Boosting Summary and discussion

19 Stable methods (estimators) Application of a learning method to training data (n samples) results in a sequence of modelsobtained via SRM, where m = complexity parameter Note that sequence depends on random realization of Unstable estimator: if a small change in leads to large changes in the sequence of models Model selection is difficult for unstable estimators

20 Model selection Consider two realizations of training data Unstable estimatorStable estimator

21 How to improve unstable methods? Unstable methods: MARS, CART, subset selection – all based on greedy optimization Stable methods: k-nearest neighbors, ridge regression, MLP, SOM Unstable method can be stabilized if we had many (N) random realizations of finite training sample Applying unstable method to each data set yields an unstable model Combined model (by averaging)  stable

22 Combining Strategy: Bagging How to obtain many random copies of training data? Given training sample, fake many copies via bootstrapping

23 Bootstrap Aggregation (=Bagging) Boootstrap Aggregation is a procedure for stabilizing unstable estimators by using many bootstrapped realizations of finite training sample Bootstrapping: let denote training set Then a bootstrap set is an i.i.d. sample of size n selected from with replacement

24 Example of Bootstrapping Consider training data set Z (classification) Sample ID: 1 2 3 4 5 6 7 8 9 10 Bootstrap set2 10 8 6 2 9 2 5 3 4 Bootstrap set4 8 7 7 1 6 8 7 1 3

25 Example of Bootstrapping Bootstrap set

26 Bagging procedure Given training set Z of size n Form N bootstrapped replicates of size n: Apply a learning method to each yielding a model Form bagged predictor by Averaging (regression) Majority vote (classification)

27 Example1: Noisy Hyperbolas Noisy Hyperbolas data set: 100 training samples Apply CART using Splitmin=2 to each bootstrap set

28 Example1: Noisy Hyperbolas Form final combined model by majority voting Bagging improves generalization Not clear how to choose opt number of replicates ~ how to control model complexity Note: RBF SVM gives ~ 0.25% test error for this data set Number of bootstrap replicates 15102040 Test error7.75%2.70%2.50%1.80%2.65%

29 Example 2: Digit Recognition Handwritten digits 5 vs. 8 data: 1,000 training samples Apply CART using Splitmin=2 to each bootstrap set Form the final bagged model by majority voting Each bootstrap model is likely to overfit Note: RBF SVM test error is under 2% for this data set Number of bootstrap replicates 15102040 Test error9.11%5.47%4.66%4.39%

30 Example 3: Univariate Regression Sine-squared target function: 25 training/1000 test samples MLP with 200 hidden units used as a base method Two bootstrap samples and their MLP estimates: Each bootstrap model overfits Number of bootstrap replicates 15102040 MSE test error0.1250.0790.0930.0780.082

31 OUTLINE Objectives Motivation for combining methods Committee of networks Bagging Boosting Motivation AdaBoost algorithm Examples Summary and discussion

32 Motivation for Boosting Often there exist simple ‘rule-of-thumb’ rules for classification E-mail spam classification: example rules - if a message contains the word ‘viagra’ then spam - if a message is from my wife then not-spam Each rule by itself not accurate, but their intelligent combination may be very accurate Issues - how to generate such simple rules (from data) ? - how to combine these rules together to form a good classifier? Boosting provides solution to these questions.

33 Base Learner (base classifier) Simple ‘rules-of-thumb’ are estimated sequentially by applying base classifier Base classifier can be any learning method, In practice, simple methods are used: - small-size tree - decision stump ~ input variable + split point - partition the input space into two halves, so that the training error is minimized Base learner is applied many times to modified training data Weights of component models are adjusted

34 Boosting strategy Apply base method to many realizations of the data

35 AdaBoost algorithm (Freund and Schapire, 1996) Given training data (binary classification): Initialize sample weights: Repeat for 1. Apply the base method to the training samples with weights, producing the component model 2. Calculate the error for the classifier and its weight: 3. Update the data weights Combine classifiers via weighted majority voting:

36 Example of AdaBoost algorithm original training data: 10 samples

37 First iteration of AdaBoost First (weak) classifierSample weight changes

38 Second iteration of AdaBoost Second (weak) classifierSample weight changes

39 Third iteration of AdaBoost Third (weak) classifier

40 Combine base classifiers

41 Example of AdaBoost classification 75 training samples: mixture of three Gaussians centered at (-2,0), (2,0) ~ class +1, and at (0,0) – class -1

42 Example (cont’d) Base classifier: (decision stump) The first 10 component classifiers are shown

43 Example (cont’d) Generalization performance (N=100 iterations) Training ~ 75 samples, test ~ 600 samples Test error does not suggest overfitting for large N

44 Theoretical Properties of AdaBoost Why AdaBoost can generalize well Why there is no overfitting (even for large N) What factors control its generalization? - selection of a base classifier? - number of iterations N ? How it compares to other methods (SVM)? No general consensus on these issues.

45 Two Main Aspects of AdaBoost AdaBoost yields large separation margin (similar to SVM classifiers) Fast reduction of the training error: - at each iteration k, the error is bounded as where - then the final AdaBoost model training error is bounded Assumption: a weak learner consistently reduces error  (1) the number of iterations N is a tuning parameter (2) potential overfitting for noisy data (non-separable)

46 Boosting as an additive method Additive models: Boosting implements a backfitting procedure but using an appropriate loss for classification

47 AdaBoost Modeling: Noisy Hyperbolas Use decision stump as a base learner 100 training samples, 100 validation samples (for choosing optimal number of iterations N) AdaBoost models for two realizations of training data: test error ~ 2.2% (for RBF SVM test error ~ 0.42%)

48 AdaBoost Modeling: Digit Recognition Handwritten digits 5 vs. 8 data: 100 training + 100 validation samples (for model selection) Decision stump as a base learner Model selection ~ choosing optimal N (# iterations) For comparison: SVM test error~ 4.7% Experiment number training error validation error test error Optimal N 100.050.080916 200.040.121716 300.10.106632 400.10.106132 500.070.11232 600.130.114764 700.140.121716 800.10.098132 900.090.08932 1000.120.091616 Ave00.0940.1042 St. dev.00.03270.014

49 AdaBoost Summary Simplicity: no/little need for model selection Robust performance (for many data sets, especially high-dimensional) Algorithmic simplicity (fast training) - no need for complex optimization algorithms - online training + re-training (for slowly changing distributions) - popular in computer vision and robotics applications Limitations - prone to overfitting with noisy data - no provision for different misclassification costs - lack of theoretical bounds on generalization performance

50 OUTLINE Objectives Motivation for combining methods Committee of networks Bagging Boosting Summary and discussion

51 Summary and Discussion Ensemble methods: - developed under inductive learning setting - for classification and regression problems - motivated by improved generalization + robustness Generalization usually depends on - complexity of a base learner - number of component models N - lack of theoretical bounds on generalization Other applications of ensemble methods - for data fusion - dealing with large volumes of training data - evaluating confidence (i.e., the fraction of classifiers that agree on prediction gives a measure of confidence)

1 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 8 Combining Methods and Ensemble Learning.

Similar presentations

Presentation on theme: "1 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 8 Combining Methods and Ensemble Learning."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 8 Combining Methods and Ensemble Learning.

Similar presentations

Presentation on theme: "1 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 8 Combining Methods and Ensemble Learning."— Presentation transcript:

Similar presentations

About project

Feedback