Overfitting, Bias/Variance tradeoff, and Ensemble methods

Slides:



Advertisements
Similar presentations
Lectures 17,18 – Boosting and Additive Trees Rice ECE697 Farinaz Koushanfar Fall 2006.
Advertisements

Data Mining Classification: Alternative Techniques
Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.
Data Mining Classification: Alternative Techniques
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Boosting Ashok Veeraraghavan. Boosting Methods Combine many weak classifiers to produce a committee. Resembles Bagging and other committee based methods.
Model Assessment and Selection
Model Assessment, Selection and Averaging
Model assessment and cross-validation - overview
CMPUT 466/551 Principal Source: CMU
Longin Jan Latecki Temple University
MCS Multiple Classifier Systems, Cagliari 9-11 June Giorgio Valentini Random aggregated and bagged ensembles.
1 Recent developments in tree induction for KDD « Towards soft tree induction » Louis WEHENKEL University of Liège – Belgium Department of Electrical and.
Sparse vs. Ensemble Approaches to Supervised Learning
Ensemble Learning: An Introduction
End of Chapter 8 Neil Weisenfeld March 28, 2005.
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
ICS 273A Intro Machine Learning
Machine Learning: Ensemble Methods
Sparse vs. Ensemble Approaches to Supervised Learning
Ensemble Learning (2), Tree and Forest
For Better Accuracy Eick: Ensemble Learning
1 Machine Learning: Lecture 5 Experimental Evaluation of Learning Algorithms (Based on Chapter 5 of Mitchell T.., Machine Learning, 1997)
Machine Learning CS 165B Spring 2012
Issues with Data Mining
Chapter 10 Boosting May 6, Outline Adaboost Ensemble point-view of Boosting Boosting Trees Supervised Learning Methods.
by B. Zadrozny and C. Elkan
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
CS 391L: Machine Learning: Ensembles
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00.
Benk Erika Kelemen Zsolt
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
Computational Intelligence: Methods and Applications Lecture 36 Meta-learning: committees, sampling and bootstrap. Włodzisław Duch Dept. of Informatics,
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
CS Ensembles1 Ensembles. 2 A “Holy Grail” of Machine Learning Automated Learner Just a Data Set or just an explanation of the problem Hypothesis.
CLASSIFICATION: Ensemble Methods
Concept learning, Regression Adapted from slides from Alpaydin’s book and slides by Professor Doina Precup, Mcgill University.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Ensemble Methods in Machine Learning
Classification Ensemble Methods 1
Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Regression. We have talked about regression problems before, as the problem of estimating the mapping f(x) between an independent variable x and a dependent.
Ensemble Methods Construct a set of classifiers from the training data Predict class label of previously unseen records by aggregating predictions made.
Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.
Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Overfitting, Bias/Variance tradeoff. 2 Content of the presentation Bias and variance definitions Parameters that influence bias and variance Bias and.
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.
Ensemble Classifiers.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Machine Learning: Ensemble Methods
Data Mining Practical Machine Learning Tools and Techniques
Bagging and Random Forests
COMP61011 : Machine Learning Ensemble Models
Ensemble Learning Introduction to Machine Learning and Data Mining, Carla Brodley.
Data Mining Lecture 11.
Data Mining Practical Machine Learning Tools and Techniques
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Chap. 7 Regularization for Deep Learning (7.8~7.12 )
Ensembles.
Ensemble learning.
Model Combination.
Ensemble learning Reminder - Bagging of Trees Random Forest
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

Overfitting, Bias/Variance tradeoff, and Ensemble methods Pierre Geurts Stochastic methods (Prof. L.Wehenkel) University of Liège

Content of the presentation Bias and variance definitions Parameters that influence bias and variance Decision/regression tree variance Bias and variance reduction techniques

Content of the presentation Bias and variance definitions: A simple regression problem with no input Generalization to full regression problems A short discussion about classification Parameters that influence bias and variance Decision/regression tree variance Bias and variance reduction techniques

Regression problem - no input Goal: predict as well as possible the height of a Belgian male adult More precisely: Choose an error measure, for example the square error. Find an estimation ŷ such that the expectation: over the whole population of Belgian male adult is minimized. 180

Regression problem - no input The estimation that minimizes the error can be computed by taking: So, the estimation which minimizes the error is Ey{y}. In AL, it is called the Bayes model. But in practice, we cannot compute the exact value of Ey{y} (this would imply to measure the height of every Belgian male adults).

Learning algorithm As p(y) is unknown, find an estimation y from a sample of individuals, LS={y1,y2,…,yN}, drawn from the Belgian male adult population. Example of learning algorithms: , (if we know that the height is close to 180)

Good learning algorithm As LS are randomly drawn, the prediction ŷ will also be a random variable A good learning algorithm should not be good only on one learning sample but in average over all learning samples (of size N)  we want to minimize: Let us analyse this error in more detail ŷ pLS (ŷ)

Bias/variance decomposition (1)

Bias/variance decomposition (2) vary{y} Ey{y} y E= Ey{(y- Ey{y})2} + ELS{(Ey{y}-ŷ)2} = residual error = minimal attainable error = vary{y}

Bias/variance decomposition (3)

Bias/variance decomposition (4) ELS{ŷ} Ey{y} ŷ E= vary{y} + (Ey{y}-ELS{ŷ})2 + … ELS{ŷ} = average model (over all LS) bias2 = error between Bayes and average model

Bias/variance decomposition (5) varLS{ŷ} ELS{ŷ} ŷ E= vary{y} + bias2 + ELS{(ŷ-ELS{ŷ})2} varLS{ŷ} = estimation variance = consequence of over-fitting

Bias/variance decomposition (6) ŷ Ey{y} ELS{ŷ} bias2 vary{y} varLS{ŷ} E= vary{y} + bias2 + varLS{ŷ}

Our simple example From statistics, ŷ1 is the best estimate with zero bias So, the first one may not be the best estimator because of variance (There is a bias/variance tradeoff w.r.t. l)

Bayesian approach (1) Hypotheses : The average height is close to 180cm: The height of one individual is Gaussian around the mean: What is the most probable value of after having seen the learning sample ?

Bayesian approach (2) Bayes theorem and P(LS) is constant Independence of the learning cases

Regression problem – full (1) Actually, we want to find a function ŷ(x) of several inputs => average over the whole input space: The error becomes: Over all learning sets:

Regression problem – full (2) ELS{Ey|x{(y-ŷ(x))^2}}=Noise(x)+Bias2(x)+Variance(x) Noise(x) = Ey|x{(y-hB(x))2} Quantifies how much y varies from hB(x) = Ey|x{y}, the Bayes model. Bias2(x) = (hB(x)-ELS{ŷ(x)})2: Measures the error between the Bayes model and the average model. Variance(x) = ELS{(ŷ(x)-ELS{ŷ(x))2} : Quantify how much ŷ(x) varies from one learning sample to another.

Illustration (1) Problem definition: One input x, uniform random variable in [0,1] y=h(x)+ε where εN(0,1) h(x)=Ey|x{y} y x

Illustration (2) Low variance, high bias method  underfitting ELS{ŷ(x)}

Illustration (3) Low bias, high variance method  overfitting ELS{ŷ(x)}

Illustration (4) No noise doesn’t imply no variance (but less variance) ELS{ŷ(x)}

Classification problems (1) The mean misclassification error is: The best possible model is the Bayes model: The “average” model is: Unfortunately, there is no such decomposition of the mean misclassification error into a bias and a variance terms. Nevertheless, we observe the same phenomena

Classification problems (2) A full decision tree One test node 1 LS1 1 LS2

Classification problems (3) Bias  systematic error component (independent of the learning sample) Variance  error due to the variability of the model with respect to the learning sample randomness There are errors due to bias and errors due to variance One test node Full decision tree 1 1

Content of the presentation Bias and variance definitions Parameters that influence bias and variance Complexity of the model Complexity of the Bayes model Noise Learning sample size Learning algorithm Decision/regression tree variance Bias and variance reduction techniques

y(x)=10.sin(π.x1.x2)+20.(x3-0.5)2+10.x4+5.x5+ε, Illustrative problem Artificial problem with 10 inputs, all uniform random variables in [0,1] The true function depends only on 5 inputs: y(x)=10.sin(π.x1.x2)+20.(x3-0.5)2+10.x4+5.x5+ε, where ε is a N(0,1) random variable Experimentations: ELS  average over 50 learning sets of size 500 Ex,y  average over 2000 cases  Estimate variance and bias (+ residual error)

Complexity of the model E=bias2+var var bias2 Complexity Usually, the bias is a decreasing function of the complexity, while variance is an increasing function of the complexity.

Complexity of the model – neural networks Error, bias, and variance w.r.t. the number of neurons in the hidden layer

Complexity of the model – regression trees Error, bias, and variance w.r.t. the number of test nodes

Complexity of the model – k-NN Error, bias, and variance w.r.t. k, the number of neighbors

Learning problem Complexity of the Bayes model: Noise: At fixed model complexity, bias increases with the complexity of the Bayes model. However, the effect on variance is difficult to predict. Noise: Variance increases with noise and bias is mainly unaffected. E.g. with (full) regression trees

Learning sample size (1) At fixed model complexity, bias remains constant and variance decreases with the learning sample size. E.g. linear regression

Learning sample size (2) When the complexity of the model is dependant on the learning sample size, both bias and variance decrease with the learning sample size. E.g. regression trees

Learning algorithms – linear regression Method Err2 Bias2+Noise Variance Linear regr. 7.0 6.8 0.2 k-NN (k=1) 15.4 5 10.4 k-NN (k=10) 8.5 7.2 1.3 MLP (10) 2.0 1.2 0.8 MLP (10 – 10) 4.6 1.4 3.2 Regr. Tree 10.2 3.5 6.7 Very few parameters : small variance Goal function is not linear : high bias

Learning algorithms – k-NN Method Err2 Bias2+Noise Variance Linear regr. 7.0 6.8 0.2 k-NN (k=1) 15.4 5 10.4 k-NN (k=10) 8.5 7.2 1.3 MLP (10) 2.0 1.2 0.8 MLP (10 – 10) 4.6 1.4 3.2 Regr. Tree 10.2 3.5 6.7 Small k : high variance and moderate bias High k : smaller variance but higher bias

Learning algorithms - MLP Method Err2 Bias2+Noise Variance Linear regr. 7.0 6.8 0.2 k-NN (k=1) 15.4 5 10.4 k-NN (k=10) 8.5 7.2 1.3 MLP (10) 2.0 1.2 0.8 MLP (10 – 10) 4.6 1.4 3.2 Regr. Tree 10.2 3.5 6.7 Small bias Variance increases with the model complexity

Learning algorithms – regression trees Method Err2 Bias2+Noise Variance Linear regr. 7.0 6.8 0.2 k-NN (k=1) 15.4 5 10.4 k-NN (k=10) 8.5 7.2 1.3 MLP (10) 2.0 1.2 0.8 MLP (10 – 10) 4.6 1.4 3.2 Regr. Tree 10.2 3.5 6.7 Small bias, a (complex enough) tree can approximate any non linear function High variance

Content of the presentation Bias and variance definition Parameters that influence bias and variance Decision/regression tree variance Bias and variance reduction techniques

Decision/regression tree variance (1) DT/RT are among the machine learning method that present the highest variance. Even a small change of the learning sample can result in a very different tree. Even small trees have a high variance Method E Bias Variance k-NN (k=10) 8.5 7.2 1.3 MLP (10 – 10) 4.6 1.4 3.2 RT, no test 25.5 25.4 0.1 RT, 1 test 19.0 17.7 RT, 3 tests 14.8 11.1 3.7 RT, full (250 tests) 10.2 3.5 6.7

Decision/regression tree variance (2) Possible sources of variance: Discretization of numerical attributes The selected threshold has a high variance (see next slide). Structure choice Sometimes, attribute scores are very close. Estimation at leaf nodes Because of the recursive partitioning, prediction at leaf nodes are based on very small samples of objects. Consequences: sub-optimality in terms of accuracy questionable interpretability since the parameters can not be trusted

Decision/regression tree variance (3) The discretization thresholds chosen in trees are very unstable This variance put into question the interpretability 0.61 A1<0,61 0.82 A1<0,82 0.48 A1<0,48 ? A1

Content of the presentation Bias and variance definition Parameters that influence bias and variance Decision/regression tree variance Bias and variance reduction techniques Introduction Dealing with the bias/variance tradeoff of one algorithm Ensemble methods

Bias and variance reduction techniques In the context of a given method: Adapt the learning algorithm to find the best trade-off between bias and variance. Not a panacea but the least we can do. Example: pruning, weight decay. Ensemble methods: Change the bias/variance trade-off. Universal but destroys some features of the initial method. Example: bagging, boosting.

Variance reduction: 1 model (1) General idea: reduce the ability of the learning algorithm to fit the LS Pruning reduces the model complexity explicitly Early stopping reduces the amount of search Regularization reduce the size of hypothesis space Weight decay with neural networks consists in penalizing high weight values

Variance reduction: 1 model (2) Optimal fitting E=bias2+var var bias2 Fitting Selection of the optimal level of fitting a priori (not optimal) by cross-validation (less efficient): Bias2  error on the learning set, E  error on an independent test set

Variance reduction: 1 model (3) Examples: Post-pruning of regression trees Early stopping of MLP by cross-validation Method E Bias Variance Full regr. Tree (250) 10.2 3.5 6.7 Pr. regr. Tree (45) 9.1 4.3 4.8 Full learned MLP 4.6 1.4 3.2 Early stopped MLP 3.8 1.5 2.3 As expected, variance decreases but bias increases

Ensemble methods Combine the predictions of several models built with a learning algorithm in order to improve with respect to the use of a single model Two important families: Averaging techniques Grow several models independantly and simply average their predictions Ex: bagging, random forests Decrease mainly variance Boosting type algorithms Grows several models sequentially Ex: Adaboost, MART Decrease mainly bias

ELS{Err(x)}=Ey|x{(y-hB(x))2}+ (hB(x)-ELS{ŷ(x)})2+ELS{(ŷ(x)-ELS{ŷ(x))2} Bagging (1) ELS{Err(x)}=Ey|x{(y-hB(x))2}+ (hB(x)-ELS{ŷ(x)})2+ELS{(ŷ(x)-ELS{ŷ(x))2} Idea : the average model ELS{ŷ(x)} has the same bias as the original method but zero variance Bagging (Bootstrap AGGregatING) : To compute ELS{ŷ (x)}, we should draw an infinite number of LS (of size N) Since we have only one single LS, we simulate sampling from nature by bootstrap sampling from the given LS Bootstrap sampling = sampling with replacement of N objects from LS (N is the size of LS)

Bagging (2) LS LS1 LS2 LST x ŷ1(x) ŷ2(x) ŷT(x) In regression: ŷ(x) = 1/k.(ŷ1(x)+ŷ2(x)+…+ŷT(x)) In classification: ŷ(x) = the majority class in {ŷ1(x),…,ŷT(x)}

Bagging (3) Usually, bagging reduces very much the variance without increasing too much the bias. Application to regression trees Method E Bias Variance 3 Test regr. Tree 14.8 11.1 3.7 Bagged (T=25) 11.7 10.7 1.0 Full regr. Tree 10.2 3.5 6.7 5.3 3.8 1.5 Strong variance reduction without increasing the bias (although the model is much more complex than a single tree)

Bagging (4) y x

Other averaging techniques Perturb and Combine paradigm: Perturb the data or the learning algorithm to obtain several models that are good on the learning sample. Combine the predictions of these models Usually, these methods decrease the variance (because of averaging) but (slightly) increase the bias (because of the perturbation) Examples: Bagging perturbs the learning sample. Learn several neural networks with random initial weights Random forests. Method E Bias Variance MLP (10-10) 4.6 1.4 3.2 Average of 10 MLPs 2.0 0.6

Random forests (1) Perturb and combine algorithm specifically designed for trees Combine bagging and random attribute subset selection: Build the tree from a bootstrap sample Instead of choosing the best split among all attributes, select the best split among a random subset of k attributes (= bagging when k is equal to the number of attributes) There is a bias/variance tradeoff with k: The smaller k, the greater the reduction of variance but also the higher the increase of bias

Random forests (2) Application to our illustrative problem: Other advantage: it decreases computing times with respect to bagging since only a subset of all attributes needs to be considered when splitting a node. Method E Bias Variance Full regr. Tree 10.2 3.5 6.7 Bagging (k=10) 5.3 3.8 1.5 Random Forests (k=7) 4.8 1.0 Random Forests (k=5) 4.9 4.0 0.9 Random Forests (k=3) 5.6 4.7 0.8

Boosting methods (1) The motivation of boosting is to combine the ouputs of many « weak » models to produce a powerful ensemble of models. Weak model = a model that has a high bias (strictly, in classification, a model slightly better than random guessing) Differences with previous ensemble methods: Models are built sequentially on modified versions of the data The predictions of the models are combined through a weighted sum/vote

Boosting methods (2) LS LS1 LST … LS2 x In regression: ŷ(x) = b1.ŷ1(x)+ b2.ŷ2(x)+…+ bT.ŷT(x)) In classification: ŷ(x) = the majority class in {ŷ1(x),…,ŷT(x)} according to the weights {b1,b2,…,bT} ŷ1(x) ŷ2(x) ŷT(x)

Adaboost (1) Assume that the learning algorithm accepts weighted objects This is the case of many learning algorithms: With trees, simply take into account the weights when counting objects In neural networks, minimize the weighted squared error At each step, adaboost increases the weights of cases from the learning sample misclassified by the last model Thus, the algorithm focuses on the difficult cases from the learning sample In the weighted majority vote, adaboost gives higher influence to the more accurate models

å å Adaboost (2) w ¬ w × exp[ b × I ( y ¹ ( x ))] w I ( y ¹ ( x )) err Input: a learning algorithm and a learning sample {(xi,yi): i=1,…,N} Initialize the weights wi=1/N, i=1,…,N For t=1 to T Build a model ŷt(x) with the learning algorithm using weights wi Compute the weighted error: Compute bt=log((1-errt)/errt)) Change weights: å w I ( y ¹ ŷ ( x )) i i t i err = i å t w i i w ¬ w × exp[ b × I ( y ¹ ŷ ( x ))] i i m i t i

MART (multiple additive regression trees) MART is a boosting algorithm for regression Input: a learning sample {(xi,yi): i=1,…,N} Initialize ŷ0(x) = 1/N åi yi ; ri=yi, i=1,…,N For t=1 to T: For i=1 to N, compute the residuals ri  ri -ŷt-1(xi) Build a regression tree from the learning sample {(xi,ri): i=1,…,N} Return the model ŷ(x)= ŷ0(x)+ŷ1(x)+…+ŷT(x)

Boosting methods Adaboost and MART are only two boosting variants. There are many other boosting type algorithms. Boosting decision/regression trees improves their accuracy often dramatically. However, boosting is more sensible to noise than averaging techniques (overfitting). For boosting to work, the models need not to be perfect on the learning sample. With trees, there are two possible strategies: Use pruned trees (pre-pruned or post-pruned by cross-validation) Limit the number of tree tests (and split first the most impure nodes)  there is again a bias/variance tradeoff with respect to the tree size.

Experiment with MART On our illustrative problem: Boosting reduces the bias but increases the variance. However, with respect to full trees, it decreases both bias and variance. Method E Bias Variance Full regr. Tree 10.2 3.5 6.7 Regr. Tree with 1 test 18.9 17.8 1.1 + MART (T=50) 5.0 3.1 1.9 + Bagging (T=50) 17.9 17.3 0.6 Regr. Tree with 5 tests 11.7 8.8 2.9 6.4 1.7 4.7 9.1 8.7 0.4

Interpretability and efficiency of ensembles Since we average several models, we loose interpretability and efficiency which are two of the main advantages of decision/regression trees However, We still can use the ensembles to compute variable importance by averaging over all trees. Actually, this even stabilizes the estimates. Averaging techniques can be parallelized and boosting type algorithm uses smaller trees. So, the increase of computing times is not so detrimental.

Experiments on Golub’s microarray data 72 objects, 7129 numerical attributes (gene expressions), 2 classes (ALL and AL) Leave-one-out error with several variants Variable importance with boosting Method Error 1 decision tree 22.2% (16/72) Random forests (k=85,T=500) 9.7% (7/72) Extra-trees (sth=0.5, T=500) 5.5% (4/72) Adaboost (1 test node, T=500) 1.4% (1/72)

Conclusion (1) The notions of bias and variance are very useful to predict how changing the (learning and problem) parameters will affect the accuracy. E.g. this explains why very simple methods can work much better than more complex ones on very difficult tasks Variance reduction is a very important topic: To reduce bias is easy, but to keep variance low is not as easy. Especially in the context of new applications of machine learning to very complex domains: temporal data, biological data, Bayesian networks learning, text mining… All learning algorithms are not equal in terms of variance. Trees are among the worst methods from this criterion

Conclusion (2) Ensemble methods are very effective techniques to reduce bias and/or variance. They can transform a not so good method to a competitive method in terms of accuracy. Adaboost with trees is considered as one of the best « off-the-shelve » classification method. Interpretability of the model and efficiency of the method are difficult to preserve if we want to reduce variance significantly. There are other ways to tackle the variance/overfitting problem, e.g.: Bayesian approaches (related to averaging techniques) Support vector machines (they maintain a low variance by maximizing the classifiction margin)

References About bias and variance: Neural networks and the bias/variance dilemma, S.Geman et al., Neural computation 4, 1(1992), 1-58 Neural networks for statistical pattern recognition, C.M.Bishop, Oxfored University Press, 1994 The elements of statistical learning, T.Hastie et al., Springer, 2001 Contribution to decision tree induction: bias/variance tradeoff and time series classification, P.Geurts, Phd Thesis, 2002 About ensemble methods: Bagging predictors, L.Breiman, Machine learning, 24, 1996 A decision theoretic generalization of on-line learning and an application to boosting, Y.Freund and R.Schapire, Journal of Computer and Science Systems, 1995 Random Forests, L.Breiman, Machine learning, 45, 2001 Ensemble methods in machine learning, T.Dietterich, First international workshop on multiple classifier systems, 2000 An introduction to boosting and leveraging, R.Meir and G.Ratsch, Advanced Lectures on Machine Learning, Springer, 2003

Softwares Random forests Boosting: http://stat-www.berkeley.edu/users/breiman/rf.html R package randomForest Boosting: See www.boosting.org