Overfitting, Bias/Variance tradeoff. 2 Content of the presentation Bias and variance definitions Parameters that influence bias and variance Bias and.

Slides:



Advertisements
Similar presentations
Regression and correlation methods
Advertisements

Neural networks Introduction Fitting neural networks
Computational Statistics. Basic ideas  Predict values that are hard to measure irl, by using co-variables (other properties from the same measurement.
CHAPTER 2: Supervised Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Learning a Class from Examples.
Lectures 17,18 – Boosting and Additive Trees Rice ECE697 Farinaz Koushanfar Fall 2006.
Data Mining Classification: Alternative Techniques
Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.
Data Mining Classification: Alternative Techniques
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
Model Assessment and Selection
Model Assessment, Selection and Averaging
Overfitting, Bias/Variance tradeoff, and Ensemble methods
Model assessment and cross-validation - overview
CMPUT 466/551 Principal Source: CMU
Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.
1 Recent developments in tree induction for KDD « Towards soft tree induction » Louis WEHENKEL University of Liège – Belgium Department of Electrical and.
Sparse vs. Ensemble Approaches to Supervised Learning
Ensemble Learning: An Introduction
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
Sparse vs. Ensemble Approaches to Supervised Learning
General Mining Issues a.j.m.m. (ton) weijters Overfitting Noise and Overfitting Quality of mined models (some figures are based on the ML-introduction.
Intelligible Models for Classification and Regression
Ensemble Learning (2), Tree and Forest
1 Machine Learning: Lecture 5 Experimental Evaluation of Learning Algorithms (Based on Chapter 5 of Mitchell T.., Machine Learning, 1997)
Regression and Correlation Methods Judy Zhong Ph.D.
Machine Learning CS 165B Spring 2012
Issues with Data Mining
Chapter 10 Boosting May 6, Outline Adaboost Ensemble point-view of Boosting Boosting Trees Supervised Learning Methods.
by B. Zadrozny and C. Elkan
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Ensemble Methods: Bagging and Boosting
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
CLASSIFICATION: Ensemble Methods
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
INTRODUCTION TO Machine Learning 3rd Edition
Lecture3 – Overview of Supervised Learning Rice ELEC 697 Farinaz Koushanfar Fall 2006.
Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.
Chapter1: Introduction Chapter2: Overview of Supervised Learning
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Ensemble Methods in Machine Learning
Bias-Variance in Machine Learning. Bias-Variance: Outline Underfitting/overfitting: –Why are complex hypotheses bad? Simple example of bias/variance Error.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16.
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Classification Ensemble Methods 1
Data Mining and Decision Support
Machine Learning 5. Parametric Methods.
Regression. We have talked about regression problems before, as the problem of estimating the mapping f(x) between an independent variable x and a dependent.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.
Model Selection and the Bias–Variance Tradeoff All models described have a smoothing or complexity parameter that has to be considered: multiplier of the.
Bias-Variance Analysis in Regression  True function is y = f(x) +  where  is normally distributed with zero mean and standard deviation .  Given a.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.
Ensemble Classifiers.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Chapter 7. Classification and Prediction
Boosting and Additive Trees (2)
Data Mining Lecture 11.
Data Mining Practical Machine Learning Tools and Techniques
Overfitting and Underfitting
Parametric Methods Berlin Chen, 2005 References:
Memory-Based Learning Instance-Based Learning K-Nearest Neighbor
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

Overfitting, Bias/Variance tradeoff

2 Content of the presentation Bias and variance definitions Parameters that influence bias and variance Bias and variance reduction techniques

3 Content of the presentation Bias and variance definitions: –A simple regression problem with no input –Generalization to full regression problems –A short discussion about classification Parameters that influence bias and variance Bias and variance reduction techniques

4 Regression problem - no input Goal: predict as well as possible the height of a Belgian male adult More precisely: –Choose an error measure, for example the square error. –Find an estimation ŷ such that the expectation: over the whole population of Belgian male adult is minimized. 180

5 Regression problem - no input The estimation that minimizes the error can be computed by taking: So, the estimation which minimizes the error is E y {y}. It is called the Bayes model. But in practice, we cannot compute the exact value of E y {y} (this would imply to measure the height of every Belgian male adults).

6 Learning algorithm As p(y) is unknown, find an estimation y from a sample of individuals, LS={y 1,y 2,…,y N }, drawn from the Belgian male adult population. Example of learning algorithms: –, – (if we know that the height is close to 180)

7 Good learning algorithm As LS are randomly drawn, the prediction ŷ will also be a random variable A good learning algorithm should not be good only on one learning sample but in average over all learning samples (of size N)  we want to minimize: Let us analyse this error in more detail ŷ p LS (ŷ)

8 Bias/variance decomposition (1)

9 Bias/variance decomposition (2) E= E y {(y- E y {y}) 2 } + E LS {(E y {y}-ŷ) 2 } y var y {y} Ey{y}Ey{y} = residual error = minimal attainable error = var y {y}

10 Bias/variance decomposition (3)

11 Bias/variance decomposition (4) E= var y {y} + (E y {y}-E LS {ŷ}) 2 + … E LS {ŷ} = average model (over all LS) bias 2 = error between Bayes and average model ŷ Ey{y}Ey{y} E LS {ŷ} bias 2

12 Bias/variance decomposition (5) E= var y {y} + bias 2 + E LS {(ŷ-E LS {ŷ}) 2 } var LS {ŷ} = estimation variance = consequence of over-fitting ŷ var LS {ŷ} E LS {ŷ}

13 Bias/variance decomposition (6) E= var y {y} + bias 2 + var LS {ŷ} ŷ Ey{y}Ey{y}E LS {ŷ} bias 2 var y {y}var LS {ŷ}

14 Our simple example – From statistics, ŷ 1 is the best estimate with zero bias – So, the first one may not be the best estimator because of variance (There is a bias/variance tradeoff w.r.t. )

15 Regression problem – full (1) Actually, we want to find a function ŷ(x) of several inputs => average over the whole input space: The error becomes: Over all learning sets:

16 Regression problem – full (2) E LS {E y|x {(y-ŷ(x))^2}}=Noise(x)+Bias 2 (x)+Variance(x) Noise(x) = E y|x {(y-h B (x)) 2 } Quantifies how much y varies from h B (x) = E y|x {y}, the Bayes model. Bias 2 (x) = (h B (x)-E LS {ŷ(x)}) 2 : Measures the error between the Bayes model and the average model. Variance(x) = E LS {(ŷ(x)-E LS {ŷ(x)) 2 } : Quantify how much ŷ(x) varies from one learning sample to another.

17 Illustration (1) Problem definition: –One input x, uniform random variable in [0,1] –y=h(x)+ε where ε  N(0,1) h(x)=E y|x {y} x y

18 Illustration (2) Low variance, high bias method  underfitting E LS {ŷ(x)}

19 Illustration (3) Low bias, high variance method  overfitting E LS {ŷ(x)}

20 Classification problems (1) The mean misclassification error is: The best possible model is the Bayes model: The “average” model is: Unfortunately, there is no such decomposition of the mean misclassification error into a bias and a variance terms. Nevertheless, we observe the same phenomena

21 Classification problems (2) A full decision treeOne test node LS LS 2

22 Classification problems (3) Bias  systematic error component (independent of the learning sample) Variance  error due to the variability of the model with respect to the learning sample randomness There are errors due to bias and errors due to variance One test nodeFull decision tree

23 Content of the presentation Bias and variance definitions Parameters that influence bias and variance –Complexity of the model –Complexity of the Bayes model –Noise –Learning sample size –Learning algorithm Bias and variance reduction techniques

24 Illustrative problem Artificial problem with 10 inputs, all uniform random variables in [0,1] The true function depends only on 5 inputs: y(x)=10.sin(π.x 1.x 2 )+20.(x ) x 4 +5.x 5 +ε, where ε is a N(0,1) random variable Experimentations: –E LS  average over 50 learning sets of size 500 –E x,y  average over 2000 cases  Estimate variance and bias (+ residual error)

25 Complexity of the model Usually, the bias is a decreasing function of the complexity, while variance is an increasing function of the complexity. E=bias 2 +var bias 2 var Complexity

26 Complexity of the model – neural networks Error, bias, and variance w.r.t. the number of neurons in the hidden layer

27 Complexity of the model – regression trees Error, bias, and variance w.r.t. the number of test nodes

28 Complexity of the model – k-NN Error, bias, and variance w.r.t. k, the number of neighbors

29 Learning problem Complexity of the Bayes model: –At fixed model complexity, bias increases with the complexity of the Bayes model. However, the effect on variance is difficult to predict. Noise: –Variance increases with noise and bias is mainly unaffected. –E.g. with (full) regression trees

30 Learning sample size (1) At fixed model complexity, bias remains constant and variance decreases with the learning sample size. E.g. linear regression

31 Learning sample size (2) When the complexity of the model is dependant on the learning sample size, both bias and variance decrease with the learning sample size. E.g. regression trees

32 Learning algorithms – linear regression Very few parameters : small variance Goal function is not linear : high bias MethodErr 2 Bias 2 +NoiseVariance Linear regr k-NN (k=1) k-NN (k=10) MLP (10) MLP (10 – 10) Regr. Tree

33 Learning algorithms – k-NN Small k : high variance and moderate bias High k : smaller variance but higher bias MethodErr 2 Bias 2 +NoiseVariance Linear regr k-NN (k=1) k-NN (k=10) MLP (10) MLP (10 – 10) Regr. Tree

34 Learning algorithms - MLP Small bias Variance increases with the model complexity MethodErr 2 Bias 2 +NoiseVariance Linear regr k-NN (k=1) k-NN (k=10) MLP (10) MLP (10 – 10) Regr. Tree

35 Learning algorithms – regression trees Small bias, a (complex enough) tree can approximate any non linear function High variance MethodErr 2 Bias 2 +NoiseVariance Linear regr k-NN (k=1) k-NN (k=10) MLP (10) MLP (10 – 10) Regr. Tree

36 Content of the presentation Bias and variance definition Parameters that influence bias and variance Decision/regression tree variance Bias and variance reduction techniques –Introduction –Dealing with the bias/variance tradeoff of one algorithm –Ensemble methods

37 Bias and variance reduction techniques In the context of a given method: –Adapt the learning algorithm to find the best trade-off between bias and variance. –Not a panacea but the least we can do. –Example: pruning, weight decay. Ensemble methods: –Change the bias/variance trade-off. –Universal but destroys some features of the initial method. –Example: bagging, boosting.

38 Variance reduction: 1 model (1) General idea: reduce the ability of the learning algorithm to fit the LS –Pruning reduces the model complexity explicitly –Early stopping reduces the amount of search –Regularization reduce the size of hypothesis space Weight decay with neural networks consists in penalizing high weight values

39 Variance reduction: 1 model (2) Selection of the optimal level of fitting –a priori (not optimal) –by cross-validation (less efficient): Bias 2  error on the learning set, E  error on an independent test set E=bias 2 +var bias 2 var Fitting Optimal fitting

40 Variance reduction: 1 model (3 ) Examples: –Post-pruning of regression trees –Early stopping of MLP by cross-validation MethodEBiasVariance Full regr. Tree (250) Pr. regr. Tree (45) Full learned MLP Early stopped MLP As expected, variance decreases but bias increases

41 Ensemble methods Combine the predictions of several models built with a learning algorithm in order to improve with respect to the use of a single model Two important families: –Averaging techniques Grow several models independantly and simply average their predictions Ex: bagging, random forests Decrease mainly variance –Boosting type algorithms Grows several models sequentially Ex: Adaboost, MART Decrease mainly bias

42 Conclusion (1) The notions of bias and variance are very useful to predict how changing the (learning and problem) parameters will affect the accuracy. E.g. this explains why very simple methods can work much better than more complex ones on very difficult tasks Variance reduction is a very important topic: –To reduce bias is easy, but to keep variance low is not as easy. –Especially in the context of new applications of machine learning to very complex domains: temporal data, biological data, Bayesian networks learning, text mining… All learning algorithms are not equal in terms of variance. Trees are among the worst methods from this criterion

43 Conclusion (2) Ensemble methods are very effective techniques to reduce bias and/or variance. They can transform a not so good method to a competitive method in terms of accuracy.