Overfitting and Its Avoidance

Slides:



Advertisements
Similar presentations
Chapter 5 Multiple Linear Regression
Advertisements

Random Forest Predrag Radenković 3237/10
1 Machine Learning: Lecture 3 Decision Tree Learning (Based on Chapter 3 of Mitchell T.., Machine Learning, 1997)
Imbalanced data David Kauchak CS 451 – Fall 2013.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Variance reduction techniques. 2 Introduction Simulation models should be coded such that they are efficient. Efficiency in terms of programming ensures.
Model Assessment, Selection and Averaging
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
Chapter 3 Producing Data 1. During most of this semester we go about statistics as if we already have data to work with. This is okay, but a little misleading.
Evaluation.
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Decision Tree Algorithm
Feature Selection for Regression Problems
Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.
Evaluation.
Ensemble Learning: An Introduction
Evaluating Hypotheses
Three kinds of learning
Experimental Evaluation
On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.
Ensemble Learning (2), Tree and Forest
Elec471 Embedded Computer Systems Chapter 4, Probability and Statistics By Prof. Tim Johnson, PE Wentworth Institute of Technology Boston, MA Theory and.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Evaluation February 23,
Model Assessment and Selection Florian Markowetz & Rainer Spang Courses in Practical DNA Microarray Analysis.
Evaluating Classifiers
Psy B07 Chapter 1Slide 1 ANALYSIS OF VARIANCE. Psy B07 Chapter 1Slide 2 t-test refresher  In chapter 7 we talked about analyses that could be conducted.
Determining Sample Size
Chapter 1: Introduction to Statistics
2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,
Issues with Data Mining
Fall 2004 TDIDT Learning CS478 - Machine Learning.
Learning what questions to ask. 8/29/03Decision Trees2  Job is to build a tree that represents a series of questions that the classifier will ask of.
by B. Zadrozny and C. Elkan
Overfitting and Its Avoidance
Biostatistics Case Studies 2005 Peter D. Christenson Biostatistician Session 5: Classification Trees: An Alternative to Logistic.
Outline 1-D regression Least-squares Regression Non-iterative Least-squares Regression Basis Functions Overfitting Validation 2.
Chapter 9 – Classification and Regression Trees
Psy B07 Chapter 4Slide 1 SAMPLING DISTRIBUTIONS AND HYPOTHESIS TESTING.
Evaluating What’s Been Learned. Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for.
Learning from observations
Learning from Observations Chapter 18 Through
Decision Trees DefinitionDefinition MechanismMechanism Splitting FunctionSplitting Function Issues in Decision-Tree LearningIssues in Decision-Tree Learning.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
1 Universidad de Buenos Aires Maestría en Data Mining y Knowledge Discovery Aprendizaje Automático 5-Inducción de árboles de decisión (2/2) Eduardo Poggi.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
Validation methods.
Chapter 3 Surveys and Sampling © 2010 Pearson Education 1.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Sampling Design and Analysis MTH 494 Lecture-21 Ossam Chohan Assistant Professor CIIT Abbottabad.
Machine Learning in Practice Lecture 2 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.
Machine Learning: Ensemble Methods
Issues in Decision-Tree Learning Avoiding overfitting through pruning
Introduction to Data Mining, 2nd Edition by
Introduction to Data Mining, 2nd Edition by
Introduction to Data Mining, 2nd Edition by
Machine Learning: Lecture 3
Decision Trees Jeff Storey.
Presentation transcript:

Overfitting and Its Avoidance Chapter 5

Overfitting When we build a predictive model (tree or mathematical function), we try to fit our model to data, but we don’t want ‘overfit’. When our learner outputs a classification that is 100% accurate on the training data but 50% accurate on test data, when in fact it could have output one that is 75% accurate on both, it has overfit.

Overfitting We want models to apply not just to the exact training set but to the general population from which the training data came. All data mining procedures have the tendency to overfit to some extent—some more than others.

Overfitting The answer is not to use a data mining procedure that doesn’t overfit because all of them do. Nor is the answer to simply use models that produce less overfitting, because there is a fundamental trade-off between model complexity and the possibility of overfitting. The best strategy is to recognize overfitting and to manage complexity in a principled way.

Holdout data Because evaluation on training data provides no assessment of how well the model generalizes to unseen cases, what we need to do is to “hold out” some data for which we know the value of the target variable, but which will not be used to build the model. Creating holdout data is like creating a “lab test” of generalization performance.

Holdout data We will simulate the use scenario on these holdout data: we will hide from the model (and possibly the modelers) the actual values for the target on the holdout data. The model will predict the values. Then we estimate the generalization performance by comparing the predicted values with the hidden true values. Thus, when the holdout data are used in this manner, they often are called the “test set.”

Overfitting Examined This figure shows the difference between a modeling procedure’s accuracy on the training data and the accuracy on holdout data as model complexity changes. Figure 1. A typical fitting graph.

Overfitting in Tree Induction Recall how we built tree-structured models for classification. If we continue to split the data, eventually the subsets will be pure—all instances in any chosen subset will have the same value for the target variable. These will be the leaves of our tree. Figure 3. A typical fitting graph for tree induction.

Overfitting in Tree Induction Any training instance given to the tree for classification will make its way down, eventually landing at the appropriate leaf. It will be perfectly accurate, predicting correctly the class for every training instance. Figure 3. A typical fitting graph for tree induction.

Overfitting in Tree Induction A procedure that grows trees until the leaves are pure tends to overfit. This figure shows a typical fitting graph for tree induction. At some point (sweet spot) the tree starts to overfit: it acquires details of the training set that are not characteristic of the population in general, as represented by the holdout set. Figure 3. A typical fitting graph for tree induction.

Overfitting in Mathematical Functions As we add more Xi, the function becomes more and more complicated. Each Xi has a corresponding Wi, which is a learned parameter of the model . In two dimensions you can fit a line to any two points and in three dimensions you can fit a plane to any three points .

Overfitting in Mathematical Functions This concept generalizes: as you increase the dimensionality, you can perfectly fit larger and larger sets of arbitrary points. And even if you cannot fit the dataset perfectly, you can fit it better and better with more dimensions---that is, with more attributes.

Example: Overfitting Linear Functions Data:sepal width, petal width Types:Iris Setosa, Iris Versicolor Two different separation lines: Logistic regression Support vector machine Figure 5-4

Example: Overfitting Linear Functions Figure 5-4 Figure 5-5

Example: Overfitting Linear Functions Figure 5-4 Figure 5-6

Overfitting in Mathematical Functions In Figures 5-5 and 5-6, Logistic regression appears to be overfitting. Arguably, the examples introduced in each are outliers that should not have a strong influence on the model---they contribute little to the “mass” of the species examples. Yet in the case of LR they clearly do. The SVM tends to be less sensitive to individual examples.

Example: Overfitting Linear Functions Added additional feature Sepal width2, which gives a boundary that’s a parabola. Figure 6 Figure 7

Example: Why is Overfitting Bad? The short answer is that as a model gets more complex it is allowed to pick up harmful spurious correlations. These correlations are idiosyncracies of the specific training set used and do not represent characteristics of the population in general. The harm occurs when these spurious correlations produce incorrect generalizations in the model. This is what causes performance to decline when overfitting occurs.

Example: Why is Overfitting Bad? Consider a simple two-class problem with classes c1 and c2 and attributes x and y. We have a population of examples, evenly balanced between the classes. Attribute x has two values, p and q, and y has two values, r and s. In the general population, x = p occurs 75% of the time in class c1 examples and in 25% of the c2 examples, so x provides some prediction of the class.

Example: Why is Overfitting Bad? By design, y has no predictive power at all, and indeed we see that in the data sample both of y’s values occur in both classes equally. In short, the instances in this domain are difficult to separate, with only x providing some predictive power. The best we can achieve is 75% accuracy by looking at x.

Example: Why is Overfitting Bad? Table 5-1 shows a very small training set of examples from this domain. What would a classification tree learner do with these? Table 5-1. A small set of training examples Instance x y Class 1 p r c1 2 p r c1 3 p r c1 4 q s c1 5 p s c2 6 q r c2 7 q s c2 8 q r c2

Example: Why is Overfitting Bad? We first obtain the three on the left:

Example: Why is Overfitting Bad? However, observe from Table 5-1 that in this particular dataset y’s values of r and s are not evenly split between the classes, so y does seem to provide some predictiveness. Specifically, once we choose x=p, we see that y=r predicts c1 perfectly (instances 1-3). Table 5-1. A small set of training examples Instance x y Class 1 p r c1 2 p r c1 3 p r c1 4 q s c1 5 p s c2 6 q r c2 7 q s c2 8 q r c2

Example: Why is Overfitting Bad? Hence, from this dataset, tree induction would achieve information gain by splitting on y’s values and create two new leaf nodes, shown in the right tree.

Example: Why is Overfitting Bad? Based on our training set, the right tree performs well, better than the left tree. It classifies seven of the eight training examples correctly, whereas the left tree classifies only six out of eight correct.

Example: Why is Overfitting Bad? But this is due to the fact that y=r purely by chance correlates with class c1 in this data sample; in the general population there is no such correlation. Table 5-1. A small set of training examples Instance x y Class 1 p r c1 2 p r c1 3 p r c1 4 q s c1 5 p s c2 6 q r c2 7 q s c2 8 q r c2

Example: Why is Overfitting Bad? We have been misled, and the extra branch in right tree is not simply extraneous, it is harmful.

Summary First, this phenomenon is not particular to classification trees. Trees are convenient for this example because it is easy to point to a portion of a tree and declare it to be spurious, but all model types are susceptible to overfitting effects.

Summary Second, this phenomenon is not due to the training data in Table 5-1 being atypical or biased. Every dataset is a finite sample of a larger population, and every sample will have variations even when there is no bias in the sampling.

Summary Finally, as we have said before, there is no general analytic way to determine in advance whether a model has overfit or not. In this example we defined what the population looked like so we could declare that a given model had overfit. In practice, you will not have such knowledge and it will be necessary to use a holdout set to detect overfitting.

From Holdout Evaluation to Cross-Validation At the beginning of this chapter we introduced the idea that in order to have a fair evaluation of the generalization performance of a model, we should estimate its accuracy on holdout data—data not used in building the model, but for which we do know the actual value of the target variable. Holdout testing is similar to other sorts of evaluation in a “laboratory” setting.

From Holdout Evaluation to Cross-Validation While a holdout set will indeed give us an estimate of generalization performance, it is just a single estimate. Should we have any confidence in a single estimate of model accuracy?

Cross-Validation Cross-validation is a more sophisticated holdout training and testing procedure. Cross-validation makes better use of a limited dataset. Unlike splitting the data into one training and one holdout set, cross-validation computes its estimates over all the data by performing multiple splits and systematically swapping out samples for testing.

Cross-Validation Cross-validation begins by splitting a labeled dataset into k partitions called folds. Typically, k will be five or ten.

Cross-Validation The top pane of Figure 5-9 shows a labeled dataset (the original dataset) split into five folds. Cross-validation then iterates training and testing k times, in a particular way. As depicted in the bottom pane of Figure 5-9, in each iteration of the cross-validation, a different fold is chosen as the test data. In this iteration, the other

Cross-Validation In this iteration, the other k–1 folds are combined to form the training data. So, in each iteration we have (k–1)/k of the data used for training and 1/k used for testing.

From Holdout Evaluation to Cross-Validation Holdout Evaluation Splits the data into only one training and one holdout set. Cross-validation computes its estimates over all the data by performing multiple splits and systematically swapping out samples for testing. ( k folds, typically k would be 5 or 10. )

The Churn Dataset Revisited Consider again the churn dataset introduced in “Example: Addressing the Churn Problem with Tree Induction” on page 73. In that section we used the entire dataset both for training and testing, and we reported an accuracy of 73%. We ended that section by asking a question, Do you trust this number?

The Churn Dataset Revisited

The Churn Dataset Revisited Figure 5-10 shows the results of ten-fold cross-validation. In fact, two model types are shown. The top graph shows results with logistic regression, and the bottom graph shows results with classification trees.

The Churn Dataset Revisited To be precise: the dataset was first shuffled, then divided into ten partitions. Each partition in turn served as a single holdout set while the other nine were collectively used for training. The horizontal line in each graph is the average of accuracies of the ten models of that type.

Observations “Example: Addressing the Churn Problem with Tree Induction” in Chapter 3. Average accuracy of the folds with classification trees is 68.6%—significantly lower than our previous measurement of 73%. This means there was some overfitting occurring with the classification trees, and this new (lower) number is a more realistic measure of what we can expect.

Observations “Example: Addressing the Churn Problem with Tree Induction” in Chapter 3. For accuracy with classification trees, there is variation in the performances in the different folds (the standard deviation of the fold accuracies is 1.1), and thus it is a good idea to average them to get a notion of the performance as well as the variation we can expect from inducing classification trees on this dataset.

Observations “Example: Addressing the Churn Problem with Tree Induction” in Chapter 3. The logistic regression models show slightly lower average accuracy (64.1%) and with higher variation (standard deviation of 1.3 ) Neither model type did very well on Fold Three and both performed well on Fold Ten. Classification trees may be preferable to logistic regression because of their greater stability and performance.

Learning Curves All else being equal, the generalization performance of data-driven modeling generally improves as more training data become available, up to a point. (for the telecommunications churn problem)

Learning Curves More flexibility in a model comes more overfitting. Logistic regression has less flexibility, which allows it to overfit less with small data, but keeps it from modeling the full complexity of the data. Tree induction is much more flexible, leading it to overfit more with small data, but to model more complex regularities with larger training sets.

Learning Curves The learning curve has additional analytical uses. For example, we’ve made the point that data can be an asset. The learning curve may show that generalization performance has leveled off so investing in more training data is probably not worthwhile; instead, one should accept the current performance or look for another way to improve the model, such as by devising better features.

Avoiding Overfitting with Tree Induction Tree induction has much flexibility and therefore will tend to overfit a good deal without some mechanism to avoid it. Tree induction commonly uses two techniques to avoid overfitting. These strategies are : (i) to stop growing the tree before it gets too complex, and (ii) to grow the tree until it is too large, then “prune” it back, reducing its size (and thereby its complexity).

To stop growing the tree before it gets too complex The simplest method to limit tree size is to specify a minimum number of instances that must be present in a leaf. A key question becomes what threshold we should use.

To stop growing the tree before it gets too complex How few instances are we willing to tolerate at a leaf? Five instances? Thirty? One hundred? There is no fixed number, although practitioners tend to have their own preferences based on experience. However, researchers have developed techniques to decide the stopping point statistically.

To stop growing the tree before it gets too complex For stopping tree growth, an alternative to setting a fixed size for the leaves is to conduct a hypothesis test at every leaf to determine whether the observed difference in (say) information gain could not have been due to chance, e.g. p-value <= 5%. If the hypothesis test concludes that it was likely not due to chance, then the split is accepted and the tree growing continues.

To grow the tree until it is too large, then “prune” it back Pruning means to cut off leaves and branches, replacing them with leaves. One general idea is to estimate whether replacing a set of leaves or a branch with a leaf would reduce accuracy. If not, then go ahead and prune. The process can be iterated on progressive subtrees until any removal or replacement would reduce accuracy.

A General Method for Avoiding Overfitting Nested holdout testing: Split the original set into training and test sets. Saving the test set for a final assessment. We can take the training set and split it again into a training subset and a validation set. Use the training subset and validation set to find the best model, e.g., tree with a complexity of 122 nodes (the “sweet spot”). The final holdout set is used to estimate the actual generalization performance. One more twist: we could use the training subset & validation split to pick the best complexity without tainting the test set, and build a model of this best complexity on the entire training set (training subset plus validation). This approach is used in many sorts of modeling algorithms to control complexity. The general method is to choose the value for some complexity parameter by using some sort of nested holdout procedure. Training set Test set ( hold out ) Training subset Validation set Final holdout set

Overfitting in Tree Induction A procedure that grows trees until the leaves are pure tends to overfit. This figure shows a typical fitting graph for tree induction. At some point (sweet spot) the tree starts to overfit: it acquires details of the training set that are not characteristic of the population in general, as represented by the holdout set. Figure 3. A typical fitting graph for tree induction.

Nested cross-validation Nested cross-validation is more complicated, but it works as you might suspect. Say we would like to do cross-validation to assess the generalization accuracy of a new modeling technique, which has an adjustable complexity parameter C, but we do not know how to set it.

Nested cross-validation So, we run cross-validation as described above. However, before building the model for each fold, we take the training set (refer to Figure 5-9) and first run an experiment: we run another entire cross validation on just that training set to find the value of C estimated to give the best accuracy.

Nested cross-validation The result of that experiment is used only to set the value of C to build the actual model for that fold of the cross-validation. Then we build another model using the entire training fold, using that value for C, and test on the corresponding test fold.

From Holdout Evaluation to Cross-Validation Holdout Evaluation Splits the data into only one training and one holdout set. Cross-validation computes its estimates over all the data by performing multiple splits and systematically swapping out samples for testing. ( k folds, typically k would be 5 or 10. )

Nested cross-validation The only difference from regular cross-validation is that for each fold we first run this experiment to find C, using another, smaller, cross-validation.

Nested cross-validation If you understood all that, you would realize that if we used 5-fold cross-validation in both cases, we actually have built 30 total models in the process (yes, thirty). This sort of experimental complexity-controlled modeling only gained broad practical application over the last decade or so, because of the obvious computational burden involved.

Nested cross-validation This idea of using the data to choose the complexity experimentally, as well as to build the resulting model, applies across different induction algorithms and different sorts of complexity. For example, we mentioned that complexity increases with the size of the feature set, so it is usually desirable to cull (選出) the feature set.

Nested cross-validation A common method for doing this is to run with many different feature sets, using this sort of nested holdout procedure to pick the best.

Sequential forward selection For example, sequential forward selection (SFS) of features uses a nested holdout procedure to first pick the best individual feature, by looking at all models built using just one feature. After choosing a first feature, SFS tests all models that add a second feature to this first chosen feature. The best pair is then selected.

Sequential forward selection Next the same procedure is done for three, then four, and so on. When adding a feature does not improve classification accuracy on the validation data, the SFS process stops.

Sequential backward selection There is a similar procedure called sequential backward elimination of features. As you might guess, it works by starting with all features and discarding features one at a time. It continues to discard features as long as there is no performance loss.

Nested cross-validation This is a common approach. In modern environments with plentiful data and computational power, the data scientist routinely sets modeling parameters by experimenting using some tactical, nested holdout testing (often nested cross-validation).