Cross-validation for detecting and preventing overfitting

Slides:



Advertisements
Similar presentations
Nonparametric Methods: Nearest Neighbors
Advertisements

Kin 304 Regression Linear Regression Least Sum of Squares
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) ETHEM ALPAYDIN © The MIT Press, 2010
Support Vector Machines and Margins
Indian Statistical Institute Kolkata
Model Assessment and Selection
Model assessment and cross-validation - overview
Overfitting and Regularization Chapters 11 and 12 on amlbook.com.
Regression. So far, we've been looking at classification problems, in which the y values are either 0 or 1. Now we'll briefly consider the case where.
Neural Networks Introduction to Artificial Intelligence COS302 Michael L. Littman Fall 2001.
Oznur Tastan Machine Learning Recitation 3 Sep
Instance Based Learning IB1 and IBK Small section in chapter 20.
Memory-Based Learning Instance-Based Learning K-Nearest Neighbor.
General Mining Issues a.j.m.m. (ton) weijters Overfitting Noise and Overfitting Quality of mined models (some figures are based on the ML-introduction.
Anomaly detection Problem motivation Machine Learning.
More Machine Learning Linear Regression Squared Error L1 and L2 Regularization Gradient Descent.
沈致远. Test error(generalization error): the expected prediction error over an independent test sample Training error: the average loss over the training.
Copyright © Andrew W. Moore Slide 1 Cross-validation for detecting and preventing overfitting Andrew W. Moore Professor School of Computer Science Carnegie.
Last lecture summary. Basic terminology tasks – classification – regression learner, algorithm – each has one or several parameters influencing its behavior.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 9: Ways of speeding up the learning and preventing overfitting Geoffrey Hinton.
1 Multiple Regression A single numerical response variable, Y. Multiple numerical explanatory variables, X 1, X 2,…, X k.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
Last lecture summary. Basic terminology tasks – classification – regression learner, algorithm – each has one or several parameters influencing its behavior.
ISCG8025 Machine Learning for Intelligent Data and Information Processing Week 3 Practical Notes Application Advice *Courtesy of Associate Professor Andrew.
CSC321 Introduction to Neural Networks and Machine Learning Lecture 3: Learning in multi-layer networks Geoffrey Hinton.
CSC321: Lecture 7:Ways to prevent overfitting
Linear Prediction Correlation can be used to make predictions – Values on X can be used to predict values on Y – Stronger relationships between X and Y.
Assignments CS fall Assignment 1 due Generate the in silico data set of 2sin(1.5x)+ N (0,1) with 100 random values of x between.
Machine Learning ICS 178 Instructor: Max Welling Supervised Learning.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 23: Linear Support Vector Machines Geoffrey Hinton.
Global predictors of regression fidelity A single number to characterize the overall quality of the surrogate. Equivalence measures –Coefficient of multiple.
Computational Intelligence: Methods and Applications Lecture 14 Bias-variance tradeoff – model selection. Włodzisław Duch Dept. of Informatics, UMK Google:
PREDICT 422: Practical Machine Learning Module 3: Resampling Methods in Machine Learning Lecturer: Nathan Bastian, Section: XXX.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Support Vector Machines
Today’s Lecture Neural networks Training
Support Vector Machines
CS Fall 2016 (Shavlik©), Lecture 5
Deep Feedforward Networks
Introduction to Radial Basis Function Networks
CSE 4705 Artificial Intelligence
Ch8: Nonparametric Methods
CS Fall 2016 (© Jude Shavlik), Lecture 4
Classification Evaluation And Model Selection
Data Mining Lecture 11.
Bias and Variance of the Estimator
Logistic Regression Classification Machine Learning.
Support Vector Machines
Machine Learning Today: Reading: Maria Florina Balcan
CS 2750: Machine Learning Line Fitting + Bias-Variance Trade-off
Hyperparameters, bias-variance tradeoff, validation
CS Fall 2016 (Shavlik©), Lecture 2
10701 / Machine Learning Today: - Cross validation,
Chap. 7 Regularization for Deep Learning (7.8~7.12 )
Deep Learning for Non-Linear Control
Simple Linear Regression
ML – Lecture 3B Deep NN.
Intro to Machine Learning
Cross-validation Brenda Thomson/ Peter Fox Data Analytics
Model generalization Brief summary of methods
Support Vector Machines
Jia-Bin Huang Virginia Tech
Shih-Yang Su Virginia Tech
MAS 622J Course Project Classification of Affective States - GP Semi-Supervised Learning, SVM and kNN Hyungil Ahn
Memory-Based Learning Instance-Based Learning K-Nearest Neighbor
Lecture 16. Classification (II): Practical Considerations
Introduction to Machine learning
Support Vector Machines 2
Presentation transcript:

Cross-validation for detecting and preventing overfitting

A Regression Problem y = f(x) + noise Can we learn f from this data? Let’s consider three methods… y x

Linear Regression x y

Linear Regression X Y 3 7 1 : 3 1 : 7 3 : Univariate Linear regression with a constant term: X Y 3 7 1 : X= y= 3 1 : 7 3 : Originally discussed in the previous Andrew Lecture: “Neural Nets” x1=(3).. y1=7..

Linear Regression X Y 3 7 1 : 3 1 : 7 3 : 1 3 : 7 3 : Univariate Linear regression with a constant term: X Y 3 7 1 : X= y= 3 1 : 7 3 : x1=(3).. y1=7.. Z= 1 3 : y= 7 3 : z1=(1,3).. zk=(1,xk) y1=7..

Linear Regression X Y 3 7 1 : 3 1 : 7 3 : 1 3 : 7 3 : Univariate Linear regression with a constant term: X Y 3 7 1 : X= y= 3 1 : 7 3 : x1=(3).. y1=7.. Z= 1 3 : y= 7 3 : b=(ZTZ)-1(ZTy) yest = b0+ b1 x z1=(1,3).. zk=(1,xk) y1=7..

Quadratic Regression y x

Quadratic Regression X Y 3 7 1 : 3 1 : 7 3 : 1 3 9 : 7 3 : X= y= Much more about this in the future Andrew Lecture: “Favorite Regression Algorithms” X Y 3 7 1 : X= y= 3 1 : 7 3 : x1=(3,2).. y1=7.. 1 3 9 : y= Z= 7 3 : b=(ZTZ)-1(ZTy) yest = b0+ b1 x+ b2 x2 z=(1 , x, x2,)

Join-the-dots x y Also known as piecewise linear nonparametric regression if that makes you feel better

Which is best? y y x x Why not choose the method with the best fit to the data?

What do we really want? y y x x Why not choose the method with the best fit to the data? “How well are you going to predict future data drawn from the same distribution?”

The test set method 1. Randomly choose 30% of the data to be in a test set 2. The remainder is a training set y x

(Linear regression example) The test set method 1. Randomly choose 30% of the data to be in a test set 2. The remainder is a training set 3. Perform your regression on the training set y x (Linear regression example)

(Linear regression example) The test set method 1. Randomly choose 30% of the data to be in a test set 2. The remainder is a training set 3. Perform your regression on the training set 4. Estimate your future performance with the test set y x (Linear regression example) Mean Squared Error = 2.4

(Quadratic regression example) The test set method 1. Randomly choose 30% of the data to be in a test set 2. The remainder is a training set 3. Perform your regression on the training set 4. Estimate your future performance with the test set y x (Quadratic regression example) Mean Squared Error = 0.9

(Join the dots example) The test set method 1. Randomly choose 30% of the data to be in a test set 2. The remainder is a training set 3. Perform your regression on the training set 4. Estimate your future performance with the test set y x (Join the dots example) Mean Squared Error = 2.2

The test set method Good news: Very very simple Can then simply choose the method with the best test-set score Bad news: What’s the downside?

The test set method Good news: Very very simple Can then simply choose the method with the best test-set score Bad news: Wastes data: we get an estimate of the best method to apply to 30% less data If we don’t have much data, our test-set might just be lucky or unlucky We say the “test-set estimator of performance has high variance”

LOOCV (Leave-one-out Cross Validation) For k=1 to R 1. Let (xk,yk) be the kth record y x

LOOCV (Leave-one-out Cross Validation) For k=1 to R 1. Let (xk,yk) be the kth record 2. Temporarily remove (xk,yk) from the dataset y x

LOOCV (Leave-one-out Cross Validation) For k=1 to R 1. Let (xk,yk) be the kth record 2. Temporarily remove (xk,yk) from the dataset 3. Train on the remaining R-1 datapoints y x

LOOCV (Leave-one-out Cross Validation) For k=1 to R 1. Let (xk,yk) be the kth record 2. Temporarily remove (xk,yk) from the dataset 3. Train on the remaining R-1 datapoints 4. Note your error (xk,yk) y x

LOOCV (Leave-one-out Cross Validation) For k=1 to R 1. Let (xk,yk) be the kth record 2. Temporarily remove (xk,yk) from the dataset 3. Train on the remaining R-1 datapoints 4. Note your error (xk,yk) When you’ve done all points, report the mean error. x y

LOOCV (Leave-one-out Cross Validation) For k=1 to R 1. Let (xk,yk) be the kth record 2. Temporarily remove (xk,yk) from the dataset 3. Train on the remaining R-1 datapoints 4. Note your error (xk,yk) When you’ve done all points, report the mean error. y y y x x x y y y x x x MSELOOCV = 2.12 y y y x x x

LOOCV for Quadratic Regression For k=1 to R 1. Let (xk,yk) be the kth record 2. Temporarily remove (xk,yk) from the dataset 3. Train on the remaining R-1 datapoints 4. Note your error (xk,yk) When you’ve done all points, report the mean error. y y y x x x y y y x x x MSELOOCV=0.962 y y y x x x

LOOCV for Join The Dots MSELOOCV=3.33 For k=1 to R 1. Let (xk,yk) be the kth record 2. Temporarily remove (xk,yk) from the dataset 3. Train on the remaining R-1 datapoints 4. Note your error (xk,yk) When you’ve done all points, report the mean error. y y y x x x y y y x x x MSELOOCV=3.33 y y y x x x

Which kind of Cross Validation? Downside Upside Test-set Variance: unreliable estimate of future performance Cheap Leave-one-out Expensive. Has some weird behavior Doesn’t waste data ..can we get the best of both worlds?

k-fold Cross Validation Randomly break the dataset into k partitions (in our example we’ll have k=3 partitions colored Red Green and Blue) y x

k-fold Cross Validation Randomly break the dataset into k partitions (in our example we’ll have k=3 partitions colored Red Green and Blue) For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. y x

k-fold Cross Validation Randomly break the dataset into k partitions (in our example we’ll have k=3 partitions colored Red Green and Blue) For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. For the green partition: Train on all the points not in the green partition. Find the test-set sum of errors on the green points. y x

k-fold Cross Validation Randomly break the dataset into k partitions (in our example we’ll have k=3 partitions colored Red Green and Blue) For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. For the green partition: Train on all the points not in the green partition. Find the test-set sum of errors on the green points. For the blue partition: Train on all the points not in the blue partition. Find the test-set sum of errors on the blue points. y x

k-fold Cross Validation Randomly break the dataset into k partitions (in our example we’ll have k=3 partitions colored Red Green and Blue) For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. For the green partition: Train on all the points not in the green partition. Find the test-set sum of errors on the green points. For the blue partition: Train on all the points not in the blue partition. Find the test-set sum of errors on the blue points. Then report the mean error y x Linear Regression MSE3FOLD=2.05

k-fold Cross Validation Randomly break the dataset into k partitions (in our example we’ll have k=3 partitions colored Red Green and Blue) For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. For the green partition: Train on all the points not in the green partition. Find the test-set sum of errors on the green points. For the blue partition: Train on all the points not in the blue partition. Find the test-set sum of errors on the blue points. Then report the mean error y x Quadratic Regression MSE3FOLD=1.11

k-fold Cross Validation Randomly break the dataset into k partitions (in our example we’ll have k=3 partitions colored Red Green and Blue) For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. For the green partition: Train on all the points not in the green partition. Find the test-set sum of errors on the green points. For the blue partition: Train on all the points not in the blue partition. Find the test-set sum of errors on the blue points. Then report the mean error y x Joint-the-dots MSE3FOLD=2.93

Which kind of Cross Validation? Downside Upside Test-set Variance: unreliable estimate of future performance Cheap Leave-one-out Expensive. Has some weird behavior Doesn’t waste data 10-fold Wastes 10% of the data. 10 times more expensive than test set Only wastes 10%. Only 10 times more expensive instead of R times. 3-fold Wastier than 10-fold. Expensivier than test set Slightly better than test-set R-fold Identical to Leave-one-out

Which kind of Cross Validation? Downside Upside Test-set Variance: unreliable estimate of future performance Cheap Leave-one-out Expensive. Has some weird behavior Doesn’t waste data 10-fold Wastes 10% of the data. 10 times more expensive than testset Only wastes 10%. Only 10 times more expensive instead of R times. 3-fold Wastier than 10-fold. Expensivier than testset Slightly better than test-set R-fold Identical to Leave-one-out But note: One of Andrew’s joys in life is algorithmic tricks for making these cheap

CV-based Model Selection We’re trying to decide which algorithm to use. We train each machine and make a table… i fi TRAINERR 10-FOLD-CV-ERR Choice 1 f1 2 f2 3 f3  4 f4 5 f5 6 f6

CV-based Model Selection Example: Choosing number of hidden units in a one-hidden-layer neural net. Step 1: Compute 10-fold CV error for six different model classes: Algorithm TRAINERR 10-FOLD-CV-ERR Choice 0 hidden units 1 hidden units 2 hidden units  3 hidden units 4 hidden units 5 hidden units Step 2: Whichever model class gave best CV score: train it with all the data, and that’s the predictive model you’ll use.

CV-based Model Selection Example: Choosing “k” for a k-nearest-neighbor regression. Step 1: Compute LOOCV error for six different model classes: Algorithm TRAINERR 10-fold-CV-ERR Choice K=1 K=2 K=3 K=4  K=5 K=6 Step 2: Whichever model class gave best CV score: train it with all the data, and that’s the predictive model you’ll use.