Model Selection and Validation “All models are wrong; some are useful.”  George E. P. Box Some slides were taken from: J. C. Sapll: M ODELING C ONSIDERATIONS.

Slides:

Advertisements

Similar presentations

Neural Networks and Kernel Methods

Advertisements

Slides from: Doug Gray, David Poole

NEURAL NETWORKS Backpropagation Algorithm

Neural Networks and SVM Stat 600. Neural Networks History: started in the 50s and peaked in the 90s Idea: learning the way the brain does. Numerous applications.

Regularization David Kauchak CS 451 – Fall 2013.

Artificial Intelligence 13. Multi-Layer ANNs Course V231 Department of Computing Imperial College © Simon Colton.

CHAPTER 13 M ODELING C ONSIDERATIONS AND S TATISTICAL I NFORMATION “All models are wrong; some are useful.”  George E. P. Box Organization of chapter.

Deep Learning and Neural Nets Spring 2015

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Model generalization Test error Bias, variance and complexity

Model Assessment and Selection

Model Assessment, Selection and Averaging

Model Assessment and Selection

Model assessment and cross-validation - overview

CMPUT 466/551 Principal Source: CMU

Chapter 2: Lasso for linear models

The loss function, the normal equation,

Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.

Linear Methods for Regression Dept. Computer Science & Engineering, Shanghai Jiao Tong University.

Linear Regression Models Based on Chapter 3 of Hastie, Tibshirani and Friedman Slides by David Madigan.

Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.

Model Selection and Validation

End of Chapter 8 Neil Weisenfeld March 28, 2005.

Classification and Prediction: Regression Analysis

CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.

Collaborative Filtering Matrix Factorization Approach

Model Assessment and Selection Florian Markowetz & Rainer Spang Courses in Practical DNA Microarray Analysis.

沈致远. Test error(generalization error): the expected prediction error over an independent test sample Training error: the average loss over the training.

Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.

Classification Part 3: Artificial Neural Networks

Chapter 11 – Neural Networks COMP 540 4/17/2007 Derek Singer.

11 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering

Lecture 12 – Model Assessment and Selection Rice ECE697 Farinaz Koushanfar Fall 2006.

Outline 1-D regression Least-squares Regression Non-iterative Least-squares Regression Basis Functions Overfitting Validation 2.

Last lecture summary. Basic terminology tasks – classification – regression learner, algorithm – each has one or several parameters influencing its behavior.

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 9: Ways of speeding up the learning and preventing overfitting Geoffrey Hinton.

CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.

ISCG8025 Machine Learning for Intelligent Data and Information Processing Week 3 Practical Notes Application Advice *Courtesy of Associate Professor Andrew.

INTRODUCTION TO Machine Learning 3rd Edition

Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.

CSC321 Introduction to Neural Networks and Machine Learning Lecture 3: Learning in multi-layer networks Geoffrey Hinton.

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.

CSC321: Lecture 7:Ways to prevent overfitting

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.

Machine Learning 5. Parametric Methods.

Regression. We have talked about regression problems before, as the problem of estimating the mapping f(x) between an independent variable x and a dependent.

Validation methods.

CSC2535: Lecture 3: Ways to make backpropagation generalize better, and ways to do without a supervision signal. Geoffrey Hinton.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

Overfitting, Bias/Variance tradeoff. 2 Content of the presentation Bias and variance definitions Parameters that influence bias and variance Bias and.

Bump Hunting The objective PRIM algorithm Beam search References: Feelders, A.J. (2002). Rule induction by bump hunting. In J. Meij (Ed.), Dealing with.

CPH Dr. Charnigo Chap. 11 Notes Figure 11.2 provides a diagram which shows, at a glance, what a neural network does. Inputs X 1, X 2,.., X P are.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Regularization Techniques in Neural Networks

CSE 4705 Artificial Intelligence

A Simple Artificial Neuron

Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)

Bias and Variance of the Estimator

Probabilistic Models for Linear Regression

Machine Learning Today: Reading: Maria Florina Balcan

Collaborative Filtering Matrix Factorization Approach

Chap. 7 Regularization for Deep Learning (7.8~7.12 )

Overfitting and Underfitting

Model generalization Brief summary of methods

Memory-Based Learning Instance-Based Learning K-Nearest Neighbor

Presentation transcript:

Model Selection and Validation “All models are wrong; some are useful.”  George E. P. Box Some slides were taken from: J. C. Sapll: M ODELING C ONSIDERATIONS AND S TATISTICAL I NFORMATION J. C. Sapll: M ODELING C ONSIDERATIONS AND S TATISTICAL I NFORMATION J. Hinton: Preventing overfitting Bei Yu: Model Assessment

13-2 Model Definition Assume model z = h(x,  ) + v, where z is output, h(·) is some function, x is input, v is noise, and  is vector of model parameters A fundamental goal is to take n data points and estimate , forming

13-3 Model Error Definition Given a data set [x i,y i ], i = 1,..,n Given a model output h(x,  n ), where  n is taken from some family of parameters, the sum squared errors (SSE, MSE) is Σ i [y i - h(x i,  n )] 2, The likelihood is Π i P(h(x i,  n )|x i )

13-4 Error surface as a function of Model parameters can look like this

13-5 Error surface can also look like this Which one is better?

13-6 Properties of the error surfaces The first one surface is rough, thus a small change in parameter space can lead to large change in error Due to the steepness of the surface, a minimum can be found, although a gradient-descent optimization algorithm can get stuck in local minima The second is very smooth thus, large change in parameter set does not lead to much change in model error In other words, it is expected that generalization performance will be similar to performance on a test set

13-7 Parameter stability Finer detail: while the surface is very smooth, it is impossible to get to the true minima. Suggests that models that penalize on smoothness may be misleading. Breiman (1992) has shown that even in simple problems and simple nonlinear models, the degree of generalization is strongly dependent on the stability of the parameters.

The goal: Gain Info about the expected Error Suggested reading: S. Geman, E. Bienenstock, and R. Doursat. Neural networks and the bias/variance dilemmaNeural networks and the bias/variance dilemma Neural Computation, 4, 1991, 1-58.

13-9 Bias-Variance Decomposition The MSE of the model at a fixed x can be decomposed as: E { [h(x,  )  E(z|x)] 2 | x } = E { [h(x,  )  E(h(x,  ))] 2 | x } + [E(h(x,  ))  E(z|x)] 2 = variance at x + (bias at x) 2 where expectations are computed w.r.t. Above implies: Model too simple  High bias / low variance Model too complex  Low bias / high variance

13-10 Bias-Variance Tradeoff in Model Selection in Simple Problem

13-11 Model Selection The bias-variance tradeoff provides conceptual framework for determining a good model –bias-variance tradeoff not directly useful Many methods for practical determination of a good model –AIC, Bayesian selection, cross-validation, minimum description length, V-C dimension, etc. All methods based on a tradeoff between fitting error (high variance) and model complexity (low bias) Cross-validation is one of the most popular model fitting methods

13-12 Cross-Validation Cross-validation is a simple, general method for comparing candidate models –Other specialized methods may work better in specific problems Cross-validation uses the training set of data Does not work on some pathological distributions Method is based on iteratively partitioning the full set of training data into training and test subsets estimate evaluateFor each partition, estimate model from training subset and evaluate model on test subset Select model that performs best over all test subsets

13-13 Division of Data for Cross-Validation with Disjoint Test Subsets

13-14 Typical Steps for Cross-Validation Step 0 (initialization) Step 0 (initialization) Determine size of test subsets and candidate model. Let i be counter for test subset being used. Step 1 (estimation) Step 1 (estimation) For the i th test subset, let the remaining data be the i th training subset. Estimate  from this training subset. Step 2 (error calculation) Step 2 (error calculation) Based on estimate for  from Step 1 (i th training subset), calculate MSE (or other measure) with data in i th test subset. Step 3 (new training / test subset) Step 3 (new training / test subset) Update i to i + 1 and return to step 1. Form mean of MSE when all test subsets have been evaluated. Step 4 (new model) Choose model with lowest mean MSE as best. Step 4 (new model) Repeat steps 1 to 3 for next model. Choose model with lowest mean MSE as best.

13-15 Numerical Illustration of Cross-Validation (Example 13.4 in ISSO) Consider true system corresponding to a sine function of the input with additive normally distributed noise Consider three candidate models –Linear (affine) model –3rd-order polynomial –10th-order polynomial Suppose 30 data points are available, divided into 5 disjoint test subsets Based on RMS error (equiv. to MSE) over test subsets, 3rd-order polynomial is preferred See following plot

13-16 Numerical Illustration (cont’d): Relative Fits for 3 Models with Low-Noise Observations

13-17 Standard approach to Model Selection Optimize concurrently the likelihood or mean squared error together with a complexity penalty. Some penalties: norm of the weight vector, smoothness, number of terminating leaves (in CART), variance weights, cross validation... etc. Spend most computational time on optimizing the parameter solution via sophisticated Gradient descent methods or even global-minimum seeking methods.

13-18 Alternative approach MDL based model selection Later

13-19 Model Complexity

13-20 Ways to Prevent Overfitting (Hinton) The training data contains information about the regularities in the mapping from input to output. But it also contains noise –The target values may be unreliable. –There is sampling error. There will be accidental regularities just because of the particular training cases that were chosen. When we fit the model, it cannot tell which regularities are real and which are caused by sampling error. –So it fits both kinds of regularity. –If the model is very flexible it can model the sampling error really well. This is a disaster.

13-21 Preventing overfitting Use a model that has the right capacity: –enough to model the true regularities –not enough to also model the spurious regularities (assuming they are weaker). Standard ways to limit the capacity of a neural net: –Limit the number of hidden units. –Limit the size of the weights. –Stop the learning before it has time to overfit.

13-22 Limiting the size of the weights Weight-decay involves adding an extra term to the cost function that penalizes the squared weights. –Keeps weights small unless they have big error derivatives. w C

13-23 The effect of weight-decay It prevents the network from using weights that it does not need. –This can often improve generalization a lot. –It helps to stop it from fitting the sampling error. –It makes a smoother model in which the output changes more slowly as the input changes. w If the network has two very similar inputs it prefers to put half the weight on each rather than all the weight on one. w/ 2 w 0

13-24 Model selection How do we decide which limit to use and how strong to make the limit? –If we use the test data we get an unfair prediction of the error rate we would get on new test data. –Suppose we compared a set of models that gave random results, the best one on a particular dataset would do better than chance. But it wont do better than chance on another test set. So use a separate validation set to do model selection.

13-25 Using a validation set Divide the total dataset into three subsets: –Training data is used for learning the parameters of the model. –Validation data is not used of learning but is used for deciding what type of model and what amount of regularization works best. –Test data is used to get a final, unbiased estimate of how well the network works. We expect this estimate to be worse than on the validation data. We could then re-divide the total dataset to get another unbiased estimate of the true error rate.

13-26 Early stopping If we have lots of data and a big model, its very expensive to keep re-training it with different amounts of weight decay. It is much cheaper to start with very small weights and let them grow until the performance on the validation set starts getting worse (but don’t get fooled by noise!) The capacity of the model is limited because the weights have not had time to grow big.

13-27 Why early stopping works When the weights are very small, every hidden unit is in its linear range. –So a net with a large layer of hidden units is linear. –It has no more capacity than a linear net in which the inputs are directly connected to the outputs! As the weights grow, the hidden units start using their non-linear ranges so the capacity grows. outputs inputs

13-28 Minimum Description Length Approach

13-29 Model Assessment and Selection (Bei Yu) Loss Function and Error Rate Bias, Variance and Model Complexity Optimization AIC (Akaike Information Criterion) BIC (Bayesian Information Criterion) MDL (Minimum Description Length)

13-30 Model Assessment and Selection Model Selection: –estimating the performance of different models in order to choose the best one. Model Assessment: –having chosen the model, estimating the prediction error on new data.

13-31 Approaches data-rich: –data split: Train-Validation-Test –typical split: 50%-25%-25% (how?) data-insufficient: –Analytical approaches: AIC, BIC, MDL, SRM –efficent sample re-use approaches: cross validation, bootstrapping

13-32 Model Complexity

13-33 Loss Functions Continuous Response Categorical Response squared error absolute error 0-1 loss log-likelihood

13-34 Error Functions Training Error: –the average loss over the training sample. –Continuous Response: –Categorical Response: Generalization Error: –the expected prediction error over an independent test sample. –Continuous Response: –Categorical Response:

13-35 Bias-Variance Decomposition Assume: Bias-Variance Decomposition: K-NN: Linear fit: –Ridge Regression:

13-36 Detailed Decomposition for Linear Model Family average squared bias decomposition =0 for LLSF; >0 for ridge regression trade off with variance;

13-37 Bias-Variance Tradeoff

13-38 Key Methods to Estimate Prediction Error Estimate Optimism, then add it to the training error rate. AIC: choose the model with smallest AIC BIC: choose the model with smallest BIC

13-39 Summary Cross validation: A practical way to estimate model error. Model Estimation should be done with a penalty When best model estimation is chosen, estimate on whole data or average models on cross validated data