Lecture 12 – Model Assessment and Selection Rice ECE697 Farinaz Koushanfar Fall 2006.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
CHAPTER 8 More About Estimation. 8.1 Bayesian Estimation In this chapter we introduce the concepts related to estimation and begin this by considering.
CHAPTER 13 M ODELING C ONSIDERATIONS AND S TATISTICAL I NFORMATION “All models are wrong; some are useful.”  George E. P. Box Organization of chapter.
Pattern Recognition and Machine Learning
1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.
Model generalization Test error Bias, variance and complexity
Model Assessment and Selection
Model Assessment, Selection and Averaging
Model Assessment and Selection
Model assessment and cross-validation - overview
CMPUT 466/551 Principal Source: CMU
Visual Recognition Tutorial
Linear Methods for Regression Dept. Computer Science & Engineering, Shanghai Jiao Tong University.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Model Selection and Validation “All models are wrong; some are useful.”  George E. P. Box Some slides were taken from: J. C. Sapll: M ODELING C ONSIDERATIONS.
Linear Regression Models Based on Chapter 3 of Hastie, Tibshirani and Friedman Slides by David Madigan.
Support Vector Machines Based on Burges (1998), Scholkopf (1998), Cristianini and Shawe-Taylor (2000), and Hastie et al. (2001) David Madigan.
Data mining and statistical learning, lecture 5 Outline  Summary of regressions on correlated inputs  Ridge regression  PCR (principal components regression)
Model Selection and Validation
Lecture 11 Multivariate Regression A Case Study. Other topics: Multicollinearity  Assuming that all the regression assumptions hold how good are our.
Visual Recognition Tutorial
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,
PATTERN RECOGNITION AND MACHINE LEARNING
沈致远. Test error(generalization error): the expected prediction error over an independent test sample Training error: the average loss over the training.
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
Model Inference and Averaging
Prof. Dr. S. K. Bhattacharjee Department of Statistics University of Rajshahi.
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Occam’s Razor No Free Lunch Theorem Minimum.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Introduction to Machine Learning Supervised Learning 姓名 : 李政軒.
Manu Chandran. Outline Background and motivation Over view of techniques Cross validation Bootstrap method Setting up the problem Comparing AIC,BIC,Crossvalidation,Bootstrap.
ELEC 303 – Random Signals Lecture 18 – Classical Statistical Inference, Dr. Farinaz Koushanfar ECE Dept., Rice University Nov 4, 2010.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
CpSc 881: Machine Learning Evaluating Hypotheses.
BCS547 Neural Decoding. Population Code Tuning CurvesPattern of activity (r) Direction (deg) Activity
INTRODUCTION TO Machine Learning 3rd Edition
Lecture3 – Overview of Supervised Learning Rice ELEC 697 Farinaz Koushanfar Fall 2006.
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Occam’s Razor No Free Lunch Theorem Minimum.
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
1 Introduction to Statistics − Day 4 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Lecture 2 Brief catalogue of probability.
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
1 Introduction to Statistics − Day 3 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Brief catalogue of probability densities.
Machine Learning 5. Parametric Methods.
Validation methods.
Computacion Inteligente Least-Square Methods for System Identification.
Nov 20th, 2001Copyright © 2001, Andrew W. Moore VC-dimension for characterizing classifiers Andrew W. Moore Associate Professor School of Computer Science.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
PREDICT 422: Practical Machine Learning Module 3: Resampling Methods in Machine Learning Lecturer: Nathan Bastian, Section: XXX.
Model Inference and Averaging
Ch3: Model Building through Regression
Bias and Variance of the Estimator
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
10701 / Machine Learning Today: - Cross validation,
Linear Model Selection and regularization
Cross-validation for the selection of statistical models
Generally Discriminant Analysis
Model generalization Brief summary of methods
Support Vector Machines 2
Presentation transcript:

Lecture 12 – Model Assessment and Selection Rice ECE697 Farinaz Koushanfar Fall 2006

Summary Bias, variance, model complexity Optimism of training error rate Estimates of in-sample prediction error, AIC Effective number of parameters The Bayesian approach and BIC Vapnik-Chernovekis dimension Cross-Validation Bootstrap method

Model Selection Criteria Training Error Loss Function Generalization Error

Training Error vs. Test Error

Model Selection and Assessment Model selection: –Estimating the performance of different models in order to chose the best Model assessment: –Having chosen a final model, estimating its prediction error (generalization error) on new data If we were rich in data: TrainValidationTest

Bias-Variance Decomposition As we have seen before, The first term is the variance of the target around the true mean f(x 0 ); the second term is the average by which our estimate is off from the true mean; the last term is variance of f ^ (x 0 ) * The more complex f, the lower the bias, but the higher the variance

Bias-Variance Decomposition (cont’d) For K-nearest neighbor For linear regression

Bias-Variance Decomposition (cont’d) For linear regression, where h(x 0 ) is the vector of weights that produce f p (x 0 )=x 0 T (X T X) -1 X T y and hence Var[(f p (x 0 )]=||h(x 0 )|| 2   2 This variance changes with x 0, but its average over the sample values x i is (p/N)   2

Example 50 observations and 20 predictors, uniformly distributed in the hypercube [0,1] 20. Left: Y is 0 if X1  1/2 and apply k-NN Right: Y is 1 if  j=1 10 X j is  5 and 0 otherwise Prediction error Squared bias Variance

Example – loss function Prediction error Squared bias Variance

Optimism of Training Error The training error Is typically less than the true error In sample error Optimism For squared error, 0-1, and other losses, on can show in general

Optimism (cont’d) Thus, the amount by which the error under estimates the true error depends on how much y i affects its own prediction For linear model For additive model Y=f(X)+  and thus, Optimism increases linearly with number of inputs or basis d, decreases as training size increases

How to count for optimism? Estimate the optimism and add it to the training error, e.g., AIC, BIC, etc. Bootstrap and cross-validation, are direct estimates of this optimism error

Estimates of In-Sample Prediction Error General form of in-sample estimate is computed from Cp statistic: for an additive error model, when d parameters are fit under squared error loss, Using this criterion, adjust the training error by a factor proportional to the number of basis Akaike Information Criterion (AIC) is a similar but a more generally applicable estimate of Err in, when the log-likelihood loss function is used

Akaike Information Criterion (AIC) AIC relies on a relationship that holds asymptotically as N   Pr  (Y) is a family of densities for Y (contains the “true” density), “  hat” is the max likelihood estimate of , “loglik” is the maximized log- likelihood:

AIC (cont’d) For the Gaussian model, the AIC  C p For the logistic regression, using the binomial log-likelihood, we have AIC=-2/N. loglik + 2. d/N Choose the model that produces the smallest possible AIC What if we don’t know d? How about having tuning parameters?

AIC (cont’d) Given a set of models f  (x) indexed by a tuning parameter , denote by err(  ) and d(  ) the training error and number of parameters The function AIC provides an estimate of the test error curve and we find the tuning parameter  that maximizes it By choosing the best fitting model with d inputs, the effective number of parameters fit is more than d

AIC- Example: Phenome recognition

The effective number of parameters Generalize num of parameters to regularization Effective num of parameters is: d(S) = trace(S) In sample error is:

The effective number of parameters Thus, for a regularized model: Hence and

The Bayesian Approach and BIC Bayesian information criterion (BIC) BIC/2 is also known as Schwartz criterion BIC is proportional to AIC (Cp) with a factor 2 replaced by log (N). BIC penalizes complex models more heavily, prefering Simpler models

BIC (cont’d) BIC is asymptotically consistent as a selection criteria: given a family of models, including the true one, the prob. of selecting the true one is 1 for N   Suppose we have a set of candidate models M m, m=1,..,M and corresponding model parameters  m, and we wish to chose a best model Assuming a prior distribution Pr(  m |M m ) for the parameters of each model M m, compute the posterior probability of a given model!

BIC (cont’d) The posterior probability is Where Z represents the training data. To compare two models M m and M l, form the posterior odds If the posterior greater than one, chose m, otherwise l.

BIC (cont’d) Bayes factor: the rightmost term in posterior odds We need to approximate Pr(Z|M m ) A Laplace approximation to the integral gives  ^ m is the maximum likelihood estimate and dm is the number of free parameters of model Mm If the loss function is set as -2 log Pr(Z|M m,  ^ m ), this is equivalent to the BIC criteria

BIC (cont’d) Thus, choosing the model with minimum BIC is equivalent to choosing the model with largest (approximate) posterior probability If we compute the BIC criterion for a set of M models, BIC m, m=1,…,M, then the posterior of each model is estimates as Thus, we can estimate not only the best model, but also asses the relative merits of the models considered

Vapnik-Chernovenkis Dimension It is difficult to specify the number of parameters The Vapnik-Chernovenkis (VC) provides a general measure of complexity and associated bounds on optimism For a class of functions {f(x,  )} indexed by a parameter vector , and x  p. Assume f is in indicator function, either 0 or 1 If  =(  0,  1 ) and f is a linear indicator, I(  0 +  1 T x>0), then it is reasonable to say complexity is p+1 How about f(x,  )=I(sin .x)?

VC Dimension (cont’d)

The Vapnik-Chernovenkis dimension is a way of measuring the complexity of a class of functions by assessing how wiggly its members can be The VC dimension of the class {f(x,  )} is defined to be the largest number of points (in some configuration) that can be shattered by members of {f(x,  )}

VC Dimension (cont’d) A set of points is shattered by a class of functions if no matter how we assign a binary label to each point, a member of the class can perfectly separate them Example: VC dim of linear indicator function in 2D

VC Dimension (cont’d) Using the concepts of VC dimension, one can prove results about the optimism of training error when using a class of functions. E.g. If we fit N data points using a class of functions {f(x,  )} having VC dimension h, then with probability at least 1-  over training sets Cherkassky and Mulier, 1998 For regression, a 1 =a 2 =1

VC Dimension (cont’d) The bounds suggest that the optimism increases with h and decreases with N in qualitative agreement with the AIC correction d/N The results of VC dimension bounds are stronger: they give a probabilistic upper bounds for all functions f(x,  ) and hence allow for searching over the class

VC Dimension (cont’d) Vapnik’s Structural Risk Minimization (SRM) is built around the described bounds SRM fits a nested sequence of models of increasing VC dimensions h 1 <h 2 <…, and then chooses the model with the smallest value of the upper bound Drawback is difficulty in computing VC dim A crude upper bound may not be adequate

Example – AIC, BIC, SRM

Cross Validation (CV) The most widely used method Directly estimate the generalization error by applying the model to the test sample K-fold cross validation –Use part of data to build a model, different part to test Do this for k=1,2,…,K and calculate the prediction error when predicting the kth part

CV (cont’d)  :{1,…,N}  {1,…,K} divides the data to groups Fitted function f ^-  (x), computed when  removed CV estimate of prediction error is If K=N, is called leave-one-out CV Given a set of models f ^-  (x), the  th model fit with the kth part removed. For this set of models we have

CV (cont’d) CV(  ) should be minimized over  What should we chose for K? With K=N, CV is unbiased, but can have a high variance since the K training sets are almost the same Computational complexity

CV (cont’d)

With lower K, CV has a lower variance, but bias could be a problem! The most common are 5-fold and 10-fold!

CV (cont’d) Generalized leave-one-out cross validation, for linear fitting with square error loss ỷ=Sy For linear fits (S ii is the ith on S diagonal) The GCV approximation is GCV maybe sometimes advantageous where the trace is computed more easily than the individual S ii’ s

Bootstrap Denote the training set by Z=(z 1,…,z N ) where z i =(x i,y i ) Randomly draw a dataset with replacement from training data This is done B times (e.g., B=100) Refit the model to each of the bootstrap datasets and examine the behavior over the B replications From the bootstrap sample, we can estimate any aspect of the distribution of S(Z) – where S(z) can be any quantity computed from the data

Bootstrap - Schematic For e.g.,

Bootstrap (Cont’d) Bootstrap to estimate the prediction error E ^ rr boot does not provide a good estimate –Bootstrap dataset is acting as both training and testing and these two have common observations –The overfit predictions will look unrealistically good By mimicking CV, better bootstrap estimates Only keep track of predictions from bootstrap samples not containing the observations

Bootstrap (Cont’d) The leave-one-out bootstrap estimate of prediction error C -i is the set of indices of the bootstrap sample b that do not contain observation I We either have to choose B large enough to ensure that all of |C-i| is greater than zero, or just leave-out the terms that correspond to |C -i |’s that are zero

Bootstrap (Cont’d) The leave-one-out bootstrap solves the overfitting problem, we has a training size bias The average number of distinct observations in each bootstrap sample is N Thus, if the learning curve has a considerable slope at sample size N/2, leave-one-out bootstrap will be biased upward in estimating the error There are a number of proposed methods to alleviate this problem, e.g.,.632 estimator, information error rate (overfitting rate)

Bootstrap (Example) Five-fold CV and.632 estimate for the same problems as before Any of the measures could be biased but not affecting, as long as relative performance is the same