Lecture 12 – Model Assessment and Selection Rice ECE697 Farinaz Koushanfar Fall 2006.

Lecture 12 – Model Assessment and Selection Rice ECE697 Farinaz Koushanfar Fall 2006

Summary Bias, variance, model complexity Optimism of training error rate Estimates of in-sample prediction error, AIC Effective number of parameters The Bayesian approach and BIC Vapnik-Chernovekis dimension Cross-Validation Bootstrap method

Model Selection Criteria Training Error Loss Function Generalization Error

Training Error vs. Test Error

Model Selection and Assessment Model selection: –Estimating the performance of different models in order to chose the best Model assessment: –Having chosen a final model, estimating its prediction error (generalization error) on new data If we were rich in data: TrainValidationTest

Bias-Variance Decomposition As we have seen before, The first term is the variance of the target around the true mean f(x 0 ); the second term is the average by which our estimate is off from the true mean; the last term is variance of f ^ (x 0 ) * The more complex f, the lower the bias, but the higher the variance

Bias-Variance Decomposition (cont’d) For K-nearest neighbor For linear regression

Bias-Variance Decomposition (cont’d) For linear regression, where h(x 0 ) is the vector of weights that produce f p (x 0 )=x 0 T (X T X) -1 X T y and hence Var[(f p (x 0 )]=||h(x 0 )|| 2   2 This variance changes with x 0, but its average over the sample values x i is (p/N)   2

Example 50 observations and 20 predictors, uniformly distributed in the hypercube [0,1] 20. Left: Y is 0 if X1  1/2 and apply k-NN Right: Y is 1 if  j=1 10 X j is  5 and 0 otherwise Prediction error Squared bias Variance

Example – loss function Prediction error Squared bias Variance

Optimism of Training Error The training error Is typically less than the true error In sample error Optimism For squared error, 0-1, and other losses, on can show in general

Optimism (cont’d) Thus, the amount by which the error under estimates the true error depends on how much y i affects its own prediction For linear model For additive model Y=f(X)+  and thus, Optimism increases linearly with number of inputs or basis d, decreases as training size increases

How to count for optimism? Estimate the optimism and add it to the training error, e.g., AIC, BIC, etc. Bootstrap and cross-validation, are direct estimates of this optimism error

Estimates of In-Sample Prediction Error General form of in-sample estimate is computed from Cp statistic: for an additive error model, when d parameters are fit under squared error loss, Using this criterion, adjust the training error by a factor proportional to the number of basis Akaike Information Criterion (AIC) is a similar but a more generally applicable estimate of Err in, when the log-likelihood loss function is used

Akaike Information Criterion (AIC) AIC relies on a relationship that holds asymptotically as N   Pr  (Y) is a family of densities for Y (contains the “true” density), “  hat” is the max likelihood estimate of , “loglik” is the maximized log- likelihood:

AIC (cont’d) For the Gaussian model, the AIC  C p For the logistic regression, using the binomial log-likelihood, we have AIC=-2/N. loglik + 2. d/N Choose the model that produces the smallest possible AIC What if we don’t know d? How about having tuning parameters?

AIC (cont’d) Given a set of models f  (x) indexed by a tuning parameter , denote by err(  ) and d(  ) the training error and number of parameters The function AIC provides an estimate of the test error curve and we find the tuning parameter  that maximizes it By choosing the best fitting model with d inputs, the effective number of parameters fit is more than d

AIC- Example: Phenome recognition

The effective number of parameters Generalize num of parameters to regularization Effective num of parameters is: d(S) = trace(S) In sample error is:

The effective number of parameters Thus, for a regularized model: Hence and

The Bayesian Approach and BIC Bayesian information criterion (BIC) BIC/2 is also known as Schwartz criterion BIC is proportional to AIC (Cp) with a factor 2 replaced by log (N). BIC penalizes complex models more heavily, prefering Simpler models

BIC (cont’d) BIC is asymptotically consistent as a selection criteria: given a family of models, including the true one, the prob. of selecting the true one is 1 for N   Suppose we have a set of candidate models M m, m=1,..,M and corresponding model parameters  m, and we wish to chose a best model Assuming a prior distribution Pr(  m |M m ) for the parameters of each model M m, compute the posterior probability of a given model!

BIC (cont’d) The posterior probability is Where Z represents the training data. To compare two models M m and M l, form the posterior odds If the posterior greater than one, chose m, otherwise l.

BIC (cont’d) Bayes factor: the rightmost term in posterior odds We need to approximate Pr(Z|M m ) A Laplace approximation to the integral gives  ^ m is the maximum likelihood estimate and dm is the number of free parameters of model Mm If the loss function is set as -2 log Pr(Z|M m,  ^ m ), this is equivalent to the BIC criteria

BIC (cont’d) Thus, choosing the model with minimum BIC is equivalent to choosing the model with largest (approximate) posterior probability If we compute the BIC criterion for a set of M models, BIC m, m=1,…,M, then the posterior of each model is estimates as Thus, we can estimate not only the best model, but also asses the relative merits of the models considered

Vapnik-Chernovenkis Dimension It is difficult to specify the number of parameters The Vapnik-Chernovenkis (VC) provides a general measure of complexity and associated bounds on optimism For a class of functions {f(x,  )} indexed by a parameter vector , and x  p. Assume f is in indicator function, either 0 or 1 If  =(  0,  1 ) and f is a linear indicator, I(  0 +  1 T x>0), then it is reasonable to say complexity is p+1 How about f(x,  )=I(sin .x)?

VC Dimension (cont’d)

The Vapnik-Chernovenkis dimension is a way of measuring the complexity of a class of functions by assessing how wiggly its members can be The VC dimension of the class {f(x,  )} is defined to be the largest number of points (in some configuration) that can be shattered by members of {f(x,  )}

VC Dimension (cont’d) A set of points is shattered by a class of functions if no matter how we assign a binary label to each point, a member of the class can perfectly separate them Example: VC dim of linear indicator function in 2D

VC Dimension (cont’d) Using the concepts of VC dimension, one can prove results about the optimism of training error when using a class of functions. E.g. If we fit N data points using a class of functions {f(x,  )} having VC dimension h, then with probability at least 1-  over training sets Cherkassky and Mulier, 1998 For regression, a 1 =a 2 =1

VC Dimension (cont’d) The bounds suggest that the optimism increases with h and decreases with N in qualitative agreement with the AIC correction d/N The results of VC dimension bounds are stronger: they give a probabilistic upper bounds for all functions f(x,  ) and hence allow for searching over the class

VC Dimension (cont’d) Vapnik’s Structural Risk Minimization (SRM) is built around the described bounds SRM fits a nested sequence of models of increasing VC dimensions h 1 <h 2 <…, and then chooses the model with the smallest value of the upper bound Drawback is difficulty in computing VC dim A crude upper bound may not be adequate

Example – AIC, BIC, SRM

Cross Validation (CV) The most widely used method Directly estimate the generalization error by applying the model to the test sample K-fold cross validation –Use part of data to build a model, different part to test Do this for k=1,2,…,K and calculate the prediction error when predicting the kth part

CV (cont’d)  :{1,…,N}  {1,…,K} divides the data to groups Fitted function f ^-  (x), computed when  removed CV estimate of prediction error is If K=N, is called leave-one-out CV Given a set of models f ^-  (x), the  th model fit with the kth part removed. For this set of models we have

CV (cont’d) CV(  ) should be minimized over  What should we chose for K? With K=N, CV is unbiased, but can have a high variance since the K training sets are almost the same Computational complexity

CV (cont’d)

With lower K, CV has a lower variance, but bias could be a problem! The most common are 5-fold and 10-fold!

CV (cont’d) Generalized leave-one-out cross validation, for linear fitting with square error loss ỷ=Sy For linear fits (S ii is the ith on S diagonal) The GCV approximation is GCV maybe sometimes advantageous where the trace is computed more easily than the individual S ii’ s

Bootstrap Denote the training set by Z=(z 1,…,z N ) where z i =(x i,y i ) Randomly draw a dataset with replacement from training data This is done B times (e.g., B=100) Refit the model to each of the bootstrap datasets and examine the behavior over the B replications From the bootstrap sample, we can estimate any aspect of the distribution of S(Z) – where S(z) can be any quantity computed from the data

Bootstrap - Schematic For e.g.,

Bootstrap (Cont’d) Bootstrap to estimate the prediction error E ^ rr boot does not provide a good estimate –Bootstrap dataset is acting as both training and testing and these two have common observations –The overfit predictions will look unrealistically good By mimicking CV, better bootstrap estimates Only keep track of predictions from bootstrap samples not containing the observations

Bootstrap (Cont’d) The leave-one-out bootstrap estimate of prediction error C -i is the set of indices of the bootstrap sample b that do not contain observation I We either have to choose B large enough to ensure that all of |C-i| is greater than zero, or just leave-out the terms that correspond to |C -i |’s that are zero

Bootstrap (Cont’d) The leave-one-out bootstrap solves the overfitting problem, we has a training size bias The average number of distinct observations in each bootstrap sample is 0.632.N Thus, if the learning curve has a considerable slope at sample size N/2, leave-one-out bootstrap will be biased upward in estimating the error There are a number of proposed methods to alleviate this problem, e.g.,.632 estimator, information error rate (overfitting rate)

Bootstrap (Example) Five-fold CV and.632 estimate for the same problems as before Any of the measures could be biased but not affecting, as long as relative performance is the same

Lecture 12 – Model Assessment and Selection Rice ECE697 Farinaz Koushanfar Fall 2006.

Similar presentations

Presentation on theme: "Lecture 12 – Model Assessment and Selection Rice ECE697 Farinaz Koushanfar Fall 2006."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 12 – Model Assessment and Selection Rice ECE697 Farinaz Koushanfar Fall 2006.

Similar presentations

Presentation on theme: "Lecture 12 – Model Assessment and Selection Rice ECE697 Farinaz Koushanfar Fall 2006."— Presentation transcript:

Similar presentations

About project

Feedback