Model Assessment, Selection and Averaging

Slides:



Advertisements
Similar presentations
Brief introduction on Logistic Regression
Advertisements

CHAPTER 13 M ODELING C ONSIDERATIONS AND S TATISTICAL I NFORMATION “All models are wrong; some are useful.”  George E. P. Box Organization of chapter.
Pattern Recognition and Machine Learning
Model generalization Test error Bias, variance and complexity
Model Assessment and Selection
Model Assessment and Selection
Model assessment and cross-validation - overview
CMPUT 466/551 Principal Source: CMU
Chapter 2: Lasso for linear models
The General Linear Model. The Simple Linear Model Linear Regression.
Visual Recognition Tutorial
The Simple Linear Regression Model: Specification and Estimation
Linear Methods for Regression Dept. Computer Science & Engineering, Shanghai Jiao Tong University.
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Model Selection and Validation “All models are wrong; some are useful.”  George E. P. Box Some slides were taken from: J. C. Sapll: M ODELING C ONSIDERATIONS.
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Linear Regression Models Based on Chapter 3 of Hastie, Tibshirani and Friedman Slides by David Madigan.
Ensemble Learning: An Introduction
Model Selection and Validation
End of Chapter 8 Neil Weisenfeld March 28, 2005.
Lecture 11 Multivariate Regression A Case Study. Other topics: Multicollinearity  Assuming that all the regression assumptions hold how good are our.
Visual Recognition Tutorial
Classification and Prediction: Regression Analysis
Ensemble Learning (2), Tree and Forest
2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,
沈致远. Test error(generalization error): the expected prediction error over an independent test sample Training error: the average loss over the training.
Model Inference and Averaging
Lecture 12 – Model Assessment and Selection Rice ECE697 Farinaz Koushanfar Fall 2006.
Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00.
VI. Evaluate Model Fit Basic questions that modelers must address are: How well does the model fit the data? Do changes to a model, such as reparameterization,
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
Manu Chandran. Outline Background and motivation Over view of techniques Cross validation Bootstrap method Setting up the problem Comparing AIC,BIC,Crossvalidation,Bootstrap.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
ISCG8025 Machine Learning for Intelligent Data and Information Processing Week 3 Practical Notes Application Advice *Courtesy of Associate Professor Andrew.
INTRODUCTION TO Machine Learning 3rd Edition
Chapter 7 Point Estimation of Parameters. Learning Objectives Explain the general concepts of estimating Explain important properties of point estimators.
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Chapter1: Introduction Chapter2: Overview of Supervised Learning
CSC321: Lecture 7:Ways to prevent overfitting
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Occam’s Razor No Free Lunch Theorem Minimum.
Classification Ensemble Methods 1
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
Machine Learning 5. Parametric Methods.
Validation methods.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bagging and Boosting Cross-Validation ML.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Computational Intelligence: Methods and Applications Lecture 15 Model selection and tradeoffs. Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Bias-Variance Analysis in Regression  True function is y = f(x) +  where  is normally distributed with zero mean and standard deviation .  Given a.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Computational Intelligence: Methods and Applications Lecture 14 Bias-variance tradeoff – model selection. Włodzisław Duch Dept. of Informatics, UMK Google:
PREDICT 422: Practical Machine Learning Module 3: Resampling Methods in Machine Learning Lecturer: Nathan Bastian, Section: XXX.
Estimating standard error using bootstrap
Model Inference and Averaging
Ch3: Model Building through Regression
CSE 4705 Artificial Intelligence
Bias and Variance of the Estimator
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Cross-validation for the selection of statistical models
Simple Linear Regression
Model generalization Brief summary of methods
Parametric Methods Berlin Chen, 2005 References:
Introduction to Machine learning
Presentation transcript:

Model Assessment, Selection and Averaging Presented by: Bibhas Chakraborty

Performance Assessment: Loss Function Typical choices for quantitative response Y: (squared error) (absolute error) Typical choices for categorical response G: (0-1 loss) (log-likelihood)

Training Error Training error is the average loss over the training sample. For the quantitative response variable Y: For the categorical response variable G:

Test Error (Generalization Error) Generalization error or test error is the expected prediction error over an independent test sample. For quantitative response Y: For categorical response G:

Bias, Variance and Model Complexity The figure is taken from Pg 194 of the book The Elements of Statistical Learning by Hastie, Tibshirani and Friedman.

What do we see from the preceding figure? There is an optimal model complexity that gives minimum test error. Training error is not a good estimate of the test error. There is a bias-variance tradeoff in choosing the appropriate complexity of the model.

Goals Model Selection: estimating the performance of different models in order to choose the best one. Model Assessment: having chosen a final model, estimating its generalization error on new data. Model Averaging: averaging the predictions from different models to achieve improved performance.

Splitting the data Split the dataset into three parts: Training set: used to fit the models. Validation set: used to estimate prediction error for model selection. Test set: used to assess the generalization error for the final chosen model.

The Bias-Variance Decomposition Assume that where and , then at an input point , = Irreducible Error + Bias2 + Variance

In-sample and Extra-sample Error In-sample error is the average prediction error, conditioned on the training sample ’s. It is obtained when new responses are observed for the training set features. Extra-sample error is the average prediction error when both features and responses are new (no conditioning on the training set).

Optimism of the Training Error Rate Typically, the training error rate will be less than the true test error. Define the optimism as the expected difference between Errin and the training error:

Optimism (cont’d) For squared error, 0-1, and other loss function, it can be shown generally that Can be simplified as for the model by a linear fit with d inputs.

How to estimate prediction error? Estimate the optimism and then add it to the training error rate. -- Methods such as AIC, BIC work in this way for a special class of estimates that are linear in their parameters. Estimating in-sample error is used for model selection. Methods like cross-validation and bootstrap: - direct estimates of the extra-sample error. - can be used with any loss function. - used for model assessment.

Estimates of In-Sample Prediction Error General form Cp statistic (when d parameters are fitted under squared error loss): AIC (Akaike information criterion), a more generally applicable estimate of Errin when a log-likelihood loss function is used:

More on AIC Choose the model giving smallest AIC over the set of models considered. Given a set of models indexed by a tuning parameter , define Find the tuning parameter that minimizes the function, and the final chosen model is

Bayesian Information Criterion (BIC) Model selection tool applicable in settings where the fitting is carried out by maximization of a log-likelihood. Motivation from Bayesian point of view. BIC tends to penalize complex models more heavily, giving preference to simpler models in selection. Its generic form is:

Bayesian Model Selection Suppose we have candidate models with corresponding model parameters Prior distribution: Posterior probability: Compare two models via posterior odds: The second factor on the RHS is called the Bayes factor and describes the contribution of the data towards posterior odds.

Bayesian Approach Continued Unless strong evidence to the contrary, we typically assume that prior over models is uniform (non-informative prior). Using Laplace approximation, one can establish a simple (but approximate) relationship between posterior model probability and the BIC. Lower BIC implies higher posterior probability of the model. Use of BIC as model selection criterion is thus justified.

AIC or BIC? BIC is asymptotically consistent as a selection criterion. That means, given a family of models including the true model, the probability that BIC will select the correct one approaches one as the sample size becomes large. AIC does not have the above property. Instead, it tends to choose more complex models as For small or moderate samples, BIC often chooses models that are too simple, because of its heavy penalty on complexity.

Cross-Validation The simplest and most widely used method for estimating prediction error. The idea is to directly estimate the extra sample error , when the method is applied to an independent test sample from the joint distribution of and In -fold cross-validation, we split the data into roughly equal-size parts. For the -th part, fit the model to the other parts and calculate the prediction error of the fitted model when predicting the -th part of the data.

Cross-Validation (Cont’d) The cross-validation estimate of prediction error is This provides an estimate of the test error, and we find the tuning parameter that minimizes it. Our final chosen model will be , which we fit to all the data.

The Learning Curve The figure is taken from Pg 215 of the book The Elements of Statistical Learning by Hastie, Tibshirani and Friedman.

Value of K? If then CV is approximately unbiased, but has high variance. The computational burden is also high. On the other hand, with CV has low variance but more bias. If the learning curve has a considerable slope at the given training set size, 5-fold, 10-fold CV will overestimate the true prediction error.

Bootstrap Method General tool for assessing statistical accuracy. Suppose we have a model to fit the training data The idea is to draw random samples with replacement of size from the training data. This process is repeated times to get bootstrap datasets. Refit the model to each of the bootstrap datasets and examine the behavior of the fits over replications.

Bootstrap (Cont’d) Here is any quantity computed from the data From the bootstrap sampling, we can estimate any aspect of the distribution of For example, its variance is estimated by where

Bootstrap used to estimate prediction error: Mimic CV Fit the model on a set of bootstrap samples keeping track of predictions from bootstrap samples not containing that observation. The leave-one-out bootstrap estimate of prediction error is solves the over-fitting problem suffered by but has training-set-size bias, mentioned in the discussion of CV.

The “0.632 Estimator” Average number of distinct observations in each bootstrap sample is approximately Bias will roughly behave like that of two-fold cross-validation (biased upwards). The “0.632 estimator” is designed to get rid of this bias.

Bagging Introduced by Breiman (Machine Learning, 1996). Acronym for ‘Bootstrap aggregation’ . It averages the prediction over a collection of bootstrap samples, thus reducing the variance in prediction.

Bagging (Cont’d) Consider the regression problem with training data Fit a model and get a prediction at the input . For each bootstrap sample fit the model, get the prediction . Then the bagging (or, bagged) estimate is:

Bagging (extended to classification) Let be a classifier for a K-class response. Consider an underlying indicator vector function the entry in the i-th place is 1 if the prediction for is the i-th class, such that Then the bagged estimate where is the proportion of base classifiers predicting class at where Finally,

Bagging Example The figure is taken from Pg 249 of the book The Elements of Statistical Learning by Hastie, Tibshirani and Friedman.

Bayesian Model Averaging Candidate models: Posterior distribution and mean: Bayesian prediction (posterior mean) is a weighted average of individual predictions, with weights proportional to posterior probability of each model. Posterior model probabilities can be estimated by BIC.

Frequentist Model Averaging Given predictions under squared error loss, we can seek the weights such that The solution is the population linear regression of on Combining models never makes things worse, at the population level. As population regression is not available, it is replaced by regression over the training set, which sometimes doesn’t work well.

Stacking Stacked generalization , or stacking is a way to get around the problem. The stacking weights are given by The final stacking prediction is: Close connection with leave-out-one-cross-validation. Better prediction, less interpretability.

References Hastie,T., Tibshirani, R. and Friedman, J.-The Elements of Statistical Learning (ch. 7 and 8)