Model Assessment and Selection

Slides:



Advertisements
Similar presentations
: INTRODUCTION TO Machine Learning Parametric Methods.
Advertisements

Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
Pattern Recognition and Machine Learning
Pattern Recognition and Machine Learning
Model generalization Test error Bias, variance and complexity
Model Assessment and Selection
Model Assessment, Selection and Averaging
Model assessment and cross-validation - overview
CMPUT 466/551 Principal Source: CMU
Chapter 4: Linear Models for Classification
Data mining and statistical learning - lecture 6
Basis Expansion and Regularization Presenter: Hongliang Fei Brian Quanz Brian Quanz Date: July 03, 2008.
Visual Recognition Tutorial
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Linear Methods for Regression Dept. Computer Science & Engineering, Shanghai Jiao Tong University.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
x – independent variable (input)
Lecture Notes for CMPUT 466/551 Nilanjan Ray
Model Selection and Validation “All models are wrong; some are useful.”  George E. P. Box Some slides were taken from: J. C. Sapll: M ODELING C ONSIDERATIONS.
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Linear Regression Models Based on Chapter 3 of Hastie, Tibshirani and Friedman Slides by David Madigan.
Machine Learning CMPT 726 Simon Fraser University
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Model Selection and Validation
End of Chapter 8 Neil Weisenfeld March 28, 2005.
Lecture 11 Multivariate Regression A Case Study. Other topics: Multicollinearity  Assuming that all the regression assumptions hold how good are our.
Visual Recognition Tutorial
Bayesian Learning Rong Jin.
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation X = {
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation Given.
Classification and Prediction: Regression Analysis
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,
PATTERN RECOGNITION AND MACHINE LEARNING
沈致远. Test error(generalization error): the expected prediction error over an independent test sample Training error: the average loss over the training.
Prof. Dr. S. K. Bhattacharjee Department of Statistics University of Rajshahi.
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation Given.
Lecture 12 – Model Assessment and Selection Rice ECE697 Farinaz Koushanfar Fall 2006.
Outline 1-D regression Least-squares Regression Non-iterative Least-squares Regression Basis Functions Overfitting Validation 2.
Jeff Howbert Introduction to Machine Learning Winter Regression Linear Regression.
Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Jeff Howbert Introduction to Machine Learning Winter Regression Linear Regression Regression Trees.
BCS547 Neural Decoding. Population Code Tuning CurvesPattern of activity (r) Direction (deg) Activity
INTRODUCTION TO Machine Learning 3rd Edition
Lecture3 – Overview of Supervised Learning Rice ELEC 697 Farinaz Koushanfar Fall 2006.
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
BCS547 Neural Decoding.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Chapter1: Introduction Chapter2: Overview of Supervised Learning
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Occam’s Razor No Free Lunch Theorem Minimum.
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
Machine Learning 5. Parametric Methods.
Machine Learning CUNY Graduate Center Lecture 6: Linear Regression II.
Overfitting, Bias/Variance tradeoff. 2 Content of the presentation Bias and variance definitions Parameters that influence bias and variance Bias and.
CPH Dr. Charnigo Chap. 11 Notes Figure 11.2 provides a diagram which shows, at a glance, what a neural network does. Inputs X 1, X 2,.., X P are.
Model Selection and the Bias–Variance Tradeoff All models described have a smoothing or complexity parameter that has to be considered: multiplier of the.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Machine learning, pattern recognition and statistical data modelling
Bias and Variance of the Estimator
Cross-validation for the selection of statistical models
Model generalization Brief summary of methods
Multivariate Methods Berlin Chen, 2005 References:
Presentation transcript:

Model Assessment and Selection Chapter 7 Neil Weisenfeld, et al.

Introduction Generalization performance of a learning method: Measure of prediction capability on independent test data Guide model selection Give quantitative assessment of chosen model Chapter shows key methods and how to apply them to model selection

Bias, Variance, and Model Complexity Bias-Variance trade-off again Generalization: test sample vs. training sample performance Training data usually monotonically increasing performance with model complexity

Measuring Performance target variable Vector of inputs Prediction model Typical Choices of Loss function

Generalization Error Test error aka. Generalization error Note: This expectation averages anything that is random, including the randomness in the training sample that it produced Training error average loss over training sample not a good estimate of test error (next slide)

Training Error Training error - Overfitting not a good estimate of test error consistently decreases with model complexity drops to zero with high enough complexity

Categorical Data same for categorical responses Test Error again: Typical Choices of Loss functions: Test Error again: Training Error again: Log-likelihood = cross-entropy loss = deviance

Loss Function for General Densities For densities parameterized by theta: Log-likelihood function can be used as a loss-function density of with predictor

Two separate goals Model selection: Model assessment: Estimating the performance of different models in order to choose the (approximate) best one Model assessment: Having chosen a final model, estimating its prediction error (generalization error) on new data Ideal situation: split data into the 3 parts for training, validation (est. prediction error+select model), and testing (assess model) Typical split: 50% / 25% / 25% Remainder of the chapter: Data-poor situation => Approximation of validation step either analytically (AIC, BIC, MDL, SRM) or by efficient sample reuse (cross-validation, bootstrap)

Bias-Variance Decomposition Then for an input point , using unit-square loss and regression fit: Irreducible Error Bias^2 Variance variance of the target around the true mean Amount by which average estimate differs from the true mean Expected deviation of f^ around its mean

Bias-Variance Decomposition kNN: Linear Model Fit:

Bias-Variance Decomposition Linear Model Fit: Model complexity is directly related to the number of parameters p

Bias-Variance Decomposition For ridge regression and other linear models, variance same as before, but with diff’t weights. Parameters of the best fitting linear approximation Further decompose the bias: Least squares fits best linear model -> no estimation bias Restricted fits -> positive estimation bias in favor of reduced variance

Bias-Variance Decomposition

Bias-Variance Decomposition - Example 50 observations. 20 predictors. Uniform in Left panels: Right panels Closeup next slide

Example, continued -- + -- = -- -- + -- = -- -- + -- <> -- Prediction error -- + -- = -- -- + -- = -- Squared bias Regression with squared error loss Variance Bias-Variance different for 0-1 loss than for squared error loss -- + -- <> -- -- + -- <> -- Classification with 0-1 loss Estimation errors on the right side of the boundary don’t hurt!

Optimism of the Training Error Rate Typically: training error rate < true error (same data is being used to fit the method and assess its error) < overly optimistic

Optimism of the Training Error Rate Err … kind of extra-sample error: test features don’t need to coincide with training feature vectors Focus on in-sample error: … observe N new response values at each of training points for squared error 0-1 and other loss functions:

Optimism of the Training Error Rate Summary: The harder we fit the data, the greater will be, thereby increasing the optimism. For linear fit with d indep inputs/basis funcs: optimism linearly with # d of basis functions Optimism as training sample size

Optimism of the Training Error Rate Ways to estimate prediction error: Estimate optimism and then add it to training error rate AIC, BIC, and others work this way, for a special class of estimates that are linear in their parameters Direct estimates of the sample error Cross-validation, bootstrap Can be used with any loss function, and with nonlinear, adaptive fitting techniques

Estimates of In-Sample Prediction Error General form of the in-sample estimate: For linear fit and with : with estimate of optimism

Estimates of In-Sample Prediction Error Similarly: Akaike Information Criterion (AIC) More applicable estimate of , when log-likelihood function is used Maximized log-likelihood due to ML estimate of theta

AIC For example, for logistic regression model, using binomial log-likelihood: To use AIC for model selection: choose the model giving smallest AIC over the set of models considered.

AIC Function AIC(a) estimates test error curve If basis functions are chosen adaptively with d<p inputs: no longer holds => optimism exceeds effective number of parameters fit > d

Using AIC to select the # of basis functions Input vector: log-periodogram of vowel; Quantized to 256 uniformly spaced f Linear logistic regression model Coefficient function: Expansion of M spline basis functions For any M, a basis of natural cubic splines is used for the knots chosen uniformly over the range of frequencies, i.e. AIC approximately minimizes Err(M) for both entropy and 0-1 loss

Using AIC to select the # of basis functions Approximation does not hold, in general, for 0-1 case, but it does o.k. (Exact only for linear models w/ additive errors and sq err loss)

Effective Number of Parameters Vector of Outcomes, similarly for predicitons Linear fit (e.g. linear regression, quadratic shrinkage – ridge, splines) c.f. d(s) is the correct d for Cp

Bayesian Approach and BIC Like AIC used in when fitting by max log-likelihood Bayesian Information Criterion (BIC): BIC proportional to AIC except for log(N) rather than factor of 2. For N>e2 (approx 7.4), BIC penalizes complex models more heavily.

BIC Motivation Given a set of candidate models Posterior probability of a given model: Where To compare two models, form the posterior odds: If odds > 1, then choose model m. Prior over models (left half) considered constant. Right half, contribution of data (Z) to posterior odds, is called the Bayes factor BF(Z). Need to approximate Pr(Z|Mm). Various chicanery and approximations (pp. 207) gets us BIC. Can est. posterior from BIC and compare relative merits of models.

BIC: How much better is a model? But we may want to know how various models stack up (not just ranking) relative to one another: Once we have the BIC: Denominator normalizes the result and now we can assess the relative merits of each model

Minimum Description Length (MDL) Huffman Coding (e.g. for compression) Variable length code: use shortest symbol length (in bits) for most frequently occurring targets. So, maybe a code for English words would look like (instantaneous prefix code): E = 1 N = 01 R = 001, etc. First bit codes “is this an E or not,” second bit codes “is this an N or not,” etc. Entropy lower bound on message length (information content – no of bits required to transmit): So bit length for

MDL Now, for our data: 2nd term: avg code len for transmitting model parameters, 1st term: avg code len for transmitting difference between model and actual target values (so choose a model that minimizes this) Minimizing descriptive length = maximizing posterior probability The book adds, almost as a non-sequitur: