Model Assessment and Selection

Slides:



Advertisements
Similar presentations
: INTRODUCTION TO Machine Learning Parametric Methods.
Advertisements

Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
CHAPTER 13 M ODELING C ONSIDERATIONS AND S TATISTICAL I NFORMATION “All models are wrong; some are useful.”  George E. P. Box Organization of chapter.
Model generalization Test error Bias, variance and complexity
Model Assessment, Selection and Averaging
Model Assessment and Selection
Model assessment and cross-validation - overview
CMPUT 466/551 Principal Source: CMU
Chapter 2: Lasso for linear models
The loss function, the normal equation,
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Sparse vs. Ensemble Approaches to Supervised Learning
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Model Selection and Validation “All models are wrong; some are useful.”  George E. P. Box Some slides were taken from: J. C. Sapll: M ODELING C ONSIDERATIONS.
1 A Prediction Interval for the Misclassification Rate E.B. Laber & S.A. Murphy.
Linear Regression Models Based on Chapter 3 of Hastie, Tibshirani and Friedman Slides by David Madigan.
Statistical Decision Theory, Bayes Classifier
Learning From Data Chichang Jou Tamkang University.
Evaluating Hypotheses
Machine Learning CMPT 726 Simon Fraser University
Model Selection and Validation
End of Chapter 8 Neil Weisenfeld March 28, 2005.
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
1 A Prediction Interval for the Misclassification Rate E.B. Laber & S.A. Murphy.
Model Selection. Agenda Myung, Pitt, & Kim Olsson, Wennerholm, & Lyxzen.
Bayesian Learning Rong Jin.
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation X = {
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation Given.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Part I: Classification and Bayesian Learning
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,
沈致远. Test error(generalization error): the expected prediction error over an independent test sample Training error: the average loss over the training.
Lecture 12 – Model Assessment and Selection Rice ECE697 Farinaz Koushanfar Fall 2006.
Outline 1-D regression Least-squares Regression Non-iterative Least-squares Regression Basis Functions Overfitting Validation 2.
Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.
Manu Chandran. Outline Background and motivation Over view of techniques Cross validation Bootstrap method Setting up the problem Comparing AIC,BIC,Crossvalidation,Bootstrap.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
ISCG8025 Machine Learning for Intelligent Data and Information Processing Week 3 Practical Notes Application Advice *Courtesy of Associate Professor Andrew.
INTRODUCTION TO Machine Learning 3rd Edition
Concept learning, Regression Adapted from slides from Alpaydin’s book and slides by Professor Doina Precup, Mcgill University.
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.
Chapter1: Introduction Chapter2: Overview of Supervised Learning
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Occam’s Razor No Free Lunch Theorem Minimum.
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
Goal of Learning Algorithms  The early learning algorithms were designed to find such an accurate fit to the data.  A classifier is said to be consistent.
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
Machine Learning 5. Parametric Methods.
Validation methods.
MACHINE LEARNING 3. Supervised Learning. Learning a Class from Examples Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Computational Intelligence: Methods and Applications Lecture 15 Model selection and tradeoffs. Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Overfitting, Bias/Variance tradeoff. 2 Content of the presentation Bias and variance definitions Parameters that influence bias and variance Bias and.
Bias-Variance Analysis in Regression  True function is y = f(x) +  where  is normally distributed with zero mean and standard deviation .  Given a.
Nov 20th, 2001Copyright © 2001, Andrew W. Moore VC-dimension for characterizing classifiers Andrew W. Moore Associate Professor School of Computer Science.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
1 C.A.L. Bailer-Jones. Machine Learning. Model selection and combination Machine learning, pattern recognition and statistical data modelling Lecture 10.
PREDICT 422: Practical Machine Learning Module 3: Resampling Methods in Machine Learning Lecturer: Nathan Bastian, Section: XXX.
Estimating standard error using bootstrap
Bias and Variance of the Estimator
PSY 626: Bayesian Statistics for Psychological Science
10701 / Machine Learning Today: - Cross validation,
Cross-validation for the selection of statistical models
Presentation transcript:

Model Assessment and Selection Lecture Notes for Comp540 Chapter7 Jian Li Mar.2007

Goal Model Selection Model Assessment

A Regression Problem y = f(x) + noise Can we learn f from this data? Let’s consider three methods...

Linear Regression

Quadratic Regression

Joining the dots

Which is best? Why not choose the method with the best fit to the data? “How well are you going to predict future data drawn from the same distribution?”

Model Selection and Assessment Model Selection: Estimating performances of different models to choose the best one (produces the minimum of the test error) Model Assessment: Having chosen a model, estimating the prediction error on new data

Why Errors Why do we want to study errors? In a data-rich situation split the data: Train Validation Test Model Selection Model assessment But, that’s not usually the case

Overall Motivation Errors Estimating the true error Measurement of errors (Loss functions) Decomposing Test Error into Bias & Variance Estimating the true error Estimating in-sample error (analytically ) AIC, BIC, MDL, SRM with VC Estimating extra-sample error (efficient sample reuse) Cross Validation & Bootstrapping

Measuring Errors: Loss Functions Typical regression loss functions Squared error: Absolute error:

Measuring Errors: Loss Functions Typical classification loss functions 0-1 Loss: Log-likelihood (cross-entropy loss / deviance):

The Goal: Low Test Error We want to minimize generalization error or test error: But all we really know is training error: ? And this is a bad estimate of test error

Bias, Variance & Complexity Training error can always be reduced when increasing model complexity, but risks over-fitting. Typically

Decomposing Test Error Model: For squared-error loss & additive noise: Deviation of the average estimate from the true function’s mean Irreducible error of target Y Expected squared deviation of our estimate around its mean

Further Bias Decomposition For linear models (eg. Ridge), bias can be further decomposed: * is the best fitting linear approximation For standard linear regression, Estimation Bias = 0 Average Model Bias Average Estimation Bias

Graphical representation of bias & variance Model Space (basic linear regression) Hypothesis Space Closest fit (given our observation) In population (if epsilon=0)  Realization Shrunken fit Model Fitting Truth Regularized Model Space (ridge regression) Model Bias Estimation Variance Estimation Bias

Bias & Variance Decomposition Examples kNN Regression Linear Regression Averaging over the training set: Linear weights on y:

Simulated Example of Bias Variance Decomposition Prediction error -- + -- = -- -- + -- = -- Bias2 Regression with squared error loss Variance Bias-Variance different for 0-1 loss than for squared error loss -- + -- <> -- -- + -- <> -- Classification with 0-1 loss Estimation errors on the right side of the boundary don’t hurt!

Optimism of The Training Error Rate Typically: training error rate < true error (same data is being used to fit the method and assess its error) < overly optimistic

Adjustment for optimism of training error Estimating Test Error Can we estimate the discrepancy between err and Err? extra-sample error Expectation over N new responses at each xi Errin --- In-sample error: Adjustment for optimism of training error

Optimism Summary: for squared error, 0-1 and other loss functions: For linear fit with d indep inputs/basis funcs: optimism linearly with # d Optimism as training sample size

Ways to Estimate Prediction Error In-sample error estimates: AIC BIC MDL SRM Extra-sample error estimates: Cross-Validation Leave-one-out K-fold Bootstrap

Estimates of In-Sample Prediction Error General form of the in-sample estimate: For linear fit :

Similarly: Akaike Information Criterion (AIC) AIC & BIC Similarly: Akaike Information Criterion (AIC) Bayesian Information Criterion (BIC)

AIC & BIC

MDL (Minimum Description Length) Regularity ~ Compressibility Learning ~ Finding regularities Learning model Input Samples Rn Predictions R1 Real class R1 Real model =? error

MDL (Minimum Description Length) Regularity ~ Compressibility Learning ~ Finding regularities Description of the model under optimal coding Length of transmitting the discrepancy given the model + optimal coding under the given model MDL principle: choose the model with the minimum description length Equivalent to maximizing the posterior:

SRM with VC (Vapnik-Chernovenkis) Dimension Vapnik showed that with probability 1- As h increases A method of selecting a class F from a family of nested classes

Errin Estimation A trade-off between the fit to the data and the model complexity

Estimation of Extra-Sample Err Cross Validation Bootstrap

Cross-Validation test train K-fold ……

Computation increases How many folds? Computation increases Variance decreases bias decreases k fold Leave-one-out k increases

Cross-Validation: Choosing K Popular choices for K: 5,10,N

Generalized Cross-Validation LOOCV can be computational expensive for linear fitting with large N Linear fitting For linear fitting under squared-error loss: GCV provides a computationally cheaper approximation

Bootstrap: Main Concept “The bootstrap is a computer-based method of statistical inference that can answer many real statistical questions without formulas” (An Introduction to the Bootstrap, Efron and Tibshirani, 1993) Step 2: Calculate the statistic Step 1: Draw samples with replacement

How is it coming Sampling distribution of sample mean In practice cannot afford large number of random samples The theory tells us the sampling distribution The sample stands for the population and the distribution of in many resamples stands for the sampling distribution

Bootstrap: Error Estimation with Errboot Depends on the unknown true distribution F A straightforward application of bootstrap to error prediction

Bootstrap: Error Estimation with Err(1) A CV-inspired improvement on Errboot

Bootstrap: Error Estimation with Err(.632) An improvement on Err(1) in light-fitting cases ?

Bootstrap: Error Estimation with Err(.632+) An improvement on Err(.632) by adaptively accounting for overfitting Depending on the amount of overfitting, the best error estimate is as little as Err(.632) , or as much as Err(1), or something in between Err(.632+) is like Err(.632) with adaptive weights, with Err(1) weighted at least .632 Err(.632+) adaptively mixes training error and leave-one-out error using the relative overfitting rate (R)

Bootstrap: Error Estimation with Err(.632+)

Cross Validation & Bootstrap Why bother with cross-validation and bootstrap when analytical estimates are known? AIC, BIC, MDL, SRM all requires knowledge of d, which is difficult to attain in most situations. 2) Bootstrap and cross validation gives similar results to above but also applicable in more complex situation. 3) Estimating the noise variance requires a roughly working model, cross validation and bootstrap will work well even if the model is far from correct.

Conclusion Test error plays crucial roles in model selection AIC, BIC and SRMVC have the advantage that you only need the training error If VC-dimension is known, then SRM is a good method for model selection – requires much less computation than CV and bootstrap, but is wildly conservative Methods like CV, Bootstrap give tighter error bounds, but might have more variance Asymptotically AIC and Leave-one-out CV should be the same Asymptotically BIC and a carefully chosen k-fold should be the same BIC is what you want if you want the best structure instead of the best predictor Bootstrap has much wider applicability than just estimating prediction error