Cross-validation for the selection of statistical models

Slides:



Advertisements
Similar presentations
Computational Statistics. Basic ideas  Predict values that are hard to measure irl, by using co-variables (other properties from the same measurement.
Advertisements

Brief introduction on Logistic Regression
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
Pattern Recognition and Machine Learning
Estimation  Samples are collected to estimate characteristics of the population of particular interest. Parameter – numerical characteristic of the population.
Model specification (identification) We already know about the sample autocorrelation function (SAC): Properties: Not unbiased (since a ratio between two.
Model generalization Test error Bias, variance and complexity
Model Assessment and Selection
Model Assessment, Selection and Averaging
Model Assessment and Selection
Model assessment and cross-validation - overview
LINEAR REGRESSION: Evaluating Regression Models Overview Assumptions for Linear Regression Evaluating a Regression Model.
LINEAR REGRESSION: Evaluating Regression Models. Overview Assumptions for Linear Regression Evaluating a Regression Model.
Statistical Methods Chichang Jou Tamkang University.
End of Chapter 8 Neil Weisenfeld March 28, 2005.
Lecture 11 Multivariate Regression A Case Study. Other topics: Multicollinearity  Assuming that all the regression assumptions hold how good are our.
Model Selection. Agenda Myung, Pitt, & Kim Olsson, Wennerholm, & Lyxzen.
Bayesian Learning Rong Jin.
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation X = {
2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,
沈致远. Test error(generalization error): the expected prediction error over an independent test sample Training error: the average loss over the training.
Prof. Dr. S. K. Bhattacharjee Department of Statistics University of Rajshahi.
Lecture 12 – Model Assessment and Selection Rice ECE697 Farinaz Koushanfar Fall 2006.
VI. Evaluate Model Fit Basic questions that modelers must address are: How well does the model fit the data? Do changes to a model, such as reparameterization,
Manu Chandran. Outline Background and motivation Over view of techniques Cross validation Bootstrap method Setting up the problem Comparing AIC,BIC,Crossvalidation,Bootstrap.
Maximum Likelihood Estimation Methods of Economic Investigation Lecture 17.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
INTRODUCTION TO Machine Learning 3rd Edition
Chap 6 Further Inference in the Multiple Regression Model
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 11: Models Marshall University Genomics Core Facility.
Machine Learning 5. Parametric Methods.
Review of statistical modeling and probability theory Alan Moses ML4bio.
Information criteria What function fits best? The more free parameters a model has the higher will be R 2. The more parsimonious a model is the lesser.
Parameter Estimation. Statistics Probability specified inferred Steam engine pump “prediction” “estimation”
Computacion Inteligente Least-Square Methods for System Identification.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Model Comparison. Assessing alternative models We don’t ask “Is the model right or wrong?” We ask “Do the data support a model more than a competing model?”
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Data Modeling Patrice Koehl Department of Biological Sciences
Chapter 4: Basic Estimation Techniques
Chapter 3: Maximum-Likelihood Parameter Estimation
DEEP LEARNING BOOK CHAPTER to CHAPTER 6
Regression Analysis AGEC 784.
Robert Plant != Richard Plant
Probability Theory and Parameter Estimation I
Basic Estimation Techniques
CH 5: Multivariate Methods
Further Inference in the Multiple Regression Model
Statistics in MSmcDESPOT
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
CH. 2: Supervised Learning
Bias and Variance of the Estimator
Basic Estimation Techniques
PSY 626: Bayesian Statistics for Psychological Science
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Predictability of Indian monsoon rainfall variability
10701 / Machine Learning Today: - Cross validation,
Linear Model Selection and regularization
Pattern Recognition and Machine Learning
Multivariate Linear Regression
Model generalization Brief summary of methods
Parametric Methods Berlin Chen, 2005 References:
Test Drop Rules: If not:
Ch 3. Linear Models for Regression (2/2) Pattern Recognition and Machine Learning, C. M. Bishop, Previously summarized by Yung-Kyun Noh Updated.
Chapter 11 Variable Selection Procedures
Supervised machine learning: creating a model
Applied Statistics and Probability for Engineers
Presentation transcript:

Cross-validation for the selection of statistical models Simon J. Mason Michael K. Tippett IRI

The Model Selection Problem Given: Family of models Ma and observations. Question: Which model to use? Goals: Maximize predictive ability given limited observations. Accurately estimate predictive ability. Example: Linear regression: Observations (n=50); Possible predictors sorted by correlation; M1 uses first predictor, M2 uses first two predictors, etc. With limited observations it may be difficult to accurately calibrate the parameters of the correct model, reducing the predictive ability of the correct model. A simpler model make have greater predictive skill.

Estimating predictive ability Wrong way: Calibrate model with all data. Choose the model that best fits the data.

In-sample skill estimates Akaike information criterion (AIC). AIC = -2 log (L) + 2p asymptotic estimate of expected out of sample error. Maximizing Mallows’ Cp = minimizing AIC Bayesian information criterion (BIC) BIC = -2 log(L)+p log(n) Difference approximates Bayes factor. L=likelihood, p=# parameters,n=# samples. Maximize fit, penalize complexity. In-sample methods rather than maximizing the likelihood of the model given the data. Well-known in Earth science literature. Maximizing AIC is equivalent to minimizing Mallows Cp for normal multiple linear regression models. The difference between the BIC for two models approximates the Bayes factor which measures the relative likelihood of one model to another given the data and equal prior probabilities for the models.

AIC and BIC AIC = -2 log (L) + 2p BIC = -2 log(L)+p log(n) BIC tends to select simpler models. AIC is asymptotically (many obs.) inconsistent. BIC consistent. For constant model size, pick best fit. Large pool of predictors leads to over-fitting. Relevant case is when the models have different dimensions.

Out-of-sample skill estimates Calibrate and validate models using independent data sets. Split data into calibration and validation data sets. Repeatedly divide data. Leave-1-out cross-validation; Leave-k-out cross-validation. Properties? Split the data when there are many observations. What are the properties of cross-validation? Model selection by cross-validation.

Leave-k-out CV is biased Single predictor and predictand. Underestimates correlation. Increasing k reduces (increases) the bias for low (high) correlations. (Barnston & van den Dool 1993). Multivariate linear regression. Overestimates RMS error with a bias ~ k/[n(n-k)] (Burman 1989). For a given model with significant skill, large k underestimates skill. Important to remember that the CV skill is a random variable, a function of the noise realization. Results are for expected values.

On the other hand … Selection bias “If one begins with a very large collection of rival models, then we can be fairly sure that the winning model will have an accidentally high maximum likelihood term.” (Forster). True predictive skill likely to be overestimated. Impacts goals of optimal model choice accurate skill estimate. Ideally use an independent data set to estimate skill. Subtle point. Given a model, CV is likely to underestimate skill. But if that model was chosen from a large pool of model, likely to overestimate. Can the bias of cross-validation and the selection bias off-set each other?

In-sample and CV estimate Leave-1-out cross-validation asymptotically equivalent to AIC (and Mallows’ Cp; Stone 1979). Leave-k-out cross-validation asymptotically equivalent to BIC for well chosen k. Increasing k tends to simpler models CV with large k complex models by require them to estimate many parameters with little data. The asymptotic limit is again when the number of observations is large. No useful distinction when the dimension of the models is the same.

Leave-k-out cross validation Leaving more out tends to select simpler models. Choice of metric matters. Correlation and rms error not simply related. RMS error selects simpler models in numerical experiments.

Impact on skill estimates Leaving more out reduces skill estimate biases in numerical experiments.

Better model selected? If the “true” model is simple, leaving out more selects a better model. If the model is not simple, then leaving more out has modest impact on the model skill. Complex model parameters may be hard to estimate accurately. Parameters of the simpler model may be easier to estimate accurately. Leaving more out gives more accurate estimates of skill.

Conclusions Increasing pool of predictors, increases chance of over-fitting and over-estimating skill. AIC and BIC balance data-fit and model complexity. BIC chooses simpler models. Leave-k-out cross-validation also penalizes model complexity. (Leave-1-out asymptotic to AIC). Leaving more out selects simpler models reduces skill estimate bias.