Model Selection. Agenda Myung, Pitt, & Kim Olsson, Wennerholm, & Lyxzen.

Slides:



Advertisements
Similar presentations
Copula Representation of Joint Risk Driver Distribution
Advertisements

The Maximum Likelihood Method
: INTRODUCTION TO Machine Learning Parametric Methods.
Brief introduction on Logistic Regression
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
Estimation  Samples are collected to estimate characteristics of the population of particular interest. Parameter – numerical characteristic of the population.
Day 6 Model Selection and Multimodel Inference
Model Assessment and Selection
Model Assessment, Selection and Averaging
Model Assessment and Selection
Model assessment and cross-validation - overview
Sampling distributions. Example Take random sample of students. Ask “how many courses did you study for this past weekend?” Calculate a statistic, say,
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Model Selection and Validation “All models are wrong; some are useful.”  George E. P. Box Some slides were taken from: J. C. Sapll: M ODELING C ONSIDERATIONS.
Model Selection and Validation
Visual Recognition Tutorial
Bayesian Learning Rong Jin.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation X = {
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation Given.
Gaussian process modelling
沈致远. Test error(generalization error): the expected prediction error over an independent test sample Training error: the average loss over the training.
1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu
1 CSI5388 Model Selection Based on “Key Concepts in Model Selection: Performance and Generalizability” by Malcom R. Forster.
BRIEF REVIEW OF STATISTICAL CONCEPTS AND METHODS.
A Neural Network MonteCarlo approach to nucleon Form Factors parametrization Paris, ° CLAS12 Europen Workshop In collaboration with: A. Bacchetta.
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation Given.
Lecture 12 – Model Assessment and Selection Rice ECE697 Farinaz Koushanfar Fall 2006.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Manu Chandran. Outline Background and motivation Over view of techniques Cross validation Bootstrap method Setting up the problem Comparing AIC,BIC,Crossvalidation,Bootstrap.
: Chapter 3: Maximum-Likelihood and Baysian Parameter Estimation 1 Montri Karnjanadecha ac.th/~montri.
INTRODUCTION TO Machine Learning 3rd Edition
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
ECE-7000: Nonlinear Dynamical Systems Overfitting and model costs Overfitting  The more free parameters a model has, the better it can be adapted.
CSC321: Lecture 7:Ways to prevent overfitting
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
Machine Learning 5. Parametric Methods.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Machine Learning CUNY Graduate Center Lecture 6: Linear Regression II.
Statistical Methods. 2 Concepts and Notations Sample unit – the basic landscape unit at which we wish to establish the presence/absence of the species.
Computational Intelligence: Methods and Applications Lecture 15 Model selection and tradeoffs. Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Hypothesis Testing. Statistical Inference – dealing with parameter and model uncertainty  Confidence Intervals (credible intervals)  Hypothesis Tests.
R. Kass/W03 P416 Lecture 5 l Suppose we are trying to measure the true value of some quantity (x T ). u We make repeated measurements of this quantity.
CPH Dr. Charnigo Chap. 11 Notes Figure 11.2 provides a diagram which shows, at a glance, what a neural network does. Inputs X 1, X 2,.., X P are.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Model Comparison. Assessing alternative models We don’t ask “Is the model right or wrong?” We ask “Do the data support a model more than a competing model?”
Model Comparison.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
The Maximum Likelihood Method
Visual Recognition Tutorial
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
How Good is a Model? How much information does AIC give us?
Statistics in MSmcDESPOT
Machine learning, pattern recognition and statistical data modelling
The Maximum Likelihood Method
Bias and Variance of the Estimator
Probabilistic Models for Linear Regression
The Maximum Likelihood Method
INTRODUCTION TO Machine Learning
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
10701 / Machine Learning Today: - Cross validation,
Cross-validation for the selection of statistical models
Presentation transcript:

Model Selection

Agenda Myung, Pitt, & Kim Olsson, Wennerholm, & Lyxzen

What is a Model? p(t) = w 1 exp(-w 2 t) w 1 = w 2 = 0.156

What is a Model? p(t) = w 1 exp(-w2 t) w w 2 =

What is a Model? p(t) = w 1 exp(-w2 t) w w 2 =

What is a Model? A model is a parametric family of probability distributions. –Parametric because the distributions depend on the parameters. –Distributions because they are stochastic models, not deterministic.

Likelihood x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x = data

Likelihood x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x L(w 1,w 2 | x) =  f(x i | w 1,w 2 ) ln L(w 1,w 2 | x) =  ln f(x i | w 1,w 2 )

Likelihood x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x L(w 1,w 2 | x) =  f(x i | w 1,w 2 ) ln L(w 1,w 2 | x) =  ln f(x i | w 1,w 2 )

Likelihood The w that maximize ln L(w 1,w 2 | x) =  ln f(x i | w 1,w 2 ) are the maximum likelihood parameter estimates, w 1 * & w 2 *.

Falsifiability (Saturation) Old rule of thumb: A model is falsifiable if the number of parameters are less than the number of data points. New rule of thumb: A model is falsifiable iff the rank of the Jacobian is less than the number of data points. –A model is testable if the probability that a model’s predictions are right by chance is 0. –Holds under certain smoothness assumptions.

Generalizability Generalizability (as defined by Myung) is the ability to generalize to new data from the same probability distribution. Data are corrupted by random noise & so goodness of fit reflects the models ability to capture regularities and fit noise.

Generalizability Goodness of fit = Fit to regularity (Generalizability) + Fit to noise (Overfitting)

Generalizability Model ComplexityLowHigh Model Fit Poor Good Goodness of fit Generalizability Overfitting

Generalizability Note that the more complex a model is, the more it overfits. Model ComplexityLowHigh Model Fit Poor Good Goodness of fit Generalizability Overfitting

Generalizability ModelM1 M2 (TRUE) M3M4 y = w 1 x+e y = ln(x + w 1 ) + e y = w 1 ln(x + w 2 ) + w 3 + e y = w 1 x + w 2 x 2 + w 3 + e Training BOF Testing BOF

Generalizability A good fit can be achieved simply because a model is more flexible. A good fit is necessary, but not sufficient for capturing underlying processes. A good fit qualifies the model as a candidate for further consideration.

Generalizability The key for many techniques is to find the model that fits future data best, not necessarily the “true” model. –There is rarely enough data to uniquely identify the true model. –Even if there were, it will probably not be one of the models under consideration. This is not to say that we don’t want to find the “true” model (if there is one).

Model Selection The quantity of interest is the lack of generalizability of a model. Essentially: Goodness of fit = Fit to regularity (Generalizability) + Fit to noise (Overfitting) Generalizability = GOF - Overfitting Generalizability = GOF - Complexity So, -Generalizability = -GOF + Complexity

AIC Akaike Information Criterion (AIC) AIC is a lack of generalizability measure, so big is bad.

AIC AIC = -2 ln L(w * |y) + 2K –y are the data –w * are the MLE estimates –K is the number of model parameters

AIC AIC = -2 ln L(w * |y) + 2K Badness of fit: Decreases with parameters. Penalty for complexity: Increases with parameters.

AIC AIC measures complexity via number of parameters. Functional form is not considered. 1 parameter2 parameters

Model Distance CloserFarther “True” Model

AIC AIC selects a model from a set of models that on average minimizes the distance between the model and the “True” model. AIC does not depend on knowing the true model. Given certain conditions, K corrects a statistical bias in estimating this distance.

Cross Validation (2-Sample) Data Calibration Model Find w * Validation Use w * CVI

Cross Validation Easy to use. Sensitive to functional form of model. Not as theoretically grounded as other methods such as AIC.

Cross Validation & AIC In single sample CV, the CVI is estimated from the calibration sample. Where L s is the likelihood of the saturated model with 0 df.

Minimum Descriptive Length Suppose the following are data described in bits: – –

Minimum Descriptive Length The data can be coded as: – for i=1:7, disp(‘0001’), end – disp(‘ ’)

Minimum Descriptive Length Regularity in data can be used to compress the data. The more regular the data are, relative to a particular coding method, the simpler the “program”. –The choice of coding method doesn’t matter so much.

Minimum Descriptive Length Think of the ‘program’ as a model. The model that best captures the regularities in the data will give the shortest code length. – for i=1:7, disp(‘0001’), end – disp(‘ ’)

Minimum Descriptive Length Capturing data regularities will lead to good prediction of future data, i.e. good generalization. By finding the model with the minimum descriptive length, MDL will find the simplest model that predicts the data well.

Minimum Descriptive Length Under certain assumptions, MDL is given by: Badness of fit. Penalty for number of parameters. Penalty for functional form. Doesn’t depend on n.

Minimum Descriptive Length Based on I, the Fischer Information Matrix. This term tells you how well the model can fit different data sets by tweaking the parameters.

Other Selection Criterion Bayesian model selection Bayesian Information Criterion (BIC) Generalization Criterion Monte Carlo Techniques Etc…