Bayesian Model Comparison and Occam’s Razor

Slides:



Advertisements
Similar presentations
J. Daunizeau Institute of Empirical Research in Economics, Zurich, Switzerland Brain and Spine Institute, Paris, France Bayesian inference.
Advertisements

Regression Eric Feigelson Lecture and R tutorial Arcetri Observatory April 2014.
Brief introduction on Logistic Regression
Probabilistic models Haixu Tang School of Informatics.
Biointelligence Laboratory, Seoul National University
Likelihood Ratio, Wald, and Lagrange Multiplier (Score) Tests
FTP Biostatistics II Model parameter estimations: Confronting models with measurements.
Chapter 4: Linear Models for Classification
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
458 Model Uncertainty and Model Selection Fish 458, Lecture 13.
Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
Bayesian Learning Rong Jin.
Role and Place of Statistical Data Analysis and very simple applications Simplified diagram of a scientific research When you know the system: Estimation.
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
Copyright © Cengage Learning. All rights reserved. 11 Applications of Chi-Square.
Bayesian Model Comparison and Occam’s Razor Lecture 2.
Hypothesis testing. Want to know something about a population Take a sample from that population Measure the sample What would you expect the sample to.
4.2 One Sided Tests -Before we construct a rule for rejecting H 0, we need to pick an ALTERNATE HYPOTHESIS -an example of a ONE SIDED ALTERNATIVE would.
Prof. Dr. S. K. Bhattacharjee Department of Statistics University of Rajshahi.
Making decisions about distributions: Introduction to the Null Hypothesis 47:269: Research Methods I Dr. Leonard April 14, 2010.
01/20151 EPI 5344: Survival Analysis in Epidemiology Maximum Likelihood Estimation: An Introduction March 10, 2015 Dr. N. Birkett, School of Epidemiology,
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
Multiple Regression and Model Building Chapter 15 Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Randomized Algorithms for Bayesian Hierarchical Clustering
INTRODUCTION TO Machine Learning 3rd Edition
Bayes Theorem. Prior Probabilities On way to party, you ask “Has Karl already had too many beers?” Your prior probabilities are 20% yes, 80% no.
5. Maximum Likelihood –II Prof. Yuille. Stat 231. Fall 2004.
Machine Learning 5. Parametric Methods.
Review of statistical modeling and probability theory Alan Moses ML4bio.
Hypothesis Testing. Statistical Inference – dealing with parameter and model uncertainty  Confidence Intervals (credible intervals)  Hypothesis Tests.
Statistical principles: the normal distribution and methods of testing Or, “Explaining the arrangement of things”
Institute of Statistics and Decision Sciences In Defense of a Dissertation Submitted for the Degree of Doctor of Philosophy 26 July 2005 Regression Model.
CSC321: Lecture 8: The Bayesian way to fit models Geoffrey Hinton.
Model Comparison.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CS Statistical Machine learning Lecture 7 Yuan (Alan) Qi Purdue CS Sept Acknowledgement: Sargur Srihari’s slides.
Methods of Presenting and Interpreting Information Class 9.
Crash course in probability theory and statistics – part 2 Machine Learning, Wed Apr 16, 2008.
Data Modeling Patrice Koehl Department of Biological Sciences
Lecture #8 Thursday, September 15, 2016 Textbook: Section 4.4
32931 Technology Research Methods Autumn 2017 Quantitative Research Component Topic 4: Bivariate Analysis (Contingency Analysis and Regression Analysis)
Oliver Schulte Machine Learning 726
Physics 114: Lecture 13 Probability Tests & Linear Fitting
(5) Notes on the Least Squares Estimate
Probability Theory and Parameter Estimation I
Testing Hypotheses about a Population Proportion
Hypothesis Testing and Confidence Intervals (Part 1): Using the Standard Normal Lecture 8 Justin Kern October 10 and 12, 2017.
Bayes Net Learning: Bayesian Approaches
CH. 2: Supervised Learning
Chapter 8: Inference for Proportions
...Relax... 9/21/2018 ST3131, Lecture 3 ST5213 Semester II, 2000/2001
POSC 202A: Lecture Lecture: Substantive Significance, Relationship between Variables 1.
Special Topics In Scientific Computing
Likelihood Ratio, Wald, and Lagrange Multiplier (Score) Tests
BNAD 276: Statistical Inference in Management Spring 2016
EC 331 The Theory of and applications of Maximum Likelihood Method
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Quantitative Methods in HPELS HPELS 6210
Choosing a test: ... start from thinking whether our variables are continuous or discrete.
Parametric Methods Berlin Chen, 2005 References:
Bayesian inference J. Daunizeau
Bayesian vision Nisheeth 14th February 2019.
Ch 3. Linear Models for Regression (2/2) Pattern Recognition and Machine Learning, C. M. Bishop, Previously summarized by Yung-Kyun Noh Updated.
RES 500 Academic Writing and Research Skills
Mathematical Foundations of BME Reza Shadmehr
Applied Statistics and Probability for Engineers
MGS 3100 Business Analysis Regression Feb 18, 2016
F test for Lack of Fit The lack of fit test..
Presentation transcript:

Bayesian Model Comparison and Occam’s Razor Lecture 2

A Picture of Occam’s Razor

Occam’s razor "All things being equal, the simplest solution tends to be the best one," or alternately, "the simplest explanation tends to be the right one." In other words, when multiple competing theories are equal in other respects, the principle recommends selecting the theory that introduces the fewest assumptions and postulates the fewest hypothetical entities. It is in this sense that Occam's razor is usually understood. Wikipedia

Copernican versus Ptolemaic View of the Universe Copernicus proposed a model of the solar system in which the earth revolved around the sun. Ptolemy (around 1000 years earlier) had proposed a theory of the universe in which planetary bodies revolved around the earth – he used ‘epicycles’ to explain his theory. Copernicus’s theory ‘won’ because it was a simpler framework from which to explain astronomical motion. Epicycles also ‘explained’ astronomical motion but employed an unnecessarily complex framework which could not properly predict such things.

Occam’s Razor: Choose Simple Models when possible

Boxes behind a tree In the figure, is there 1 or 2 boxes behind the tree? A one box theory: does not assume many complicated factors such as that two boxes happen to be of identical height, but does explain the data as we see it. A two-box theory: does assume an unlikely factor, but also explains the data as we see it.

Statistical Models Statistical models are designed to describe data by postulating that data X=(x1,...,xn) follow a density f(X|Θ) in a class described by those in a class {f(X|Θ): Θ} (possibly nonparametric). For a given parameter Θ0, We can compare the likelihood of data values X01 vs X02 via f(X01|Θ0)/f(X02|Θ0). If this is >1, then the first datum is more likely; if <1, then the second is more likely.

Bayesian Model Comparison We evaluate statistical models via: The term ‘P(X|M)’ is the likelihood; the term P(M) is the prior; the term ‘P(X)’ is the marginal density of the data. When comparing two models M1 and M2, we need only look at the ratio,

Bayesian Model Comparison (continued) In comparing the two models, the term ‘P(X|M)’ explains how well model M explains the data. We see in our tree example that both the one box and two box theories explain the data well. So they don’t help us decide between the one and two box models. But, the probability, ‘P(M1)’ for the one-box theory is much larger than the probability ‘P(M2)’ . So we prefer the one-box to the two box theory. Note that things like the MLE have no preference regarding the one versus two-box theory.

Model Comparison when parameters are present If parameters Θ are present, we want to use: This is the average score of the data. Calculus shows (see the appendix) that,

The Occam factor Now, if we had two models M1,M2 which explained the data equally well, but the first provided more certain (posterior) information than the second, we prefer the first model to the second. The ‘likelihood’ scores are similar for the two models; the ‘Occam factor’ ‘(Θ|M)Σ(1/2)’ or posterior uncertainty for the first model is smaller than that for the second.

Example of Model Comparison when Parameters are present Say we want to choose between two regression models for a set of bivariate data. The first is a linear model and the second is a polynomial model involving terms up to the fourth power. The second always does a better job of fitting the data than the first. But the posterior uncertainty of the second tends to be smaller than that of the second because the presence of additional parameters adds posterior uncertainty. Note that classical statistics always views the second as better than the first.

An example Score(1)/Score(0)= .71/15 = .05 The data: (-8,8),(-2,10),(6,11) (see next) The model under : H0: y=β0+ε; H1: y= β0+β1x+ε; Parameters have simple gaussian priors and σe=1. Score[0]=φ{√3 σY} φ{Y} (1/√3)=1.5*10-23; Score[1]=φ{√3 σY √(1-ρ2)} φ{b0} φ{b1}(1/[3σX]) =.71*10-24 Score(1)/Score(0)= .71/15 = .05

Example Explained H0 Score[0]=φ{√3 σY} φ{Y} (1/√3) =1.5*10-23; Y is the average of the Y’s. φ is the gaussian density. φ{√3 σY} is the likelihood under the null model (with MLE assignment) φ{Y} is the prior under the null model (with MLE assignment) (1/√3) is the inverse of the square root of the information.

Example explained H1 Score[1]=φ{√3 σY √(1-ρ2)} φ{b0} φ{b1} (1/σX) φ{√3 σY √(1-ρ2)} is the likelihood under the alternative model (under MLE assignment) φ{b0} φ{b1} is the prior under the alternative model (under MLE assignment) b0, b1 are the usual beta estimates. (1/3σX) is the inverse of the square root of the information.

Regression Example

Classical Statistics falls short Comparing the likelihoods (under MLE’s) without regards to the Occam factor gives: Classical Null Score= φ{√3 σY} =.012; Classical Alt Score= φ{√3 σY √(1-ρ2)} =.3146 In this case, the alternate model is to be preferred. But, we can see from the picture it isn’t too good, and adds more complexity which doesn’t serve a good purpose.

Stats for the linear model σx= 7.02; σy= 1.53; mean(y)=9.66; mean(x)=-1.33 b0= 9.9459 b1= 0.2095 BINT = b conf 5.5683 14.3236 -0.5340 0.9530 R = residuals -0.2703 0.4730 -0.2027 RINT =residual conf -3.7044 3.1638 -5.5367 6.4827 -2.7783 2.3729 STATS = R2= 0.9276 F= 12.8133 p0= 0.1734 p1=0.3378

Dice Example We roll a die 30 times getting [4,4,3,3,7,9]. Is it a fair die? Would you be willing to gamble using it? H0: p1=…=p6=(1/6) ; H1: p’s ≈Dir(1,…,1) What does chi-squared goodness of fit say? Chi-square p-value is 31% -- we would never reject the null in this case. What does Bayes theory say: score under H0 is

Dice Example (continued) Under the alternative: In this case, the laplace approximation is slightly off. The real answer is 3*10-6 So, roughly, the alternative is about 10 times as likely as the null. This is in accord with our intuition.

Possible Project Possible Project: Construct or otherwise get bivariate data which are essentially linearly related with noise. Assume linear and higher power models have equal prior probability. Calculate the average score for linear and higher order models. Show the average score for the linear model is best.

Another Possible Project Generate multinomial data from a distribution with equal p’s. For the generated data determine the chi-squared p-value and compare it to the Bayes factor favoring the null (true) hypothesis – determine how the chi-squared values differ from the Bayes factor counterparts over many simulations.

Appendix: Laplace approximation In the usual setting,

Possible Project Fill in the mathematical steps involving the calculation of the marginal distribution of the data and compare it to the laplace approximation.