Bayesian Inconsistency under Misspecification Peter Grünwald CWI, Amsterdam Extension of joint work with John Langford, TTI Chicago (COLT 2004)

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Bayes rule, priors and maximum a posteriori
Pattern Recognition and Machine Learning
Week 11 Review: Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution.
CHAPTER 8 More About Estimation. 8.1 Bayesian Estimation In this chapter we introduce the concepts related to estimation and begin this by considering.
1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.
2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
Bounds on Code Length Theorem: Let l ∗ 1, l ∗ 2,..., l ∗ m be optimal codeword lengths for a source distribution p and a D-ary alphabet, and let L ∗ be.
Bayesian inference Gil McVean, Department of Statistics Monday 17 th November 2008.
Visual Recognition Tutorial
Machine Learning CMPT 726 Simon Fraser University CHAPTER 1: INTRODUCTION.
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
Presenting: Assaf Tzabari
Machine Learning CMPT 726 Simon Fraser University
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Visual Recognition Tutorial
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
Crash Course on Machine Learning
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu
Speech Recognition Pattern Classification. 22 September 2015Veton Këpuska2 Pattern Classification  Introduction  Parametric classifiers  Semi-parametric.
Suboptimality of Bayes and MDL in Classification Peter Grünwald CWI/EURANDOM joint work with John Langford, TTI Chicago, Preliminary version.
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
Estimating parameters in a statistical model Likelihood and Maximum likelihood estimation Bayesian point estimates Maximum a posteriori point.
Lecture note for Stat 231: Pattern Recognition and Machine Learning 4. Maximum Likelihood Prof. A.L. Yuille Stat 231. Fall 2004.
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
Coarse sample complexity bounds for active learning Sanjoy Dasgupta UC San Diego.
Applied Bayesian Inference, KSU, April 29, 2012 § ❷ / §❷ An Introduction to Bayesian inference Robert J. Tempelman 1.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
1 Bayesian Essentials Slides by Peter Rossi and David Madigan.
1 Optimizing Decisions over the Long-term in the Presence of Uncertain Response Edward Kambour.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger, 1996/07/05
CS Ensembles and Bayes1 Ensembles, Model Combination and Bayesian Combination.
Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability Primer Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Lecture 1.31 Criteria for optimal reception of radio signals.
Confidence intervals for µ
Chapter 3: Maximum-Likelihood Parameter Estimation
Visual Recognition Tutorial
12. Principles of Parameter Estimation
Probability Theory and Parameter Estimation I
ICS 280 Learning in Graphical Models
Comp328 tutorial 3 Kai Zhang
Non-Parametric Models
Maximal Independent Set
Latent Variables, Mixture Models and EM
CSCI 5822 Probabilistic Models of Human and Machine Learning
LECTURE 05: THRESHOLD DECODING
LECTURE 05: THRESHOLD DECODING
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Summarizing Data by Statistics
Pattern Recognition and Machine Learning
Computing and Statistical Data Analysis / Stat 7
Mathematical Foundations of BME
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Chapter 8: Estimating with Confidence
Learning From Observed Data
Nonparametric density estimation and classification
LECTURE 05: THRESHOLD DECODING
Pattern Recognition and Machine Learning Chapter 2: Probability Distributions July chonbuk national university.
FDA – A Scalable Evolutionary Algorithm for the Optimization of Additively Decomposed Functions BISCuit EDA Seminar
Chapter 9 Chapter 9 – Point estimation
12. Principles of Parameter Estimation
Data Exploration and Pattern Recognition © R. El-Yaniv
Mathematical Foundations of BME Reza Shadmehr
Applied Statistics and Probability for Engineers
Uncertainty and confidence
Presentation transcript:

Bayesian Inconsistency under Misspecification Peter Grünwald CWI, Amsterdam Extension of joint work with John Langford, TTI Chicago (COLT 2004)

The Setting Let (classification setting) Let be a set of conditional distributions, and let be a prior on

Bayesian Consistency Let (classification setting) Let be a set of conditional distributions, and let be a prior on Let be a distribution on Let i.i.d. If, then Bayes is consistent under very mild conditions on and

Bayesian Consistency Let (classification setting) Let be a set of conditional distributions, and let be a prior on Let be a distribution on Let i.i.d. If, then Bayes is consistent under very mild conditions on and –“consistency” can be defined in number of ways, e.g. posterior distribution “concentrates” on “neighborhoods” of

Bayesian Consistency Let (classification setting) Let be a set of conditional distributions, and let be a prior on Let be a distribution on Let i.i.d. If, then Bayes is consistent under very mild conditions on and

Bayesian Consistency? Let (classification setting) Let be a set of conditional distributions, and let be a prior on Let be a distribution on Let i.i.d. If, then Bayes is consistent under very mild conditions on and not quite so mild!

Misspecification Inconsistency We exhibit: 1.a model, 2.a prior 3.a “true” such that – contains a good approximation to –Yet –and Bayes will be inconsistent in various ways

The Model (Class) Let Each is a 1-dimensional parametric model, Example:

The Model (Class) Let Each is a 1-dimensional parametric model, Example:,

The Model (Class) Let Each is a 1-dimensional parametric model, Example:,,

The Prior Prior mass on the model index must be such that, for some small, for all large, Prior density of given must be –continuous –not depend on –bounded away from 0, i.e.

The Prior Prior mass on the model index must be such that, for some small, for all large, Prior density of given must be –continuous –not depend on –bounded away from 0, i.e. e.g. e.g. uniform

The “True” - benign version marginal is uniform on conditional is given by Since for, we have Bayes is consistent

Two Natural “Distance” Measures Kullback-Leibler (KL) divergence Classification performance (measured in 0/1-risk)

For all : For all, : is true and, of course, optimal for classification KL and 0/1-behaviour of

The “True” - evil version First throw a coin with bias If, sample from as before If, set

The “True” - evil version First throw a coin with bias If, sample from as before If, set All satisfy, so these are easy examples

is still optimal For all, For all, : For all, is still optimal for classification is still optimal in KL divergence

Inconsistency Both from a KL and from a classification perspective, we’d hope that the posterior converges on But this does not happen!

Inconsistency Theorem: With - probability 1: Posterior puts almost all mass on ever larger, none of which are optimal

1. Kullback-Leibler Inconsistency Bad News: With - probability 1, for all, Posterior “concentrates” on very bad distributions Note: if we restrict all to for some then this only holds for all smaller than some

1. Kullback-Leibler Inconsistency Bad News: With - probability 1, for all, Posterior “concentrates” on very bad distributions Good News: With - probability 1, for all large, Predictive distribution does perform well!

2. 0/1-Risk Inconsistency Bad News: With - probability 1, Posterior “concentrates” on bad distributions More bad news: With - probability 1, Now predictive distribution is no good either!

Theorem 1: worse 0/1 news Prior depends on parameter True distribution depends on two parameters –With probability, generate “easy” example –With probability, generate example according to

Theorem 1: worse 0/1 news Theorem 1: for each desired, we can set such that yet with -probability 1, as Here is the binary entropy

Theorem 1, graphically X-axis: = maximum 0/1 risk Bayes MAP = maximum 0/1 risk full Bayes Maximum difference at achieved with probability 1, for all large n

Theorem 1, graphically X-axis: = maximum 0/1 risk Bayes MAP = maximum 0/1 risk full Bayes Maximum difference at Bayes can get worse than random guessing!

Thm 2: full Bayes result is tight Let be an arbitrary countable set of conditional distributions Let be an arbitrary prior on with full support Let data be i.i.d. according to an arbitrary on, and let Then the 0/1-risk of Bayes predictive distribution is bounded by (red line)

KL and Classification It is trivial to construct a model such that, for some

KL and Classification It is trivial to construct a model such that, for some However, here we constructed such that, for arbitrary on, achieved for that also achieves Therefore, the bad classification performance of Bayes is really surprising!

KL… KL geometry Bayes predictive dist.

KL… and classification KL geometry0/1 geometry Bayes predictive dist.

What’s new? There exist various infamous theorems showing that Bayesian inference can be inconsistent even if –Diaconis and Freedman (1986), Barron (Valencia 6, 1998) So why is result interesting? Because we can choose to be countable

Bayesian Consistency Results Doob (1949), Blackwell & Dubins (1962), Barron(1985) Suppose –Countable –Contains ‘true’ conditional distribution Then with -probability 1, as,

Countability and Consistency Thus: if model well-specified and countable, Bayes must be consistent, and previous inconsistency results do not apply. We show that in misspecified case, can even get inconsistency if model countable. –Previous results based on priors with ‘very small’ mass on neighborhoods of true. –In our case, prior can be arbitrarily close to 1 on achieving

Discussion 1.“Result not surprising because Bayesian inference was never designed for misspecification” I agree it’s not too surprising, but it is disturbing, because in practice, Bayes is used with misspecified models all the time

Discussion 1.“Result not surprising because Bayesian inference was never designed for misspecification” I agree it’s not too surprising, but it is disturbing, because in practice, Bayes is used with misspecified models all the time 2.“Result irrelevant for true (subjective) Bayesian because a ‘true’ distribution does not exist anyway” I agree true distributions don’t exist, but can rephrase result so that it refers only to realized patterns in data (not distributions)

Discussion 1.“Result not surprising because Bayesian inference was never designed for misspecification” I agree it’s not too surprising, but it is disturbing, because in practice, Bayes is used with misspecified models all the time 2.“Result irrelevant for true (subjective) Bayesian because a ‘true’ distribution does not exist anyway” I agree true distributions don’t exist, but can rephrase result so that it refers only to realized patterns in data (not distributions) 3.“Result irrelevant if you use a nonparametric model class, containing all distributions on ” For small samples, your prior then severely restricts “effective” model size (there are versions of our result for small samples)

Discussion - II One objection remains: scenario is very unrealistic! –Goal was to discover the worst possible scenario Note however that –Variation of result still holds for containing distributions with differentiable densities –Variation of result still holds if precision of -data is finite –Priors are not so strange –Other methods such as McAllester’s PAC-Bayes do perform better on this type of problem. –They are guaranteed to be consistent under misspecification but often need much more data than Bayesian procedures see also Clarke (2003), Suzuki (2005)

Conclusion Conclusion should not be “Bayes is bad under misspecification”, but rather “more work needed to find out what types of misspecification are problematic for Bayes”

Thank you! Shameless plug: see also The Minimum Description Length Principle, P. Grünwald, MIT Press, 2007