Informatics and Mathematical Modelling / Lars Kai Hansen Adv. Signal Proc. 2006 Variational Bayes 101.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Bayesian inference Lee Harrison York Neuroimaging Centre 01 / 05 / 2009.
CS479/679 Pattern Recognition Dr. George Bebis
Biointelligence Laboratory, Seoul National University
Pattern Recognition and Machine Learning
LECTURE 11: BAYESIAN PARAMETER ESTIMATION
Chapter 7 Title and Outline 1 7 Sampling Distributions and Point Estimation of Parameters 7-1 Point Estimation 7-2 Sampling Distributions and the Central.
Shinichi Nakajima Sumio Watanabe  Tokyo Institute of Technology
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
Model Assessment, Selection and Averaging
Chapter 4: Linear Models for Classification
Variational Inference for Dirichlet Process Mixture Daniel Klein and Soravit Beer Changpinyo October 11, 2011 Applied Bayesian Nonparametrics Special Topics.
Visual Recognition Tutorial
Parameter Estimation: Maximum Likelihood Estimation Chapter 3 (Duda et al.) – Sections CS479/679 Pattern Recognition Dr. George Bebis.
Statistical Inference Chapter 12/13. COMP 5340/6340 Statistical Inference2 Statistical Inference Given a sample of observations from a population, the.
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Machine Learning CMPT 726 Simon Fraser University CHAPTER 1: INTRODUCTION.
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
J. Daunizeau Wellcome Trust Centre for Neuroimaging, London, UK Institute of Empirical Research in Economics, Zurich, Switzerland Bayesian inference.
Presenting: Assaf Tzabari
Machine Learning CMPT 726 Simon Fraser University
Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.
Introduction to Bayesian Parameter Estimation
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
The horseshoe estimator for sparse signals CARLOS M. CARVALHO NICHOLAS G. POLSON JAMES G. SCOTT Biometrika (2010) Presented by Eric Wang 10/14/2010.
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
Introduction and Motivation Approaches for DE: Known model → parametric approach: p(x;θ) (Gaussian, Laplace,…) Unknown model → nonparametric approach Assumes.
Statistical Decision Theory
G. Cowan Lectures on Statistical Data Analysis Lecture 3 page 1 Lecture 3 1 Probability (90 min.) Definition, Bayes’ theorem, probability densities and.
7-1 Introduction The field of statistical inference consists of those methods used to make decisions or to draw conclusions about a population. These.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Bayesian Parameter Estimation Liad Serruya. Agenda Introduction Bayesian decision theory Scale-Invariant Learning Bayesian “One-Shot” Learning.
Statistical Decision Theory Bayes’ theorem: For discrete events For probability density functions.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 07: BAYESIAN ESTIMATION (Cont.) Objectives:
INTRODUCTION TO Machine Learning 3rd Edition
BCS547 Neural Decoding.
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Lecture 2: Statistical learning primer for biologists
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Machine Learning 5. Parametric Methods.
Statistics Sampling Distributions and Point Estimation of Parameters Contents, figures, and exercises come from the textbook: Applied Statistics and Probability.
Univariate Gaussian Case (Cont.)
Week 21 Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.
Outline Historical note about Bayes’ rule Bayesian updating for probability density functions –Salary offer estimate Coin trials example Reading material:
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
A Method to Approximate the Bayesian Posterior Distribution in Singular Learning Machines Kenji Nagata, Sumio Watanabe Tokyo Institute of Technology.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Lecture 1.31 Criteria for optimal reception of radio signals.
Oliver Schulte Machine Learning 726
CS479/679 Pattern Recognition Dr. George Bebis
Chapter 3: Maximum-Likelihood Parameter Estimation
Probability Theory and Parameter Estimation I
ICS 280 Learning in Graphical Models
7-1 Introduction The field of statistical inference consists of those methods used to make decisions or to draw conclusions about a population. These.
Variational Bayes Model Selection for Mixture Distribution
Ch3: Model Building through Regression
Parameter Estimation 主講人:虞台文.
Bayes Net Learning: Bayesian Approaches
Distributions and Concepts in Probability Theory
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Expectation-Maximization & Belief Propagation
Parametric Methods Berlin Chen, 2005 References:
Learning From Observed Data
Ch 3. Linear Models for Regression (2/2) Pattern Recognition and Machine Learning, C. M. Bishop, Previously summarized by Yung-Kyun Noh Updated.
Bayesian Model Selection and Averaging
Maximum Likelihood Estimation (MLE)
Presentation transcript:

Informatics and Mathematical Modelling / Lars Kai Hansen Adv. Signal Proc Variational Bayes 101

Informatics and Mathematical Modelling / Lars Kai Hansen Adv. Signal Proc The Bayes scene Exact averaging in discrete/small models (Bayes networks) Approximate averaging: - Monte Carlo methods - Ensemble/mean field - Variational Bayes methods Variational-Bayes.org MLpedia Wikipedia ISP Bayes: ICA: mean field, Kalman, dynamical systems NeuroImaging: Optimal signal detector Approximate inference Machine learning methods

Informatics and Mathematical Modelling / Lars Kai Hansen Adv. Signal Proc Bayes’ methodology Minimal error rate obtained when detector is based on posterior probability (Bayes decision theory) Likelihood may contain unknown parameters

Informatics and Mathematical Modelling / Lars Kai Hansen Adv. Signal Proc Bayes’ methodology Conventional approach is to use most probable parameters However: averaged model is generalization optimal (Hansen, 1999), i.e.:

Informatics and Mathematical Modelling / Lars Kai Hansen Adv. Signal Proc The hidden agenda of learning Typically learning proceeds by generalization from limited set of samples…but We would like to identify the model that generated the data ….Choose the least complex model compatible with data That I figured out in 1386

Informatics and Mathematical Modelling / Lars Kai Hansen Adv. Signal Proc Generalizability is defined as the expected performance on a random new sample... the mean performance of a model on a ”fresh” data set is an unbiased estimate of generalization Typical loss functions:,, etc Results can be presented as ”bias-variance trade-off curves” or ”learning curves” Generalization!

Informatics and Mathematical Modelling / Lars Kai Hansen Adv. Signal Proc Generalization optimal predictive distribution ”The game of guessing a pdf” Assume: Random teacher drawn from P(θ), random data set, D, drawn from P(x|θ) The prediction / generalization error is Predictive distribution of model ATest sample distribution

Informatics and Mathematical Modelling / Lars Kai Hansen Adv. Signal Proc Generalization optimal predictive distribution We define the ”generalization functional” (Hansen, NIPS 1999) Minimized by the ”Bayesian averaging” predictive distribution

Informatics and Mathematical Modelling / Lars Kai Hansen Adv. Signal Proc Bias-variance trade-off and averaging Now averaging is good, can we average ”too much”? Define the family of tempered posterior distributions Case: univariate normal dist. w. unknown mean parameter… High temperature: widened posterior average Low temperature: Narrow average

Informatics and Mathematical Modelling / Lars Kai Hansen Adv. Signal Proc Bayes’ model selection, example Let three models A,B,C be given A)x is normal N(0,1) B)x is normal N(0,σ 2 ), σ 2 is uniform U(0,∞) C)x is normal N(μ,σ 2 ), μ, σ 2 are uniform U(0,∞)

Informatics and Mathematical Modelling / Lars Kai Hansen Adv. Signal Proc Model A The likelihood of N samples is given by

Informatics and Mathematical Modelling / Lars Kai Hansen Adv. Signal Proc Model B The likelihood of N samples is given by

Informatics and Mathematical Modelling / Lars Kai Hansen Adv. Signal Proc Model C The likelihood of N samples is given by

Informatics and Mathematical Modelling / Lars Kai Hansen Adv. Signal Proc Model A –maximum likelihood The likelihood of N samples is given by

Informatics and Mathematical Modelling / Lars Kai Hansen Adv. Signal Proc Model B The likelihood of N samples is given by

Informatics and Mathematical Modelling / Lars Kai Hansen Adv. Signal Proc Model C The likelihood of N samples is given by

Informatics and Mathematical Modelling / Lars Kai Hansen Adv. Signal Proc Bayesian model selection C(green) is the correct model, what if only A(red)+B(blue) are known?

Informatics and Mathematical Modelling / Lars Kai Hansen Adv. Signal Proc Bayesian model selection A (red) is the correct model

Informatics and Mathematical Modelling / Lars Kai Hansen Adv. Signal Proc Bayesian inference Bayesian averaging Caveats: Bayes can rarely be implemented exactly Not optimal if the model family is incorrect: ”Bayes can not detect bias” However, still asymptotically optimal if observation model is correct & prior is ”weak” (Hansen, 1999).

Informatics and Mathematical Modelling / Lars Kai Hansen Adv. Signal Proc Hierarchical Bayes models Multi-level models in Bayesian averaging C.P. Robert: The Bayesian Choice - A Decision-Theoretic Motivation. Springer Texts in Statistics, Springer Verlag, New York (1994). G. Golub, M. Heath and G. Wahba, Generalized crossvalidation as a method for choosing a good ridge parameter, Technometrics 21 pp. 215–223, (1979). K. Friston: A theory of Cortical Responses. Phil. Trans. R. Soc. B 360: (2005)

Informatics and Mathematical Modelling / Lars Kai Hansen Adv. Signal Proc Hierarchical Bayes models “ learning hyper- parameters by adjusting prior expectations” -empirical Bayes -MacKay, (1992) Hansen et al. (Eusipco, 2006) Cf. Boltzmann learning (Hinton et al. 1983) Posterior “Evidence” Prior Target at Maximal evidence

Informatics and Mathematical Modelling / Lars Kai Hansen Adv. Signal Proc Hyperparameter dynamics Gaussian prior w adaptive hyperparameter Discontinuity: Parameter is pruned at Low signal-to-noise Hansen & Rasmussen, Neural Comp (1994) Tipping “Relevance vector machine” (1999) θ 2 A is a signal-to-noise measure θ ML is maximum lik. opt.

Informatics and Mathematical Modelling / Lars Kai Hansen Adv. Signal Proc Hyperparameter dynamics Hyperparameters dynamically updated implies pruning Pruning decisions based on SNR Mechanism for cognitive selection, attention?

Informatics and Mathematical Modelling / Lars Kai Hansen Adv. Signal Proc Hansen & Rasmussen, Neural Comp (1994)

Informatics and Mathematical Modelling / Lars Kai Hansen Adv. Signal Proc. 2006

Informatics and Mathematical Modelling / Lars Kai Hansen Adv. Signal Proc. 2006

Informatics and Mathematical Modelling / Lars Kai Hansen Adv. Signal Proc. 2006

Informatics and Mathematical Modelling / Lars Kai Hansen Adv. Signal Proc. 2006

Informatics and Mathematical Modelling / Lars Kai Hansen Adv. Signal Proc. 2006

Informatics and Mathematical Modelling / Lars Kai Hansen Adv. Signal Proc. 2006

Informatics and Mathematical Modelling / Lars Kai Hansen Adv. Signal Proc. 2006

Informatics and Mathematical Modelling / Lars Kai Hansen Adv. Signal Proc. 2006

Informatics and Mathematical Modelling / Lars Kai Hansen Adv. Signal Proc. 2006

Informatics and Mathematical Modelling / Lars Kai Hansen Adv. Signal Proc. 2006

Informatics and Mathematical Modelling / Lars Kai Hansen Adv. Signal Proc Approximations needed for posteriors Approximations using asymptotic expansions (Laplace etc) -JL Approximation of posteriors using tractable (factorized) pdf’s by KL-fitting… Approximation of products using EP -AH Wednesday Approximation by MCMC –OWI Thursday

Informatics and Mathematical Modelling / Lars Kai Hansen Adv. Signal Proc P. Højen-Sørensen: Thesis (2001) Illustration of approximation by a gaussian pdf

Informatics and Mathematical Modelling / Lars Kai Hansen Adv. Signal Proc. 2006

Informatics and Mathematical Modelling / Lars Kai Hansen Adv. Signal Proc Variational Bayes Notation are observables and hidden variables – we analyse the log likelihood of a mixture model

Informatics and Mathematical Modelling / Lars Kai Hansen Adv. Signal Proc Variational Bayes

Informatics and Mathematical Modelling / Lars Kai Hansen Adv. Signal Proc Variational Bayes:

Informatics and Mathematical Modelling / Lars Kai Hansen Adv. Signal Proc Conjugate exponential families

Informatics and Mathematical Modelling / Lars Kai Hansen Adv. Signal Proc Mini exercise What are the natural parameters for a Gaussian? What are the natural parameters for a MoG?

Informatics and Mathematical Modelling / Lars Kai Hansen Adv. Signal Proc. 2006

Informatics and Mathematical Modelling / Lars Kai Hansen Adv. Signal Proc Observation model and “Bayes factor”

Informatics and Mathematical Modelling / Lars Kai Hansen Adv. Signal Proc “Normal inverse gamma” prior – the conjugate prior for the GLM observation model

Informatics and Mathematical Modelling / Lars Kai Hansen Adv. Signal Proc “Normal inverse gamma” prior – the conjugate prior for the GLM observation model

Informatics and Mathematical Modelling / Lars Kai Hansen Adv. Signal Proc Bayes factor is the ratio between normalization const. of NIG’s:

Informatics and Mathematical Modelling / Lars Kai Hansen Adv. Signal Proc. 2006

Informatics and Mathematical Modelling / Lars Kai Hansen Adv. Signal Proc. 2006

Informatics and Mathematical Modelling / Lars Kai Hansen Adv. Signal Proc. 2006

Informatics and Mathematical Modelling / Lars Kai Hansen Adv. Signal Proc Exercises Matthew Beal’s Mixture of Factor Analyzers code –Code available (variational-bayes.org) Code a VB version of the BGML for signal detection –Code available for exact posterior