Recent Advances in Bayesian Inference Techniques

Slides:

Advertisements

Similar presentations

Pattern Recognition and Machine Learning

Advertisements

Pattern Recognition and Machine Learning

Part 2: Unsupervised Learning

Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.

Factorial Mixture of Gaussians and the Marginal Independence Model Ricardo Silva Joint work-in-progress with Zoubin Ghahramani.

Feature Selection as Relevant Information Encoding Naftali Tishby School of Computer Science and Engineering The Hebrew University, Jerusalem, Israel NIPS.

Pattern Recognition and Machine Learning

Biointelligence Laboratory, Seoul National University

Protein Fold Recognition with Relevance Vector Machines Patrick Fernie COMS 6772 Advanced Machine Learning 12/05/2005.

Computer vision: models, learning and inference Chapter 8 Regression.

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Supervised Learning Recap

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.

Computer vision: models, learning and inference

EE462 MLCV Lecture Introduction of Graphical Models Markov Random Fields Segmentation Tae-Kyun Kim 1.

Industrial Engineering College of Engineering Bayesian Kernel Methods for Binary Classification and Online Learning Problems Theodore Trafalis Workshop.

Visual Recognition Tutorial

Variational Inference and Variational Message Passing

Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.

Pattern Recognition and Machine Learning

Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.

Latent Dirichlet Allocation a generative model for text

Collaborative Ordinal Regression Shipeng Yu Joint work with Kai Yu, Volker Tresp and Hans-Peter Kriegel University of Munich, Germany Siemens Corporate.

Visual Recognition Tutorial

Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.

Using ranking and DCE data to value health states on the QALY scale using conventional and Bayesian methods Theresa Cain.

Study of Sparse Online Gaussian Process for Regression EE645 Final Project May 2005 Eric Saint Georges.

CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:

Correlated Topic Models By Blei and Lafferty (NIPS 2005) Presented by Chunping Wang ECE, Duke University August 4 th, 2006.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.

ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Conjugate Priors Multinomial Gaussian MAP Variance Estimation Example.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

Christopher M. Bishop, Pattern Recognition and Machine Learning.

Latent Dirichlet Allocation D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3: , January Jonathan Huang

Overview of the final test for CSC Overview PART A: 7 easy questions –You should answer 5 of them. If you answer more we will select 5 at random.

INTRODUCTION TO Machine Learning 3rd Edition

Some Aspects of Bayesian Approach to Model Selection Vetrov Dmitry Dorodnicyn Computing Centre of RAS, Moscow.

Biointelligence Laboratory, Seoul National University

Lecture 2: Statistical learning primer for biologists

Latent Dirichlet Allocation

Machine Learning 5. Parametric Methods.

1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.

Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.

CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

Density Estimation in R Ha Le and Nikolaos Sarafianos COSC 7362 – Advanced Machine Learning Professor: Dr. Christoph F. Eick 1.

Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Learning Deep Generative Models by Ruslan Salakhutdinov

CEE 6410 Water Resources Systems Analysis

LECTURE 11: Advanced Discriminant Analysis

Variational Bayes Model Selection for Mixture Distribution

Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani

Computer vision: models, learning and inference

Multimodal Learning with Deep Boltzmann Machines

Machine Learning Basics

Data Mining Lecture 11.

Latent Variables, Mixture Models and EM

Bayesian Models in Machine Learning

Bayesian Inference for Mixture Language Models

Robust Full Bayesian Learning for Neural Networks

Parametric Methods Berlin Chen, 2005 References:

Biointelligence Laboratory, Seoul National University

Machine Learning: Lecture 6

Ch 3. Linear Models for Regression (2/2) Pattern Recognition and Machine Learning, C. M. Bishop, Previously summarized by Yung-Kyun Noh Updated.

Machine Learning: UNIT-3 CHAPTER-1

Presentation transcript:

Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004

Abstract Bayesian methods offer significant advantages over many conventional techniques such as maximum likelihood. However, their practicality has traditionally been limited by the computational cost of implementing them, which has often been done using Monte Carlo methods. In recent years, however, the applicability of Bayesian methods has been greatly extended through the development of fast analytical techniques such as variational inference. In this talk I will give a tutorial introduction to variational methods and will demonstrate their applicability in both supervised and unsupervised learning domains. I will also discuss techniques for automating variational inference, allowing rapid prototyping and testing of new probabilistic models. Christopher M. Bishop

Overview What is “Bayesian Inference”? Variational methods Example 1: uni-variate Gaussian Example 2: mixture of Gaussians Example 3: sparse kernel machines (RVM) Example 4: latent Dirichlet allocation Automatic variational inference Christopher M. Bishop

Maximum Likelihood Parametric model Data set (i.i.d.) where Likelihood function Maximize (log) likelihood Predictive distribution Christopher M. Bishop

Regularized Maximum Likelihood Prior , posterior MAP (maximum posterior) Predictive distribution For example, if data model is Gaussian with unknown mean and if prior over mean is Gaussian which is least squares with quadratic regularizer Christopher M. Bishop

Bayesian Inference Key idea is to marginalize over unknown parameters, rather than make point estimates avoids severe over-fitting of ML/MAP allows direct model comparison Most interesting probabilistic models also have hidden (latent) variables – we should marginalize over these too Such integrations (summations) are generally intractable Traditional approach: Markov chain Monte Carlo computationally very expensive limited to small data sets Christopher M. Bishop

This Talk Variational methods extend the practicality of Bayesian inference to “medium sized” data sets Christopher M. Bishop

Data Set Size Problem 1: learn the function for from 100 (slightly) noisy examples data set is computationally small but statistically large Problem 2: learn to recognize 1,000 everyday objects from 5,000,000 natural images data set is computationally large but statistically small Bayesian inference computationally more demanding than ML or MAP significant benefit for statistically small data sets Christopher M. Bishop

Model Complexity A central issue in statistical inference is the choice of model complexity: too simple poor predictions too complex poor predictions (and slow on test) Maximum likelihood always favours more complex models – over-fitting It is usual to resort to cross-validation computationally expensive limited to one or two complexity parameters Bayesian inference can determine model complexity from training data – even with many complexity parameters Still a good idea to test final model on independent data Christopher M. Bishop

Variational Inference Goal: approximate posterior by a simpler distribution for which marginalization is tractable Posterior related to joint by marginal likelihood also a key quantity for model comparison Christopher M. Bishop

Variational Inference For an arbitrary we have where Kullback-Leibler divergence satisfies Christopher M. Bishop

Variational Inference Choose to maximize the lower bound Christopher M. Bishop

Variational Inference Free-form optimization over would give the true posterior distribution – but this is intractable by definition One approach would be to consider a parametric family of distributions and choose the best member Here we consider factorized approximations with free-form optimization of the factors A few lines of algebra shows that the optimum factors are These are coupled so we need to iterate Christopher M. Bishop

Christopher M. Bishop

Lower Bound Can also be evaluated Useful for maths + code verification Also useful for model comparison and hence Christopher M. Bishop

Example 1: Simple Gaussian Likelihood function Conjugate priors Factorized variational distribution mean precision (inverse variance) Christopher M. Bishop

Variational Posterior Distribution where Christopher M. Bishop

Initial Configuration Christopher M. Bishop

After Updating Christopher M. Bishop

After Updating Christopher M. Bishop

Converged Solution Christopher M. Bishop

Applications of Variational Inference Hidden Markov models (MacKay) Neural networks (Hinton) Bayesian PCA (Bishop) Independent Component Analysis (Attias) Mixtures of Gaussians (Attias; Ghahramani and Beal) Mixtures of Bayesian PCA (Bishop and Winn) Flexible video sprites (Frey et al.) Audio-video fusion for tracking (Attias et al.) Latent Dirichlet Allocation (Jordan et al.) Relevance Vector Machine (Tipping and Bishop) Object recognition in images (Li et al.) … Christopher M. Bishop

Example 2: Gaussian Mixture Model Linear superposition of Gaussians Conventional maximum likelihood solution using EM: E-step: evaluate responsibilities Christopher M. Bishop

Gaussian Mixture Model M-step: re-estimate parameters Problems: singularities how to choose K? Christopher M. Bishop

Bayesian Mixture of Gaussians Conjugate priors for the parameters: Dirichlet prior for mixing coefficients normal-Wishart prior for means and precisions where the Wishart distribution is given by Christopher M. Bishop

Graphical Representation Parameters and latent variables appear on equal footing Christopher M. Bishop

Variational Mixture of Gaussians Assume factorized posterior distribution No other assumptions! Gives optimal solution in the form where is a Dirichlet, and is a Normal-Wishart; is multinomial Christopher M. Bishop

Sufficient Statistics Similar computational cost to maximum likelihood EM No singularities! Predictive distribution is mixture of Student-t distributions Christopher M. Bishop

Variational Equations for GMM Christopher M. Bishop

Time between eruptions (minutes) Old Faithful Data Set Time between eruptions (minutes) Duration of eruption (minutes) Christopher M. Bishop

Bound vs. K for Old Faithful Data Christopher M. Bishop

Bayesian Model Complexity Christopher M. Bishop

Sparse Gaussian Mixtures Instead of comparing different values of K, start with a large value and prune out excess components Achieved by treating mixing coefficients as parameters, and maximizing marginal likelihood (Corduneanu and Bishop, AI Stats 2001) Gives simple re-estimation equations for the mixing coefficients – interleave with variational updates Christopher M. Bishop

Christopher M. Bishop

Christopher M. Bishop

Example 3: RVM Relevance Vector Machine (Tipping, 1999) Bayesian alternative to support vector machine (SVM) Limitations of the SVM: two classes large number of kernels (in spite of sparsity) kernels must satisfy Mercer criterion cross-validation to set parameters C (and ε) decisions at outputs instead of probabilities Christopher M. Bishop

Relevance Vector Machine Linear model as for SVM Input vectors and targets Regression Classification Christopher M. Bishop

Relevance Vector Machine Gaussian prior for with hyper-parameters Hyper-priors over (and if regression) A high proportion of are driven to large values in the posterior distribution, and corresponding are driven to zero, giving a sparse model Christopher M. Bishop

Relevance Vector Machine Graphical model representation (regression) For classification use sigmoid (or softmax for multi-class) outputs and omit noise node Christopher M. Bishop

Relevance Vector Machine Regression: synthetic data Christopher M. Bishop

Relevance Vector Machine Regression: synthetic data Christopher M. Bishop

Relevance Vector Machine Classification: SVM Christopher M. Bishop

Relevance Vector Machine Classification: VRVM Christopher M. Bishop

Relevance Vector Machine Results on Ripley regression data (Bayes error rate 8%) Results on classification benchmark data Model Error No. kernels SVM 10.6% 38 VRVM 9.2% 4 Errors Kernels SVM GP RVM Pima Indians 67 68 65 109 200 4 U.S.P.S. 4.4% - 5.1% 2540 316 Christopher M. Bishop

Relevance Vector Machine Properties comparable error rates to SVM on new data no cross-validation to set comlexity parameters applicable to wide choice of basis function multi-class classification probabilistic outputs dramatically fewer kernels (by an order of magnitude) but, slower to train than SVM Christopher M. Bishop

Fast RVM Training Tipping and Faul (2003) Analytic treatment of each hyper-parameter in turn Applied to 250,000 image patches (face/non-face) Recently applied to regression with 106 data points Christopher M. Bishop

Face Detection Christopher M. Bishop

Face Detection Christopher M. Bishop

Face Detection Christopher M. Bishop

Face Detection Christopher M. Bishop

Face Detection Christopher M. Bishop

Example 4: Latent Dirichlet Allocation Blei, Jordan and Ng (2003) Generative model of documents (but broadly applicable e.g. collaborative filtering, image retrieval, bioinformatics) Generative model: choose choose topic choose word Christopher M. Bishop

Latent Dirichlet Allocation Variational approximation Data set: 15,000 documents 90,000 terms 2.1 million words Model: 100 factors 9 million parameters MCMC totally infeasible for this problem Christopher M. Bishop

Automatic Variational Inference Currently for each new model we have to derive the variational update equations write application-specific code to find the solution Each can be time consuming and error prone Can we build a general-purpose inference engine which automates these procedures? Christopher M. Bishop

VIBES Variational Inference for Bayesian Networks Bishop and Winn (1999, 2004) A general inference engine using variational methods Analogous to BUGS for MCMC VIBES is available on the web http://vibes.sourceforge.net/index.shtml Christopher M. Bishop

VIBES (cont’d) A key observation is that in the general solution the update for a particular node (or group of nodes) depends only on other nodes in the Markov blanket Permits a local message-passing framework which is independent of the particular graph structure Christopher M. Bishop

VIBES (cont’d) Christopher M. Bishop

VIBES (cont’d) Christopher M. Bishop

VIBES (cont’d) Christopher M. Bishop

New Book Pattern Recognition and Machine Learning Springer (2005) 600 pages, hardback, four colour, maximum $75 Graduate level text book Worked solutions to all 250 exercises Complete lectures on www Companion software text with Ian Nabney Matlab software on www Christopher M. Bishop

Conclusions Variational inference: a broad class of new semi-analytical algorithms for Bayesian inference Applicable to much larger data sets than MCMC Can be automated for rapid prototyping Christopher M. Bishop

Viewgraphs and tutorials available from: research.microsoft.com/~cmbishop