Perceptual and Sensory Augmented Computing Machine Learning, Summer’10 Machine Learning – Lecture 16 Approximate Inference 07.07.2010 Bastian Leibe RWTH.

Slides:



Advertisements
Similar presentations
Bayesian Belief Propagation
Advertisements

Markov Chain Monte Carlo Convergence Diagnostics: A Comparative Review By Mary Kathryn Cowles and Bradley P. Carlin Presented by Yuting Qi 12/01/2006.
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Monte Carlo Methods and Statistical Physics
Introduction to Markov Random Fields and Graph Cuts Simon Prince
I Images as graphs Fully-connected graph – node for every pixel – link between every pair of pixels, p,q – similarity w ij for each link j w ij c Source:
Ch 11. Sampling Models Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by I.-H. Lee Biointelligence Laboratory, Seoul National.
Computer Vision Lab. SNU Young Ki Baik An Introduction to MCMC for Machine Learning (Markov Chain Monte Carlo)
EE462 MLCV Lecture Introduction of Graphical Models Markov Random Fields Segmentation Tae-Kyun Kim 1.
1 s-t Graph Cuts for Binary Energy Minimization  Now that we have an energy function, the big question is how do we minimize it? n Exhaustive search is.
Introduction to Sampling based inference and MCMC Ata Kaban School of Computer Science The University of Birmingham.
10/11/2001Random walks and spectral segmentation1 CSE 291 Fall 2001 Marina Meila and Jianbo Shi: Learning Segmentation by Random Walks/A Random Walks View.
CHAPTER 16 MARKOV CHAIN MONTE CARLO
Suggested readings Historical notes Markov chains MCMC details
BAYESIAN INFERENCE Sampling techniques
Exact Inference (Last Class) variable elimination  polytrees (directed graph with at most one undirected path between any two vertices; subset of DAGs)
Chapter 8-3 Markov Random Fields 1. Topics 1. Introduction 1. Undirected Graphical Models 2. Terminology 2. Conditional Independence 3. Factorization.
Perceptual and Sensory Augmented Computing Machine Learning, Summer’09 Machine Learning – Lecture 15 Applications of Bayesian Networks Bastian.
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press IMPRS Summer School 2009, Prof. William H. Press 1 4th IMPRS Astronomy.
Machine Learning CUNY Graduate Center Lecture 7b: Sampling.
2010/5/171 Overview of graph cuts. 2010/5/172 Outline Introduction S-t Graph cuts Extension to multi-label problems Compare simulated annealing and alpha-
Discretization Pieter Abbeel UC Berkeley EECS
Stereo Computation using Iterative Graph-Cuts
Measuring Uncertainty in Graph Cut Solutions Pushmeet Kohli Philip H.S. Torr Department of Computing Oxford Brookes University.
Computer vision: models, learning and inference
Computer vision: models, learning and inference Chapter 10 Graphical Models.
Lecture II-2: Probability Review
Image Analysis and Markov Random Fields (MRFs) Quanren Xiong.
Introduction to Monte Carlo Methods D.J.C. Mackay.
Perceptual and Sensory Augmented Computing Machine Learning, Summer’09 Machine Learning – Lecture 16 Approximate Inference Bastian Leibe RWTH.
Perceptual and Sensory Augmented Computing Machine Learning, WS 13/14 Machine Learning – Lecture 14 Introduction to Regression Bastian Leibe.
MRFs and Segmentation with Graph Cuts Computer Vision CS 543 / ECE 549 University of Illinois Derek Hoiem 02/24/10.
CS774. Markov Random Field : Theory and Application Lecture 13 Kyomin Jung KAIST Oct
Machine Learning Lecture 23: Statistical Estimation with Sampling Iain Murray’s MLSS lecture on videolectures.net:
Object Stereo- Joint Stereo Matching and Object Segmentation Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on Michael Bleyer Vienna.
Module 1: Statistical Issues in Micro simulation Paul Sousa.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
Perceptual and Sensory Augmented Computing Machine Learning, Summer’11 Machine Learning – Lecture 13 Introduction to Graphical Models Bastian.
Monte Carlo Methods1 T Special Course In Information Science II Tomas Ukkonen
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Suppressing Random Walks in Markov Chain Monte Carlo Using Ordered Overrelaxation Radford M. Neal 발표자 : 장 정 호.
1 Markov Random Fields with Efficient Approximations Yuri Boykov, Olga Veksler, Ramin Zabih Computer Science Department CORNELL UNIVERSITY.
Perceptual and Sensory Augmented Computing Machine Learning WS 13/14 Machine Learning – Lecture 3 Probability Density Estimation II Bastian.
Machine Learning – Lecture 15
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 07: BAYESIAN ESTIMATION (Cont.) Objectives:
Approximate Inference: Decomposition Methods with Applications to Computer Vision Kyomin Jung ( KAIST ) Joint work with Pushmeet Kohli (Microsoft Research)
Lecture 2: Statistical learning primer for biologists
Lecture #9: Introduction to Markov Chain Monte Carlo, part 3
Machine Learning – Lecture 15
Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:
1 Introduction to Statistics − Day 4 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Lecture 2 Brief catalogue of probability.
Machine Learning – Lecture 11
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
A global approach Finding correspondence between a pair of epipolar lines for all pixels simultaneously Local method: no guarantee we will have one to.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Introduction to Sampling Methods Qi Zhao Oct.27,2004.
The Unscented Particle Filter 2000/09/29 이 시은. Introduction Filtering –estimate the states(parameters or hidden variable) as a set of observations becomes.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
Machine Learning – Lecture 18
Perceptual and Sensory Augmented Computing Machine Learning, Summer’09 Machine Learning – Lecture 13 Exact Inference & Belief Propagation Bastian.
Photoconsistency constraint C2 q C1 p l = 2 l = 3 Depth labels If this 3D point is visible in both cameras, pixels p and q should have similar intensities.
Markov Random Fields in Vision
Markov-Chain-Monte-Carlo (MCMC) & The Metropolis-Hastings Algorithm P548: Intro Bayesian Stats with Psych Applications Instructor: John Miyamoto 01/19/2016:
Kevin Stevenson AST 4762/5765. What is MCMC?  Random sampling algorithm  Estimates model parameters and their uncertainty  Only samples regions of.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Advanced Statistical Computing Fall 2016
Markov Random Fields with Efficient Approximations
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Expectation-Maximization & Belief Propagation
Markov Networks.
Presentation transcript:

Perceptual and Sensory Augmented Computing Machine Learning, Summer’10 Machine Learning – Lecture 16 Approximate Inference Bastian Leibe RWTH Aachen TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A AA A A A A A AAAAAA A AAAA A A A AA A A

Perceptual and Sensory Augmented Computing Machine Learning, Summer’10 Announcements Exercise 5 is available  Junction tree algorithm  st-mincut  Graph cuts  Sampling  MCMC 2 B. Leibe

Perceptual and Sensory Augmented Computing Machine Learning, Summer’10 Course Outline Fundamentals (2 weeks)  Bayes Decision Theory  Probability Density Estimation Discriminative Approaches (4 weeks)  Lin. Discriminants, SVMs, Boosting  Dec. Trees, Random Forests, Model Sel. Generative Models (5 weeks)  Bayesian Networks + Applications  Markov Random Fields + Applications  Exact Inference  Approximate Inference Unifying Perspective (1 week) B. Leibe 3

Perceptual and Sensory Augmented Computing Machine Learning, Summer’10 Recap: MRF Structure for Images Basic structure Two components  Observation model –How likely is it that node x i has label L i given observation y i ? –This relationship is usually learned from training data.  Neighborhood relations –Simplest case: 4-neighborhood –Serve as smoothing terms.  Discourage neighboring pixels to have different labels. –This can either be learned or be set to fixed “penalties”. 4 B. Leibe “True” image content Noisy observations

Perceptual and Sensory Augmented Computing Machine Learning, Summer’10 Recap: How to Set the Potentials? Unary potentials  E.g. color model, modeled with a Mixture of Gaussians  Learn color distributions for each label 5 B. Leibe

Perceptual and Sensory Augmented Computing Machine Learning, Summer’10 Recap: How to Set the Potentials? Pairwise potentials  Potts Model –Simplest discontinuity preserving model. –Discontinuities between any pair of labels are penalized equally. –Useful when labels are unordered or number of labels is small.  Extension: “contrast sensitive Potts model” where –Discourages label changes except in places where there is also a large change in the observations. 6 B. Leibe

Perceptual and Sensory Augmented Computing Machine Learning, Summer’10 Recap: Graph Cuts for Binary Problems 7 B. Leibe n-links s t a cut t-link EM-style optimization “expected” intensities of object and background can be re-estimated [Boykov & Jolly, ICCV’01] Slide credit: Yuri Boykov

Perceptual and Sensory Augmented Computing Machine Learning, Summer’10 Recap: s-t-Mincut Equivalent to Maxflow 8 B. Leibe Source Sink v1v1 v2v Slide credit: Pushmeet Kohli Augmenting Path Based Algorithms 1.Find path from source to sink with positive capacity 2.Push maximum possible flow through this path 3.Repeat until no path can be found Algorithms assume non-negative capacity Flow = 0

Perceptual and Sensory Augmented Computing Machine Learning, Summer’10 Recap: When Can s-t Graph Cuts Be Applied? s-t graph cuts can only globally minimize binary energies that are submodular. Submodularity is the discrete equivalent to convexity.  Implies that every local energy minimum is a global minimum.  Solution will be globally optimal. 9 B. Leibe t-links n-links Boundary term Regional term E(L) can be minimized by s-t graph cuts Submodularity (“convexity”) [Boros & Hummer, 2002, Kolmogorov & Zabih, 2004]

Perceptual and Sensory Augmented Computing Machine Learning, Summer’10 Recap: Simple Binary Image Denoising Model MRF Structure  Example: simple energy function (“Potts model”) –Smoothness term: fixed penalty ¯ if neighboring labels disagree. –Observation term: fixed penalty ´ if label and observation disagree. 10 B. Leibe “True” image content Noisy observations “Smoothness constraints” Observation process PriorSmoothnessObservation Image source: C. Bishop, 2006

Perceptual and Sensory Augmented Computing Machine Learning, Summer’10 Converting an MRF into an s-t Graph Conversion: Energy:  Unary potentials are straightforward to set. –How? –Just insert x i = 1 and x i = -1 into the unary terms above B. Leibe Source Sink xixi xjxj

Perceptual and Sensory Augmented Computing Machine Learning, Summer’10 Converting an MRF into an s-t Graph Conversion: Energy:  Unary potentials are straightforward to set. How?  Pairwise potentials are more tricky, since we don’t know x i ! –Trick: the pairwise energy only has an influence if x i  x j. –(Only!) in this case, the cut will go through the edge { x i, x j }. 12 B. Leibe Source Sink xixi xjxj

Perceptual and Sensory Augmented Computing Machine Learning, Summer’10 Dealing with Non-Binary Cases Limitation to binary energies is often a nuisance.  E.g. binary segmentation only… We would like to solve also multi-label problems.  The bad news: Problem is NP-hard with 3 or more labels! There exist some approximation algorithms which extend graph cuts to the multi-label case:   -Expansion   -Swap They are no longer guaranteed to return the globally optimal result.  But  -Expansion has a guaranteed approximation quality (2-approx) and converges in a few iterations. 13 B. Leibe

Perceptual and Sensory Augmented Computing Machine Learning, Summer’10  -Expansion Move Basic idea:  Break multi-way cut computation into a sequence of binary s-t cuts.  No longer globally optimal result, but guaranteed approximation quality and typically converges in few iterations. 14 B. Leibe other labels  Slide credit: Yuri Boykov

Perceptual and Sensory Augmented Computing Machine Learning, Summer’10  -Expansion Algorithm 1. Start with any initial solution 2. For each label “  ” in any (e.g. random) order: 1. Compute optimal  -expansion move (s-t graph cuts). 2. Decline the move if there is no energy decrease. 3. Stop when no expansion move would decrease energy. 15 B. Leibe Slide credit: Yuri Boykov

Perceptual and Sensory Augmented Computing Machine Learning, Summer’10 Example: Stereo Vision 16 B. Leibe Original pair of “stereo” images Depth map ground truth Slide credit: Yuri Boykov Depth map

Perceptual and Sensory Augmented Computing Machine Learning, Summer’10  -Expansion Moves In each  -expansion a given label “  ” grabs space from other labels 17 B. Leibe initial solution -expansion For each move, we choose the expansion that gives the largest decrease in the energy:  binary optimization problem Slide credit: Yuri Boykov

Perceptual and Sensory Augmented Computing Machine Learning, Summer’10 GraphCut Applications: “GrabCut” User segmentation cues Additional segmentation cues Interactive Image Segmentation [Boykov & Jolly, ICCV’01]  Rough region cues sufficient  Segmentation boundary can be extracted from edges Procedure  User marks foreground and background regions with a brush.  This is used to create an initial segmentation which can then be corrected by additional brush strokes. Slide credit: Matthieu Bray 18

Perceptual and Sensory Augmented Computing Machine Learning, Summer’10 Iterated Graph Cuts 19 B. Leibe Energy after each iteration Result Foreground & Background Background G R Foreground Background G R Color model (Mixture of Gaussians) Slide credit: Carsten Rother

Perceptual and Sensory Augmented Computing Machine Learning, Summer’10 GrabCut: Example Results 20 B. Leibe

Perceptual and Sensory Augmented Computing Machine Learning, Summer’10 Applications: Interactive 3D Segmentation 21 B. Leibe Slide credit: Yuri Boykov [Y. Boykov, V. Kolmogorov, ICCV’03]

Perceptual and Sensory Augmented Computing Machine Learning, Summer’10 Topics of This Lecture Approximate Inference  Variational methods  Sampling approaches Sampling approaches  Sampling from a distribution  Ancestral Sampling  Rejection Sampling  Importance Sampling Markov Chain Monte Carlo  Markov Chains  Metropolis Algorithm  Metropolis-Hastings Algorithm  Gibbs Sampling 22 B. Leibe

Perceptual and Sensory Augmented Computing Machine Learning, Summer’10 Approximate Inference Exact Bayesian inference is often intractable.  Often infeasible to evaluate the posterior distribution or to compute expectations w.r.t. the distribution. –E.g. because the dimensionality of the latent space is too high. –Or because the posterior distribution has a too complex form.  Problems with continuous variables –Required integrations may not have closed-form solutions.  Problems with discrete variables –Marginalization involves summing over all possible configurations of the hidden variables. –There may be exponentially many such states.  We need to resort to approximation schemes. 23 B. Leibe

Perceptual and Sensory Augmented Computing Machine Learning, Summer’10 Two Classes of Approximation Schemes Deterministic approximations (Variational methods)  Based on analytical approximations to the posterior distribution –E.g. by assuming that it factorizes in a certain form –Or that it has a certain parametric form (e.g. a Gaussian).  Can never generate exact results, but are often scalable to large applications. Stochastic approximations (Sampling methods)  Given infinite computationally resources, they can generate exact results.  Approximation arises from the use of a finite amount of processor time.  Enable the use of Bayesian techniques across many domains.  But: computationally demanding, often limited to small-scale problems. 24 B. Leibe

Perceptual and Sensory Augmented Computing Machine Learning, Summer’10 Topics of This Lecture Approximate Inference  Variational methods  Sampling approaches Sampling approaches  Sampling from a distribution  Ancestral Sampling  Rejection Sampling  Importance Sampling Markov Chain Monte Carlo  Markov Chains  Metropolis Algorithm  Metropolis-Hastings Algorithm  Gibbs Sampling 25 B. Leibe

Perceptual and Sensory Augmented Computing Machine Learning, Summer’10 Sampling Idea Objective:  Evaluate expectation of a function f ( z ) w.r.t. a probability distribution p ( z ). Sampling idea  Draw L independent samples z ( l ) with l = 1,…, L from p ( z ).  This allows the expectation to be approximated by a finite sum  As long as the samples z ( l ) are drawn independently from p ( z ), then  Unbiased estimate, independent of the dimension of z ! 26 B. Leibe Slide adapted from Bernt Schiele

Perceptual and Sensory Augmented Computing Machine Learning, Summer’10 Sampling – Challenges Problem 1: Samples might not be independent  Effective sample size might be much smaller than apparent sample size. Problem 2:  If f(z) is small in regions where p(z) is large and vice versa, the expectation may be dominated by regions of small probability.  Large sample sizes necessary to achieve sufficient accuracy. 27 B. Leibe

Perceptual and Sensory Augmented Computing Machine Learning, Summer’10 Parametric Density Model Example:  A simple multivariate (d-dimensional) Gaussian model  This is a “generative” model in the sense that we can generate samples x according to the distribution. 28 B. Leibe Slide adapted from Bernt Schiele

Perceptual and Sensory Augmented Computing Machine Learning, Summer’10 Sampling from a Gaussian Given: 1-dim. Gaussian pdf (probability density function) p ( x | ¹, ¾ 2 ) and the corresponding cumulative distribution: To draw samples from a Gaussian, we can invert the cumulative distribution function: 29 B. Leibe Slide credit: Bernt Schiele

Perceptual and Sensory Augmented Computing Machine Learning, Summer’10 Sampling from a pdf (Transformation method) In general, assume we are given the pdf p ( x ) and the corresponding cumulative distribution: To draw samples from this pdf, we can invert the cumulative distribution function: 30 B. Leibe Slide credit: Bernt Schiele

Perceptual and Sensory Augmented Computing Machine Learning, Summer’10 Ancestral Sampling Generalization of this idea to directed graphical models.  Joint probability factorizes into conditional probabilities: Ancestral sampling  Assume the variables are ordered such that there are no links from any node to a lower-numbered node.  Start with lowest-numbered node and draw a sample from its distribution.  Cycle through each of the nodes in order and draw samples from the conditional distribution (where the parent variable is set to its sampled value). 31 B. Leibe

Perceptual and Sensory Augmented Computing Machine Learning, Summer’10 Discussion Transformation method  Limited applicability, as we need to invert the indefinite integral of the required distribution p ( z ). More general  Rejection Sampling  Importance Sampling 32 B. Leibe Slide credit: Bernt Schiele

Perceptual and Sensory Augmented Computing Machine Learning, Summer’10 Rejection Sampling Assumptions  Sampling directly from p ( z ) is difficult.  But we can easily evaluate p ( z ) (up to some normalization factor Z p ): Idea  We need some simpler distribution q ( z ) (called proposal distribution) from which we can draw samples.  Choose a constant k such that: 33 B. Leibe Slide credit: Bernt Schiele

Perceptual and Sensory Augmented Computing Machine Learning, Summer’10 Rejection Sampling Sampling procedure  Generate a number z 0 from q ( z ).  Generate a number u 0 from the uniform distribution over [0, kq ( z 0 )].  If reject sample, otherwise accept. –Sample is rejected if it lies in the grey shaded area. –The remaining pairs ( u 0, z 0 ) have uniform distribution under the curve. Discussion  Original values of z are generated from the distribution q(z).  Samples are accepted with probability  k should be as small as possible! 34 B. Leibe Slide credit: Bernt Schiele

Perceptual and Sensory Augmented Computing Machine Learning, Summer’10 Rejection Sampling – Discussion Limitation: high-dimensional spaces  For rejection sampling to be of practical value, we require that kq ( z ) be close to the required distribution, so that the rate of rejection is minimal. Artificial example  Assume that p ( z ) is Gaussian with covariance matrix  Assume that q ( z ) is Gaussian with covariance matrix  Obviously:  In D dimensions: k = ( ¾ q / ¾ p ) D. –Assume ¾ q is just 1% larger than ¾ p. –D = 1000  k = ¸ 20,000 –And  Often impractical to find good proposal distributions for high dimensions! 35 B. Leibe Slide credit: Bernt Schiele

Perceptual and Sensory Augmented Computing Machine Learning, Summer’10 Importance Sampling Approach  Approximate expectations directly (but does not enable to draw samples from p ( z ) directly).  Goal: Simplistic strategy: Grid sampling  Discretize z -space into a uniform grid.  Evaluate the integrand as a sum of the form  But: number of terms grows exponentially with number of dimensions! 36 B. Leibe Slide credit: Bernt Schiele

Perceptual and Sensory Augmented Computing Machine Learning, Summer’10 Importance Sampling Idea  Use a proposal distribution q ( z ) from which it is easy to draw samples.  Express expectations in the form of a finite sum over samples { z ( l ) } drawn from q ( z ).  with importance weights 37 B. Leibe Slide credit: Bernt Schiele

Perceptual and Sensory Augmented Computing Machine Learning, Summer’10 Importance Sampling Typical setting:  p ( z ) can only be evaluated up to an unknown normalization constant  q ( z ) can also be treated in a similar fashion.  Then  with: 38 B. Leibe Slide credit: Bernt Schiele

Perceptual and Sensory Augmented Computing Machine Learning, Summer’10 Importance Sampling Ratio of normalization constants can be evaluated and therefore with 39 B. Leibe Slide credit: Bernt Schiele

Perceptual and Sensory Augmented Computing Machine Learning, Summer’10 Importance Sampling – Discussion Observations  Success of importance sampling depends crucially on how well the sampling distribution q ( z ) matches the desired distribution p ( z ).  Often, p ( z ) f ( z ) is strongly varying and has a significant propor- tion of its mass concentrated over small regions of z -space.  Weights r l may be dominated by a few weights having large values.  Practical issue: if none of the samples falls in the regions where p ( z ) f ( z ) is large… –The results may be arbitrary in error. –And there will be no diagnostic indication (no large variance in r l )!  Key requirement for sampling distribution q ( z ) : –Should not be small or zero in regions where p ( z ) is significant! 40 B. Leibe Slide credit: Bernt Schiele

Perceptual and Sensory Augmented Computing Machine Learning, Summer’10 Topics of This Lecture Approximate Inference  Variational methods  Sampling approaches Sampling approaches  Sampling from a distribution  Ancestral Sampling  Rejection Sampling  Importance Sampling Markov Chain Monte Carlo  Markov Chains  Metropolis Algorithm  Metropolis-Hastings Algorithm  Gibbs Sampling 41 B. Leibe

Perceptual and Sensory Augmented Computing Machine Learning, Summer’10 Independent Sampling vs. Markov Chains So far  We’ve considered two methods, Rejection Sampling and Importance Sampling, which were both based on independent samples from q ( z ).  However, for many problems of practical interest, it is difficult or impossible to find q ( z ) with the necessary properties. Different approach  We abandon the idea of independent sampling.  Instead, rely on a Markov Chain to generate dependent samples from the target distribution.  Independence would be a nice thing, but it is not necessary for the Monte Carlo estimate to be valid. 42 B. Leibe Slide credit: Zoubin Ghahramani

Perceptual and Sensory Augmented Computing Machine Learning, Summer’10 Overview  Allows to sample from a large class of distributions.  Scales well with the dimensionality of the sample space. Idea  We maintain a record of the current state z ( ¿ )  The proposal distribution depends on the current state: q ( z | z ( ¿ ) )  The sequence of samples forms a Markov chain z (1), z (2),… Setting  We can evaluate p ( z ) (up to some normalizing factor Z p ):  At each time step, we generate a candidate sample from the proposal distribution and accept the sample according to a criterion. MCMC – Markov Chain Monte Carlo 43 B. Leibe Slide credit: Bernt Schiele

Perceptual and Sensory Augmented Computing Machine Learning, Summer’10 MCMC – Metropolis Algorithm Metropolis algorithm [Metropolis et al., 1953]  Proposal distribution is symmetric:  The new candidate sample z * is accepted with probability Implementation  Choose random number u uniformly from unit interval (0,1).  Accept sample if. Note  New candidate samples always accepted if. –I.e. when new sample has higher probability than the previous one.  The algorithm sometimes accepts a state with lower probability. 44 B. Leibe Slide credit: Bernt Schiele

Perceptual and Sensory Augmented Computing Machine Learning, Summer’10 MCMC – Metropolis Algorithm Two cases  If new sample is accepted:  Otherwise:  This is in contrast to rejection sampling, where rejected samples are simply discarded.  Leads to multiple copies of the same sample! 45 B. Leibe Slide credit: Bernt Schiele

Perceptual and Sensory Augmented Computing Machine Learning, Summer’10 MCMC – Metropolis Algorithm Property  When q ( z A | z B ) > 0 for all z, the distribution of z ¿ tends to p ( z ) as ¿ ! 1. Note  Sequence z (1), z (2),… is not a set of independent samples from p ( z ), as successive samples are highly correlated.  We can obtain (largely) independent samples by just retaining every M th sample. Example: Sampling from a Gaussian  Proposal: Gaussian with ¾ = 0.2.  Green: accepted samples  Red:rejected samples 46 B. Leibe Slide credit: Bernt Schiele

Perceptual and Sensory Augmented Computing Machine Learning, Summer’10 Markov Chains Question  How can we show that z ¿ tends to p ( z ) as ¿ ! 1 ? Markov chains  First-order Markov chain:  Marginal probability 47 B. Leibe Slide credit: Bernt Schiele

Perceptual and Sensory Augmented Computing Machine Learning, Summer’10 Markov Chains – Properties Invariant distribution  A distribution is said to be invariant (or stationary) w.r.t. a Markov chain if each step in the chain leaves that distribution invariant.  Transition probabilities:  Distribution p * ( z ) is invariant if: Detailed balance  Sufficient (but not necessary) condition to ensure that a distribution is invariant:  A Markov chain which respects detailed balance is reversible. 48 B. Leibe Slide credit: Bernt Schiele

Perceptual and Sensory Augmented Computing Machine Learning, Summer’10 MCMC – Metropolis-Hastings Algorithm Metropolis-Hastings Algorithm  Generalization: Proposal distribution not required to be symmetric.  The new candidate sample z * is accepted with probability  where k labels the members of the set of possible transitions considered. Note  When the proposal distributions are symmetric, Metropolis- Hastings reduces to the standard Metropolis algorithm. 49 B. Leibe Slide credit: Bernt Schiele

Perceptual and Sensory Augmented Computing Machine Learning, Summer’10 MCMC – Metropolis-Hastings Algorithm Properties  We can show that p ( z ) is an invariant distribution of the Markov chain defined by the Metropolis-Hastings algorithm.  We show detailed balance: 50 B. Leibe Slide credit: Bernt Schiele

Perceptual and Sensory Augmented Computing Machine Learning, Summer’10 MCMC – Metropolis-Hastings Algorithm Schematic illustration  For continuous state spaces, a common choice of proposal distribution is a Gaussian centered on the current state.  What should be the variance of the proposal distribution? –Large variance: rejection rate will be high for complex problems. –The scale ½ of the proposal distribution should be as large as possible without incurring high rejection rates.  ½ should be of the same order as the smallest length scale ¾ min.  This causes the system to explore the distribution by means of a random walk. –Undesired behavior: number of steps to arrive at state that is independent of original state is of order ( ¾ max / ¾ min ) 2. –Strong correlations can slow down the Metropolis algorithm! 51 B. Leibe

Perceptual and Sensory Augmented Computing Machine Learning, Summer’10 Gibbs Sampling Approach  MCMC-algorithm that is simple and widely applicable.  May be seen as a special case of Metropolis-Hastings. Idea  Sample variable-wise: replace z i by a value drawn from the distribution p ( z i | z \ i ). –This means we update one coordinate at a time.  Repeat procedure either by cycling through all variables or by choosing the next variable. 52 B. Leibe Slide credit: Bernt Schiele

Perceptual and Sensory Augmented Computing Machine Learning, Summer’10 Gibbs Sampling Example  Assume distribution p ( z 1, z 2, z 3 ).  Replace with new value drawn from  And so on… 53 B. Leibe Slide credit: Bernt Schiele

Perceptual and Sensory Augmented Computing Machine Learning, Summer’10 Gibbs Sampling Properties  The factor that determines the acceptance probability in the Metropolis-Hastings is determined by  I.e. we get an algorithm which always accepts!  If you can compute (and sample from) the conditionals, you can apply Gibbs sampling.  The algorithm is completely parameter free.  Can also be applied to subsets of variables. 54 B. Leibe Slide credit: Zoubin Ghahramani

Perceptual and Sensory Augmented Computing Machine Learning, Summer’10 Gibbs Sampling Example  20 iterations of Gibbs sampling on a bivariate Gaussian.  Note: strong correlations can slow down Gibbs sampling. 55 B. Leibe Slide credit: Zoubin Ghahramani

Perceptual and Sensory Augmented Computing Machine Learning, Summer’10 Summary: Approximate Inference Exact Bayesian Inference often intractable. Rejection and Importance Sampling  Generate independent samples.  Impractical in high-dimensional state spaces. Markov Chain Monte Carlo (MCMC)  Simple & effective (even though typically computationally expensive).  Scales well with the dimensionality of the state space.  Issues of convergence have to be considered carefully. Gibbs Sampling  Used extensively in practice.  Parameter free  Requires sampling conditional distributions. 56 B. Leibe

Perceptual and Sensory Augmented Computing Machine Learning, Summer’10 References and Further Reading Sampling methods for approximate inference are described in detail in Chapter 11 of Bishop’s book. Another good introduction to Monte Carlo methods can be found in Chapter 29 of MacKay’s book (also available online: ) B. Leibe 57 Christopher M. Bishop Pattern Recognition and Machine Learning Springer, 2006 David MacKay Information Theory, Inference, and Learning Algorithms Cambridge University Press, 2003