BAYESIAN INFERENCE Sampling techniques

Slides:



Advertisements
Similar presentations
Slice Sampling Radford M. Neal The Annals of Statistics (Vol. 31, No. 3, 2003)
Advertisements

Bayesian Estimation in MARK
Monte Carlo Methods for Inference and Learning Guest Lecturer: Ryan Adams CSC 2535
Introduction of Markov Chain Monte Carlo Jeongkyun Lee.
Ch 11. Sampling Models Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by I.-H. Lee Biointelligence Laboratory, Seoul National.
Gibbs Sampling Qianji Zheng Oct. 5th, 2010.
Markov Chains Modified by Longin Jan Latecki
Computer Vision Lab. SNU Young Ki Baik An Introduction to MCMC for Machine Learning (Markov Chain Monte Carlo)
Markov Chain Monte Carlo Prof. David Page transcribed by Matthew G. Lee.
Introduction to Sampling based inference and MCMC Ata Kaban School of Computer Science The University of Birmingham.
CHAPTER 16 MARKOV CHAIN MONTE CARLO
Bayesian Reasoning: Markov Chain Monte Carlo
Bayesian statistics – MCMC techniques
Suggested readings Historical notes Markov chains MCMC details
Stochastic approximate inference Kay H. Brodersen Computational Neuroeconomics Group Department of Economics University of Zurich Machine Learning and.
Exact Inference (Last Class) variable elimination  polytrees (directed graph with at most one undirected path between any two vertices; subset of DAGs)
CS774. Markov Random Field : Theory and Application Lecture 16 Kyomin Jung KAIST Nov
Variational Methods TCD Interests Simon Wilson. Background We are new to this area of research – so we can’t say very much about it – but we’re enthusiastic!
Computational statistics 2009 Random walk. Computational statistics 2009 Random walk with absorbing barrier.
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press IMPRS Summer School 2009, Prof. William H. Press 1 4th IMPRS Astronomy.
Machine Learning CUNY Graduate Center Lecture 7b: Sampling.
Computational statistics, course introduction Course contents  Monte Carlo Methods  Random number generation  Simulation methodology  Bootstrap  Markov.
Today Introduction to MCMC Particle filters and MCMC
CS 561, Session 29 1 Belief networks Conditional independence Syntax and semantics Exact inference Approximate inference.
CSI Uncertainty in A.I. Lecture 141 The Bigger Picture (Sic) If you saw this picture, what game would you infer you were watching? How could we get.
Bayesian Analysis for Extreme Events Pao-Shin Chu and Xin Zhao Department of Meteorology School of Ocean & Earth Science & Technology University of Hawaii-
Monte Carlo Methods in Partial Differential Equations.
Introduction to Monte Carlo Methods D.J.C. Mackay.
Bayes Factor Based on Han and Carlin (2001, JASA).
Introduction to MCMC and BUGS. Computational problems More parameters -> even more parameter combinations Exact computation and grid approximation become.
Perceptual and Sensory Augmented Computing Machine Learning, Summer’09 Machine Learning – Lecture 16 Approximate Inference Bastian Leibe RWTH.
1 MCMC Style Sampling / Counting for SAT Can we extend SAT/CSP techniques to solve harder counting/sampling problems? Such an extension would lead us to.
Machine Learning Lecture 23: Statistical Estimation with Sampling Iain Murray’s MLSS lecture on videolectures.net:
Module 1: Statistical Issues in Micro simulation Paul Sousa.
1 Gil McVean Tuesday 24 th February 2009 Markov Chain Monte Carlo.
Monte Carlo Methods1 T Special Course In Information Science II Tomas Ukkonen
Suppressing Random Walks in Markov Chain Monte Carlo Using Ordered Overrelaxation Radford M. Neal 발표자 : 장 정 호.
Exact Inference (Last Class) Variable elimination  polytrees (directed graph with at most one undirected path between any two vertices; subset of DAGs)
Numerical Bayesian Techniques. outline How to evaluate Bayes integrals? Numerical integration Monte Carlo integration Importance sampling Metropolis algorithm.
Markov Chain Monte Carlo Prof. David Page transcribed by Matthew G. Lee.
An Introduction to Markov Chain Monte Carlo Teg Grenager July 1, 2004.
Seminar on random walks on graphs Lecture No. 2 Mille Gandelsman,
7. Metropolis Algorithm. Markov Chain and Monte Carlo Markov chain theory describes a particularly simple type of stochastic processes. Given a transition.
Probabilistic models Jouni Tuomisto THL. Outline Deterministic models with probabilistic parameters Hierarchical Bayesian models Bayesian belief nets.
Markov Chain Monte Carlo for LDA C. Andrieu, N. D. Freitas, and A. Doucet, An Introduction to MCMC for Machine Learning, R. M. Neal, Probabilistic.
Lecture #9: Introduction to Markov Chain Monte Carlo, part 3
Javier Junquera Importance sampling Monte Carlo. Cambridge University Press, Cambridge, 2002 ISBN Bibliography.
CS774. Markov Random Field : Theory and Application Lecture 15 Kyomin Jung KAIST Oct
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Introduction to Sampling Methods Qi Zhao Oct.27,2004.
The Unscented Particle Filter 2000/09/29 이 시은. Introduction Filtering –estimate the states(parameters or hidden variable) as a set of observations becomes.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
Bayesian Modelling Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.
Kevin Stevenson AST 4762/5765. What is MCMC?  Random sampling algorithm  Estimates model parameters and their uncertainty  Only samples regions of.
The Monte Carlo Method/ Markov Chains/ Metropolitan Algorithm from sec in “Adaptive Cooperative Systems” -summarized by Jinsan Yang.
Generalization Performance of Exchange Monte Carlo Method for Normal Mixture Models Kenji Nagata, Sumio Watanabe Tokyo Institute of Technology.
Markov Chain Monte Carlo in R
Introduction to Sampling based inference and MCMC
MCMC Output & Metropolis-Hastings Algorithm Part I
Markov Chain Monte Carlo methods --the final project of stat 6213
Advanced Statistical Computing Fall 2016
Jun Liu Department of Statistics Stanford University
Markov Chain Monte Carlo
Markov chain monte carlo
Remember that our objective is for some density f(y|) for observations where y and  are vectors of data and parameters,  being sampled from a prior.
Markov Networks.
Ch13 Empirical Methods.
Markov Chain Monte Carlo
Lecture 15 Sampling.
Markov Networks.
Presentation transcript:

BAYESIAN INFERENCE Sampling techniques Andreas Steingötter

Motivation & Background Exact inference is intractable, so we have to resort to some form of approximation

Motivation & Background variational Bayes deterministic approximation not exact in principle Alternative approximation: Perform inference by numerical sampling, also known as Monte Carlo techniques.

Motivation & Background Posterior distribution 𝑝 𝑧 is required (primarily) for the purpose of evaluating expectations 𝐸(𝑓). 𝑓 𝑧 are predictions made by model with parameters 𝑧 𝑝 𝑧 is parameter prior and 𝑓 𝑧 =𝑝(𝑦|𝑧) is likelihood - evaluate the marginal likelihood (evidence) for a model

Motivation & Background approximation Classical Monte Carlo approx 𝑧 (𝑙) are random (not necessarily independent) draws from 𝑝 𝑧 , which converges to the right answer in the limit of large numbers of samples, 𝐿.

Motivation & Background Problems: How to obtain independent samples from 𝒑 𝒛 ? Expectation may be dominated by regions of small probability -> large sample sizes will be required to achieve sufficient accuracy Monte Carlo ignores values of 𝑧 (𝑙) when forming the estimate if 𝑧 (𝑙) are independent draws from 𝑝 𝑧 , then low numbers suffice to estimate expectation

How to do sampling? Basic Sampling algorithms Markov chain Monte Carlo Restricted mainly to 1- / 2- dimensional problems Markov chain Monte Carlo Very general and powerful framework

Basic sampling Special cases Model with directed graph Ancestral sampling: Easy sampling of joint distribution: Logic sampling: Compare sampled value for 𝑧 𝑖 with observed value at node i. If NOT agree, then discard all previous samples and start with first node

Random sampling Computers can generate only pseudorandom numbers Correlation of successive values Lack of uniformity of distribution Poor dimensional distribution of output sequence Distance between where certain values occur are distributed differently from those in a random sequence distribution

Random sampling from the Uniform Distribution Assumption: good pseudo-random generator for uniformly distributed data is implemented Alternative: http://www.random.org “true” random numbers with randomness coming from atmospheric noise

Random sampling from a standard non-uniform distribution Goal: Sample from non-uniform distribution 𝑝 𝑦 which is a standard distribution, i.e. given in analytical form Suppose: we have uniformly distributed random numbers from (0,1) Solution: Transform random numbers 𝑧 over (0,1) using a function which is the inverse of the indefinite integral of the desired distribution

Random sampling from a standard non-uniform distribution Step 1: Calculate cumulative distribution function Step 2: Transform samples 𝑈 𝑧 0,1 by

Rejection sampling Suppose: Approach: direct sampling from 𝑝 𝒛 is difficult, but 𝑝 𝒛 can be evaluated for any given value of 𝒛 up to some normalization constant 𝑍 𝑍 𝑝 is unknown, 𝑝 𝑧 can be evaluated Approach: Define simple proposal distribution 𝑞(𝑧) such that 𝑘𝑞 𝑧 ≥ 𝑝 (𝑧) for all 𝑧.

Rejection sampling Simple visual example Constant k should be as small as possible. Fraction of rejected points depends on the ratio of the area under the unnormalized distribution 𝑝 𝑧 to the area under the curve 𝑘𝑞 𝑧 . 𝑘𝑞 𝑧 𝑝 (𝑧)

Rejection sampling Rejection sampler Generate two random numbers number 𝑧 0 from proposal distribution 𝑞(𝑧) generate a number 𝑢 0 from uniform distribution over [0,k𝑞 𝑧0 ] If 𝑢 0 > 𝑝 ( 𝑧 0 ) reject! Remaining pairs have unifrom distribution under 𝑝 (𝑧)

Adaptive rejection sampling Suppose: difficult to determine a suitable analytic form for the proposal distribution 𝑞(𝑧) Approach: construct envelope function “on the fly” based on observed values of the distribution 𝑝 𝑧 if 𝑝 𝑧 is log concave (ln𝑝 𝑧 has non-increasing derivatives) use derivatives to construct envelope

Adaptive rejection sampling Step 1: at initial set of grid points 𝑧 1 ,…, 𝑧 𝑀 evaluate function ln𝑝 𝑧 𝑖 and its gradient and calculate tangents at 𝑝 𝑧 𝑖 , i=1,…,M. Step 2: sample from envelop distribution, if accepted use it to calculate 𝑝(𝑧), otherwise refine grid. Envelope distribution is a piecewise exponential distribution Slope  Offset k

Adaptive rejection sampling Problem of rejection sampling: Find a proposal distribution 𝑞(𝑧), which is close to required distribution to minimize rejection rate. Therefore restricted mainly to univariate distributions curse of dimensionality However: potential subroutine

Importance sampling Framework for approximating expectations 𝐸 𝑝 (𝑓 𝑧 ) directly with respect to 𝑝 𝒛 Does NOT provide 𝑝 𝒛 Suppose (again): direct sampling from 𝑝 𝒛 is difficult, but 𝑝 𝒛 can be evaluated for any given value of 𝒛 up to some normalization constant 𝑍.

Importance sampling As for rejection sampling, apply proposal distribution 𝑞 𝑧 from which it is easy to draw samples

Importance sampling Expectation formula for un-normalized distributions with importance weights 𝑟 𝑙 Key points: Importance weights correct bias introduced by sampling from proposal distribution Dependence on how well 𝑞 𝑧 approximates 𝑝(𝑧) (similar to rejection sampling) Choose sample points in input space where 𝑓 𝑧 𝑝 𝑧 is large (or at least where 𝑝 𝑧 is large) If 𝑝 𝑧 > 0 in same region, then 𝑞 𝑧 >0 necessary

Importance sampling Attention: Consider none of the samples falls in the regions where 𝑓 𝑧 𝑝 𝑧 is large. In that case, the apparent variances of 𝑟 𝑙 and 𝑟 𝑙 𝑓(𝑧(𝑙)) may be small even though the estimate of the expectation may be severely wrong. Hence a major drawback of the importance sampling method is the potential to produce results that are arbitrarily in error and with no diagnostic indication. 𝒒 𝒛 should NOT be small where 𝒑 𝒛 may be significant!!!

Markov Chain Monte Carlo (MCMC) sampling MCMC is a general framework, sampling from large class of distributions, scales well with dimensionality of sample space Goal: Generate samples from distribution 𝑝(𝑧) Idea: Build a machine which uses the current sample to decide which next sample to produce in such a way that the overall distribution of the samples will be 𝑝(𝑧).

Markov Chain Monte Carlo (MCMC) sampling Approach: Generate a candidate sample 𝑧 ∗ from a proposal distribution 𝑞(𝑧| 𝑧 (𝜏) ) that depends on the current state 𝑧 (𝜏) and is sufficiently simple to draw samples from directly. Current sample 𝑧 (𝜏) is known (i.e. maintain record of the current state) Samples 𝑧 (1) , 𝑧 (2) , 𝑧 (3) ,… form a Markov chain Accept or reject the candidate sample 𝑧 ∗ according to some appropriate criterion

MCMC - Metropolis algorithm Suppose: 𝑝 𝒛 can be evaluated for any given value of 𝒛 up to some normalization constant 𝑍. Algorithm: Step 1: Choose symmetric proposal distribution 𝑞 𝒛𝐴 𝒛𝐵 =𝑞 𝒛𝐵 𝒛𝐴 Step 2: Candidate sample 𝑧 ∗ is accepted with probability

MCMC - Metropolis algorithm Algorithm (cont.): Step 2.1: Choose a random number 𝑢 with uniform distribution in (0,1) Step 2.2: Acceptance test for 𝑢< Step 3:

Metropolis algorithm Notes: rejection of a points leads to the previous sample (different from rejection sampling) If 𝑞 𝒛𝐴 𝒛𝐵 > 0 for any values 𝒛𝐴, 𝒛𝐵, then 𝒛 (𝜏) tends to 𝑝 𝒛 for 𝜏 ->  𝑧 (1) , 𝑧 (2) , 𝑧 (2) ... present no independent samples from 𝑝 𝒛 - serial correlation. Instead retain only every Mth sample.

Examples: Metropolis algorithm Implementation in R: Elliptical distibution 𝑝 ( 𝑧 (𝜏) ) 𝑝 ( 𝑧 ∗ ) 𝑢< Update state 𝒛 𝜏+1 = 𝒛 ∗ Keep old state 𝒛 𝜏+1 = 𝒛 (𝜏)

Examples: Metropolis algorithm Implementation in R: Initialization [-2,2], step size = 0.3 n=1500 n=15000

Examples: Metropolis algorithm Implementation in R: Initialization [-2,2], step size = 0.5 n=1500 n=15000

Examples: Metropolis algorithm Implementation in R: Initialization [-2,2], step size = 1 n=1500 n=15000

Validation of MCMC Properties of Markov chains: Transition probabilities 𝑇𝑚( 𝑧 ′ ,𝑧): z(1) z(2) z(m) z(m+1) homogeneous If 𝑇𝑚 is the same for all m Invariant (stationary)

Validation of MCMC Propoerties of Markov chains: 𝑝 ∗ 𝑧 : homogeneous If 𝑇𝑚 is the same for all m Invariant (stationary) Sufficient detailed balance 𝑇𝑚 satisfy reversible

Validation of MCMC ergodicity Goal: invariant Markov chain that converges to desired distribution 𝑝 ∗ 𝑧 An ergodic Markov chain has only one equilibrium distribution invariant 𝑝 ∗ 𝑧 = lim 𝑚→∞ 𝑝 𝑧m !!! for any 𝑝(𝑧0) ergodicity

Properties and validation of MCMC Approach: Construct appropriate transition probabilities 𝑇( 𝑧 ′ ,𝑧): 𝑇( 𝑧 ′ ,𝑧) from set of base transitions 𝑩k Mixture form Successive application k - Mixing coefficients

Metropolis-Hastings algorithm Generalization of Metropolis algorithm No symmetric proposal distribution 𝑞(𝑧) required Choice of proposal distribution crititcal If symmetry

Metropolis-Hastings algorithm Gaussian centered on current state Small variance -> high acceptance, slow walk, dependent samples Large variance -> high rejection rate

Gibbs sampling Special case of Metropolis-Hastings algorithm the random value is always accepted, 1 Suppose: 𝑝 𝑧1,𝑧2,𝑧3 , Step 1: initial samples 𝑧1, 𝑧2, 𝑧3 Step 2: (repeated) 𝑧11~𝑝 𝑧1 𝑧2,𝑧3 𝑧21~𝑝(𝑧2|𝑧11,𝑧3) 𝑧31~𝑝(𝑧3|𝑧11,𝑧21) repeated by cycling randomly choose variable to be updated

Gibbs sampling 𝑝 𝒛\i is invariant (unchanged) Univariate conditional distribution 𝑝(𝑧𝑖|𝒛\i) is invariant (by definition) Joint distribution 𝑝 𝒛 is invariant Because (fixed at each step)

Gibbs sampling Sufficient condition for ergodicity: None of the conditional distributions be anywhere zero, i.e. any point in 𝒛 space can be reached from any other point in a finite number of steps z(2) z(1) z(3)

Gibbs sampling Obtain m independent samples: Sample MCMC during a «burn-in» period to remove dependence on initial values Then, sample at set time points (e.g. every Mth sample) The Gibbs sequence converges to a stationary (equilibrium) distribution that is independent of the starting values, By construction this stationary distribution is the target distribution we are trying to simulate.

Gibbs sampling Practicability dependent feasibility to draw samples from conditional distributions 𝑝(𝑧𝑖|𝒛\i). Directed graphs will lead to conditional distributions for Gibbs sampling that are log concave. Adaptive rejection sampling methods