Approximate Inference: Particle-Based Methods

Slides:



Advertisements
Similar presentations
Bayesian Methods with Monte Carlo Markov Chains III
Advertisements

Markov Chain Monte Carlo Prof. David Page transcribed by Matthew G. Lee.
Stochastic approximate inference Kay H. Brodersen Computational Neuroeconomics Group Department of Economics University of Zurich Machine Learning and.
Bayesian network inference
. PGM: Tirgul 8 Markov Chains. Stochastic Sampling  In previous class, we examined methods that use independent samples to estimate P(X = x |e ) Problem:
Machine Learning CUNY Graduate Center Lecture 7b: Sampling.
Bayesian Networks. Graphical Models Bayesian networks Conditional random fields etc.
5/25/2005EE562 EE562 ARTIFICIAL INTELLIGENCE FOR ENGINEERS Lecture 16, 6/1/2005 University of Washington, Department of Electrical Engineering Spring 2005.
CS 188: Artificial Intelligence Spring 2007 Lecture 14: Bayes Nets III 3/1/2007 Srini Narayanan – ICSI and UC Berkeley.
Genome evolution: a sequence-centric approach Lecture 4: Beyond Trees. Inference by sampling Pre-lecture draft – update your copy after the lecture!
CS 188: Artificial Intelligence Fall 2006 Lecture 17: Bayes Nets III 10/26/2006 Dan Klein – UC Berkeley.
. Approximate Inference Slides by Nir Friedman. When can we hope to approximate? Two situations: u Highly stochastic distributions “Far” evidence is discarded.
Announcements Homework 8 is out Final Contest (Optional)
Computer vision: models, learning and inference Chapter 10 Graphical Models.
. PGM 2002/3 – Tirgul6 Approximate Inference: Sampling.
Approximate Inference 2: Monte Carlo Markov Chain
Introduction to Monte Carlo Methods D.J.C. Mackay.
1 Approximate Inference 2: Importance Sampling. (Unnormalized) Importance Sampling.
Machine Learning Lecture 23: Statistical Estimation with Sampling Iain Murray’s MLSS lecture on videolectures.net:
Monte Carlo Methods1 T Special Course In Information Science II Tomas Ukkonen
Suppressing Random Walks in Markov Chain Monte Carlo Using Ordered Overrelaxation Radford M. Neal 발표자 : 장 정 호.
Instructor: Eyal Amir Grad TAs: Wen Pu, Yonatan Bisk Undergrad TAs: Sam Johnson, Nikhil Johri CS 440 / ECE 448 Introduction to Artificial Intelligence.
Bayes’ Nets: Sampling [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available.
CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Monte Carlo Methods for Probabilistic Inference.
1 CMSC 671 Fall 2001 Class #21 – Tuesday, November 13.
The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)
Markov Chain Monte Carlo Prof. David Page transcribed by Matthew G. Lee.
An Introduction to Markov Chain Monte Carlo Teg Grenager July 1, 2004.
Probabilistic models Jouni Tuomisto THL. Outline Deterministic models with probabilistic parameters Hierarchical Bayesian models Bayesian belief nets.
CS 188: Artificial Intelligence Bayes Nets: Approximate Inference Instructor: Stuart Russell--- University of California, Berkeley.
Inference Algorithms for Bayes Networks
1 CMSC 671 Fall 2001 Class #20 – Thursday, November 8.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
UIUC CS 497: Section EA Lecture #7 Reasoning in Artificial Intelligence Professor: Eyal Amir Spring Semester 2004 (Based on slides by Gal Elidan (Hebrew.
CS 541: Artificial Intelligence Lecture VII: Inference in Bayesian Networks.
CS498-EA Reasoning in AI Lecture #23 Instructor: Eyal Amir Fall Semester 2011.
CS498-EA Reasoning in AI Lecture #19 Professor: Eyal Amir Fall Semester 2011.
The Monte Carlo Method/ Markov Chains/ Metropolitan Algorithm from sec in “Adaptive Cooperative Systems” -summarized by Jinsan Yang.
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12
CS b553: Algorithms for Optimization and Learning
Intro to Sampling Methods
Approximate Inference
Artificial Intelligence
CS 4/527: Artificial Intelligence
Remember that our objective is for some density f(y|) for observations where y and  are vectors of data and parameters,  being sampled from a prior.
CAP 5636 – Advanced Artificial Intelligence
Markov Networks.
Read R&N Ch Next lecture: Read R&N
Hidden Markov Models Part 2: Algorithms
Haim Kaplan and Uri Zwick
Inference Inference: calculating some useful quantity from a joint probability distribution Examples: Posterior probability: Most likely explanation: B.
Readings: K&F: 15.1, 15.2, 15.3, 15.4, 15.5 K&F: 7 (overview of inference) K&F: 8.1, 8.2 (Variable Elimination) Structure Learning in BNs 3: (the good,
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Instructors: Fei Fang (This Lecture) and Dave Touretzky
CS 188: Artificial Intelligence
Class #19 – Tuesday, November 3
Simple Sampling Sampling Methods Inference Probabilistic Graphical
CS 188: Artificial Intelligence Fall 2008
Class #16 – Tuesday, October 26
Inference III: Approximate Inference
Approximate Inference by Sampling
Machine Learning: Lecture 6
Machine Learning: UNIT-3 CHAPTER-1
Read R&N Ch Next lecture: Read R&N
Markov Networks.
Variable Elimination Graphical Models – Carlos Guestrin
Sequential Learning with Dependency Nets
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12
Presentation transcript:

Approximate Inference: Particle-Based Methods Eran Segal Weizmann Institute

Inference Complexity Summary NP-Hard Exact inference Approximate inference with relative error with absolute error < 0.5 (given evidence) Hopeless? No, we will see many network structures that have provably efficient algorithms and we will see cases when approximate inference works efficiently with high accuracy

Approximate Inference Particle-based methods Create instances that represent part of the probability mass Random sampling Deterministically search for high probability assignments Global methods Approximate the distribution in its entirety Use exact inference on a simpler (but close) network Perform inference in the original network but approximate some steps of the process (e.g., ignore certain computations or approximate some intermediate results)

Particle-Based Methods Particle generation process Generate particles deterministically Generate particles by sampling Particle definition Full particles – complete assignments to all variables Distributional particles – assignment to part of the variables General framework Generate samples x[1],...,x[M] Estimate function by

Particle-Based Methods Overview Full particle methods Sampling methods Forward sampling Importance sampling Markov chain Monte Carlo Deterministic particle generation Distributional particles

Forward Sampling Generate random samples from P(X) Use the Bayesian network to generate samples Estimate function by

Forward Sampling D I G S L Samples d0 d1 0.4 0.6 i0 i1 0.7 0.3 I s0 s1 0.95 0.05 i1 0.2 0.8 G S D I g0 g1 g2 d0 i0 0.3 0.4 i1 0.05 0.25 0.7 d1 0.9 0.08 0.02 0.5 0.2 L G l0 l1 g0 0.1 0.9 g1 0.4 0.6 g2 0.99 0.01 Samples D I G S L

Forward Sampling D I G S L Samples d0 d1 0.4 0.6 i0 i1 0.7 0.3 I s0 s1 0.95 0.05 i1 0.2 0.8 G S D I g0 g1 g2 d0 i0 0.3 0.4 i1 0.05 0.25 0.7 d1 0.9 0.08 0.02 0.5 0.2 L G l0 l1 g0 0.1 0.9 g1 0.4 0.6 g2 0.99 0.01 Samples D I G S L d1

Forward Sampling D I G S L Samples d0 d1 0.4 0.6 i0 i1 0.7 0.3 I s0 s1 0.95 0.05 i1 0.2 0.8 G S D I g0 g1 g2 d0 i0 0.3 0.4 i1 0.05 0.25 0.7 d1 0.9 0.08 0.02 0.5 0.2 L G l0 l1 g0 0.1 0.9 g1 0.4 0.6 g2 0.99 0.01 Samples D I G S L d1 i1

Forward Sampling D I G S L Samples d0 d1 0.4 0.6 i0 i1 0.7 0.3 I s0 s1 0.95 0.05 i1 0.2 0.8 G S D I g0 g1 g2 d0 i0 0.3 0.4 i1 0.05 0.25 0.7 d1 0.9 0.08 0.02 0.5 0.2 L G l0 l1 g0 0.1 0.9 g1 0.4 0.6 g2 0.99 0.01 Samples D I G S L d1 i1 g2

Forward Sampling D I G S L Samples d0 d1 0.4 0.6 i0 i1 0.7 0.3 I s0 s1 0.95 0.05 i1 0.2 0.8 G S D I g0 g1 g2 d0 i0 0.3 0.4 i1 0.05 0.25 0.7 d1 0.9 0.08 0.02 0.5 0.2 L G l0 l1 g0 0.1 0.9 g1 0.4 0.6 g2 0.99 0.01 Samples D I G S L d1 i1 g2 s1

Forward Sampling D I G S L Samples d0 d1 0.4 0.6 i0 i1 0.7 0.3 I s0 s1 0.95 0.05 i1 0.2 0.8 G S D I g0 g1 g2 d0 i0 0.3 0.4 i1 0.05 0.25 0.7 d1 0.9 0.08 0.02 0.5 0.2 L G l0 l1 g0 0.1 0.9 g1 0.4 0.6 g2 0.99 0.01 Samples D I G S L d1 i1 g2 s1 l0

Forward Sampling Let X1, …,Xn be a topological order of the variables For i = 1,...,n Sample xi from P(Xi | pai ) (Note: since Pai  {X1,…,Xi-1}, we already assigned values to them) return x1, …,xn Estimate function by: Estimate P(y) by:

Forward Sampling Sampling cost Number of samples needed Per variable cost: O(log(Val|Xi|)) Sample uniformly in [0,1] Find appropriate value of all Val|Xi| values that Xi can take Per sample cost: O(nlog(d)) (d = maxi Val|Xi|) Total cost: O(Mnlog(d)) Number of samples needed To get a relative error < , with probability 1-, we need Note that number of samples grows inversely with P(y) For small P(y) we need many samples, otherwise we report P(y)=0

Rejection Sampling In general we need to compute P(Y|e) We can do so with rejection sampling Generate samples as in forward sampling Reject samples in which Ee Estimate function from accepted samples Problem: if evidence is unlikely (e.g., P(e)=0.001)) then we generate many rejected samples

Particle-Based Methods Overview Full particle methods Sampling methods Forward sampling Importance sampling Markov chain Monte Carlo Deterministic particle generation Distributional particles

Likelihood Weighting Can we ensure that all of our samples satisfy E=e? Solution: when sampling a variable XE, set X=e Problem: we are trying to sample from the posterior P(X|e) but our sampling process still samples from P(X) Solution: weigh each sample by the joint probability of setting each variable to its evidence/observed value In effect, we are sampling from P(X,e) which we can normalize to then obtain P(X|e) for a query of interest

Likelihood Weighting Let X1, …,Xn be a topological order of the variables For i = 1,...,n If XiE Sample xi from P(Xi | pai ) If XiE Set Xi=E[xi] Set wi = wi  P(E[xi] | pai ) return wi and x1, …,xn Estimate P(y|E) by:

Likelihood Weighting D I G S L Samples E = {S=s1, G=g2} d0 d1 0.4 0.6 0.7 0.3 I s0 s1 i0 0.95 0.05 i1 0.2 0.8 G S D I g0 g1 g2 d0 i0 0.3 0.4 i1 0.05 0.25 0.7 d1 0.9 0.08 0.02 0.5 0.2 L G l0 l1 g0 0.1 0.9 g1 0.4 0.6 g2 0.99 0.01 Samples D I G S L W E = {S=s1, G=g2}

Likelihood Weighting D I G S L Samples E = {S=s1, G=g2} d0 d1 0.4 0.6 0.7 0.3 I s0 s1 i0 0.95 0.05 i1 0.2 0.8 G S D I g0 g1 g2 d0 i0 0.3 0.4 i1 0.05 0.25 0.7 d1 0.9 0.08 0.02 0.5 0.2 L G l0 l1 g0 0.1 0.9 g1 0.4 0.6 g2 0.99 0.01 Samples D I G S L W d1 1 E = {S=s1, G=g2}

Likelihood Weighting D I G S L Samples E = {S=s1, G=g2} d0 d1 0.4 0.6 0.7 0.3 I s0 s1 i0 0.95 0.05 i1 0.2 0.8 G S D I g0 g1 g2 d0 i0 0.3 0.4 i1 0.05 0.25 0.7 d1 0.9 0.08 0.02 0.5 0.2 L G l0 l1 g0 0.1 0.9 g1 0.4 0.6 g2 0.99 0.01 Samples D I G S L W d1 i1 1 E = {S=s1, G=g2}

Likelihood Weighting D I G S L Samples E = {S=s1, G=g2} d0 d1 0.4 0.6 0.7 0.3 I s0 s1 i0 0.95 0.05 i1 0.2 0.8 G S D I g0 g1 g2 d0 i0 0.3 0.4 i1 0.05 0.25 0.7 d1 0.9 0.08 0.02 0.5 0.2 L G l0 l1 g0 0.1 0.9 g1 0.4 0.6 g2 0.99 0.01 Samples D I G S L W d1 i1 g2 0.2 E = {S=s1, G=g2}

Likelihood Weighting D I G S L Samples E = {S=s1, G=g2} d0 d1 0.4 0.6 0.7 0.3 I s0 s1 i0 0.95 0.05 i1 0.2 0.8 G S D I g0 g1 g2 d0 i0 0.3 0.4 i1 0.05 0.25 0.7 d1 0.9 0.08 0.02 0.5 0.2 L G l0 l1 g0 0.1 0.9 g1 0.4 0.6 g2 0.99 0.01 Samples D I G S L W d1 i1 g2 s1 0.16 E = {S=s1, G=g2}

Likelihood Weighting D I G S L Samples E = {S=s1, G=g2} d0 d1 0.4 0.6 0.7 0.3 I s0 s1 i0 0.95 0.05 i1 0.2 0.8 G S D I g0 g1 g2 d0 i0 0.3 0.4 i1 0.05 0.25 0.7 d1 0.9 0.08 0.02 0.5 0.2 L G l0 l1 g0 0.1 0.9 g1 0.4 0.6 g2 0.99 0.01 Samples D I G S L W d1 i1 g2 s1 l0 0.16 E = {S=s1, G=g2}

Importance Sampling Idea: to estimate a function relative to P, rather than sampling from the distribution P, sample from another distribution Q P is called the target distribution Q is called the proposal or the sampling distribution Requirement from Q: P(x) > 0  Q(x) > 0 Q does not ‘ignore’ any non-zero probability events in P In practice, performance depends on similarity between Q and P

Unnormalized Importance Sampling Since: We can estimate f(x) by generating samples from Q and estimating: Can show that estimator variance decreases with more samples Can show that Q=P is the lowest variance estimator

Normalized Importance Sampling Unnormalized importance sampling assumes known P Usually we know P up to a constant P = P’/ =xP’(x) Example: Posterior distribution P(X|e)=P(X,e)/ where =P(e) We can estimate  by: Thus

Normalized Importance Sampling Given M samples from Q, normalized sampling estimates function f by:

Importance Sampling for BayesNets Define mutilated network GE=e as: Nodes XE have no parents in GE=e Nodes XE have CPD that is 1 for X=E[X] and 0 otherwise The parents and CPDs for all other variables are unchanged d0 d1 0.4 0.6 D I i0 i1 0.7 0.3 D I g0 g1 g2 1 s0 s1 1 G S G S E = {S=s1, G=g2} L Original network L G l0 l1 g0 0.1 0.9 g1 0.4 0.6 g2 0.99 0.01

Likelihood Weighting as IS Target distribution P’(X,e) Proposal distribution Q(X,e)=PGE=e(X,e) Claim: Likelihood weighting is precisely normalized importance sampling with the above distributions Proof sketch: LW estimate: IS estimate: But w[m] in LW precisely comes out to be P’(x[m])/Q(x[m]) Since w[m]=P(ei|Pa(ei))=P(Xi|Pa(Xi))P(ei|Pa(ei))/P(Xi|Pa(Xi))

Particle-Based Methods Overview Full particle methods Sampling methods Forward sampling Importance sampling Markov chain Monte Carlo Deterministic particle generation Distributional particles

Markov Chain Monte Carlo General idea: Define a sampling process that is guaranteed to converge to taking samples from the posterior distribution of interest Generate samples from the sampling process Estimate f(X) from the samples

Markov Chains A Markov chain consists of A state space Val(X) Transition probability T(xx’) of going from state x to x’ Distribution over subsequent states is defined as 0.25 0.7 x1 x2 0.5 0.5 0.3 0.75 x3 Simple Markov chain

Stationary Distribution A distribution (X) is a stationary distribution for a Markov chain T if it satisfies 0.25 0.7 x1 x2 0.5 0.5 0.3 0.75 x3 Simple Markov chain

Markov Chains & Stationary Dist. A Markov chain is regular if there is k such that for every x,x’Val(X), the probability of getting from x to x’ in exactly k steps is greater than zero Theorem: A finite state Markov chain T has a unique stationary distribution if and only if it is regular Goal: Define a Markov chain whose stationary distribution is P(X|e)

Gibbs Sampling States Transition probability Legal (=consistent with evidence) assignments to variables Transition probability T = T1...Tk Ti = T((ui, xi) (ui, x’i)) = P(x’i|ui) Claim: P(X|e) is a stationary distribution to the chain

Gibbs Sampling Gibbs-sampling Markov chain is regular if: Bayesian networks: all CPDs are strictly positive Markov networks: all clique potentials are strictly positive Gibbs sampling procedure (one sample) Set x[m]=x[m-1] For each variable Xi  X-E Set ui = x[m](X-Xi) Sample from P(Xi | ui) Save sample x[m](Xi) = sampled value Return x[m] Note: P(Xi | ui) is easily computed from Markov blanket

Gibbs Sampling How do we evaluate P(Xi | x1,...,xi-1,xi+1,...,xn) ? Let Y1, …,Yk be the children of Xi By definition of MBi, the parents of Yj are in MBi{Xi} It is easy to show that

Gibbs Sampling in Practice We need to wait until the burn-in time has ended – number of steps until we take samples from the chain Wait until the sampling distribution is close to the stationary Hard to provides bounds in general for this ‘mixing time’ Once the burn-in time ended, all samples are from the stationary distribution We can evaluate burn-in time by comparing two chains Note: after the burn-in time, samples are correlated Since no theoretical guarantees exist, application of Markov chains is somewhat of an art

M samples from one chain Sampling Strategy How do we collect the samples? Strategy I: Run the chain M times, each run for N steps each run starts from a different state Return the last state in each run Strategy II: Run one chain for a long time After some “burn in” period, sample points every some fixed number of steps M chains “burn in” M samples from one chain

Comparing Strategies Strategy I: Strategy II: Hybrid strategy: Better chance of “covering” the space of points, especially if the chain is slow to reach the stationary distribution Have to perform “burn in” steps for each chain Strategy II: Perform “burn in” only once Samples might be correlated (although only weakly) Hybrid strategy: run several chains, and sample few samples from each Combines benefits of both strategies

Particle-Based Methods Overview Full particle methods Sampling methods Forward sampling Importance sampling Markov chain Monte Carlo Deterministic particle generation Distributional particles

Deterministic Search Methods Idea: if the distribution is dominated by a small set of instances, it suffices to consider only them for approximating the function For instances that we generate x[1],...x[m], estimate is: Note: we can obtain lower and upper bounds by examining the part of the probability mass covered by P(x[m]) Key: how can we enumerate highly probably instantiations? Note: the single most probable instantiation is MPE which itself is an NP-hard problem

Particle-Based Methods Overview Full particle methods Sampling methods Forward sampling Importance sampling Markov chain Monte Carlo Deterministic particle generation Distributional particles

Distributional Particles Idea: use partial assignments to a subset of the network variables, combined with closed form representation of a distribution over the rest Xp – Variables whose assignments are defined by particles Xd – Variables over which we maintain a distribution Distributional particles are a.k.a Rao-Blackwellized particles Estimation proceeds as We can use any sampling procedure to sample Xp We assume that we can compute the internal expectation efficiently

Distributional Particles Distributional particles define a continuum For |Xp|=|X-E| we have full particles and thus full sampling For |Xp|= 0we are performing full exact inference Distributional Likelihood Weighting Sample over a subset of the variables Distributional Gibbs Sampling Sample only a subset of the variables, transition probability is as before T((ui, xi) (ui, x’i)) = P(x’i|ui) but the computation may require inference