Approximate Inference: Particle-Based Methods

Approximate Inference: Particle-Based Methods
Eran Segal Weizmann Institute

Inference Complexity Summary
NP-Hard Exact inference Approximate inference with relative error with absolute error < 0.5 (given evidence) Hopeless? No, we will see many network structures that have provably efficient algorithms and we will see cases when approximate inference works efficiently with high accuracy

Approximate Inference
Particle-based methods Create instances that represent part of the probability mass Random sampling Deterministically search for high probability assignments Global methods Approximate the distribution in its entirety Use exact inference on a simpler (but close) network Perform inference in the original network but approximate some steps of the process (e.g., ignore certain computations or approximate some intermediate results)

Particle-Based Methods
Particle generation process Generate particles deterministically Generate particles by sampling Particle definition Full particles – complete assignments to all variables Distributional particles – assignment to part of the variables General framework Generate samples x[1],...,x[M] Estimate function by

Particle-Based Methods Overview
Full particle methods Sampling methods Forward sampling Importance sampling Markov chain Monte Carlo Deterministic particle generation Distributional particles

Forward Sampling Generate random samples from P(X)
Use the Bayesian network to generate samples Estimate function by

Forward Sampling D I G S L Samples d0 d1 0.4 0.6 i0 i1 0.7 0.3 I s0 s1
0.95 0.05 i1 0.2 0.8 G S D I g0 g1 g2 d0 i0 0.3 0.4 i1 0.05 0.25 0.7 d1 0.9 0.08 0.02 0.5 0.2 L G l0 l1 g0 0.1 0.9 g1 0.4 0.6 g2 0.99 0.01 Samples D I G S L

0.95 0.05 i1 0.2 0.8 G S D I g0 g1 g2 d0 i0 0.3 0.4 i1 0.05 0.25 0.7 d1 0.9 0.08 0.02 0.5 0.2 L G l0 l1 g0 0.1 0.9 g1 0.4 0.6 g2 0.99 0.01 Samples D I G S L d1

0.95 0.05 i1 0.2 0.8 G S D I g0 g1 g2 d0 i0 0.3 0.4 i1 0.05 0.25 0.7 d1 0.9 0.08 0.02 0.5 0.2 L G l0 l1 g0 0.1 0.9 g1 0.4 0.6 g2 0.99 0.01 Samples D I G S L d1 i1

0.95 0.05 i1 0.2 0.8 G S D I g0 g1 g2 d0 i0 0.3 0.4 i1 0.05 0.25 0.7 d1 0.9 0.08 0.02 0.5 0.2 L G l0 l1 g0 0.1 0.9 g1 0.4 0.6 g2 0.99 0.01 Samples D I G S L d1 i1 g2

0.95 0.05 i1 0.2 0.8 G S D I g0 g1 g2 d0 i0 0.3 0.4 i1 0.05 0.25 0.7 d1 0.9 0.08 0.02 0.5 0.2 L G l0 l1 g0 0.1 0.9 g1 0.4 0.6 g2 0.99 0.01 Samples D I G S L d1 i1 g2 s1

0.95 0.05 i1 0.2 0.8 G S D I g0 g1 g2 d0 i0 0.3 0.4 i1 0.05 0.25 0.7 d1 0.9 0.08 0.02 0.5 0.2 L G l0 l1 g0 0.1 0.9 g1 0.4 0.6 g2 0.99 0.01 Samples D I G S L d1 i1 g2 s1 l0

Forward Sampling Let X1, …,Xn be a topological order of the variables
For i = 1,...,n Sample xi from P(Xi | pai ) (Note: since Pai  {X1,…,Xi-1}, we already assigned values to them) return x1, …,xn Estimate function by: Estimate P(y) by:

Forward Sampling Sampling cost Number of samples needed
Per variable cost: O(log(Val|Xi|)) Sample uniformly in [0,1] Find appropriate value of all Val|Xi| values that Xi can take Per sample cost: O(nlog(d)) (d = maxi Val|Xi|) Total cost: O(Mnlog(d)) Number of samples needed To get a relative error < , with probability 1-, we need Note that number of samples grows inversely with P(y) For small P(y) we need many samples, otherwise we report P(y)=0

Rejection Sampling In general we need to compute P(Y|e)
We can do so with rejection sampling Generate samples as in forward sampling Reject samples in which Ee Estimate function from accepted samples Problem: if evidence is unlikely (e.g., P(e)=0.001)) then we generate many rejected samples

Likelihood Weighting Can we ensure that all of our samples satisfy E=e? Solution: when sampling a variable XE, set X=e Problem: we are trying to sample from the posterior P(X|e) but our sampling process still samples from P(X) Solution: weigh each sample by the joint probability of setting each variable to its evidence/observed value In effect, we are sampling from P(X,e) which we can normalize to then obtain P(X|e) for a query of interest

Likelihood Weighting Let X1, …,Xn be a topological order of the variables For i = 1,...,n If XiE Sample xi from P(Xi | pai ) If XiE Set Xi=E[xi] Set wi = wi  P(E[xi] | pai ) return wi and x1, …,xn Estimate P(y|E) by:

Likelihood Weighting D I G S L Samples E = {S=s1, G=g2} d0 d1 0.4 0.6
0.7 0.3 I s0 s1 i0 0.95 0.05 i1 0.2 0.8 G S D I g0 g1 g2 d0 i0 0.3 0.4 i1 0.05 0.25 0.7 d1 0.9 0.08 0.02 0.5 0.2 L G l0 l1 g0 0.1 0.9 g1 0.4 0.6 g2 0.99 0.01 Samples D I G S L W E = {S=s1, G=g2}

0.7 0.3 I s0 s1 i0 0.95 0.05 i1 0.2 0.8 G S D I g0 g1 g2 d0 i0 0.3 0.4 i1 0.05 0.25 0.7 d1 0.9 0.08 0.02 0.5 0.2 L G l0 l1 g0 0.1 0.9 g1 0.4 0.6 g2 0.99 0.01 Samples D I G S L W d1 1 E = {S=s1, G=g2}

0.7 0.3 I s0 s1 i0 0.95 0.05 i1 0.2 0.8 G S D I g0 g1 g2 d0 i0 0.3 0.4 i1 0.05 0.25 0.7 d1 0.9 0.08 0.02 0.5 0.2 L G l0 l1 g0 0.1 0.9 g1 0.4 0.6 g2 0.99 0.01 Samples D I G S L W d1 i1 1 E = {S=s1, G=g2}

0.7 0.3 I s0 s1 i0 0.95 0.05 i1 0.2 0.8 G S D I g0 g1 g2 d0 i0 0.3 0.4 i1 0.05 0.25 0.7 d1 0.9 0.08 0.02 0.5 0.2 L G l0 l1 g0 0.1 0.9 g1 0.4 0.6 g2 0.99 0.01 Samples D I G S L W d1 i1 g2 0.2 E = {S=s1, G=g2}

0.7 0.3 I s0 s1 i0 0.95 0.05 i1 0.2 0.8 G S D I g0 g1 g2 d0 i0 0.3 0.4 i1 0.05 0.25 0.7 d1 0.9 0.08 0.02 0.5 0.2 L G l0 l1 g0 0.1 0.9 g1 0.4 0.6 g2 0.99 0.01 Samples D I G S L W d1 i1 g2 s1 0.16 E = {S=s1, G=g2}

0.7 0.3 I s0 s1 i0 0.95 0.05 i1 0.2 0.8 G S D I g0 g1 g2 d0 i0 0.3 0.4 i1 0.05 0.25 0.7 d1 0.9 0.08 0.02 0.5 0.2 L G l0 l1 g0 0.1 0.9 g1 0.4 0.6 g2 0.99 0.01 Samples D I G S L W d1 i1 g2 s1 l0 0.16 E = {S=s1, G=g2}

Importance Sampling Idea: to estimate a function relative to P, rather than sampling from the distribution P, sample from another distribution Q P is called the target distribution Q is called the proposal or the sampling distribution Requirement from Q: P(x) > 0  Q(x) > 0 Q does not ‘ignore’ any non-zero probability events in P In practice, performance depends on similarity between Q and P

Unnormalized Importance Sampling
Since: We can estimate f(x) by generating samples from Q and estimating: Can show that estimator variance decreases with more samples Can show that Q=P is the lowest variance estimator

Normalized Importance Sampling
Unnormalized importance sampling assumes known P Usually we know P up to a constant P = P’/ =xP’(x) Example: Posterior distribution P(X|e)=P(X,e)/ where =P(e) We can estimate  by: Thus

Normalized Importance Sampling
Given M samples from Q, normalized sampling estimates function f by:

Importance Sampling for BayesNets
Define mutilated network GE=e as: Nodes XE have no parents in GE=e Nodes XE have CPD that is 1 for X=E[X] and 0 otherwise The parents and CPDs for all other variables are unchanged d0 d1 0.4 0.6 D I i0 i1 0.7 0.3 D I g0 g1 g2 1 s0 s1 1 G S G S E = {S=s1, G=g2} L Original network L G l0 l1 g0 0.1 0.9 g1 0.4 0.6 g2 0.99 0.01

Likelihood Weighting as IS
Target distribution P’(X,e) Proposal distribution Q(X,e)=PGE=e(X,e) Claim: Likelihood weighting is precisely normalized importance sampling with the above distributions Proof sketch: LW estimate: IS estimate: But w[m] in LW precisely comes out to be P’(x[m])/Q(x[m]) Since w[m]=P(ei|Pa(ei))=P(Xi|Pa(Xi))P(ei|Pa(ei))/P(Xi|Pa(Xi))

Markov Chain Monte Carlo
General idea: Define a sampling process that is guaranteed to converge to taking samples from the posterior distribution of interest Generate samples from the sampling process Estimate f(X) from the samples

Markov Chains A Markov chain consists of
A state space Val(X) Transition probability T(xx’) of going from state x to x’ Distribution over subsequent states is defined as 0.25 0.7 x1 x2 0.5 0.5 0.3 0.75 x3 Simple Markov chain

Stationary Distribution
A distribution (X) is a stationary distribution for a Markov chain T if it satisfies 0.25 0.7 x1 x2 0.5 0.5 0.3 0.75 x3 Simple Markov chain

Markov Chains & Stationary Dist.
A Markov chain is regular if there is k such that for every x,x’Val(X), the probability of getting from x to x’ in exactly k steps is greater than zero Theorem: A finite state Markov chain T has a unique stationary distribution if and only if it is regular Goal: Define a Markov chain whose stationary distribution is P(X|e)

Gibbs Sampling States Transition probability
Legal (=consistent with evidence) assignments to variables Transition probability T = T1...Tk Ti = T((ui, xi) (ui, x’i)) = P(x’i|ui) Claim: P(X|e) is a stationary distribution to the chain

Gibbs Sampling Gibbs-sampling Markov chain is regular if:
Bayesian networks: all CPDs are strictly positive Markov networks: all clique potentials are strictly positive Gibbs sampling procedure (one sample) Set x[m]=x[m-1] For each variable Xi  X-E Set ui = x[m](X-Xi) Sample from P(Xi | ui) Save sample x[m](Xi) = sampled value Return x[m] Note: P(Xi | ui) is easily computed from Markov blanket

Gibbs Sampling How do we evaluate P(Xi | x1,...,xi-1,xi+1,...,xn) ?
Let Y1, …,Yk be the children of Xi By definition of MBi, the parents of Yj are in MBi{Xi} It is easy to show that

Gibbs Sampling in Practice
We need to wait until the burn-in time has ended – number of steps until we take samples from the chain Wait until the sampling distribution is close to the stationary Hard to provides bounds in general for this ‘mixing time’ Once the burn-in time ended, all samples are from the stationary distribution We can evaluate burn-in time by comparing two chains Note: after the burn-in time, samples are correlated Since no theoretical guarantees exist, application of Markov chains is somewhat of an art

M samples from one chain
Sampling Strategy How do we collect the samples? Strategy I: Run the chain M times, each run for N steps each run starts from a different state Return the last state in each run Strategy II: Run one chain for a long time After some “burn in” period, sample points every some fixed number of steps M chains “burn in” M samples from one chain

Comparing Strategies Strategy I: Strategy II: Hybrid strategy:
Better chance of “covering” the space of points, especially if the chain is slow to reach the stationary distribution Have to perform “burn in” steps for each chain Strategy II: Perform “burn in” only once Samples might be correlated (although only weakly) Hybrid strategy: run several chains, and sample few samples from each Combines benefits of both strategies

Deterministic Search Methods
Idea: if the distribution is dominated by a small set of instances, it suffices to consider only them for approximating the function For instances that we generate x[1],...x[m], estimate is: Note: we can obtain lower and upper bounds by examining the part of the probability mass covered by P(x[m]) Key: how can we enumerate highly probably instantiations? Note: the single most probable instantiation is MPE which itself is an NP-hard problem

Distributional Particles
Idea: use partial assignments to a subset of the network variables, combined with closed form representation of a distribution over the rest Xp – Variables whose assignments are defined by particles Xd – Variables over which we maintain a distribution Distributional particles are a.k.a Rao-Blackwellized particles Estimation proceeds as We can use any sampling procedure to sample Xp We assume that we can compute the internal expectation efficiently

Distributional Particles
Distributional particles define a continuum For |Xp|=|X-E| we have full particles and thus full sampling For |Xp|= 0we are performing full exact inference Distributional Likelihood Weighting Sample over a subset of the variables Distributional Gibbs Sampling Sample only a subset of the variables, transition probability is as before T((ui, xi) (ui, x’i)) = P(x’i|ui) but the computation may require inference

Approximate Inference: Particle-Based Methods

Similar presentations

Presentation on theme: "Approximate Inference: Particle-Based Methods"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Approximate Inference: Particle-Based Methods

Similar presentations

Presentation on theme: "Approximate Inference: Particle-Based Methods"— Presentation transcript:

Similar presentations

About project

Feedback