. Approximate Inference Slides by Nir Friedman
When can we hope to approximate? Two situations: u Highly stochastic distributions “Far” evidence is discarded u “Peaked” distributions improbable values are ignored
Stochasticity & Approximations u Consider a chain: u P(X i+1 = t | X i = t) = 1- P(X i+1 = f | X i = f) = 1- Computing the probability of X n+1 given X 1, we get X1X1 X2X2 X3X3 X n+1 Even # of flips: Odd # of flips:
Plot of P(X n = t | X 1 = t) n = 5 n = 10 n = 20
Stochastic Processes u This behavior of a chain (a Markov Process) is called Mixing. u In general Bayes nets there is a similar behavior. If probabilities are far from 0 & 1, then effect of “far” evidence vanishes (and so can be discarded in approximations).
“Peaked” distributions u If the distribution is “peaked”, then most of the mass is on few instances u If we can focus on these instances, we can ignore the rest Instances
Global conditioning A L C I D J B M E K Fixing value of A & B Fixing values in the beginning of the summation can decrease tables formed by variable elimination. This way space is traded with time. Special case: choose to fix a set of nodes that “break all loops”. This method is called cutset-conditioning. L C I J M E K D a b ba
Bounded conditioning A B Fixing value of A & B By examining only the probable assignment of A & B, we perform several simple computations instead of a complex one.
Bounded conditioning u Choose A and B so that P(Y,e |a,b) can be computed easily. E.g., a cycle cutset. u Search for highly probable assignments to A,B. l Option 1--- select a,b with high P(a,b). l Option 2--- select a,b with high P(a,b | e). u We need to search for such high mass values and that can be hard.
Bounded Conditioning Advantages: u Combines exact inference within approximation u Continuous: more time can be used to examine more cases u Bounds: unexamined mass used to compute error-bars Possible problems: P(a,b) is prior mass not the posterior. If posterior is significantly different P(a,b| e), Computation can be wasted on irrelevant assignments
Network Simplifications u In these approaches, we try to replace the original network with a simpler one l the resulting network allows fast exact methods
Network Simplifications Typical simplifications: l Remove parts of the network l Remove edges l Reduce the number of values (value abstraction) l Replace a sub-network with a simpler one (model abstraction) u These simplifications are often w.r.t. to the particular evidence and query
Stochastic Simulation Suppose our goal is the compute the likelihood of evidence P(e) where e is an assignment to some variables in {X 1,…,X n }. Assume that we can sample instances according to the distribution P (x 1,…,x n ). What is then the probability that a random sample satisfies e? Answer: simply P(e) which is what we wish to compute. Each sample simulates the tossing of a biased coin with probability P(e) of “ Heads ”.
Stochastic Sampling Intuition: given a sufficient number of samples x[1],…,x[N], we can estimate Law of large number implies that as N grows, our estimate will converge to p with high probability Zeros or ones u How many samples do we need to get a reliable estimation? We will not discuss this issue here.
Sampling a Bayesian Network If P (X 1,…,X n ) is represented by a Bayesian network, can we efficiently sample from it? u Idea: sample according to structure of the network l Write distribution using the chain rule, and then sample each variable given its parents
Samples: B E A C R Logic sampling P(b) 0.03 P(e) P(a) b e P(c) a a P(r) e e b Earthquake Radio Burglary Alarm Call 0.03
Samples: B E A C R Logic sampling P(b) 0.03 P(e) P(a) b e P(c) a a P(r) e e eb Earthquake Radio Burglary Alarm Call 0.001
Samples: B E A C R Logic sampling P(b) 0.03 P(e) P(a) b e P(c) a a P(r) e e eab 0.4 Earthquake Radio Burglary Alarm Call
Samples: B E A C R Logic sampling P(b) 0.03 P(e) P(a) b e P(c) a a P(r) e e eacb Earthquake Radio Burglary Alarm Call 0.8
Samples: B E A C R Logic sampling P(b) 0.03 P(e) P(a) b e P(c) a a P(r) e e eacb 0.3 Earthquake Radio Burglary Alarm Call
Samples: B E A C R Logic sampling P(b) 0.03 P(e) P(a) b e P(c) a a P(r) e e eacb r Earthquake Radio Burglary Alarm Call
Logic Sampling Let X 1, …, X n be order of variables consistent with arc direction for i = 1, …, n do sample x i from P(X i | pa i ) (Note: since Pa i {X 1,…,X i-1 }, we already assigned values to them) return x 1, …,x n
Logic Sampling u Sampling a complete instance is linear in number of variables l Regardless of structure of the network However, if P(e) is small, we need many samples to get a decent estimate
Can we sample from P( X i |e) ? If evidence e is in roots of the Bayes network, easily u If evidence is in leaves of the network, we have a problem: l Our sampling method proceeds according to the order of nodes in the network. Z R B A=a X
Likelihood Weighting Can we ensure that all of our sample satisfy e? u One simple (but wrong) solution: When we need to sample a variable Y that is assigned value by e, use its specified value. For example: we know Y = 1 Sample X from P(X) Then take Y = 1 Is this a sample from P( X,Y |Y = 1) ? NO. X Y
Likelihood Weighting Problem: these samples of X are from P(X) u Solution: Penalize samples in which P(Y=1|X) is small u We now sample as follows: Let x i be a sample from P(x) Let w i = P(Y = 1|X = x i ) X Y
Likelihood Weighting Let X 1, …, X n be order of variables consistent with arc direction u w = 1 for i = 1, …, n do if X i = x i has been observed w w P(X i = x i | pa i ) l else sample x i from P(X i | pa i ) return x 1, …,x n, and w
Samples: B E A C R Likelihood Weighting P(b) 0.03 P(e) P(a) b e P(c) a P(r) r r b Earthquake Radio Burglary Alarm Call 0.03 Weight = r a = a
Samples: B E A C R Likelihood Weighting P(b) 0.03 P(e) P(a) b e P(c) a a P(r) r r eb Earthquake Radio Burglary Alarm Call Weight = r = a
Samples: B E A C R Likelihood Weighting P(b) 0.03 P(e) P(a) b e P(c) a a P(r) r r eb 0.4 Earthquake Radio Burglary Alarm Call Weight = r = a 0.6 a
Samples: B E A C R Likelihood Weighting P(b) 0.03 P(e) P(a) b e P(c) a a P(r) r r ecb Earthquake Radio Burglary Alarm Call 0.05 Weight = r = a a 0.6
Samples: B E A C R Likelihood Weighting P(b) 0.03 P(e) P(a) b e P(c) a a P(r) r r ecb r 0.3 Earthquake Radio Burglary Alarm Call Weight = r = a a 0.6 *0.3
Likelihood Weighting u Why does this make sense? When N is large, we expect to sample NP(X = x) samples with x[i] = x u Thus,
Summary Approximate inference is needed for large pedigrees. We have seen a few methods today. Some could fit genetic linkage analysis and some do not. There are many other approximation algorithms: Variational methods, MCMC, and others. In next semester’s project of Bioinformatics (236524), we will offer projects that seek to implement some approximation methods and embed them in the superlink software.