Approximate Inference Edited from Slides by Nir Friedman .
Complexity of Inference Thm: Computing P(X = x) in a Bayesian network is NP-hard Not surprising, since we can simulate Boolean gates.
Proof We reduce 3-SAT to Bayesian network computation Assume we are given a 3-SAT problem: q1,…,qn be propositions, 1 ,... ,k be clauses, such that i = li1 li2 li3 where each lij is a literal over q1,…,qn = 1... k We will construct a Bayesian network s.t. P(X=t) > 0 iff is satisfiable
... Q1 Q3 Q2 Q4 Qn ... 1 2 3 k-1 k ... A1 A2 Ak/2-1 X P(Qi = true) = 0.5, P(I = true | Qi , Qj , Ql ) = 1 iff Qi , Qj , Ql satisfy the clause I A1, A2, …, are simple binary AND gates
Conclusion: polynomial reduction of 3-SAT It is easy to check Polynomial number of variables Each local probability table can be described by a small table (8 parameters at most) P(X = true) > 0 if and only if there exists a satisfying assignment to Q1,…,Qn Conclusion: polynomial reduction of 3-SAT
Note: this construction also shows that computing P(X = t) is harder than NP 2nP(X = t) is the number of satisfying assignments to Thus, it is #P-hard.
Hardness - Notes We used deterministic relations in our construction The same construction works if we use (1-, ) instead of (1,0) in each gate for any < 0.5 Hardness does not mean we cannot solve inference It implies that we cannot find a general procedure that works efficiently for all networks For particular families of networks, we can have provably efficient procedures We have seen such families in the course: HMMs, Evolutionary trees.
Approximation Until now, we examined exact computation In many applications, approximation are sufficient Example: P(X = x|e) = 0.3183098861838 Maybe P(X = x|e) 0.3 is a good enough approximation e.g., we take action only if P(X = x|e) > 0.5 Can we find good approximation algorithms?
Types of Approximations Absolute error An estimate q of P(X = x | e) has absolute error , if P(X = x|e) - q P(X = x|e) + equivalently q - P(X = x|e) q + Absolute error is not always what we want: If P(X = x | e) = 0.0001, then an absolute error of 0.001 is unacceptable If P(X = x | e) = 0.3, then an absolute error of 0.001 is overly precise 1 q 2
Types of Approximations Relative error An estimate q of P(X = x | e) has relative error , if P(X = x|e)(1 - ) q P(X = x|e)(1 + ) equivalently q/(1 + ) P(X = x|e) q/(1 - ) Sensitivity of approximation depends on actual value of desired result 1 q/(1-) q q/(1+)
Complexity Exact inference is NP-hard Is approximate inference any easier? Construction for exact inference: Input: a 3-SAT problem Output: a BN such that P(X=t) > 0 iff is satisfiable
Complexity: Relative Error Theorem: Given , finding an -relative error approximation is NP-hard. Suppose that q is a relative error estimate of P(X = t)=0. Then, 0 = P(X = t)(1 - ) q P(X = t)(1 + ) = 0 namely, q=0. Thus, -relative error and exact computation coincide for the value 0.
Complexity: Absolute error Theorem If < 0.5, then finding an estimate of P(X=x|e) with absolute error approximation is NP-Hard
... Proof Recall our construction 1 2 3 k k-1 Q1 Q3 Q2 Q4 Qn A1 X
Proof (cont.) Suppose we can estimate with absolute error Let p1 P(Q1 = t | X = t) Assign q1 = t if p1 > 0.5, else q1 = f Let p2 P(Q2 = t | X = t, Q1 = q1 ) Assign q2 = t if p2 > 0.5, else q2 = f … Let pn P(Qn = t | X = t, Q1 = q1, …, Qn-1 = qn-1 ) Assign qn = t if pn > 0.5, else qn = f
Proof (cont.) Claim: if is satisfiable, then q1 ,…, qn is a satisfying assignment Suppose is satisfiable By induction on i there is a satisfying assignment with Q1 = q1, …, Qi = qi Base case: If Q1 = t in all satisfying assignments, P(Q1 = t | X = t) = 1 p1 1 - > 0.5 q1 = t If Q1 = f, in all satisfying assignments, then q1 = f Otherwise, the statement holds for any choice of q1
Proof (cont.) Induction argument: If Qi+1 = t in all satisfying assignments s.t.Q1 = q1, …, Qi = qi P(Qi+1 = t | X = t, Q1 = q1, …, Qi = qi ) = 1 pi+1 1 - > 0.5 qi+1 = t If Qi+1 = f in all satisfying assignments s.t.Q1 = q1, …, Qi = qi then qi+1 = f. Otherwise, the statement holds for any choice of qi .
Proof (cont.) We can efficiently check whether q1 ,…, qn is a satisfying assignment (linear time) If it is, then is satisfiable If it is not, then is not satisfiable Suppose we have an approximation procedure with relative error. We can determine 3-SAT with n procedure calls. We generate an assignment as in the proof, and check satisfyability of the resulting assignment in linear time. If there were a satisfiable solution, we showed one would find it, and if no such assignment exists, one won’t find it. Thus, approximation is NP-hard.
When can we hope to approximate? Two situations: “Peaked” distributions improbable values are ignored Highly stochastic distributions “Far” evidence is discarded. (E.g., far markers in genetic linkage analysis)
Stochastic Simulation Suppose we can sample instances <x1,…,xn> according to P(X1,…,Xn) What is the probability that a random sample <x1,…,xn> satisfies e? This is exactly P(e) We can view each sample as tossing a biased coin with probability P(e) of “Heads”
Stochastic Sampling Intuition: given a sufficient number of samples x[1],…,x[N], we can estimate Law of large number implies that as N grows, our estimate will converge to p with high probability. 1 or 0
Sampling a Bayesian Network If P(X1,…,Xn) is represented by a Bayesian network, can we efficiently sample from it? YES: sample according to structure of the network: sample each variable given its sampled parents
Logic sampling Samples: P(b) P(e) b e b e b e b e P(a) e e P(r) a a 0.03 0.03 Earthquake Radio Burglary Alarm Call P(e) 0.001 b e b e b e b e P(a) 0.98 0.7 0.4 0.01 e e P(r) 0.3 0.001 a a Samples: B E A C R P(c) 0.8 0.05 b
Logic sampling Samples: P(b) P(e) b e b e b e b e P(a) e e P(r) a a 0.03 0.001 Earthquake Radio Burglary Alarm Call P(e) 0.001 b e b e b e b e P(a) 0.98 0.7 0.4 0.01 e e P(r) 0.3 0.001 a a Samples: B E A C R P(c) 0.8 0.05 b e
Logic sampling Samples: P(b) P(e) b e b e b e b e P(a) e e P(r) a a 0.03 Earthquake Radio Burglary Alarm Call P(e) 0.001 b e b e b e b e P(a) 0.98 0.7 0.4 0.4 0.01 e e P(r) 0.3 0.001 a a Samples: B E A C R P(c) 0.8 0.05 b e a
Logic sampling Samples: P(b) P(e) b e b e b e b e P(a) e e P(r) a a 0.03 Earthquake Radio Burglary Alarm Call P(e) 0.001 b e b e b e b e P(a) 0.98 0.7 0.4 0.01 e e P(r) 0.3 0.001 a a Samples: B E A C R P(c) 0.8 0.8 0.05 b e a c
Logic sampling Samples: P(b) P(e) b e b e b e b e P(a) e e P(r) a a 0.03 Earthquake Radio Burglary Alarm Call P(e) 0.001 b e b e b e b e P(a) 0.98 0.7 0.4 0.01 e e P(r) 0.3 0.3 0.001 a a Samples: B E A C R P(c) 0.8 0.05 b e a c r
Logic Sampling Let X1, …, Xn be order of variables consistent with arc direction for i = 1, …, n do sample xi from P(Xi | pai ) (Note: since Pai {X1,…,Xi-1}, we already assigned values to them) return x1, …,xn
Logic Sampling Sampling a complete instance is linear in number of variables Regardless of structure of the network However, if P(e) is small, we need many samples to get a decent estimate
Can we sample from P(X1,…,Xn |e)? If evidence is in roots of the network, as before. If evidence is in leaves of the network, we have a problem: Our sampling method proceeds according to order of nodes in graph. We need to retain only those samples that match e. This might be a rare event.
Likelihood Weighting Can we ensure that all of our sample is used? One wrong (but fixable) approach: When we need to sample a variable that is assigned a value by e, use that specified value. For example: we know Y = 1 Sample X from P(X) Then take Y = 1 This is NOT a sample from P(X,Y |Y = 1) ! X Y
Likelihood Weighting Problem: these samples of X are from P(X) Solution: Penalize samples in which P(Y=1|X) is small We now sample as follows: Let x[i] be a sample from P(X) Let w[i] be P(Y = 1|X = x [i]) X Y 1 or 0
Likelihood Weighting Samples: P(b) P(e) b e b e b e b e P(a) = a r = r 0.03 0.03 Earthquake Radio Burglary Alarm Call P(e) 0.001 b e b e b e b e P(a) 0.98 0.7 0.4 0.01 = a r = r r P(r) 0.3 0.001 a a Weight B E A C R P(c) 0.8 0.05 b Samples:
Likelihood Weighting Samples: P(b) P(e) b e b e b e b e P(a) = a r = r 0.03 0.001 Earthquake Radio Burglary Alarm Call P(e) 0.001 b e b e b e b e P(a) 0.98 0.7 0.4 0.01 = a r = r r P(r) 0.3 0.001 a a Samples: B E A C R Weight P(c) 0.8 0.05 b e
Likelihood Weighting Samples: P(b) P(e) b e b e b e b e P(a) = a r = r 0.03 Earthquake Radio Burglary Alarm Call P(e) 0.001 b e b e b e b e P(a) 0.98 0.7 0.4 0.4 0.01 = a r = r r P(r) 0.3 0.001 a a Samples: B E A C R Weight P(c) 0.8 0.05 b e a 0.6
Likelihood Weighting Samples: P(b) P(e) b e b e b e b e P(a) = a r = r 0.03 Earthquake Radio Burglary Alarm Call P(e) 0.001 b e b e b e b e P(a) 0.98 0.7 0.4 0.01 = a r = r r P(r) 0.3 0.001 a a Samples: B E A C R Weight P(c) 0.05 0.8 0.05 b e a c 0.6
Likelihood Weighting Samples: P(b) P(e) b e b e b e b e P(a) = a r = r 0.03 Earthquake Radio Burglary Alarm Call P(e) 0.001 b e b e b e b e P(a) 0.98 0.7 0.4 0.01 = a r = r r P(r) 0.3 0.3 0.001 a a Weight B E A C R P(c) 0.8 0.05 b e a c r 0.6 *0.3 Samples:
Likelihood Weighting Let X1, …, Xn be order of variables consistent with arc direction w = 1 for i = 1, …, n do if Xi = xi has been observed w w P(Xi = xi | pai ) else sample xi from P(Xi | pai ) return x1, …,xn, and w
Likelihood Weighting Why does this make sense? When N is large, we expect to sample NP(X = x) samples with x[i] = x Thus,
Likelihood Weighting What can we say about the quality of answer? Intuitively, the weights of a sample reflects their probability given the evidence. We need collect a enough mass for the sample to provide accurate answer. Another factor is the “extremeness” of CPDs. Theorem (Dagum & Luby AIJ93): If P(Xi | Pai) [l,u] for all local probability tables, and then with probability 1-, the estimate is relative error approximation