Approximate Inference

Slides:



Advertisements
Similar presentations
Bayesian network inference
Advertisements

. PGM: Tirgul 8 Markov Chains. Stochastic Sampling  In previous class, we examined methods that use independent samples to estimate P(X = x |e ) Problem:
PGM 2003/04 Tirgul 3-4 The Bayesian Network Representation.
. Bayesian Networks Lecture 9 Edited from Nir Friedman’s slides by Dan Geiger from Nir Friedman’s slides.
Complexity 19-1 Complexity Andrei Bulatov More Probabilistic Algorithms.
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
. Inference I Introduction, Hardness, and Variable Elimination Slides by Nir Friedman.
1 Discrete Structures CS 280 Example application of probability: MAX 3-SAT.
. Approximate Inference Slides by Nir Friedman. When can we hope to approximate? Two situations: u Highly stochastic distributions “Far” evidence is discarded.
. DAGs, I-Maps, Factorization, d-Separation, Minimal I-Maps, Bayesian Networks Slides by Nir Friedman.
. PGM 2002/3 – Tirgul6 Approximate Inference: Sampling.
Nattee Niparnan. Easy & Hard Problem What is “difficulty” of problem? Difficult for computer scientist to derive algorithm for the problem? Difficult.
Automated Planning and Decision Making Prof. Ronen Brafman Automated Planning and Decision Making 2007 Bayesian networks Variable Elimination Based on.
Week 10Complexity of Algorithms1 Hard Computational Problems Some computational problems are hard Despite a numerous attempts we do not know any efficient.
Instructor: Eyal Amir Grad TAs: Wen Pu, Yonatan Bisk Undergrad TAs: Sam Johnson, Nikhil Johri CS 440 / ECE 448 Introduction to Artificial Intelligence.
Marginalization & Conditioning Marginalization (summing out): for any sets of variables Y and Z: Conditioning(variant of marginalization):
Bayesian networks and their application in circuit reliability estimation Erin Taylor.
Inference Algorithms for Bayes Networks
1 CMSC 671 Fall 2001 Class #20 – Thursday, November 8.
Conditional Independence As with absolute independence, the equivalent forms of X and Y being conditionally independent given Z can also be used: P(X|Y,
UIUC CS 497: Section EA Lecture #7 Reasoning in Artificial Intelligence Professor: Eyal Amir Spring Semester 2004 (Based on slides by Gal Elidan (Hebrew.
The NP class. NP-completeness Lecture2. The NP-class The NP class is a class that contains all the problems that can be decided by a Non-Deterministic.
. The EM algorithm Lecture #11 Acknowledgement: Some slides of this lecture are due to Nir Friedman.
CS498-EA Reasoning in AI Lecture #19 Professor: Eyal Amir Fall Semester 2011.
Virtual University of Pakistan
The NP class. NP-completeness
More NP-Complete and NP-hard Problems
P & NP.
Richard Anderson Lecture 26 NP-Completeness
Advanced Algorithms Analysis and Design
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12
Inference in Bayesian Networks
Richard Anderson Lecture 26 NP-Completeness
Read R&N Ch Next lecture: Read R&N
Approximate Inference Methods
Artificial Intelligence
Learning Bayesian Network Models from Data
Bayesian Networks Background Readings: An Introduction to Bayesian Networks, Finn Jensen, UCL Press, Some slides have been edited from Nir Friedman’s.
Combining Random Variables
NP-Completeness Yin Tat Lee
Intro to Theory of Computation
Read R&N Ch Next lecture: Read R&N
More about Posterior Distributions
Richard Anderson Lecture 25 NP-Completeness
CS498-EA Reasoning in AI Lecture #20
Bayesian Networks Based on
Inference Inference: calculating some useful quantity from a joint probability distribution Examples: Posterior probability: Most likely explanation: B.
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12
Richard Anderson Lecture 28 NP-Completeness
Instructors: Fei Fang (This Lecture) and Dave Touretzky
Chapter 11 Limitations of Algorithm Power
NP-Complete Problems.
Class #19 – Tuesday, November 3
Directed Graphical Probabilistic Models: the sequel
Class #16 – Tuesday, October 26
Richard Anderson Lecture 26 NP-Completeness
NP-Completeness Yin Tat Lee
Inference III: Approximate Inference
Approximate Inference by Sampling
CSE 6408 Advanced Algorithms.
Hankz Hankui Zhuo Bayesian Networks Hankz Hankui Zhuo
CPS 570: Artificial Intelligence Bayesian networks
Read R&N Ch Next lecture: Read R&N
Instructor: Vincent Conitzer
Approximate Inference: Particle-Based Methods
Read R&N Ch Next lecture: Read R&N
Instructor: Aaron Roth
Lecture 23 NP-Hard Problems
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12
Presentation transcript:

Approximate Inference Edited from Slides by Nir Friedman .

Complexity of Inference Thm: Computing P(X = x) in a Bayesian network is NP-hard Not surprising, since we can simulate Boolean gates.

Proof We reduce 3-SAT to Bayesian network computation Assume we are given a 3-SAT problem: q1,…,qn be propositions, 1 ,... ,k be clauses, such that i = li1 li2  li3 where each lij is a literal over q1,…,qn  = 1... k We will construct a Bayesian network s.t. P(X=t) > 0 iff  is satisfiable

... Q1 Q3 Q2 Q4 Qn ... 1 2 3 k-1 k ... A1 A2 Ak/2-1 X P(Qi = true) = 0.5, P(I = true | Qi , Qj , Ql ) = 1 iff Qi , Qj , Ql satisfy the clause I A1, A2, …, are simple binary AND gates

Conclusion: polynomial reduction of 3-SAT It is easy to check Polynomial number of variables Each local probability table can be described by a small table (8 parameters at most) P(X = true) > 0 if and only if there exists a satisfying assignment to Q1,…,Qn Conclusion: polynomial reduction of 3-SAT

Note: this construction also shows that computing P(X = t) is harder than NP 2nP(X = t) is the number of satisfying assignments to  Thus, it is #P-hard.

Hardness - Notes We used deterministic relations in our construction The same construction works if we use (1-, ) instead of (1,0) in each gate for any  < 0.5 Hardness does not mean we cannot solve inference It implies that we cannot find a general procedure that works efficiently for all networks For particular families of networks, we can have provably efficient procedures We have seen such families in the course: HMMs, Evolutionary trees.

Approximation Until now, we examined exact computation In many applications, approximation are sufficient Example: P(X = x|e) = 0.3183098861838 Maybe P(X = x|e)  0.3 is a good enough approximation e.g., we take action only if P(X = x|e) > 0.5 Can we find good approximation algorithms?

Types of Approximations Absolute error An estimate q of P(X = x | e) has absolute error , if P(X = x|e) -   q  P(X = x|e) +  equivalently q -   P(X = x|e)  q +  Absolute error is not always what we want: If P(X = x | e) = 0.0001, then an absolute error of 0.001 is unacceptable If P(X = x | e) = 0.3, then an absolute error of 0.001 is overly precise 1 q 2

Types of Approximations Relative error An estimate q of P(X = x | e) has relative error , if P(X = x|e)(1 - )  q  P(X = x|e)(1 + ) equivalently q/(1 + )  P(X = x|e)  q/(1 - ) Sensitivity of approximation depends on actual value of desired result 1 q/(1-) q q/(1+)

Complexity Exact inference is NP-hard Is approximate inference any easier? Construction for exact inference: Input: a 3-SAT problem  Output: a BN such that P(X=t) > 0 iff  is satisfiable

Complexity: Relative Error Theorem: Given , finding an -relative error approximation is NP-hard. Suppose that q is a relative error estimate of P(X = t)=0. Then, 0 = P(X = t)(1 - )  q  P(X = t)(1 + ) = 0 namely, q=0. Thus, -relative error and exact computation coincide for the value 0.

Complexity: Absolute error Theorem If  < 0.5, then finding an estimate of P(X=x|e) with  absolute error approximation is NP-Hard

... Proof Recall our construction 1 2 3 k k-1 Q1 Q3 Q2 Q4 Qn A1 X

Proof (cont.) Suppose we can estimate with  absolute error Let p1  P(Q1 = t | X = t) Assign q1 = t if p1 > 0.5, else q1 = f Let p2  P(Q2 = t | X = t, Q1 = q1 ) Assign q2 = t if p2 > 0.5, else q2 = f … Let pn  P(Qn = t | X = t, Q1 = q1, …, Qn-1 = qn-1 ) Assign qn = t if pn > 0.5, else qn = f

Proof (cont.) Claim: if  is satisfiable, then q1 ,…, qn is a satisfying assignment Suppose  is satisfiable By induction on i there is a satisfying assignment with Q1 = q1, …, Qi = qi Base case: If Q1 = t in all satisfying assignments,  P(Q1 = t | X = t) = 1  p1  1 -  > 0.5  q1 = t If Q1 = f, in all satisfying assignments, then q1 = f Otherwise, the statement holds for any choice of q1

Proof (cont.) Induction argument: If Qi+1 = t in all satisfying assignments s.t.Q1 = q1, …, Qi = qi  P(Qi+1 = t | X = t, Q1 = q1, …, Qi = qi ) = 1  pi+1  1 -  > 0.5  qi+1 = t If Qi+1 = f in all satisfying assignments s.t.Q1 = q1, …, Qi = qi then qi+1 = f. Otherwise, the statement holds for any choice of qi .

Proof (cont.) We can efficiently check whether q1 ,…, qn is a satisfying assignment (linear time) If it is, then  is satisfiable If it is not, then  is not satisfiable Suppose we have an approximation procedure with  relative error. We can determine 3-SAT with n procedure calls. We generate an assignment as in the proof, and check satisfyability of the resulting assignment in linear time. If there were a satisfiable solution, we showed one would find it, and if no such assignment exists, one won’t find it. Thus, approximation is NP-hard.

When can we hope to approximate? Two situations: “Peaked” distributions improbable values are ignored Highly stochastic distributions “Far” evidence is discarded. (E.g., far markers in genetic linkage analysis)

Stochastic Simulation Suppose we can sample instances <x1,…,xn> according to P(X1,…,Xn) What is the probability that a random sample <x1,…,xn> satisfies e? This is exactly P(e) We can view each sample as tossing a biased coin with probability P(e) of “Heads”

Stochastic Sampling Intuition: given a sufficient number of samples x[1],…,x[N], we can estimate Law of large number implies that as N grows, our estimate will converge to p with high probability. 1 or 0

Sampling a Bayesian Network If P(X1,…,Xn) is represented by a Bayesian network, can we efficiently sample from it? YES: sample according to structure of the network: sample each variable given its sampled parents

Logic sampling Samples: P(b) P(e) b e b e b e b e P(a) e e P(r) a a 0.03 0.03 Earthquake Radio Burglary Alarm Call P(e) 0.001 b e b e b e b e P(a) 0.98 0.7 0.4 0.01 e e P(r) 0.3 0.001 a a Samples: B E A C R P(c) 0.8 0.05 b

Logic sampling Samples: P(b) P(e) b e b e b e b e P(a) e e P(r) a a 0.03 0.001 Earthquake Radio Burglary Alarm Call P(e) 0.001 b e b e b e b e P(a) 0.98 0.7 0.4 0.01 e e P(r) 0.3 0.001 a a Samples: B E A C R P(c) 0.8 0.05 b e

Logic sampling Samples: P(b) P(e) b e b e b e b e P(a) e e P(r) a a 0.03 Earthquake Radio Burglary Alarm Call P(e) 0.001 b e b e b e b e P(a) 0.98 0.7 0.4 0.4 0.01 e e P(r) 0.3 0.001 a a Samples: B E A C R P(c) 0.8 0.05 b e a

Logic sampling Samples: P(b) P(e) b e b e b e b e P(a) e e P(r) a a 0.03 Earthquake Radio Burglary Alarm Call P(e) 0.001 b e b e b e b e P(a) 0.98 0.7 0.4 0.01 e e P(r) 0.3 0.001 a a Samples: B E A C R P(c) 0.8 0.8 0.05 b e a c

Logic sampling Samples: P(b) P(e) b e b e b e b e P(a) e e P(r) a a 0.03 Earthquake Radio Burglary Alarm Call P(e) 0.001 b e b e b e b e P(a) 0.98 0.7 0.4 0.01 e e P(r) 0.3 0.3 0.001 a a Samples: B E A C R P(c) 0.8 0.05 b e a c r

Logic Sampling Let X1, …, Xn be order of variables consistent with arc direction for i = 1, …, n do sample xi from P(Xi | pai ) (Note: since Pai  {X1,…,Xi-1}, we already assigned values to them) return x1, …,xn

Logic Sampling Sampling a complete instance is linear in number of variables Regardless of structure of the network However, if P(e) is small, we need many samples to get a decent estimate

Can we sample from P(X1,…,Xn |e)? If evidence is in roots of the network, as before. If evidence is in leaves of the network, we have a problem: Our sampling method proceeds according to order of nodes in graph. We need to retain only those samples that match e. This might be a rare event.

Likelihood Weighting Can we ensure that all of our sample is used? One wrong (but fixable) approach: When we need to sample a variable that is assigned a value by e, use that specified value. For example: we know Y = 1 Sample X from P(X) Then take Y = 1 This is NOT a sample from P(X,Y |Y = 1) ! X Y

Likelihood Weighting Problem: these samples of X are from P(X) Solution: Penalize samples in which P(Y=1|X) is small We now sample as follows: Let x[i] be a sample from P(X) Let w[i] be P(Y = 1|X = x [i]) X Y 1 or 0

Likelihood Weighting Samples: P(b) P(e) b e b e b e b e P(a) = a r = r 0.03 0.03 Earthquake Radio Burglary Alarm Call P(e) 0.001 b e b e b e b e P(a) 0.98 0.7 0.4 0.01 = a r = r r P(r) 0.3 0.001 a a Weight B E A C R P(c) 0.8 0.05 b Samples:

Likelihood Weighting Samples: P(b) P(e) b e b e b e b e P(a) = a r = r 0.03 0.001 Earthquake Radio Burglary Alarm Call P(e) 0.001 b e b e b e b e P(a) 0.98 0.7 0.4 0.01 = a r = r r P(r) 0.3 0.001 a a Samples: B E A C R Weight P(c) 0.8 0.05 b e

Likelihood Weighting Samples: P(b) P(e) b e b e b e b e P(a) = a r = r 0.03 Earthquake Radio Burglary Alarm Call P(e) 0.001 b e b e b e b e P(a) 0.98 0.7 0.4 0.4 0.01 = a r = r r P(r) 0.3 0.001 a a Samples: B E A C R Weight P(c) 0.8 0.05 b e a 0.6

Likelihood Weighting Samples: P(b) P(e) b e b e b e b e P(a) = a r = r 0.03 Earthquake Radio Burglary Alarm Call P(e) 0.001 b e b e b e b e P(a) 0.98 0.7 0.4 0.01 = a r = r r P(r) 0.3 0.001 a a Samples: B E A C R Weight P(c) 0.05 0.8 0.05 b e a c 0.6

Likelihood Weighting Samples: P(b) P(e) b e b e b e b e P(a) = a r = r 0.03 Earthquake Radio Burglary Alarm Call P(e) 0.001 b e b e b e b e P(a) 0.98 0.7 0.4 0.01 = a r = r r P(r) 0.3 0.3 0.001 a a Weight B E A C R P(c) 0.8 0.05 b e a c r 0.6 *0.3 Samples:

Likelihood Weighting Let X1, …, Xn be order of variables consistent with arc direction w = 1 for i = 1, …, n do if Xi = xi has been observed w w  P(Xi = xi | pai ) else sample xi from P(Xi | pai ) return x1, …,xn, and w

Likelihood Weighting Why does this make sense? When N is large, we expect to sample NP(X = x) samples with x[i] = x Thus,

Likelihood Weighting What can we say about the quality of answer? Intuitively, the weights of a sample reflects their probability given the evidence. We need collect a enough mass for the sample to provide accurate answer. Another factor is the “extremeness” of CPDs. Theorem (Dagum & Luby AIJ93): If P(Xi | Pai) [l,u] for all local probability tables, and then with probability 1-, the estimate is  relative error approximation

END