Inference III: Approximate Inference

Slides:



Advertisements
Similar presentations
Bayesian Networks CSE 473. © Daniel S. Weld 2 Last Time Basic notions Atomic events Probabilities Joint distribution Inference by enumeration Independence.
Advertisements

Exact Inference in Bayes Nets
Dynamic Bayesian Networks (DBNs)
Bayesian network inference
. PGM: Tirgul 8 Markov Chains. Stochastic Sampling  In previous class, we examined methods that use independent samples to estimate P(X = x |e ) Problem:
PGM 2003/04 Tirgul 3-4 The Bayesian Network Representation.
. Bayesian Networks Lecture 9 Edited from Nir Friedman’s slides by Dan Geiger from Nir Friedman’s slides.
. Approximate Inference Slides by Nir Friedman. When can we hope to approximate? Two situations: u Highly stochastic distributions “Far” evidence is discarded.
. DAGs, I-Maps, Factorization, d-Separation, Minimal I-Maps, Bayesian Networks Slides by Nir Friedman.
. PGM 2002/3 – Tirgul6 Approximate Inference: Sampling.
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
Instructor: Eyal Amir Grad TAs: Wen Pu, Yonatan Bisk Undergrad TAs: Sam Johnson, Nikhil Johri CS 440 / ECE 448 Introduction to Artificial Intelligence.
CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Monte Carlo Methods for Probabilistic Inference.
1 CMSC 671 Fall 2001 Class #21 – Tuesday, November 13.
The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)
CS 188: Artificial Intelligence Bayes Nets: Approximate Inference Instructor: Stuart Russell--- University of California, Berkeley.
Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:
Inference Algorithms for Bayes Networks
1 CMSC 671 Fall 2001 Class #20 – Thursday, November 8.
Conditional Independence As with absolute independence, the equivalent forms of X and Y being conditionally independent given Z can also be used: P(X|Y,
UIUC CS 497: Section EA Lecture #7 Reasoning in Artificial Intelligence Professor: Eyal Amir Spring Semester 2004 (Based on slides by Gal Elidan (Hebrew.
CS498-EA Reasoning in AI Lecture #19 Professor: Eyal Amir Fall Semester 2011.
Theory of Computational Complexity Probability and Computing Chapter Hikaru Inada Iwama and Ito lab M1.
P & NP.
Lecture 7: Constrained Conditional Models
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12
Inference in Bayesian Networks
CS b553: Algorithms for Optimization and Learning
Read R&N Ch Next lecture: Read R&N
Approximate Inference
Approximate Inference Methods
Artificial Intelligence
Learning Bayesian Network Models from Data
CS 4/527: Artificial Intelligence
Exact Inference Continued
CAP 5636 – Advanced Artificial Intelligence
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18
Read R&N Ch Next lecture: Read R&N
Read R&N Ch Next lecture: Read R&N
Learning Bayesian networks
CSCI 5822 Probabilistic Models of Human and Machine Learning
CS498-EA Reasoning in AI Lecture #20
Bayesian Networks Based on
Inference Inference: calculating some useful quantity from a joint probability distribution Examples: Posterior probability: Most likely explanation: B.
Readings: K&F: 15.1, 15.2, 15.3, 15.4, 15.5 K&F: 7 (overview of inference) K&F: 8.1, 8.2 (Variable Elimination) Structure Learning in BNs 3: (the good,
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Instructors: Fei Fang (This Lecture) and Dave Touretzky
CAP 5636 – Advanced Artificial Intelligence
Professor Marie desJardins,
Exact Inference ..
CS 188: Artificial Intelligence
CS 188: Artificial Intelligence Fall 2007
Class #19 – Tuesday, November 3
CS 188: Artificial Intelligence
Directed Graphical Probabilistic Models: the sequel
Exact Inference Continued
CS 188: Artificial Intelligence Fall 2008
Class #16 – Tuesday, October 26
Approximate Inference by Sampling
Hankz Hankui Zhuo Bayesian Networks Hankz Hankui Zhuo
CPS 570: Artificial Intelligence Bayesian networks
This Lecture Substitution model
Belief Networks CS121 – Winter 2003 Belief Networks.
Read R&N Ch Next lecture: Read R&N
Instructor: Vincent Conitzer
Approximate Inference: Particle-Based Methods
Read R&N Ch Next lecture: Read R&N
Variable Elimination Graphical Models – Carlos Guestrin
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12
Presentation transcript:

Inference III: Approximate Inference Slides by Nir Friedman .

Global conditioning Fixing value of A & B L C I J M E K D a b A L C I Fixing values in the beginning of the summation can decrease tables formed by variable elimination. This way space is traded with time. Special case: choose to fix a set of nodes that “break all loops”. This method is called cutset-conditioning. Alternatively, choose to fix some variables from the largest cliques in a clique tree.

Approximation Until now, we examined exact computation In many applications, approximation are sufficient Example: P(X = x|e) = 0.3183098861838 Maybe P(X = x|e)  0.3 is a good enough approximation e.g., we take action only if P(X = x|e) > 0.5 Can we find good approximation algorithms?

Types of Approximations Absolute error An estimate q of P(X = x | e) has absolute error , if P(X = x|e) -   q  P(X = x|e) +  equivalently q -   P(X = x|e)  q +  Absolute error is not always what we want: If P(X = x | e) = 0.0001, then an absolute error of 0.001 is unacceptable If P(X = x | e) = 0.3, then an absolute error of 0.001 is overly precise 1 q 2

Types of Approximations Relative error An estimate q of P(X = x | e) has relative error , if P(X = x|e)(1 - )  q  P(X = x|e)(1 + ) equivalently q/(1 + )  P(X = x|e)  q/(1 - ) Sensitivity of approximation depends on actual value of desired result 1 q/(1-) q q/(1+)

Complexity Recall, exact inference is NP-hard Is approximate inference any easier? Construction for exact inference: Input: a 3-SAT problem  Output: a BN such that P(X=t) > 0 iff  is satisfiable

Complexity: Relative Error Suppose that q is a relative error estimate of P(X = t), If  is not satisfiable, then P(X = t)=0 . Hence, 0 = P(X = t)(1 - )  q  P(X = t)(1 + ) = 0 namely, q=0. Thus, if q > 0, then  is satisfiable An immediate consequence: Thm: Given , finding an -relative error approximation is NP-hard

Complexity: Absolute error We can find absolute error approximations to P(X = x) with high probability (via sampling). We will see such algorithms shortly However, once we have evidence, the problem is harder Thm If  < 0.5, then finding an estimate of P(X=x|e) with  absulote error approximation is NP-Hard

... Proof Recall our construction 1 2 3 k k-1 Q1 Q3 Q2 Q4 Qn A1 X

Proof (cont.) Suppose we can estimate with  absolute error Let p1  P(Q1 = t | X = t) Assign q1 = t if p1 > 0.5, else q1 = f Let p2  P(Q2 = t | X = t, Q1 = q1 ) Assign q2 = t if p2 > 0.5, else q2 = f … Let pn  P(Qn = t | X = t, Q1 = q1, …, Qn-1 = qn-1 ) Assign qn = t if pn > 0.5, else qn = f

Proof (cont.) Claim: if  is satisfiable, then q1 ,…, qn is a satisfying assignment Suppose  is satisfiable By induction on i there is a satisfying assignment with Q1 = q1, …, Qi = qi Base case: If Q1 = t in all satisfying assignments,  P(Q1 = t | X = t) = 1  p1  1 -  > 0.5  q1 = t If Q1 = f, in all satisfying assignments, then q1 = f Otherwise, statement holds for any choice of q1

Proof (cont.) Claim: if  is satisfiable, then q1 ,…, qn is a satisfying assignment Suppose  is satisfiable By induction on i there is a satisfying assignment with Q1 = q1, …, Qi = qi Induction argument: If Qi+1 = t in all satisfying assignments s.t.Q1 = q1, …, Qi = qi  P(Qi+1 = t | X = t, Q1 = q1, …, Qi = qi ) = 1  pi+1  1 -  > 0.5  qi+1 = t If Qi+1 = f in all satisfying assignments s.t.Q1 = q1, …, Qi = qi then qi+1 = f

Proof (cont.) We can efficiently check whether q1 ,…, qn is a satisfying assignment (linear time) If it is, then  is satisfiable If it is not, then  is not satisfiable Suppose we have an approximation procedure with  relative error  we can determine 3-SAT with n procedure calls  approximation is NP-hard

When can we hope to approximate? Two situations: “Peaked” distributions improbable values are ignored Highly stochastic distributions “Far” evidence is discarded

“Peaked” distributions If the distribution is “peaked”, then most of the mass is on few instances If we can focus on these instances, we can ignore the rest Instances

Stochasticity & Approximations Consider a chain: P(Xi+1 = t | Xi = t) = 1-  P(Xi+1 = f | Xi = f) = 1-  Computing the probability of Xn given X1 , we get X1 X2 X3 Xn

Plot of P(Xn = t | X1 = t) 0.5 0.6 0.7 0.8 0.9 1 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 n = 5 n = 10 n = 20 

Stochastic Processes This behavior of a chain (a Markov Process) is called Mixing. We return to this as a tool in approximation In general networks there is a similar behavior If probabilities are far from 0 & 1, then effect of “far” evidence vanishes (and so can be discarded in approximations).

Bounded conditioning Fixing value of A & B By examining only the probable assignment of A & B, we perform several simple computations instead of a complex one

Bounded conditioning Choose A and B so that P(Y,e |a,b) can be computed easily. E.g., a cycle cutset. Search for highly probable assignments to A,B. Option 1--- select a,b with high P(a,b). Option 2--- select a,b with high P(a,b | e). We need to search for such high mass values and that can be hard.

Bounded Conditioning Advantages: Combines exact inference within approximation Continuous: more time can be used to examine more cases Bounds: unexamined mass used to compute error-bars Possible problems: P(a,b) is prior mass not the posterior. If posterior is significantly different P(a,b| e), Computation can be wasted on irrelevant assignments

Network Simplifications In these approaches, we try to replace the original network with a simpler one the resulting network allows fast exact methods

Network Simplifications Typical simplifications: Remove parts of the network Remove edges Reduce the number of values (value abstraction) Replace a sub-network with a simpler one (model abstraction) These simplifications are often w.r.t. to the particular evidence and query

Stochastic Simulation Suppose we can sample instances <x1,…,xn> according to P(X1,…,Xn) What is the probability that a random sample <x1,…,xn> satisfies e? This is exactly P(e) We can view each sample as tossing a biased coin with probability P(e) of “Heads”

Stochastic Sampling Intuition: given a sufficient number of samples x[1],…,x[N], we can estimate Law of large number implies that as N grows, our estimate will converge to p with high probability

Sampling a Bayesian Network If P(X1,…,Xn) is represented by a Bayesian network, can we efficiently sample from it? Idea: sample according to structure of the network Write distribution using the chain rule, and then sample each variable given its parents

Logic sampling Samples: P(b) P(e) b e b e b e b e P(a) e e P(r) a a 0.03 0.03 Earthquake Radio Burglary Alarm Call P(e) 0.001 b e b e b e b e P(a) 0.98 0.7 0.4 0.01 e e P(r) 0.3 0.001 a a Samples: B E A C R P(c) 0.8 0.05 b

Logic sampling Samples: P(b) P(e) b e b e b e b e P(a) e e P(r) a a 0.03 0.001 Earthquake Radio Burglary Alarm Call P(e) 0.001 b e b e b e b e P(a) 0.98 0.7 0.4 0.01 e e P(r) 0.3 0.001 a a Samples: B E A C R P(c) 0.8 0.05 b e

Logic sampling Samples: P(b) P(e) b e b e b e b e P(a) e e P(r) a a 0.03 Earthquake Radio Burglary Alarm Call P(e) 0.001 b e b e b e b e P(a) 0.98 0.7 0.4 0.4 0.01 e e P(r) 0.3 0.001 a a Samples: B E A C R P(c) 0.8 0.05 b e a

Logic sampling Samples: P(b) P(e) b e b e b e b e P(a) e e P(r) a a 0.03 Earthquake Radio Burglary Alarm Call P(e) 0.001 b e b e b e b e P(a) 0.98 0.7 0.4 0.01 e e P(r) 0.3 0.001 a a Samples: B E A C R P(c) 0.8 0.8 0.05 b e a c

Logic sampling Samples: P(b) P(e) b e b e b e b e P(a) e e P(r) a a 0.03 Earthquake Radio Burglary Alarm Call P(e) 0.001 b e b e b e b e P(a) 0.98 0.7 0.4 0.01 e e P(r) 0.3 0.3 0.001 a a Samples: B E A C R P(c) 0.8 0.05 b e a c r

Logic Sampling Let X1, …, Xn be order of variables consistent with arc direction for i = 1, …, n do sample xi from P(Xi | pai ) (Note: since Pai  {X1,…,Xi-1}, we already assigned values to them) return x1, …,xn

Logic Sampling Sampling a complete instance is linear in number of variables Regardless of structure of the network However, if P(e) is small, we need many samples to get a decent estimate

Can we sample from P(X1,…,Xn |e)? If evidence is in roots of network, easily If evidence is in leaves of network, we have a problem Our sampling method proceeds according to order of nodes in graph Note, we can use arc-reversal to make evidence nodes root. In some networks, however, this will create exponentially large tables...

Likelihood Weighting Can we ensure that all of our sample satisfy e? One simple solution: When we need to sample a variable that is assigned value by e, use the specified value For example: we know Y = 1 Sample X from P(X) Then take Y = 1 Is this a sample from P(X,Y |Y = 1) ? X Y

Likelihood Weighting Problem: these samples of X are from P(X) Solution: Penalize samples in which P(Y=1|X) is small We now sample as follows: Let x[i] be a sample from P(X) Let w[i] be P(Y = 1|X = x [i]) X Y

Likelihood Weighting Why does this make sense? When N is large, we expect to sample NP(X = x) samples with x[i] = x Thus, Using different notation to express the same idea. Define pj=P(Y=1|X=j). Let Nj be number of samples where X=j.

Likelihood Weighting Samples: P(b) P(e) b e b e b e b e P(a) = a r = r 0.03 0.03 Earthquake Radio Burglary Alarm Call P(e) 0.001 b e b e b e b e P(a) 0.98 0.7 0.4 0.01 = a r = r r P(r) 0.3 0.001 a a Weight B E A C R P(c) 0.8 0.05 b Samples:

Likelihood Weighting Samples: P(b) P(e) b e b e b e b e P(a) = a r = r 0.03 0.001 Earthquake Radio Burglary Alarm Call P(e) 0.001 b e b e b e b e P(a) 0.98 0.7 0.4 0.01 = a r = r r P(r) 0.3 0.001 a a Samples: B E A C R Weight P(c) 0.8 0.05 b e

Likelihood Weighting Samples: P(b) P(e) b e b e b e b e P(a) = a r = r 0.03 Earthquake Radio Burglary Alarm Call P(e) 0.001 b e b e b e b e P(a) 0.98 0.7 0.4 0.4 0.01 = a r = r r P(r) 0.3 0.001 a a Samples: B E A C R Weight P(c) 0.8 0.05 b e a 0.6

Likelihood Weighting Samples: P(b) P(e) b e b e b e b e P(a) = a r = r 0.03 Earthquake Radio Burglary Alarm Call P(e) 0.001 b e b e b e b e P(a) 0.98 0.7 0.4 0.01 = a r = r r P(r) 0.3 0.001 a a Samples: B E A C R Weight P(c) 0.05 0.8 0.05 b e a c 0.6

Likelihood Weighting Samples: P(b) P(e) b e b e b e b e P(a) = a r = r 0.03 Earthquake Radio Burglary Alarm Call P(e) 0.001 b e b e b e b e P(a) 0.98 0.7 0.4 0.01 = a r = r r P(r) 0.3 0.3 0.001 a a Weight B E A C R P(c) 0.8 0.05 b e a c r 0.6 *0.3 Samples:

Likelihood Weighting Let X1, …, Xn be order of variables consistent with arc direction w = 1 for i = 1, …, n do if Xi = xi has been observed w w  P(Xi = xi | pai ) else sample xi from P(Xi | pai ) return x1, …,xn, and w

Likelihood Weighting What can we say about the quality of answer? Intuitively, the weights of sample reflects their probability given the evidence. We need collect a certain mass. Another factor is the “extremeness” of CPDs. Thm: If P(Xi | Pai) [l,u] for all CPDs, and then with probability 1-, the estimate is  relative error approximation

END