. PGM 2002/3 – Tirgul6 Approximate Inference: Sampling.

Slides:

Advertisements

Similar presentations

Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.

Advertisements

Dynamic Bayesian Networks (DBNs)

Identifying Conditional Independencies in Bayes Nets Lecture 4.

© The McGraw-Hill Companies, Inc., Chapter 8 The Theory of NP-Completeness.

For inference in Bayesian Networks Presented by Daniel Rembiszewski and Avishay Livne Based on Probabilistic Graphical Models: Principles And Techniques.

Bayesian network inference

Bayesian Networks. Motivation The conditional independence assumption made by naïve Bayes classifiers may seem to rigid, especially for classification.

. PGM: Tirgul 8 Markov Chains. Stochastic Sampling  In previous class, we examined methods that use independent samples to estimate P(X = x |e ) Problem:

PGM 2003/04 Tirgul 3-4 The Bayesian Network Representation.

. Learning Bayesian networks Slides by Nir Friedman.

Chapter 7 Sampling and Sampling Distributions

Bayesian Belief Networks

. Bayesian Networks Lecture 9 Edited from Nir Friedman’s slides by Dan Geiger from Nir Friedman’s slides.

Bayesian Networks What is the likelihood of X given evidence E? i.e. P(X|E) = ?

. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.

. Inference I Introduction, Hardness, and Variable Elimination Slides by Nir Friedman.

Part III: Inference Topic 6 Sampling and Sampling Distributions

. Approximate Inference Slides by Nir Friedman. When can we hope to approximate? Two situations: u Highly stochastic distributions “Far” evidence is discarded.

. DAGs, I-Maps, Factorization, d-Separation, Minimal I-Maps, Bayesian Networks Slides by Nir Friedman.

1 Joint work with Shmuel Safra. 2 Motivation 3 Motivation.

CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.

1 Approximate Inference 2: Importance Sampling. (Unnormalized) Importance Sampling.

Binomial Distributions Calculating the Probability of Success.

1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.

Great Theoretical Ideas in Computer Science.

1 Inference Rules and Proofs (Z); Program Specification and Verification Inference Rules and Proofs (Z); Program Specification and Verification.

Bayesian networks. Motivation We saw that the full joint probability can be used to answer any question about the domain, but can become intractable as.

Automated Planning and Decision Making Prof. Ronen Brafman Automated Planning and Decision Making 2007 Bayesian networks Variable Elimination Based on.

ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.

Week 10Complexity of Algorithms1 Hard Computational Problems Some computational problems are hard Despite a numerous attempts we do not know any efficient.

Instructor: Eyal Amir Grad TAs: Wen Pu, Yonatan Bisk Undergrad TAs: Sam Johnson, Nikhil Johri CS 440 / ECE 448 Introduction to Artificial Intelligence.

1 The Theory of NP-Completeness 2 Cook ’ s Theorem (1971) Prof. Cook Toronto U. Receiving Turing Award (1982) Discussing difficult problems: worst case.

CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Monte Carlo Methods for Probabilistic Inference.

1 CMSC 671 Fall 2001 Class #21 – Tuesday, November 13.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 07: BAYESIAN ESTIMATION (Cont.) Objectives:

The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)

Marginalization & Conditioning Marginalization (summing out): for any sets of variables Y and Z: Conditioning(variant of marginalization):

An Introduction to Markov Chain Monte Carlo Teg Grenager July 1, 2004.

Bayesian networks and their application in circuit reliability estimation Erin Taylor.

CS 188: Artificial Intelligence Bayes Nets: Approximate Inference Instructor: Stuart Russell--- University of California, Berkeley.

Lecture 29 Conditional Independence, Bayesian networks intro Ch 6.3, 6.3.1, 6.5, 6.5.1,

Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:

Inference Algorithms for Bayes Networks

1 CMSC 671 Fall 2001 Class #20 – Thursday, November 8.

1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.

Complexity 24-1 Complexity Andrei Bulatov Interactive Proofs.

Complexity ©D.Moshkovits 1 2-Satisfiability NOTE: These slides were created by Muli Safra, from OPICS/sat/)

Conditional Independence As with absolute independence, the equivalent forms of X and Y being conditionally independent given Z can also be used: P(X|Y,

UIUC CS 497: Section EA Lecture #7 Reasoning in Artificial Intelligence Professor: Eyal Amir Spring Semester 2004 (Based on slides by Gal Elidan (Hebrew.

The NP class. NP-completeness Lecture2. The NP-class The NP class is a class that contains all the problems that can be decided by a Non-Deterministic.

. The EM algorithm Lecture #11 Acknowledgement: Some slides of this lecture are due to Nir Friedman.

CS498-EA Reasoning in AI Lecture #19 Professor: Eyal Amir Fall Semester 2011.

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12

Inference in Bayesian Networks

CS b553: Algorithms for Optimization and Learning

Read R&N Ch Next lecture: Read R&N

Approximate Inference

Learning Bayesian Network Models from Data

CS 4/527: Artificial Intelligence

Read R&N Ch Next lecture: Read R&N

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12

Class #19 – Tuesday, November 3

Inference III: Approximate Inference

Approximate Inference by Sampling

Hankz Hankui Zhuo Bayesian Networks Hankz Hankui Zhuo

Read R&N Ch Next lecture: Read R&N

Approximate Inference: Particle-Based Methods

Read R&N Ch Next lecture: Read R&N

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12

Presentation transcript:

. PGM 2002/3 – Tirgul6 Approximate Inference: Sampling

Approximation u Until now, we examined exact computation u In many applications, approximation are sufficient Example: P(X = x|e) = Maybe P(X = x|e)  0.3 is a good enough approximation e.g., we take action only if P(X = x|e) > 0.5 u Can we find good approximation algorithms?

Types of Approximations Absolute error  An estimate q of P(X = x | e) has absolute error , if P(X = x|e) -   q  P(X = x|e) +  equivalently q -   P(X = x|e)  q +  u Absolute error is not always what we want: If P(X = x | e) = , then an absolute error of is unacceptable If P(X = x | e) = 0.3, then an absolute error of is overly precise 0 1 q 22

Types of Approximations Relative error  An estimate q of P(X = x | e) has relative error , if P(X = x|e)(1 -  )  q  P(X = x|e)(1 +  ) equivalently q/(1 +  )  P(X = x|e)  q/(1 -  )  Sensitivity of approximation depends on actual value of desired result 0 1 q q/(1+  ) q/(1-  )

Complexity u Recall, exact inference is NP-hard u Is approximate inference any easier? u Construction for exact inference: Input: a 3-SAT problem  Output: a BN such that P(X=t) > 0 iff  is satisfiable

Complexity: Relative Error  Suppose that q is an relative error estimate of P(X = t),  If  is not satisfiable, then P(X = t)(1 -  )  q  P(X = t)(1 +  ) 0 = P(X = t)(1 -  )  q  P(X = t)(1 +  ) = 0 Thus, if q > 0, then  is satisfiable An immediate consequence: Thm: Given , finding an  -relative error approximation is NP- hard

Complexity: Absolute error  We can find absolute error approximations to P(X = x) l We will see such algorithms shortly u However, once we have evidence, the problem is harder Thm  If  < 0.5, then finding an estimate of P(X=x|e) with  absulote error approximation is NP-Hard

Proof u Recall our construction... 11 Q1Q1 Q3Q3 Q2Q2 Q4Q4 QnQn 22 33 kk A1A1  k-1 A2A2 A k/2 X...

Proof (cont.)  Suppose we can estimate with  absolute error  Let p 1  P(Q 1 = t | X = t) Assign q 1 = t if p 1 > 0.5, else q 1 = f Let p 2  P(Q 2 = t | X = t, Q 1 = q 1 ) Assign q 2 = t if p 2 > 0.5, else q 2 = f … Let p n  P(Q n = t | X = t, Q 1 = q 1, …, Q n-1 = q n-1 ) Assign q n = t if p n > 0.5, else q n = f

Proof (cont.) Claim: if  is satisfiable, then q 1,…, q n is a satisfying assignment  Suppose  is satisfiable  By induction on i there is a satisfying assignment with Q 1 = q 1, …, Q i = q i Base case: If Q 1 = t in all satisfying assignments,  P(Q 1 = t | X = t) = 1  p 1  1 -  > 0.5  q 1 = t If Q 1 = f, in all satisfying assignments, then q 1 = f Otherwise, statement holds for any choice of q 1

Induction argument: If Q i+1 = t in all satisfying assignments s.t. Q 1 = q 1, …, Q i = q i  P(Q i+1 = t | X = t, Q 1 = q 1, …, Q i = q i ) = 1  p i+1  1 -  > 0.5  q i+1 = t If Q i+1 = f in all satisfying assignments s.t. Q 1 = q 1, …, Q i = q i then q i+1 = f Proof (cont.) Claim: if  is satisfiable, then q 1,…, q n is a satisfying assignment  Suppose  is satisfiable  By induction on i there is a satisfying assignment with Q 1 = q 1, …, Q i = q i

Proof (cont.)  We can efficiently check whether q 1,…, q n is a satisfying assignment (linear time) If it is, then  is satisfiable If it is not, then  is not satisfiable  Suppose we have an approximation procedure with  relative error   we can determine 3-SAT with n procedure calls u  approximation is NP-hard

Search Algorithms Idea: search for high probability instances  Suppose x[1], …, x[N] are instances with high mass u We can approximate:  If x[i] is a complete instantiation, then P(e|x[i]) is 0 or 1

Search Algorithms (cont)  Instances that do not satisfy e, do not play a role in approximation  We need to focus the search to find instances that do satisfy e u Clearly, in some cases this is hard (e.g., the construction from our NP-hardness result

Stochastic Simulation  Suppose we can sample instances according to P (X 1,…,X n )  What is the probability that a random sample satisfies e? This is exactly P(e)  We can view each sample as tossing a biased coin with probability P(e) of “Heads”

Stochastic Sampling  Intuition: given a sufficient number of samples x[1],…,x[N], we can estimate  Law of large number implies that as N grows, our estimate will converge to p with high probability u How many samples do we need to get a reliable estimation? Use Chernof’s bound for binomial distributions

Sampling a Bayesian Network  If P (X 1,…,X n ) is represented by a Bayesian network, can we efficiently sample from it? u Idea: sample according to structure of the network l Write distribution using the chain rule, and then sample each variable given its parents

Samples: B E A C R Logic sampling P(b) 0.03 P(e) P(a) b e P(c) a a P(r) e e b Earthquake Radio Burglary Alarm Call 0.03

Samples: B E A C R Logic sampling P(b) 0.03 P(e) P(a) b e P(c) a a P(r) e e eb Earthquake Radio Burglary Alarm Call 0.001

Samples: B E A C R Logic sampling P(b) 0.03 P(e) P(a) b e P(c) a a P(r) e e eab 0.4 Earthquake Radio Burglary Alarm Call

Samples: B E A C R Logic sampling P(b) 0.03 P(e) P(a) b e P(c) a a P(r) e e eacb Earthquake Radio Burglary Alarm Call 0.8

Samples: B E A C R Logic sampling P(b) 0.03 P(e) P(a) b e P(c) a a P(r) e e eacb r 0.3 Earthquake Radio Burglary Alarm Call

Logic Sampling  Let X 1, …, X n be order of variables consistent with arc direction  for i = 1, …, n do sample x i from P(X i | pa i ) (Note: since Pa i  {X 1,…,X i-1 }, we already assigned values to them)  return x 1, …,x n

Logic Sampling u Sampling a complete instance is linear in number of variables l Regardless of structure of the network  However, if P(e) is small, we need many samples to get a decent estimate

Can we sample from P( X 1,…,X n |e)? u If evidence is in roots of network, easily u If evidence is in leaves of network, we have a problem l Our sampling method proceeds according to order of nodes in graph u Note, we can use arc-reversal to make evidence nodes root. l In some networks, however, this will create exponentially large tables...

Likelihood Weighting  Can we ensure that all of our sample satisfy e? u One simple solution: l When we need to sample a variable that is assigned value by e, use the specified value  For example: we know Y = 1 Sample X from P(X) Then take Y = 1  Is this a sample from P( X,Y |Y = 1) ? X Y

Likelihood Weighting  Problem: these samples of X are from P(X) u Solution: Penalize samples in which P(Y=1|X) is small u We now sample as follows: Let x[i] be a sample from P(X) Let w[i] be P(Y = 1|X = x [i]) X Y

Likelihood Weighting u Why does this make sense?  When N is large, we expect to sample NP(X = x) samples with x[i] = x u Thus, u When we normalize, we get approximation of the conditional probability

Samples: B E A C R Likelihood Weighting P(b) 0.03 P(e) P(a) b e P(c) a P(r) e e b Earthquake Radio Burglary Alarm Call 0.03 Weight = r a = a

Samples: B E A C R Likelihood Weighting P(b) 0.03 P(e) P(a) b e P(c) a a P(r) e e eb Earthquake Radio Burglary Alarm Call Weight = r = a

Samples: B E A C R Likelihood Weighting P(b) 0.03 P(e) P(a) b e P(c) a a P(r) e e eb 0.4 Earthquake Radio Burglary Alarm Call Weight = r = a 0.6 a

Samples: B E A C R Likelihood Weighting P(b) 0.03 P(e) P(a) b e P(c) a a P(r) e e ecb Earthquake Radio Burglary Alarm Call 0.05 Weight = r = a a 0.6

Samples: B E A C R Likelihood Weighting P(b) 0.03 P(e) P(a) b e P(c) a a P(r) e e ecb r 0.3 Earthquake Radio Burglary Alarm Call Weight = r = a a 0.6 *0.3

Likelihood Weighting  Let X 1, …, X n be order of variables consistent with arc direction u w = 1  for i = 1, …, n do if X i = x i has been observed  w  w* P(X i = x i | pa i ) l else  sample x i from P(X i | pa i )  return x 1, …,x n, and w

Importance Sampling A general method for evaluating P(X) when we cannot sample from P(X). Idea: Choose an approximating distribution Q(X) and sample from it Using this we can now sample from Q and then W(X) If we could generate samples from P(X) Now that we generate the sample from Q(X)

(Unnormalized) Importance Sampling 1. For m=1:M Sample X[m] from Q(X) Calculate W(m) = P(X)/Q(X) 2. Estimate the expectation of f(X) using Requirements: P(X)>0  Q(X)>0 (do not ignore possible scenarios) It is possible to calculate P(X),Q(X) for a specific X=x It is possible to sample from Q(X)

Normalized Importance Sampling Assume that we cannot now even evalute P(X=x) but can evaluate P’(X=x) =  P(X=x) (for example we can evaluate P(X) but not P(X|e) in a Bayesian network) We define w’(X) = P’(X)/Q(X). We can then evaluate  : and then: where in the last step we simply replace  with the above equation

Normalized Importance Sampling We can now estimate the expectation of f(X) similarly to unnormalized importance sampling by sampling from Q(X) and then (hence the name “normalized”)

Importance Sampling to LW We want to compute P(Y=y|e)? (X is the set of random variables in the network and Y is some subset we are interested in) 1) Define a mutilated Bayesian network B Z=z to be a network where: all variables in Z are disconnected from their parents and are deterministically set to z all other variables remain unchanged 2) Choose Q to be B E=e convince yourself that P’(X)/Q(X) is exactly P(Y=y|X) 3) Choose f(x) to be 1(Y[m]=y)/M 4) Plug into the formula and you get exactly Likelihood Weighting  Likelihood weighting is correct!!!