UIUC CS 497: Section EA Lecture #7 Reasoning in Artificial Intelligence Professor: Eyal Amir Spring Semester 2004 (Based on slides by Gal Elidan (Hebrew.

Slides:



Advertisements
Similar presentations
A Tutorial on Learning with Bayesian Networks
Advertisements

Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
CS498-EA Reasoning in AI Lecture #15 Instructor: Eyal Amir Fall Semester 2011.
Exact Inference in Bayes Nets
Dynamic Bayesian Networks (DBNs)
Reasoning Under Uncertainty: Bayesian networks intro Jim Little Uncertainty 4 November 7, 2014 Textbook §6.3, 6.3.1, 6.5, 6.5.1,
Markov Chains Modified by Longin Jan Latecki
1 The Monte Carlo method. 2 (0,0) (1,1) (-1,-1) (-1,1) (1,-1) 1 Z= 1 If  X 2 +Y 2  1 0 o/w (X,Y) is a point chosen uniformly at random in a 2  2 square.
Bayesian Methods with Monte Carlo Markov Chains III
Markov Chains 1.
Markov Chain Monte Carlo Prof. David Page transcribed by Matthew G. Lee.
CHAPTER 16 MARKOV CHAIN MONTE CARLO
Exact Inference (Last Class) variable elimination  polytrees (directed graph with at most one undirected path between any two vertices; subset of DAGs)
CS774. Markov Random Field : Theory and Application Lecture 16 Kyomin Jung KAIST Nov
Bayesian network inference
Bayesian Networks Chapter 2 (Duda et al.) – Section 2.11
. PGM: Tirgul 8 Markov Chains. Stochastic Sampling  In previous class, we examined methods that use independent samples to estimate P(X = x |e ) Problem:
PGM 2003/04 Tirgul 3-4 The Bayesian Network Representation.
Machine Learning CUNY Graduate Center Lecture 7b: Sampling.
. Bayesian Networks Lecture 9 Edited from Nir Friedman’s slides by Dan Geiger from Nir Friedman’s slides.
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
5/25/2005EE562 EE562 ARTIFICIAL INTELLIGENCE FOR ENGINEERS Lecture 16, 6/1/2005 University of Washington, Department of Electrical Engineering Spring 2005.
. Approximate Inference Slides by Nir Friedman. When can we hope to approximate? Two situations: u Highly stochastic distributions “Far” evidence is discarded.
Announcements Homework 8 is out Final Contest (Optional)
. DAGs, I-Maps, Factorization, d-Separation, Minimal I-Maps, Bayesian Networks Slides by Nir Friedman.
1 Bayesian Networks Chapter ; 14.4 CS 63 Adapted from slides by Tim Finin and Marie desJardins. Some material borrowed from Lise Getoor.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
. PGM 2002/3 – Tirgul6 Approximate Inference: Sampling.
Approximate Inference 2: Monte Carlo Markov Chain
1 Approximate Inference 2: Importance Sampling. (Unnormalized) Importance Sampling.
Machine Learning Lecture 23: Statistical Estimation with Sampling Iain Murray’s MLSS lecture on videolectures.net:
Reasoning Under Uncertainty: Bayesian networks intro CPSC 322 – Uncertainty 4 Textbook §6.3 – March 23, 2011.
2 Syntax of Bayesian networks Semantics of Bayesian networks Efficient representation of conditional distributions Exact inference by enumeration Exact.
Instructor: Eyal Amir Grad TAs: Wen Pu, Yonatan Bisk Undergrad TAs: Sam Johnson, Nikhil Johri CS 440 / ECE 448 Introduction to Artificial Intelligence.
Bayes’ Nets: Sampling [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available.
CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Monte Carlo Methods for Probabilistic Inference.
1 CMSC 671 Fall 2001 Class #21 – Tuesday, November 13.
Exact Inference (Last Class) Variable elimination  polytrees (directed graph with at most one undirected path between any two vertices; subset of DAGs)
The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)
Marginalization & Conditioning Marginalization (summing out): for any sets of variables Y and Z: Conditioning(variant of marginalization):
CPSC 422, Lecture 11Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 11 Oct, 2, 2015.
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
An Introduction to Markov Chain Monte Carlo Teg Grenager July 1, 2004.
Seminar on random walks on graphs Lecture No. 2 Mille Gandelsman,
Bayesian networks and their application in circuit reliability estimation Erin Taylor.
CS 188: Artificial Intelligence Bayes Nets: Approximate Inference Instructor: Stuart Russell--- University of California, Berkeley.
Wei Sun and KC Chang George Mason University March 2008 Convergence Study of Message Passing In Arbitrary Continuous Bayesian.
Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:
Inference Algorithms for Bayes Networks
1 CMSC 671 Fall 2001 Class #20 – Thursday, November 8.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Probabilistic Reasoning Inference and Relational Bayesian Networks.
Knowledge Representation & Reasoning Lecture #5 UIUC CS 498: Section EA Professor: Eyal Amir Fall Semester 2005 (Based on slides by Lise Getoor and Alvaro.
CS 541: Artificial Intelligence Lecture VII: Inference in Bayesian Networks.
CS498-EA Reasoning in AI Lecture #19 Professor: Eyal Amir Fall Semester 2011.
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12
CS b553: Algorithms for Optimization and Learning
Approximate Inference
CS 4/527: Artificial Intelligence
Remember that our objective is for some density f(y|) for observations where y and  are vectors of data and parameters,  being sampled from a prior.
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Instructors: Fei Fang (This Lecture) and Dave Touretzky
CS 188: Artificial Intelligence
Class #19 – Tuesday, November 3
CS 188: Artificial Intelligence Fall 2008
Inference III: Approximate Inference
Approximate Inference by Sampling
Approximate Inference: Particle-Based Methods
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12
Presentation transcript:

UIUC CS 497: Section EA Lecture #7 Reasoning in Artificial Intelligence Professor: Eyal Amir Spring Semester 2004 (Based on slides by Gal Elidan (Hebrew U))

Last Time Probabilistic graphical models Exact reasoning –Variable elimination –Junction tree algorithm Applications (of Bayes Networks): –Sensor networks, medical diagnosis, computer diagnosis (MS Windows), classification

Approximate Inference Large treewidth –Large, highly connected graphical models –Treewidth may be large (>40) in sparse networks In many applications, approximation are sufficient –Example: P(X = x|e) = –Maybe P(X = x|e)  0.3 is a good enough approximation –e.g., we take action only if P(X = x|e) > 0.5

Today: Approximate reasoning via sampling 1.Monte Carlo techniques 1.Rejection sampling 2.Likelihood weighting 3.Importance sampling 2.Markov Chain Monte Carlo (MCMC) 1.Gibbs sampling 2.Metropolis-Hastings 3. Applications du jour: ?

Types of Approximations Absolute error An estimate q of P(X = x | e) has absolute error , if P(X = x|e) -   q  P(X = x|e) +  equivalently q -   P(X = x|e)  q +  Not always what we want: error –Unacceptable if P(X = x | e) = , –Overly precise if P(X = x | e) = q 22

Types of Approximations Relative error An estimate q of P(X = x | e) has relative error , if P(X = x|e)(1-  )  q  P(X = x|e)(1+  ) equivalently q/(1+  )  P(X = x|e)  q/(1-  ) Sensitivity of approximation depends on actual value of desired result 0 1 q q/(1+  ) q/(1-  )

Complexity Recall, exact inference is NP-hard Is approximate inference any easier? Construction for exact inference: –Input: a 3-SAT problem  –Output: a BN such that P(X=t) > 0 iff  is satisfiable

Complexity: Relative Error Suppose that q is a relative error estimate of P(X = t), If  is not satisfiable, then P(X = t)(1 -  )  q  P(X = t)(1 +  )0 = P(X = t)(1 -  )  q  P(X = t)(1 +  ) = 0 Thus, if q > 0, then  is satisfiable An immediate consequence: Thm: Given , finding an  -relative error approximation is NP- hard

Complexity: Absolute error Thm: If  < 0.5, then finding an estimate of P(X=x|e) with  absulote error approximation is NP-Hard

Search Algorithms Idea: search for high probability instances Suppose x[1], …, x[N] are instances with high mass We can approximate: If x[i] is a complete instantiation, then P(e|x[i]) is 0 or 1

Search Algorithms (cont) Instances that do not satisfy e, do not play a role in approximation We need to focus the search to find instances that do satisfy e Clearly, in some cases this is hard (e.g., the construction from our NP-hardness result

Stochastic Simulation Suppose we can sample instances according to P (X 1,…,X n ) What is the probability that a random sample satisfies e? –This is exactly P(e) We can view each sample as tossing a biased coin with probability P(e) of “Heads”

Stochastic Sampling Intuition: given a sufficient number of samples x[1],…,x[N], we can estimate Law of large number implies that as N grows, our estimate will converge to p whp How many samples do we need to get a reliable estimation? Use Chernof’s bound for binomial distributions

Sampling a Bayesian Network If P (X 1,…,X n ) is represented by a Bayesian network, can we efficiently sample from it? Idea: sample according to structure of the network –Write distribution using the chain rule, and then sample each variable given its parents

Samples: B E A C R Logic sampling P(b) 0.03 P(e) P(a) b e P(c) a a P(r) e e b Earthquake Radio Burglary Alarm Call 0.03

Samples: B E A C R Logic sampling P(b) 0.03 P(e) P(a) b e P(c) a a P(r) e e eb Earthquake Radio Burglary Alarm Call 0.001

Samples: B E A C R Logic sampling P(b) 0.03 P(e) P(a) b e P(c) a a P(r) e e eab 0.4 Earthquake Radio Burglary Alarm Call

Samples: B E A C R Logic sampling P(b) 0.03 P(e) P(a) b e P(c) a a P(r) e e eacb Earthquake Radio Burglary Alarm Call 0.8

Samples: B E A C R Logic sampling P(b) 0.03 P(e) P(a) b e P(c) a a P(r) e e eacb r 0.3 Earthquake Radio Burglary Alarm Call

Logic Sampling Let X 1, …, X n be order of variables consistent with arc direction for i = 1, …, n do –sample x i from P(X i | pa i ) –(Note: since Pa i  {X 1,…,X i-1 }, we already assigned values to them) return x 1, …,x n

Logic Sampling Sampling a complete instance is linear in number of variables –Regardless of structure of the network However, if P(e) is small, we need many samples to get a decent estimate

Can sample from P( X 1,…,X n |e)? If evidence is in roots of network, easily If evidence is in leaves of network, we have a problem –Our sampling method proceeds according to order of nodes in graph Note, we can use arc-reversal to make evidence nodes root. –In some networks, however, this will create exponentially large tables...

Likelihood Weighting Can we ensure that all of our sample satisfy e? One simple solution: –When we need to sample a variable that is assigned value by e, use the specified value For example: we know Y = 1 –Sample X from P(X) –Then take Y = 1 Is this a sample from P( X,Y |Y = 1) ? X Y

Likelihood Weighting Problem: these samples of X from P(X) Solution: –Penalize samples in which P(Y=1|X) is small We now sample as follows: –Let x[i] be a sample from P(X) –Let w[i] be P(Y = 1|X = x [i]) X Y

Likelihood Weighting Why does this make sense? When N is large, we expect to sample NP(X = x) samples with x[i] = x Thus, When we normalize, we get approximation of the conditional probability

Samples: B E A C R Likelihood Weighting P(b) 0.03 P(e) P(a) b e P(c) a P(r) e e b Earthquake Radio Burglary Alarm Call 0.03 Weight = r a = a

Samples: B E A C R Likelihood Weighting P(b) 0.03 P(e) P(a) b e P(c) a a P(r) e e eb Earthquake Radio Burglary Alarm Call Weight = r = a

Samples: B E A C R Likelihood Weighting P(b) 0.03 P(e) P(a) b e P(c) a a P(r) e e eb 0.4 Earthquake Radio Burglary Alarm Call Weight = r = a 0.6 a

Samples: B E A C R Likelihood Weighting P(b) 0.03 P(e) P(a) b e P(c) a a P(r) e e ecb Earthquake Radio Burglary Alarm Call 0.05 Weight = r = a a 0.6

Samples: B E A C R Likelihood Weighting P(b) 0.03 P(e) P(a) b e P(c) a a P(r) e e ecb r 0.3 Earthquake Radio Burglary Alarm Call Weight = r = a a 0.6 *0.3

Likelihood Weighting Let X 1, …, X n be order of variables consistent with arc direction w = 1 for i = 1, …, n do –if X i = x i has been observed w  w* P(X i = x i | pa i ) –else sample x i from P(X i | pa i ) return x 1, …,x n, and w

Importance Sampling A method for evaluating expectation of f under P(x), P(X) Discrete: Continuous: If we could sample from P

Importance Sampling A general method for evaluating P(X) when we cannot sample from P(X). Idea: Choose an approximating distribution Q(X) and sample from it Using this we can now sample from Q and then W(X) If we could generate samples from P(X) Now that we generate the samples from Q(X)

(Unnormalized) Importance Sampling 1. For m=1:M Sample X[m] from Q(X) Calculate W(m) = P(X)/Q(X) 2. Estimate the expectation of f(X) using Requirements: P(X)>0  Q(X)>0 (don’t ignore possible scenarios) Possible to calculate P(X),Q(X) for a specific X=x It is possible to sample from Q(X)

Normalized Importance Sampling Assume that we cannot evalute P(X=x) but can evaluate P’(X=x) =  P(X=x) (for example we can evaluate P(X) but not P(X|e) in a Bayesian network) We define w’(X) = P’(X)/Q(X). We can then evaluate  : and then: In the last step we simply replace  with the above equation

Normalized Importance Sampling We can now estimate the expectation of f(X) similarly to unnormalized importance sampling by sampling x[m] from Q(X) and then (hence the name “normalized”)

Importance Sampling Weaknesses Important to choose sampling distribution with heavy tails –Not to “miss” large values of f Many-dimensional I-S: –“Typical set” of P may take a long time to find, unless Q good approximation to P –Weights vary by factors exponential in N Similar for Likelihood Weighting

Today: Approximate Reasoning via Sampling 1.Monte Carlo techniques 1.Rejection sampling 2.Likelihood weighting 3.Importance sampling 2.Markov Chain Monte Carlo (MCMC) 1.Gibbs sampling 2.Metropolis-Hastings 3. Applications du jour: ?

Stochastic Sampling Previously: independent samples to estimate P(X = x |e ) Problem: difficult to sample from P(X 1, …. X n |e ) We had to use likelihood weighting –Introduces bias in estimation In some case, such as when the evidence is on leaves, these methods are inefficient

MCMC Methods Sampling methods that are based on Markov Chain –Markov Chain Monte Carlo (MCMC) methods Key ideas: –Sampling process as a Markov Chain Next sample depends on the previous one –Approximate any posterior distribution Next: review theory of Markov chains

Markov Chains Suppose X 1, X 2, … take some set of values –wlog. These values are 1, 2,... A Markov chain is a process that corresponds to the network: To quantify the chain, we need to specify –Initial probability: P(X 1 ) –Transition probability: P(X t+1 |X t ) A Markov chain has stationary transition probability: P(X t+1 |X t ) same for all times t X1X1 X2X2 X3X3 XnXn...

Irreducible Chains A state j is accessible from state i if there is an n such that P(X n = j | X 1 = i) > 0 –There is a positive probability of reaching j from i after some number steps A chain is irreducible if every state is accessible from every state

Ergodic Chains A state is positively recurrent if there is a finite expected time to get back to state i after being in state i –If X has finite number of states, then this is suffices that i is accessible from itself A chain is ergodic if it is irreducible and every state is positively recurrent

(A)periodic Chains A state i is periodic if there is an integer d such that when n is not divisible by d P(X n = i | X 1 = i ) = 0 Intuition: only every d steps state i may occur A chain is aperiodic if it contains no periodic state

Stationary Probabilities Thm: If a chain is ergodic and aperiodic, then the limit exists, and does not depend on i Moreover, let then, P * (X) is the unique probability satisfying

Stationary Probabilities The probability P * (X) is the stationary probability of the process Regardless of the starting point, the process will converge to this probability The rate of convergence depends on properties of the transition probability

Sampling from the stationary probability This theory suggests how to sample from the stationary probability: –Set X 1 = i, for some random/arbitrary i –For t = 1, 2, …, n Sample a value x t+1 for X t+1 from P(X t+1 |X t =x t ) –return x n If n is large enough, then this is a sample from P * (X)

Designing Markov Chains How do we construct the right chain to sample from? –Ensuring aperiodicity and irreducibility is usually easy Problem is ensuring the desired stationary probability

Designing Markov Chains Key tool: If the transition probability satisfies then, P * (X) = Q(X) This gives a local criteria for checking that the chain will have the right stationary distribution

MCMC Methods We can use these results to sample from P(X 1,…,X n |e) Idea: Construct an ergodic & aperiodic Markov Chain such that P * (X 1,…,X n ) = P(X 1,…,X n |e) Simulate the chain n steps to get a sample

MCMC Methods Notes: The Markov chain variable Y takes as value assignments to all variables that are consistent evidence For simplicity, we will denote such a state using the vector of variables

Gibbs Sampler One of the simplest MCMC method Each transition changes the state of one X i The transition probability defined by P itself as a stochastic procedure: –Input: a state x 1,…,x n –Choose i at random (uniform probability) –Sample x’ i from P(X i |x 1, …, x i-1, x i+1,…, x n, e) –let x’ j = x j for all j  i –return x’ 1,…,x’ n

Correctness of Gibbs Sampler How do we show correctness?

Correctness of Gibbs Sampler By chain rule P(x 1,…,x i-1, x i, x i+1,…,x n |e) = P(x 1,…,x i-1, x i+1,…,x n |e)P(x i |x 1,…,x i-1, x i+1,…,x n, e) Thus, we get Since we choose i from the same distribution at each stage, this procedure satisfies the ratio criteria Transition

Gibbs Sampling for Bayesian Network Why is the Gibbs sampler “easy” in BNs? Recall that the Markov blanket of a variable separates it from the other variables in the network –P(X i | X 1,…,X i-1,X i+1,…,X n ) = P(X i | Mb i ) This property allows us to use local computations to perform sampling in each transition

Gibbs Sampling in Bayesian Networks How do we evaluate P(X i | x 1,…,x i-1,x i+1,…,x n ) ? Let Y 1, …, Y k be the children of X i –By definition of Mb i, the parents of Y j are in Mb i  {X i } It is easy to show that

Metropolis-Hastings More general than Gibbs (Gibbs is a special case of M-H) Proposal distribution arbitrary q(x’|x) that is ergodic and aperiodic (e.g., uniform) Transition to x’ happens with probability  (x’|x)=min(1, P(x’)q(x|x’)/P(x)q(x’|x)) Useful when computing P(x) infeasible q(x’|x)=0 implies P(x’)=0 or q(x|x’)=0

Sampling Strategy How do we collect the samples? Strategy I: Run the chain M times, each for N steps –each run starts from a different state points Return the last state in each run M chains

Sampling Strategy Strategy II: Run one chain for a long time After some “burn in” period, sample points every some fixed number of steps “burn in” M samples from one chain

Comparing Strategies Strategy I: –Better chance of “covering” the space of points especially if the chain is slow to reach stationarity –Have to perform “burn in” steps for each chain Strategy II: –Perform “burn in” only once –Samples might be correlated (although only weakly) Hybrid strategy: –Run several chains, sample few times each –Combines benefits of both strategies

Summary: Approximate Inference Monte Carlo (sampling with positive and negative error) Methods: –Pos: Simplicity of implementation and theoretical guarantee of convergence –Neg: Can be slow to converge and hard to diagnose their convergence. Variational Methods – Your presentation Loopy Belief Propagation and Generalized Belief Propagation -- Your presentation

Next Time Combining Probabilities with Relations and Objects

THE END

Example: Naïve Bayesian Model A common model in early diagnosis: –Symptoms are conditionally independent given the disease (or fault) Thus, if –X 1,…,X p denote whether the symptoms exhibited by the patient (headache, high- fever, etc.) and –H denotes the hypothesis about the patients health then, P(X 1,…,X p,H) = P(H)P(X 1 |H)…P(X p |H), This naïve Bayesian model allows compact representation –It does embody strong independence assumptions

Elimination on Trees Formally, for any tree, there is an elimination ordering with induced width = 1 Thm Inference on trees is linear in number of variables

Importance Sampling to LW We want to compute P(Y=y|e)? (X is the set of random variables in the network and Y is some subset we are interested in) 1) Define a mutilated Bayesian network B Z=z to be a network where: all variables in Z are disconnected from their parents and are deterministically set to z all other variables remain unchanged 2) Choose Q to be B E=e convince yourself that P’(X)/Q(X) is exactly P(Y=y|X) 3) Choose f(x) to be 1(Y[m]=y)/M 4) Plug into the formula and you get exactly Likelihood Weighting  Likelihood weighting is correct!!!

A Word of Caution Deterministic nodes –Not ergodic in the simple sense –M-H cannot be used