Instructor: Eyal Amir Grad TAs: Wen Pu, Yonatan Bisk Undergrad TAs: Sam Johnson, Nikhil Johri CS 440 / ECE 448 Introduction to Artificial Intelligence Spring 2010 Lecture #22
Previously Bayesian Networks Exact reasoning –Variable elimination Machine Learning –Naïve Bayes –ID3: Decision-Tree Learning Today: approximate reasoning
Today’s Agenda 1.Random Sampling 1.Logic sampling 2.Rejection sampling 3.Likelihood weighting 4.Importance sampling 2.Will not cover: Markov Chain Monte Carlo (MCMC) 1.Gibbs sampling 2.Metropolis-Hastings
Random Sampling Sampling: Also called Monte Carlo techniques Sometimes hard to compute the posterior probability exactly Approximate Reasoning: –Approximate by sampling from a distribution Accuracy depends on the number of samples
Example Estimate the probability of head for an unbiased coin –Generate samples from P(coin)=(0.5,0.5) –Like tossing a coin –Finally :
Generating Samples Need: generate samples from probability P(“head”)=p How? Find a biased coin … Use Random-number-generator –Provides Uniform distribution between [0,1] –When x generated uniform [0,1] answer… x<p 0 1p
Sampling a Bayesian Network If P(X 1,…,X n ) is represented by a Bayesian network, can we efficiently sample from it? Idea: sample according to structure of the network –Write distribution using the chain rule, and then sample each variable given its parents
Samples: B E A C R Logic sampling P(b) 0.03 P(e) P(a) b e P(c) a a P(r) e e b Earthquake Radio Burglary Alarm Call 0.03
Samples: B E A C R Logic sampling P(b) 0.03 P(e) P(a) b e P(c) a a P(r) e e eb Earthquake Radio Burglary Alarm Call 0.001
Samples: B E A C R Logic sampling P(b) 0.03 P(e) P(a) b e P(c) a a P(r) e e eab 0.4 Earthquake Radio Burglary Alarm Call
Samples: B E A C R Logic sampling P(b) 0.03 P(e) P(a) b e P(c) a a P(r) e e eacb Earthquake Radio Burglary Alarm Call 0.8
Samples: B E A C R Logic sampling P(b) 0.03 P(e) P(a) b e P(c) a a P(r) e e eacb r 0.3 Earthquake Radio Burglary Alarm Call
Logic Sampling Let X 1, …, X n be order of variables consistent with arc direction for i = 1, …, n do –sample x i from P(X i | pa i ) –(Note: since Pa i {X 1,…,X i-1 }, we already assigned values to them) return x 1, …,x n
Logic Sampling Running time of sampling a complete instance –Linear in number of variables –Regardless of structure of the network However, can require a lot of samples to approximate the distribution
Can sample from P(X 1,…,X n |e)? If evidence is in roots of network, easily If evidence is in leaves of network, we have a problem –Our sampling method proceeds according to order of nodes in graph Note, we can use arc-reversal to make evidence nodes root. –In some networks, however, this will create exponentially large tables...
Rejection Sampling Employing evidence –Idea: remove any sample that does not satisfy the evidence –Problem: rejects so many sample if P(e) is small, we need many samples to get a decent estimate
Likelihood Weighting Can we ensure that all of our samples satisfy e? One simple solution: –When we need to sample a variable that is assigned value by e, use the specified value For example: we know Y = 1 –Sample X from P(X) –Then take Y = 1 Is this a sample from P( X,Y |Y = 1) ? X Y
Likelihood Weighting Problem: these samples of X from P(X) Solution: –Penalize samples in which P(Y=1|X) is small We now sample as follows: –Let x[i] be a sample from P(X) –Let w[i] be P(Y = 1|X = x [i]) X Y (Bayes rule)
Likelihood Weighting Why does this make sense? When N is large, we expect to sample NP(X = x) samples with x[i] = x Thus, When we normalize, we get approximation of the conditional probability w[i]
Samples: B E A C R Likelihood Weighting P(b) 0.03 P(e) P(a) b e P(c) a P(r) e e b Earthquake Radio Burglary Alarm Call 0.03 Weight = r a = a
Samples: B E A C R Likelihood Weighting P(b) 0.03 P(e) P(a) b e P(c) a a P(r) e e eb Earthquake Radio Burglary Alarm Call Weight = r = a
Samples: B E A C R Likelihood Weighting P(b) 0.03 P(e) P(a) b e P(c) a a P(r) e e eb 0.4 Earthquake Radio Burglary Alarm Call Weight = r = a 0.6 a
Samples: B E A C R Likelihood Weighting P(b) 0.03 P(e) P(a) b e P(c) a a P(r) e e ecb Earthquake Radio Burglary Alarm Call 0.05 Weight = r = a a 0.6
Samples: B E A C R Likelihood Weighting P(b) 0.03 P(e) P(a) b e P(c) a a P(r) e e ecb r 0.3 Earthquake Radio Burglary Alarm Call Weight = r = a a 0.6 *0.3
Likelihood Weighting: Generation of Samples Let X 1, …, X n be order of variables consistent with arc direction w = 1 for i = 1, …, n do –if X i = x i has been observed w w* P(X i = x i | pa i ) –else sample x i from P(X i | pa i ) return x 1, …,x n, and w
Sampling Algorithms with Evidence: Another View Idea: search for high probability instances Suppose are instances with high mass We can approximate (Bayes rule): If is a complete instantiation, then is – 0 or 1
Search Algorithms (cont) Instances that do not satisfy e, do not play a role in approximation We need to focus the search to find instances that do satisfy e Clearly, in some cases this is hard (NP- hardness result)
Stochastic Simulation Suppose we can sample instances according to P(X 1,…,X n ) What is the probability that a random sample satisfies e? –This is exactly P(e) We can view each sample as tossing a biased coin with probability P(e) of “Heads”
Stochastic Sampling Intuition: given a sufficient number of samples x[1],…,x[N], we can estimate Law of large number implies that as N grows, our estimate will converge to p The number of samples that we need is potentially exponential in dimension of P.
Importance Sampling A method for evaluating expectation of f under P(x), P(X) Discrete: Continuous: If we could sample from P
Importance Sampling A general method for evaluating P(X) when we cannot sample from P(X). Idea: Choose an approximating distribution Q(X) and sample from it Using this we can now sample from Q and then W(X) If we could generate samples from P(X) Now that we generate the samples from Q(X)
(Unnormalized) Importance Sampling 1. For m=1:M Sample X[m] from Q(X) Calculate W(m) = P(X)/Q(X) 2. Estimate the expectation of f(X) using Requirements: P(X)>0 Q(X)>0 (don’t ignore possible scenarios) Possible to calculate P(X),Q(X) for a specific X=x It is possible to sample from Q(X)
Normalized Importance Sampling Assume that we cannot evaluate P(X=x) but can evaluate P’(X=x) = P(X=x) (for example we can evaluate P(X) but not P(X|e) in a Bayesian network) We define w’(X) = P’(X)/Q(X). We can then evaluate : and then: In the last step we simply replace with the above equation
Normalized Importance Sampling We can now estimate the expectation of f(X) similarly to unnormalized importance sampling by sampling x[m] from Q(X) and then (hence the name “normalized”)
Importance Sampling Weaknesses Important to choose sampling distribution with heavy tails –Not to “miss” large values of f Many-dimensional I-S: –“Typical set” of P may take a long time to find, unless Q good approximation to P –Weights vary by factors exponential in N Similar for Likelihood Weighting
Next Class 1.Approximation guarantees and hardness 2.Monte Carlo techniques 1.Rejection sampling 2.Likelihood weighting 3.Importance sampling 3.Markov Chain Monte Carlo (MCMC) 1.Gibbs sampling 2.Metropolis-Hastings
Stochastic Sampling Previously: independent samples to estimate P(X = x |e ) Problem: difficult to sample from P(X 1, …. X n |e ) We had to use likelihood weighting –Introduces bias in estimation In some case, such as when the evidence is on leaves, these methods are inefficient –Very low weights if e has low prior probability –Very few samples have high-mass (high weight)
MCMC Methods Sampling methods that are based on Markov Chain –Markov Chain Monte Carlo (MCMC) methods Key ideas: –Sampling process as a Markov Chain Next sample depends on the previous one –Approximate any posterior distribution Next: review theory of Markov chains
MCMC Methods Notes: The Markov chain variable Y takes as value assignments to all variables that are consistent with evidence For simplicity, we will denote such a state using the vector of variables
Gibbs Sampler One of the simplest MCMC method Each transition changes the state of one X i The transition probability defined by P itself as a stochastic procedure: –Input: a state x 1,…,x n –Choose i at random (uniform probability) –Sample x’ i from P(X i |x 1, …, x i-1, x i+1,…, x n, e) –let x’ j = x j for all j i –return x’ 1,…,x’ n
Sampling Strategy How do we collect the samples? Strategy I: Run the chain M times, each for N steps –each run starts from a different state points Return the last state in each run M chains
Sampling Strategy Strategy II: Run one chain for a long time After some “burn in” period, sample points every some fixed number of steps “burn in” M samples from one chain