Instructor: Eyal Amir Grad TAs: Wen Pu, Yonatan Bisk Undergrad TAs: Sam Johnson, Nikhil Johri CS 440 / ECE 448 Introduction to Artificial Intelligence.

Slides:



Advertisements
Similar presentations
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Advertisements

Exact Inference in Bayes Nets
Lirong Xia Hidden Markov Models Tue, March 28, 2014.
Lirong Xia Approximate inference: Particle filter Tue, April 1, 2014.
Bayesian Methods with Monte Carlo Markov Chains III
Exact Inference (Last Class) variable elimination  polytrees (directed graph with at most one undirected path between any two vertices; subset of DAGs)
Bayesian network inference
CS 188: Artificial Intelligence Fall 2009 Lecture 20: Particle Filtering 11/5/2009 Dan Klein – UC Berkeley TexPoint fonts used in EMF. Read the TexPoint.
. PGM: Tirgul 8 Markov Chains. Stochastic Sampling  In previous class, we examined methods that use independent samples to estimate P(X = x |e ) Problem:
Machine Learning CUNY Graduate Center Lecture 7b: Sampling.
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
CS 188: Artificial Intelligence Spring 2007 Lecture 14: Bayes Nets III 3/1/2007 Srini Narayanan – ICSI and UC Berkeley.
CS 188: Artificial Intelligence Fall 2006 Lecture 17: Bayes Nets III 10/26/2006 Dan Klein – UC Berkeley.
. Approximate Inference Slides by Nir Friedman. When can we hope to approximate? Two situations: u Highly stochastic distributions “Far” evidence is discarded.
Announcements Homework 8 is out Final Contest (Optional)
. DAGs, I-Maps, Factorization, d-Separation, Minimal I-Maps, Bayesian Networks Slides by Nir Friedman.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
. PGM 2002/3 – Tirgul6 Approximate Inference: Sampling.
Approximate Inference 2: Monte Carlo Markov Chain
Bayesian networks Chapter 14. Outline Syntax Semantics.
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
1 Approximate Inference 2: Importance Sampling. (Unnormalized) Importance Sampling.
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
Exam I review Understanding the meaning of the terminology we use. Quick calculations that indicate understanding of the basis of methods. Many of the.
Machine Learning Lecture 23: Statistical Estimation with Sampling Iain Murray’s MLSS lecture on videolectures.net:
Reasoning Under Uncertainty: Bayesian networks intro CPSC 322 – Uncertainty 4 Textbook §6.3 – March 23, 2011.
Finding Scientific topics August , Topic Modeling 1.A document as a probabilistic mixture of topics. 2.A topic as a probability distribution.
CS 188: Artificial Intelligence Fall 2006 Lecture 18: Decision Diagrams 10/31/2006 Dan Klein – UC Berkeley.
Bayes’ Nets: Sampling [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available.
CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Monte Carlo Methods for Probabilistic Inference.
1 CMSC 671 Fall 2001 Class #21 – Tuesday, November 13.
Exact Inference (Last Class) Variable elimination  polytrees (directed graph with at most one undirected path between any two vertices; subset of DAGs)
The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)
Markov Chain Monte Carlo Prof. David Page transcribed by Matthew G. Lee.
Conditional Probability Mass Function. Introduction P[A|B] is the probability of an event A, giving that we know that some other event B has occurred.
QUIZ!!  In HMMs...  T/F:... the emissions are hidden. FALSE  T/F:... observations are independent given no evidence. FALSE  T/F:... each variable X.
An Introduction to Markov Chain Monte Carlo Teg Grenager July 1, 2004.
Bayesian networks and their application in circuit reliability estimation Erin Taylor.
CS 188: Artificial Intelligence Bayes Nets: Approximate Inference Instructor: Stuart Russell--- University of California, Berkeley.
Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:
Inference Algorithms for Bayes Networks
Quick Warm-Up Suppose we have a biased coin that comes up heads with some unknown probability p; how can we use it to produce random bits with probabilities.
1 CMSC 671 Fall 2001 Class #20 – Thursday, November 8.
CS774. Markov Random Field : Theory and Application Lecture 15 Kyomin Jung KAIST Oct
CPSC 7373: Artificial Intelligence Lecture 5: Probabilistic Inference Jiang Bian, Fall 2012 University of Arkansas at Little Rock.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
CS 416 Artificial Intelligence Lecture 15 Uncertainty Chapter 14 Lecture 15 Uncertainty Chapter 14.
CHAPTER 3: BAYESIAN DECISION THEORY. Making Decision Under Uncertainty Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
UIUC CS 497: Section EA Lecture #7 Reasoning in Artificial Intelligence Professor: Eyal Amir Spring Semester 2004 (Based on slides by Gal Elidan (Hebrew.
Instructor: Eyal Amir Grad TAs: Wen Pu, Yonatan Bisk Undergrad TAs: Sam Johnson, Nikhil Johri CS 440 / ECE 448 Introduction to Artificial Intelligence.
CS498-EA Reasoning in AI Lecture #19 Professor: Eyal Amir Fall Semester 2011.
Instructor: Eyal Amir Grad TAs: Wen Pu, Yonatan Bisk
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12
CS b553: Algorithms for Optimization and Learning
Approximate Inference
Artificial Intelligence
CS 4/527: Artificial Intelligence
Remember that our objective is for some density f(y|) for observations where y and  are vectors of data and parameters,  being sampled from a prior.
CAP 5636 – Advanced Artificial Intelligence
Advanced Artificial Intelligence
Inference Inference: calculating some useful quantity from a joint probability distribution Examples: Posterior probability: Most likely explanation: B.
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12
Instructors: Fei Fang (This Lecture) and Dave Touretzky
CS 188: Artificial Intelligence
Class #19 – Tuesday, November 3
CS 188: Artificial Intelligence Fall 2008
Inference III: Approximate Inference
Approximate Inference by Sampling
Approximate Inference: Particle-Based Methods
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12
Presentation transcript:

Instructor: Eyal Amir Grad TAs: Wen Pu, Yonatan Bisk Undergrad TAs: Sam Johnson, Nikhil Johri CS 440 / ECE 448 Introduction to Artificial Intelligence Spring 2010 Lecture #22

Previously Bayesian Networks Exact reasoning –Variable elimination Machine Learning –Naïve Bayes –ID3: Decision-Tree Learning Today: approximate reasoning

Today’s Agenda 1.Random Sampling 1.Logic sampling 2.Rejection sampling 3.Likelihood weighting 4.Importance sampling 2.Will not cover: Markov Chain Monte Carlo (MCMC) 1.Gibbs sampling 2.Metropolis-Hastings

Random Sampling Sampling: Also called Monte Carlo techniques Sometimes hard to compute the posterior probability exactly Approximate Reasoning: –Approximate by sampling from a distribution Accuracy depends on the number of samples

Example Estimate the probability of head for an unbiased coin –Generate samples from P(coin)=(0.5,0.5) –Like tossing a coin –Finally :

Generating Samples Need: generate samples from probability P(“head”)=p How? Find a biased coin … Use Random-number-generator –Provides Uniform distribution between [0,1] –When x generated uniform [0,1]  answer… x<p 0 1p

Sampling a Bayesian Network If P(X 1,…,X n ) is represented by a Bayesian network, can we efficiently sample from it? Idea: sample according to structure of the network –Write distribution using the chain rule, and then sample each variable given its parents

Samples: B E A C R Logic sampling P(b) 0.03 P(e) P(a) b e P(c) a a P(r) e e b Earthquake Radio Burglary Alarm Call 0.03

Samples: B E A C R Logic sampling P(b) 0.03 P(e) P(a) b e P(c) a a P(r) e e eb Earthquake Radio Burglary Alarm Call 0.001

Samples: B E A C R Logic sampling P(b) 0.03 P(e) P(a) b e P(c) a a P(r) e e eab 0.4 Earthquake Radio Burglary Alarm Call

Samples: B E A C R Logic sampling P(b) 0.03 P(e) P(a) b e P(c) a a P(r) e e eacb Earthquake Radio Burglary Alarm Call 0.8

Samples: B E A C R Logic sampling P(b) 0.03 P(e) P(a) b e P(c) a a P(r) e e eacb r 0.3 Earthquake Radio Burglary Alarm Call

Logic Sampling Let X 1, …, X n be order of variables consistent with arc direction for i = 1, …, n do –sample x i from P(X i | pa i ) –(Note: since Pa i  {X 1,…,X i-1 }, we already assigned values to them) return x 1, …,x n

Logic Sampling Running time of sampling a complete instance –Linear in number of variables –Regardless of structure of the network However, can require a lot of samples to approximate the distribution

Can sample from P(X 1,…,X n |e)? If evidence is in roots of network, easily If evidence is in leaves of network, we have a problem –Our sampling method proceeds according to order of nodes in graph Note, we can use arc-reversal to make evidence nodes root. –In some networks, however, this will create exponentially large tables...

Rejection Sampling Employing evidence –Idea: remove any sample that does not satisfy the evidence –Problem: rejects so many sample if P(e) is small, we need many samples to get a decent estimate

Likelihood Weighting Can we ensure that all of our samples satisfy e? One simple solution: –When we need to sample a variable that is assigned value by e, use the specified value For example: we know Y = 1 –Sample X from P(X) –Then take Y = 1 Is this a sample from P( X,Y |Y = 1) ? X Y

Likelihood Weighting Problem: these samples of X from P(X) Solution: –Penalize samples in which P(Y=1|X) is small We now sample as follows: –Let x[i] be a sample from P(X) –Let w[i] be P(Y = 1|X = x [i]) X Y (Bayes rule)

Likelihood Weighting Why does this make sense? When N is large, we expect to sample NP(X = x) samples with x[i] = x Thus, When we normalize, we get approximation of the conditional probability w[i]

Samples: B E A C R Likelihood Weighting P(b) 0.03 P(e) P(a) b e P(c) a P(r) e e b Earthquake Radio Burglary Alarm Call 0.03 Weight = r a = a

Samples: B E A C R Likelihood Weighting P(b) 0.03 P(e) P(a) b e P(c) a a P(r) e e eb Earthquake Radio Burglary Alarm Call Weight = r = a

Samples: B E A C R Likelihood Weighting P(b) 0.03 P(e) P(a) b e P(c) a a P(r) e e eb 0.4 Earthquake Radio Burglary Alarm Call Weight = r = a 0.6 a

Samples: B E A C R Likelihood Weighting P(b) 0.03 P(e) P(a) b e P(c) a a P(r) e e ecb Earthquake Radio Burglary Alarm Call 0.05 Weight = r = a a 0.6

Samples: B E A C R Likelihood Weighting P(b) 0.03 P(e) P(a) b e P(c) a a P(r) e e ecb r 0.3 Earthquake Radio Burglary Alarm Call Weight = r = a a 0.6 *0.3

Likelihood Weighting: Generation of Samples Let X 1, …, X n be order of variables consistent with arc direction w = 1 for i = 1, …, n do –if X i = x i has been observed w  w* P(X i = x i | pa i ) –else sample x i from P(X i | pa i ) return x 1, …,x n, and w

Sampling Algorithms with Evidence: Another View Idea: search for high probability instances Suppose are instances with high mass We can approximate (Bayes rule): If is a complete instantiation, then is – 0 or 1

Search Algorithms (cont) Instances that do not satisfy e, do not play a role in approximation We need to focus the search to find instances that do satisfy e Clearly, in some cases this is hard (NP- hardness result)

Stochastic Simulation Suppose we can sample instances according to P(X 1,…,X n ) What is the probability that a random sample satisfies e? –This is exactly P(e) We can view each sample as tossing a biased coin with probability P(e) of “Heads”

Stochastic Sampling Intuition: given a sufficient number of samples x[1],…,x[N], we can estimate Law of large number implies that as N grows, our estimate will converge to p The number of samples that we need is potentially exponential in dimension of P.

Importance Sampling A method for evaluating expectation of f under P(x), P(X) Discrete: Continuous: If we could sample from P

Importance Sampling A general method for evaluating P(X) when we cannot sample from P(X). Idea: Choose an approximating distribution Q(X) and sample from it Using this we can now sample from Q and then W(X) If we could generate samples from P(X) Now that we generate the samples from Q(X)

(Unnormalized) Importance Sampling 1. For m=1:M Sample X[m] from Q(X) Calculate W(m) = P(X)/Q(X) 2. Estimate the expectation of f(X) using Requirements: P(X)>0  Q(X)>0 (don’t ignore possible scenarios) Possible to calculate P(X),Q(X) for a specific X=x It is possible to sample from Q(X)

Normalized Importance Sampling Assume that we cannot evaluate P(X=x) but can evaluate P’(X=x) =  P(X=x) (for example we can evaluate P(X) but not P(X|e) in a Bayesian network) We define w’(X) = P’(X)/Q(X). We can then evaluate  : and then: In the last step we simply replace  with the above equation

Normalized Importance Sampling We can now estimate the expectation of f(X) similarly to unnormalized importance sampling by sampling x[m] from Q(X) and then (hence the name “normalized”)

Importance Sampling Weaknesses Important to choose sampling distribution with heavy tails –Not to “miss” large values of f Many-dimensional I-S: –“Typical set” of P may take a long time to find, unless Q good approximation to P –Weights vary by factors exponential in N Similar for Likelihood Weighting

Next Class 1.Approximation guarantees and hardness 2.Monte Carlo techniques 1.Rejection sampling 2.Likelihood weighting 3.Importance sampling 3.Markov Chain Monte Carlo (MCMC) 1.Gibbs sampling 2.Metropolis-Hastings

Stochastic Sampling Previously: independent samples to estimate P(X = x |e ) Problem: difficult to sample from P(X 1, …. X n |e ) We had to use likelihood weighting –Introduces bias in estimation In some case, such as when the evidence is on leaves, these methods are inefficient –Very low weights if e has low prior probability –Very few samples have high-mass (high weight)

MCMC Methods Sampling methods that are based on Markov Chain –Markov Chain Monte Carlo (MCMC) methods Key ideas: –Sampling process as a Markov Chain Next sample depends on the previous one –Approximate any posterior distribution Next: review theory of Markov chains

MCMC Methods Notes: The Markov chain variable Y takes as value assignments to all variables that are consistent with evidence For simplicity, we will denote such a state using the vector of variables

Gibbs Sampler One of the simplest MCMC method Each transition changes the state of one X i The transition probability defined by P itself as a stochastic procedure: –Input: a state x 1,…,x n –Choose i at random (uniform probability) –Sample x’ i from P(X i |x 1, …, x i-1, x i+1,…, x n, e) –let x’ j = x j for all j  i –return x’ 1,…,x’ n

Sampling Strategy How do we collect the samples? Strategy I: Run the chain M times, each for N steps –each run starts from a different state points Return the last state in each run M chains

Sampling Strategy Strategy II: Run one chain for a long time After some “burn in” period, sample points every some fixed number of steps “burn in” M samples from one chain