CS 188: Artificial Intelligence Bayes Nets: Approximate Inference Instructor: Stuart Russell--- University of California, Berkeley.

Slides:



Advertisements
Similar presentations
CPSC 422, Lecture 11Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 11 Jan, 29, 2014.
Advertisements

Lirong Xia Hidden Markov Models Tue, March 28, 2014.
Lirong Xia Approximate inference: Particle filter Tue, April 1, 2014.
Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.
Exact Inference (Last Class) variable elimination  polytrees (directed graph with at most one undirected path between any two vertices; subset of DAGs)
Bayesian network inference
CS 188: Artificial Intelligence Fall 2009 Lecture 20: Particle Filtering 11/5/2009 Dan Klein – UC Berkeley TexPoint fonts used in EMF. Read the TexPoint.
. PGM: Tirgul 8 Markov Chains. Stochastic Sampling  In previous class, we examined methods that use independent samples to estimate P(X = x |e ) Problem:
Machine Learning CUNY Graduate Center Lecture 7b: Sampling.
Bayesian Networks. Graphical Models Bayesian networks Conditional random fields etc.
CS 188: Artificial Intelligence Fall 2009 Lecture 17: Bayes Nets IV 10/27/2009 Dan Klein – UC Berkeley.
CS 188: Artificial Intelligence Spring 2007 Lecture 14: Bayes Nets III 3/1/2007 Srini Narayanan – ICSI and UC Berkeley.
CS 188: Artificial Intelligence Fall 2006 Lecture 17: Bayes Nets III 10/26/2006 Dan Klein – UC Berkeley.
. Approximate Inference Slides by Nir Friedman. When can we hope to approximate? Two situations: u Highly stochastic distributions “Far” evidence is discarded.
Announcements Homework 8 is out Final Contest (Optional)
CS 188: Artificial Intelligence Fall 2009 Lecture 19: Hidden Markov Models 11/3/2009 Dan Klein – UC Berkeley.
. PGM 2002/3 – Tirgul6 Approximate Inference: Sampling.
Approximate Inference 2: Monte Carlo Markov Chain
CS 188: Artificial Intelligence Fall 2008 Lecture 18: Decision Diagrams 10/30/2008 Dan Klein – UC Berkeley 1.
QUIZ!!  T/F: The forward algorithm is really variable elimination, over time. TRUE  T/F: Particle Filtering is really sampling, over time. TRUE  T/F:
Machine Learning Lecture 23: Statistical Estimation with Sampling Iain Murray’s MLSS lecture on videolectures.net:
Recap: Reasoning Over Time  Stationary Markov models  Hidden Markov models X2X2 X1X1 X3X3 X4X4 rainsun X5X5 X2X2 E1E1 X1X1 X3X3 X4X4 E2E2 E3E3.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
CS 188: Artificial Intelligence Fall 2006 Lecture 18: Decision Diagrams 10/31/2006 Dan Klein – UC Berkeley.
UBC Department of Computer Science Undergraduate Events More
Instructor: Eyal Amir Grad TAs: Wen Pu, Yonatan Bisk Undergrad TAs: Sam Johnson, Nikhil Johri CS 440 / ECE 448 Introduction to Artificial Intelligence.
Bayes’ Nets: Sampling [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available.
CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Monte Carlo Methods for Probabilistic Inference.
Exact Inference (Last Class) Variable elimination  polytrees (directed graph with at most one undirected path between any two vertices; subset of DAGs)
The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)
CPSC 422, Lecture 11Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 11 Oct, 2, 2015.
QUIZ!!  In HMMs...  T/F:... the emissions are hidden. FALSE  T/F:... observations are independent given no evidence. FALSE  T/F:... each variable X.
An Introduction to Markov Chain Monte Carlo Teg Grenager July 1, 2004.
CIAR Summer School Tutorial Lecture 1b Sigmoid Belief Nets Geoffrey Hinton.
Probabilistic models Jouni Tuomisto THL. Outline Deterministic models with probabilistic parameters Hierarchical Bayesian models Bayesian belief nets.
Bayesian networks and their application in circuit reliability estimation Erin Taylor.
Advanced Artificial Intelligence Lecture 5: Probabilistic Inference.
Inference Algorithms for Bayes Networks
Quick Warm-Up Suppose we have a biased coin that comes up heads with some unknown probability p; how can we use it to produce random bits with probabilities.
CS774. Markov Random Field : Theory and Application Lecture 15 Kyomin Jung KAIST Oct
CSE 473: Artificial Intelligence Autumn 2011 Bayesian Networks: Inference Luke Zettlemoyer Many slides over the course adapted from either Dan Klein, Stuart.
CS 416 Artificial Intelligence Lecture 15 Uncertainty Chapter 14 Lecture 15 Uncertainty Chapter 14.
CSC Lecture 23: Sigmoid Belief Nets and the wake-sleep algorithm Geoffrey Hinton.
The Law of Averages. What does the law of average say? We know that, from the definition of probability, in the long run the frequency of some event will.
QUIZ!!  T/F: You can always (theoretically) do BNs inference by enumeration. TRUE  T/F: In VE, always first marginalize, then join. FALSE  T/F: VE is.
SIR method continued. SIR: sample-importance resampling Find maximum likelihood (best likelihood × prior), Y Randomly sample pairs of r and N 1973 For.
CS 541: Artificial Intelligence Lecture VII: Inference in Bayesian Networks.
CS 416 Artificial Intelligence Lecture 16 Uncertainty Chapter 14 Lecture 16 Uncertainty Chapter 14.
CS498-EA Reasoning in AI Lecture #19 Professor: Eyal Amir Fall Semester 2011.
MCMC Output & Metropolis-Hastings Algorithm Part I
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12
CS b553: Algorithms for Optimization and Learning
Bayes Net Learning: Bayesian Approaches
Artificial Intelligence
CS 4/527: Artificial Intelligence
CAP 5636 – Advanced Artificial Intelligence
Markov Networks.
Advanced Artificial Intelligence
Inference Inference: calculating some useful quantity from a joint probability distribution Examples: Posterior probability: Most likely explanation: B.
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12
Instructors: Fei Fang (This Lecture) and Dave Touretzky
CS 188: Artificial Intelligence
Class #19 – Tuesday, November 3
CS 188: Artificial Intelligence Fall 2008
Hidden Markov Models Lirong Xia.
Approximate Inference by Sampling
CS 188: Artificial Intelligence Fall 2007
Quick Warm-Up Suppose we have a biased coin that comes up heads with some unknown probability p; how can we use it to produce random bits with probabilities.
Markov Networks.
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12
Presentation transcript:

CS 188: Artificial Intelligence Bayes Nets: Approximate Inference Instructor: Stuart Russell--- University of California, Berkeley

Sampling  Sampling is a lot like repeated simulation  Basic idea  Draw N samples from a sampling distribution S  Compute an approximate posterior probability  Show this converges to the true probability P  Why sample?  Often very fast to get a decent approximate answer  The algorithms are very simple and general (easy to apply to fancy models)  They require very little memory (O(n))  They can be applied to large models, whereas exact algorithms blow up

Example  Suppose you have two agent programs A and B for Monopoly  What is the probability that A wins?  Method 1:  Let s be a sequence of dice rolls and Chance and Community Chest cards  Given s, the outcome V(s) is determined (1 for a win, 0 for a loss)  Probability that A wins is  s P(s) V(s)  Problem: infinitely many sequences s !  Method 2:  Sample N (maybe 100) sequences from P(s), play N games  Probability that A wins is roughly 1/N  i V(s i ) i.e., the fraction of wins (e.g., 57/100) in the sample 3

Sampling from a discrete distribution  We need to simulate a biased d- sided coin  Step 1: Get sample u from uniform distribution over [0, 1)  E.g. random() in python  Step 2: Convert this sample u into an outcome for the given distribution by associating each outcome x with a P(x)-sized sub-interval of [0,1)  Example  If random() returns u = 0.83, then our sample is C = blue  E.g, after sampling 8 times: CP(C) red0.6 green0.1 blue0.3

Sampling in Bayes Nets  Prior Sampling  Rejection Sampling  Likelihood Weighting  Gibbs Sampling

Prior Sampling

Cloudy Sprinkler Rain WetGrass Cloudy Sprinkler Rain WetGrass c0.5 cc c s0.1 ss 0.9 cc s0.5 ss c r0.8 rr 0.2 cc r rr 0.8 s r w0.99 ww 0.01 rr w0.90 ww 0.10 ss r w0.90 ww 0.10 rr w0.01 ww 0.99 Samples: c,  s, r, w  c, s,  r, w …

Prior Sampling  For i=1, 2, …, n (in topological order)  Sample X i from P(X i | parents(X i ))  Return (x 1, x 2, …, x n )

Prior Sampling  This process generates samples with probability: …i.e. the BN’s joint probability  Let the number of samples of an event be  Then  I.e., the sampling procedure is consistent

Example  We’ll get a bunch of samples from the BN: c,  s, r, w c, s, r, w  c, s, r, -w c,  s, r, w  c,  s,  r, w  If we want to know P(W)  We have counts  Normalize to get P(W) =  This will get closer to the true distribution with more samples  Can estimate anything else, too  E..g, for query P(C| r, w) use P(C| r, w) = α P(C, r, w) SR W C

Rejection Sampling

c,  s, r, w c, s,  r  c, s, r,  w c, -s,  r  c,  s, r, w Rejection Sampling  A simple modification of prior sampling for conditional probabilities  Let’s say we want P(C| r, w)  Count the C outcomes, but ignore (reject) samples that don’t have R=true, W=true  This is called rejection sampling  It is also consistent for conditional probabilities (i.e., correct in the limit) SR W C

Rejection Sampling  Input: evidence e 1,..,e k  For i=1, 2, …, n  Sample X i from P(X i | parents(X i ))  If x i not consistent with evidence  Reject: Return, and no sample is generated in this cycle  Return (x 1, x 2, …, x n )

Likelihood Weighting

 Idea: fix evidence variables, sample the rest  Problem: sample distribution not consistent!  Solution: weight each sample by probability of evidence variables given parents Likelihood Weighting  Problem with rejection sampling:  If evidence is unlikely, rejects lots of samples  Evidence not exploited as you sample  Consider P(Shape|Color=blue) ShapeColorShapeColor pyramid, green pyramid, red sphere, blue cube, red sphere, green pyramid, blue sphere, blue cube, blue sphere, blue

Likelihood Weighting c0.5 cc c s0.1 ss 0.9 cc s0.5 ss c r0.8 rr 0.2 cc r rr 0.8 s r w0.99 ww 0.01 rr w0.90 ww 0.10 ss r w0.90 ww 0.10 rr w0.01 ww 0.99 Samples: c, s, r, w … Cloudy Sprinkler Rain WetGrass Cloudy Sprinkler Rain WetGrass

Likelihood Weighting  Input: evidence e 1,..,e k  w = 1.0  for i=1, 2, …, n  if X i is an evidence variable  x i = observed value i for X i  Set w = w * P(x i | Parents(X i ))  else  Sample x i from P(X i | Parents(X i ))  return (x 1, x 2, …, x n ), w

Likelihood Weighting  Sampling distribution if z sampled and e fixed evidence  Now, samples have weights  Together, weighted sampling distribution is consistent Cloudy R C S W

Likelihood Weighting  Likelihood weighting is good  All samples are used  The values of downstream variables are influenced by upstream evidence  Likelihood weighting still has weaknesses  The values of upstream variables are unaffected by downstream evidence  E.g., suppose evidence is a video of a traffic accident  With evidence in k leaf nodes, weights will be O(2 -k )  With high probability, one lucky sample will have much larger weight than the others, dominating the result  We would like each variable to “see” all the evidence!

Gibbs Sampling

Markov Chain Monte Carlo  MCMC (Markov chain Monte Carlo) is a family of randomized algorithms for approximating some quantity of interest over a very large state space  Markov chain = a sequence of randomly chosen states (“random walk”), where each state is chosen conditioned on the previous state  Monte Carlo = a very expensive city in Monaco with a famous casino  Monte Carlo = an algorithm (usually based on sampling) that has some probability of producing an incorrect answer  MCMC = wander around for a bit, average what you see 21

Gibbs sampling  A particular kind of MCMC  States are complete assignments to all variables  (Cf local search: closely related to simulated annealing!)  Evidence variables remain fixed, other variables change  To generate the next state, pick a variable and sample a value for it conditioned on all the other variables (Cf min-conflicts!)  X i ’ ~ P(X i | x 1,..,x i-1,x i+1,..,x n )  Will tend to move towards states of higher probability, but can go down too  In a Bayes net, P(X i | x 1,..,x i-1,x i+1,..,x n ) = P(X i | markov_blanket(X i ))  Theorem: Gibbs sampling is consistent*  Provided all Gibbs distributions are bounded away from 0 and 1 and variable selection is fair 22

Why would anyone do this? Samples soon begin to reflect all the evidence in the network Eventually they are being drawn from the true posterior! 23

How would anyone do this?  Repeat many times  Sample a non-evidence variable X i from  P(X i | x 1,..,x i-1,x i+1,..,x n ) = P(X i | markov_blanket(X i ))  = α P(X i | u 1,..,u m )  j P(y j | parents(Y j )) 24

 Step 2: Initialize other variables  Randomly Gibbs Sampling Example: P( S | r)  Step 1: Fix evidence  R = true  Step 3: Repeat  Choose a non-evidence variable X  Resample X from P(X | markov_blanket(X)) S r W C S r W C S r W C S r W C S r W C S r W C S r W C S r W C Sample S ~ P(S | c, r,  w) Sample C ~ P(C | s, r)Sample W ~ P(W | s, r)

Why does it work? (see AIMA for details)  Suppose we run it for a long time and predict the probability of reaching any given state at time t: π t (x 1,...,x n ) or π t (x)  Each Gibbs sampling step (pick a variable, resample its value) applied to a state x has a probability q(x’ | x) of reaching a next state x’  So π t+1 (x’) =  x q(x’ | x) π t (x) or, in matrix/vector form π t+1 = Qπ t  When the process is in equilibrium π t+1 = π t so Qπ t = π t  This has a unique* solution π t = P(x 1,...,x n | e 1,...,e k )  So for large enough t the next sample will be drawn from the true posterior

Bayes Net Sampling Summary  Prior Sampling P  Likelihood Weighting P( Q | e)  Rejection Sampling P( Q | e )  Gibbs Sampling P( Q | e )