1 Approximate Inference 2: Importance Sampling. (Unnormalized) Importance Sampling.

Slides:



Advertisements
Similar presentations
CHAPTER 8 More About Estimation. 8.1 Bayesian Estimation In this chapter we introduce the concepts related to estimation and begin this by considering.
Advertisements

Markov Chain Monte Carlo Convergence Diagnostics: A Comparative Review By Mary Kathryn Cowles and Bradley P. Carlin Presented by Yuting Qi 12/01/2006.
Biointelligence Laboratory, Seoul National University
Flipping A Biased Coin Suppose you have a coin with an unknown bias, θ ≡ P(head). You flip the coin multiple times and observe the outcome. From observations,
IMPORTANCE SAMPLING ALGORITHM FOR BAYESIAN NETWORKS
Visual Recognition Tutorial
M.I. Jaime Alfonso Reyes ´Cortés.  The basic task for any probabilistic inference system is to compute the posterior probability distribution for a set.
Bayesian network inference
. PGM: Tirgul 8 Markov Chains. Stochastic Sampling  In previous class, we examined methods that use independent samples to estimate P(X = x |e ) Problem:
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Bayesian Networks. Graphical Models Bayesian networks Conditional random fields etc.
Evaluating Hypotheses
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
5/25/2005EE562 EE562 ARTIFICIAL INTELLIGENCE FOR ENGINEERS Lecture 16, 6/1/2005 University of Washington, Department of Electrical Engineering Spring 2005.
CS 188: Artificial Intelligence Spring 2007 Lecture 14: Bayes Nets III 3/1/2007 Srini Narayanan – ICSI and UC Berkeley.
Genome evolution: a sequence-centric approach Lecture 4: Beyond Trees. Inference by sampling Pre-lecture draft – update your copy after the lecture!
CS 188: Artificial Intelligence Fall 2006 Lecture 17: Bayes Nets III 10/26/2006 Dan Klein – UC Berkeley.
. Approximate Inference Slides by Nir Friedman. When can we hope to approximate? Two situations: u Highly stochastic distributions “Far” evidence is discarded.
Announcements Homework 8 is out Final Contest (Optional)
1 Bayesian Networks Chapter ; 14.4 CS 63 Adapted from slides by Tim Finin and Marie desJardins. Some material borrowed from Lise Getoor.
. PGM 2002/3 – Tirgul6 Approximate Inference: Sampling.
Approximate Inference 2: Monte Carlo Markov Chain
Image Analysis and Markov Random Fields (MRFs) Quanren Xiong.
Introduction to Monte Carlo Methods D.J.C. Mackay.
Bayesian networks. Motivation We saw that the full joint probability can be used to answer any question about the domain, but can become intractable as.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
BINOMIALDISTRIBUTION AND ITS APPLICATION. Binomial Distribution  The binomial probability density function –f(x) = n C x p x q n-x for x=0,1,2,3…,n for.
1 Variable Elimination Graphical Models – Carlos Guestrin Carnegie Mellon University October 11 th, 2006 Readings: K&F: 8.1, 8.2, 8.3,
Sample variance and sample error We learned recently how to determine the sample variance using the sample mean. How do we translate this to an unbiased.
Made by: Maor Levy, Temple University  Inference in Bayes Nets ◦ What is the probability of getting a strong letter? ◦ We want to compute the.
Instructor: Eyal Amir Grad TAs: Wen Pu, Yonatan Bisk Undergrad TAs: Sam Johnson, Nikhil Johri CS 440 / ECE 448 Introduction to Artificial Intelligence.
Bayes’ Nets: Sampling [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available.
1 CMSC 671 Fall 2001 Class #21 – Tuesday, November 13.
Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests.
The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)
Lecture 2: Statistical learning primer for biologists
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
Bayesian networks and their application in circuit reliability estimation Erin Taylor.
CS 188: Artificial Intelligence Bayes Nets: Approximate Inference Instructor: Stuart Russell--- University of California, Berkeley.
Wei Sun and KC Chang George Mason University March 2008 Convergence Study of Message Passing In Arbitrary Continuous Bayesian.
1 CMSC 671 Fall 2001 Class #20 – Thursday, November 8.
CS 416 Artificial Intelligence Lecture 15 Uncertainty Chapter 14 Lecture 15 Uncertainty Chapter 14.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
Statistical NLP: Lecture 4 Mathematical Foundations I: Probability Theory (Ch2)
UIUC CS 497: Section EA Lecture #7 Reasoning in Artificial Intelligence Professor: Eyal Amir Spring Semester 2004 (Based on slides by Gal Elidan (Hebrew.
Chapter 6 Sampling and Sampling Distributions
CS 541: Artificial Intelligence Lecture VII: Inference in Bayesian Networks.
CS498-EA Reasoning in AI Lecture #19 Professor: Eyal Amir Fall Semester 2011.
Crash course in probability theory and statistics – part 2 Machine Learning, Wed Apr 16, 2008.
Maximum Expected Utility
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12
Ch3: Model Building through Regression
CS 4/527: Artificial Intelligence
CAP 5636 – Advanced Artificial Intelligence
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Statistical NLP: Lecture 4
CS 188: Artificial Intelligence
Class #19 – Tuesday, November 3
Simple Sampling Sampling Methods Inference Probabilistic Graphical
CS 188: Artificial Intelligence Fall 2008
Class #16 – Tuesday, October 26
Inference III: Approximate Inference
MCMC for PGMs: The Gibbs Chain
Approximate Inference by Sampling
Probabilistic Influence & d-separation
Reasoning Patterns Bayesian Networks Representation Probabilistic
Parametric Methods Berlin Chen, 2005 References:
Approximate Inference: Particle-Based Methods
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12
Presentation transcript:

1 Approximate Inference 2: Importance Sampling

(Unnormalized) Importance Sampling

Likelihood weighting is a special case of a general approach called importance sampling Let X be a set of variables that takes on values in some space Val(X) Importance sampling is a way to estimate E P(x) [f(x)] ie. the expectation of a function f(x) relative to some distribution P(X), typically called the target distribution

(Unnormalized) Importance Sampling Generate samples x[1], …, x[M] from P Then estimate:

(Unnormalized) Importance Sampling Sometimes you might want to generate samples from a different distribution Q (called a proposal distribution or sampling distribution) Why? –Might be impossible or computationally expensive to sample from P Proposal distribution can be arbitrary –Require that Q(x) > 0 whenever P(x) > 0 –But computational performance of importance sampling depends strongly on how similar Q is to P

(Unnormalized) Importance Sampling How to use the proposal distribution: Generate a set of samples D = {x[1], …, x[M]} from Q then estimate: Unnormalized importance sampling estimator

(Unnormalized) Importance Sampling This estimator is unbiased:

Normalized Importance Sampling

Frequently, P is known only up to a normalizing constant Z ie. Happens when: –We know P(X,e) but need P(X|e) –We have the unnormalized product of clique potentials for a Markov network

Normalized Importance Sampling Define The expected value of the w(X) under Q(X) is

Normalized Importance Sampling

With M samples D = {x[1], …, x[M]} from Q, we can estimate: This is called the normalized importance sampling estimator or weighted importance sampling estimator

Normalized Importance Sampling Normalized importance sampling estimator is biased But has smaller variance than the unnormalized estimator Normalized estimator often used instead of unnormalized estimator, even when P is known and can be sampled from effectively

Importance Sampling for Bayesian Networks

Intelligence GradeSAT Difficulty Letter DP(D) low0.6 high0.4 IP(I) low0.7 high0.3 ISP(S|I) low 0.95 lowhigh0.05 highlow0.2 high 0.8 GLP(L|G) Cweak0.99 Cstrong0.01 Bweak0.4 Bstrong0.6 Aweak0.1 Astrong0.9 DIGP(G|D,I) low C0.3 low B0.4 low A0.3 lowhighC0.02 lowhighB0.08 lowhighA0.9 highlowC0.7 highlowB0.25 highlowA0.05 high C0.2 high B0.3 high A0.5 Student Example

Importance Sampling for Bayesian Networks What proposal distribution do we use? Suppose we want an event Grade=B either as a query or as evidence –Easy to sample P(Letter | Grade = B) –Difficult to account for Grade=B’s influence on Difficulty, Intelligence and SAT In general: –Want to account for effect of the event on the descendants –But avoid accounting for its effects on the nondescendants

Importance Sampling for Bayesian Networks Let B be a network, and Z 1 = z 1, …, Z k = z k, abbreviated Z=z, an instantiation of variables. We define the mutilated network B Z=z as follows: –Each node Z i  Z has no parents in B Z=z –The CPD of Z i in B Z=z gives probability 1 to Z i = z i and probability 0 to all other values z i ’  Val(Z i ) –The parents and CPDs of all other nodes X  Z are unchanged

Importance Sampling for Bayesian Networks Intelligence GradeSAT Difficulty Letter DP(D) low0.6 high0.4 IP(I) low0 high1 ISP(S|I) low 0.95 lowhigh0.05 highlow0.2 high 0.8 GLP(L|G) Cweak0.99 Cstrong0.01 Bweak0.4 Bstrong0.6 Aweak0.1 Astrong0.9 GP(G|D,I) C0 B1 A0 Mutilated Network:

Importance Sampling for Bayesian Networks Proposition 12.2: Let  be a sample generated by Likelihood Weighting and w be its weight. Then the distribution over  is as defined by the network B Z=z, and (Informally) Importance sampling using a mutilated network as a proposal distribution is equivalent to Likelihood Weighting with P B (X,z) and proposal distribution Q induced by the mutilated network B E=e.

Likelihood Weighting Revisited

Two versions of likelihood weighting 1.Ratio Likelihood Weighting 2.Normalized Likelihood Weighting

Likelihood Weighting Revisited Ratio Likelihood Weighting Use unnormalized importance sampling: 1.For numerator – use LW to generate M samples with Y=y, E=e as the event 2.For denominator – use LW to generate M’ samples with E=e as the event

Likelihood Weighting Revisited Normalized Likelihood Weighting Ratio Likelihood Weighting estimates a single query P(y|e) from a set of samples (ie. it sets Y=y when sampling) Sometimes we want to evaluate a set of queries P(y|e) Use normalized likelihood weighting with Estimate the expectation of a function f:

Likelihood Weighting Revisited Quality of importance sampling depends on how close the proposal distribution Q is to the target distribution P. Consider the two extremes: 1. All evidence at the roots: –Proposal distribution is the posterior –Evidence affects samples all along the way and all samples have the same weight P(e)

Likelihood Weighting Revisited 2.All evidence at the leaves: –Proposal distribution is the prior distribution P B (X) –Evidence doesn’t affect samples, weights have to compensate. LW will only work well if prior is similar to the posterior

Likelihood Weighting Revisited If P(e) is high, then the posterior P(X|e) plays a large role and is close to the prior P(X) If P(e) is low, then the posterior P(X|e) plays a small role and the prior P(X) will likely look very different PriorPosterior

Likelihood Weighting Revisited Summary Ratio Likelihood Weighting Sets the values of Y=y => results in lower variance in estimator Theoretical analysis allows us to provide bounds (under very strong conditions) on number of samples to obtain a good estimate of P(y,e) and P(e) Needs a new set of samples for each query y

Likelihood Weighting Revisited Summary Normalized Likelihood Weighting Samples an assignment for Y, which introduces additional variance Allows multiple queries y using the same set of samples (conditioned on evidence e)

Likelihood Weighting Revisted Problems with Likelihood Weighting: If there are a lot of evidence variables P( Y | E 1 = e 1,..., E k = e k ): –Many samples will have  weight –Weighted estimate dominated by a small fraction of samples that have >  If evidence variables occur in the leaves, the samples drawn will not be affected much by the evidence