CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Monte Carlo Methods for Probabilistic Inference.

Slides:



Advertisements
Similar presentations
Belief networks Conditional independence Syntax and semantics Exact inference Approximate inference CS 460, Belief Networks1 Mundhenk and Itti Based.
Advertisements

1 Bayesian Networks Slides from multiple sources: Weng-Keen Wong, School of Electrical Engineering and Computer Science, Oregon State University.
Identifying Conditional Independencies in Bayes Nets Lecture 4.
Gibbs Sampling Qianji Zheng Oct. 5th, 2010.
Bayesian Networks Chapter 14 Section 1, 2, 4. Bayesian networks A simple, graphical notation for conditional independence assertions and hence for compact.
I NFERENCE IN B AYESIAN N ETWORKS. A GENDA Reading off independence assumptions Efficient inference in Bayesian Networks Top-down inference Variable elimination.
BAYESIAN INFERENCE Sampling techniques
Bayesian network inference
Inference in Bayesian Nets
Maximum likelihood (ML) and likelihood ratio (LR) test
. PGM: Tirgul 8 Markov Chains. Stochastic Sampling  In previous class, we examined methods that use independent samples to estimate P(X = x |e ) Problem:
Computational statistics 2009 Random walk. Computational statistics 2009 Random walk with absorbing barrier.
Machine Learning CUNY Graduate Center Lecture 7b: Sampling.
Bayesian Networks. Graphical Models Bayesian networks Conditional random fields etc.
Bayesian Networks Russell and Norvig: Chapter 14 CMCS424 Fall 2003 based on material from Jean-Claude Latombe, Daphne Koller and Nir Friedman.
CS 188: Artificial Intelligence Spring 2007 Lecture 14: Bayes Nets III 3/1/2007 Srini Narayanan – ICSI and UC Berkeley.
CS 188: Artificial Intelligence Fall 2006 Lecture 17: Bayes Nets III 10/26/2006 Dan Klein – UC Berkeley.
Belief Networks Russell and Norvig: Chapter 15 CS121 – Winter 2002.
. Approximate Inference Slides by Nir Friedman. When can we hope to approximate? Two situations: u Highly stochastic distributions “Far” evidence is discarded.
Announcements Homework 8 is out Final Contest (Optional)
Review of Probability and Statistics
. PGM 2002/3 – Tirgul6 Approximate Inference: Sampling.
1 Probabilistic Belief States and Bayesian Networks (Where we exploit the sparseness of direct interactions among components of a world) R&N: Chap. 14,
Bayesian Networks Russell and Norvig: Chapter 14 CMCS421 Fall 2006.
CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Continuous Probability Distributions and Bayesian Networks with Continuous Variables.
Read R&N Ch Next lecture: Read R&N
Estimation and Hypothesis Testing. The Investment Decision What would you like to know? What will be the return on my investment? Not possible PDF for.
Bayesian networks Chapter 14. Outline Syntax Semantics.
Short Resume of Statistical Terms Fall 2013 By Yaohang Li, Ph.D.
B AYESIAN N ETWORKS. S IGNIFICANCE OF C ONDITIONAL INDEPENDENCE Consider Grade(CS101), Intelligence, and SAT Ostensibly, the grade in a course doesn’t.
Machine Learning Lecture 23: Statistical Estimation with Sampling Iain Murray’s MLSS lecture on videolectures.net:
Bayesian Networks What is the likelihood of X given evidence E? i.e. P(X|E) = ?
Finding Scientific topics August , Topic Modeling 1.A document as a probabilistic mixture of topics. 2.A topic as a probability distribution.
Probabilistic Belief States and Bayesian Networks (Where we exploit the sparseness of direct interactions among components of a world) R&N: Chap. 14, Sect.
Random Numbers and Simulation  Generating truly random numbers is not possible Programs have been developed to generate pseudo-random numbers Programs.
Made by: Maor Levy, Temple University  Inference in Bayes Nets ◦ What is the probability of getting a strong letter? ◦ We want to compute the.
, Patrik Huber.  One of our goals: Evaluation of the posterior p(Z|X)  Exact inference  In practice: often infeasible to evaluate the posterior.
Instructor: Eyal Amir Grad TAs: Wen Pu, Yonatan Bisk Undergrad TAs: Sam Johnson, Nikhil Johri CS 440 / ECE 448 Introduction to Artificial Intelligence.
Bayes’ Nets: Sampling [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available.
The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)
Chapter 7 Point Estimation of Parameters. Learning Objectives Explain the general concepts of estimating Explain important properties of point estimators.
CPSC 422, Lecture 11Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 11 Oct, 2, 2015.
Bayesian networks and their application in circuit reliability estimation Erin Taylor.
CS 188: Artificial Intelligence Bayes Nets: Approximate Inference Instructor: Stuart Russell--- University of California, Berkeley.
Inference Algorithms for Bayes Networks
CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Bayesian Networks.
CS B 351 S TATISTICAL L EARNING. A GENDA Learning coin flips, learning Bayes net parameters Likelihood functions, maximum likelihood estimation (MLE)
CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Bayesian Networks.
Belief Networks CS121 – Winter Other Names Bayesian networks Probabilistic networks Causal networks.
QUIZ!!  T/F: You can always (theoretically) do BNs inference by enumeration. TRUE  T/F: In VE, always first marginalize, then join. FALSE  T/F: VE is.
CS498-EA Reasoning in AI Lecture #19 Professor: Eyal Amir Fall Semester 2011.
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12
Qian Liu CSE spring University of Pennsylvania
CS b553: Algorithms for Optimization and Learning
Read R&N Ch Next lecture: Read R&N
Approximate Inference
Artificial Intelligence
CS 4/527: Artificial Intelligence
CAP 5636 – Advanced Artificial Intelligence
Read R&N Ch Next lecture: Read R&N
Advanced Artificial Intelligence
Inference Inference: calculating some useful quantity from a joint probability distribution Examples: Posterior probability: Most likely explanation: B.
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12
Instructors: Fei Fang (This Lecture) and Dave Touretzky
CS 188: Artificial Intelligence
CS 188: Artificial Intelligence Fall 2008
Inference III: Approximate Inference
Approximate Inference by Sampling
Read R&N Ch Next lecture: Read R&N
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12
Presentation transcript:

CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Monte Carlo Methods for Probabilistic Inference

A GENDA Monte Carlo methods O(1/sqrt(N)) standard deviation For Bayesian inference Likelihood weighting Gibbs sampling

M ONTE C ARLO I NTEGRATION Estimate large integrals/sums: I =  f(x)p(x) dx I =  f(x)p(x) Using a sample of N i.i.d. samples from p(x) I  1/N  f(x (i) ) Examples:  [a,b] f(x) dx  (b-a)/N  f(x (i) ) E[X] =  x p(x) dx  1/N  x (i) Volume of a set in R n

M EAN & V ARIANCE OF ESTIMATE Let I N be the random variable denoting the estimate of the integral with N samples What is the bias (mean error) E[I-I N ]?

M EAN & V ARIANCE OF ESTIMATE Let I N be the random variable denoting the estimate of the integral with N samples What is the bias (mean error) E[I-I N ]? E[I-I N ]=I-E[I N ](linearity of expectation)

M EAN & V ARIANCE OF ESTIMATE Let I N be the random variable denoting the estimate of the integral with N samples What is the bias (mean error) E[I-I N ]? E[I-I N ]=I-E[I N ](linearity of expectation) = E[f(x)] - 1/N  E[f(x (i) )] (definition of I and I N )

M EAN & V ARIANCE OF ESTIMATE Let I N be the random variable denoting the estimate of the integral with N samples What is the bias (mean error) E[I-I N ]? E[I-I N ]=I-E[I N ](linearity of expectation) = E[f(x)] - 1/N  E[f(x (i) )] (definition of I and I N ) = 1/N  (E[f(x)]-E[f(x (i) )]) = 1/N  0 (x and x (i) are distributed w.r.t. p(x)) = 0

M EAN & V ARIANCE OF ESTIMATE Let I N be the random variable denoting the estimate of the integral with N samples What is the bias (mean error) E[I-I N ]? Unbiased estimator What is the variance Var[I N ]?

M EAN & V ARIANCE OF ESTIMATE Let I N be the random variable denoting the estimate of the integral with N samples What is the bias (mean error) E[I-I N ]? Unbiased estimator What is the variance Var[I N ]? Var[I N ] = Var[1/N  f(x (i) )](definition)

M EAN & V ARIANCE OF ESTIMATE Let I N be the random variable denoting the estimate of the integral with N samples What is the bias (mean error) E[I-I N ]? Unbiased estimator What is the variance Var[I N ]? Var[I N ] = Var[1/N  f(x (i) )](definition) = 1/N 2 Var[  f(x (i) )] (scaling of variance)

M EAN & V ARIANCE OF ESTIMATE Let I N be the random variable denoting the estimate of the integral with N samples What is the bias (mean error) E[I-I N ]? Unbiased estimator What is the variance Var[I N ]? Var[I N ] = Var[1/N  f(x (i) )](definition) = 1/N 2 Var[  f(x (i) )] (scaling of variance) = 1/N 2  Var[f(x (i) )] (variance of a sum of independent variables)

M EAN & V ARIANCE OF ESTIMATE Let I N be the random variable denoting the estimate of the integral with N samples What is the bias (mean error) E[I-I N ]? Unbiased estimator What is the variance Var[I N ]? Var[I N ] = Var[1/N  f(x (i) )](definition) = 1/N 2 Var[  f(x (i) )] (scaling of variance) = 1/N 2  Var[f(x (i) )] = 1/N Var[f(x)] (i.i.d. sample)

M EAN & V ARIANCE OF ESTIMATE Let I N be the random variable denoting the estimate of the integral with N samples What is the bias (mean error) E[I-I N ]? Unbiased estimator What is the variance Var[I N ]? 1/N Var[f(x)] Standard deviation: O(1/sqrt(N))

A PPROXIMATE I NFERENCE T HROUGH S AMPLING Unconditional simulation: To estimate the probability of a coin flipping heads, I can flip it a huge number of times and count the fraction of heads observed

A PPROXIMATE I NFERENCE T HROUGH S AMPLING Unconditional simulation: To estimate the probability of a coin flipping heads, I can flip it a huge number of times and count the fraction of heads observed Conditional simulation: To estimate the probability P(H) that a coin picked out of bucket B flips heads: Repeat for i=1,…,N: 1. Pick a coin C out of a random bucket b (i) chosen with probability P(B) 2. h (i) = flip C according to probability P(H|b (i) ) 3. Sample (h (i),b (i) ) comes from distribution P(H,B) Result approximates P(H,B)

M ONTE C ARLO I NFERENCE I N B AYES N ETS BN over variables X Repeat for i=1,…,N In top-down order, generate x (i) as follows: Sample x j (i) ~ P(X j | pa Xj (i) ) (RHS is taken by putting parent values in sample into the CPT for X j ) Sample x (1) … x (N) approximates the distribution over X

A PPROXIMATE I NFERENCE : M ONTE -C ARLO S IMULATION Sample from the joint distribution BEP(A| … ) TTFFTTFF TFTFTFTF BurglaryEarthquake Alarm MaryCallsJohnCalls P(B) P(E) AP(J|…) TFTF AP(M|…) TFTF B=0 E=0 A=0 J=1 M=0

A PPROXIMATE I NFERENCE : M ONTE -C ARLO S IMULATION As more samples are generated, the distribution of the samples approaches the joint distribution B=0 E=0 A=0 J=1 M=0 B=0 E=0 A=0 J=0 M=0 B=0 E=0 A=0 J=0 M=0 B=1 E=0 A=1 J=1 M=0

B ASIC METHOD FOR H ANDLING E VIDENCE Inference: given evidence E = e (e.g., J=1), approximate P( X / E | E = e ) Remove the samples that conflict B=0 E=0 A=0 J=1 M=0 B=0 E=0 A=0 J=0 M=0 B=0 E=0 A=0 J=0 M=0 B=1 E=0 A=1 J=1 M=0 Distribution of remaining samples approximates the conditional distribution

R ARE E VENT P ROBLEM : What if some events are really rare (e.g., burglary & earthquake ?) # of samples must be huge to get a reasonable estimate Solution: likelihood weighting Enforce that each sample agrees with evidence While generating a sample, keep track of the ratio of (how likely the sampled value is to occur in the real world) (how likely you were to generate the sampled value)

L IKELIHOOD WEIGHTING Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5 BEP(A| … ) TTFFTTFF TFTFTFTF BurglaryEarthquake Alarm MaryCallsJohnCalls P(B) P(E) AP(J|…) TFTF AP(M|…) TFTF w=1

L IKELIHOOD WEIGHTING Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5 BEP(A| … ) TTFFTTFF TFTFTFTF BurglaryEarthquake Alarm MaryCallsJohnCalls P(B) P(E) AP(J|…) TFTF AP(M|…) TFTF B=0 E=1 w=0.008

L IKELIHOOD WEIGHTING Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5 BEP(A| … ) TTFFTTFF TFTFTFTF BurglaryEarthquake Alarm MaryCallsJohnCalls P(B) P(E) AP(J|…) TFTF AP(M|…) TFTF B=0 E=1 A=1 w= A=1 is enforced, and the weight updated to reflect the likelihood that this occurs

L IKELIHOOD WEIGHTING Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5 BEP(A| … ) TTFFTTFF TFTFTFTF BurglaryEarthquake Alarm MaryCallsJohnCalls P(B) P(E) AP(J|…) TFTF AP(M|…) TFTF B=0 E=1 A=1 M=1 J=1 w=0.0016

L IKELIHOOD WEIGHTING Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5 BEP(A| … ) TTFFTTFF TFTFTFTF BurglaryEarthquake Alarm MaryCallsJohnCalls P(B) P(E) AP(J|…) TFTF AP(M|…) TFTF B=0 E=0 w=3.988

L IKELIHOOD WEIGHTING Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5 BEP(A| … ) TTFFTTFF TFTFTFTF BurglaryEarthquake Alarm MaryCallsJohnCalls P(B) P(E) AP(J|…) TFTF AP(M|…) TFTF B=0 E=0 A=1 w=0.004

L IKELIHOOD WEIGHTING Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5 BEP(A| … ) TTFFTTFF TFTFTFTF BurglaryEarthquake Alarm MaryCallsJohnCalls P(B) P(E) AP(J|…) TFTF AP(M|…) TFTF B=0 E=0 A=1 M=1 J=1 w=0.0028

L IKELIHOOD WEIGHTING Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5 BEP(A| … ) TTFFTTFF TFTFTFTF BurglaryEarthquake Alarm MaryCallsJohnCalls P(B) P(E) AP(J|…) TFTF AP(M|…) TFTF B=1 E=0 A=1 w=

L IKELIHOOD WEIGHTING Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5 BEP(A| … ) TTFFTTFF TFTFTFTF BurglaryEarthquake Alarm MaryCallsJohnCalls P(B) P(E) AP(J|…) TFTF AP(M|…) TFTF B=1 E=0 A=1 M=1 J=1 w=0.0026

L IKELIHOOD WEIGHTING Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5 BEP(A| … ) TTFFTTFF TFTFTFTF BurglaryEarthquake Alarm MaryCallsJohnCalls P(B) P(E) AP(J|…) TFTF AP(M|…) TFTF B=1 E=1 A=1 M=1 J=1 w=5e-7

L IKELIHOOD WEIGHTING Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5 N=4 gives P(B|A,M)~=0.371 Exact inference gives P(B|A,M) = B=0 E=1 A=1 M=1 J=1 w= B=0 E=0 A=1 M=1 J=1 w= B=1 E=0 A=1 M=1 J=1 w= B=1 E=1 A=1 M=1 J=1 w~=0

A NOTHER R ARE -E VENT P ROBLEM B = b given as evidence Probability each b i is rare given all but one setting of A i (say, A i =1) Chance of sampling all 1’s is very low => most likelihood weights will be too low Problem: evidence is not being used to sample A’s effectively (i.e., near P(A i | b )) A1A1 A2A2 A 10 B1B1 B2B2 B 10

G IBBS S AMPLING Idea: reduce the computational burden of sampling from a multidimensional distribution P( x )=P(x 1,…,x n ) by doing repeated draws of individual attributes Cycle through j=1,…,n Sample x j ~ P(x j | x[1…j-1,j+1,…n]) Over the long run, the random walk taken by x approaches the true distribution P( x )

G IBBS S AMPLING IN BN S Each Gibbs sampling step: 1) pick a variable X i, 2) sample x i ~ P(X i | X /X i ) Look at values of “Markov blanket” of X i : Parents Pa Xi Children Y 1,…,Y k Parents of children (excluding X i ) Pa Y1 /X i, …, Pa Yk /X i X i is independent of rest of network given Markov blanket Sample x i ~P(X i |, Y 1, Pa Y1 /X i, …, Y k, Pa Yk /X i ) = 1/Z P(X i | Pa Xi ) P(Y 1 | Pa Y1 ) *…* P(Y k | Pa Yk ) Product of X i ’s factor and the factors of its children

H ANDLING EVIDENCE Simply set each evidence variable to its appropriate value, don’t sample Resulting walk approximates distribution P( X/E | E = e ) Uses evidence more efficiently than likelihood weighting

G IBBS SAMPLING ISSUES Demonstrating correctness & convergence requires examining Markov Chain random walk (more later) Need to take many steps before the effects of poor initialization wear off (mixing time) Difficult to tell how much is needed a priori Numerous variants Known as Markov Chain Monte Carlo techniques

N EXT TIME Continuous and hybrid distributions