CS b553: Algorithms for Optimization and Learning

Slides:



Advertisements
Similar presentations
Identifying Conditional Independencies in Bayes Nets Lecture 4.
Advertisements

Probabilistic Reasoning (2)
BAYESIAN INFERENCE Sampling techniques
CPSC 322, Lecture 26Slide 1 Reasoning Under Uncertainty: Belief Networks Computer Science cpsc322, Lecture 27 (Textbook Chpt 6.3) March, 16, 2009.
Bayesian network inference
Inference in Bayesian Nets
Bayesian Networks. Motivation The conditional independence assumption made by naïve Bayes classifiers may seem to rigid, especially for classification.
. PGM: Tirgul 8 Markov Chains. Stochastic Sampling  In previous class, we examined methods that use independent samples to estimate P(X = x |e ) Problem:
Bayesian Belief Networks
Bayesian Networks. Graphical Models Bayesian networks Conditional random fields etc.
CS 188: Artificial Intelligence Spring 2007 Lecture 14: Bayes Nets III 3/1/2007 Srini Narayanan – ICSI and UC Berkeley.
CS 188: Artificial Intelligence Fall 2006 Lecture 17: Bayes Nets III 10/26/2006 Dan Klein – UC Berkeley.
. Approximate Inference Slides by Nir Friedman. When can we hope to approximate? Two situations: u Highly stochastic distributions “Far” evidence is discarded.
Announcements Homework 8 is out Final Contest (Optional)
1 Bayesian Networks Chapter ; 14.4 CS 63 Adapted from slides by Tim Finin and Marie desJardins. Some material borrowed from Lise Getoor.
. PGM 2002/3 – Tirgul6 Approximate Inference: Sampling.
Read R&N Ch Next lecture: Read R&N
Bayesian networks Chapter 14. Outline Syntax Semantics.
Bayesian Networks What is the likelihood of X given evidence E? i.e. P(X|E) = ?
Bayesian networks. Motivation We saw that the full joint probability can be used to answer any question about the domain, but can become intractable as.
Statistical Learning (From data to distributions).
Made by: Maor Levy, Temple University  Inference in Bayes Nets ◦ What is the probability of getting a strong letter? ◦ We want to compute the.
, Patrik Huber.  One of our goals: Evaluation of the posterior p(Z|X)  Exact inference  In practice: often infeasible to evaluate the posterior.
Instructor: Eyal Amir Grad TAs: Wen Pu, Yonatan Bisk Undergrad TAs: Sam Johnson, Nikhil Johri CS 440 / ECE 448 Introduction to Artificial Intelligence.
Bayes’ Nets: Sampling [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available.
CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Monte Carlo Methods for Probabilistic Inference.
The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)
CPSC 422, Lecture 11Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 11 Oct, 2, 2015.
Bayesian networks and their application in circuit reliability estimation Erin Taylor.
CS 188: Artificial Intelligence Bayes Nets: Approximate Inference Instructor: Stuart Russell--- University of California, Berkeley.
Inference Algorithms for Bayes Networks
1 CMSC 671 Fall 2001 Class #20 – Thursday, November 8.
CS 416 Artificial Intelligence Lecture 15 Uncertainty Chapter 14 Lecture 15 Uncertainty Chapter 14.
Chapter 12. Probability Reasoning Fall 2013 Comp3710 Artificial Intelligence Computing Science Thompson Rivers University.
QUIZ!!  T/F: You can always (theoretically) do BNs inference by enumeration. TRUE  T/F: In VE, always first marginalize, then join. FALSE  T/F: VE is.
Web-Mining Agents Data Mining Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme Karsten Martiny (Übungen)
CS 541: Artificial Intelligence Lecture VII: Inference in Bayesian Networks.
CS498-EA Reasoning in AI Lecture #19 Professor: Eyal Amir Fall Semester 2011.
CS 2750: Machine Learning Directed Graphical Models
Bayesian networks Chapter 14 Section 1 – 2.
Presented By S.Yamuna AP/CSE
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12
Qian Liu CSE spring University of Pennsylvania
Read R&N Ch Next lecture: Read R&N
Approximate Inference
Artificial Intelligence
Bayesian Networks Probability In AI.
CS 4/527: Artificial Intelligence
CSCI 121 Special Topics: Bayesian Networks Lecture #2: Bayes Nets
CAP 5636 – Advanced Artificial Intelligence
Markov Networks.
Read R&N Ch Next lecture: Read R&N
Advanced Artificial Intelligence
Inference Inference: calculating some useful quantity from a joint probability distribution Examples: Posterior probability: Most likely explanation: B.
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12
Instructors: Fei Fang (This Lecture) and Dave Touretzky
CS 188: Artificial Intelligence
Class #19 – Tuesday, November 3
CS 188: Artificial Intelligence Fall 2008
Class #16 – Tuesday, October 26
Inference III: Approximate Inference
Approximate Inference by Sampling
Hankz Hankui Zhuo Bayesian Networks Hankz Hankui Zhuo
Belief Networks CS121 – Winter 2003 Belief Networks.
Read R&N Ch Next lecture: Read R&N
Bayesian networks Chapter 14 Section 1 – 2.
Probabilistic Reasoning
Read R&N Ch Next lecture: Read R&N
Markov Networks.
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12
Presentation transcript:

CS b553: Algorithms for Optimization and Learning Monte Carlo Methods for Probabilistic Inference

Agenda Monte Carlo methods For Bayesian inference O(1/sqrt(N)) standard deviation For Bayesian inference Likelihood weighting Gibbs sampling

Monte Carlo Integration Estimate large integrals/sums: I =  f(x)p(x) dx I =  f(x)p(x) Using a sample of N i.i.d. samples from p(x) I  1/N  f(x(i)) Examples: [a,b] f(x) dx  (b-a)/N S f(x(i)) E[X] =  x p(x) dx  1/N S x(i) Volume of a set in Rn

Mean & Variance of estimate Let IN be the random variable denoting the estimate of the integral with N samples What is the bias (mean error) E[I-IN]?

Mean & Variance of estimate Let IN be the random variable denoting the estimate of the integral with N samples What is the bias (mean error) E[I-IN]? E[I-IN]=I-E[IN] (linearity of expectation)

Mean & Variance of estimate Let IN be the random variable denoting the estimate of the integral with N samples What is the bias (mean error) E[I-IN]? E[I-IN]=I-E[IN] (linearity of expectation) = E[f(x)] - 1/N S E[f(x(i))] (definition of I and IN)

Mean & Variance of estimate Let IN be the random variable denoting the estimate of the integral with N samples What is the bias (mean error) E[I-IN]? E[I-IN]=I-E[IN] (linearity of expectation) = E[f(x)] - 1/N S E[f(x(i))] (definition of I and IN) = 1/N S (E[f(x)]-E[f(x(i))]) = 1/N S 0 (x and x(i) are distributed w.r.t. p(x)) = 0

Mean & Variance of estimate Let IN be the random variable denoting the estimate of the integral with N samples What is the bias (mean error) E[I-IN]? Unbiased estimator What is the variance Var[IN]?

Mean & Variance of estimate Let IN be the random variable denoting the estimate of the integral with N samples What is the bias (mean error) E[I-IN]? Unbiased estimator What is the variance Var[IN]? Var[IN] = Var[1/N S f(x(i))] (definition)

Mean & Variance of estimate Let IN be the random variable denoting the estimate of the integral with N samples What is the bias (mean error) E[I-IN]? Unbiased estimator What is the variance Var[IN]? Var[IN] = Var[1/N S f(x(i))] (definition) = 1/N2 Var[S f(x(i))] (scaling of variance)

Mean & Variance of estimate Let IN be the random variable denoting the estimate of the integral with N samples What is the bias (mean error) E[I-IN]? Unbiased estimator What is the variance Var[IN]? Var[IN] = Var[1/N S f(x(i))] (definition) = 1/N2 Var[S f(x(i))] (scaling of variance) = 1/N2 S Var[f(x(i))] (variance of a sum of independent variables)

Mean & Variance of estimate Let IN be the random variable denoting the estimate of the integral with N samples What is the bias (mean error) E[I-IN]? Unbiased estimator What is the variance Var[IN]? Var[IN] = Var[1/N S f(x(i))] (definition) = 1/N2 Var[S f(x(i))] (scaling of variance) = 1/N2 S Var[f(x(i))] = 1/N Var[f(x)] (i.i.d. sample)

Mean & Variance of estimate Let IN be the random variable denoting the estimate of the integral with N samples What is the bias (mean error) E[I-IN]? Unbiased estimator What is the variance Var[IN]? 1/N Var[f(x)] Standard deviation: O(1/sqrt(N))

Approximate Inference Through Sampling Unconditional simulation: To estimate the probability of a coin flipping heads, I can flip it a huge number of times and count the fraction of heads observed

Approximate Inference Through Sampling Unconditional simulation: To estimate the probability of a coin flipping heads, I can flip it a huge number of times and count the fraction of heads observed Conditional simulation: To estimate the probability P(H) that a coin picked out of bucket B flips heads: Repeat for i=1,…,N: Pick a coin C out of a random bucket b(i) chosen with probability P(B) h(i) = flip C according to probability P(H|b(i)) Sample (h(i),b(i)) comes from distribution P(H,B) Result approximates P(H,B)

Monte Carlo Inference In Bayes Nets BN over variables X Repeat for i=1,…,N In top-down order, generate x(i) as follows: Sample xj(i) ~ P(Xj |paXj(i)) (RHS is taken by putting parent values in sample into the CPT for Xj) Sample x(1)… x(N) approximates the distribution over X

Approximate Inference: Monte-Carlo Simulation Sample from the joint distribution P(B) 0.001 P(E) 0.002 Burglary Earthquake Alarm MaryCalls JohnCalls B=0 E=0 A=0 J=1 M=0 B E P(A|…) TTFF TFTF 0.95 0.94 0.29 0.001 A P(J|…) TF 0.90 0.05 A P(M|…) TF 0.70 0.01

Approximate Inference: Monte-Carlo Simulation As more samples are generated, the distribution of the samples approaches the joint distribution B=0 E=0 A=0 J=1 M=0 B=0 E=0 A=0 J=0 M=0 B=0 E=0 A=0 J=0 M=0 B=1 E=0 A=1 J=1 M=0

Basic method for Handling Evidence Inference: given evidence E=e (e.g., J=1), approximate P(X/E|E=e) Remove the samples that conflict B=0 E=0 A=0 J=1 M=0 B=0 E=0 A=0 J=0 M=0 B=0 E=0 A=0 J=0 M=0 B=1 E=0 A=1 J=1 M=0 Distribution of remaining samples approximates the conditional distribution

Rare Event Problem: What if some events are really rare (e.g., burglary & earthquake ?) # of samples must be huge to get a reasonable estimate Solution: likelihood weighting Enforce that each sample agrees with evidence While generating a sample, keep track of the ratio of (how likely the sampled value is to occur in the real world) (how likely you were to generate the sampled value)

Likelihood weighting Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5 P(B) 0.001 P(E) 0.002 Burglary Earthquake Alarm MaryCalls JohnCalls w=1 B E P(A|…) TTFF TFTF 0.95 0.94 0.29 0.001 A P(J|…) TF 0.90 0.05 A P(M|…) TF 0.70 0.01

Likelihood weighting Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5 P(B) 0.001 P(E) 0.002 Burglary Earthquake Alarm MaryCalls JohnCalls w=0.008 B=0 E=1 B E P(A|…) TTFF TFTF 0.95 0.94 0.29 0.001 A P(J|…) TF 0.90 0.05 A P(M|…) TF 0.70 0.01

Likelihood weighting Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5 P(B) 0.001 P(E) 0.002 Burglary Earthquake Alarm MaryCalls JohnCalls w=0.0023 B=0 E=1 A=1 B E P(A|…) TTFF TFTF 0.95 0.94 0.29 0.001 A=1 is enforced, and the weight updated to reflect the likelihood that this occurs A P(J|…) TF 0.90 0.05 A P(M|…) TF 0.70 0.01

Likelihood weighting Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5 P(B) 0.001 P(E) 0.002 Burglary Earthquake Alarm MaryCalls JohnCalls w=0.0016 B=0 E=1 A=1 M=1 J=1 B E P(A|…) TTFF TFTF 0.95 0.94 0.29 0.001 A P(J|…) TF 0.90 0.05 A P(M|…) TF 0.70 0.01

Likelihood weighting Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5 P(B) 0.001 P(E) 0.002 Burglary Earthquake Alarm MaryCalls JohnCalls w=3.988 B=0 E=0 B E P(A|…) TTFF TFTF 0.95 0.94 0.29 0.001 A P(J|…) TF 0.90 0.05 A P(M|…) TF 0.70 0.01

Likelihood weighting Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5 P(B) 0.001 P(E) 0.002 Burglary Earthquake Alarm MaryCalls JohnCalls w=0.004 B=0 E=0 A=1 B E P(A|…) TTFF TFTF 0.95 0.94 0.29 0.001 A P(J|…) TF 0.90 0.05 A P(M|…) TF 0.70 0.01

Likelihood weighting Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5 P(B) 0.001 P(E) 0.002 Burglary Earthquake Alarm MaryCalls JohnCalls w=0.0028 B=0 E=0 A=1 M=1 J=1 B E P(A|…) TTFF TFTF 0.95 0.94 0.29 0.001 A P(J|…) TF 0.90 0.05 A P(M|…) TF 0.70 0.01

Likelihood weighting Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5 P(B) 0.001 P(E) 0.002 Burglary Earthquake Alarm MaryCalls JohnCalls w=0.00375 B=1 E=0 A=1 B E P(A|…) TTFF TFTF 0.95 0.94 0.29 0.001 A P(J|…) TF 0.90 0.05 A P(M|…) TF 0.70 0.01

Likelihood weighting Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5 P(B) 0.001 P(E) 0.002 Burglary Earthquake Alarm MaryCalls JohnCalls w=0.0026 B=1 E=0 A=1 M=1 J=1 B E P(A|…) TTFF TFTF 0.95 0.94 0.29 0.001 A P(J|…) TF 0.90 0.05 A P(M|…) TF 0.70 0.01

Likelihood weighting Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5 P(B) 0.001 P(E) 0.002 Burglary Earthquake Alarm MaryCalls JohnCalls w=5e-7 B=1 E=1 A=1 M=1 J=1 B E P(A|…) TTFF TFTF 0.95 0.94 0.29 0.001 A P(J|…) TF 0.90 0.05 A P(M|…) TF 0.70 0.01

Likelihood weighting Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5 N=4 gives P(B|A,M)~=0.371 Exact inference gives P(B|A,M) = 0.375 w=0.0016 w=0.0028 w=0.0026 w~=0 B=0 E=1 A=1 M=1 J=1 B=0 E=0 A=1 M=1 J=1 B=1 E=0 A=1 M=1 J=1 B=1 E=1 A=1 M=1 J=1

Another Rare-Event Problem B=b given as evidence Probability each bi is rare given all but one setting of Ai (say, Ai=1) Chance of sampling all 1’s is very low => most likelihood weights will be too low Problem: evidence is not being used to sample A’s effectively (i.e., near P(Ai|b)) A1 A2 A10 B1 B2 B10

Gibbs Sampling Idea: reduce the computational burden of sampling from a multidimensional distribution P(x)=P(x1,…,xn) by doing repeated draws of individual attributes Cycle through j=1,…,n Sample xj ~ P(xj | x[1…j-1,j+1,…n]) Over the long run, the random walk taken by x approaches the true distribution P(x)

Gibbs Sampling in BNs Each Gibbs sampling step: 1) pick a variable Xi, 2) sample xi ~ P(Xi|X/Xi) Look at values of “Markov blanket” of Xi: Parents PaXi Children Y1,…,Yk Parents of children (excluding Xi) PaY1/Xi, …, PaYk/Xi Xi is independent of rest of network given Markov blanket Sample xi~P(Xi|, Y1, PaY1/Xi, …, Yk, PaYk/Xi) = 1/Z P(Xi|PaXi) P(Y1|PaY1) *…* P(Yk|PaYk) Product of Xi’s factor and the factors of its children

Handling evidence Simply set each evidence variable to its appropriate value, don’t sample Resulting walk approximates distribution P(X/E|E=e) Uses evidence more efficiently than likelihood weighting

Gibbs sampling issues Demonstrating correctness & convergence requires examining Markov Chain random walk (more later) Need to take many steps before the effects of poor initialization wear off (mixing time) Difficult to tell how much is needed a priori Numerous variants Known as Markov Chain Monte Carlo techniques

Next time Continuous and hybrid distributions