CS b553: Algorithms for Optimization and Learning

Slides:

Advertisements

Similar presentations

Identifying Conditional Independencies in Bayes Nets Lecture 4.

Advertisements

Probabilistic Reasoning (2)

BAYESIAN INFERENCE Sampling techniques

CPSC 322, Lecture 26Slide 1 Reasoning Under Uncertainty: Belief Networks Computer Science cpsc322, Lecture 27 (Textbook Chpt 6.3) March, 16, 2009.

Bayesian network inference

Inference in Bayesian Nets

Bayesian Networks. Motivation The conditional independence assumption made by naïve Bayes classifiers may seem to rigid, especially for classification.

. PGM: Tirgul 8 Markov Chains. Stochastic Sampling  In previous class, we examined methods that use independent samples to estimate P(X = x |e ) Problem:

Bayesian Belief Networks

Bayesian Networks. Graphical Models Bayesian networks Conditional random fields etc.

CS 188: Artificial Intelligence Spring 2007 Lecture 14: Bayes Nets III 3/1/2007 Srini Narayanan – ICSI and UC Berkeley.

CS 188: Artificial Intelligence Fall 2006 Lecture 17: Bayes Nets III 10/26/2006 Dan Klein – UC Berkeley.

. Approximate Inference Slides by Nir Friedman. When can we hope to approximate? Two situations: u Highly stochastic distributions “Far” evidence is discarded.

Announcements Homework 8 is out Final Contest (Optional)

1 Bayesian Networks Chapter ; 14.4 CS 63 Adapted from slides by Tim Finin and Marie desJardins. Some material borrowed from Lise Getoor.

. PGM 2002/3 – Tirgul6 Approximate Inference: Sampling.

Read R&N Ch Next lecture: Read R&N

Bayesian networks Chapter 14. Outline Syntax Semantics.

Bayesian Networks What is the likelihood of X given evidence E? i.e. P(X|E) = ?

Bayesian networks. Motivation We saw that the full joint probability can be used to answer any question about the domain, but can become intractable as.

Statistical Learning (From data to distributions).

Made by: Maor Levy, Temple University  Inference in Bayes Nets ◦ What is the probability of getting a strong letter? ◦ We want to compute the.

, Patrik Huber.  One of our goals: Evaluation of the posterior p(Z|X)  Exact inference  In practice: often infeasible to evaluate the posterior.

Instructor: Eyal Amir Grad TAs: Wen Pu, Yonatan Bisk Undergrad TAs: Sam Johnson, Nikhil Johri CS 440 / ECE 448 Introduction to Artificial Intelligence.

Bayes’ Nets: Sampling [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available.

CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Monte Carlo Methods for Probabilistic Inference.

The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)

CPSC 422, Lecture 11Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 11 Oct, 2, 2015.

Bayesian networks and their application in circuit reliability estimation Erin Taylor.

CS 188: Artificial Intelligence Bayes Nets: Approximate Inference Instructor: Stuart Russell--- University of California, Berkeley.

Inference Algorithms for Bayes Networks

1 CMSC 671 Fall 2001 Class #20 – Thursday, November 8.

CS 416 Artificial Intelligence Lecture 15 Uncertainty Chapter 14 Lecture 15 Uncertainty Chapter 14.

Chapter 12. Probability Reasoning Fall 2013 Comp3710 Artificial Intelligence Computing Science Thompson Rivers University.

QUIZ!!  T/F: You can always (theoretically) do BNs inference by enumeration. TRUE  T/F: In VE, always first marginalize, then join. FALSE  T/F: VE is.

Web-Mining Agents Data Mining Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme Karsten Martiny (Übungen)

CS 541: Artificial Intelligence Lecture VII: Inference in Bayesian Networks.

CS498-EA Reasoning in AI Lecture #19 Professor: Eyal Amir Fall Semester 2011.

CS 2750: Machine Learning Directed Graphical Models

Bayesian networks Chapter 14 Section 1 – 2.

Presented By S.Yamuna AP/CSE

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12

Qian Liu CSE spring University of Pennsylvania

Read R&N Ch Next lecture: Read R&N

Approximate Inference

Artificial Intelligence

Bayesian Networks Probability In AI.

CS 4/527: Artificial Intelligence

CSCI 121 Special Topics: Bayesian Networks Lecture #2: Bayes Nets

CAP 5636 – Advanced Artificial Intelligence

Markov Networks.

Read R&N Ch Next lecture: Read R&N

Advanced Artificial Intelligence

Inference Inference: calculating some useful quantity from a joint probability distribution Examples: Posterior probability: Most likely explanation: B.

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12

Instructors: Fei Fang (This Lecture) and Dave Touretzky

CS 188: Artificial Intelligence

Class #19 – Tuesday, November 3

CS 188: Artificial Intelligence Fall 2008

Class #16 – Tuesday, October 26

Inference III: Approximate Inference

Approximate Inference by Sampling

Hankz Hankui Zhuo Bayesian Networks Hankz Hankui Zhuo

Belief Networks CS121 – Winter 2003 Belief Networks.

Read R&N Ch Next lecture: Read R&N

Bayesian networks Chapter 14 Section 1 – 2.

Probabilistic Reasoning

Read R&N Ch Next lecture: Read R&N

Markov Networks.

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12

Presentation transcript:

CS b553: Algorithms for Optimization and Learning Monte Carlo Methods for Probabilistic Inference

Agenda Monte Carlo methods For Bayesian inference O(1/sqrt(N)) standard deviation For Bayesian inference Likelihood weighting Gibbs sampling

Monte Carlo Integration Estimate large integrals/sums: I =  f(x)p(x) dx I =  f(x)p(x) Using a sample of N i.i.d. samples from p(x) I  1/N  f(x(i)) Examples: [a,b] f(x) dx  (b-a)/N S f(x(i)) E[X] =  x p(x) dx  1/N S x(i) Volume of a set in Rn

Mean & Variance of estimate Let IN be the random variable denoting the estimate of the integral with N samples What is the bias (mean error) E[I-IN]?

Mean & Variance of estimate Let IN be the random variable denoting the estimate of the integral with N samples What is the bias (mean error) E[I-IN]? E[I-IN]=I-E[IN] (linearity of expectation)

Mean & Variance of estimate Let IN be the random variable denoting the estimate of the integral with N samples What is the bias (mean error) E[I-IN]? E[I-IN]=I-E[IN] (linearity of expectation) = E[f(x)] - 1/N S E[f(x(i))] (definition of I and IN)

Mean & Variance of estimate Let IN be the random variable denoting the estimate of the integral with N samples What is the bias (mean error) E[I-IN]? E[I-IN]=I-E[IN] (linearity of expectation) = E[f(x)] - 1/N S E[f(x(i))] (definition of I and IN) = 1/N S (E[f(x)]-E[f(x(i))]) = 1/N S 0 (x and x(i) are distributed w.r.t. p(x)) = 0

Mean & Variance of estimate Let IN be the random variable denoting the estimate of the integral with N samples What is the bias (mean error) E[I-IN]? Unbiased estimator What is the variance Var[IN]?

Mean & Variance of estimate Let IN be the random variable denoting the estimate of the integral with N samples What is the bias (mean error) E[I-IN]? Unbiased estimator What is the variance Var[IN]? Var[IN] = Var[1/N S f(x(i))] (definition)

Mean & Variance of estimate Let IN be the random variable denoting the estimate of the integral with N samples What is the bias (mean error) E[I-IN]? Unbiased estimator What is the variance Var[IN]? Var[IN] = Var[1/N S f(x(i))] (definition) = 1/N2 Var[S f(x(i))] (scaling of variance)

Mean & Variance of estimate Let IN be the random variable denoting the estimate of the integral with N samples What is the bias (mean error) E[I-IN]? Unbiased estimator What is the variance Var[IN]? Var[IN] = Var[1/N S f(x(i))] (definition) = 1/N2 Var[S f(x(i))] (scaling of variance) = 1/N2 S Var[f(x(i))] (variance of a sum of independent variables)

Mean & Variance of estimate Let IN be the random variable denoting the estimate of the integral with N samples What is the bias (mean error) E[I-IN]? Unbiased estimator What is the variance Var[IN]? Var[IN] = Var[1/N S f(x(i))] (definition) = 1/N2 Var[S f(x(i))] (scaling of variance) = 1/N2 S Var[f(x(i))] = 1/N Var[f(x)] (i.i.d. sample)

Mean & Variance of estimate Let IN be the random variable denoting the estimate of the integral with N samples What is the bias (mean error) E[I-IN]? Unbiased estimator What is the variance Var[IN]? 1/N Var[f(x)] Standard deviation: O(1/sqrt(N))

Approximate Inference Through Sampling Unconditional simulation: To estimate the probability of a coin flipping heads, I can flip it a huge number of times and count the fraction of heads observed

Approximate Inference Through Sampling Unconditional simulation: To estimate the probability of a coin flipping heads, I can flip it a huge number of times and count the fraction of heads observed Conditional simulation: To estimate the probability P(H) that a coin picked out of bucket B flips heads: Repeat for i=1,…,N: Pick a coin C out of a random bucket b(i) chosen with probability P(B) h(i) = flip C according to probability P(H|b(i)) Sample (h(i),b(i)) comes from distribution P(H,B) Result approximates P(H,B)

Monte Carlo Inference In Bayes Nets BN over variables X Repeat for i=1,…,N In top-down order, generate x(i) as follows: Sample xj(i) ~ P(Xj |paXj(i)) (RHS is taken by putting parent values in sample into the CPT for Xj) Sample x(1)… x(N) approximates the distribution over X

Approximate Inference: Monte-Carlo Simulation Sample from the joint distribution P(B) 0.001 P(E) 0.002 Burglary Earthquake Alarm MaryCalls JohnCalls B=0 E=0 A=0 J=1 M=0 B E P(A|…) TTFF TFTF 0.95 0.94 0.29 0.001 A P(J|…) TF 0.90 0.05 A P(M|…) TF 0.70 0.01

Approximate Inference: Monte-Carlo Simulation As more samples are generated, the distribution of the samples approaches the joint distribution B=0 E=0 A=0 J=1 M=0 B=0 E=0 A=0 J=0 M=0 B=0 E=0 A=0 J=0 M=0 B=1 E=0 A=1 J=1 M=0

Basic method for Handling Evidence Inference: given evidence E=e (e.g., J=1), approximate P(X/E|E=e) Remove the samples that conflict B=0 E=0 A=0 J=1 M=0 B=0 E=0 A=0 J=0 M=0 B=0 E=0 A=0 J=0 M=0 B=1 E=0 A=1 J=1 M=0 Distribution of remaining samples approximates the conditional distribution

Rare Event Problem: What if some events are really rare (e.g., burglary & earthquake ?) # of samples must be huge to get a reasonable estimate Solution: likelihood weighting Enforce that each sample agrees with evidence While generating a sample, keep track of the ratio of (how likely the sampled value is to occur in the real world) (how likely you were to generate the sampled value)

Likelihood weighting Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5 P(B) 0.001 P(E) 0.002 Burglary Earthquake Alarm MaryCalls JohnCalls w=1 B E P(A|…) TTFF TFTF 0.95 0.94 0.29 0.001 A P(J|…) TF 0.90 0.05 A P(M|…) TF 0.70 0.01

Likelihood weighting Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5 P(B) 0.001 P(E) 0.002 Burglary Earthquake Alarm MaryCalls JohnCalls w=0.008 B=0 E=1 B E P(A|…) TTFF TFTF 0.95 0.94 0.29 0.001 A P(J|…) TF 0.90 0.05 A P(M|…) TF 0.70 0.01

Likelihood weighting Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5 P(B) 0.001 P(E) 0.002 Burglary Earthquake Alarm MaryCalls JohnCalls w=0.0023 B=0 E=1 A=1 B E P(A|…) TTFF TFTF 0.95 0.94 0.29 0.001 A=1 is enforced, and the weight updated to reflect the likelihood that this occurs A P(J|…) TF 0.90 0.05 A P(M|…) TF 0.70 0.01

Likelihood weighting Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5 P(B) 0.001 P(E) 0.002 Burglary Earthquake Alarm MaryCalls JohnCalls w=0.0016 B=0 E=1 A=1 M=1 J=1 B E P(A|…) TTFF TFTF 0.95 0.94 0.29 0.001 A P(J|…) TF 0.90 0.05 A P(M|…) TF 0.70 0.01

Likelihood weighting Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5 P(B) 0.001 P(E) 0.002 Burglary Earthquake Alarm MaryCalls JohnCalls w=3.988 B=0 E=0 B E P(A|…) TTFF TFTF 0.95 0.94 0.29 0.001 A P(J|…) TF 0.90 0.05 A P(M|…) TF 0.70 0.01

Likelihood weighting Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5 P(B) 0.001 P(E) 0.002 Burglary Earthquake Alarm MaryCalls JohnCalls w=0.004 B=0 E=0 A=1 B E P(A|…) TTFF TFTF 0.95 0.94 0.29 0.001 A P(J|…) TF 0.90 0.05 A P(M|…) TF 0.70 0.01

Likelihood weighting Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5 P(B) 0.001 P(E) 0.002 Burglary Earthquake Alarm MaryCalls JohnCalls w=0.0028 B=0 E=0 A=1 M=1 J=1 B E P(A|…) TTFF TFTF 0.95 0.94 0.29 0.001 A P(J|…) TF 0.90 0.05 A P(M|…) TF 0.70 0.01

Likelihood weighting Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5 P(B) 0.001 P(E) 0.002 Burglary Earthquake Alarm MaryCalls JohnCalls w=0.00375 B=1 E=0 A=1 B E P(A|…) TTFF TFTF 0.95 0.94 0.29 0.001 A P(J|…) TF 0.90 0.05 A P(M|…) TF 0.70 0.01

Likelihood weighting Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5 P(B) 0.001 P(E) 0.002 Burglary Earthquake Alarm MaryCalls JohnCalls w=0.0026 B=1 E=0 A=1 M=1 J=1 B E P(A|…) TTFF TFTF 0.95 0.94 0.29 0.001 A P(J|…) TF 0.90 0.05 A P(M|…) TF 0.70 0.01

Likelihood weighting Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5 P(B) 0.001 P(E) 0.002 Burglary Earthquake Alarm MaryCalls JohnCalls w=5e-7 B=1 E=1 A=1 M=1 J=1 B E P(A|…) TTFF TFTF 0.95 0.94 0.29 0.001 A P(J|…) TF 0.90 0.05 A P(M|…) TF 0.70 0.01

Likelihood weighting Suppose evidence Alarm & MaryCalls Sample B,E with P=0.5 N=4 gives P(B|A,M)~=0.371 Exact inference gives P(B|A,M) = 0.375 w=0.0016 w=0.0028 w=0.0026 w~=0 B=0 E=1 A=1 M=1 J=1 B=0 E=0 A=1 M=1 J=1 B=1 E=0 A=1 M=1 J=1 B=1 E=1 A=1 M=1 J=1

Another Rare-Event Problem B=b given as evidence Probability each bi is rare given all but one setting of Ai (say, Ai=1) Chance of sampling all 1’s is very low => most likelihood weights will be too low Problem: evidence is not being used to sample A’s effectively (i.e., near P(Ai|b)) A1 A2 A10 B1 B2 B10

Gibbs Sampling Idea: reduce the computational burden of sampling from a multidimensional distribution P(x)=P(x1,…,xn) by doing repeated draws of individual attributes Cycle through j=1,…,n Sample xj ~ P(xj | x[1…j-1,j+1,…n]) Over the long run, the random walk taken by x approaches the true distribution P(x)

Gibbs Sampling in BNs Each Gibbs sampling step: 1) pick a variable Xi, 2) sample xi ~ P(Xi|X/Xi) Look at values of “Markov blanket” of Xi: Parents PaXi Children Y1,…,Yk Parents of children (excluding Xi) PaY1/Xi, …, PaYk/Xi Xi is independent of rest of network given Markov blanket Sample xi~P(Xi|, Y1, PaY1/Xi, …, Yk, PaYk/Xi) = 1/Z P(Xi|PaXi) P(Y1|PaY1) *…* P(Yk|PaYk) Product of Xi’s factor and the factors of its children

Handling evidence Simply set each evidence variable to its appropriate value, don’t sample Resulting walk approximates distribution P(X/E|E=e) Uses evidence more efficiently than likelihood weighting

Gibbs sampling issues Demonstrating correctness & convergence requires examining Markov Chain random walk (more later) Need to take many steps before the effects of poor initialization wear off (mixing time) Difficult to tell how much is needed a priori Numerous variants Known as Markov Chain Monte Carlo techniques

Next time Continuous and hybrid distributions