Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12

Slides:

Advertisements

Similar presentations

CPSC 422, Lecture 11Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 11 Jan, 29, 2014.

Advertisements

Lirong Xia Hidden Markov Models Tue, March 28, 2014.

Probabilistic Reasoning (2)

. PGM: Tirgul 8 Markov Chains. Stochastic Sampling  In previous class, we examined methods that use independent samples to estimate P(X = x |e ) Problem:

Machine Learning CUNY Graduate Center Lecture 7b: Sampling.

Bayesian Networks. Graphical Models Bayesian networks Conditional random fields etc.

CS 188: Artificial Intelligence Fall 2009 Lecture 17: Bayes Nets IV 10/27/2009 Dan Klein – UC Berkeley.

CS 188: Artificial Intelligence Spring 2007 Lecture 14: Bayes Nets III 3/1/2007 Srini Narayanan – ICSI and UC Berkeley.

CS 188: Artificial Intelligence Fall 2006 Lecture 17: Bayes Nets III 10/26/2006 Dan Klein – UC Berkeley.

10/22  Homework 3 returned; solutions posted  Homework 4 socket opened  Project 3 assigned  Mid-term on Wednesday  (Optional) Review session Tuesday.

. Approximate Inference Slides by Nir Friedman. When can we hope to approximate? Two situations: u Highly stochastic distributions “Far” evidence is discarded.

Announcements Homework 8 is out Final Contest (Optional)

CPSC 422, Lecture 18Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18 Feb, 25, 2015 Slide Sources Raymond J. Mooney University of.

CPSC 422, Lecture 14Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 14 Feb, 4, 2015 Slide credit: some slides adapted from Stuart.

. PGM 2002/3 – Tirgul6 Approximate Inference: Sampling.

Approximate Inference 2: Monte Carlo Markov Chain

Bayesian networks Chapter 14. Outline Syntax Semantics.

CS 188: Artificial Intelligence Fall 2008 Lecture 18: Decision Diagrams 10/30/2008 Dan Klein – UC Berkeley 1.

Machine Learning Lecture 23: Statistical Estimation with Sampling Iain Murray’s MLSS lecture on videolectures.net:

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.

CS 188: Artificial Intelligence Fall 2006 Lecture 18: Decision Diagrams 10/31/2006 Dan Klein – UC Berkeley.

CPSC 322, Lecture 28Slide 1 More on Construction and Compactness: Compact Conditional Distributions Once we have established the topology of a Bnet, we.

UBC Department of Computer Science Undergraduate Events More

Instructor: Eyal Amir Grad TAs: Wen Pu, Yonatan Bisk Undergrad TAs: Sam Johnson, Nikhil Johri CS 440 / ECE 448 Introduction to Artificial Intelligence.

Bayes’ Nets: Sampling [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available.

CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Monte Carlo Methods for Probabilistic Inference.

The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)

CPSC 422, Lecture 11Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 11 Oct, 2, 2015.

CS 188: Artificial Intelligence Bayes Nets: Approximate Inference Instructor: Stuart Russell--- University of California, Berkeley.

Advanced Artificial Intelligence Lecture 5: Probabilistic Inference.

Inference Algorithms for Bayes Networks

Quick Warm-Up Suppose we have a biased coin that comes up heads with some unknown probability p; how can we use it to produce random bits with probabilities.

CPSC 422, Lecture 17Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 17 Oct, 19, 2015 Slide Sources D. Koller, Stanford CS - Probabilistic.

CSE 473: Artificial Intelligence Autumn 2011 Bayesian Networks: Inference Luke Zettlemoyer Many slides over the course adapted from either Dan Klein, Stuart.

CS 416 Artificial Intelligence Lecture 15 Uncertainty Chapter 14 Lecture 15 Uncertainty Chapter 14.

QUIZ!!  T/F: You can always (theoretically) do BNs inference by enumeration. TRUE  T/F: In VE, always first marginalize, then join. FALSE  T/F: VE is.

CS 541: Artificial Intelligence Lecture VII: Inference in Bayesian Networks.

CS 416 Artificial Intelligence Lecture 16 Uncertainty Chapter 14 Lecture 16 Uncertainty Chapter 14.

CS498-EA Reasoning in AI Lecture #19 Professor: Eyal Amir Fall Semester 2011.

Bayesian Networks: Sampling Algorithms for Approximate Inference.

CS 541: Artificial Intelligence

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12

CS b553: Algorithms for Optimization and Learning

Approximate Inference

Bayesian Networks: A Tutorial

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 14

Artificial Intelligence

CS 4/527: Artificial Intelligence

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 11

CAP 5636 – Advanced Artificial Intelligence

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 11

Hidden Markov Models Part 2: Algorithms

Advanced Artificial Intelligence

Inference Inference: calculating some useful quantity from a joint probability distribution Examples: Posterior probability: Most likely explanation: B.

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12

Instructors: Fei Fang (This Lecture) and Dave Touretzky

CS 188: Artificial Intelligence

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 11

Class #19 – Tuesday, November 3

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 14

CS 188: Artificial Intelligence Fall 2008

Inference III: Approximate Inference

Hidden Markov Models Lirong Xia.

Approximate Inference by Sampling

CS 188: Artificial Intelligence Fall 2007

Quick Warm-Up Suppose we have a biased coin that comes up heads with some unknown probability p; how can we use it to produce random bits with probabilities.

CS 188: Artificial Intelligence Spring 2006

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 14

Presentation transcript:

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12 Sept, 30, 2019 Slide credit: some slides adapted from Stuart Russell (Berkeley) CPSC 422, Lecture 12

Lecture Overview Recap of Forward and Rejection Sampling Likelihood Weighting Monte Carlo Markov Chain (MCMC) – Gibbs Sampling Application Requiring Approx. reasoning CPSC 422, Lecture 12

Sampling The building block on any sampling algorithm is the generation of samples from a known (or easy to compute, like in Gibbs) distribution We then use these samples to derive estimates of probabilities hard-to-compute exactly And you want consistent sampling methods... More samples... Closer to... CPSC 422, Lecture 12

Prior Sampling + - Cloudy Cloudy Sprinkler Sprinkler Rain Rain 0.5 -c Cloudy Cloudy +c +s 0.1 -s 0.9 -c 0.5 +c +r 0.8 -r 0.2 -c Sprinkler Sprinkler Rain Rain WetGrass WetGrass Samples: +s +r +w 0.99 -w 0.01 -r 0.90 0.10 -s +c, -s, +r, +w -c, +s, -r, +w … CPSC 422, Lecture 12

Example We’ll get a bunch of samples from the BN: +c, -s, +r, +w +c, +s, +r, +w -c, +s, +r, -w -c, -s, -r, +w Cloudy Sprinkler Rain WetGrass From these samples you can compute any distribution involving the four vars... <+w:0.8, -w:0.2> CPSC 422, Lecture 12

Example Can estimate anything else from the samples, besides P(W), P(R) , etc: +c, -s, +r, +w +c, +s, +r, +w -c, +s, +r, -w -c, -s, -r, +w What about P(C| +w)? P(C| +r, +w)? P(C| +r, -w)? Cloudy Sprinkler Rain WetGrass P(C| +w)? P(C| +r, +w)? P(C| -r, -w)? Can use/generate fewer samples when we want to estimate a probability conditioned on evidence? CPSC 422, Lecture 12

Rejection Sampling Let’s say we want P(W| +s) ignore (reject) samples which don’t have S=+s This is called rejection sampling It is also consistent for conditional probabilities (i.e., correct in the limit) +c, -s, +r, +w +c, +s, +r, +w -c, +s, +r, -w -c, -s, -r, +w But what happens if +s is rare? He biggest problem with rejection sampling is that it end up rejecting so many samples! The number of samples consistent with the evidence e drops exponentially as the number of Evidence variables grows – unusable for complex problems And if the number of evidence vars grows... A. Less samples will be rejected B. More samples will be rejected C. The same number of samples will be rejected CPSC 422, Lecture 12

Likelihood Weighting Problem with rejection sampling: If evidence is unlikely, you reject a lot of samples You don’t exploit your evidence as you sample Consider P(B|+a) Idea: fix evidence variables and sample the rest Problem?: sample distribution not consistent! Solution: weight by probability of evidence given parents -b, -a +b, +a Burglary Alarm -b +a -b, +a +b, +a An example of Importance Sampling! Burglary Alarm CPSC 422, Lecture 12

Likelihood Weighting + - + - Cloudy Cloudy Sprinkler Sprinkler Rain evidence +c 0.5 -c - Cloudy Cloudy +c +s 0.1 -s 0.9 -c 0.5 +c +r 0.8 -r 0.2 -c Sprinkler Sprinkler Rain Rain Initial weight is 1 WetGrass WetGrass Samples: +s +r +w 0.99 -w 0.01 -r 0.90 0.10 -s +c +s +r +w … CPSC 422, Lecture 12

Likelihood Weighting A 0.08 B 0.02 C. 0.005 + - + - Cloudy Sprinkler 0.5 -c + evidence - Cloudy +c +s 0.1 -s 0.9 -c 0.5 +c +r 0.8 -r 0.2 -c Sprinkler Sprinkler Rain Rain WetGrass What would be the weight for this sample? +s +r +w 0.99 -w 0.01 -r 0.90 0.10 -s Initial weight is 1 +c, +s, -r, +w A 0.08 B 0.02 C. 0.005 CPSC 422, Lecture 12

Likelihood Weighting Likelihood weighting is good We have taken evidence into account as we generate the sample All our samples will reflect the state of the world suggested by the evidence Uses all samples that it generates (much more efficient than rejection sampling) Cloudy Rain C S R W Likelihood weighting doesn’t solve all our problems Evidence influences the choice of downstream variables, but not upstream ones (C isn’t more likely to get a value matching the evidence) Degradation in performance with large number of evidence vars -> each sample small weight We would like to consider evidence when we sample every variable Because LW keeps using all the samples that it generates, it is much more efficient that Rejection Sampling But performance still degrades when number of evidence variables increase The more evidence variables, the more samples will have small weight If one does not generate enough samples, the weighted estimate will be dominated by the small fraction of sample with substantial weight One possible approach is to sample some of the variables, while using exact inference to generate the posterior probability for others CPSC 422, Lecture 12

Lecture Overview Recap of Forward and Rejection Sampling Likelihood Weighting Monte Carlo Markov Chain (MCMC) – Gibbs Sampling Application Requiring Approx. reasoning CPSC 422, Lecture 12

Markov Chain Monte Carlo Idea: instead of sampling from scratch, create samples that are each like the last one (only randomly change one var). Procedure: resample one variable at a time, conditioned on all the rest, but keep evidence fixed. E.g., for P(B|+c): B A C +b +a +c -b +a +c -b -a +c +b, +a, +c Sample a Sample b - b, -a, +c - b, +a, +c To flip a var in a fully instantiated BN you only need to multiply the prob of the var given its parents times the probs of all its children given their parents. Number of multiplication required is equal to the number of children Sample b Sample a + b, -a, +c - b, -a, +c Sample b - b, -a, +c CPSC 422, Lecture 12

Markov Chain Monte Carlo Properties: Now samples are not independent (in fact they’re nearly identical), but sample averages are still consistent estimators! And can be computed efficiently What’s the point: when you sample a variable conditioned on all the rest, both upstream and downstream variables condition on evidence. To flip a var in a fully instantiated BN you only need to multiply the prob of the var given its parents times the probs of all its children given their parents. Number of multiplication required is equal to the number of children Open issue: what does it mean to sample a variable conditioned on all the rest ? CPSC 422, Lecture 12

Sample for X is conditioned on all the rest A node is conditionally independent from all the other nodes in the network, given its parents, children, and children’s parents (i.e., its Markov Blanket ) A. I need to consider all the other nodes B. I only need to consider its Markov Blanket C. I only need to consider all the nodes not in the Markov Blanket CPSC 422, Lecture 12

Sample conditioned on all the rest A node is conditionally independent from all the other nodes in the network, given its parents, children, and children’s parents (i.e., its Markov Blanket ) Configuration B CPSC 422, Lecture 12

Rain’ s Markov Blanket is We want to sample Rain Rain’ s Markov Blanket is Cloudy Sprinkler Rain WetGrass Note: need to sample a different prob. Distribution for each configuration of the markov blanket CPSC 422, Lecture 12

We want to sample Rain Cloudy Cloudy Sprinkler Sprinkler Rain WetGrass 0.5 -c Cloudy Cloudy +c +s 0.1 -s 0.9 -c 0.5 +c +r 0.8 -r 0.2 -c Sprinkler Sprinkler Rain WetGrass WetGrass +s +r +w 0.99 -w 0.01 -r 0.90 0.10 -s CPSC 422, Lecture 12

MCMC Example Do it 100 times CPSC 422, Lecture 12

Why it is called Markov Chain MC States of the chain are possible samples (fully instantiated Bnet) ..given the evidence CPSC 422, Lecture 12

Hoeffding’s inequality Suppose p is the true probability and s is the sample average from n independent samples. p above can be the probability of any event for random variable X = {X1,…Xn} described by a Bayesian network If you want an infinitely small probability of having an error greater than ε, you need infinitely many samples But if you settle on something less than infinitely small, let’s say δ, then you just need to set So you pick the error ε you can tolerate, the frequency δ with which you can tolerate it And solve for n, i.e., the number of samples that can ensure this performance However, if you are willing to have an error greater than ε in δ of the cases (1)‏

Hoeffding’s inequality Examples: You can tolerate an error greater than 0.1 only in 5% of your cases Set ε =0.1, δ = 0.05 Equation (1) gives you n > 184 (1)‏ If you can tolerate the same error (0.1) only in 1% of the cases, then you need 265 samples If you want an error greater than 0.01 in no more than 5% of the cases, you need 18,445 samples

Learning Goals for today’s class You can: Describe and justify the Likelihood Weighting sampling method Describe and justify Markov Chain Monte Carlo sampling method CPSC 422, Lecture 12

TODO for Wed <Readings> Follow instructions on course WebPage Next research paper: Using Bayesian Networks to Manage Uncertainty in Student Modeling. Journal of User Modeling and User-Adapted Interaction 2002 Dynamic BN (required only up to page 400, do not have to answer the question “How was the system evaluated?) Very influential paper 500+ citations Follow instructions on course WebPage <Readings> Keep working on assignment-2 (due on Fri, Oct 18) CPSC 422, Lecture 12

Not Required for 422 Proof for the formula computing the probability of a node given its markov blanket (which is the one you sample in Gibbs) CPSC 422, Lecture 12