CS 416 Artificial Intelligence Lecture 16 Uncertainty Chapter 14 Lecture 16 Uncertainty Chapter 14.

Slides:

Advertisements

Similar presentations

Probabilistic models Jouni Tuomisto THL. Outline Deterministic models with probabilistic parameters Hierarchical Bayesian models Bayesian belief nets.

Advertisements

Exact Inference in Bayes Nets

Lirong Xia Approximate inference: Particle filter Tue, April 1, 2014.

Homework 3: Naive Bayes Classification

Probabilistic Reasoning (2)

Markov Networks.

Exact Inference (Last Class) variable elimination  polytrees (directed graph with at most one undirected path between any two vertices; subset of DAGs)

Bayesian network inference

Inference in Bayesian Nets

CS 188: Artificial Intelligence Fall 2009 Lecture 20: Particle Filtering 11/5/2009 Dan Klein – UC Berkeley TexPoint fonts used in EMF. Read the TexPoint.

. PGM: Tirgul 8 Markov Chains. Stochastic Sampling  In previous class, we examined methods that use independent samples to estimate P(X = x |e ) Problem:

Machine Learning CUNY Graduate Center Lecture 7b: Sampling.

Bayesian Networks. Graphical Models Bayesian networks Conditional random fields etc.

5/25/2005EE562 EE562 ARTIFICIAL INTELLIGENCE FOR ENGINEERS Lecture 16, 6/1/2005 University of Washington, Department of Electrical Engineering Spring 2005.

CS 188: Artificial Intelligence Spring 2007 Lecture 14: Bayes Nets III 3/1/2007 Srini Narayanan – ICSI and UC Berkeley.

CS 188: Artificial Intelligence Fall 2006 Lecture 17: Bayes Nets III 10/26/2006 Dan Klein – UC Berkeley.

10/22  Homework 3 returned; solutions posted  Homework 4 socket opened  Project 3 assigned  Mid-term on Wednesday  (Optional) Review session Tuesday.

. Approximate Inference Slides by Nir Friedman. When can we hope to approximate? Two situations: u Highly stochastic distributions “Far” evidence is discarded.

Announcements Homework 8 is out Final Contest (Optional)

Bayesian Networks Textbook: Probabilistic Reasoning, Sections 1-2, pp

Bayesian networks Chapter 14. Outline Syntax Semantics.

Recap: Reasoning Over Time  Stationary Markov models  Hidden Markov models X2X2 X1X1 X3X3 X4X4 rainsun X5X5 X2X2 E1E1 X1X1 X3X3 X4X4 E2E2 E3E3.

INC 551 Artificial Intelligence Lecture 8 Models of Uncertainty.

1 Chapter 14 Probabilistic Reasoning. 2 Outline Syntax of Bayesian networks Semantics of Bayesian networks Efficient representation of conditional distributions.

2 Syntax of Bayesian networks Semantics of Bayesian networks Efficient representation of conditional distributions Exact inference by enumeration Exact.

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.

CS 188: Artificial Intelligence Fall 2006 Lecture 18: Decision Diagrams 10/31/2006 Dan Klein – UC Berkeley.

Instructor: Eyal Amir Grad TAs: Wen Pu, Yonatan Bisk Undergrad TAs: Sam Johnson, Nikhil Johri CS 440 / ECE 448 Introduction to Artificial Intelligence.

Bayes’ Nets: Sampling [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available.

1 CMSC 671 Fall 2001 Class #21 – Tuesday, November 13.

Exact Inference (Last Class) Variable elimination  polytrees (directed graph with at most one undirected path between any two vertices; subset of DAGs)

CS 416 Artificial Intelligence Lecture 17 Reasoning over Time Chapter 15 Lecture 17 Reasoning over Time Chapter 15.

The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)

CPSC 422, Lecture 11Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 11 Oct, 2, 2015.

QUIZ!!  In HMMs...  T/F:... the emissions are hidden. FALSE  T/F:... observations are independent given no evidence. FALSE  T/F:... each variable X.

An Introduction to Markov Chain Monte Carlo Teg Grenager July 1, 2004.

Probabilistic models Jouni Tuomisto THL. Outline Deterministic models with probabilistic parameters Hierarchical Bayesian models Bayesian belief nets.

Bayesian networks and their application in circuit reliability estimation Erin Taylor.

Markov Chain Monte Carlo for LDA C. Andrieu, N. D. Freitas, and A. Doucet, An Introduction to MCMC for Machine Learning, R. M. Neal, Probabilistic.

CS 188: Artificial Intelligence Bayes Nets: Approximate Inference Instructor: Stuart Russell--- University of California, Berkeley.

Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:

Inference Algorithms for Bayes Networks

Quick Warm-Up Suppose we have a biased coin that comes up heads with some unknown probability p; how can we use it to produce random bits with probabilities.

CS 416 Artificial Intelligence Lecture 15 Uncertainty Chapter 14 Lecture 15 Uncertainty Chapter 14.

Probabilistic Reasoning Inference and Relational Bayesian Networks.

Web-Mining Agents Data Mining Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme Karsten Martiny (Übungen)

CS 541: Artificial Intelligence Lecture VII: Inference in Bayesian Networks.

CS498-EA Reasoning in AI Lecture #19 Professor: Eyal Amir Fall Semester 2011.

CS 541: Artificial Intelligence

Web-Mining Agents Data Mining

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12

CS b553: Algorithms for Optimization and Learning

Artificial Intelligence

CS 4/527: Artificial Intelligence

CAP 5636 – Advanced Artificial Intelligence

Markov Networks.

Advanced Artificial Intelligence

Inference Inference: calculating some useful quantity from a joint probability distribution Examples: Posterior probability: Most likely explanation: B.

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12

Instructors: Fei Fang (This Lecture) and Dave Touretzky

CS 188: Artificial Intelligence

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 11

Class #19 – Tuesday, November 3

CS 188: Artificial Intelligence Fall 2008

Nonparametric density estimation and classification

CS 188: Artificial Intelligence Fall 2007

Quick Warm-Up Suppose we have a biased coin that comes up heads with some unknown probability p; how can we use it to produce random bits with probabilities.

CS 416 Artificial Intelligence

Markov Networks.

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12

Presentation transcript:

CS 416 Artificial Intelligence Lecture 16 Uncertainty Chapter 14 Lecture 16 Uncertainty Chapter 14

Conditional probability The probability of a given all we know is b P (a | b)P (a | b) Written as an unconditional probability The probability of a given all we know is b P (a | b)P (a | b) Written as an unconditional probability

Conditioning A distribution over Y can be obtained by summing out all the other variables from any joint distribution containing Y We need the full joint distribution to sum this up A distribution over Y can be obtained by summing out all the other variables from any joint distribution containing Y We need the full joint distribution to sum this up eventevidence all other variables anywhere x and e are true

Bayes Network Bayes Network captures the full joint distribution For comparison: Bayes Network captures the full joint distribution For comparison:

Example P(B | john calls, mary calls)

Example P(B | john calls, mary calls) To expedite, move terms outside summation P(B | john calls, mary calls) To expedite, move terms outside summation old way

Example Depth- first tree traversal required required

Example Complexity of Bayes Net Bayes Net reduces space complexityBayes Net reduces space complexity Bayes Net does not reduce time complexity for general caseBayes Net does not reduce time complexity for general case Complexity of Bayes Net Bayes Net reduces space complexityBayes Net reduces space complexity Bayes Net does not reduce time complexity for general caseBayes Net does not reduce time complexity for general case

Time complexity Note repeated subexpressions Dynamic Programming Note repeated subexpressions Dynamic Programming

Time complexity Dynamic programming Works well for polytrees (where there is at most one path between any two nodes)Works well for polytrees (where there is at most one path between any two nodes) Doesn’t work for multiply connected networksDoesn’t work for multiply connected networksClustering Try to convert multiply connected networks to polytreesTry to convert multiply connected networks to polytrees Dynamic programming Works well for polytrees (where there is at most one path between any two nodes)Works well for polytrees (where there is at most one path between any two nodes) Doesn’t work for multiply connected networksDoesn’t work for multiply connected networksClustering Try to convert multiply connected networks to polytreesTry to convert multiply connected networks to polytrees

Approximate Inference It’s expensive to work with the full joint distrbution… whether as a table or as a Bayes Network Is approximation good enough? Monte Carlo It’s expensive to work with the full joint distrbution… whether as a table or as a Bayes Network Is approximation good enough? Monte Carlo

Use samples to approximate solution Simulated annealing used Monte Carlo theories to justify why random guesses and sometimes going uphill can lead to optimalitySimulated annealing used Monte Carlo theories to justify why random guesses and sometimes going uphill can lead to optimality More samples = better approximation How many are needed?How many are needed? Where should you take the samples?Where should you take the samples? Use samples to approximate solution Simulated annealing used Monte Carlo theories to justify why random guesses and sometimes going uphill can lead to optimalitySimulated annealing used Monte Carlo theories to justify why random guesses and sometimes going uphill can lead to optimality More samples = better approximation How many are needed?How many are needed? Where should you take the samples?Where should you take the samples?

Prior sampling An ability to model the prior probabilities of a set of random variables

Approximating true distribution With enough samples, perfect modeling is possible

Rejection sampling Compute P(X | e) Use PriorSample (S PS ) and create N samplesUse PriorSample (S PS ) and create N samples Inspect each sample TRUTH of eInspect each sample TRUTH of e Keep a count for how many samples are consistent with eKeep a count for how many samples are consistent with e P(X | e) can be computed from count / NP(X | e) can be computed from count / N Compute P(X | e) Use PriorSample (S PS ) and create N samplesUse PriorSample (S PS ) and create N samples Inspect each sample TRUTH of eInspect each sample TRUTH of e Keep a count for how many samples are consistent with eKeep a count for how many samples are consistent with e P(X | e) can be computed from count / NP(X | e) can be computed from count / N

Example P(Rain | Sprinkler = true)P(Rain | Sprinkler = true) Use Bayes Net to generate 100 samplesUse Bayes Net to generate 100 samples –Suppose 73 have Sprinkler=false –Suppose 27 have Sprinkler=true  8 have Rain=true  19 have Rain=false P(Rain | Sprinkler=true) = Normalize ( ) = P(Rain | Sprinkler=true) = Normalize ( ) = P(Rain | Sprinkler = true)P(Rain | Sprinkler = true) Use Bayes Net to generate 100 samplesUse Bayes Net to generate 100 samples –Suppose 73 have Sprinkler=false –Suppose 27 have Sprinkler=true  8 have Rain=true  19 have Rain=false P(Rain | Sprinkler=true) = Normalize ( ) = P(Rain | Sprinkler=true) = Normalize ( ) =

Problems with rejection sampling Standard deviation of the error in probability is proportional to 1/sqrt(n), where n is the number of samples consistent with evidenceStandard deviation of the error in probability is proportional to 1/sqrt(n), where n is the number of samples consistent with evidence As problems become complex, number of samples consistent with evidence becomes small and it becomes harder to construct accurate estimatesAs problems become complex, number of samples consistent with evidence becomes small and it becomes harder to construct accurate estimates Standard deviation of the error in probability is proportional to 1/sqrt(n), where n is the number of samples consistent with evidenceStandard deviation of the error in probability is proportional to 1/sqrt(n), where n is the number of samples consistent with evidence As problems become complex, number of samples consistent with evidence becomes small and it becomes harder to construct accurate estimatesAs problems become complex, number of samples consistent with evidence becomes small and it becomes harder to construct accurate estimates

Likelihood weighting We only want to generate samples that are consistent with the evidence, e We’ll sample the Bayes Net, but we won’t let every random variable be sampled, some will be forced to produce a specific outputWe’ll sample the Bayes Net, but we won’t let every random variable be sampled, some will be forced to produce a specific output We only want to generate samples that are consistent with the evidence, e We’ll sample the Bayes Net, but we won’t let every random variable be sampled, some will be forced to produce a specific outputWe’ll sample the Bayes Net, but we won’t let every random variable be sampled, some will be forced to produce a specific output

Example P (Rain | Sprinkler=true, WetGrass=true)

Example P (Rain | Sprinkler=true, WetGrass=true) First, weight vector, w, set to 1.0First, weight vector, w, set to 1.0 P (Rain | Sprinkler=true, WetGrass=true) First, weight vector, w, set to 1.0First, weight vector, w, set to 1.0

Example Notice that weight is reduced according to how likely an evidence variable’s output is given its parents So final probability is a function of what comes from sampling the free variables while constraining the evidence variablesSo final probability is a function of what comes from sampling the free variables while constraining the evidence variables Notice that weight is reduced according to how likely an evidence variable’s output is given its parents So final probability is a function of what comes from sampling the free variables while constraining the evidence variablesSo final probability is a function of what comes from sampling the free variables while constraining the evidence variables

Comparing techniques In likelihood weighting, attention is paid to evidence variables before samples are collectedIn likelihood weighting, attention is paid to evidence variables before samples are collected In rejection sampling, evidence variables are considered after the samplingIn rejection sampling, evidence variables are considered after the sampling Likelihood weighting isn’t as accurate as the true posterior distribution, P(z | e) because the sampled variables ignore evidence among z’s non-ancestorsLikelihood weighting isn’t as accurate as the true posterior distribution, P(z | e) because the sampled variables ignore evidence among z’s non-ancestors In likelihood weighting, attention is paid to evidence variables before samples are collectedIn likelihood weighting, attention is paid to evidence variables before samples are collected In rejection sampling, evidence variables are considered after the samplingIn rejection sampling, evidence variables are considered after the sampling Likelihood weighting isn’t as accurate as the true posterior distribution, P(z | e) because the sampled variables ignore evidence among z’s non-ancestorsLikelihood weighting isn’t as accurate as the true posterior distribution, P(z | e) because the sampled variables ignore evidence among z’s non-ancestors

Likelihood weighting Uses all the samplesUses all the samples As evidence variables increase, it becomes harder to keep the weighting constant high and estimate quality dropsAs evidence variables increase, it becomes harder to keep the weighting constant high and estimate quality drops Uses all the samplesUses all the samples As evidence variables increase, it becomes harder to keep the weighting constant high and estimate quality dropsAs evidence variables increase, it becomes harder to keep the weighting constant high and estimate quality drops

Markov chain Monte Carlo (MCMC) Imagine being in a current stateImagine being in a current state –An assignment to all the random variables The next state is selected according to random sample of one of the nonevidence variables, X iThe next state is selected according to random sample of one of the nonevidence variables, X i –Conditioned on the current values of the variables in the current state MCMC wanders around state space, flipping one variable at a time while keeping evidence variables fixedMCMC wanders around state space, flipping one variable at a time while keeping evidence variables fixed Imagine being in a current stateImagine being in a current state –An assignment to all the random variables The next state is selected according to random sample of one of the nonevidence variables, X iThe next state is selected according to random sample of one of the nonevidence variables, X i –Conditioned on the current values of the variables in the current state MCMC wanders around state space, flipping one variable at a time while keeping evidence variables fixedMCMC wanders around state space, flipping one variable at a time while keeping evidence variables fixed

Cool method of operation The sampling process settles into a “dynamic equilibrium” in which the long-run fraction of time spent in each state is exactly proportional to its posterior probabilityThe sampling process settles into a “dynamic equilibrium” in which the long-run fraction of time spent in each state is exactly proportional to its posterior probability Let q(x  x’) = probability of transitioning from state x to x’Let q(x  x’) = probability of transitioning from state x to x’ A Markov chain is a sequence of state transitions according to q( ) functionsA Markov chain is a sequence of state transitions according to q( ) functions  t (x) measures the probability of being in state x after t steps The sampling process settles into a “dynamic equilibrium” in which the long-run fraction of time spent in each state is exactly proportional to its posterior probabilityThe sampling process settles into a “dynamic equilibrium” in which the long-run fraction of time spent in each state is exactly proportional to its posterior probability Let q(x  x’) = probability of transitioning from state x to x’Let q(x  x’) = probability of transitioning from state x to x’ A Markov chain is a sequence of state transitions according to q( ) functionsA Markov chain is a sequence of state transitions according to q( ) functions  t (x) measures the probability of being in state x after t steps

Markov chains  t+1 (x) = probability of being in x after t+1 steps If  t =  t+1 we have reached a stationary distributionIf  t =  t+1 we have reached a stationary distribution  t+1 (x) = probability of being in x after t+1 steps If  t =  t+1 we have reached a stationary distributionIf  t =  t+1 we have reached a stationary distribution