Inference Algorithms for Bayes Networks

Slides:

Advertisements

Similar presentations

CPSC 422, Lecture 11Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 11 Jan, 29, 2014.

Advertisements

Bayesian Networks CSE 473. © Daniel S. Weld 2 Last Time Basic notions Atomic events Probabilities Joint distribution Inference by enumeration Independence.

BAYESIAN NETWORKS. Bayesian Network Motivation  We want a representation and reasoning system that is based on conditional independence  Compact yet.

Exact Inference in Bayes Nets

Lirong Xia Hidden Markov Models Tue, March 28, 2014.

Identifying Conditional Independencies in Bayes Nets Lecture 4.

Bayesian Networks Chapter 14 Section 1, 2, 4. Bayesian networks A simple, graphical notation for conditional independence assertions and hence for compact.

Probabilistic Reasoning (2)

Bayesian network inference

Inference in Bayesian Nets

Probabilistic Reasoning Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 14 (14.1, 14.2, 14.3, 14.4) Capturing uncertain knowledge Probabilistic.

Bayesian networks Chapter 14 Section 1 – 2.

Bayesian Belief Networks

Bayesian Networks. Graphical Models Bayesian networks Conditional random fields etc.

Bayesian Networks What is the likelihood of X given evidence E? i.e. P(X|E) = ?

CS 188: Artificial Intelligence Spring 2007 Lecture 14: Bayes Nets III 3/1/2007 Srini Narayanan – ICSI and UC Berkeley.

CS 188: Artificial Intelligence Fall 2006 Lecture 17: Bayes Nets III 10/26/2006 Dan Klein – UC Berkeley.

. Approximate Inference Slides by Nir Friedman. When can we hope to approximate? Two situations: u Highly stochastic distributions “Far” evidence is discarded.

Announcements Homework 8 is out Final Contest (Optional)

1 Bayesian Networks Chapter ; 14.4 CS 63 Adapted from slides by Tim Finin and Marie desJardins. Some material borrowed from Lise Getoor.

. PGM 2002/3 – Tirgul6 Approximate Inference: Sampling.

Probabilistic Reasoning

Bayesian networks Chapter 14. Outline Syntax Semantics.

Bayes Net Lab. 1. Absolute Independence vs. Conditional Independence For any three random variables X, Y, and Z, a)is it true that if X  Y, then X 

Machine Learning Lecture 23: Statistical Estimation with Sampling Iain Murray’s MLSS lecture on videolectures.net:

Undirected Models: Markov Networks David Page, Fall 2009 CS 731: Advanced Methods in Artificial Intelligence, with Biomedical Applications.

Bayesian Networks What is the likelihood of X given evidence E? i.e. P(X|E) = ?

Inference Complexity As Learning Bias Daniel Lowd Dept. of Computer and Information Science University of Oregon Joint work with Pedro Domingos.

Made by: Maor Levy, Temple University  Inference in Bayes Nets ◦ What is the probability of getting a strong letter? ◦ We want to compute the.

Instructor: Eyal Amir Grad TAs: Wen Pu, Yonatan Bisk Undergrad TAs: Sam Johnson, Nikhil Johri CS 440 / ECE 448 Introduction to Artificial Intelligence.

Bayes’ Nets: Sampling [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available.

1 CMSC 671 Fall 2001 Class #21 – Tuesday, November 13.

The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)

Marginalization & Conditioning Marginalization (summing out): for any sets of variables Y and Z: Conditioning(variant of marginalization):

Bayesian Networks CSE 473. © D. Weld and D. Fox 2 Bayes Nets In general, joint distribution P over set of variables (X 1 x... x X n ) requires exponential.

CPSC 422, Lecture 11Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 11 Oct, 2, 2015.

CS 188: Artificial Intelligence Bayes Nets: Approximate Inference Instructor: Stuart Russell--- University of California, Berkeley.

Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:

1 CMSC 671 Fall 2001 Class #20 – Thursday, November 8.

CPSC 7373: Artificial Intelligence Lecture 5: Probabilistic Inference Jiang Bian, Fall 2012 University of Arkansas at Little Rock.

CSE 473: Artificial Intelligence Autumn 2011 Bayesian Networks: Inference Luke Zettlemoyer Many slides over the course adapted from either Dan Klein, Stuart.

CS 416 Artificial Intelligence Lecture 15 Uncertainty Chapter 14 Lecture 15 Uncertainty Chapter 14.

Bayesian networks Chapter 14 Slide Set 2. Constructing Bayesian networks 1. Choose an ordering of variables X 1, …,X n 2. For i = 1 to n –add X i to the.

Conditional Independence As with absolute independence, the equivalent forms of X and Y being conditionally independent given Z can also be used: P(X|Y,

Probabilistic Reasoning Inference and Relational Bayesian Networks.

QUIZ!!  T/F: You can always (theoretically) do BNs inference by enumeration. TRUE  T/F: In VE, always first marginalize, then join. FALSE  T/F: VE is.

Web-Mining Agents Data Mining Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme Karsten Martiny (Übungen)

CS 541: Artificial Intelligence Lecture VII: Inference in Bayesian Networks.

CS498-EA Reasoning in AI Lecture #19 Professor: Eyal Amir Fall Semester 2011.

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12

Qian Liu CSE spring University of Pennsylvania

Inference in Bayesian Networks

CS b553: Algorithms for Optimization and Learning

Artificial Intelligence

CS 4/527: Artificial Intelligence

CAP 5636 – Advanced Artificial Intelligence

Markov Networks.

Still More Uncertainty

CSCI 5822 Probabilistic Models of Human and Machine Learning

Advanced Artificial Intelligence

Inference Inference: calculating some useful quantity from a joint probability distribution Examples: Posterior probability: Most likely explanation: B.

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12

Instructors: Fei Fang (This Lecture) and Dave Touretzky

CS 188: Artificial Intelligence

Class #19 – Tuesday, November 3

CS 188: Artificial Intelligence Fall 2008

Class #16 – Tuesday, October 26

CS 188: Artificial Intelligence Fall 2007

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12

Presentation transcript:

Inference Algorithms for Bayes Networks

Outline Bayes Nets are popular representations in AI, and researchers have developed many inference techniques for them. We will consider two types of algorithms: Exact inference (with 2 subtypes) Enumeration Variable elimination Other techniques not covered: Junction tree, loop-set conditioning, … Approximate inference (sampling) (with 3 sub-types) Rejection sampling Likelihood weighting Gibbs sampling

First: Notation I’m going to assume all variables are binary. For random variable A, I will write the event that A is true as +a, and –a for A is false. Similarly for the other variables. A B C D E

Technique 1: Enumeration This is the “brute-force” approach to BN inference. Suppose I want to know P(+a | +b, +e). Algorithm: 1) If query is conditional (yes in this case), rewrite with def. of cond. prob. 2) Use marginalization to rewrite marginal probabilities in terms of the joint probability. e.g., 3) Use the Bayes Net equation to determine the joint probability. A B 𝑃 +𝑎 +𝑏,+𝑒 = 𝑃(+𝑎,+𝑏,+𝑒) 𝑃(+𝑏,+𝑒) C 𝑃(+𝑎,+𝑏,+𝑒)= 𝑐 𝑑 𝑃(+𝑎,+𝑏,𝑐,𝑑,+𝑒) D E 𝑐 𝑑 𝑃(+𝑎,+𝑏,𝑐,𝑑,+𝑒) = 𝑐 𝑑 𝑃 +𝑎 𝑃 +𝑏 𝑃 𝑐 +𝑎,+𝑏 𝑃 𝑑 𝑐 𝑃(+𝑒|𝑐)

Speeding up Enumeration Pulling out terms: 𝑐 𝑑 𝑃 +𝑎 𝑃 +𝑏 𝑃 𝑐 +𝑎,+𝑏 𝑃 𝑑 𝑐 𝑃(+𝑒|𝑐) =𝑃 +𝑎 𝑃 +𝑏 𝑐 𝑑 𝑃 𝑐 +𝑎,+𝑏 𝑃 𝑑 𝑐 𝑃(+𝑒|𝑐) =𝑃 +𝑎 𝑃 +𝑏 𝑐 𝑃 𝑐 +𝑎,+𝑏 𝑃(+𝑒|𝑐) 𝑑 𝑃 𝑑 𝑐 Each term in the sum is faster. But: the total number of terms (things to add up) remains the same. In the worst case, this is still exponential in the number of nodes.

Maximize Independence If you can, it helps to create the BN so that it has as few edges as possible. Let’s re-create the network on the left, but start with the “John Calls” node and gradually add more nodes and edges. Let’s see how many edges/dependencies we end up with. Burglary Earthquake Alarm John calls Mary calls

Maximize Independence If you can, it helps to create the BN so that it has as few edges as possible. Burglary Earthquake John calls Mary calls ? Alarm John calls Mary calls

Maximize Independence If you can, it helps to create the BN so that it has as few edges as possible. Burglary Earthquake John calls Mary calls ? ? Alarm Alarm John calls Mary calls

Maximize Independence If you can, it helps to create the BN so that it has as few edges as possible. Burglary Earthquake John calls Mary calls ? Alarm ? Alarm ? John calls Mary calls Burglary

Maximize Independence If you can, it helps to create the BN so that it has as few edges as possible. Burglary Earthquake John calls Mary calls Alarm Alarm ? ? ? John calls Mary calls Burglary Earthquake ?

Maximize Independence If you can, it helps to create the BN so that it has as few edges as possible. Burglary Earthquake John calls Mary calls Alarm Alarm John calls Mary calls Burglary Earthquake

Causal Direction Moral: Bayes Nets tend to be the most compact, and most efficient, when edges go from causes to effects. Burglary Earthquake John calls Mary calls Alarm Alarm John calls Mary calls Burglary Earthquake Causal direction Non-causal direction

Technique 2: Variable Elimination Suppose I want to know P(+a | +b, +e). Algorithm: 1) If query is conditional (yes in this case), rewrite with def. of cond. prob. 2) For each marginal distribution, apply variable elimination to find that probability. e.g., for Join C & D (multiplication) Eliminate D (marginalization) Join C & +e (multiplication) Eliminate C (marginalization) Join +a & +e (multiplication) Join +b & (+a, +e) (multiplication) Done. A B 𝑃 +𝑎 +𝑏,+𝑒 = 𝑃(+𝑎,+𝑏,+𝑒) 𝑃(+𝑏,+𝑒) C D E 𝑃(+𝑎,+𝑏,+𝑒)

Joining D & C A B A B C C, D D E E Bayes Net provides: P(C | +a, +b) P(D | C) Joining D & C will compute P(D, C | +a, +b) For each c and each d, compute: P(d, c | +a, +b) = P(d | c) * P(c | +a, +b)

Eliminating D A B A B C, D C E E Bayes Net now provides: P(D, C | +a, +b) Eliminating D will compute P(C | +a, +b) For each c, compute: P(c | +a, +b) = d P(d, c | +a, +b)

Joining C and +e A B A B C C, E E Bayes Net now provides: P(C | +a, +b) P(+e | C) Joining C and +e will compute P(+e, C | +a, +b) For each c, compute: P(+e, c | +a, +b) = P(c | +a, +b)*P(+e | c)

Eliminating C A B A B C, E E Bayes Net now provides: P(+e, C | +a, +b) Eliminating C will compute P(+e | +a, +b) Compute: P(+e | +a, +b) = c P(+e, c | +a, +b)

Joining +a, +b, and +e A B E A, B, E Bayes Net now provides: P(+e | +a, +b) P(+a), P(+b) Joining +a, +b, and +e will compute P(+e, +a, +b) Compute: P(+e, +a , +b) = P(+e | +a, +b) * P(a) * P(b)

Notes on Time Complexity For graphs that are trees with N nodes, variable elimination can perform inference in time O(N). For general graphs, variable elimination can perform inference in time that O(2w), where w is the “tree-width” of the graph. (However, this depends on the order in which variables are eliminated, and it is hard to figure out the best order.) Intuitively, tree-width is a measure of how close a graph is to an actual tree. In the worst case, this can mean a time complexity that is exponential in the size of the graph. Exact inference in BNs is known to be NP-hard.

Approximate Inference via Sampling Penny Nickel Count Probability Heads ? Tails

Approximate Inference via Sampling Penny Nickel Count Probability Heads Tails 1

Approximate Inference via Sampling Penny Nickel Count Probability Heads 1 .5 Tails Penny Nickel Count Probability Heads Tails 1

Approximate Inference via Sampling Penny Nickel Count Probability Heads 2 .67 Tails 1 .33 Penny Nickel Count Probability Heads 1 .5 Tails

Approximate Inference via Sampling Penny Nickel Count Probability Heads 53 .2465 Tails 56 .2605 52 .2419 54 .2512 As the number of samples increases, our estimates should approach the true joint distribution. Conveniently, we get to decide how long we want to spend to figure out the probabilities.

Generating Samples from a BN P(A) +a .6 A B P(B|A) +a +b .7 -a .6 A C P(C|A) +a +c .4 -a .9 A Sample generation algorithm: For each variable X that has not been assigned, but whose parents have all been assigned: 1. r  a random number in the range [0, 1] 2. If r < P(+x | parents(X)), then assign X  +x 3. Else, X  -x B C D For this example: At first, A is the only variable whose parents have been assigned (since it has no parents). r  0.3 0.3 < P(+a), so we assign A  +a B C D P(D|B,C) +b +c +d .5 -c .6 -b .2 .3

Generating Samples from a BN P(A) +a .6 A B P(B|A) +a +b .7 -a .6 A C P(C|A) +a +c .4 -a .9 A Sample generation algorithm: For each variable X that has not been assigned, but whose parents have all been assigned: 1. r  a random number in the range [0, 1] 2. If r < P(+x | parents(X)), then assign X  +x 3. Else, X  -x B C D For this example: Current Sample: +a Next, both B and C have all their parents assigned. Let’s choose B. r  .9 .9 >= P(+b | +a), so we set B  -b B C D P(D|B,C) +b +c +d .5 -c .6 -b .2 .3

Generating Samples from a BN P(A) +a .6 A B P(B|A) +a +b .7 -a .6 A C P(C|A) +a +c .4 -a .9 A Sample generation algorithm: For each variable X that has not been assigned, but whose parents have all been assigned: 1. r  a random number in the range [0, 1] 2. If r < P(+x | parents(X)), then assign X  +x 3. Else, X  -x B C D For this example: Current Sample: +a, -b Quiz: what variable would be assigned next? If r  .4, what would this variable be assigned? B C D P(D|B,C) +b +c +d .5 -c .6 -b .2 .3

Generating Samples from a BN P(A) +a .6 A B P(B|A) +a +b .7 -a .6 A C P(C|A) +a +c .4 -a .9 A Sample generation algorithm: For each variable X that has not been assigned, but whose parents have all been assigned: 1. r  a random number in the range [0, 1] 2. If r < P(+x | parents(X)), then assign X  +x 3. Else, X  -x B C D For this example: Current Sample: +a, -b, -c Now D has all its parents assigned. If r  .2, what would D be assigned? B C D P(D|B,C) +b +c +d .5 -c .6 -b .2 .3

Generating Samples from a BN P(A) +a .6 A B P(B|A) +a +b .7 -a .6 A C P(C|A) +a +c .4 -a .9 A Sample generation algorithm: For each variable X that has not been assigned, but whose parents have all been assigned: 1. r  a random number in the range [0, 1] 2. If r < P(+x | parents(X)), then assign X  +x 3. Else, X  -x B C D For this example: Current Sample: +a, -b, -c, +d That completes this sample. We can now increase the count of (+a, -b, -c, +d) by 1, and move on to the next sample. B C D P(D|B,C) +b +c +d .5 -c .6 -b .2 .3

Quiz: Approximating Queries B C Count +a +b +c 20 -c 30 -b 50 -a 80 40 Total 300 Suppose I generate a bunch of samples for a BN with variables A, B, C, and get these counts. What are these probabilities? P(+a, -b, -c)? P(+a, -c)? P(-a | -b, -c)? P(-b | +a)?

Technique 3: Rejection Sampling Rejection sampling is the fancy name given to the procedure you just used to compute, eg., P(-a | -b, -c). To compute this, you ignore (or “reject”) samples where B = +b or C = +c, since they don’t match the evidence in the query.

Consistency Rejection sampling is a consistent approximate inference technique. Consistency means that as the number of samples increases, the estimated value of the probability for a query approaches its true value. In the limit of infinite samples, consistent sampling techniques give the correct probabilities.

Room for Improvement Efficiency of Rejection Sampling: If you’re interested in a query like P(+a | +b, +c), you’ll reject 5 out of 6 samples, since only 1 out of 6 samples have the right evidence (+b and +c). So most samples are useless for your query.

Technique 4: Likelihood Weighting A P(A) +a .6 A B P(B|A) +a +b .7 -a .6 A C P(C|A) +a +c .4 -a .9 Query of interest: P(+c | +b, +d) A Sample generation algorithm: Initialize: sample  {}, P(sample)  1 For each variable X that has not been assigned, but whose parents have all been assigned: 1. If X is an evidence node: a. assign X the value from the query b. P(sample)  P(sample) * P(X|parents(X)) 2. Otherwise, assign X as normal, P(sample) unchanged B C D For this example: Sample: {} P(sample): 1 At first, A is the only variable whose parents have been assigned (since it has no parents). r  0.3 0.3 < P(+a), so we assign A  +a B C D P(D|B,C) +b +c +d .5 -c .6 -b .2 .3

Likelihood Weighting Query of interest: P(+c | +b, +d) A P(A) +a .6 A P(B|A) +a +b .7 -a .6 A C P(C|A) +a +c .4 -a .9 Query of interest: P(+c | +b, +d) A Sample generation algorithm: Initialize: sample  {}, P(sample)  1 For each variable X that has not been assigned, but whose parents have all been assigned: 1. If X is an evidence node: a. assign X the value from the query b. P(sample)  P(sample) * P(X|parents(X)) 2. Otherwise, assign X as normal, P(sample) unchanged B C D For this example: Sample: {+a} P(sample): 1 B and C have their parents assigned. Let’s do B next. B is an evidence node, so we choose B  +b (from the query) Also, P(+b|+a) = .7, so we update P(sample)  0.7 B C D P(D|B,C) +b +c +d .5 -c .6 -b .2 .3

Likelihood Weighting Query of interest: P(+c | +b, +d) A P(A) +a .6 A P(B|A) +a +b .7 -a .6 A C P(C|A) +a +c .4 -a .9 Query of interest: P(+c | +b, +d) A Sample generation algorithm: Initialize: sample  {}, P(sample)  1 For each variable X that has not been assigned, but whose parents have all been assigned: 1. If X is an evidence node: a. assign X the value from the query b. P(sample)  P(sample) * P(X|parents(X)) 2. Otherwise, assign X as normal, P(sample) unchanged B C D For this example: Sample: {+a, +b} P(sample): 0.7 C has its parents assigned. It is NOT an evidence node. r  .8 .8 >= P(+c | +a), so C  -c P(sample) is NOT UPDATED. B C D P(D|B,C) +b +c +d .5 -c .6 -b .2 .3

Likelihood Weighting Query of interest: P(+c | +b, +d) A P(A) +a .6 A P(B|A) +a +b .7 -a .6 A C P(C|A) +a +c .4 -a .9 Query of interest: P(+c | +b, +d) A Sample generation algorithm: Initialize: sample  {}, P(sample)  1 For each variable X that has not been assigned, but whose parents have all been assigned: 1. If X is an evidence node: a. assign X the value from the query b. P(sample)  P(sample) * P(X|parents(X)) 2. Otherwise, assign X as normal, P(sample) unchanged B C D For this example: Sample: {+a, +b, -c} P(sample): 0.7 D has its parents assigned. How do the sample and P(sample) change? B C D P(D|B,C) +b +c +d .5 -c .6 -b .2 .3

Likelihood Weighting Query of interest: P(+c | +b, +d) A P(A) +a .6 A P(B|A) +a +b .7 -a .6 A C P(C|A) +a +c .4 -a .9 Query of interest: P(+c | +b, +d) A Sample generation algorithm: Initialize: sample  {}, P(sample)  1 For each variable X that has not been assigned, but whose parents have all been assigned: 1. If X is an evidence node: a. assign X the value from the query b. P(sample)  P(sample) * P(X|parents(X)) 2. Otherwise, assign X as normal, P(sample) unchanged B C D For this example: Sample: {+a, +b, -c, +d} P(sample): 0.42 B C D P(D|B,C) +b +c +d .5 -c .6 -b .2 .3

Likelihood Weighting vs. Rejection Sampling for query P(+c | -a) A B C Count +a +b +c 20 -c 30 -b 50 -a 80 40 Total 300 A B C Probabilistic Count -a +b +c 23.58 -c 68.3 -b 90.6 40.6 Total 223.08 Requires fewer samples to get good estimates. But solves just one query at a time. Needs LOTS of samples. Can answer any query. Both are consistent.

Further room for improvement A P(A) +a .6 A B P(B|A) +a +b .7 -a .6 A C P(C|A) +a +c .4 -a .9 A Example query of interest: P(+d | +b, +c) B C If we generate samples using likelihood weighting, the choice of sample for D takes into account the evidence. However, the choice of sample for A does NOT take into account the evidence. So we may generate lots of samples that are very unlikely, and don’t contribute much to our overall counts. Quiz: what is P(+a | +b, +c)? And P(-a | +b, +c)? D B C D P(D|B,C) +b +c +d .5 -c .6 -b .2 .3

Technique 5: Gibbs Sampling Named after physicist Josiah Gibbs (you may have heard of Gibbs Free Energy). This is a special case of a more general algorithm called Metropolis-Hastings, which is itself a special case of Markov-Chain Monte Carlo (MCMC) estimation.

Gibbs Sampling Query of interest: P(-d | +b, -c) A P(A) +a .6 A B P(B|A) +a +b .7 -a .6 A C P(C|A) +a +c .4 -a .9 Query of interest: P(-d | +b, -c) A Sample generation algorithm: Initialize: sample  {Arandom, +b, -c, D  random} Repeat: 1. pick a non-evidence variable X 2. Get a random number r in the range [0, 1] 3. If r < P(X | all other variables), set X  +x 4. Otherwise, set X  -x 5. Add 1 to the count for this new sample B C D For this example: Sample: {-a, +b, -c, +d} A and D are non-evidence. Randomly choose D to re-set. r  0.7 P(+d | -a, +b, -c) = P(+d | +b, -c) = .6 r >= .6, so D = -d B C D P(D|B,C) +b +c +d .5 -c .6 -b .2 .3

Gibbs Sampling Query of interest: P(-d | +b, -c) A P(A) +a .6 A B P(B|A) +a +b .7 -a .6 A C P(C|A) +a +c .4 -a .9 Query of interest: P(-d | +b, -c) A Sample generation algorithm: Initialize: sample  {Arandom, +b, -c, D  random} Repeat: 1. pick a non-evidence variable X 2. Get a random number r in the range [0, 1] 3. If r < P(X | all other variables), set X  +x 4. Otherwise, set X  -x 5. Add 1 to the count for this new sample B C D For this example: Sample: {-a, +b, -c, -d} A and D are non-evidence. Randomly choose D to re-set. r  0.9 P(+d | -a, +b, -c) = P(+d | +b, -c) = .6 r >= .6, so D = -d (no change) B C D P(D|B,C) +b +c +d .5 -c .6 -b .2 .3

Gibbs Sampling Query of interest: P(-d | +b, -c) A P(A) +a .6 A B P(B|A) +a +b .7 -a .6 A C P(C|A) +a +c .4 -a .9 Query of interest: P(-d | +b, -c) A Sample generation algorithm: Initialize: sample  {Arandom, +b, -c, D  random} Repeat: 1. pick a non-evidence variable X 2. Get a random number r in the range [0, 1] 3. If r < P(X | all other variables), set X  +x 4. Otherwise, set X  -x 5. Add 1 to the count for this new sample B C D For this example: Sample: {-a, +b, -c, -d} A and D are non-evidence. Randomly choose A to re-set. r  0.3 P(+a | +b, -c, -d) = P(+a | +b, -c) = ? What is A after this step? B C D P(D|B,C) +b +c +d .5 -c .6 -b .2 .3

Details of Gibbs Sampling To compute P(X | all other variables), it is enough to consider only the Markov Blanket of X: X’s parents, X’s children, and the parents of X’s children. Everything else will be conditionally independent of X, given its Markov Blanket. Unlike Rejection Sampling and Likelihood Weighting, samples in Gibbs Sampling are NOT independent. Nevertheless, Gibbs Sampling is consistent. It is very common to discard the first N (often N ~= 1000) samples from a Gibbs sampler. The first N samples are called the “burn-in” period.