Belief propagation with junction trees Presented by Mark Silberstein and Yaniv Hamo
Outline – Part I ● Belief propagation example ● Propagation using message passing ● Clique trees ● Junction trees ● Junction trees algorithm
Simple belief propagation example (from Jensen, “An introduction to Bayesian Networks” P(X Icy ): yes0.7 no0.3 P(X Holmes |X Icy ): yesno yes No P(X Watson |X Icy ): yesno yes no “Icy Roads”
“Watson has had an accident!” P(X Watson =yes)=1 Bayes’ Rule P(X Icy | X Watson =yes) = (0.95,0.05) (0.70,0.30) a priori ? Joint Probability + Marginalization P(X Holmes | X Watson =yes) = (0.76,0.24) (0.59,0.41) a priori
“No, the roads are not icy.” P(X Icy =no)=1 When initiating X Icy X Holmes becomes independent of X Watson ; X Holmes X Watson | X Icy
Answering probabilistic queries (J. Pearl, [1]) ● Joint probability using elimination – most likely that human brain does not do that! Why? – Needs to hold all the network to set the elimination order – Answers only single question, without answering on all questions – Create and calculate spurious dependencies among vars concieved as independent – Sequential! ● Our brain probably computes it in parallel
Belief updating as a constraint propagation (J. Pearl, [1]) ● Local, simple computations ● But is it possible at all? – Why would it ever stabilize ● Rumour example: You updated your nabour, after several days you hear the same from him. Should it increase your belief? Graph coloring
Simple example for chain propagation (J. Pearl, [1]) Definitions: X Y e X Y Z e Link matrix Vector
Bidirectional propagation (J. Pearl, [1]) XTU X Y Z e-e- e+e+ π(u) π(t) π(x) λ(y)λ(x) λ(z) Chooses column Chooses row π(t) λ(y)
HMM and Backward-Forward algorithm P(x 1,…,x L,h i ) = P(x 1,…,x i,h i ) P(x i+1,…,x L | x 1,…,x i,h i ) H1H1 H2H2 H L-1 HLHL X1X1 X2X2 X L-1 XLXL HiHi XiXi = P(x 1,…,x i,h i ) P(x i+1,…,x L | h i ) f(h i ) b(h i ) Belief update: P(h i | x 1,…,x L ) = (1/K) P(x 1,…,x L,h i ) where K= hi P(x 1,…,x L,h i ). π(h i ) = P(x 1,…,x i,h i ) P(x i+1,…,x L | h i ) f(h i ) b(h i ) λ(h i )
The forward algorithm H1H1 H2H2 X1X1 X2X2 HiHi XiXi The task: Compute f(h i ) = P(x 1,…,x i,h i ) for i=1,…,L (namely, considering evidence up to time slot i). P(x 1, h 1 ) = P(h 1 ) P(x 1 |h 1 ) {Basis step} P(x 1,…,x i,h i ) = P(x 1,…,x i-1, h i-1 ) P(h i | h i-1 ) P(x i | h i ) h i-1 {step i} π(h i-1 )
The backward algorithm The task: Compute b(h i ) = P(x i+1,…,x L |h i ) for i=L-1,…,1 (namely, considering evidence after time slot i). H L-1 HLHL X L-1 XLXL HiHi H i+1 X i+1 P(x i+1,…,x L |h i ) = P(h i+1 | h i ) P(x i+1 | h i+1 ) P(x i+2,…,x L | h i+1 ) h i+1 {step i} =b(h i )= =b(h i+1 )=
Can we generalize this approach to any graph? ● Loops pose a problem – We might reach contradiction or indefinite loop ● We should apply clustering and create tree of clusters ● Each new vertex in cluster tree has potential Ψ (mapping all combination of cluster variables to non-negative real number. Joint distribution table is a special case) ● Problems: – Many ways to create cluster (e.g. all vertices forming a loop) – How to obtain marginal probabilities from potentials
● Yet another representation of joint probability ● How we build them: – For every variable A there should exist single clique V that – Clique potential is a multiplication of all its tables (a table is multiplied only if it was not used in another clique) – Links are labeled with separators, which consist of the intersection of adjacent nodes – Separator tables are initialized to ones ● Claim: Joint distribution is a product of all cluster tables divided by product of all separator tables Clique trees
Example A B C D F E A,B,C CDEDEF C DE A B C D F E Chordal graph
Consistency ● The marginals of adjacent nodes on their separator should be equal Ψ(V)Ψ(W) Ψ(S)
Absorption ● Absorption passes a message from one node to another Ψ(V)Ψ*(W) Ψ*(S) Ψ(V)Ψ*(W) Ψ*(S)
Absorption (cont) ● Absorption ensures consistency? ● Product of cluster tables divided by product of separators is invariant under absorption ● This feature maintains the correctness of clique tree representation
Rules of message passing in clique tree ● Node V can send exactly one message to neighbor W, and only if V has received a message from each of its other neighbors ● We continue till messages passed once in both directions along every link After all messages are sent in both directions over every link, the tree is consistent
Does local consistency ensures global consistency? ● The same old loop problem ● Building a tree breaks the loops DBC ABC ED EA Global consistency: B D A C E
Junction tree ● Ensures global consistency ● Definition: Clique tree is a junction tree if all nodes on the path between V and W contain V∩W ABE EH BCF FIFJ CDG GK E B F F F C G ACDB HJKI FGE
Claims on junction tree ● Claim: A consistent junction tree is globally consistent ● Claim: t u is a product of all node potentials divided by the product of all separator potentials. Then ● Claim: after a full round of message passing in T, ● Claim: given evidence at different nodes, after a full round of message passing in T,
References until now 1. J. Pearl, “Probabilistic Reasoning In Intellihent Systems” 2. Finn.V. Jensen, “An Introduction To Bayesian Networks” 3. Presentations by: Sam Roweis, Henrik Bengtsson, David Barber