Download presentation
Presentation is loading. Please wait.
1
Exact Inference Eran Segal Weizmann Institute
2
Course Outline WeekTopicReading 1Introduction, Bayesian network representation1-3 2Bayesian network representation cont.1-3 3Local probability models4 4Undirected graphical models5 5Exact inference7,8 6Exact inference cont.9 7Approximate inference10 8Approximate inference cont.11 9Learning: Parameters13,14 10Learning: Parameters cont.14 11Learning: Structure15 12Partially observed data16 13Learning undirected graphical models17 14Template models18 15Dynamic Bayesian networks18
3
Inference Markov networks and Bayesian networks represent a joint probability distribution Networks contain information needed to answer any query about the distribution Inference is the process of answering such queries Direction between variables does not restrict queries Inference combines evidence from all network parts
4
Likelihood Queries Compute probability (=likelihood) of the evidence Evidence: subset of variables E and an assignment e Task: compute P(E=e) Computation
5
Conditional Probability Queries Conditional probability queries Evidence: subset of variables E and an assignment e Query: a subset of variables Y Task: compute P(Y | E=e) Applications Medical and fault diagnosis Genetic inheritance Computation
6
Maximum A Posteriori Assignment Maximum A Posteriori Assignment (MAP) Evidence: subset of variables E and an assignment e Query: a subset of variables Y Task: compute MAP(Y|E=e) = argmax y P(Y=y | E=e) Note 1: there may be more than one possible solution Note 2: equivalent to computing argmax y P(Y=y, E=e) since P(Y=y | E=e) = P(Y=y, E=e) / P(E=e) Computation
7
Most Probable Assignment: MPE Most Probable Explanation (MPE) Evidence: subset of variables E and an assignment e Query: all other variables Y (Y=U-E) Task: compute MPE(Y|E=e) = argmax y P(Y=y | E=e) Note: there may be more than one possible solution Applications Decoding messages: find the most likely transmitted bits Diagnosis: find a single most likely consistent hypothesis
8
Most Probable Assignment: MPE Note: We are searching for the most likely joint assignment to all variables May be different than most likely assignment to each RV B AB0B0 B1B1 a0a0 0.10.9 a1a1 0.5 I a0a0 a1a1 0.40.6 P(B|A)P(A) A B P(a 1 )>P(a 0 ) MAP(A) = a 1 MPE(A,B) = {a 0, b 1 } P(a 0, b 0 ) = 0.04 P(a 0, b 1 ) = 0.36 P(a 1, b 0 ) = 0.3 P(a 1, b 1 ) = 0.3
9
Exact Inference in Graphical Models Graphical models can be used to answer Conditional probability queries MAP queries MPE queries Naïve approach Generate joint distribution Depending on query, compute sum/max Exponential blowup Exploit independencies for efficient inference
10
Complexity of Bayesnet Inference Assume encoding specifies DAG structure Assume CPD representation as table CPD Decision problem: Given a network G, a variable X and a value x Val(X), decide whether P G (X=x)>0
11
Complexity of Bayesnet Inference Theorem: Decision problem is NP-complete Proof: Decision problem is in NP: for an assignment e to all network variables, check whether X=x in e and P(e)>0 Reduction from 3-SAT Binary valued variables Q 1,...,Q n Clauses C 1,...,C k where C i =L i,1 L i,2 L i,3 L i,j for i=1,...k and j=1,2,3 is a literal which is Q i or Q i = C 1 ,..., C k Decision problem: is there an assignment to Q 1,...,Q n satisfying ? Construct network such that P(X=1)>0 iff satisfiable
12
Complexity of Bayesnet Inference Q1Q1 Q2Q2 Q3Q3 QnQn C1C1 C2C2 C3C3 CkCk A1A1 A2A2 X... P(Q i =1)=0.5 P(C i =1 | Pa(C i )) = Pa(C i ) CPD of A 1,...,A k-2,X is a deterministic AND P(X=1|q 1,...q n )=1 iff q 1,...q n satisfies P(X=1)>0 iff there is a satisfying assignment
13
Complexity of Bayesnet Inference Easy to check Polynomial number of variables CPDs can be described by a small table (max 8 parameters) P(X = 1) > 0 if and only if there exists a satisfying assignment to Q 1,…,Q n Conclusion: polynomial reduction of 3-SAT Implications Cannot find a general efficient procedure for all networks Can find provably efficient procedures for particular families of networks Exploit network structure and independencies Dynamic programming
14
Approximate Inference Rather than computing the exact answer, compute an approximation to it Approximation metrics for computing P(y|e) Estimate p has absolute error if |P(y|e)-p| Estimate p has relative error if p / (1+ ) P(y|e) p(1+ ) Absolute error is not very useful in probability distributions since often probabilities are small
15
Approximate Inference Complexity Theorem: the following is NP-Hard Given a network G over n variables, a variable X and a value x Val(X), find a number p that has relative error (n) for the query P G (X=x) Proof: Based on the hardness result for exact inference We showed that computing P G (X=x)>0 is hard An algorithm that returns an estimate p to the original query would return p>0 iff P G (X=x)>0 The approximate inference with relative error is as NP-hard as the original exact inference problem
16
Approximate Inference Complexity Theorem: the following is NP-Hard for <0.5 Given a network G over n variables, a variable X and a value x Val(X), and observation e Val(E) for variables E find a number p that has absolute error for P G (X=x|E=e) Proof: Consider the same construction for the network as above Strategy of proof: show that given an approximation as above, we can determine satisfiability in polynomial time
17
Approximate Inference Complexity Proof cont.: Construction Use approximate algorithm to compute the query P(Q 1 |X=1) Assign Q 1 to the value q which has higher posterior probability Generate new network without Q 1 and with modified CPDs Repeat this process for all Q i Claim: Assignment generated in the process satisfies iff has a satisfiable assignment Proving the claim Easy case: if does not have a satisfiable assignment, then obviously the resulting assignment will not satisfy Harder case: if has a satisfiable assignment we show that it has a satisfiable assignment with Q 1 =q
18
Approximate Inference Complexity Proof cont.: Proving the claim Easy case: if does not have a satisfiable assignment, then obviously the resulting assignment will not satisfy Harder case: if has a satisfiable assignment we show that it has a satisfiable assignment with Q 1 =q If is satisfiable with both q and q then done If is not satisfiable with Q 1 =q, then P(Q 1 =q|X=1)=0 but then we have P(Q 1 = q|X=1)=1, and then since our approximation has absolute error <0.5, we will necessarily choose q which has a satisfying assignment By induction on all Q variables, we have that the assignment we find must satisfy Construction process is polynomial
19
Inference Complexity Summary NP-Hard Exact inference Approximate inference with relative error with absolute error < 0.5 (given evidence) Hopeless? No, we will see many network structures that have provably efficient algorithms and we will see cases when approximate inference works efficiently with high accuracy
20
Exact Inference Variable Elimination Inference in a simple chain Computing P(X 2 ) X1X1 X2X2 All the numbers for this computation are in the CPDs of the original Bayesian network O(|X 1 ||X 2 |) operations X3X3
21
Exact Inference Variable Elimination Inference in a simple chain Computing P(X 2 ) Computing P(X 3 ) X1X1 X2X2 X3X3 P(X 3 |X 2 ) is a given CPD P(X 2 ) was computed above O(|X 1 ||X 2 |+|X 2 ||X 3 |) operations
22
Exact Inference Variable Elimination Inference in a general chain Computing P(X n ) Compute each P(X i+1 ) from P(X i ) k 2 operations for each computation (assuming |X i |=k) O(nk 2 ) operations for the inference Compare to k n operations required in summing over all possible entries in the joint distribution over X 1,...X n Inference in a general chain can be done in linear time! X1X1 X2X2 X3X3 XnXn...
23
Exact Inference Variable Elimination X1X1 X2X2 X3X3 X4X4 Pushing summations = Dynamic programming
24
Inference With a Loop Computing P(X 4 ) X1X1 X2X2 X3X3 X4X4
25
Efficient Inference in Bayesnets Properties that allow us to avoid exponential blowup in the joint distribution Bayesian network structure – some subexpressions depend on a small number of variables Computing these subexpressions and caching the results avoids generating them exponentially many times
26
Variable Elimination Inference algorithm defined in terms of factors A factor is a function from value assignments of a set of random variables D to real positive numbers + The set of variables D is the scope of the factor Factors generalize the notion of CPDs Thus, the algorithm we describe applies both to Bayesian networks and Markov networks
27
Variable Elimination: Factors Let X, Y, Z be three sets of disjoint sets of RVs, and let 1 (X,Y) and 2 (Y,Z) be two factors We define the factor product 1 x 2 operation to be a factor :Val(X,Y,Z) as (X,Y,Z)= 1 (X,Y) 2 (Y,Z)
28
Variable Elimination: Factors Let X be a set of RVs, Y X a RV, and (X,Y) a factor We define the factor marginalization of Y in X to be a factor :Val(X) as (X)= Y (X,Y) Also called summing out In a Bayesian network, summing out all variables = 1 In a Markov network, summing out all variables is the partition function
29
Variable Elimination: Factors Factors are commutative 1 x 2 = 1 x 2 X Y (X,Y) = Y X (X,Y) Products are associative ( 1 x 2 )x 3 = 1 x( 2 x 3 ) If X Scope[ 1 ] (we used this in elimination above) X 1 x 2 = 1 x X 2
30
Inference in Chain by Factors X1X1 X2X2 X3X3 X4X4 Scope of X 3 and X 4 does not contain X 1 Scope of X 4 does not contain X 2
31
Sum-Product Inference Let Y be the query RVs and Z be all other RVs The general inference task is Effective computation Since scope factors is limited, “push in” some of the summations and perform them over the product of only a subset of factors
32
Sum-Product Variable Elimination Algorithm Sum out the variables one at a time When summing out a variable multiply all the factors that mention the variable, generating a product factor Sum out the variable from the combined factor, generating a new factor without the variable
33
Sum-Product Variable Elimination Theorem Let X be a set of RVs Let F be a set of factors such that for each Scope[ ] X Let Y X be a set of query RVs Let Z=X-Y For any ordering over Z, the above algorithm returns a factor (Y) such that Instantiation for Bayesian network query P G (Y) F consists of all CPDs in G Each X i = P(X i | Pa(X i )) Apply variable elimination for U-Y
34
A More Complex Network Goal: P(J) Eliminate: C,D,I,H,G,S,L C D I SG L J H D
35
A More Complex Network Goal: P(J) Eliminate: C,D,I,H,G,S,L Compute: C D I SG L J H D
36
A More Complex Network Goal: P(J) Eliminate: D,I,H,G,S,L Compute: C D I SG L J H D
37
A More Complex Network Goal: P(J) Eliminate: I,H,G,S,L Compute: C D I SG L J H D
38
A More Complex Network Goal: P(J) Eliminate: H,G,S,L Compute: C D I SG L J H D
39
A More Complex Network Goal: P(J) Eliminate: G,S,L Compute: C D I SG L J H D
40
A More Complex Network Goal: P(J) Eliminate: S,L Compute: C D I SG L J H D
41
A More Complex Network Goal: P(J) Eliminate: L Compute: C D I SG L J H D
42
A More Complex Network Goal: P(J) Eliminate: G,I,S,L,H,C,D C D I SG L J H D Note: intermediate factor large: f 1 (I,D,L,J,H)
43
Inference With Evidence Let Y be the query RVs Let E be the evidence RVs and e their assignment Let Z be all other RVs (U-Y-E) The general inference task is
44
Inference With Evidence Goal: P(J|H=h,I=i) Eliminate: C,D,G,S,L Below, compute f(J,H=h,I=i) C D I SG L J H D
45
Complexity of Variable Elimination Variable elimination consists of Generating the factors that will be summed out Summing out Generating the factor f i = 1 x,...x k i Let X i be the scope of f i Each entry requires k i multiplications to generate Generating factor f i is O(k i |Val(X i )|) Summing out Addition operations, at most |Val(X i )| Per factor: O(kN) where N=max i |Val(X i )|, k=max i k i
46
Complexity of Variable Elimination Start with n factors (n=number of variables) Generate exactly one factor at each iteration there are at most 2n factors Generating factors At most i |Val(X i )|k i N i k i N 2n (since each factor is multiplied in exactly once and there are 2n factors) Summing out i |Val(X i )| N n (since we have n summing outs to do) Total work is linear in N and n Exponential blowup can be in N i which for factor i can be v m if factor i has m variables with v values each
47
VE as Graph Transformation At each step we are computing Plot a graph where there is an undirected edge X—Y if variables X and Y appear in the same factor Note: this is the Markov network of the probability on the variables that were not eliminated yet
48
VE as Graph Transformation Goal: P(J) Eliminate: C,D,I,H,G,S,L C D I SG L J H D
49
VE as Graph Transformation Goal: P(J) Eliminate: C,D,I,H,G,S,L Compute: C D I SG L J H D
50
VE as Graph Transformation Goal: P(J) Eliminate: D,I,H,G,S,L Compute: C D I SG L J H
51
VE as Graph Transformation Goal: P(J) Eliminate: I,H,G,S,L Compute: C D I SG L J H
52
VE as Graph Transformation Goal: P(J) Eliminate: H,G,S,L Compute: C D I SG L J H
53
VE as Graph Transformation Goal: P(J) Eliminate: G,S,L Compute: C D I SG L J H
54
VE as Graph Transformation Goal: P(J) Eliminate: S,L Compute: C D I SG L J H
55
VE as Graph Transformation Goal: P(J) Eliminate: L Compute: C D I SG L J H
56
The Induced Graph The induced graph I F, over factors F and ordering : Undirected X i and X j are connected if they appeared in the same factor throughout the VE algorithm using as the ordering C D I SG L J H C D I SG L J H D Original graph Induced graph
57
The Induced Graph The induced graph I F, over factors F and ordering : Undirected X i and X j are connected if they appeared in the same factor throughout the VE algorithm using as the ordering The width of an induced graph is the number of nodes in the largest clique in the graph minus 1 Minimal induced width of a graph K is min width(I K, ) Minimal induced width provides a lower bound on best performance by applying VE to a model that factorized on K
58
The Induced Graph Finding the optimal ordering is NP-hard Theorem: For a graph H, determining whether any elimination ordering achieves an induced width K is NP-complete Note: this NP-hard result is distinct from the NP-hard result of inference – Even given the minimal induced graph, inference may still be exponential Hopeless? No, heuristic techniques can find good elimination orderings
59
Finding Elimination Orderings Reduce to finding triangulation with small cliques Valid since Theorem: Every induced graph is chordal Proof: Assume by contradiction that we have a cycle X 1 —X 2 —...—X k —X 1 for k 4. If X i is the first of the cycle to be eliminated, then because X i is eliminated and no other edges will be added to it later, we have the X i —X i+1 and X i-1 —X i exist at this point, and thus X i-1 —X i+1 will be added at this point, contradicting X 1 —X 2 —...—X k —X 1 being a chordless cycle Theorem: Every chordal graph corresponds to an elimination ordering that does not introduce new fill edges Use graph theoretic algorithms for triangulation Greedy search using heuristic cost function At each point, add node with smallest cost Possible costs: neighbors in current graph, neighbors of neighbors, number of filling edges
60
Elimination On Trees Tree Bayesian network Each variable has at most one parent All factors involve at most two variables Elimination Eliminate leaf variables Maintains tree structure Induced width = 1 D I S G LH D I S G LH D I S G LH
61
Elimination on PolyTrees PolyTree Bayesian network At most one path between any two variables Theorem: inference is linear in the network representation size C D I SG L J H D
62
Inference By Conditioning General idea Enumerate the possible values of a variable Apply Variable Elimination in a simplified network Aggregate the results C S J D G L I Ind(G;S | I) C S J D G L Observe I Transform CPDs of G and S to eliminate I as parent I
63
Inference By Conditioning Compute P G (J) using C S J D G L I Ind(G;S | I) C S J D G L Observe I Transform CPDs of G and S to eliminate I as parent I How do we compute P G (I)? Inference in G (simple here) Restrict factors in undirected
64
Inference By Conditioning Compute P G (J) using C S J D G L I C S J D G L I How do we compute P G (I)? Inference in G (simple here) Restrict factors in undirected Factor for each CPD Partition function is P(I=i) Compute by inference
65
Cutset Conditioning Select a subset of nodes X U Define the conditional Bayesian network G X=x G X=x has the same variables as G G X=x has the same structure as G except that all outgoing edges of nodes in X are deleted, and CPDs of nodes in which edges were deleted are updated to X is a cutset in G if G X=x is a polytree Compute original P(Y) query by Exponential in cutset
66
Cutset Conditioning Examples C S J D G L I H Original network C S J D G L I H I is not a cutset C S J D G L I H G is a cutset
67
Inference with Structured CPDs Idea: structured CPDs have additional structure which can be exploited for more efficient inference
68
Independence of Causal Influence X1X1 Y X2X2 XnXn... Causes: X 1,…X n Effect: Y General case: Y has a complex dependency on X 1,…X n Common case Each X i influences Y separately Influence of X 1,…X n is combined to an overall influence on Y
69
Example 1: Noisy OR Two independent effects X 1, X 2 Y=y 1 cannot happen unless one of X 1, X 2 occurs P(Y=y 0 | X 1 =x 1 0, X 2 =x 2 0 ) = P(X 1 =x 1 0 )P(X 2 =x 2 0 ) Y X1X1 X2X2 y0y0 y1y1 x10x10 x20x20 10 x10x10 x21x21 0.20.8 x11x11 x20x20 0.10.9 x11x11 x21x21 0.020.98 X1X1 Y X2X2
70
Noisy OR: Elaborate Representation Y X’ 1 X’ 2 y0y0 y1y1 x10x10 x20x20 10 x10x10 x21x21 01 x11x11 x20x20 01 x11x11 x21x21 01 X’ 1 Y X’ 2 X1X1 X2X2 Deterministic OR X’ 1 X1X1 x10x10 x11x11 x10x10 10 x11x11 0.10.9 Noisy CPD 1 X’ 2 X2X2 x20x20 x21x21 x20x20 10 x21x21 0.20.8 Noisy CPD 2 Noise parameter X 1 =0.9 Noise parameter X 1 =0.8
71
Noisy OR Decomposition Y X1X1 X2X2 X3X3 X4X4 Goal: Compute P(Y) Naïve approach 4 multiplications – P(X 1 ) x P(X 2 ) 8 multiplications – P(X 1,X 2 ) x P(X 3 ) 16 multiplications – P(X 1,X 2,X 3 ) x P(X 4 ) 32 multiplications – P(X 1,X 2,X 3,X 4 ) x P(Y|X 1,X 2,X 3,X 4 ) 30 additions to extract P(Y) from P(Y,X1,X2,X3,X4) 60 multiplications, 30 additions
72
Noisy OR Decomposition Y X1X1 X2X2 X3X3 X4X4 Goal: Compute P(Y) Y Z1Z1 Z2Z2 Z3Z3 Z4Z4 X1X1 X2X2 X3X3 X4X4 Y Z1Z1 Z2Z2 Z3Z3 Z4Z4 X1X1 X2X2 X3X3 X4X4 O1O1 O2O2 Y X1X1 X2X2 X3X3 X4X4 O1O1 O2O2
73
Noisy OR Decomposition Goal: Compute P(Y) Y X1X1 X2X2 X3X3 X4X4 O1O1 O2O2 8 multiplications – P(X 1 ) x P(O|X 1,X 2 ) 4 additions to sum out X 1 4 multiplications for f(O,X 2 ) x P(X 2 ) 2 additions to sum out X 2 Similar cost for eliminating X 3,X 4 and then subsequently for O 1 and O 2 Total 3 * (8+4) = 36 multiplications and 3 * (4+2) = 18 additions
74
Noisy OR Decomposition Goal: Compute P(Y) Y X1X1 X2X2 X3X3 X4X4 O1O1 O2O2 4 multiplications and 2 additions to eliminate each X i 8 multiplications and 4 additions to eliminate each O i 4 multiplications for f(O,X 2 ) x P(X 2 ) Total cost is 4*4 + 3*8 = 40 multiplications and 4*2 + 3*4 = 20 additions O1O1 O2O2 O3O3 Y X1X1 X2X2 X3X3 X4X4
75
General Formulation Let Y be a random variable with parents X 1,...X n The CPD P(Y | X 1,...X n ) exhibits independence of causal influence (ICI) if it can be described by The CPD P(Z | Z 1,...Z n ) is deterministic Z1Z1 Z Z2Z2 X1X1 X2X2 Y ZnZn XnXn... Noisy OR Z i has noise model Z is an OR function Y is the identity CPD Logistic Z i = w i 1(X i =1) Z = Z i Y = logit (Z)
76
General Decomposition Independence of causal influence Network with variable Y with parents X 1,...X n Decompose Y by introducing n-1 intermediate variables O 1,...O n-1 Variable Y and each of the O i ’s has exactly two parents in Z 1,...Z n,O 1,...O n-1 The CPD of Y and of O i is deterministic on its two parents Each Z i and each O i is a parent of at most one variable in O 1,...O n-1 and Y
77
Context Specific Independence Idea: exploit structure in tree CPD or rule CPD Approach 1: Decompose the CPD in a modified network structure Approach 2: modify the variable elimination algorithm to perform operations on structured factors
78
Context Specific Independence X1X1 X2X2 X3X3 X4X4 A Y X1X1 X2X2 X3X3 X4X4 A Y YA1YA1 YA2YA2 X1X1 X2X2 X3X3 X4X4 A Y YA1YA1 YA2YA2 X1X1 (0.4,0.6)(0.7,0.3) (0.9,0.1) a0a0 a1a1 x1x1 x0x0 x1x1 x0x0 X3X3 A X2X2 X4X4 (0.2,0.8) (0.3,0.7) x1x1 x0x0 A “selects” Y A 1 or Y A 2
79
General Decomposition Let Y be a variable Let A be one parent of Y Let X be the remaining parents For each a Val(A) define a new variable Y a Parents of Y a are those variables X X such that the edge from X to Y is not spurious in the context A=a The CPD of Y a is P(Y a |Pa(Y a )) = P(Y|a,Pa(Y a )) Y is a deterministic multiplexer CPD, with A as the selector
80
Tree CPD Decomposition a0a0 a1a1 b1b1 b0b0 d1d1 d0d0 B A D C c1c1 c0c0 D d1d1 d0d0 ABCD Y DCB A Y Ya0Ya0 YA1YA1 Add A as selector
81
Tree CPD Decomposition a0a0 a1a1 b1b1 b0b0 d1d1 d0d0 B A D C c1c1 c0c0 D d1d1 d0d0 ABCD Y DC B A Y Ya0Ya0 YA1YA1 Add A as selector Add B as selector Ya1b1Ya1b1 Ya1b0Ya1b0
82
Tree CPD Decomposition a0a0 a1a1 b1b1 b0b0 d1d1 d0d0 B A D C c1c1 c0c0 D d1d1 d0d0 ABCD Y D C B A Y Ya0Ya0 YA1YA1 Add A as selector Add B as selector Add C as selector Ya1b1Ya1b1 Ya1b0Ya1b0 Ya1b0c0Ya1b0c0 Ya1b0c1Ya1b0c1
83
MPE and MAP Queries Conditional probability queries Evidence: subset of variables E and an assignment e Query: a subset of variables Y Task: compute P(Y | E=e) Most Probable Explanation (MPE) Evidence: subset of variables E and an assignment e Query: all other variables Y (Y=U-E) Task: compute MPE(Y|E=e) = argmax y P(Y=y | E=e) Note: there may be more than one possible solution Maximum A Posteriori Assignment (MAP) Evidence: subset of variables E and an assignment e Query: a subset of variables Y Task: compute MAP(Y|E=e) = argmax y P(Y=y | E=e) Sum-productMax-productMax-sum-product
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.