. Inference I Introduction, Hardness, and Variable Elimination Slides by Nir Friedman
u In previous lessons we introduced compact representations of probability distributions: l Bayesian Networks l Markov Networks A network describes a unique probability distribution P How do we answer queries about P ? u We use inference as a name for the process of computing answers to such queries
Queries: Likelihood u There are many types of queries we might ask. u Most of these involve evidence An evidence e is an assignment of values to a set E variables in the domain Without loss of generality E = { X k+1, …, X n } Simplest query: compute probability of evidence u This is often referred to as computing the likelihood of the evidence
Queries: A posteriori belief u Often we are interested in the conditional probability of a variable given the evidence This is the a posteriori belief in X, given evidence e A related task is computing the term P(X, e) i.e., the likelihood of e and X = x for values of X l we can recover the a posteriori belief by
A posteriori belief This query is useful in many cases: u Prediction: what is the probability of an outcome given the starting condition l Target is a descendent of the evidence u Diagnosis: what is the probability of disease/fault given symptoms l Target is an ancestor of the evidence u As we shall see, the direction between variables does not restrict the directions of the queries l Probabilistic inference can combine evidence form all parts of the network
Queries: A posteriori joint In this query, we are interested in the conditional probability of several variables, given the evidence P(X, Y, … | e ) u Note that the size of the answer to query is exponential in the number of variables in the joint
Queries: MAP In this query we want to find the maximum a posteriori assignment for some variable of interest (say X 1,…,X l ) That is, x 1,…,x l maximize the probability P(x 1,…,x l | e) Note that this is equivalent to maximizing P(x 1,…,x l, e)
Queries: MAP We can use MAP for: u Classification l find most likely label, given the evidence u Explanation l What is the most likely scenario, given the evidence
Queries: MAP Cautionary note: u The MAP depends on the set of variables u Example: MAP of X is 1, MAP of (X, Y) is (0,0)
Complexity of Inference Thm: Computing P(X = x) in a Bayesian network is NP- hard Not surprising, since we can simulate Boolean gates.
Proof We reduce 3-SAT to Bayesian network computation Assume we are given a 3-SAT problem: q 1,…,q n be propositions, 1,..., k be clauses, such that i = l i1 l i2 l i3 where each l ij is a literal over q 1,…,q n u = 1 ... k We will construct a network s.t. P(X=t) > 0 iff is satisfiable
... P(Q i = true) = 0.5, P( I = true | Q i, Q j, Q l ) = 1 iff Q i, Q j, Q l satisfy the clause I A 1, A 2, …, are simple binary or gates... 11 Q1Q1 Q3Q3 Q2Q2 Q4Q4 QnQn 22 33 kk A1A1 k-1 A2A2 X A k/2-1
u It is easy to check l Polynomial number of variables l Each CPDs can be described by a small table (8 parameters at most) P(X = true) > 0 if and only if there exists a satisfying assignment to Q 1,…,Q n u Conclusion: polynomial reduction of 3-SAT
Note: this construction also shows that computing P(X = t) is harder than NP 2 n P(X = t) is the number of satisfying assignments to u Thus, it is #P-hard (in fact it is #P-complete)
Hardness - Notes u We used deterministic relations in our construction The same construction works if we use (1- , ) instead of (1,0) in each gate for any < 0.5 u Hardness does not mean we cannot solve inference l It implies that we cannot find a general procedure that works efficiently for all networks l For particular families of networks, we can have provably efficient procedures l We will characterize such families in the next two classes
Inference in Simple Chains How do we compute P(X 2 ) ? X1X1 X2X2
Inference in Simple Chains (cont.) How do we compute P(X 3 ) ? we already know how to compute P(X 2 )... X1X1 X2X2 X3X3
Inference in Simple Chains (cont.) How do we compute P(X n ) ? Compute P(X 1 ), P(X 2 ), P(X 3 ), … u We compute each term by using the previous one Complexity: Each step costs O(|Val(X i )|*|Val(X i+1 )|) operations Compare to naïve evaluation, that requires summing over joint values of n-1 variables X1X1 X2X2 X3X3 XnXn...
Inference in Simple Chains (cont.) Suppose that we observe the value of X 2 =x 2 How do we compute P(X 1 |x 2 ) ? Recall that it suffices to compute P(X 1,x 2 ) X1X1 X2X2
Inference in Simple Chains (cont.) Suppose that we observe the value of X 3 =x 3 How do we compute P(X 1,x 3 ) ? How do we compute P(x 3 |x 1 ) ? X1X1 X2X2 X3X3
Inference in Simple Chains (cont.) Suppose that we observe the value of X n =x n How do we compute P(X 1,x n ) ? We compute P(x n |x n-1 ), P(x n |x n-2 ), … iteratively X1X1 X2X2 X3X3 XnXn...
Inference in Simple Chains (cont.) Suppose that we observe the value of X n =x n We want to find P(X k |x n ) How do we compute P(X k,x n ) ? We compute P(X k ) by forward iterations We compute P(x n | X k ) by backward iterations X1X1 X2X2 XkXk XnXn...
Elimination in Chains u We now try to understand the simple chain example using first-order principles u Using definition of probability, we have ABC E D
Elimination in Chains u By chain decomposition, we get ABC E D
Elimination in Chains u Rearranging terms... ABC E D
Elimination in Chains u Now we can perform innermost summation u This summation, is exactly the first step in the forward iteration we describe before ABC E D X
Elimination in Chains u Rearranging and then summing again, we get ABC E D X X
Elimination in Chains with Evidence u Similarly, we understand the backward pass u We write the query in explicit form ABC E D
Elimination in Chains with Evidence Eliminating d, we get ABC E D X
Elimination in Chains with Evidence Eliminating c, we get ABC E D X X
Elimination in Chains with Evidence Finally, we eliminate b ABC E D X X X
Variable Elimination General idea: u Write query in the form u Iteratively l Move all irrelevant terms outside of innermost sum l Perform innermost sum, getting a new term l Insert the new term into the product
A More Complex Example Visit to Asia Smoking Lung Cancer Tuberculosis Abnormality in Chest Bronchitis X-Ray Dyspnea u “Asia” network:
V S L T A B XD We want to compute P(d) Need to eliminate: v,s,x,t,l,a,b Initial factors
V S L T A B XD We want to compute P(d) Need to eliminate: v,s,x,t,l,a,b Initial factors Eliminate: v Note: f v (t) = P(t) In general, result of elimination is not necessarily a probability term Compute:
V S L T A B XD We want to compute P(d) Need to eliminate: s,x,t,l,a,b u Initial factors Eliminate: s Summing on s results in a factor with two arguments f s (b,l) In general, result of elimination may be a function of several variables Compute:
V S L T A B XD We want to compute P(d) Need to eliminate: x,t,l,a,b u Initial factors Eliminate: x Note: f x (a) = 1 for all values of a !! Compute:
V S L T A B XD We want to compute P(d) Need to eliminate: t,l,a,b u Initial factors Eliminate: t Compute:
V S L T A B XD We want to compute P(d) Need to eliminate: l,a,b u Initial factors Eliminate: l Compute:
V S L T A B XD We want to compute P(d) Need to eliminate: b u Initial factors Eliminate: a,b Compute:
Variable Elimination u We now understand variable elimination as a sequence of rewriting operations u Actual computation is done in elimination step u Exactly the same computation procedure applies to Markov networks u Computation depends on order of elimination l We will return to this issue in detail
Dealing with evidence u How do we deal with evidence? Suppose get evidence V = t, S = f, D = t We want to compute P(L, V = t, S = f, D = t) V S L T A B XD
Dealing with Evidence u We start by writing the factors: Since we know that V = t, we don’t need to eliminate V Instead, we can replace the factors P(V) and P(T|V) with u These “select” the appropriate parts of the original factors given the evidence Note that f p(V) is a constant, and thus does not appear in elimination of other variables V S L T A B XD
Dealing with Evidence Given evidence V = t, S = f, D = t Compute P(L, V = t, S = f, D = t ) u Initial factors, after setting evidence: V S L T A B XD
Dealing with Evidence Given evidence V = t, S = f, D = t Compute P(L, V = t, S = f, D = t ) u Initial factors, after setting evidence: Eliminating x, we get V S L T A B XD
Dealing with Evidence Given evidence V = t, S = f, D = t Compute P(L, V = t, S = f, D = t ) u Initial factors, after setting evidence: Eliminating x, we get Eliminating t, we get V S L T A B XD
Dealing with Evidence Given evidence V = t, S = f, D = t Compute P(L, V = t, S = f, D = t ) u Initial factors, after setting evidence: Eliminating x, we get Eliminating t, we get Eliminating a, we get V S L T A B XD
Dealing with Evidence Given evidence V = t, S = f, D = t Compute P(L, V = t, S = f, D = t ) u Initial factors, after setting evidence: Eliminating x, we get Eliminating t, we get Eliminating a, we get Eliminating b, we get V S L T A B XD
Complexity of variable elimination u Suppose in one elimination step we compute This requires u multiplications For each value for x, y 1, …, y k, we do m multiplications u additions For each value of y 1, …, y k, we do |Val(X)| additions Complexity is exponential in number of variables in the intermediate factor.
Understanding Variable Elimination u We want to select “good” elimination orderings that reduce complexity u We start by attempting to understand variable elimination via the graph we are working with u This will reduce the problem of finding good ordering to graph-theoretic operation that is well- understood
Undirected graph representation u At each stage of the procedure, we have an algebraic term that we need to evaluate In general this term is of the form: where Z i are sets of variables We now plot a graph where there is undirected edge X--Y if X,Y are arguments of some factor that is, if X,Y are in some Z i u Note: this is the Markov network that describes the probability on the variables we did not eliminate yet
Undirected Graph Representation u Consider the “Asia” example u The initial factors are u thus, the undirected graph is u In the first step this graph is just the moralized graph V S L T A B XD V S L T A B XD
Undirected Graph Representation Now we eliminate t, getting u The corresponding change in the graph is V S L T A B XD V S L T A B XD
Example Want to compute P(L, V = t, S = f, D = t) u Moralizing V S L T A B XD L T A B X V S D
Example Want to compute P(L, V = t, S = f, D = t) u Moralizing u Setting evidence V S L T A B XD L T A B X V S D
Example Want to compute P(L, V = t, S = f, D = t) u Moralizing u Setting evidence Eliminating x New factor f x (A) V S L T A B XD L T A B X V S D
Example Want to compute P(L, V = t, S = f, D = t) u Moralizing u Setting evidence Eliminating x Eliminating a New factor f a (b,t,l) V S L T A B XD L T A B X V S D
Example Want to compute P(L, V = t, S = f, D = t) u Moralizing u Setting evidence Eliminating x Eliminating a Eliminating b New factor f b (t,l) V S L T A B XD L T A B X V S D
Example Want to compute P(L, V = t, S = f, D = t) u Moralizing u Setting evidence Eliminating x Eliminating a Eliminating b Eliminating t New factor f t (l) V S L T A B XD L T A B X V S D
Elimination in Undirected Graphs u Generalizing, we see that we can eliminate a variable x by 1. For all Y,Z, s.t., Y--X, Z--X add an edge Y--Z 2. Remove X and all adjacent edges to it This procedures create a clique that contains all the neighbors of X u After step 1 we have a clique that corresponds to the intermediate factor (before marginlization) u The cost of the step is exponential in the size of this clique
Undirected Graphs u The process of eliminating nodes from an undirected graph gives us a clue to the complexity of inference u To see this, we will examine the graph that contains all of the edges we added during the elimination. The resulting graph is always chordal.
Example Want to compute P(L) Moralizing V S L T A B XD L T A B X V S D
Example Want to compute P(L) Moralizing Eliminating v Multiply to get f’ v (v,t) Result f v (t) V S L T A B XD L T A B X V S D
Example Want to compute P(L) Moralizing Eliminating v Eliminating x Multiply to get f’ x (a,x) Result f x (a) V S L T A B XD L T A B X V S D
Example Want to compute P(L) Moralizing Eliminating v Eliminating x Eliminating s Multiply to get f’ s (l,b,s) Result f s (l,b) V S L T A B XD L T A B X V S D
Example Want to compute P(D) Moralizing Eliminating v Eliminating x Eliminating s Eliminating t Multiply to get f’ t (a,l,t) Result f t (a,l) V S L T A B XD L T A B X V S D
Example Want to compute P(D) Moralizing Eliminating v Eliminating x Eliminating s Eliminating t Eliminating l Multiply to get f’ l (a,b,l) Result f l (a,b) V S L T A B XD L T A B X V S D
Example Want to compute P(D) Moralizing Eliminating v Eliminating x Eliminating s Eliminating t Eliminating l Eliminating a, b Multiply to get f’ a (a,b,d) Result f(d) V S L T A B XD L T A B X V S D
u The resulting graph is the induced graph (for this particular ordering) u Main property: l Every maximal clique in the induced graph corresponds to a intermediate factor in the computation l Every factor stored during the process is a subset of some maximal clique in the graph u These facts are true for any variable elimination ordering on any network Expanded Graphs L T A B X V S D
Induced Width (Treewidth) u The size of the largest clique in the induced graph is thus an indicator for the complexity of variable elimination u This quantity (minus one) is called the induced width (or treewidth) of a graph according to the specified ordering u Finding a good ordering for a graph is equivalent to finding the minimal induced width of the graph
Consequence: Elimination on Trees u Suppose we have a tree l A network where each variable has at most one parent u All the factors involve at most two variables u Thus, the moralized graph is also a tree A C B D E FG A C B D E FG
Elimination on Trees u We can maintain the tree structure by eliminating extreme variables in the tree A C B D E FG A C B D E FG A C B D E FG
Elimination on Trees u Formally, for any tree, there is an elimination ordering with treewidth = 1 Thm u Inference on trees is linear in number of variables
PolyTrees u A polytree is a network where there is at most one path from one variable to another Thm: u Inference in a polytree is linear in the representation size of the network l This assumes tabular CPT representation u Can you see how the argument would work? A C B D E FG H
General Networks What do we do when the network is not a polytree? u If network has a cycle, the treewidth for any ordering is greater than 1
Example u Eliminating A, B, C, D, E,…. u Resulting graph is chordal with treewidth 2 A H B D F C E G A H B D F C E G A H B D F C E G A H B D F C E G A H B D F C E G
Example u Eliminating H,G, E, C, F, D, E, A A H B D F C E G A H B D F C E G A H B D F C E G A H B D F C E G A H B D F C E G
General Networks u From graph theory: Thm: u Finding an ordering that minimizes the treewidth is NP-Hard However, u There are reasonable heuristic for finding “relatively” good ordering u There are provable approximations to the best treewidth u If the graph has a small treewidth, there are algorithms that find it in polynomial time