. Inference I Introduction, Hardness, and Variable Elimination Slides by Nir Friedman.

Slides:

Advertisements

Similar presentations

CS188: Computational Models of Human Behavior

Advertisements

Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.

Variational Methods for Graphical Models Micheal I. Jordan Zoubin Ghahramani Tommi S. Jaakkola Lawrence K. Saul Presented by: Afsaneh Shirazi.

CS498-EA Reasoning in AI Lecture #15 Instructor: Eyal Amir Fall Semester 2011.

. Exact Inference in Bayesian Networks Lecture 9.

Bayesian Networks, Winter Yoav Haimovitch & Ariel Raviv 1.

Exact Inference in Bayes Nets

Dynamic Bayesian Networks (DBNs)

Parallel Scheduling of Complex DAGs under Uncertainty Grzegorz Malewicz.

Bayesian Networks. Introduction A problem domain is modeled by a list of variables X 1, …, X n Knowledge about the problem domain is represented by a.

Probabilistic networks Inference and Other Problems Hans L. Bodlaender Utrecht University.

Junction Trees: Motivation Standard algorithms (e.g., variable elimination) are inefficient if the undirected graph underlying the Bayes Net contains cycles.

From Variable Elimination to Junction Trees

GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.

Graphical Models - Inference -

Bayesian network inference

Hardness Results for Problems P: Class of “easy to solve” problems Absolute hardness results Relative hardness results –Reduction technique.

PGM 2003/04 Tirgul 3-4 The Bayesian Network Representation.

Global Approximate Inference Eran Segal Weizmann Institute.

. Bayesian Networks For Genetic Linkage Analysis Lecture #7.

. Bayesian Networks Lecture 9 Edited from Nir Friedman’s slides by Dan Geiger from Nir Friedman’s slides.

Bayesian Networks Clique tree algorithm Presented by Sergey Vichik.

Exact Inference Eran Segal Weizmann Institute. Course Outline WeekTopicReading 1Introduction, Bayesian network representation1-3 2Bayesian network representation.

Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland

. Approximate Inference Slides by Nir Friedman. When can we hope to approximate? Two situations: u Highly stochastic distributions “Far” evidence is discarded.

Exact Inference: Clique Trees

PGM 2002/03 Tirgul5 Clique/Junction Tree Inference.

. DAGs, I-Maps, Factorization, d-Separation, Minimal I-Maps, Bayesian Networks Slides by Nir Friedman.

. PGM 2002/3 – Tirgul6 Approximate Inference: Sampling.

Nattee Niparnan. Easy & Hard Problem What is “difficulty” of problem? Difficult for computer scientist to derive algorithm for the problem? Difficult.

Undirected Models: Markov Networks David Page, Fall 2009 CS 731: Advanced Methods in Artificial Intelligence, with Biomedical Applications.

Automated Planning and Decision Making Prof. Ronen Brafman Automated Planning and Decision Making 2007 Bayesian networks Variable Elimination Based on.

Week 10Complexity of Algorithms1 Hard Computational Problems Some computational problems are hard Despite a numerous attempts we do not know any efficient.

1 Variable Elimination Graphical Models – Carlos Guestrin Carnegie Mellon University October 11 th, 2006 Readings: K&F: 8.1, 8.2, 8.3,

UIUC CS 598: Section EA Graphical Models Deepak Ramachandran Fall 2004 (Based on slides by Eyal Amir (which were based on slides by Lise Getoor and Alvaro.

1 COROLLARY 4: D is an I-map of P iff each variable X is conditionally independent in P of all its non-descendants, given its parents. Proof  : Each variable.

Knowledge Representation & Reasoning Lecture #4 UIUC CS 498: Section EA Professor: Eyal Amir Fall Semester 2005 (Based on slides by Lise Getoor and Alvaro.

The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)

1 Bayesian Networks (Directed Acyclic Graphical Models) The situation of a bell that rings whenever the outcome of two coins are equal can not be well.

Intro to Junction Tree propagation and adaptations for a Distributed Environment Thor Whalen Metron, Inc.

CPSC 422, Lecture 11Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 11 Oct, 2, 2015.

Probabilistic Graphical Models seminar 15/16 ( ) Haim Kaplan Tel Aviv University.

1 Use graphs and not pure logic Variables represented by nodes and dependencies by edges. Common in our language: “threads of thoughts”, “lines of reasoning”,

Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:

CS6045: Advanced Algorithms NP Completeness. NP-Completeness Some problems are intractable: as they grow large, we are unable to solve them in reasonable.

Introduction on Graphic Models

Today Graphical Models Representing conditional dependence graphically

1 Structure Learning (The Good), The Bad, The Ugly Inference Graphical Models – Carlos Guestrin Carnegie Mellon University October 13 th, 2008 Readings:

1 Variable Elimination Graphical Models – Carlos Guestrin Carnegie Mellon University October 15 th, 2008 Readings: K&F: 8.1, 8.2, 8.3,

. Bayesian Networks Some slides have been edited from Nir Friedman’s lectures which is available at Changes made by Dan Geiger.

The NP class. NP-completeness Lecture2. The NP-class The NP class is a class that contains all the problems that can be decided by a Non-Deterministic.

Knowledge Representation & Reasoning Lecture #5 UIUC CS 498: Section EA Professor: Eyal Amir Fall Semester 2005 (Based on slides by Lise Getoor and Alvaro.

Instructor: Eyal Amir Grad TAs: Wen Pu, Yonatan Bisk

Inference in Bayesian Networks

PGM 2003/04 Tirgul6 Clique/Junction Tree Inference

Approximate Inference

Bayesian Networks Background Readings: An Introduction to Bayesian Networks, Finn Jensen, UCL Press, Some slides have been edited from Nir Friedman’s.

Exact Inference Continued

UIUC CS 497: Section EA Lecture #6

Professor Marie desJardins,

Exact Inference ..

Markov Random Fields Presented by: Vladan Radosavljevic.

Exact Inference Eric Xing Lecture 11, August 14, 2010

Exact Inference Continued

Class #22/23 – Wednesday, November 12 / Monday, November 17

Inference III: Approximate Inference

Elimination in Chains A B C E D.

Variable Elimination Graphical Models – Carlos Guestrin

Advanced Machine Learning

Presentation transcript:

. Inference I Introduction, Hardness, and Variable Elimination Slides by Nir Friedman

u In previous lessons we introduced compact representations of probability distributions: l Bayesian Networks l Markov Networks  A network describes a unique probability distribution P  How do we answer queries about P ? u We use inference as a name for the process of computing answers to such queries

Queries: Likelihood u There are many types of queries we might ask. u Most of these involve evidence An evidence e is an assignment of values to a set E variables in the domain Without loss of generality E = { X k+1, …, X n }  Simplest query: compute probability of evidence u This is often referred to as computing the likelihood of the evidence

Queries: A posteriori belief u Often we are interested in the conditional probability of a variable given the evidence  This is the a posteriori belief in X, given evidence e  A related task is computing the term P(X, e) i.e., the likelihood of e and X = x for values of X l we can recover the a posteriori belief by

A posteriori belief This query is useful in many cases: u Prediction: what is the probability of an outcome given the starting condition l Target is a descendent of the evidence u Diagnosis: what is the probability of disease/fault given symptoms l Target is an ancestor of the evidence u As we shall see, the direction between variables does not restrict the directions of the queries l Probabilistic inference can combine evidence form all parts of the network

Queries: A posteriori joint  In this query, we are interested in the conditional probability of several variables, given the evidence P(X, Y, … | e ) u Note that the size of the answer to query is exponential in the number of variables in the joint

Queries: MAP  In this query we want to find the maximum a posteriori assignment for some variable of interest (say X 1,…,X l )  That is, x 1,…,x l maximize the probability P(x 1,…,x l | e)  Note that this is equivalent to maximizing P(x 1,…,x l, e)

Queries: MAP We can use MAP for: u Classification l find most likely label, given the evidence u Explanation l What is the most likely scenario, given the evidence

Queries: MAP Cautionary note: u The MAP depends on the set of variables u Example: MAP of X is 1, MAP of (X, Y) is (0,0)

Complexity of Inference Thm: Computing P(X = x) in a Bayesian network is NP- hard Not surprising, since we can simulate Boolean gates.

Proof We reduce 3-SAT to Bayesian network computation Assume we are given a 3-SAT problem:  q 1,…,q n be propositions,   1,...,  k be clauses, such that  i = l i1  l i2  l i3 where each l ij is a literal over q 1,…,q n u  =  1 ...  k We will construct a network s.t. P(X=t) > 0 iff  is satisfiable

...  P(Q i = true) = 0.5,  P(  I = true | Q i, Q j, Q l ) = 1 iff Q i, Q j, Q l satisfy the clause  I  A 1, A 2, …, are simple binary or gates... 11 Q1Q1 Q3Q3 Q2Q2 Q4Q4 QnQn 22 33 kk A1A1  k-1 A2A2 X A k/2-1

u It is easy to check l Polynomial number of variables l Each CPDs can be described by a small table (8 parameters at most) P(X = true) > 0 if and only if there exists a satisfying assignment to Q 1,…,Q n u Conclusion: polynomial reduction of 3-SAT

Note: this construction also shows that computing P(X = t) is harder than NP  2 n P(X = t) is the number of satisfying assignments to  u Thus, it is #P-hard (in fact it is #P-complete)

Hardness - Notes u We used deterministic relations in our construction  The same construction works if we use (1- ,  ) instead of (1,0) in each gate for any  < 0.5 u Hardness does not mean we cannot solve inference l It implies that we cannot find a general procedure that works efficiently for all networks l For particular families of networks, we can have provably efficient procedures l We will characterize such families in the next two classes

Inference in Simple Chains How do we compute P(X 2 ) ? X1X1 X2X2

Inference in Simple Chains (cont.) How do we compute P(X 3 ) ?  we already know how to compute P(X 2 )... X1X1 X2X2 X3X3

Inference in Simple Chains (cont.) How do we compute P(X n ) ?  Compute P(X 1 ), P(X 2 ), P(X 3 ), … u We compute each term by using the previous one Complexity:  Each step costs O(|Val(X i )|*|Val(X i+1 )|) operations  Compare to naïve evaluation, that requires summing over joint values of n-1 variables X1X1 X2X2 X3X3 XnXn...

Inference in Simple Chains (cont.)  Suppose that we observe the value of X 2 =x 2  How do we compute P(X 1 |x 2 ) ? Recall that it suffices to compute P(X 1,x 2 ) X1X1 X2X2

Inference in Simple Chains (cont.)  Suppose that we observe the value of X 3 =x 3  How do we compute P(X 1,x 3 ) ?  How do we compute P(x 3 |x 1 ) ? X1X1 X2X2 X3X3

Inference in Simple Chains (cont.)  Suppose that we observe the value of X n =x n  How do we compute P(X 1,x n ) ?  We compute P(x n |x n-1 ), P(x n |x n-2 ), … iteratively X1X1 X2X2 X3X3 XnXn...

Inference in Simple Chains (cont.)  Suppose that we observe the value of X n =x n  We want to find P(X k |x n )  How do we compute P(X k,x n ) ?  We compute P(X k ) by forward iterations  We compute P(x n | X k ) by backward iterations X1X1 X2X2 XkXk XnXn...

Elimination in Chains u We now try to understand the simple chain example using first-order principles u Using definition of probability, we have ABC E D

Elimination in Chains u By chain decomposition, we get ABC E D

Elimination in Chains u Rearranging terms... ABC E D

Elimination in Chains u Now we can perform innermost summation u This summation, is exactly the first step in the forward iteration we describe before ABC E D X

Elimination in Chains u Rearranging and then summing again, we get ABC E D X X

Elimination in Chains with Evidence u Similarly, we understand the backward pass u We write the query in explicit form ABC E D

Elimination in Chains with Evidence  Eliminating d, we get ABC E D X

Elimination in Chains with Evidence  Eliminating c, we get ABC E D X X

Elimination in Chains with Evidence  Finally, we eliminate b ABC E D X X X

Variable Elimination General idea: u Write query in the form u Iteratively l Move all irrelevant terms outside of innermost sum l Perform innermost sum, getting a new term l Insert the new term into the product

A More Complex Example Visit to Asia Smoking Lung Cancer Tuberculosis Abnormality in Chest Bronchitis X-Ray Dyspnea u “Asia” network:

V S L T A B XD  We want to compute P(d)  Need to eliminate: v,s,x,t,l,a,b Initial factors

V S L T A B XD  We want to compute P(d)  Need to eliminate: v,s,x,t,l,a,b Initial factors Eliminate: v Note: f v (t) = P(t) In general, result of elimination is not necessarily a probability term Compute:

V S L T A B XD  We want to compute P(d)  Need to eliminate: s,x,t,l,a,b u Initial factors Eliminate: s Summing on s results in a factor with two arguments f s (b,l) In general, result of elimination may be a function of several variables Compute:

V S L T A B XD  We want to compute P(d)  Need to eliminate: x,t,l,a,b u Initial factors Eliminate: x Note: f x (a) = 1 for all values of a !! Compute:

V S L T A B XD  We want to compute P(d)  Need to eliminate: t,l,a,b u Initial factors Eliminate: t Compute:

V S L T A B XD  We want to compute P(d)  Need to eliminate: l,a,b u Initial factors Eliminate: l Compute:

V S L T A B XD  We want to compute P(d)  Need to eliminate: b u Initial factors Eliminate: a,b Compute:

Variable Elimination u We now understand variable elimination as a sequence of rewriting operations u Actual computation is done in elimination step u Exactly the same computation procedure applies to Markov networks u Computation depends on order of elimination l We will return to this issue in detail

Dealing with evidence u How do we deal with evidence?  Suppose get evidence V = t, S = f, D = t  We want to compute P(L, V = t, S = f, D = t) V S L T A B XD

Dealing with Evidence u We start by writing the factors:  Since we know that V = t, we don’t need to eliminate V  Instead, we can replace the factors P(V) and P(T|V) with u These “select” the appropriate parts of the original factors given the evidence  Note that f p(V) is a constant, and thus does not appear in elimination of other variables V S L T A B XD

Dealing with Evidence  Given evidence V = t, S = f, D = t  Compute P(L, V = t, S = f, D = t ) u Initial factors, after setting evidence: V S L T A B XD

Dealing with Evidence  Given evidence V = t, S = f, D = t  Compute P(L, V = t, S = f, D = t ) u Initial factors, after setting evidence:  Eliminating x, we get V S L T A B XD

Dealing with Evidence  Given evidence V = t, S = f, D = t  Compute P(L, V = t, S = f, D = t ) u Initial factors, after setting evidence:  Eliminating x, we get  Eliminating t, we get V S L T A B XD

Dealing with Evidence  Given evidence V = t, S = f, D = t  Compute P(L, V = t, S = f, D = t ) u Initial factors, after setting evidence:  Eliminating x, we get  Eliminating t, we get  Eliminating a, we get V S L T A B XD

Dealing with Evidence  Given evidence V = t, S = f, D = t  Compute P(L, V = t, S = f, D = t ) u Initial factors, after setting evidence:  Eliminating x, we get  Eliminating t, we get  Eliminating a, we get  Eliminating b, we get V S L T A B XD

Complexity of variable elimination u Suppose in one elimination step we compute This requires u multiplications For each value for x, y 1, …, y k, we do m multiplications u additions For each value of y 1, …, y k, we do |Val(X)| additions Complexity is exponential in number of variables in the intermediate factor.

Understanding Variable Elimination u We want to select “good” elimination orderings that reduce complexity u We start by attempting to understand variable elimination via the graph we are working with u This will reduce the problem of finding good ordering to graph-theoretic operation that is well- understood

Undirected graph representation u At each stage of the procedure, we have an algebraic term that we need to evaluate  In general this term is of the form: where Z i are sets of variables  We now plot a graph where there is undirected edge X--Y if X,Y are arguments of some factor that is, if X,Y are in some Z i u Note: this is the Markov network that describes the probability on the variables we did not eliminate yet

Undirected Graph Representation u Consider the “Asia” example u The initial factors are u thus, the undirected graph is u In the first step this graph is just the moralized graph V S L T A B XD V S L T A B XD

Undirected Graph Representation  Now we eliminate t, getting u The corresponding change in the graph is V S L T A B XD V S L T A B XD

Example  Want to compute P(L, V = t, S = f, D = t) u Moralizing V S L T A B XD L T A B X V S D

Example  Want to compute P(L, V = t, S = f, D = t) u Moralizing u Setting evidence V S L T A B XD L T A B X V S D

Example  Want to compute P(L, V = t, S = f, D = t) u Moralizing u Setting evidence  Eliminating x New factor f x (A) V S L T A B XD L T A B X V S D

Example  Want to compute P(L, V = t, S = f, D = t) u Moralizing u Setting evidence  Eliminating x  Eliminating a New factor f a (b,t,l) V S L T A B XD L T A B X V S D

Example  Want to compute P(L, V = t, S = f, D = t) u Moralizing u Setting evidence  Eliminating x  Eliminating a  Eliminating b New factor f b (t,l) V S L T A B XD L T A B X V S D

Example  Want to compute P(L, V = t, S = f, D = t) u Moralizing u Setting evidence  Eliminating x  Eliminating a  Eliminating b  Eliminating t New factor f t (l) V S L T A B XD L T A B X V S D

Elimination in Undirected Graphs u Generalizing, we see that we can eliminate a variable x by 1. For all Y,Z, s.t., Y--X, Z--X  add an edge Y--Z 2. Remove X and all adjacent edges to it  This procedures create a clique that contains all the neighbors of X u After step 1 we have a clique that corresponds to the intermediate factor (before marginlization) u The cost of the step is exponential in the size of this clique

Undirected Graphs u The process of eliminating nodes from an undirected graph gives us a clue to the complexity of inference u To see this, we will examine the graph that contains all of the edges we added during the elimination. The resulting graph is always chordal.

Example  Want to compute P(L)  Moralizing V S L T A B XD L T A B X V S D

Example  Want to compute P(L)  Moralizing  Eliminating v Multiply to get f’ v (v,t) Result f v (t) V S L T A B XD L T A B X V S D

Example  Want to compute P(L)  Moralizing  Eliminating v  Eliminating x Multiply to get f’ x (a,x) Result f x (a) V S L T A B XD L T A B X V S D

Example  Want to compute P(L)  Moralizing  Eliminating v  Eliminating x  Eliminating s Multiply to get f’ s (l,b,s) Result f s (l,b) V S L T A B XD L T A B X V S D

Example  Want to compute P(D)  Moralizing  Eliminating v  Eliminating x  Eliminating s  Eliminating t Multiply to get f’ t (a,l,t) Result f t (a,l) V S L T A B XD L T A B X V S D

Example  Want to compute P(D)  Moralizing  Eliminating v  Eliminating x  Eliminating s  Eliminating t  Eliminating l Multiply to get f’ l (a,b,l) Result f l (a,b) V S L T A B XD L T A B X V S D

Example  Want to compute P(D)  Moralizing  Eliminating v  Eliminating x  Eliminating s  Eliminating t  Eliminating l  Eliminating a, b Multiply to get f’ a (a,b,d) Result f(d) V S L T A B XD L T A B X V S D

u The resulting graph is the induced graph (for this particular ordering) u Main property: l Every maximal clique in the induced graph corresponds to a intermediate factor in the computation l Every factor stored during the process is a subset of some maximal clique in the graph u These facts are true for any variable elimination ordering on any network Expanded Graphs L T A B X V S D

Induced Width (Treewidth) u The size of the largest clique in the induced graph is thus an indicator for the complexity of variable elimination u This quantity (minus one) is called the induced width (or treewidth) of a graph according to the specified ordering u Finding a good ordering for a graph is equivalent to finding the minimal induced width of the graph

Consequence: Elimination on Trees u Suppose we have a tree l A network where each variable has at most one parent u All the factors involve at most two variables u Thus, the moralized graph is also a tree A C B D E FG A C B D E FG

Elimination on Trees u We can maintain the tree structure by eliminating extreme variables in the tree A C B D E FG A C B D E FG A C B D E FG

Elimination on Trees u Formally, for any tree, there is an elimination ordering with treewidth = 1 Thm u Inference on trees is linear in number of variables

PolyTrees u A polytree is a network where there is at most one path from one variable to another Thm: u Inference in a polytree is linear in the representation size of the network l This assumes tabular CPT representation u Can you see how the argument would work? A C B D E FG H

General Networks What do we do when the network is not a polytree? u If network has a cycle, the treewidth for any ordering is greater than 1

Example u Eliminating A, B, C, D, E,…. u Resulting graph is chordal with treewidth 2 A H B D F C E G A H B D F C E G A H B D F C E G A H B D F C E G A H B D F C E G

Example u Eliminating H,G, E, C, F, D, E, A A H B D F C E G A H B D F C E G A H B D F C E G A H B D F C E G A H B D F C E G

General Networks u From graph theory: Thm: u Finding an ordering that minimizes the treewidth is NP-Hard However, u There are reasonable heuristic for finding “relatively” good ordering u There are provable approximations to the best treewidth u If the graph has a small treewidth, there are algorithms that find it in polynomial time