Advanced Machine Learning

Slides:

Advertisements

Similar presentations

Exact Inference. Inference Basic task for inference: – Compute a posterior distribution for some query variables given some observed evidence – Sum out.

Advertisements

Variational Methods for Graphical Models Micheal I. Jordan Zoubin Ghahramani Tommi S. Jaakkola Lawrence K. Saul Presented by: Afsaneh Shirazi.

CS498-EA Reasoning in AI Lecture #15 Instructor: Eyal Amir Fall Semester 2011.

Exact Inference in Bayes Nets

Junction Trees And Belief Propagation. Junction Trees: Motivation What if we want to compute all marginals, not just one? Doing variable elimination for.

Dynamic Bayesian Networks (DBNs)

An Introduction to Variational Methods for Graphical Models.

Overview of Inference Algorithms for Bayesian Networks Wei Sun, PhD Assistant Research Professor SEOR Dept. & C4I Center George Mason University, 2009.

Hidden Markov Models M. Vijay Venkatesh. Outline Introduction Graphical Model Parameterization Inference Summary.

Junction Trees: Motivation Standard algorithms (e.g., variable elimination) are inefficient if the undirected graph underlying the Bayes Net contains cycles.

From Variable Elimination to Junction Trees

GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.

Bayesian network inference

1 Graphical Models in Data Assimilation Problems Alexander Ihler UC Irvine Collaborators: Sergey Kirshner Andrew Robertson Padhraic Smyth.

Global Approximate Inference Eran Segal Weizmann Institute.

Belief Propagation, Junction Trees, and Factor Graphs

Graphical Models Lei Tang. Review of Graphical Models Directed Graph (DAG, Bayesian Network, Belief Network) Typically used to represent causal relationship.

. Inference I Introduction, Hardness, and Variable Elimination Slides by Nir Friedman.

. Approximate Inference Slides by Nir Friedman. When can we hope to approximate? Two situations: u Highly stochastic distributions “Far” evidence is discarded.

Exact Inference: Clique Trees

Computer vision: models, learning and inference Chapter 10 Graphical Models.

. DAGs, I-Maps, Factorization, d-Separation, Minimal I-Maps, Bayesian Networks Slides by Nir Friedman.

Computer vision: models, learning and inference

Automated Planning and Decision Making Prof. Ronen Brafman Automated Planning and Decision Making 2007 Bayesian networks Variable Elimination Based on.

1 Variable Elimination Graphical Models – Carlos Guestrin Carnegie Mellon University October 11 th, 2006 Readings: K&F: 8.1, 8.2, 8.3,

1 CMSC 671 Fall 2001 Class #21 – Tuesday, November 13.

CPSC 422, Lecture 11Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 11 Oct, 2, 2015.

1 Mean Field and Variational Methods finishing off Graphical Models – Carlos Guestrin Carnegie Mellon University November 5 th, 2008 Readings: K&F:

Probabilistic Graphical Models seminar 15/16 ( ) Haim Kaplan Tel Aviv University.

Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:

Inference Algorithms for Bayes Networks

Christopher M. Bishop, Pattern Recognition and Machine Learning 1.

Pattern Recognition and Machine Learning

Reasoning Under Uncertainty: Independence and Inference CPSC 322 – Uncertainty 5 Textbook §6.3.1 (and for HMMs) March 25, 2011.

CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov

Bayes network inference  A general scenario:  Query variables: X  Evidence (observed) variables and their values: E = e  Unobserved variables: Y 

Daphne Koller Overview Conditional Probability Queries Probabilistic Graphical Models Inference.

Today Graphical Models Representing conditional dependence graphically

1 Structure Learning (The Good), The Bad, The Ugly Inference Graphical Models – Carlos Guestrin Carnegie Mellon University October 13 th, 2008 Readings:

1 Variable Elimination Graphical Models – Carlos Guestrin Carnegie Mellon University October 15 th, 2008 Readings: K&F: 8.1, 8.2, 8.3,

Instructor: Eyal Amir Grad TAs: Wen Pu, Yonatan Bisk

Hidden Markov Models.

Inference in Bayesian Networks

Exact Inference Continued

CAP 5636 – Advanced Artificial Intelligence

Hidden Markov Models Part 2: Algorithms

Hidden Markov Autoregressive Models

CSCI 5822 Probabilistic Models of Human and Machine Learning

CS498-EA Reasoning in AI Lecture #20

Inference Inference: calculating some useful quantity from a joint probability distribution Examples: Posterior probability: Most likely explanation: B.

Readings: K&F: 15.1, 15.2, 15.3, 15.4, 15.5 K&F: 7 (overview of inference) K&F: 8.1, 8.2 (Variable Elimination) Structure Learning in BNs 3: (the good,

Instructors: Fei Fang (This Lecture) and Dave Touretzky

Exact Inference ..

CS 188: Artificial Intelligence

Markov Random Fields Presented by: Vladan Radosavljevic.

Exact Inference Eric Xing Lecture 11, August 14, 2010

Graduate School of Information Sciences, Tohoku University

CS 188: Artificial Intelligence Fall 2008

Expectation-Maximization & Belief Propagation

Variable Elimination 2 Clique Trees

Inference III: Approximate Inference

Approximate Inference by Sampling

Lecture 3: Exact Inference in GMs

Machine Learning: Lecture 6

Junction Trees 3 Undirected Graphical Models

Read R&N Ch Next lecture: Read R&N

Variable Elimination Graphical Models – Carlos Guestrin

Mean Field and Variational Methods Loopy Belief Propagation

Presentation transcript:

Advanced Machine Learning Exact Inference Eric Xing Lecture 11, August 12, 2009 Eric Xing © Eric Xing @ CMU, 2006-2009 Reading:

Inference and Learning We now have compact representations of probability distributions: GM A BN M describes a unique probability distribution P Typical tasks: Task 1: How do we answer queries about P? We use inference as a name for the process of computing answers to such queries Task 2: How do we estimate a plausible model M from data D? We use learning as a name for the process of obtaining point estimate of M. But for Bayesian, they seek p(M |D), which is actually an inference problem. When not all variables are observable, even computing point estimate of M need to do inference to impute the missing data. Eric Xing © Eric Xing @ CMU, 2006-2009

Inferential Query 1: Likelihood Most of the queries one may ask involve evidence Evidence xv is an assignment of values to a set Xv of nodes in the GM over varialbe set X={X1, X2, …, Xn} Without loss of generality Xv={Xk+1, … , Xn}, Write XH=X\Xv as the set of hidden variables, XH can be Æ or X Simplest query: compute probability of evidence this is often referred to as computing the likelihood of xv Eric Xing © Eric Xing @ CMU, 2006-2009

Inferential Query 2: Conditional Probability Often we are interested in the conditional probability distribution of a variable given the evidence this is the a posteriori belief in XH, given evidence xv We usually query a subset Y of all hidden variables XH={Y,Z} and "don't care" about the remaining, Z: the process of summing out the "don't care" variables z is called marginalization, and the resulting P(Y|xv) is called a marginal prob. Eric Xing © Eric Xing @ CMU, 2006-2009

Applications of a posteriori Belief Prediction: what is the probability of an outcome given the starting condition the query node is a descendent of the evidence Diagnosis: what is the probability of disease/fault given symptoms the query node an ancestor of the evidence Learning under partial observation fill in the unobserved values under an "EM" setting (more later) The directionality of information flow between variables is not restricted by the directionality of the edges in a GM probabilistic inference can combine evidence form all parts of the network ? A C B ? A C B Eric Xing © Eric Xing @ CMU, 2006-2009

Inferential Query 3: Most Probable Assignment In this query we want to find the most probable joint assignment (MPA) for some variables of interest Such reasoning is usually performed under some given evidence xv, and ignoring (the values of) other variables Z: this is the maximum a posteriori configuration of Y. Eric Xing © Eric Xing @ CMU, 2006-2009

Complexity of Inference Thm: Computing P(XH=xH| xv) in an arbitrary GM is NP-hard Hardness does not mean we cannot solve inference It implies that we cannot find a general procedure that works efficiently for arbitrary GMs For particular families of GMs, we can have provably efficient procedures Eric Xing © Eric Xing @ CMU, 2006-2009

Approaches to inference Exact inference algorithms The elimination algorithm Belief propagation The junction tree algorithms (but will not cover in detail here) Approximate inference techniques Variational algorithms Stochastic simulation / sampling methods Markov chain Monte Carlo methods Eric Xing © Eric Xing @ CMU, 2006-2009

Inference on General BN via Variable Elimination General idea: Write query in the form this suggests an "elimination order" of latent variables to be marginalized Iteratively Move all irrelevant terms outside of innermost sum Perform innermost sum, getting a new term Insert the new term into the product wrap-up Eric Xing © Eric Xing @ CMU, 2006-2009

Hidden Markov Model ... A Conditional probability: p(x, y) = p(x1……xT, y1, ……, yT) = p(y1) p(x1 | y1) p(y2 | y1) p(x2 | y2) … p(yT | yT-1) p(xT | yT) Conditional probability: A x2 x3 x1 xT y2 y3 y1 yT ... Eric Xing © Eric Xing @ CMU, 2006-2009

Hidden Markov Model ... A Conditional probability: x2 x3 x1 xT y2 y3 yT ... Eric Xing © Eric Xing @ CMU, 2006-2009

A Bayesian network A food web B A D C E F G H What is the probability that hawks are leaving given that the grass condition is poor? Eric Xing © Eric Xing @ CMU, 2006-2009

Example: Variable Elimination Query: P(A |h) Need to eliminate: B,C,D,E,F,G,H Initial factors: Choose an elimination order: H,G,F,E,D,C,B Step 1: Conditioning (fix the evidence node (i.e., h) on its observed value (i.e., )): This step is isomorphic to a marginalization step: B A D C E F G H B A D C E F G Eric Xing © Eric Xing @ CMU, 2006-2009

Example: Variable Elimination Query: P(B |h) Need to eliminate: B,C,D,E,F,G Initial factors: Step 2: Eliminate G compute B A D C E F G H B A D C E F Eric Xing © Eric Xing @ CMU, 2006-2009

Example: Variable Elimination Query: P(B |h) Need to eliminate: B,C,D,E,F Initial factors: Step 3: Eliminate F compute B A D C E F G H B A D C E Eric Xing © Eric Xing @ CMU, 2006-2009

Example: Variable Elimination Query: P(B |h) Need to eliminate: B,C,D,E Initial factors: Step 4: Eliminate E compute B A D C E F G H B A D C B A D C E Eric Xing © Eric Xing @ CMU, 2006-2009

Example: Variable Elimination Query: P(B |h) Need to eliminate: B,C,D Initial factors: Step 5: Eliminate D compute B A D C E F G H B A C Eric Xing © Eric Xing @ CMU, 2006-2009

Example: Variable Elimination Query: P(B |h) Need to eliminate: B,C Initial factors: Step 6: Eliminate C compute B A D C E F G H B A Eric Xing © Eric Xing @ CMU, 2006-2009

Example: Variable Elimination Query: P(B |h) Need to eliminate: B Initial factors: Step 7: Eliminate B compute B A D C E F G H A Eric Xing © Eric Xing @ CMU, 2006-2009

Example: Variable Elimination Query: P(B |h) Need to eliminate: B Initial factors: Step 8: Wrap-up B A D C E F G H Eric Xing © Eric Xing @ CMU, 2006-2009

Complexity of variable elimination Suppose in one elimination step we compute This requires multiplications For each value of x, y1, …, yk, we do k multiplications additions For each value of y1, …, yk , we do |Val(X)| additions Complexity is exponential in number of variables in the intermediate factor Eric Xing © Eric Xing @ CMU, 2006-2009

Elimination Cliques B A D C E F G H B A D C E F G H B A D C E F G H B Eric Xing © Eric Xing @ CMU, 2006-2009

Understanding Variable Elimination A graph elimination algorithm Intermediate terms correspond to the cliques resulted from elimination “good” elimination orderings lead to small cliques and hence reduce complexity (what will happen if we eliminate "e" first in the above graph?) finding the optimum ordering is NP-hard, but for many graph optimum or near-optimum can often be heuristically found Applies to undirected GMs B A D C E F G H moralization graph elimination Eric Xing © Eric Xing @ CMU, 2006-2009

From Elimination to Belief Propagation Recall that Induced dependency during marginalization is captured in elimination cliques Summation <-> elimination Intermediate term <-> elimination clique Can this lead to an generic inference algorithm? E F H A B C G D Eric Xing © Eric Xing @ CMU, 2006-2009

Tree GMs Directed tree: all nodes except the root have exactly one parent Undirected tree: a unique path between any pair of nodes Poly tree: can have multiple parents We will come back to this later Eric Xing © Eric Xing @ CMU, 2006-2009

Equivalence of directed and undirected trees Any undirected tree can be converted to a directed tree by choosing a root node and directing all edges away from it A directed tree and the corresponding undirected tree make the same conditional independence assertions Parameterizations are essentially the same. Undirected tree: Directed tree: Equivalence: Evidence:? Eric Xing © Eric Xing @ CMU, 2006-2009

From elimination to message passing Recall ELIMINATION algorithm: Choose an ordering Z in which query node f is the final node Place all potentials on an active list Eliminate node i by removing all potentials containing i, take sum/product over xi. Place the resultant factor back on the list For a TREE graph: Choose query node f as the root of the tree View tree as a directed tree with edges pointing towards from f Elimination ordering based on depth-first traversal Elimination of each node can be considered as message-passing (or Belief Propagation) directly along tree branches, rather than on some transformed graphs  thus, we can use the tree itself as a data-structure to do general inference!! Eric Xing © Eric Xing @ CMU, 2006-2009

Message passing for trees Let mij(xi) denote the factor resulting from eliminating variables from bellow up to i, which is a function of xi: This is reminiscent of a message sent from j to i. i j A mathematical summary of the elimination algorithm k l mij(xi) represents a "belief" of xi from xj! Eric Xing © Eric Xing @ CMU, 2006-2009

Elimination on trees is equivalent to message passing along tree branches! f i j k l Eric Xing © Eric Xing @ CMU, 2006-2009

The message passing protocol: A node can send a message to its neighbors when (and only when) it has received messages from all its other neighbors. Computing node marginals: Naïve approach: consider each node as the root and execute the message passing algorithm X1 m21(x1) Computing P(X1) m32(x2) m42(x2) X2 X3 X4 Eric Xing © Eric Xing @ CMU, 2006-2009

The message passing protocol: A node can send a message to its neighbors when (and only when) it has received messages from all its other neighbors. Computing node marginals: Naïve approach: consider each node as the root and execute the message passing algorithm X1 m12(x2) Computing P(X2) m32(x2) m42(x2) X2 X3 X4 Eric Xing © Eric Xing @ CMU, 2006-2009

The message passing protocol: A node can send a message to its neighbors when (and only when) it has received messages from all its other neighbors. Computing node marginals: Naïve approach: consider each node as the root and execute the message passing algorithm X1 m12(x2) Computing P(X3) m23(x3) m42(x2) X2 X3 X4 Eric Xing © Eric Xing @ CMU, 2006-2009

Computing node marginals Naïve approach: Complexity: NC N is the number of nodes C is the complexity of a complete message passing Alternative dynamic programming approach 2-Pass algorithm (next slide ) Complexity: 2C! Eric Xing © Eric Xing @ CMU, 2006-2009

The message passing protocol: A two-pass algorithm: X1 X2 X4 X3 Eric Xing © Eric Xing @ CMU, 2006-2009

Belief Propagation (SP-algorithm): Sequential implementation Eric Xing © Eric Xing @ CMU, 2006-2009

Belief Propagation (SP-algorithm): Parallel synchronous implementation For a node of degree d, whenever messages have arrived on any subset of d-1 node, compute the message for the remaining edge and send! A pair of messages have been computed for each edge, one for each direction All incoming messages are eventually computed for each node Eric Xing © Eric Xing @ CMU, 2006-2009

Correctness of BP on tree Collollary: the synchronous implementation is "non-blocking" Thm: The Message Passage Guarantees obtaining all marginals in the tree What about non-tree? Eric Xing © Eric Xing @ CMU, 2006-2009

Inference on general GM Now, what if the GM is not a tree-like graph? Can we still directly run message message-passing protocol along its edges? For non-trees, we do not have the guarantee that message-passing will be consistent! Then what? Construct a graph data-structure from P that has a tree structure, and run message-passing on it!  Junction tree algorithm Eric Xing © Eric Xing @ CMU, 2006-2009

Elimination Clique Recall that Induced dependency during marginalization is captured in elimination cliques Summation <-> elimination Intermediate term <-> elimination clique Can this lead to an generic inference algorithm? E F H A B C G D Eric Xing © Eric Xing @ CMU, 2006-2009

A Clique Tree B A C B A A A D C A E F A D C E E F H E G Eric Xing © Eric Xing @ CMU, 2006-2009

From Elimination to Message Passing Elimination º message passing on a clique tree Messages can be reused B A D C E F G H º B A C B A A A D C A E F A D C E E G E F H Eric Xing © Eric Xing @ CMU, 2006-2009

From Elimination to Message Passing Elimination º message passing on a clique tree Another query ... Messages mf and mh are reused, others need to be recomputed E F H A B C G D Eric Xing © Eric Xing @ CMU, 2006-2009

The Shafer Shenoy Algorithm Message from clique i to clique j : Clique marginal Eric Xing © Eric Xing @ CMU, 2006-2009

A Sketch of the Junction Tree Algorithm The algorithm Construction of junction trees --- a special clique tree Propagation of probabilities --- a message-passing protocol Results in marginal probabilities of all cliques --- solves all queries in a single run A generic exact inference algorithm for any GM Complexity: exponential in the size of the maximal clique --- a good elimination order often leads to small maximal clique, and hence a good (i.e., thin) JT Many well-known algorithms are special cases of JT Forward-backward, Kalman filter, Peeling, Sum-Product ... Eric Xing © Eric Xing @ CMU, 2006-2009

The Junction tree algorithm for HMM A junction tree for the HMM Rightward pass This is exactly the forward algorithm! Leftward pass … This is exactly the backward algorithm! A x2 x3 x1 xT y2 y3 y1 yT ... Þ Eric Xing © Eric Xing @ CMU, 2006-2009

Summary The simple Eliminate algorithm captures the key algorithmic Operation underlying probabilistic inference: --- That of taking a sum over product of potential functions The computational complexity of the Eliminate algorithm can be reduced to purely graph-theoretic considerations. This graph interpretation will also provide hints about how to design improved inference algorithms What can we say about the overall computational complexity of the algorithm? In particular, how can we control the "size" of the summands that appear in the sequence of summation operation. Eric Xing © Eric Xing @ CMU, 2006-2009