Machine Learning – Lecture 18

Slides:



Advertisements
Similar presentations
Linear Time Methods for Propagating Beliefs Min Convolution, Distance Transforms and Box Sums Daniel Huttenlocher Computer Science Department December,
Advertisements

CSE 473/573 Computer Vision and Image Processing (CVIP) Ifeoma Nwogu Lecture 27 – Overview of probability concepts 1.
Exact Inference. Inference Basic task for inference: – Compute a posterior distribution for some query variables given some observed evidence – Sum out.
Variational Methods for Graphical Models Micheal I. Jordan Zoubin Ghahramani Tommi S. Jaakkola Lawrence K. Saul Presented by: Afsaneh Shirazi.
CS498-EA Reasoning in AI Lecture #15 Instructor: Eyal Amir Fall Semester 2011.
Graphical Models BRML Chapter 4 1. the zoo of graphical models Markov networks Belief networks Chain graphs (Belief and Markov ) Factor graphs =>they.
Bayesian Networks, Winter Yoav Haimovitch & Ariel Raviv 1.
Exact Inference in Bayes Nets
Junction Trees And Belief Propagation. Junction Trees: Motivation What if we want to compute all marginals, not just one? Doing variable elimination for.
Discrete Optimization in Computer Vision Nikos Komodakis Ecole des Ponts ParisTech, LIGM Traitement de l’information et vision artificielle.
Dynamic Bayesian Networks (DBNs)
Supervised Learning Recap
Introduction to Belief Propagation and its Generalizations. Max Welling Donald Bren School of Information and Computer and Science University of California.
EE462 MLCV Lecture Introduction of Graphical Models Markov Random Fields Segmentation Tae-Kyun Kim 1.
Overview of Inference Algorithms for Bayesian Networks Wei Sun, PhD Assistant Research Professor SEOR Dept. & C4I Center George Mason University, 2009.
Chapter 8-3 Markov Random Fields 1. Topics 1. Introduction 1. Undirected Graphical Models 2. Terminology 2. Conditional Independence 3. Factorization.
Junction Trees: Motivation Standard algorithms (e.g., variable elimination) are inefficient if the undirected graph underlying the Bayes Net contains cycles.
From Variable Elimination to Junction Trees
Machine Learning CUNY Graduate Center Lecture 6: Junction Tree Algorithm.
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
1 Graphical Models in Data Assimilation Problems Alexander Ihler UC Irvine Collaborators: Sergey Kirshner Andrew Robertson Padhraic Smyth.
Global Approximate Inference Eran Segal Weizmann Institute.
Conditional Random Fields
Belief Propagation, Junction Trees, and Factor Graphs
Graphical Models Lei Tang. Review of Graphical Models Directed Graph (DAG, Bayesian Network, Belief Network) Typically used to represent causal relationship.
Exact Inference: Clique Trees
Bayesian Networks Alan Ritter.
Computer vision: models, learning and inference Chapter 10 Graphical Models.
Machine Learning CUNY Graduate Center Lecture 21: Graphical Models.
Computer vision: models, learning and inference
Ch 8. Graphical Models Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by B.-H. Kim Biointelligence Laboratory, Seoul National.
Presenter : Kuang-Jui Hsu Date : 2011/5/23(Tues.).
Perceptual and Sensory Augmented Computing Machine Learning, WS 13/14 Machine Learning – Lecture 14 Introduction to Regression Bastian Leibe.
Undirected Models: Markov Networks David Page, Fall 2009 CS 731: Advanced Methods in Artificial Intelligence, with Biomedical Applications.
Perceptual and Sensory Augmented Computing Machine Learning, Summer’11 Machine Learning – Lecture 13 Introduction to Graphical Models Bastian.
1 Variable Elimination Graphical Models – Carlos Guestrin Carnegie Mellon University October 11 th, 2006 Readings: K&F: 8.1, 8.2, 8.3,
Perceptual and Sensory Augmented Computing Machine Learning WS 13/14 Machine Learning – Lecture 3 Probability Density Estimation II Bastian.
Ch 8. Graphical Models Pattern Recognition and Machine Learning, C. M. Bishop, Revised by M.-O. Heo Summarized by J.W. Nam Biointelligence Laboratory,
Perceptual and Sensory Augmented Computing Machine Learning, WS 13/14 Machine Learning – Lecture 15 Regression II Bastian Leibe RWTH Aachen.
CS Statistical Machine learning Lecture 24
An Introduction to Variational Methods for Graphical Models
1 Mean Field and Variational Methods finishing off Graphical Models – Carlos Guestrin Carnegie Mellon University November 5 th, 2008 Readings: K&F:
Lecture 2: Statistical learning primer for biologists
Indexing Correlated Probabilistic Databases Bhargav Kanagal, Amol Deshpande University of Maryland, College Park, USA SIGMOD Presented.
1 Use graphs and not pure logic Variables represented by nodes and dependencies by edges. Common in our language: “threads of thoughts”, “lines of reasoning”,
Machine Learning – Lecture 15
Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Elements of a Discrete Model Evaluation.
CPSC 422, Lecture 17Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 17 Oct, 19, 2015 Slide Sources D. Koller, Stanford CS - Probabilistic.
Christopher M. Bishop, Pattern Recognition and Machine Learning 1.
Machine Learning – Lecture 11
Pattern Recognition and Machine Learning
Perceptual and Sensory Augmented Computing Machine Learning, Summer’09 Machine Learning – Lecture 13 Exact Inference & Belief Propagation Bastian.
Today Graphical Models Representing conditional dependence graphically
Distributed cooperation and coordination using the Max-Sum algorithm
Markov Random Fields in Vision
1 Variable Elimination Graphical Models – Carlos Guestrin Carnegie Mellon University October 15 th, 2008 Readings: K&F: 8.1, 8.2, 8.3,
Slide 1 Directed Graphical Probabilistic Models: inference William W. Cohen Machine Learning Feb 2008.
Prof. Adriana Kovashka University of Pittsburgh April 4, 2017
Markov Networks.
CSCI 5822 Probabilistic Models of Human and Machine Learning
Markov Random Fields Presented by: Vladan Radosavljevic.
Graduate School of Information Sciences, Tohoku University
Expectation-Maximization & Belief Propagation
Readings: K&F: 5.1, 5.2, 5.3, 5.4, 5.5, 5.6, 5.7 Markov networks, Factor graphs, and an unified view Start approximate inference If we are lucky… Graphical.
Lecture 3: Exact Inference in GMs
Markov Networks.
Variable Elimination Graphical Models – Carlos Guestrin
Presentation transcript:

Machine Learning – Lecture 18 Exact Inference & Belief Propagation 13.01.2014 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Many slides adapted from C. Bishop, Z. Gharahmani TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAAAAAAAAAAAAAAAAAAAAA

Announcements Exercise 5 now on the web Next week Junction tree algorithm (today) Markov random fields Graph Cuts (next lecture) Due date: Thursday in TWO weeks (14.07.) Next week Exercise 4 on Tuesday Bayes Nets Conditional Independence, D-Separation, Bayes Ball Factor graphs (today) Sum-Product algorithm (today)  Submit your solutions by Monday evening! B. Leibe

Course Outline Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Discriminative Approaches (5 weeks) Linear Discriminant Functions Statistical Learning Theory & SVMs Ensemble Methods & Boosting Decision Trees & Randomized Trees Generative Models (4 weeks) Bayesian Networks Markov Random Fields Exact Inference B. Leibe

Recap: Undirected Graphical Models Undirected graphical models (“Markov Random Fields”) Given by undirected graph Conditional independence for undirected graphs If every path from any node in set A to set B passes through at least one node in set C, then . Simple Markov blanket: B. Leibe Image source: C. Bishop, 2006

Recap: Factorization in MRFs Joint distribution Written as product of potential functions over maximal cliques in the graph: The normalization constant Z is called the partition function. Remarks BNs are automatically normalized. But for MRFs, we have to explicitly perform the normalization. Presence of normalization constant is major limitation! Evaluation of Z involves summing over O(KM) terms for M nodes! B. Leibe

Recap: Factorization in MRFs Role of the potential functions General interpretation No restriction to potential functions that have a specific probabilistic interpretation as marginals or conditional distributions. Convenient to express them as exponential functions (“Boltzmann distribution”) with an energy function E. Why is this convenient? Joint distribution is the product of potentials  sum of energies. We can take the log and simply work with the sums… B. Leibe

Recap: Converting Directed to Undirected Graphs Problematic case: multiple parents Need to introduce additional links (“marry the parents”).  This process is called moralization. It results in the moral graph. Fully connected, no cond. indep.! Need a clique of x1,…,x4 to represent this factor! Slide adapted from Chris Bishop B. Leibe Image source: C. Bishop, 2006

Recap: Conversion Algorithm General procedure to convert directed  undirected Add undirected links to marry the parents of each node. Drop the arrows on the original links  moral graph. Find maximal cliques for each node and initialize all clique potentials to 1. Take each conditional distribution factor of the original directed graph and multiply it into one clique potential. Restriction Conditional independence properties are often lost! Moralization results in additional connections and larger cliques. Slide adapted from Chris Bishop B. Leibe

Computing Marginals How do we apply graphical models? Given some observed variables, we want to compute distributions of the unobserved variables. In particular, we want to compute marginal distributions, for example p(x4). How can we compute marginals? Classical technique: sum-product algorithm by Judea Pearl. In the context of (loopy) undirected models, this is also called (loopy) belief propagation [Weiss, 1997]. Basic idea: message-passing. Slide credit: Bernt Schiele, Stefan Roth B. Leibe

Inference on a Chain Chain graph Joint probability Marginalization Slide adapted from Chris Bishop B. Leibe Image source: C. Bishop, 2006

Inference on a Chain Idea: Split the computation into two parts (“messages”). Slide adapted from Chris Bishop B. Leibe Image source: C. Bishop, 2006

Inference on a Chain We can define the messages recursively… Slide adapted from Chris Bishop B. Leibe Image source: C. Bishop, 2006

Inference on a Chain Until we reach the leaf nodes… Interpretation We pass messages from the two ends towards the query node xn. We still need the normalization constant Z. This can be easily obtained from the marginals: B. Leibe Image source: C. Bishop, 2006

Recap: Message Passing on a Chain Idea Pass messages from the two ends towards the query node xn. Define the messages recursively: Compute the normalization constant Z at any node xm. Slide adapted from Chris Bishop B. Leibe Image source: C. Bishop, 2006

Summary: Inference on a Chain To compute local marginals: Compute and store all forward messages ¹®(xn). Compute and store all backward messages ¹¯(xn). Compute Z at any node xm. Compute for all variables required. Inference through message passing We have thus seen a first message passing algorithm. How can we generalize this? Slide adapted from Chris Bishop B. Leibe

Inference on Trees Let’s next assume a tree graph. Example: We are given the following joint distribution: Assume we want to know the marginal p(E)… Slide credit: Bernt Schiele, Stefan Roth B. Leibe

Inference on Trees Strategy Marginalize out all other variables by summing over them. Then rearrange terms: Slide credit: Bernt Schiele, Stefan Roth B. Leibe

Marginalization with Messages Use messages to express the marginalization: Slide credit: Bernt Schiele, Stefan Roth B. Leibe

Marginalization with Messages Use messages to express the marginalization: Slide credit: Bernt Schiele, Stefan Roth B. Leibe

Marginalization with Messages Use messages to express the marginalization: Slide credit: Bernt Schiele, Stefan Roth B. Leibe

Marginalization with Messages Use messages to express the marginalization: Slide credit: Bernt Schiele, Stefan Roth B. Leibe

Inference on Trees We can generalize this for all tree graphs. Root the tree at the variable that we want to compute the marginal of. Start computing messages at the leaves. Compute the messages for all nodes for which all incoming messages have already been computed. Repeat until we reach the root. If we want to compute the marginals for all possible nodes (roots), we can reuse some of the messages. Computational expense linear in the number of nodes. Slide credit: Bernt Schiele, Stefan Roth B. Leibe

Recap: Message Passing on Trees General procedure for all tree graphs. Root the tree at the variable that we want to compute the marginal of. Start computing messages at the leaves. Compute the messages for all nodes for which all incoming messages have already been computed. Repeat until we reach the root. If we want to compute the marginals for all possible nodes (roots), we can reuse some of the messages. Computational expense linear in the number of nodes. We already motivated message passing for inference. How can we formalize this into a general algorithm? Slide credit: Bernt Schiele, Stefan Roth B. Leibe

How Can We Generalize This? Message passing algorithm motivated for trees. Now: generalize this to directed polytrees. We do this by introducing a common representation  Factor graphs Undirected Tree Directed Tree Polytree B. Leibe Image source: C. Bishop, 2006

Topics of This Lecture Factor graphs Construction Properties Sum-Product Algorithm for computing marginals Key ideas Derivation Example Max-Sum Algorithm for finding most probable value Algorithms for loopy graphs Junction Tree algorithm Loopy Belief Propagation B. Leibe

Factor Graphs Motivation Joint probabilities on both directed and undirected graphs can be expressed as a product of factors over subsets of variables. Factor graphs make this decomposition explicit by introducing separate nodes for the factors. Joint probability Regular nodes Factor nodes Slide adapted from Chris Bishop B. Leibe Image source: C. Bishop, 2006

Factor Graphs from Directed Graphs Conversion procedure Take variable nodes from directed graph. Create factor nodes corresponding to conditional distributions. Add the appropriate links.  Different factor graphs possible for same directed graph. Slide adapted from Chris Bishop B. Leibe Image source: C. Bishop, 2006

Factor Graphs from Undirected Graphs Some factor graphs for the same undirected graph:  The factor graph keeps the factors explicit and can thus convey more detailed information about the underlying factorization! Slide adapted from Chris Bishop B. Leibe Image source: C. Bishop, 2006

Factor Graphs – Why Are They Needed? Converting a directed or undirected tree to factor graph The result will again be a tree. Converting a directed polytree Conversion to undirected tree creates loops due to moralization! Conversion to a factor graph again results in a tree. B. Leibe Image source: C. Bishop, 2006

Topics of This Lecture Factor graphs Construction Properties Sum-Product Algorithm for computing marginals Key ideas Derivation Example Max-Sum Algorithm for finding most probable value Algorithms for loopy graphs Junction Tree algorithm Loopy Belief Propagation B. Leibe

Sum-Product Algorithm Objectives Efficient, exact inference algorithm for finding marginals. In situations where several marginals are required, allow computations to be shared efficiently. General form of message-passing idea Applicable to tree-structured factor graphs.  Original graph can be undirected tree or directed tree/polytree. Key idea: Distributive Law  Exchange summations and products exploiting the tree structure of the factor graph. Let’s assume first that all nodes are hidden (no observations). Slide adapted from Chris Bishop B. Leibe

Sum-Product Algorithm ne(x): neighbors of x. Goal: Compute marginal for x: Tree structure of graph allows us to partition the joint distrib. into groups associated with each neighboring factor node: Slide adapted from Chris Bishop B. Leibe Image source: C. Bishop, 2006

Sum-Product Algorithm ne(x): neighbors of x. Marginal: Exchanging products and sums: Slide adapted from Chris Bishop B. Leibe Image source: C. Bishop, 2006

Sum-Product Algorithm Marginal: Exchanging products and sums: This defines a first type of message : B. Leibe Image source: C. Bishop, 2006

Sum-Product Algorithm First message type: Evaluating the messages: Each factor Fs(x,Xs) is again described by a factor (sub-)graph.  Can itself be factorized: Slide adapted from Chris Bishop B. Leibe Image source: C. Bishop, 2006

Sum-Product Algorithm First message type: Evaluating the messages: Thus, we can write Slide adapted from Chris Bishop B. Leibe Image source: C. Bishop, 2006

Sum-Product Algorithm First message type: Evaluating the messages: Thus, we can write Second message type: B. Leibe Image source: C. Bishop, 2006

Sum-Product Algorithm Recursive definition: Recursive message evaluation: Exchanging sum and product, we again get Each term Gm(xm, Xsm) is again given by a product B. Leibe Image source: C. Bishop, 2006

Sum-Product Algorithm – Summary Two kinds of messages Message from factor node to variable nodes: Sum of factor contributions Message from variable node to factor node: Product of incoming messages  Simple propagation scheme. B. Leibe

Sum-Product Algorithm Initialization Start the recursion by sending out messages from the leaf nodes Propagation procedure A node can send out a message once it has received incoming messages from all other neighboring nodes. Once a variable node has received all messages from its neighboring factor nodes, we can compute its marginal by multiplying all messages and renormalizing: B. Leibe Image source: C. Bishop, 2006

Sum-Product Algorithm – Summary To compute local marginals: Pick an arbitrary node as root. Compute and propagate messages from the leaf nodes to the root, storing received messages at every node. Compute and propagate messages from the root to the leaf nodes, storing received messages at every node. Compute the product of received messages at each node for which the marginal is required, and normalize if necessary. Computational effort Total number of messages = 2 ¢ number of links in the graph. Maximal parallel runtime = 2 ¢ tree height. Slide adapted from Chris Bishop B. Leibe

Sum-Product: Example We want to compute the values of all marginals… Picking x3 as root…  x1 and x4 are leaves. Unnormalized joint distribution: Slide adapted from Chris Bishop B. Leibe Image source: C. Bishop, 2006

Sum-Product: Example Message definitions: Slide adapted from Chris Bishop B. Leibe Image source: C. Bishop, 2006

Sum-Product: Example Message definitions: B. Leibe Image source: C. Bishop, 2006

Sum-Product: Example Message definitions: B. Leibe Image source: C. Bishop, 2006

Sum-Product: Example Message definitions: Slide adapted from Chris Bishop B. Leibe Image source: C. Bishop, 2006

Sum-Product: Example Message definitions: Verify that marginal is correct: Slide adapted from Chris Bishop B. Leibe Image source: C. Bishop, 2006

Sum-Product Algorithm – Extensions Dealing with observed nodes Until now we had assumed that all nodes were hidden… Observed nodes can easily be incorporated: Partition x into hidden variables h and observed variables . Simply multiply the joint distribution p(x) by  Any summation over variables in v collapses into a single term. Further generalizations So far, assumption that we are dealing with discrete variables. But the sum-product algorithm can also be generalized to simple continuous variable distributions, e.g. linear-Gaussian variables. where B. Leibe

Topics of This Lecture Factor graphs Construction Properties Sum-Product Algorithm for computing marginals Key ideas Derivation Example Max-Sum Algorithm for finding most probable value Algorithms for loopy graphs Junction Tree algorithm Loopy Belief Propagation B. Leibe

Max-Sum Algorithm Objective: an efficient algorithm for finding Value xmax that maximises p(x); Value of p(xmax).  Application of dynamic programming in graphical models. In general, maximum marginals  joint maximum. Example: Slide adapted from Chris Bishop B. Leibe

Max-Sum Algorithm – Key Ideas Key idea 1: Distributive Law (again)  Exchange products/summations and max operations exploiting the tree structure of the factor graph. Key idea 2: Max-Product  Max-Sum We are interested in the maximum value of the joint distribution  Maximize the product p(x). For numerical reasons, use the logarithm.  Maximize the sum (of log-probabilities). B. Leibe

Max-Sum Algorithm Maximizing over a chain (max-product) Exchange max and product operators Generalizes to tree-structured factor graph Slide adapted from Chris Bishop B. Leibe Image source: C. Bishop, 2006

Max-Sum Algorithm Initialization (leaf nodes) Recursion Messages For each node, keep a record of which values of the variables gave rise to the maximum state: Slide adapted from Chris Bishop B. Leibe

Max-Sum Algorithm Termination (root node) Score of maximal configuration Value of root node variable giving rise to that maximum Back-track to get the remaining variable values Slide adapted from Chris Bishop B. Leibe

Visualization of the Back-Tracking Procedure Example: Markov chain  Same idea as in Viterbi algorithm for HMMs… stored links states variables Slide adapted from Chris Bishop B. Leibe Image source: C. Bishop, 2006

Topics of This Lecture Factor graphs Construction Properties Sum-Product Algorithm for computing marginals Key ideas Derivation Example Max-Sum Algorithm for finding most probable value Algorithms for loopy graphs Junction Tree algorithm Loopy Belief Propagation B. Leibe

Junction Tree Algorithm Motivation Exact inference on general graphs. Works by turning the initial graph into a junction tree and then running a sum-product-like algorithm. Intractable on graphs with large cliques. Main steps If starting from directed graph, first convert it to an undirected graph by moralization. Introduce additional links by triangulation in order to reduce the size of cycles. Find cliques of the moralized, triangulated graph. Construct a new graph from the maximal cliques. Remove minimal links to break cycles and get a junction tree.  Apply regular message passing to perform inference. B. Leibe

Junction Tree Algorithm Starting from an undirected graph… Slide adapted from Zoubin Gharahmani B. Leibe Image source: Z. Gharahmani

Junction Tree Algorithm Convert to an undirected graph through moralization. Marry the parents of each node. Remove edge directions. Slide adapted from Zoubin Gharahmani B. Leibe Image source: Z. Gharahmani

Junction Tree Algorithm Triangulate Such that there is no loop of length > 3 without a chord. This is necessary so that the final junction tree satisfies the “running intersection” property (explained later). Slide adapted from Zoubin Gharahmani B. Leibe Image source: Z. Gharahmani

Junction Tree Algorithm Find cliques of the moralized, triangulated graph. Slide adapted from Zoubin Gharahmani B. Leibe Image source: Z. Gharahmani

Junction Tree Algorithm Construct a new junction graph from maximal cliques. Create a node from each clique. Each link carries a list of all variables in the intersection. Drawn in a “separator” box. B. Leibe Image source: Z. Gharahmani

Junction Tree Algorithm Remove links to break cycles  junction tree. For each cycle, remove the link(s) with the minimal number of shared nodes until all cycles are broken. Result is a maximal spanning tree, the junction tree. B. Leibe Image source: Z. Gharahmani

Junction Tree – Properties Running intersection property “If a variable appears in more than one clique, it also appears in all intermediate cliques in the tree”. This ensures that neighboring cliques have consistent probability distributions. Local consistency  global consistency Slide adapted from Zoubin Gharahmani B. Leibe Image source: Z. Gharahmani

Junction Tree: Example 1 Algorithm Moralization Triangulation (not necessary here) B. Leibe Image source: J. Pearl, 1988

Junction Tree: Example 1 Algorithm Moralization Triangulation (not necessary here) Find cliques Construct junction graph B. Leibe Image source: J. Pearl, 1988

Junction Tree: Example 1 Algorithm Moralization Triangulation (not necessary here) Find cliques Construct junction graph Break links to get junction tree B. Leibe Image source: J. Pearl, 1988

Junction Tree: Example 2 Without triangulation step The final graph will contain cycles that we cannot break without losing the running intersection property! B. Leibe Image source: J. Pearl, 1988

Junction Tree: Example 2 When applying the triangulation Only small cycles remain that are easy to break. Running intersection property is maintained. B. Leibe Image source: J. Pearl, 1988

Junction Tree Algorithm Good news The junction tree algorithm is efficient in the sense that for a given graph there does not exist a computationally cheaper approach. Bad news This may still be too costly. Effort determined by number of variables in the largest clique. Grows exponentially with this number (for discrete variables).  Algorithm becomes impractical if the graph contains large cliques! B. Leibe

Loopy Belief Propagation Alternative algorithm for loopy graphs Sum-Product on general graphs. Strategy: simply ignore the problem. Initial unit messages passed across all links, after which messages are passed around until convergence Convergence is not guaranteed! Typically break off after fixed number of iterations. Approximate but tractable for large graphs. Sometime works well, sometimes not at all. B. Leibe

References and Further Reading A thorough introduction to Graphical Models in general and Bayesian Networks in particular can be found in Chapter 8 of Bishop’s book. Christopher M. Bishop Pattern Recognition and Machine Learning Springer, 2006 B. Leibe