Section 3: Appendix BP as an Optimization Algorithm 1.

Slides:



Advertisements
Similar presentations
Bayesian Belief Propagation
Advertisements

Linear Time Methods for Propagating Beliefs Min Convolution, Distance Transforms and Box Sums Daniel Huttenlocher Computer Science Department December,
Discrete Optimization Lecture 4 – Part 3 M. Pawan Kumar Slides available online
Exact Inference. Inference Basic task for inference: – Compute a posterior distribution for some query variables given some observed evidence – Sum out.
Exact Inference in Bayes Nets
Bounds on Code Length Theorem: Let l ∗ 1, l ∗ 2,..., l ∗ m be optimal codeword lengths for a source distribution p and a D-ary alphabet, and let L ∗ be.
Loopy Belief Propagation a summary. What is inference? Given: –Observabled variables Y –Hidden variables X –Some model of P(X,Y) We want to make some.
Section 2 Appendix Tensor Notation for BP 1. In section 2, BP was introduced with a notation which defined messages and beliefs as functions. This Appendix.
Convergent Message-Passing Algorithms for Inference over General Graphs with Convex Free Energies Tamir Hazan, Amnon Shashua School of Computer Science.
Introduction to Belief Propagation and its Generalizations. Max Welling Donald Bren School of Information and Computer and Science University of California.
Belief Propagation by Jakob Metzler. Outline Motivation Pearl’s BP Algorithm Turbo Codes Generalized Belief Propagation Free Energies.
Markov Networks.
Belief Propagation on Markov Random Fields Aggeliki Tsoli.
CS774. Markov Random Field : Theory and Application Lecture 04 Kyomin Jung KAIST Sep
Exact Inference (Last Class) variable elimination  polytrees (directed graph with at most one undirected path between any two vertices; subset of DAGs)
Convergent and Correct Message Passing Algorithms Nicholas Ruozzi and Sekhar Tatikonda Yale University TexPoint fonts used in EMF. Read the TexPoint manual.
Global Approximate Inference Eran Segal Weizmann Institute.
Learning Low-Level Vision William T. Freeman Egon C. Pasztor Owen T. Carmichael.
Conditional Random Fields
Genome evolution: a sequence-centric approach Lecture 6: Belief propagation.
Understanding Belief Propagation and its Applications Dan Yuan June 2004.
Today Logistic Regression Decision Trees Redux Graphical Models
Genome Evolution. Amos Tanay 2009 Genome evolution: Lecture 8: Belief propagation.
If we measured a distribution P, what is the tree- dependent distribution P t that best approximates P? Search Space: All possible trees Goal: From all.
Conditional Random Fields Rahul Gupta (KReSIT, IIT Bombay)
Some Surprises in the Theory of Generalized Belief Propagation Jonathan Yedidia Mitsubishi Electric Research Labs (MERL) Collaborators: Bill Freeman (MIT)
Physics Fluctuomatics (Tohoku University) 1 Physical Fluctuomatics 7th~10th Belief propagation Appendix Kazuyuki Tanaka Graduate School of Information.
Adaptive CSMA under the SINR Model: Fast convergence using the Bethe Approximation Krishna Jagannathan IIT Madras (Joint work with) Peruru Subrahmanya.
Undirected Models: Markov Networks David Page, Fall 2009 CS 731: Advanced Methods in Artificial Intelligence, with Biomedical Applications.
1 Structured Region Graphs: Morphing EP into GBP Max Welling Tom Minka Yee Whye Teh.
Belief Propagation. What is Belief Propagation (BP)? BP is a specific instance of a general class of methods that exist for approximate inference in Bayes.
Probabilistic Graphical Models
Physics Fluctuomatics / Applied Stochastic Process (Tohoku University) 1 Physical Fluctuomatics Applied Stochastic Process 9th Belief propagation Kazuyuki.
Problem Introduction Chow’s Problem Solution Example Proof of correctness.
Tokyo Institute of Technology, Japan Yu Nishiyama and Sumio Watanabe Theoretical Analysis of Accuracy of Gaussian Belief Propagation.
Readings: K&F: 11.3, 11.5 Yedidia et al. paper from the class website
Physics Fluctuomatics (Tohoku University) 1 Physical Fluctuomatics 7th~10th Belief propagation Kazuyuki Tanaka Graduate School of Information Sciences,
Exact Inference (Last Class) Variable elimination  polytrees (directed graph with at most one undirected path between any two vertices; subset of DAGs)
1 Mean Field and Variational Methods finishing off Graphical Models – Carlos Guestrin Carnegie Mellon University November 5 th, 2008 Readings: K&F:
Probabilistic Graphical Models seminar 15/16 ( ) Haim Kaplan Tel Aviv University.
Belief Propagation and its Generalizations Shane Oldenburger.
1 Use graphs and not pure logic Variables represented by nodes and dependencies by edges. Common in our language: “threads of thoughts”, “lines of reasoning”,
Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:
Genome Evolution. Amos Tanay 2010 Genome evolution: Lecture 9: Variational inference and Belief propagation.
Graduate School of Information Sciences, Tohoku University
Eric Xing © Eric CMU, Machine Learning Algorithms and Theory of Approximate Inference Eric Xing Lecture 15, August 15, 2010 Reading:
Tightening LP Relaxations for MAP using Message-Passing David Sontag Joint work with Talya Meltzer, Amir Globerson, Tommi Jaakkola, and Yair Weiss.
Today Graphical Models Representing conditional dependence graphically
Markov Networks: Theory and Applications Ying Wu Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208
Distributed cooperation and coordination using the Max-Sum algorithm
Section 2: Belief Propagation Basics 1. Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been using.
Heuristic Optimization Methods
STEREO MATCHING USING POPULATION-BASED MCMC
Graduate School of Information Sciences, Tohoku University, Japan
Markov Networks.
Graduate School of Information Sciences Tohoku University, Japan
Generalized Belief Propagation
一般化された確率伝搬法の数学的構造 東北大学大学院情報科学研究科 田中和之
Graduate School of Information Sciences, Tohoku University
Physical Fluctuomatics 7th~10th Belief propagation
Algorithms and Theory of
Expectation-Maximization & Belief Propagation
Probabilistic image processing and Bayesian network
Readings: K&F: 5.1, 5.2, 5.3, 5.4, 5.5, 5.6, 5.7 Markov networks, Factor graphs, and an unified view Start approximate inference If we are lucky… Graphical.
Unifying Variational and GBP Learning Parameters of MNs EM for BNs
Readings: K&F: 11.3, 11.5 Yedidia et al. paper from the class website
Markov Networks.
Mean Field and Variational Methods Loopy Belief Propagation
Generalized Belief Propagation
Presentation transcript:

Section 3: Appendix BP as an Optimization Algorithm 1

This Appendix provides a more in-depth study of BP as an optimization algorithm. Our focus is on the Bethe Free Energy and its relation to KL divergence, Gibbs Free Energy, and the Helmholtz Free Energy. We also include a discussion of the convergence properties of max-product BP. 2

KL and Free Energies 3 Gibbs Free Energy Kullback–Leibler (KL) divergence Helmholtz Free Energy

Minimizing KL Divergence If we find the distribution b that minimizes the KL divergence, then b = p Also, true of the minimum of the Gibbs Free Energy But what if b is not (necessarily) a probability distribution? 4

True distribution: BP on a 2 Variable Chain 5 X ψ1ψ1 Y Beliefs at the end of BP: We successfully minimized the KL divergence! *where U(x) is the uniform distribution

True distribution: BP on a 3 Variable Chain 6 W ψ1ψ1 X ψ2ψ2 Y KL decomposes over the marginals Define the joint belief to have the same form: The true distribution can be expressed in terms of its marginals:

True distribution: BP on a 3 Variable Chain 7 W ψ1ψ1 X ψ2ψ2 Y Gibbs Free Energy decomposes over the marginals Define the joint belief to have the same form: The true distribution can be expressed in terms of its marginals:

True distribution: BP on an Acyclic Graph 8 KL decomposes over the marginals time like flies anarrow X1X1 ψ1ψ1 X2X2 ψ3ψ3 X3X3 ψ5ψ5 X4X4 ψ7ψ7 X5X5 ψ9ψ9 X6X6 ψ 10 X8X8 ψ 12 X7X7 ψ 14 X9X9 ψ 13 ψ 11 Define the joint belief to have the same form: The true distribution can be expressed in terms of its marginals:

True distribution: BP on an Acyclic Graph 9 The true distribution can be expressed in terms of its marginals: Define the joint belief to have the same form: time like flies anarrow X1X1 ψ1ψ1 X2X2 ψ3ψ3 X3X3 ψ5ψ5 X4X4 ψ7ψ7 X5X5 ψ9ψ9 X6X6 ψ 10 X8X8 ψ 12 X7X7 ψ 14 X9X9 ψ 13 ψ 11 Gibbs Free Energy decomposes over the marginals

True distribution: BP on a Loopy Graph 10 Construct the joint belief as before: time like flies anarrow X1X1 ψ1ψ1 X2X2 ψ3ψ3 X3X3 ψ5ψ5 X4X4 ψ7ψ7 X5X5 ψ9ψ9 X6X6 ψ 10 X8X8 ψ 12 X7X7 ψ 14 X9X9 ψ 13 ψ 11 ψ2ψ2 ψ4ψ4 ψ6ψ6 ψ8ψ8 KL is no longer well defined, because the joint belief is not a proper distribution. This might not be a distribution! 1.The beliefs are distributions: are non-negative and sum-to-one. 2.The beliefs are locally consistent: So add constraints…

True distribution: BP on a Loopy Graph 11 Construct the joint belief as before: time like flies anarrow X1X1 ψ1ψ1 X2X2 ψ3ψ3 X3X3 ψ5ψ5 X4X4 ψ7ψ7 X5X5 ψ9ψ9 X6X6 ψ 10 X8X8 ψ 12 X7X7 ψ 14 X9X9 ψ 13 ψ 11 This is called the Bethe Free Energy and decomposes over the marginals ψ2ψ2 ψ4ψ4 ψ6ψ6 ψ8ψ8 But we can still optimize the same objective as before, subject to our belief constraints: This might not be a distribution! 1.The beliefs are distributions: are non-negative and sum-to-one. 2.The beliefs are locally consistent: So add constraints…

BP as an Optimization Algorithm The Bethe Free Energy, a function of the beliefs: BP minimizes a constrained version of the Bethe Free Energy – BP is just one local optimization algorithm: fast but not guaranteed to converge – If BP converges, the beliefs are called fixed points – The stationary points of a function have a gradient of zero 12 The fixed points of BP are local stationary points of the Bethe Free Energy (Yedidia, Freeman, & Weiss, 2000)

BP as an Optimization Algorithm The Bethe Free Energy, a function of the beliefs: BP minimizes a constrained version of the Bethe Free Energy – BP is just one local optimization algorithm: fast but not guaranteed to converge – If BP converges, the beliefs are called fixed points – The stationary points of a function have a gradient of zero 13 The stable fixed points of BP are local minima of the Bethe Free Energy (Heskes, 2003)

BP as an Optimization Algorithm For graphs with no cycles: – The minimizing beliefs = the true marginals – BP finds the global minimum of the Bethe Free Energy – This global minimum is –log Z (the “Helmholtz Free Energy”) For graphs with cycles: – The minimizing beliefs only approximate the true marginals – Attempting to minimize may get stuck at local minimum or other critical point – Even the global minimum only approximates –log Z 14

Convergence of Sum-product BP The fixed point beliefs: – Do not necessarily correspond to marginals of any joint distribution over all the variables (Mackay, Yedidia, Freeman, & Weiss, 2001; Yedidia, Freeman, & Weiss, 2005) Unbelievable probabilities – Conversely, the true marginals for many joint distributions cannot be reached by BP (Pitkow, Ahmadian, & Miller, 2011) 15 Figure adapted from (Pitkow, Ahmadian, & Miller, 2011) G Bethe (b) b1(x1)b1(x1) b2(x2)b2(x2) The figure shows a two- dimensional slice of the Bethe Free Energy for a binary graphical model with pairwise interactions

Convergence of Max-product BP 16 If the max-marginals b i (x i ) are a fixed point of BP, and x * is the corresponding assignment (assumed unique), then p(x * ) > p(x) for every x ≠ x * in a rather large neighborhood around x * (Weiss & Freeman, 2001). Figure from (Weiss & Freeman, 2001) Informally: If you take the fixed-point solution x * and arbitrarily change the values of the dark nodes in the figure, the overall probability of the configuration will decrease. The neighbors of x * are constructed as follows: For any set of vars S of disconnected trees and single loops, set the variables in S to arbitrary values, and the rest to x *.

Convergence of Max-product BP 17 If the max-marginals b i (x i ) are a fixed point of BP, and x * is the corresponding assignment (assumed unique), then p(x * ) > p(x) for every x ≠ x * in a rather large neighborhood around x * (Weiss & Freeman, 2001). Figure from (Weiss & Freeman, 2001) Informally: If you take the fixed-point solution x * and arbitrarily change the values of the dark nodes in the figure, the overall probability of the configuration will decrease. The neighbors of x * are constructed as follows: For any set of vars S of disconnected trees and single loops, set the variables in S to arbitrary values, and the rest to x *.

Convergence of Max-product BP 18 If the max-marginals b i (x i ) are a fixed point of BP, and x * is the corresponding assignment (assumed unique), then p(x * ) > p(x) for every x ≠ x * in a rather large neighborhood around x * (Weiss & Freeman, 2001). Figure from (Weiss & Freeman, 2001) Informally: If you take the fixed-point solution x * and arbitrarily change the values of the dark nodes in the figure, the overall probability of the configuration will decrease. The neighbors of x * are constructed as follows: For any set of vars S of disconnected trees and single loops, set the variables in S to arbitrary values, and the rest to x *.