Algorithms and Theory of

Slides:



Advertisements
Similar presentations
Mean-Field Theory and Its Applications In Computer Vision1 1.
Advertisements

Bayesian Belief Propagation
CS188: Computational Models of Human Behavior
Linear Time Methods for Propagating Beliefs Min Convolution, Distance Transforms and Box Sums Daniel Huttenlocher Computer Science Department December,
Exact Inference. Inference Basic task for inference: – Compute a posterior distribution for some query variables given some observed evidence – Sum out.
Variational Inference Amr Ahmed Nov. 6 th Outline Approximate Inference Variational inference formulation – Mean Field Examples – Structured VI.
Factorial Mixture of Gaussians and the Marginal Independence Model Ricardo Silva Joint work-in-progress with Zoubin Ghahramani.
Variational Methods for Graphical Models Micheal I. Jordan Zoubin Ghahramani Tommi S. Jaakkola Lawrence K. Saul Presented by: Afsaneh Shirazi.
Constrained Approximate Maximum Entropy Learning (CAMEL) Varun Ganapathi, David Vickrey, John Duchi, Daphne Koller Stanford University TexPoint fonts used.
Exact Inference in Bayes Nets
ICCV 2007 tutorial Part III Message-passing algorithms for energy minimization Vladimir Kolmogorov University College London.
Dynamic Bayesian Networks (DBNs)
Introduction to Belief Propagation and its Generalizations. Max Welling Donald Bren School of Information and Computer and Science University of California.
Belief Propagation by Jakob Metzler. Outline Motivation Pearl’s BP Algorithm Turbo Codes Generalized Belief Propagation Free Energies.
1 Fast Primal-Dual Strategies for MRF Optimization (Fast PD) Robot Perception Lab Taha Hamedani Aug 2014.
CS774. Markov Random Field : Theory and Application Lecture 04 Kyomin Jung KAIST Sep
CS774. Markov Random Field : Theory and Application Lecture 06 Kyomin Jung KAIST Sep
Visual Recognition Tutorial
1 Graphical Models in Data Assimilation Problems Alexander Ihler UC Irvine Collaborators: Sergey Kirshner Andrew Robertson Padhraic Smyth.
Convergent and Correct Message Passing Algorithms Nicholas Ruozzi and Sekhar Tatikonda Yale University TexPoint fonts used in EMF. Read the TexPoint manual.
Lecture 5: Learning models using EM
Conditional Random Fields
Computer vision: models, learning and inference Chapter 10 Graphical Models.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Computer vision: models, learning and inference
Some Surprises in the Theory of Generalized Belief Propagation Jonathan Yedidia Mitsubishi Electric Research Labs (MERL) Collaborators: Bill Freeman (MIT)
CS774. Markov Random Field : Theory and Application Lecture 08 Kyomin Jung KAIST Sep
CSC2535: Computation in Neural Networks Lecture 11: Conditional Random Fields Geoffrey Hinton.
Probabilistic Graphical Models
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
Markov Random Fields Probabilistic Models for Images
Continuous Variables Write message update equation as an expectation: Proposal distribution W t (x t ) for each node Samples define a random discretization.
Readings: K&F: 11.3, 11.5 Yedidia et al. paper from the class website
CS Statistical Machine learning Lecture 24
1 Mean Field and Variational Methods finishing off Graphical Models – Carlos Guestrin Carnegie Mellon University November 5 th, 2008 Readings: K&F:
Lecture 2: Statistical learning primer for biologists
Belief Propagation and its Generalizations Shane Oldenburger.
CIAR Summer School Tutorial Lecture 1b Sigmoid Belief Nets Geoffrey Hinton.
1 Use graphs and not pure logic Variables represented by nodes and dependencies by edges. Common in our language: “threads of thoughts”, “lines of reasoning”,
Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:
Eric Xing © Eric CMU, Machine Learning Algorithms and Theory of Approximate Inference Eric Xing Lecture 15, August 15, 2010 Reading:
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Markov Networks: Theory and Applications Ying Wu Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208
Bayesian Belief Propagation for Image Understanding David Rosenberg.
CSC Lecture 23: Sigmoid Belief Nets and the wake-sleep algorithm Geoffrey Hinton.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Learning Deep Generative Models by Ruslan Salakhutdinov
LECTURE 10: EXPECTATION MAXIMIZATION (EM)
Multimodal Learning with Deep Boltzmann Machines
Statistical Learning Dong Liu Dept. EEIS, USTC.
Markov Networks.
Bayesian Models in Machine Learning
CSCI 5822 Probabilistic Models of Human and Machine Learning
Generalized Belief Propagation
Markov Random Fields Presented by: Vladan Radosavljevic.
≠ Particle-based Variational Inference for Continuous Systems
Exact Inference Eric Xing Lecture 11, August 14, 2010
Physical Fluctuomatics 7th~10th Belief propagation
Expectation-Maximization & Belief Propagation
Readings: K&F: 5.1, 5.2, 5.3, 5.4, 5.5, 5.6, 5.7 Markov networks, Factor graphs, and an unified view Start approximate inference If we are lucky… Graphical.
Quantum Foundations Lecture 3
Learning From Observed Data
Unifying Variational and GBP Learning Parameters of MNs EM for BNs
Junction Trees 3 Undirected Graphical Models
Readings: K&F: 11.3, 11.5 Yedidia et al. paper from the class website
(Convex) Cones Def: closed under nonnegative linear combinations, i.e.
Markov Networks.
Advanced Machine Learning
Mean Field and Variational Methods Loopy Belief Propagation
Generalized Belief Propagation
Presentation transcript:

Algorithms and Theory of Machine Learning Algorithms and Theory of Approximate Inference Eric Xing Lecture 15, August 15, 2010 Eric Xing © Eric Xing @ CMU, 2006-2010 Reading:

Inference Problems Compute the likelihood of observed data Compute the marginal distribution over a particular subset of nodes Compute the conditional distribution for disjoint subsets A and B Compute a mode of the density Brute-force methods are intractable Recursive message passing is efficient for simple graphical models, like trees Sum-product/max-product Junction Eric Xing © Eric Xing @ CMU, 2006-2010 2

Inference in GM ... A HMM A general BN x2 x3 x1 xT y2 y3 y1 yT B A D C Eric Xing © Eric Xing @ CMU, 2006-2010

(Forward-backward , Max-product /BP, Junction Tree) Inference Problems Compute the likelihood of observed data Compute the marginal distribution over a particular subset of nodes Compute the conditional distribution for disjoint subsets A and B Compute a mode of the density Methods we have Message Passing (Forward-backward , Max-product /BP, Junction Tree) Brute-force methods are intractable Recursive message passing is efficient for simple graphical models, like trees Sum-product/max-product Junction Brute force Elimination Individual computations independent Sharing intermediate terms Eric Xing © Eric Xing @ CMU, 2006-2010 4

Recall forward-backward on HMM Forward algorithm Backward algorithm A x2 x3 x1 xT y2 y3 y1 yT ... Eric Xing © Eric Xing @ CMU, 2006-2010

Message passing for trees Let mij(xi) denote the factor resulting from eliminating variables from bellow up to i, which is a function of xi: This is reminiscent of a message sent from j to i. i j A mathematical summary of the elimination algorithm mij(xi) represents a "belief" of xi from xj! k l Eric Xing © Eric Xing @ CMU, 2006-2010

The General Sum-Product Algorithm Tree-structured GMs Message Passing on Trees: On trees, converge to a unique fixed point after a finite number of iterations Eric Xing © Eric Xing @ CMU, 2006-2010

Junction Tree Revisited General Algorithm on Graphs with Cycles Steps: => Triangularization => Construct JTs => Message Passing on Clique Trees B C S Eric Xing © Eric Xing @ CMU, 2006-2010

Local Consistency Given a set of functions associated with the cliques and separator sets They are locally consistent if: For junction trees, local consistency is equivalent to global consistency! Eric Xing © Eric Xing @ CMU, 2006-2010

An Ising model on 2-D image Nodes encode hidden information (patch-identity). They receive local information from the image (brightness, color). Information is propagated though the graph over its edges. Edges encode ‘compatibility’ between nodes. ? air or water ? Eric Xing © Eric Xing @ CMU, 2006-2010

Why Approximate Inference? Why can’t we just run junction tree on this graph? If NxN grid, tree width at least N N can be a huge number(~1000s of pixels) If N~O(1000), we have a clique with 2100 entries Eric Xing © Eric Xing @ CMU, 2006-2010

Solution 1: Belief Propagation on loopy graphs k k BP Message-update Rules May not converge or converge to a wrong solution Mki i k k j k i k k external evidence Compatibilities (interactions) Eric Xing © Eric Xing @ CMU, 2006-2010

Recall BP on trees k k i k k k j i k k BP Message-update Rules BP on trees always converges to exact marginals Mki i k k j k i k k external evidence Compatibilities (interactions) Eric Xing © Eric Xing @ CMU, 2006-2010

Solution 2: The naive mean field approximation Approximate p(X) by fully factorized q(X)=Piqi(Xi) For Boltzmann distribution p(X)=exp{åi < j qijXiXj+qioXi}/Z : mean field equation: Xi Áxjñqj resembles a “message” sent from node j to i {áxjñqj : j Î Ni} forms the “mean field” applied to Xi from its neighborhood Xi To draw an analogy between our algorithm and the naïve mean field method, let’s briefly review the classical mean field approximation to the Boltzmann distribution, which depends on pairwise interactions between random variables. In a Gibbs sampling algorithm that asymptotically converges to the true distribution, we iteratively sample each variable using a predictive distribution that conditioned on the previously sampled values of the neighbored variables. The naïve mean field method approximate the true distribution with a fully factored distribution, which is a product of singleton marginals on every node. Simple algebra shows that each singleton marginals can be expressed as a conditional distribution of the relevant node given the expectation of all its neighbors, and this distribution reuses the set of coupling weights of the original model. This is very similar to the Gibbs sampler scheme expect that the sampled values are now replaced by expectations. We can view the expectation of a neighboring node as a message, and the messages from all neighbors as the mean fields. Eric Xing © Eric Xing @ CMU, 2006-2010

Gibbs predictive distribution: Recall Gibbs sampling Approximate p(X) by fully factorized q(X)=Piqi(Xi) For Boltzmann distribution p(X)=exp{åi < j qijXiXj+qioXi}/Z : Gibbs predictive distribution: Xi Xi To draw an analogy between our algorithm and the naïve mean field method, let’s briefly review the classical mean field approximation to the Boltzmann distribution, which depends on pairwise interactions between random variables. In a Gibbs sampling algorithm that asymptotically converges to the true distribution, we iteratively sample each variable using a predictive distribution that conditioned on the previously sampled values of the neighbored variables. The naïve mean field method approximate the true distribution with a fully factored distribution, which is a product of singleton marginals on every node. Simple algebra shows that each singleton marginals can be expressed as a conditional distribution of the relevant node given the expectation of all its neighbors, and this distribution reuses the set of coupling weights of the original model. This is very similar to the Gibbs sampler scheme expect that the sampled values are now replaced by expectations. We can view the expectation of a neighboring node as a message, and the messages from all neighbors as the mean fields. Eric Xing © Eric Xing @ CMU, 2006-2010

Summary So Far Exact inference methods are limited to tree-structured graphs Junction Tree methods is exponentially expensive to the tree-width Message Passing methods can be applied for loopy graphs, but lack of analysis! Mean-field is convergent, but can have local optimal Where do these two algorithm come from? Do they make sense? Eric Xing © Eric Xing @ CMU, 2006-2010

Next Step … Develop a general theory of variational inference Introduce some approximate inference methods Provide deep understandings to some popular methods Eric Xing © Eric Xing @ CMU, 2006-2010

Exponential Family GMs Canonical Parameterization Effective canonical parameters Regular family: Minimal representation: if there does not exist a nonzero vector such that is a constant Canonical Parameters Sufficient Statistics Log-normalization Function Eric Xing © Eric Xing @ CMU, 2006-2010

Examples Ising Model (binary r.v.: {-1, +1}) Gaussian MRF Eric Xing © Eric Xing @ CMU, 2006-2010

Mean Parameterization The mean parameter associated with a sufficient statistic is defined as Realizable mean parameter set A convex subset of Convex hull for discrete case Convex polytope when is finite Eric Xing © Eric Xing @ CMU, 2006-2010

Convex Polytope Convex hull representation Half-plane based representation Minkowski-Weyl Theorem: any polytope can be characterized by a finite collection of linear inequality constraints Eric Xing © Eric Xing @ CMU, 2006-2010

Example Two-node Ising Model Convex hull representation Half-plane representation Probability Theory: Eric Xing © Eric Xing @ CMU, 2006-2010

Marginal Polytope Canonical Parameterization Mean parameterization Marginal distributions over nodes and edges Marginal Polytope Eric Xing © Eric Xing @ CMU, 2006-2010

Conjugate Duality Duality between MLE and Max-Ent: For all , a unique canonical parameter satisfying The log-partition function has the variational form For all , the supremum in (*) is attained uniquely at specified by the moment-matching conditions Bijection for minimal exponential family Eric Xing © Eric Xing @ CMU, 2006-2010

Roles of Mean Parameters Forward Mapping: From to the mean parameters A fundamental class of inference problems in exponential family models Backward Mapping: Parameter estimation to learn the unknown Eric Xing © Eric Xing @ CMU, 2006-2010

Example Bernoulli If Reverse mapping: Unique! No gradient stationary point in the Opt. problem (**) Unique! Eric Xing © Eric Xing @ CMU, 2006-2010

Variational Inference In General An umbrella term that refers to various mathematical tools for optimization-based formulations of problems, as well as associated techniques for their solution General idea: Express a quantity of interest as the solution of an optimization problem The optimization problem can be relaxed in various ways Approximate the functions to be optimized Approximate the set over which the optimization takes place Goes in parallel with MCMC Eric Xing © Eric Xing @ CMU, 2006-2010

A Tree-Based Outer-Bound to a Local Consistent (Pseudo-) Marginal Polytope normalization marginalization Relation to holds for any graph holds for tree-structured graphs Both convex hull and intersection of half-spaces representation are hard to manipulate in general Eric Xing © Eric Xing @ CMU, 2006-2010 28

A Example A three node graph (binary r.v.) For any , we have For , we have an exercise? 1 3 2 Eric Xing © Eric Xing @ CMU, 2006-2010

Bethe Entropy Approximation Approximate the negative entropy , which doesn’t has a closed-form in general graph. Entropy on tree (Marginals) recall: entropy Bethe entropy approximation (Pseudo-marginals) Eric Xing © Eric Xing @ CMU, 2006-2010

Bethe Variational Problem (BVP) We already have: a convex (polyhedral) outer bound the Bethe approximate entropy Combining the two ingredients, we have a simple structured problem (differentiable & constraint set is a simple polytope) Max-product is the solver! Eric Xing © Eric Xing @ CMU, 2006-2010 Nobel Prize in Physics (1967)

Connection to Sum-Product Alg. Lagrangian method for BVP: Sum-product and Bethe Variational (Yedidia et al., 2002) For any graph G, any fixed point of the sum-product updates specifies a pair of such that For a tree-structured MRF, the solution is unique, where correspond to the exact singleton and pairwise marginal distributions of the MRF, and the optimal value of BVP is equal to Eric Xing © Eric Xing @ CMU, 2006-2010

Proof Eric Xing © Eric Xing @ CMU, 2006-2010

Discussions The connection provides a principled basis for applying the sum-product algorithm for loopy graphs However, this connection provides no guarantees on the convergence of the sum-product alg. on loopy graphs the Bethe variational problem is usually non-convex. Therefore, there are no guarantees on the global optimum Generally, there are no guarantees that is a lower bound of However, however the connection and understanding suggest a number of avenues for improving upon the ordinary sum-product alg., via progressively better approximations to the entropy function and outer bounds on the marginal polytope! Eric Xing © Eric Xing @ CMU, 2006-2010

Inexactness of Bethe and Sum-Product From Bethe entropy approximation Example From pseudo-marginal outer bound strict inclusion 3 2 1 4 Eric Xing © Eric Xing @ CMU, 2006-2010

Summary of LBP Variational methods in general turn inference into an optimization problem However, both the objective function and constraint set are hard to deal with Bethe variational approximation is a tree-based approximation to both objective function and marginal polytope Belief propagation is a Lagrangian-based solver for BVP Generalized BP extends BP to solve the generalized hyper-tree based variational approximation problem Eric Xing © Eric Xing @ CMU, 2006-2010

Tractable Subgraph Given a GM with a graph G, a subgraph F is tractable if We can perform exact inference on it Example: Eric Xing © Eric Xing @ CMU, 2006-2010

Mean Parameterization For an exponential family GM defined with graph G and sufficient statistics , the realizable mean parameter set For a given tractable subgraph F, a subset of mean parameters is of interest Inner Approximation Eric Xing © Eric Xing @ CMU, 2006-2010

Optimizing a Lower Bound Any mean parameter yields a lower bound on the log-partition function Moreover, equality holds iff and are dually coupled, i.e., Proof Idea: (Jensen’s Inequality) Optimizing the lower bound gives This is an inference! Eric Xing © Eric Xing @ CMU, 2006-2010

Mean Field Methods In General However, the lower bound can’t explicitly evaluated in general Because the dual function typically lacks an explicit form Mean Field Methods Approximate the lower bound Approximate the realizable mean parameter set The MF optimization problem Still a lower bound? Eric Xing © Eric Xing @ CMU, 2006-2010

KL-divergence Kullback-Leibler Divergence For two exponential family distributions with the same STs: Primal Form Mixed Form Dual Form Eric Xing © Eric Xing @ CMU, 2006-2010

Mean Field and KL-divergence Optimizing a lower bound Equivalent to minimize a KL-divergence Therefore, we are doing minimization Eric Xing © Eric Xing @ CMU, 2006-2010

Naïve Mean Field Fully factorized variational distribution Eric Xing © Eric Xing @ CMU, 2006-2010

Naïve Mean Field for Ising Model Sufficient statistics and Mean Parameters Naïve Mean Field Realizable mean parameter subset Entropy Optimization Problem Eric Xing © Eric Xing @ CMU, 2006-2010

Naïve Mean Field for Ising Model Optimization Problem Update Rule resembles “message” sent from node to forms the “mean field” applied to from its neighborhood Eric Xing © Eric Xing @ CMU, 2006-2010

Non-Convexity of Mean Field Mean field optimization is always non-convex for any exponential family in which the state space is finite Finite convex hull contains all the extreme points If is a convex set, then Mean field has been used successfully Eric Xing © Eric Xing @ CMU, 2006-2010

Structured Mean Field Mean field theory is general to any tractable sub-graphs Naïve mean field is based on the fully unconnected sub-graph Variants based on structured sub-graphs can be derived Eric Xing © Eric Xing @ CMU, 2006-2010 47

Topic models Σ μ Σ* Approximate the Integral g g β z z Φ* β w w μ* Σ* Φ* β Approximate the Integral g β z w Approximate the Posterior μ*,Σ*,φ1:n* Optimization Problem Solve Eric Xing © Eric Xing @ CMU, 2006-2010

Variational Inference With no Tears [Ahmed and Xing, 2006, Xing et al 2003] β g z e μ Σ Fully Factored Distribution Fixed Point Equations μ Σ g β z e Laplace approximation Eric Xing © Eric Xing @ CMU, 2006-2010

Summary of GMF Message-passing algorithms (e.g., belief propagation, mean field) are solving approximate versions of exact variational principle in exponential families There are two distinct components to approximations: Can use either inner or outer bounds to Various approximation to the entropy function BP: polyhedral outer bound and non-convex Bethe approximation MF: non-convex inner bound and exact form of entropy Kikuchi: tighter polyhedral outer bound and better entropy approximation Eric Xing © Eric Xing @ CMU, 2005-2009 © Eric Xing @ CMU, 2006-2010