Synthesis of MCMC and Belief Propagation

Synthesis of MCMC and Belief Propagation
Sungsoo Ahn(speaker)1, Michael Chertkov2, Jinwoo Shin1 1Korea Advanced Institute of Science and Technology (KAIST) 2Los Alamos National Laboratory (LANL) Hi. I’m Sungsoo Ahn and I’ll talk about “Synthesis of MCMC and Belief Propagation”. This is a joint work with Michael Chertkov and Jinwoo Shin. Neural Information Processing Systems (NIPS), December 6th, 2016

Graphical Model expressing distributions by graph
Probabilistic model, expressing probability distributions by graph. Applied in machine learning [Pearl, 1982], statistical physics [Ising,1920], theoretical computer science [Erdös, 1976], information theory [Gallager, 1963]… Barcelona, Eixample Hi, so first, I will begin with some preliminary materials. Obviously, our work is about graphical models, and it is a probabilistic model expressing probability distributions by graph. It is quite famous and applied in various areas such as machine learning, statistical physics, theoretical computer science and information theory. For an example, graphical model can be used to analyze the behavior of epidemics in a city. If we formulate this problem, we have a graph where nodes represent people living in the city and edges represent the relationships between them. Moreover, FOR THIS EXAMPLE, the result will be a grid graph if we further assume the spread of epidemics is likely to occur between neighbors.

Graphical Model expressing distributions by graph
Binary Random variables: Random variables: node factor edge factor Need partition function for normalization. Essential for inference However, very hard to compute! NP or #P-hard even for approximation Need normalization. 1 Next, we associate some random variables to the graph. Especially, we deal with binary random variables in our work. In graphical models, a probability distribution need to have certain characteristics. That is, the distribution should be proportional to the product of some non-negative functions called factors. And in this case, there are factors related to nodes, and then edges. However, this alone cannot define a probability distribution because the overall summation of a distribution must equal to one. In order to enforce this, each term is divided by the overall summation. And this overall summation is called partition function. Naturally, computation of this partition function is essential for inference tasks in graphical models. However, it is very hard to compute and usually becomes a bottleneck for large-scale applications. Formally put, it is NP-hard or #P-hard even for approximation. Therefore, people use approximations algorithms for practical purpose and among them, Markov chain Monte Carlo and Belief Propagation are the most popular. For a brief explanation, Markov Chain Monte Carlo, or MCMC is a randomized algorithm based on sampling from Markov Chain. And belief propagation, or BP is a message-passing algorithm for performing inference in graphical models. These two algorithms have their own pros and cons, which I will discuss in the next slide. Instead, use approximation algorithms like: Markov Chain Monte Carlo (MCMC) Randomized algorithm based on sampling from Markov Chain Belief Propagation (BP) Message-passing algorithm for performing inference in graphical models Two algorithms have their own cons and pros.

MCMC and BP Popular algorithms for approximating partition function Z
Two algorithms have orthogonal characteristics. MCMC Cons: suffers from slow mixing time Pros: exact BP Pros: empirically fast, efficient Cons: lacks control over approximation quality approximation quality gap converged Our Approach: We synthesize MCMC and BP to utilize both advantages. First, I want to point out that these two algorithms have orthogonal characteristics. Say, if we run two algorithms at the same for approximating the partition function, the first observation is BP converging much faster than MCMC, and this is because MCMC suffers from slow mixing time. However, as we run more iterations, MCMC eventually outperforms BP, because it is an exact algorithm given infinite time. On the other hand, BP have some approximation error, or a gap which it cannot fill no matter how much time is given. Inspired by this, our approach is to synthesize MCMC and BP in order to utilize orthogonal advantages. MCMC BP

Our Approach Estimating BP error using MCMC
approximation quality Algorithm at high level: 1. Run BP. 2. Use MCMC to estimate BP error. error Loop Series MCMC BP BP error equals Loop Series [Chertkov et al. 2006]. Generalized loop is a subgraph with degree ≥2. Now we describe our algorithm at a high level. First we run BP to approximate the partition function. As mentioned before, there is some gap that BP cannot fill. We run MCMC to fill this gap. For this purpose, we utilize the closed form expression of BP error called the Loop Series, which was proposed by Chertkov et al. In Loop Series, BP error is expressed by a summation over some weight function w, defined on subgraphs called generalized loops, Here, generalized loop is a constrained subgraph having degree greater or equal to two. And in below are the examples of generalized loops. Then the second step would be to use MCMC in order to estimate the loop series, which is equivalent to BP error. Naturally, we hope for a provably efficient MCMC. 2*. Use MCMC to estimate Loop Series (= BP error).

Our Approach Estimating BP error using MCMC
However, designing a provably efficient MCMC for loop series is hard! Our Main Contribution: We develop two algorithms for approximating Loop Series: 1. MCMC for estimating 2-regular loop series Polynomial-time mixing MCMC for estimating truncated version of loop series (≈ BP error). 2. MCMC for estimating full loop series Empirically efficient MCMC for estimating exact loop series (= BP error). However, designing a provably efficient MCMC for loop series is hard. Consequently, we have developed two algorithms for approximating the Loop Series. First, we develop a MCMC for estimating 2-regular loop series. This is a polynomial-time mixing MCMC for estimating truncated version of loop series called the 2-regular loops series. In other words, we estimate an approximated version of BP error. Next, we develop a MCMC for estimating the full loop series. This is an empirically efficient MCMC for estimating exact loop series. I will begin with explaining the first algorithm, MCMC for estimating 2-regular loop series.

MCMC for 2-regular Loop Series Polynomial-time algorithm for approximating truncated loop series (≈ BP error) 2-regular loop series is truncated version of full loop series. [Chertkov et al. 2008], [Gomez et al. 2010] 2-regular loop (disjoint set of cycles) is a subgraph with degree = 2. As mentioned before, 2-regular loop series is a truncated version of full loop series. It consists of 2-regular loops, which is a subgraph with degree exactly two. Equivalently, it is a union of disjoint cycles. And the concept of 2-regular loop series has been studied before and observed to provide nice approximation quality for the full loop series. Specifically, it is exactly the full loop series in Ising models without external field. Moreover, it is an easier objective, and has been shown to be computable in polynomial time by matrix determinants in planar graphs. In fact, our work can be viewed as a extension of this result as we offer a polynomial time approximation scheme for 2-regular loop series in general graphs. Often provide nice approximation quality. e.g., exact in Ising model with no external field. Computable in polynomial time by matrix determinants in planar graphs. We design a polynomial time approximation scheme in general graph. [Chertkov et al. 2008]

MCMC for 2-regular Loop Series Polynomial-time algorithm for approximating truncated loop series (≈ BP error) We combine MC for 2-regular loops + simulated annealing. [Khachaturyan et al. 1979] MC description: Based on worm algorithm [Prokofiev and Svistunov. 2001] . State space: power set of edges (sample = subgraph) Stationary distribution: 1 not removable removable MC transition: 1. Add or remove (i.e., flip) an edge to subgraph. 2. Constrain # of odd-degree vertices ≤2 Rejection scheme: If sample subgraph is not 2-regular, reject and try again. Now I’ll describe the algorithm. At high level, MCMC for 2-regular loop series is a combination of Markov chain for sampling 2-regular loops and simulated annealing strategy. Here, simulated annealing strategy is a popular technique for estimating the partition function using MCMC. In this talk, I will describe only about our main contribution, Markov Chain for sampling 2-regular loops. First of all, this Markov Chain is based on the worm algorithm developed by Prokofiev and Svistunov. It has the state space equal to power set of edges. In other words, each sample from the markov chain correspond to some subgraph. Moreover, we design the stationary distribution to be proportional to the weight function in the loop series. And for the transition, each step is made by adding or removing an edge to the subgraph, while constraining the number of odd-degree vertices to be less or equal to two. For example, we can add edges like this, or remove it repeatedly, but some edges are not able to be removed or added because of the degree constraints. Having this Markov Chain, we sample a subgraph after some iterations. If we sample a 2-regular loop, we are okay. However, we may sample a subgraph which is not 2-regular. To resolve this issue, we introduce the rejection scheme, which rejects a non-2-regular subgraph until a 2-regular subgraph is obtained. As a result, we obtain a MCMC which takes polynomial time to estimate 2-regular loop series. Next, I will describe the MCMC for full loop series. REMOVE ADD Sample? Theorem [Ahn, Chertkov and Shin. 2016] Proposed MCMC takes polynomial time to estimate 2-regular loop series. Not 2-regular

MCMC for Full Loop Series Empirically efficient algorithm for exact loop series (= BP error)
We combine MC for generalized loops + simulated annealing. subgraph with degree ≥2 MC description: State space: power set of edges. Stationary distribution: Utilize concept of cycle basis and all-pair path set (collection of cycles and paths). Cycle basis: Minimal set of cycles, expressing every Eulerian subgraph by symmetric difference All-pair path set: Set of paths, having a path for every possible combination of endpoints Again, our MCMC is combination of Markov chain for sampling generalized loops with simulated annealing strategy. Like before, this markov chain has state space equal to power set of edges and distribution proportional to the weight function in loop series. For transition, we introduce the concept of cycle basis and all-pair path set. Here, cycle basis is a minimal set of cycles expressing every Eulerian subgraphs by symmetric difference. So any eulerian subgraph like this, can be expressed as a symmetric difference of cycles in the cycle basis. And all-pair path set is a collection of paths having a path for every combination of endpoints. So even if we pick a random pair of vertices and there will be a path between them. In combination, we prove that any generalized loop can be expressed by applying symmetric difference with a subset of cycle basis and all-pair path set. Lemma [Ahn, Chertkov and Shin. 2016] Any generalized loop can be expressed by applying symmetric difference with a subset of cycle basis ∪ all-pair path set.

MCMC for Full Loop Series Empirically efficient algorithm for exact loop series (= BP error)
We combine MC for generalized loops + simulated annealing. subgraph with degree ≥2 MC description: State space: power set of edges. Stationary distribution: Utilize concept of cycle basis and all-pair path set (collection of cycles and paths). MC transition: Pick an element from cycle basis ∪ all-pair path and apply symmetric difference. Then the Markov chain follows quite naturally: we pick an element from cycle basis or all-pair path. then apply symmetric difference. and so on… Next, I will demonstrate that our algorithms’ performance. Lemma [Ahn, Chertkov and Shin. 2016] Any generalized loop can be expressed by applying symmetric difference with a subset of cycle basis ∪ all-pair path set.

Experiment Comparison with BP and MCMC based on Gibbs sampler
Ising model Experiments in 4x4 (left) and 10x10 (right) grid graph. Interaction strengths are set as . We measure log-partition approximation error, i.e Here, 2-regular loop series = full loop series. 4x4 grid graph 10x10 grid graph In the experiments, we compare our algorithms to BP and MCMC based on Gibbs sampler. First we experiment on Ising model in 4x4 and 10x10 grid graph, Here, interaction strength are set as gaussian distribution with varying average strength. For performance measure, we use the log partition approximation error. As mentioned before, 2-regular loop series is exactly the loop series in this model. log-partition approximation ratio Average Interaction Strength

Ising model MCMC for 2-regular loop series outperforms. In 4x4 grid, MCMC for full loop series outperforms BP and MCMC-Gibbs. In 10x10 grid, MCMC for full loop series outperforms BP, and MCMC-Gibbs in extreme regimes (both MCMC are slow, but ours win by benefit from BP). MCMC-Gibbs is expected to get worse as graph grows. 4x4 grid graph 10x10 grid graph If we look into the figures, the red line, MCMC for 2 regular loop series outperforms other algorithms. Next, in 4x4 grid graph, the green line, MCMC for full loop series outperforms bare BP and MCMC based on Gibbs sampler. And in 10x10 grid graph, MCMC for full loop series outperform BP, and outperforms MCMC in extreme regime where the average interaction strength is either very high, or very low. Here, even though both MCMC is slow in extreme regimes, our algorithm benefit from performance of BP. As graph size grows, MCMC is expected to get worse, and our algorithm is likely to benefit more from BP. log-partition approximation ratio Extreme regimes Average Interaction Strength

Ising model MCMC for 2-regular loop series outperforms. In 4x4 grid, MCMC for full loop series outperforms BP and MCMC-Gibbs. In 10x10 grid, MCMC for full loop series outperforms BP, and MCMC-Gibbs in extreme regimes (both MCMC are slow, but ours win by benefit from BP). MCMC-Gibbs is expected to get worse as graph grows. MCMC-2regular MCMC-full MCMC-Gibbs BP > MCMC-2regular MCMC-full BP MCMC-Gibbs > Summing up, in 4x4 grid graph, MCMC for 2-regular loop series performs best, and then MCMC for full loop series, then MCMC based on Gibbs sampler, and BP. However, as the graph gets large, the performance of MCMC is eventually outperformed by BP, while our algorithms still outperform bare BP by correcting its error. As graph grows large 4x4 grid graph

Ising model with external fields Experiment in 4x4 grid graph. Interaction strengths and external fields are set as and MCMC for 2-regular loop series is inexact, and does not perform well. MCMC for full loop series perform similarly with BP and outperforms MCMC- Gibbs. (log-scale) MCMC-full BP MCMC-Gibbs MCMC-2regular ≈ > Next, we add external field to the Ising model. We experiment in 4x4 grid graph, Interaction strengths and external fields with gaussian distribution. Note that in 10x10 or larger grid graphs, exact computation of partition function is no longer possible due to the external fields. If we look in to the figure, the red line, MCMC for 2-regular loop series is no longer exact, and does not perform well naturally. Next, MCMC for full loop series perform similarly with BP and outperforms MCMC based on Gibbs sampler. Therefore the overall result is MCMC for full loop series being similar to BP, and outperforming MCMC based on Gibbs sampler and MCMC for 2-regular loop series. In 10x10 (or larger) grid graphs, exact computation of partition function is no longer possible due to external fields. log-partition approximation error BP error is too small to be estimated with small # of samples. Average Interaction Strength

Hard-core model Hard-core model is a distribution defined on independent set . Experiment in 4x4 grid graph. We control the parameter called fugacity, where MCMC for full loop series outperforms MCMC-Gibbs significantly even when BP is worse. i.e., independent set model (log-scale) MCMC-full MCMC-Gibbs BP MCMC-2regular > Next, we experiment on the hard-core model. Which can be seen as a special instance of Ising model with external fields. Hard-core model is a distribution on independent sets, which is a set of vertices where no pair of vertices are adjacent. We experiment in 4x4 grid graph, and do not experiment in 10x10 grid graph because of the same reason as in the previous experiment. Here, we control the parameter lambda called fugacity, which is needed for the definition of the distribution as in the equation. Here, MCMC for full loop series outperforms MCMC based on Gibbs sampler quite significantly even when BP is worse than MCMC based on Gibbs sampler. So the overall result is MCMC for full loop series performing the best, followed by MCMC based on Gibbs sampler, BP and MCMC for 2-regular loop series. In an independent set, no vertices are adjacent. log-partition approximation error fugacity

Conclusion In summary, we have proposed: and in experiments,
Polynomial time MCMC for truncated, 2-regular loop series (≈ BP error). Empirically effective MCMC for full loop series (= BP error). and in experiments, A and B always outperform BP by correcting its error. A or B outperform standard MCMC by benefiting from BP performance. Final words.. Graphical models have great expressive power! However, inference task is too expensive for large-scale applications. Our work might provide a new angle for tackling the issue. In summary, we have proposed two algorithms. A., A polynomial time MCMC for approximating truncated 2-regular loop series and B., an empirically effective MCMC for approximating full loop series. In experiments, Both our algorithms outperform BP by correcting its error, and at least one of our algorithms outperforms standard MCMC by benefiting from BP performance. For wrapping up, I would like to mention that even though graphical models have great expressive power, Its inference task has been too expensive for large-scale applications, and our work might provide a new angle for tackling this issue. For additional information, please visit our poster today at one-seven-seven. For additional information, visit our poster at #177 !

Synthesis of MCMC and Belief Propagation

Similar presentations

Presentation on theme: "Synthesis of MCMC and Belief Propagation"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Synthesis of MCMC and Belief Propagation

Similar presentations

Presentation on theme: "Synthesis of MCMC and Belief Propagation"— Presentation transcript:

Similar presentations

About project

Feedback