From the EM Algorithm to the CM-EM Algorithm for Global Convergence of Mixture Models Chenguang Lu lcguang@foxmail.com 2018-11-10 从EM算法到CM-EM算法求混合模型全局收敛.

From the EM Algorithm to the CM-EM Algorithm for Global Convergence of Mixture Models Chenguang Lu 从EM算法到CM-EM算法求混合模型全局收敛鲁晨光 Homepage: This ppt may be downloaded from

Mixture Models Iterations
Sampling distribution P(X)=∑ j P*(yj)P(X|θj*) Predicted distribution Pθ(X) by θ=(μ,σ) and P(Y) To make the relative entropy KL divergence Iterations Start iteration Pθ(X)≠ P(X) End iteration Pθ(X)≈ P(X)

The EM Algorithm for Mixture Models
The popular EM algorithm and its convergence proof Likelihood is negative general entropy negative general joint entropy, in short, negative entropy E-step: put P(yj|xi, θ) into Q M-step：Maximize Q. Popular convergence proof: 1) Increasing Q can maximizes logP(X|θ); 2) Q is increasing in every M-step and no-decreasing in every E-step. Jensen's inequality

The First Problem with the Convergence Proof of the EM Algorithm: Q may be greater than Q*
Assume P(y1)=P(y2)=0.5; µ1=µ1*, µ2=µ2*; σ1= σ2= σ. [1].Dempster, A. P., Laird, N. M., Rubin, D.B.: Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society, Series B 39, 1–38 (1977). [2] Wu, C. F. J.: On the Convergence Properties of the EM Algorithm. Annals of Statistics 11, 95–10 (1983). logP(XN,Y|θ) Q=-6.75N Target Q*=-6.89N

The Second Problem with the EM algorithm Convergence Proof
P(Y|X) from the E-step is not a proper Shannon’s channel because new mixture ratio: P(X|θj+1)=P(X)P(X|θj)/Pθ(X) is not normalized. It is possible that ∑ i P(xi|θ1+1)>1.6 and ∑ i P(xi|θ0+1) <0.4

The CM-EM Algorithm for Mixture Models
The CM-EM algorithm: basic idea is to minimize R-G=I(X;Y)-I(X;θ) since minimizing R-G is equivalent to minimizing H(P||Pθ) E1-step=E-step E2-step: to modify P(Y) by replacing it by P+1(Y): Until MG-step: to maximize semantic mutual information G=I(X;θ) For Gaussian distributions

Comparing the CM-EM with EM Algorithms for Mixture Models
To write Q of EM in cross-entropies: Maximizing Q = Minimizing H(X|θ) and Hθ(Y). The CM-EM does not minimize Hθ(Y), it modifies P(Y) so that P+1(Y)=P(Y) or Hθ(Y)=0. Relationship: E-step of EM = E1-step of CM-EM M-step of EM (E2-step + MG-step) of CM-EM

Comparing CM-EM and MM Algorithms
Neal and Hinton define F=Q+NH(Y)=-NH(X,Y|θ)+NH(Y)≈ -NH(X|θ). then maximize F in both M-step and E-step. CM-EM maximize G=H(X)-H(X|θ) in MG step. So, the MG-step is similar to the M-step of the MM algorithm. Maximizing F is similar to minimizing H(X|θ) or maximizing G. If we replace H(Y) with Hθ(Y) in F then the M-step of MM is the same as the MG-step. However, E2-step does not maximize G, it minimize H(Y+1||Y).

An Iterative Example of Mixture Models with R<R* or Q<Q*
The number of iterations is 5 Both CM_G and EM_Q are monotonously increasing. H(Q||P)=R(G)-G→0

A Counterexample with R>R. or Q>Q
A Counterexample with R>R* or Q>Q* against the EM convergence Proof True, starting, and ending parameters: Excel demo files can be can downloaded from: The number of iterations is 5

Illustrating the Convergence of the CM-EM Algorithm for R>R
Illustrating the Convergence of the CM-EM Algorithm for R>R* and R<R* The central idea of The CM is Finding the point G≈R on R-G plane, two-dimensional plane； also looking for R→R*——EM algorithm neglects R→R* Minimizing H(Q||P)= R(G)-G (similar to min-max method)； Two examples: Start R<R* or Q<Q* Start R>R* or Q>Q* A counterexample against the EM; Q is decreasing Target

Comparing the Iteration Numbers of CM-EM，EM and MM Algorithms
For the same example used by Neal and Hinton, EM algorithm needs 36 iterations MM algorithm (Neal and Hinton) needs 18 iterations; CM-EM algorithm needs only 9 iterations. References: 1. Lu Chenguang， From the EM Algorithm to the CM-EM Algorithm for Global Convergence of Mixture Models， 2. Neal, Radford; Hinton, Geoffrey , ftp://ftp.cs.toronto.edu/pub/radford/emk.pdf

Fundamentals for the Convergence Proof 1: Semantic Information is Defined with Log-normalized-likelihood Semantic information conveyed by yj about xi: Averaging I(xi;θj) to get Semantic Kullback-Leibler Information: Averaging I(X;θj) to get Semantic Mutual Information:

From Shannon’s Channel to Semantic Channel
The Shannon channel consists of transition function The semantic channel consists of truth functions： The semantic mutual information formula: We may fix one and optimize another alternatively. yj不变X变 X

Fundamentals for the Convergence Proof 2： From R(D) Function to R(G) Function
Shannon’s Information rate distortion function: R(D) where R(D) means minimum R for given D. Replacing D with G: We have R(G) function： All R(G) functions are bowl like. Matching Point

Fundamentals for the Convergence Proof 2： Two Kinds of mutual Matching
1. For maximum mutual information classifications for Maximum R and G: 2. For mixture models for minimum R-G Matching Point

Semantic Channel Matches Shannon’s Channel
Optimize the truth function and the semantic channel: or When the sample is large enough, the optimized truth function is proportional to the transition probability function

Shannon’s Channel Matches Semantic Channel
For Maximum Mutual Information Classifications Using classifier For mixture models Using E1-step and E2-step of CM-EM Repeat Until

The Convergence Proof of CM-EM I: Basic Formulas
Semantic mutual information Shannon mutual information where Main formula for mixture models: = =∑i P(xi)P(yj|xi)

The Convergence Proof of CM-EM II: Using Variational Method
The Convergence Proof: Proving that Pθ(X) converges to P(X) is equivalent to proving that H(P||Pθ) converges to 0. Since E2-step makes R=R'' and H(Y+1||Y)=0, we only need to prove that every step minimizes R-G after the start step. Because MG-step maximizes G without changing R. The left work is to prove that E1-step and E2-step minimize R-G. Fortunately, we can strictly prove that by the variational method and the iterative method that Shannon (1959) and others (Berger, 1971; Zhou, 1883) used for analyzing the rate-distortion function R(D).

The CM Algorithm: Using Optimized Mixture Models for Maximum Mutual Information Classifications
To find the best dividing points. First assume a z’ to obtain P(zj|Y) Matching I: Obtain T*(θzj|Y) And information lines I(Y; θzj|X) Matching II: Using the classifier: If H(P||P θ)<0.001, then End,else Goto Matching I.

Illustrating the Convergence of the CM Algorithm for Maximum Mutual Information Classifications with R(G) Function Iterative steps and convergence reasons: 1)For each Shannon channel, there is a matched semantic channel that maximizes average log-likelihood; 2)For given P(X) and semantic channel, we can find a better Shannon channel; 3)Repeating the two steps can obtain the Shannon channel that maximizes Shannon mutual information and average log-likelihood. A R(G) function serves as a ladder letting R climb up, and find a better semantic channel and a better ladder.

An Example Shows the Reliability of The CM Algorithm
A 3×3 Shannon channel to show reliable convergence Even if a pair of bad start points are used, the convergence is also reliable. Using good start points, the number of iterations is 4; Using very bad start points, the number of iterations is 11. beginning convergent

Thank you for your listening！ Welcome to criticize！
Summary The CM algorithm is a new tool for statistical learning. To show its power, we use the CM-EM algorithm to resolve the problems with mixture models. In real applications, X may be multi-dimensional; however, the convergence reasons and reliability should be the same. ——End—— Thank you for your listening！ Welcome to criticize！ reported in ICIS2017 (第二届智能科学国际会议,上海) revised for better convergence proof. More papers about the author’s semantic information theory:

From the EM Algorithm to the CM-EM Algorithm for Global Convergence of Mixture Models Chenguang Lu lcguang@foxmail.com 2018-11-10 从EM算法到CM-EM算法求混合模型全局收敛.

Similar presentations

Presentation on theme: "From the EM Algorithm to the CM-EM Algorithm for Global Convergence of Mixture Models Chenguang Lu lcguang@foxmail.com 2018-11-10 从EM算法到CM-EM算法求混合模型全局收敛."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

From the EM Algorithm to the CM-EM Algorithm for Global Convergence of Mixture Models Chenguang Lu lcguang@foxmail.com 2018-11-10 从EM算法到CM-EM算法求混合模型全局收敛.

Similar presentations

Presentation on theme: "From the EM Algorithm to the CM-EM Algorithm for Global Convergence of Mixture Models Chenguang Lu lcguang@foxmail.com 2018-11-10 从EM算法到CM-EM算法求混合模型全局收敛."— Presentation transcript:

Similar presentations

About project

Feedback