From the EM Algorithm to the CM-EM Algorithm for Global Convergence of Mixture Models Chenguang Lu lcguang@foxmail.com 2018-11-10 从EM算法到CM-EM算法求混合模型全局收敛.

Slides:



Advertisements
Similar presentations
Contrastive Divergence Learning
Advertisements

Image Modeling & Segmentation
Feature Selection as Relevant Information Encoding Naftali Tishby School of Computer Science and Engineering The Hebrew University, Jerusalem, Israel NIPS.
Unsupervised Learning
Pattern Recognition and Machine Learning
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
Multivariate linear models for regression and classification Outline: 1) multivariate linear regression 2) linear classification (perceptron) 3) logistic.
1 Expectation Maximization Algorithm José M. Bioucas-Dias Instituto Superior Técnico 2005.
EE 290A: Generalized Principal Component Analysis Lecture 6: Iterative Methods for Mixture-Model Segmentation Sastry & Yang © Spring, 2011EE 290A, University.
First introduced in 1977 Lots of mathematical derivation Problem : given a set of data (data is incomplete or having missing values). Goal : assume the.
. Learning Bayesian networks Slides by Nir Friedman.
Lecture 5: Learning models using EM
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Machine Learning CMPT 726 Simon Fraser University
Visual Recognition Tutorial
Expectation-Maximization (EM) Chapter 3 (Duda et al.) – Section 3.9
Clustering with Bregman Divergences Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, Joydeep Ghosh Presented by Rohit Gupta CSci 8980: Machine Learning.
. Learning Bayesian networks Most Slides by Nir Friedman Some by Dan Geiger.
Incomplete Graphical Models Nan Hu. Outline Motivation K-means clustering Coordinate Descending algorithm Density estimation EM on unconditional mixture.
Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.
Speech Recognition Pattern Classification. 22 September 2015Veton Këpuska2 Pattern Classification  Introduction  Parametric classifiers  Semi-parametric.
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
Lab 3b: Distribution of the mean
Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.
HMM - Part 2 The EM algorithm Continuous density HMM.
ECE 8443 – Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem Proof EM Example – Missing Data Intro to Hidden Markov Models.
Information Bottleneck versus Maximum Likelihood Felix Polyakov.
Relation Strength-Aware Clustering of Heterogeneous Information Networks with Incomplete Attributes ∗ Source: VLDB.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
RADFORD M. NEAL GEOFFREY E. HINTON 발표: 황규백
Rate Distortion Theory. Introduction The description of an arbitrary real number requires an infinite number of bits, so a finite representation of a.
Computational Intelligence: Methods and Applications Lecture 26 Density estimation, Expectation Maximization. Włodzisław Duch Dept. of Informatics, UMK.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
. The EM algorithm Lecture #11 Acknowledgement: Some slides of this lecture are due to Nir Friedman.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Statistical-Mechanical Approach to Probabilistic Image Processing -- Loopy Belief Propagation and Advanced Mean-Field Method -- Kazuyuki Tanaka and Noriko.
Hidden Markov Models.
Learning Tree Structures
Model Inference and Averaging
Classification of unlabeled data:
Statistical Models for Automatic Speech Recognition
LECTURE 10: EXPECTATION MAXIMIZATION (EM)
Clustering Evaluation The EM Algorithm
Latent Variables, Mixture Models and EM
Graduate School of Information Sciences, Tohoku University, Japan
Statistical Learning Dong Liu Dept. EEIS, USTC.
Hidden Markov Models Part 2: Algorithms
Bayesian Models in Machine Learning
Expectation Maximization Mixture Models HMMs
Introduction to EM algorithm
Statistical Models for Automatic Speech Recognition
SMEM Algorithm for Mixture Models
Lecture 11: Mixture of Gaussians
Gaussian Mixture Models And their training with the EM algorithm
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
信道匹配算法用于混合模型——挑战EM算法 Channels’ Matching Algorithm for Mixture Models ——A Challenge to the EM Algorithm 鲁晨光 Chenguang Lu Hpage:
Computing and Statistical Data Analysis / Stat 7
Boltzmann Machine (BM) (§6.4)
The loss function, the normal equation,
语义信道和Shannon信道相互匹配和迭代for检验,估计和分类with最大互信息和最大似然度 Semantic Channel and Shannon Channel Mutually Match and Iterate for Tests, Estimations, and Classifications.
Mathematical Foundations of BME Reza Shadmehr
Information Theoretical Analysis of Digital Watermarking
EM Algorithm 主講人:虞台文.
Data Exploration and Pattern Recognition © R. El-Yaniv
Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.
Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.
Applied Statistics and Probability for Engineers
Semantic Information G Theory for Falsification and Confirmation
A New Iteration Algorithm for Maximum Mutual Information Classifications on Factor Spaces ——Based on a Semantic information theory Chenguang Lu
Presentation transcript:

From the EM Algorithm to the CM-EM Algorithm for Global Convergence of Mixture Models Chenguang Lu lcguang@foxmail.com 2018-11-10 从EM算法到CM-EM算法求混合模型全局收敛 鲁晨光 Homepage: http://survivor99.com/ http://www.survivor99.com/lcg/english/ This ppt may be downloaded from http://survivor99.com/lcg/CM/CM4mix.ppt

Mixture Models Iterations Sampling distribution P(X)=∑ j P*(yj)P(X|θj*) Predicted distribution Pθ(X) by θ=(μ,σ) and P(Y) To make the relative entropy KL divergence Iterations Start iteration Pθ(X)≠ P(X) End iteration Pθ(X)≈ P(X)

The EM Algorithm for Mixture Models The popular EM algorithm and its convergence proof Likelihood is negative general entropy negative general joint entropy, in short, negative entropy E-step: put P(yj|xi, θ) into Q M-step:Maximize Q. Popular convergence proof: 1) Increasing Q can maximizes logP(X|θ); 2) Q is increasing in every M-step and no-decreasing in every E-step. Jensen's inequality

The First Problem with the Convergence Proof of the EM Algorithm: Q may be greater than Q* Assume P(y1)=P(y2)=0.5; µ1=µ1*, µ2=µ2*; σ1= σ2= σ. [1].Dempster, A. P., Laird, N. M., Rubin, D.B.: Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society, Series B 39, 1–38 (1977). [2] Wu, C. F. J.: On the Convergence Properties of the EM Algorithm. Annals of Statistics 11, 95–10 (1983). logP(XN,Y|θ) Q=-6.75N Target Q*=-6.89N

The Second Problem with the EM algorithm Convergence Proof P(Y|X) from the E-step is not a proper Shannon’s channel because new mixture ratio: P(X|θj+1)=P(X)P(X|θj)/Pθ(X) is not normalized. It is possible that ∑ i P(xi|θ1+1)>1.6 and ∑ i P(xi|θ0+1) <0.4

The CM-EM Algorithm for Mixture Models The CM-EM algorithm: basic idea is to minimize R-G=I(X;Y)-I(X;θ) since minimizing R-G is equivalent to minimizing H(P||Pθ) E1-step=E-step E2-step: to modify P(Y) by replacing it by P+1(Y): Until MG-step: to maximize semantic mutual information G=I(X;θ) For Gaussian distributions

Comparing the CM-EM with EM Algorithms for Mixture Models To write Q of EM in cross-entropies: Maximizing Q = Minimizing H(X|θ) and Hθ(Y). The CM-EM does not minimize Hθ(Y), it modifies P(Y) so that P+1(Y)=P(Y) or Hθ(Y)=0. Relationship: E-step of EM = E1-step of CM-EM M-step of EM (E2-step + MG-step) of CM-EM

Comparing CM-EM and MM Algorithms Neal and Hinton define F=Q+NH(Y)=-NH(X,Y|θ)+NH(Y)≈ -NH(X|θ). then maximize F in both M-step and E-step. CM-EM maximize G=H(X)-H(X|θ) in MG step. So, the MG-step is similar to the M-step of the MM algorithm. Maximizing F is similar to minimizing H(X|θ) or maximizing G. If we replace H(Y) with Hθ(Y) in F then the M-step of MM is the same as the MG-step. However, E2-step does not maximize G, it minimize H(Y+1||Y).

An Iterative Example of Mixture Models with R<R* or Q<Q* The number of iterations is 5 Both CM_G and EM_Q are monotonously increasing. H(Q||P)=R(G)-G→0

A Counterexample with R>R. or Q>Q A Counterexample with R>R* or Q>Q* against the EM convergence Proof True, starting, and ending parameters: Excel demo files can be can downloaded from: http://survivor99.com/lcg/cc-iteration.zip The number of iterations is 5

Illustrating the Convergence of the CM-EM Algorithm for R>R Illustrating the Convergence of the CM-EM Algorithm for R>R* and R<R* The central idea of The CM is Finding the point G≈R on R-G plane, two-dimensional plane; also looking for R→R*——EM algorithm neglects R→R* Minimizing H(Q||P)= R(G)-G (similar to min-max method); Two examples: Start R<R* or Q<Q* Start R>R* or Q>Q* A counterexample against the EM; Q is decreasing Target

Comparing the Iteration Numbers of CM-EM,EM and MM Algorithms For the same example used by Neal and Hinton, EM algorithm needs 36 iterations MM algorithm (Neal and Hinton) needs 18 iterations; CM-EM algorithm needs only 9 iterations. References: 1. Lu Chenguang, From the EM Algorithm to the CM-EM Algorithm for Global Convergence of Mixture Models, http://arxiv.org/a/lu_c_3. 2. Neal, Radford; Hinton, Geoffrey , ftp://ftp.cs.toronto.edu/pub/radford/emk.pdf

Fundamentals for the Convergence Proof 1: Semantic Information is Defined with Log-normalized-likelihood Semantic information conveyed by yj about xi: Averaging I(xi;θj) to get Semantic Kullback-Leibler Information: Averaging I(X;θj) to get Semantic Mutual Information:

From Shannon’s Channel to Semantic Channel The Shannon channel consists of transition function The semantic channel consists of truth functions: The semantic mutual information formula: We may fix one and optimize another alternatively. yj不变X变 X

Fundamentals for the Convergence Proof 2: From R(D) Function to R(G) Function Shannon’s Information rate distortion function: R(D) where R(D) means minimum R for given D. Replacing D with G: We have R(G) function: All R(G) functions are bowl like. Matching Point

Fundamentals for the Convergence Proof 2: Two Kinds of mutual Matching 1. For maximum mutual information classifications for Maximum R and G: 2. For mixture models for minimum R-G Matching Point

Semantic Channel Matches Shannon’s Channel Optimize the truth function and the semantic channel: or When the sample is large enough, the optimized truth function is proportional to the transition probability function

Shannon’s Channel Matches Semantic Channel For Maximum Mutual Information Classifications Using classifier For mixture models Using E1-step and E2-step of CM-EM Repeat Until

The Convergence Proof of CM-EM I: Basic Formulas Semantic mutual information Shannon mutual information where Main formula for mixture models: = =∑i P(xi)P(yj|xi)

The Convergence Proof of CM-EM II: Using Variational Method The Convergence Proof: Proving that Pθ(X) converges to P(X) is equivalent to proving that H(P||Pθ) converges to 0. Since E2-step makes R=R'' and H(Y+1||Y)=0, we only need to prove that every step minimizes R-G after the start step. Because MG-step maximizes G without changing R. The left work is to prove that E1-step and E2-step minimize R-G. Fortunately, we can strictly prove that by the variational method and the iterative method that Shannon (1959) and others (Berger, 1971; Zhou, 1883) used for analyzing the rate-distortion function R(D).

The CM Algorithm: Using Optimized Mixture Models for Maximum Mutual Information Classifications To find the best dividing points. First assume a z’ to obtain P(zj|Y) Matching I: Obtain T*(θzj|Y) And information lines I(Y; θzj|X) Matching II: Using the classifier: If H(P||P θ)<0.001, then End,else Goto Matching I.

Illustrating the Convergence of the CM Algorithm for Maximum Mutual Information Classifications with R(G) Function Iterative steps and convergence reasons: 1)For each Shannon channel, there is a matched semantic channel that maximizes average log-likelihood; 2)For given P(X) and semantic channel, we can find a better Shannon channel; 3)Repeating the two steps can obtain the Shannon channel that maximizes Shannon mutual information and average log-likelihood. A R(G) function serves as a ladder letting R climb up, and find a better semantic channel and a better ladder.

An Example Shows the Reliability of The CM Algorithm A 3×3 Shannon channel to show reliable convergence Even if a pair of bad start points are used, the convergence is also reliable. Using good start points, the number of iterations is 4; Using very bad start points, the number of iterations is 11. beginning convergent

Thank you for your listening! Welcome to criticize! Summary The CM algorithm is a new tool for statistical learning. To show its power, we use the CM-EM algorithm to resolve the problems with mixture models. In real applications, X may be multi-dimensional; however, the convergence reasons and reliability should be the same. ——End—— Thank you for your listening! Welcome to criticize! 2017-8-26 reported in ICIS2017 (第二届智能科学国际会议,上海) 2018-11-9 revised for better convergence proof. More papers about the author’s semantic information theory: http://survivor99.com/lcg/books/GIT/index.htm