Clustering (2) & EM algorithm Model-based clustering EM algorithm Data Clustering by Gan et al. Machine Learning, a Probabilistic Perspective, The Expectation Maximization Algorithm, A short tutorial, by Borman
Model-based clustering Impose certain model assumptions on potential clusters; try to optimize the fit between data and model. The data is viewed as coming from a mixture of probability distributions; each of the distributions represents a cluster.
Model-based clustering For example, if we believe the data come from a mixture of several Gaussian densities, the likelihood that data point i is from cluster j is: Classification likelihood approach: find cluster assignments and parameters that maximize
Model-based clustering Mixture likelihood approach: The most commonly used method is the EM algorithm. It iterates between soft cluster assignment and parameter estimation.
EM algorithm In maximum likelihood estimation, the likelihood function is a function of the parameter θ given the data X, EM algorithm is an iterative procedure for maximizing L(θ) After the nth iteration, the current estimate for is θn. We want an update θn+1 that maximizes In many problems, there are unobserved variables - hidden random vector Z. Then In clustering, z is the soft cluster assignment.
EM algorithm
EM algorithm -ln() is convex
EM algorithm This is proportional to the expectation of ln[P(X, z|θ)], over the distribution of z|X, θn
EM algorithm Thus at every θn, we find the conditional distribution of the hidden variables z, the taking expectation over this distribution to find the θn+1 that maximizes the likelihood.
EM algorithm Convergence of EM algorithm. At every step, θn+1 is the maximizer of So, Thus the likelihood L(θ) is strictly non-decreasing. Most of the time, EM will converge to a local maximum. But it can jump out of the closest local maximum.
EM algorithm Nature Biotechnology volume 26, pages 897–899 (2008)
EM algorithm Example: the 2 coin problem. Scenario 1: no missing value: Nature Biotechnology volume 26, pages 897–899 (2008)
EM algorithm Scenario 2: missing which coin is tossed: Nature Biotechnology volume 26, pages 897–899 (2008)
Model-based clustering EM algorithm in the simplest case: two component Gaussian in 1D
Model-based clustering
Model-based clustering
Model-based clustering
Model-based clustering Gaussian cluster models. E step: M step:
Model-based clustering Common assumptions: From 1 to 4, the model becomes more flexible, yet more parameters need to be estimated. May become less stable.
Model-based clustering Example: Mixture of multinoullis