Presentation is loading. Please wait.

Presentation is loading. Please wait.

Latent Variables, Mixture Models and EM

Similar presentations


Presentation on theme: "Latent Variables, Mixture Models and EM"— Presentation transcript:

1 Latent Variables, Mixture Models and EM
Christopher M. Bishop Microsoft Research, Cambridge BCS Summer School Exeter, 2003

2 Overview. K-means clustering Gaussian mixtures
Maximum likelihood and EM Latent variables: EM revisited Bayesian Mixtures of Gaussians BCS Summer School, Exeter, 2003

3 Old Faithful BCS Summer School, Exeter, 2003

4 Time between eruptions (minutes)
Old Faithful Data Set Time between eruptions (minutes) Duration of eruption (minutes) BCS Summer School, Exeter, 2003

5 K-means Algorithm Goal: represent a data set in terms of K clusters each of which is summarized by a prototype Initialize prototypes, then iterate between two phases: E-step: assign each data point to nearest prototype M-step: update prototypes to be the cluster means Simplest version is based on Euclidean distance re-scale Old Faithful data BCS Summer School, Exeter, 2003

6 BCS Summer School, Exeter, 2003

7 BCS Summer School, Exeter, 2003

8 BCS Summer School, Exeter, 2003

9 BCS Summer School, Exeter, 2003

10 BCS Summer School, Exeter, 2003

11 BCS Summer School, Exeter, 2003

12 BCS Summer School, Exeter, 2003

13 BCS Summer School, Exeter, 2003

14 BCS Summer School, Exeter, 2003

15 Responsibilities Responsibilities assign data points to clusters such that Example: 5 data points and 3 clusters BCS Summer School, Exeter, 2003

16 K-means Cost Function data prototypes responsibilities
BCS Summer School, Exeter, 2003

17 Minimizing the Cost Function
E-step: minimize w.r.t. assigns each data point to nearest prototype M-step: minimize w.r.t gives each prototype set to the mean of points in that cluster Convergence guaranteed since there is a finite number of possible settings for the responsibilities BCS Summer School, Exeter, 2003

18 BCS Summer School, Exeter, 2003

19 Limitations of K-means
Hard assignments of data points to clusters – small shift of a data point can flip it to a different cluster Not clear how to choose the value of K Solution: replace ‘hard’ clustering of K-means with ‘soft’ probabilistic assignments Represents the probability distribution of the data as a Gaussian mixture model BCS Summer School, Exeter, 2003

20 The Gaussian Distribution
Multivariate Gaussian Define precision to be the inverse of the covariance In 1-dimension mean covariance BCS Summer School, Exeter, 2003

21 Likelihood Function Data set
Assume observed data points generated independently Viewed as a function of the parameters, this is known as the likelihood function BCS Summer School, Exeter, 2003

22 Maximum Likelihood Set the parameters by maximizing the likelihood function Equivalently maximize the log likelihood BCS Summer School, Exeter, 2003

23 Maximum Likelihood Solution
Maximizing w.r.t. the mean gives the sample mean Maximizing w.r.t covariance gives the sample covariance BCS Summer School, Exeter, 2003

24 Gaussian Mixtures Linear super-position of Gaussians
Normalization and positivity require Can interpret the mixing coefficients as prior probabilities BCS Summer School, Exeter, 2003

25 Example: Mixture of 3 Gaussians
BCS Summer School, Exeter, 2003

26 Contours of Probability Distribution
BCS Summer School, Exeter, 2003

27 Surface Plot BCS Summer School, Exeter, 2003

28 Sampling from the Gaussian
To generate a data point: first pick one of the components with probability then draw a sample from that component Repeat these two steps for each new data point BCS Summer School, Exeter, 2003

29 Synthetic Data Set BCS Summer School, Exeter, 2003

30 Fitting the Gaussian Mixture
We wish to invert this process – given the data set, find the corresponding parameters: mixing coefficients means covariances If we knew which component generated each data point, the maximum likelihood solution would involve fitting each component to the corresponding cluster Problem: the data set is unlabelled We shall refer to the labels as latent (= hidden) variables BCS Summer School, Exeter, 2003

31 Synthetic Data Set Without Labels
BCS Summer School, Exeter, 2003

32 Posterior Probabilities
We can think of the mixing coefficients as prior probabilities for the components For a given value of we can evaluate the corresponding posterior probabilities, called responsibilities These are given from Bayes’ theorem by BCS Summer School, Exeter, 2003

33 Posterior Probabilities (colour coded)
BCS Summer School, Exeter, 2003

34 Posterior Probability Map
BCS Summer School, Exeter, 2003

35 Maximum Likelihood for the GMM
The log likelihood function takes the form Note: sum over components appears inside the log There is no closed form solution for maximum likelihood BCS Summer School, Exeter, 2003

36 Problems and Solutions
How to maximize the log likelihood solved by expectation-maximization (EM) algorithm How to avoid singularities in the likelihood function solved by a Bayesian treatment How to choose number K of components also solved by a Bayesian treatment BCS Summer School, Exeter, 2003

37 EM Algorithm – Informal Derivation
Let us proceed by simply differentiating the log likelihood Setting derivative with respect to equal to zero gives giving which is simply the weighted mean of the data BCS Summer School, Exeter, 2003

38 EM Algorithm – Informal Derivation
Similarly for the covariances For mixing coefficients use a Lagrange multiplier to give BCS Summer School, Exeter, 2003

39 EM Algorithm – Informal Derivation
The solutions are not closed form since they are coupled Suggests an iterative scheme for solving them: Make initial guesses for the parameters Alternate between the following two stages: E-step: evaluate responsibilities M-step: update parameters using ML results BCS Summer School, Exeter, 2003

40 BCS Summer School, Exeter, 2003

41 BCS Summer School, Exeter, 2003

42 BCS Summer School, Exeter, 2003

43 BCS Summer School, Exeter, 2003

44 BCS Summer School, Exeter, 2003

45 BCS Summer School, Exeter, 2003

46 EM – Latent Variable Viewpoint
Binary latent variables describing which component generated each data point Conditional distribution of observed variable Prior distribution of latent variables Marginalizing over the latent variables we obtain BCS Summer School, Exeter, 2003

47 Expected Value of Latent Variable
From Bayes’ theorem BCS Summer School, Exeter, 2003

48 Complete and Incomplete Data
BCS Summer School, Exeter, 2003

49 Latent Variable View of EM
If we knew the values for the latent variables, we would maximize the complete-data log likelihood which gives a trivial closed-form solution (fit each component to the corresponding set of data points) We don’t know the values of the latent variables However, for given parameter values we can compute the expected values of the latent variables BCS Summer School, Exeter, 2003

50 Expected Complete-Data Log Likelihood
Suppose we make a guess for the parameter values (means, covariances and mixing coefficients) Use these to evaluate the responsibilities Consider expected complete-data log likelihood where responsibilities are computed using We are implicitly ‘filling in’ latent variables with best guess Keeping the responsibilities fixed and maximizing with respect to the parameters give the previous results BCS Summer School, Exeter, 2003

51 EM in General Consider arbitrary distribution over the latent variables The following decomposition always holds where BCS Summer School, Exeter, 2003

52 Decomposition BCS Summer School, Exeter, 2003

53 Optimizing the Bound E-step: maximize with respect to
equivalent to minimizing KL divergence sets equal to the posterior distribution M-step: maximize bound with respect to equivalent to maximizing expected complete-data log likelihood Each EM cycle must increase incomplete-data likelihood unless already at a (local) maximum BCS Summer School, Exeter, 2003

54 E-step BCS Summer School, Exeter, 2003

55 M-step BCS Summer School, Exeter, 2003


Download ppt "Latent Variables, Mixture Models and EM"

Similar presentations


Ads by Google