Download presentation
Presentation is loading. Please wait.
Published byKerry Nichols Modified over 9 years ago
1
CSE 446: Expectation Maximization (EM) Winter 2012 Daniel Weld Slides adapted from Carlos Guestrin, Dan Klein & Luke Zettlemoyer
2
Machine Learning 2 Supervised Learning Parametric Reinforcement Learning Unsupervised Learning Non-parametric FriK-means & Agglomerative Clustering MonExpectation Maximization (EM) WedPrinciple Component Analysis (PCA)
3
K-Means An iterative clustering algorithm –Pick K random points as cluster centers (means) –Alternate: Assign data instances to closest mean Assign each mean to the average of its assigned points –Stop when no points’ assignments change
4
K-Means as Optimization Consider the total distance to the means: Two stages each iteration: –Update assignments: fix means c, change assignments a –Update means: fix assignments a, change means c Coordinate gradient ascent on Φ Will it converge? –Yes!, change from either update can only decrease Φ points assignments means
5
Phase I: Update Assignments (Expectation) For each point, re-assign to closest mean: Can only decrease total distance phi!
6
Phase II: Update Means (Maximization) Move each mean to the average of its assigned points: Also can only decrease total distance… (Why?) Fun fact: the point y with minimum squared Euclidean distance to a set of points {x} is their mean
7
Preview: EM Pick K random cluster models Alternate: –Assign data instances proportionately to different models –Revise each cluster model based on its (proportionately) assigned points Stop when no changes another iterative clustering algorithm
8
K-Means Getting Stuck A local optimum: Why doesn’t this work out like the earlier example, with the purple taking over half the blue?
9
Preference for Equally Sized Clusters 9
10
The Evils of “Hard Assignments”? Clusters may overlap Some clusters may be “wider” than others Distances can be deceiving!
11
Probabilistic Clustering Try a probabilistic model! allows overlaps, clusters of different size, etc. Can tell a generative story for data – P(X|Y) P(Y) is common Challenge: we need to estimate model parameters without labeled Ys YX1X1 X2X2 ??0.12.1 ??0.5-1.1 ??0.03.0 ??-0.1-2.0 ??0.21.5 ………
12
The General GMM assumption P(Y): There are k components P(X|Y): Each component generates data from a multivariate Gaussian with mean μ i and covariance matrix i Each data point is sampled from a generative process: 1.Choose component i with probability P(y=i) 2.Generate datapoint ~ N(m i, i )
13
What Model Should We Use? Depends on X! Here, maybe Gaussian Naïve Bayes? – Multinomial over clusters Y – Gaussian over each X i given Y YX1X1 X2X2 ??0.12.1 ??0.5-1.1 ??0.03.0 ??-0.1-2.0 ??0.21.5 ………
14
Could we make fewer assumptions? What if the X i co-vary? What if there are multiple peaks? Gaussian Mixture Models! – P(Y) still multinomial – P(X|Y) is a multivariate Gaussian dist’n
15
The General GMM assumption 1.What’s a Multivariate Gaussian? 2.What’s a Mixture Model?
16
Review: Gaussians
17
Learning Gaussian Parameters (given fully-observable data)
18
Multivariate Gaussians 18 P(X=x j )= Covariance matrix, Σ, = degree to which x i vary together Eigenvalue, λ
19
Multivariate Gaussians 19 Σ identity matrix
20
Multivariate Gaussians 20 Σ = diagonal matrix X i are independent ala Gaussian NB
21
Multivariate Gaussians 21 Σ = arbitrary (semidefinite) matrix specifies rotation (change of basis) eigenvalues specify relative elongation
22
The General GMM assumption 1.What’s a Multivariate Gaussian? 2.What’s a Mixture Model?
23
Mixtures of Gaussians (1) Old Faithful Data Set Time to Eruption Duration of Last Eruption
24
Mixtures of Gaussians (1) Old Faithful Data Set Single GaussianMixture of two Gaussians
25
Mixtures of Gaussians (2) Combine simple models into a complex model: Component Mixing coefficient K=3
26
Mixtures of Gaussians (3)
27
Eliminating Hard Assignments to Clusters Model data as mixture of multivariate Gaussians
28
Eliminating Hard Assignments to Clusters Model data as mixture of multivariate Gaussians
29
Eliminating Hard Assignments to Clusters Model data as mixture of multivariate Gaussians π i = probability point was generated from i th Gaussian
30
Eliminating Hard Assignments to Clusters Model data as mixture of multivariate Gaussians
31
Eliminating Hard Assignments to Clusters Model data as mixture of multivariate Gaussians π i = probability point was generated from i th Gaussian
32
Detour/Review: Supervised MLE for GMM How do we estimate parameters for Gaussian Mixtures with fully supervised data? Have to define objective and solve optimization problem. For example, MLE estimate has closed form solution:
33
Compare Univariate Gaussian Mixture of Multivariate Gaussians
34
That was easy! But what if unobserved data? MLE: – argmax θ j P(y j,x j ) – θ: all model parameters eg, class probs, means, and variance for naïve Bayes But we don’t know y j ’s!!! Maximize marginal likelihood: – argmax θ j P(x j ) = argmax j i=1 k P(y j =i,x j )
35
How do we optimize? Closed Form? Maximize marginal likelihood: – argmax θ j P(x j ) = argmax j i=1 k P(y j =i,x j ) Almost always a hard problem! – Usually no closed form solution – Even when P(X,Y) is convex, P(X) generally isn’t… – For all but the simplest P(X), we will have to do gradient ascent, in a big messy space with lots of local optimum…
36
Simple example: learn means only! Consider: 1D data Mixture of k=2 Gaussians Variances fixed to σ=1 Dist’n over classes is uniform Just estimate μ 1 and μ 2 -3 -2 -1 0 1 2 3 P
37
Graph of log P(x 1, x 2.. x n | μ 1, μ 2 ) against μ 1 and μ 2 Max likelihood = (μ 1 =-2.13, μ 2 =1.668) Local minimum, but very close to global at (μ 1 =2.085, μ 2 =-1.257)* μ2μ2 Marginal Likelihood for Mixture of two Gaussians * corresponds to switching y 1 with y 2. μ1μ1
38
Learning general mixtures of Gaussian Marginal likelihood: Need to differentiate and solve for μ i, Σ i, and P(Y=i) for i=1..k There will be no closed for solution, gradient is complex, lots of local optimum Wouldn’t it be nice if there was a better way!?!
39
Expectation Maximization
40
The EM Algorithm A clever method for maximizing marginal likelihood: – argmax θ j P(x j ) = argmax θ j i=1 k P(y j =i,x j ) – A type of gradient ascent that can be easy to implement (eg, no line search, learning rates, etc.) Alternate between two steps: – Compute an expectation – Compute a maximization Not magic: still optimizing a non-convex function with lots of local optima – The computations are just easier (often, significantly so!)
41
EM: Two Easy Steps Objective: argmax θ j i=1 k P(y j =i,x j |θ) = j log i=1 k P(y j =i,x j |θ) Data: {x j | j=1.. n} E-step: Compute expectations to “fill in” missing y values according to current parameters, – For all examples j and values i for y, compute: P(y j =i | x j, θ) M-step: Re-estimate the parameters with “weighted” MLE estimates – Set θ = argmax θ j i=1 k P(y j =i | x j, θ) log P(y j =i,x j |θ) Especially useful when the E and M steps have closed form solutions!!! Notation a bit inconsistent Parameters = =
42
Simple example: learn means only! Consider: 1D data Mixture of k=2 Gaussians Variances fixed to σ=1 Dist’n over classes is uniform Just need to estimate μ 1 and μ 2.01.03.05.07.09
43
EM for GMMs: only learning means Iterate: On the t’th iteration let our estimates be t = { μ 1 (t), μ 2 (t) … μ k (t) } E-step Compute “expected” classes of all datapoints M-step Compute most likely new μs given class expectations
44
E.M. for General GMMs Iterate: On the t’th iteration let our estimates be t = { μ 1 (t), μ 2 (t) … μ k (t), 1 (t), 2 (t) … k (t), p 1 (t), p 2 (t) … p k (t) } E-step Compute “expected” classes of all datapoints for each class p i (t) is shorthand for estimate of P(y=i) on t’th iteration M-step Compute weighted MLE for μ given expected classes above m = #training examples Just evaluate a Gaussian at x j
45
Gaussian Mixture Example: Start
46
After first iteration
47
After 2nd iteration
48
After 3rd iteration
49
After 4th iteration
50
After 5th iteration
51
After 6th iteration
52
After 20th iteration
53
Some Bio Assay data
54
GMM clustering of the assay data
55
Resulting Density Estimator
56
Three classes of assay (each learned with it’s own mixture model)
57
What if we do hard assignments? Iterate: On the t’th iteration let our estimates be θ t = { μ 1 (t), μ 2 (t) … μ k (t) } E-step Compute “expected” classes of all datapoints M-step Compute most likely new μs given class expectations δ represents hard assignment to “most likely” or nearest cluster Equivalent to k-means clustering algorithm!!!
58
Lets look at the math behind the magic! We will argue that EM: Optimizes a bound on the likelihood Is a type of coordinate ascent Is guaranteed to converge to a (often local) optima
59
The general learning problem with missing data Marginal likelihood: x is observed, z (eg class labels, y) is missing: Objective: Find argmax θ l(θ:Data)
60
Skipping Gnarly Math EM Converges – E-step doesn’t decrease F( , D) – M-step doesn’t either EM is Coordinate Ascent 60
61
A Key Computation: E-step x is observed, z is missing Compute probability of missing data given current choice of – Q(z|x j ) for each x j e.g., probability computed during classification step corresponds to “classification step” in K-means
62
Jensen’s inequality Theorem: – log z P(z) f(z) ≥ z P(z) log f(z) – e.g., Binary case for convex function f: – actually, holds for any concave (convex) function applied to an expectation!
63
Applying Jensen’s inequality Use: log z P(z) f(z) ≥ z P(z) log f(z)
64
The M-step Maximization step: We are optimizing a lower bound! Use expected counts to do weighted learning: – If learning requires Count(x,z) – Use E Q(t+1) [Count(x,z)] – Looks a bit like boosting!!! Lower bound:
65
Convergence of EM Define: potential function F( ,Q): – lower bound from Jensen’s inequality EM is coordinate ascent on F! – Thus, maximizes lower bound on marginal log likelihood
66
M-step can’t decrease F(θ,Q): by definition! We are maximizing F directly, by ignoring a constant!
67
E-step: more work to show that F(θ,Q) doesn’t decrease KL-divergence: measures distance between distributions KL=zero if and only if Q=P
68
E-step also doesn’t decrease F: Step 1 Fix to (t), take a max over Q:
69
E-step also doesn’t decrease F: Step 2 Fixing to (t) : Now, the max over Q yields: – Q(z|x j ) P(z|x j, (t) ) – Why? The likelihood term is a constant; the KL term is zero iff the arguments are the same distribution!! – So, the E-step is actually a maximization / tightening of the bound. It ensures that:
70
EM is coordinate ascent M-step: Fix Q, maximize F over (a lower bound on ): E-step: Fix , maximize F over Q: – “Realigns” F with likelihood:
71
What you should know K-means for clustering: – algorithm – converges because it’s coordinate ascent Know what agglomerative clustering is EM for mixture of Gaussians: – Also coordinate ascent – How to “learn” maximum likelihood parameters (locally max. like.) in the case of unlabeled data – Relation to K-means Hard / soft clustering Probabilistic model Remember, E.M. can get stuck in local minima, – And empirically it DOES
72
Acknowledgements K-means & Gaussian mixture models presentation contains material from excellent tutorial by Andrew Moore: – http://www.autonlab.org/tutorials/ http://www.autonlab.org/tutorials/ K-means Applet: – http://www.elet.polimi.it/upload/matteucc/Clustering/tu torial_html/AppletKM.html http://www.elet.polimi.it/upload/matteucc/Clustering/tu torial_html/AppletKM.html Gaussian mixture models Applet: – http://www.neurosci.aist.go.jp/%7Eakaho/MixtureEM.ht ml
73
Solution to #3 - Learning Given x 1 …x N, how do we learn θ =(,, ) to maximize P(x)? Unfortunately, there is no known way to analytically find a global maximum θ * such that θ * = arg max P(o | θ) But it is possible to find a local maximum; given an initial model θ, we can always find a model θ ’ such that P(o | θ ’ ) ≥ P(o | θ)
74
74 Chicken & Egg Problem If we knew the actual sequence of states – It would be easy to learn transition and emission probabilities – But we can’t observe states, so we don’t! If we knew transition & emission probabilities –Then it’d be easy to estimate the sequence of states (Viterbi) –But we don’t know them! Slide by Daniel S. Weld
75
75 Simplest Version Mixture of two distributions Know: form of distribution & variance, % =5 Just need mean of each distribution.01.03.05.07.09 Slide by Daniel S. Weld
76
76 Input Looks Like.01.03.05.07.09 Slide by Daniel S. Weld
77
77 We Want to Predict.01.03.05.07.09 ? Slide by Daniel S. Weld
78
78 Chicken & Egg.01.03.05.07.09 Note that coloring instances would be easy if we knew Gausians…. Slide by Daniel S. Weld
79
79 Chicken & Egg.01.03.05.07.09 And finding the Gausians would be easy If we knew the coloring Slide by Daniel S. Weld
80
80 Expectation Maximization (EM) Pretend we do know the parameters – Initialize randomly: set 1 =?; 2 =?.01.03.05.07.09 Slide by Daniel S. Weld
81
81 Expectation Maximization (EM) Pretend we do know the parameters – Initialize randomly [E step] Compute probability of instance having each possible value of the hidden variable.01.03.05.07.09 Slide by Daniel S. Weld
82
82 Expectation Maximization (EM) Pretend we do know the parameters – Initialize randomly [E step] Compute probability of instance having each possible value of the hidden variable.01.03.05.07.09 Slide by Daniel S. Weld
83
83 Expectation Maximization (EM) Pretend we do know the parameters – Initialize randomly [E step] Compute probability of instance having each possible value of the hidden variable.01.03.05.07.09 [M step] Treating each instance as fractionally having both values compute the new parameter values Slide by Daniel S. Weld
84
84 ML Mean of Single Gaussian U ml = argmin u i (x i – u) 2.01.03.05.07.09 Slide by Daniel S. Weld
85
85 Expectation Maximization (EM) [E step] Compute probability of instance having each possible value of the hidden variable.01.03.05.07.09 [M step] Treating each instance as fractionally having both values compute the new parameter values Slide by Daniel S. Weld
86
86 Expectation Maximization (EM) [E step] Compute probability of instance having each possible value of the hidden variable.01.03.05.07.09 Slide by Daniel S. Weld
87
87 Expectation Maximization (EM) [E step] Compute probability of instance having each possible value of the hidden variable.01.03.05.07.09 [M step] Treating each instance as fractionally having both values compute the new parameter values Slide by Daniel S. Weld
88
88 Expectation Maximization (EM) [E step] Compute probability of instance having each possible value of the hidden variable.01.03.05.07.09 [M step] Treating each instance as fractionally having both values compute the new parameter values Slide by Daniel S. Weld
89
89 EM for HMMs [E step] Compute probability of instance having each possible value of the hidden variable – Compute the forward and backward probabilities for given model parameters and our observations [M step] Treating each instance as fractionally having both values compute the new parameter values - Re-estimate the model parameters - Simple Counting
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.