Download presentation
Presentation is loading. Please wait.
1
The EM algorithm LING 572 Fei Xia 03/01/07
2
What is EM? EM stands for “expectation maximization”. A parameter estimation method: it falls into the general framework of maximum-likelihood estimation (MLE). The general form was given in (Dempster, Laird, and Rubin, 1977), although essence of the algorithm appeared previously in various forms.
3
Outline MLE EM –Basic concepts –Main ideas of EM –EM for PM models –An example: Forward-backward algorithm
4
MLE
5
What is MLE? Given –A sample X={X 1, …, X n } –A vector of parameters θ We define –Likelihood of the data: P(X | θ) –Log-likelihood of the data: L(θ)=log P(X|θ) Given X, find
6
MLE (cont) Often we assume that X i s are independently identically distributed (i.i.d.) Depending on the form of P(X | θ), solving optimization problem can be easy or hard.
7
An easy case Assuming –A coin has a probability p of being heads, 1-p of being tails. –Observation: We toss a coin N times, and the result is a set of Hs and Ts, and there are m Hs. What is the value of p based on MLE, given the observation?
8
An easy case (cont) p= m/N
9
EM: basic concepts
10
Basic setting in EM X is a set of data points: observed data Θ is a parameter vector. EM is a method to find θ ML where Calculating P(X | θ) directly is hard. Calculating P(X,Y|θ) is much simpler, where Y is “hidden” data (or “missing” data).
11
The basic EM strategy Z = (X, Y) –Z: complete data (“augmented data”) –X: observed data (“incomplete” data) –Y: hidden data (“missing” data)
12
The “missing” data Y Y need not necessarily be missing in the practical sense of the word. It may just be a conceptually convenient technical device to simplify the calculation of P(X | θ). There could be many possible Ys.
13
Examples of EM HMMPCFGMTCoin toss X (observed) sentences Parallel dataHead-tail sequences Y (hidden)State sequences Parse treesWord alignment Coin id sequences θa ij b ijk P(A BC)t(f | e) d(a j | j, l, m), … p1, p2, λ AlgorithmForward- backward Inside- outside IBM ModelsN/A
14
The EM algorithm Consider a set of starting parameters Use these to “estimate” the missing data Use “complete” data to update parameters Repeat until convergence
15
Highlights General algorithm for missing data problems Requires “specialization” to the problem at hand Examples of EM: –Forward-backward algorithm for HMM –Inside-outside algorithm for PCFG –EM in IBM MT Models
16
EM: main ideas
17
Idea #1: find θ that maximizes the likelihood of training data
18
Idea #2: find the θ t sequence No analytical solution iterative approach, find s.t.
19
Idea #3: find θ t+1 that maximizes a tight lower bound of a tight lower bound
20
Idea #4: find θ t+1 that maximizes the Q function Lower bound of The Q function
21
The EM algorithm Start with initial estimate, θ 0 Repeat until convergence –E-step: calculate –M-step: find
22
Important classes of EM problem Products of multinomial (PM) models Exponential families Gaussian mixture …
23
The EM algorithm for PM models
24
PM models Where is a partition of all the parameters, and for any j
25
HMM is a PM
26
PCFG PCFG: each sample point (x,y): –x is a sentence –y is a possible parse tree for that sentence.
27
PCFG is a PM
28
Q-function for PM
29
Maximizing the Q function Maximize Subject to the constraint Use Lagrange multipliers
30
Optimal solution Normalization factor Expected count
31
PCFG example Calculate expected counts Update parameters
32
EM: An example
33
Forward-backward algorithm HMM is a PM (Product of Multi-nominal) Model Forward-backward algorithm is a special case of the EM algorithm for PM Models. X (observed data): each data point is an O 1T. Y (hidden data): state sequence X 1T. Θ (parameters): a ij, b ijk, π i.
34
Expected counts
35
Expected counts (cont)
36
The inner loop for forward-backward algorithm Given an input sequence and 1.Calculate forward probability: Base case Recursive case: 2.Calculate backward probability: Base case: Recursive case: 3.Calculate expected counts: 4.Update the parameters:
37
HMM A HMM is a tuple : –A set of states S={s 1, s 2, …, s N }. –A set of output symbols Σ={w 1, …, w M }. –Initial state probabilities –State transition prob: A={a ij }. –Symbol emission prob: B={b ijk } State sequence: X 1 …X T+1 Output sequence: o 1 …o T
38
Forward probability The probability of producing o i,t-1 while ending up in state s i :
39
Calculating forward probability Initialization: Induction:
40
Backward probability The probability of producing the sequence O t,T, given that at time t, we are at state s i.
41
Calculating backward probability Initialization: Induction:
42
Calculating the prob of the observation
43
Estimating parameters The prob of traversing a certain arc at time t given O: (denoted by p t (i, j) in M&S)
44
The prob of being at state i at time t given O:
45
Expected counts Sum over the time index: Expected # of transitions from state i to j in O: Expected # of transitions from state i in O:
46
Update parameters
47
Final formulae
48
Summary
49
The EM algorithm –An iterative approach –L(θ) is non-decreasing at each iteration –Optimal solution in M-step exists for many classes of problems. The EM algorithm for PM models –Simpler formulae –Three special cases Inside-outside algorithm Forward-backward algorithm IBM Models for MT
50
Relations among the algorithms The generalized EM The EM algorithm PM Gaussian Mix Inside-Outside Forward-backward IBM models
51
Strengths of EM Numerical stability: in every iteration of the EM algorithm, it increases the likelihood of the observed data. The EM handles parameter constraints gracefully.
52
Problems with EM Convergence can be very slow on some problems and is intimately related to the amount of missing information. It guarantees to improve the probability of the training corpus, which is different from reducing the errors directly. It cannot guarantee to reach global maxima (it could get struck at the local maxima, saddle points, etc) The initial estimate is important.
53
Additional slides
54
The EM algorithm for PM models // for each iteration // for each training example x i // for each possible y // for each parameter
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.