Presentation is loading. Please wait.

Presentation is loading. Please wait.

EM algorithm LING 572 Fei Xia 03/02/06. Outline The EM algorithm EM for PM models Three special cases –Inside-outside algorithm –Forward-backward algorithm.

Similar presentations


Presentation on theme: "EM algorithm LING 572 Fei Xia 03/02/06. Outline The EM algorithm EM for PM models Three special cases –Inside-outside algorithm –Forward-backward algorithm."— Presentation transcript:

1 EM algorithm LING 572 Fei Xia 03/02/06

2 Outline The EM algorithm EM for PM models Three special cases –Inside-outside algorithm –Forward-backward algorithm –IBM models for MT

3 The EM algorithm

4 Basic setting in EM X is a set of data points: observed data Θ is a parameter vector. EM is a method to find θ ML where Calculating P(X | θ) directly is hard. Calculating P(X,Y|θ) is much simpler, where Y is “hidden” data (or “missing” data).

5 The basic EM strategy Z = (X, Y) –Z: complete data (“augmented data”) –X: observed data (“incomplete” data) –Y: hidden data (“missing” data) Given a fixed x, there could be many possible y’s. –Ex: given a sentence x, there could be many state sequences in an HMM that generates x.

6 Examples of EM HMMPCFGMTCoin toss X (observed) sentences Parallel dataHead-tail sequences Y (hidden)State sequences Parse treesWord alignment Coin id sequences θa ij b ijk P(A  BC)t(f|e) d(a j |j, l, m), … p1, p2, λ AlgorithmForward- backward Inside- outside IBM ModelsN/A

7 The log-likelihood function L is a function of θ, while holding X constant:

8 The iterative approach for MLE In many cases, we cannot find the solution directly. An alternative is to find a sequence: s.t.

9 Jensen’s inequality

10 log is a concave function

11 Maximizing the lower bound The Q function

12 The Q-function Define the Q-function (a function of θ): –Y is a random vector. –X=(x 1, x 2, …, x n ) is a constant (vector). –Θ t is the current parameter estimate and is a constant (vector). –Θ is the normal variable (vector) that we wish to adjust. The Q-function is the expected value of the complete data log- likelihood P(X,Y|θ) with respect to Y given X and θ t.

13 The inner loop of the EM algorithm E-step: calculate M-step: find

14 L(θ) is non-decreasing at each iteration The EM algorithm will produce a sequence It can be proved that

15 The inner loop of the Generalized EM algorithm (GEM) E-step: calculate M-step: find

16 Recap of the EM algorithm

17 Idea #1: find θ that maximizes the likelihood of training data

18 Idea #2: find the θ t sequence No analytical solution  iterative approach, find s.t.

19 Idea #3: find θ t+1 that maximizes a tight lower bound of a tight lower bound

20 Idea #4: find θ t+1 that maximizes the Q function Lower bound of The Q function

21 The EM algorithm Start with initial estimate, θ 0 Repeat until convergence –E-step: calculate –M-step: find

22 Important classes of EM problem Products of multinomial (PM) models Exponential families Gaussian mixture …

23 The EM algorithm for PM models

24 PM models Where is a partition of all the parameters, and for any j

25 HMM is a PM

26 PCFG PCFG: each sample point (x,y): –x is a sentence –y is a possible parse tree for that sentence.

27 PCFG is a PM

28 Q-function for PM

29 Maximizing the Q function Maximize Subject to the constraint Use Lagrange multipliers

30 Optimal solution Normalization factor Expected count

31 PM Models is r th parameter in the model. Each parameter is the member of some multinomial distribution. Count(x,y, r) is the number of times that is seen in the expression for P(x, y | θ)

32 The EM algorithm for PM Models Calculate expected counts Update parameters

33 PCFG example Calculate expected counts Update parameters

34 The EM algorithm for PM models // for each iteration // for each training example x i // for each possible y // for each parameter

35 Inside-outside algorithm

36 Inner loop of the Inside-outside algorithm Given an input sequence and 1.Calculate inside probability: Base case Recursive case: 2.Calculate outside probability: Base case: Recursive case:

37 Inside-outside algorithm (cont) 3. Collect the counts 4. Normalize and update the parameters

38 Expected counts for PCFG rules This is the formula if we have only one sentence. Add an outside sum if X contains multiple sentences.

39 Expected counts (cont)

40 Relation to EM PCFG is a PM Model Inside-outside algorithm is a special case of the EM algorithm for PM Models. X (observed data): each data point is a sentence w 1m. Y (hidden data): parse tree Tr. Θ (parameters):

41 Forward-backward algorithm

42 The inner loop for forward-backward algorithm Given an input sequence and 1.Calculate forward probability: Base case Recursive case: 2.Calculate backward probability: Base case: Recursive case: 3.Calculate expected counts: 4.Update the parameters:

43 Expected counts

44 Expected counts (cont)

45 Relation to EM HMM is a PM Model Forward-backward algorithm is a special case of the EM algorithm for PM Models. X (observed data): each data point is an O 1T. Y (hidden data): state sequence X 1T. Θ (parameters): a ij, b ijk, π i.

46 IBM models for MT

47 Expected counts for (f, e) pairs Let Ct(f, e) be the fractional count of (f, e) pair in the training data. Alignment probActual count of times e and f are linked in (E,F) by alignment a

48 Relation to EM IBM models are PM Models. The EM algorithm used in IBM models is a special case of the EM algorithm for PM Models. X (observed data): each data point is a sentence pair (F, E). Y (hidden data): word alignment a. Θ (parameters): t(f|e), d(i | j, m, n), etc..

49 Summary The EM algorithm –An iterative approach –L(θ) is non-decreasing at each iteration –Optimal solution in M-step exists for many classes of problems. The EM algorithm for PM models –Simpler formulae –Three special cases Inside-outside algorithm Forward-backward algorithm IBM Models for MT

50 Relations among the algorithms The generalized EM The EM algorithm PM Gaussian Mix Inside-Outside Forward-backward IBM models

51 Strengths of EM Numerical stability: in every iteration of the EM algorithm, it increases the likelihood of the observed data. The EM handles parameter constraints gracefully.

52 Problems with EM Convergence can be very slow on some problems and is intimately related to the amount of missing information. It guarantees to improve the probability of the training corpus, which is different from reducing the errors directly. It cannot guarantee to reach global maxima (it could get struck at the local maxima, saddle points, etc)  The initial estimate is important.

53 Additional slides

54 Lower bound lemma If Then Proof :

55 L(θ) is non-decreasing Let We have (By lower bound lemma)


Download ppt "EM algorithm LING 572 Fei Xia 03/02/06. Outline The EM algorithm EM for PM models Three special cases –Inside-outside algorithm –Forward-backward algorithm."

Similar presentations


Ads by Google