Expectation-Maximization Algorithm M.B.Chandak
Principle-EM Algorithm Maximum Data Likelihood Estimation. This algorithm operates on parallel corpus. For example: English-Hindi aligned parallel corpus. The algorithm aims to find out MLE [Maximum likelihood estimation] of two words to be used for Machine Translation. In the following example: English and Hindi languages are used source and target language. Let Es-represents English and Hs-represents Hindi corpus.
Implementation: It is an iterative algorithm. The two steps are: Computing the probability of word alignment [M-step] and generating the expected count of these alignment [E-step] Initially: To all alignment uniform probability is assigned.
Example: Sentence: English-Hindi Green House The House हरा घर यह घर हरा घर यह घर Uniform probability table Green House The t(Green|हरा )=1/3 t(house|हरा )=1/3 t(the|हरा )=1/3 t(Green|घर)=1/3 t(house|घर)=1/3 t(the|घर)=1/3 t(Green|यह)=1/3 t(house|यह)=1/3 t(the|यह)=1/3
Example Compute P(a, e|h) by multiplying all “t” probabilities Green House The House हरा घर यह घर 1/3 * 1/3 = 1/9 1/3 * 1/3 = 1/9 1/3 * 1/3 = 1/9 1/3 * 1/3 = 1/9
Re-calculating values Green House हरा घर THE GREEN HOUSE यह हरा ½ घर The House यह घर THE GREEN HOUSE यह ½ हरा घर
Calculate “tcounts”=tc Green House The TOTAL tc(Green|हरा )=1/2 tc(house|हरा )=1/2 tc(the|हरा )=0 t(the|हरा )=1 tc(Green|घर)=1/2 tc(house|घर)=[1/2+1/2]=1 tc(the|घर)=1/2 t(the|घर)=2 tc(Green|यह)=0 tc(house|यह)=1/2 tc(the|यह)=1/2 t(the|यह)=1
M-Step t(Green|हरा )=1/2 t(house|हरा )=1/2 t(the|हरा )=0 TOTAL t(Green|हरा )=1/2/1 tc(house|हरा )=1/2/1 t(the|हरा )=0/1 t(the|हरा )=1 t(Green|घर)=1/2/2 t(house|घर)=[1/2+1/2]=1/2 t(the|घर)=1/2/2 t(the|घर)=2 t(Green|यह)=0/1 t(house|यह)=1/2/1 t(the|यह)=1/2/1 t(the|यह)=1 Green House The t(Green|हरा )=1/2 t(house|हरा )=1/2 t(the|हरा )=0 t(Green|घर)=1/4 t(house|घर)=1/2 t(the|घर)=1/4 t(Green|यह)=1/2 t(house|यह)=1/2 t(the|यह)=1/2
E-step: Part 2: Identifying higher probability phrase Compute P(a, e|h) by multiplying all “t” probabilities Green House The House हरा घर यह घर 1/2 * 1/2 = 1/4 1/2 * 1/2 = 1/4 1/4 * 1/2 = 1/8 1/4 * 1/2= 1/8
Further:: The process continues to iterate with E-step followed by M-step. The probability values are changed from 1/9 to 1/4 and 1/9 to 1/8.