Bayesian Word Alignment for Statistical Machine Translation Authors: Coskun Mermer, Murat Saraclar Present by Jun Lang I2R SMT-Reading Group
Paper info Bayesian Word Alignment for Statistical Machine Translation ACL 2011 Short Paper With Source Code in Perl on 379 lines Authors –Coskun Mermer –Murat Saraclar
Core Idea Propose a Gibbs Sampler for Fully Bayesian Inference in IBM Model 1 Result –Outperform classical EM in BLEU up to 2.99 –Effectively address the rare word problem –Much smaller phrase table than EM
Mathematics (E, F): parallel corpus e i, f j : i-th (j-th) source (target) word in e (f), which contains I (J) words in corpus E (F). e 0 : Each E sentence contains “null” word V E (V F ): size of source (target) vocabulary a (A): alignment for sentence (corpus) a j : f j has alignment a j for source word e aj T: parameter table, size is V E x V F t e,f = P(f|e): word translation probability
IBM Model 1 T as a random variable
Dirichlet Distribution T={t e,f } is an exponential family distribution Specifically being multinomial distribution We choose the conjugate prior In the case of Dirichlet Distribution for computational convenience
Dirichlet Distribution Each source word type te is a distribution over the target vocabulary, to be a Dirichlet distribution Avoid rare words acting as “garbage collectors”
Dirichlet Distribution sample the unknowns A and T in turn ¬j denotes the exclusion of the current value of aj.
Algorithm A can be arbitrary, but normal EM output is better
Results
Code View bayesalign.pl
Conclusions Outperform classical EM in BLEU up to 2.99 Effectively address the rare word problem Much smaller phrase table than EM Shortcomings –Too slow: 100 sentence pairs costs 18 mins –Maybe can be speedup by parallel computing
3