Improving Word-Alignments for Machine Translation Using Phrase-Based Techniques Mike Rodgers Sarah Spikes Ilya Sherman
IBM Model 2 - recap Alignments are word-to-word Factors considered: the words themselves position within source and target sentences Formally, probability that i th word of sentence S aligns with j th word of sentence T depends on: what S[i] and T[j] are f(i, j, length(S), length(T))
Introducing Phrases Groups of words tend to translate as a unit (i.e., a “phrase”). IBM Model 2 has no notion of this. We began with a working IBM Model 2 word aligner (from PA2) and looked at three ways to extend this model using the notion of phrases.
Technique 1: Nearby Neighbors Ideal: instead of measuring displacement relative to diagonal, measure displacement relative to the previous alignment. This is hard: to be efficient, EM assumes that all alignments are independent. Referring to “the previous alignment” has no meaning. We get around this by means of a weaker dependency. For the likelihood of aligning S[i] to T[j]: don’t ask if S[i-1] is aligned to T[i-1]. ask whether S[i-1] to T[j-1] would be a good alignment.
Technique 1: Nearby Neighbors Suppose we have P(S, T, i, j) that returns probability S[i] aligns with T[j]. Define P’(S, T, i, j) = 1 · P(S, T, i, j) + 2 · P(S, T, i - 1, j - 1) 1 = 0.95, 2 = 0.05 Use this distribution (in EM phase and in computing final results) Also tried a variety of similar models
Technique 1:Nearby Neighbors Results When added to Model 2, provided only slight improvement in quality of final results. Provided massive speedup in EM convergence pre-encoding information that would otherwise have to be learned When added to Model 1, provided notable improvement in quality of results model adds information, but most of that information already captured by Model 2
Technique 2: Beam Search The IBM models had a slightly different solution IBM Model 2 penalized alignments of S[i] to T[j] that had higher displacements d(S[i], T[j]) from the diagonal Since phrases tend to move together, each word in the phrase incurs the penalty So, IBM Model 4 instead penalizes alignments of S[i] to T[j] that have a high displacement relative to the alignment of S[i − 1] Thus, only the first word in each phrase is penalized.
Technique 2: Beam Search But, to know where the previous source word was aligned, we need to keep track of each partial alignment for the sentence We cannot afford to evaluate every possible alignment (exponential in the length of sentence) Instead, we can maintain a beam of the n best alignments for the previous word.
Technique 2: Beam Search To assess a penalty for aligning S[i] to T[j], we compute d’(S[i], T[j]) as the minimal displacement measured either absolutely from the diagonal, or Relative to one of the previous n best alignments. The two cases represent a new and an old phrase, respectively Formally, d’(S[i], T[j]) = min(d(S[i], T[j]), min(d(S[i − 1], T[j]) − d(S[i], T[k m ])), where T[k m ] is the m th best alignment for S[i − 1] 1 ≤ m ≤ n
Technique 2: Beam Search Results In practice, n = 2 worked best Gives just enough context without blurring distinctions between phrases Resulted in more than 20% improvement in AER Combined with nearby neighbors approach gives a massive speedup as well
Technique 3: Phrase Pre-chunking Another idea was to find common phrases in each language and store them as a set Take the sentences, and whenever we see our common phrases, just treat them as words using Model 2 Ideally, we would find phrases of any length, taking the most probable phrases over the sentence as our chunks
Technique 3: Phrase Pre-chunking Implementation issues We began by just using bigrams as our phrases, for simplicity However, we found that this did not work well with our pre-existing Model 2 code The function to get the best word alignment expects alignments based on the original sentences’ indices We need to pre-chunk the sentences to get any meaningful results based on our training This destroys the original indices, so we have to either store the old sentence or reconstruct the indices as we go
Technique 3: Phrase Pre-chunking Ideas for Improvement Expanding to N-grams Finding the best bigrams/N-grams rather than just the first one we see that is in our “good enough” set Once we had a bigram, we tried checking if the second word and the following made a “better” bigram, and if so, used that one instead This could potentially be improved upon with better techniques, though it would obviously be more complicated with longer N-grams
Summary Nearby neighbors approach: Massive speed-up Beam Search 20% AER improvement Combined neighbors and beam Both improvements were maintained (speed and AER) Phrase Pre-Chunking Good idea for further exploration