12/08/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) Statistical Translation: Alignment and Parameter Estimation Dr. Jan Hajič CS Dept., Johns Hopkins Univ.
12/08/1999JHU CS / Intro to NLP/Jan Hajic2 Alignment Available corpus assumed: –parallel text (translation E ↔ F) No alignment present (day marks only)! Sentence alignment –sentence detection –sentence alignment Word alignment –tokenization –word alignment (with restrictions)
12/08/1999JHU CS / Intro to NLP/Jan Hajic3 Sentence Boundary Detection Rules, lists: –Sentence breaks: paragraphs (if marked) certain characters: ?, !, ; (...almost sure) The Problem: period. –could be end of sentence (... left yesterday. He was heading to...) –decimal point: 3.6 (three-point-six) –thousand segment separator: (three-thousand-two-hundred) –abbreviation never at the end of sentence: cf., e.g., Calif., Mt., Mr. –ellipsis:... –other languages: ordinal number indication (2nd ~ 2.) –initials: A. B. Smith Statistical methods: e.g., Maximum Entropy
12/08/1999JHU CS / Intro to NLP/Jan Hajic4 Sentence Alignment The Problem: sentences detected only: E: F: Desired output: Segmentation with equal number of segments, spanning continuously the whole text. Original sentence boundaries kept: E: F: Alignments obtained: 2-1, 1-1, 1-1, 2-2, 2-1, 0-1 New segments called “ sentences ” from now on.
12/08/1999JHU CS / Intro to NLP/Jan Hajic5 Alignment Methods Several methods (probabilistic and not prob.) –character-length based –word-length based –“ cognates ” (word identity used) using an existing dictionary (F: prendre ~ E: make, take) using word “ distance ” (similarity): names, numbers, borrowed words, Latin origin words,... Best performing: –statistical, word- or character- length based (with some words perhaps)
12/08/1999JHU CS / Intro to NLP/Jan Hajic6 Length-based Alignment First, define the problem probabilistically: argmax A P(A|E,F) = argmax A P(A,E,F) (E,F fixed) Define a “ bead ” : E: F: Approximate: P(A,E,F) i=1..n P(B i ), where B i is a bead; P(B i ) does not depend on the rest of E,F. “ bead ” (2:2 in this case)
12/08/1999JHU CS / Intro to NLP/Jan Hajic7 The Alignment Task Given the model definition, P(A,E,F) i=1..n P(B i ), find the partitioning of (E,F) into n beads B i=1..n, that maximizes P(A,E,F) over training data. Define B i = p:q i, where p:q {0:1,1:0,1:1,1:2,2:1,2:2} –describes the type of alignment Want to use some sort of dynamic programming: Define Pref(i,j)... probability of the best alignment from the start of (E,F) data (1,1) up to (i,j)
12/08/1999JHU CS / Intro to NLP/Jan Hajic8 Recursive Definition Initialize: Pref(0,0) = 0. Pref(i,j) = max ( Pref(i,j-1) P( 0:1 k ), Pref(i-1,j) P( 1:0 k ), Pref(i-1,j-1) P( 1:1 k ), Pref(i-1,j-2) P( 1:2 k ), Pref(i-2,j-1) P( 2:1 k ), Pref(i-2,j-2) P( 2:2 k ) ) This is enough for a Viterbi-like search. E: F: i j Pref(i-2,j-2) P( 2:2 k ) Pref(i-2,j-1) P( 2:1 k ) Pref(i-1,j-2) P( 1:2 k ) Pref(i-1,j-1) P( 1:1 k ) Pref(i-1,j) P( 1:0 k ) Pref(i,j-1) P( 0:1 k )
12/08/1999JHU CS / Intro to NLP/Jan Hajic9 Probability of a Bead Remains to define P( p:q k ) (the red part): –k refers to the “ next ” bead, with segments of p and q sentences, lengths l k,e and l k,f. Use normal distribution for length variation: P( p:q k ) = P( l k,e,l k,f, , 2 ,p:q) P( l k,e,l k,f, , 2 )P(p:q) l k,e,l k,f, , 2 = ( l k,f - l k,e )/ l k,e 2 Estimate P(p:q) from small amount of data, or even guess and re-estimate after aligning some data. Words etc. might be used as better clues in P( p:q a k ) def.
12/08/1999JHU CS / Intro to NLP/Jan Hajic10 Saving time For long texts (> 10 4 sentences), even Viterbi (in the version needed) is not effective (o(S 2 ) time) Go paragraph by paragraph if they are aligned 1:1 What if not? Apply the same method first to paragraphs! –identify paragraphs roughly in both languages –run the algorithm to get aligned paragraph-like segments –then, run on sentences within paragraphs. Performs well if not many consecutive 1:0 or 0:1 beads.
12/08/1999JHU CS / Intro to NLP/Jan Hajic11 Word alignment Length alone does not help anymore. –mainly because words can be swapped, and mutual translations have often vastly different length....but at least, we have “ sentences ” (sentence-like segments) aligned; that will be exploited heavily. Idea: –Assume some (simple) translation model (such as Model 1). –Find its parameters by considering virtually all alignments. –After we have the parameters, find the best alignment given those parameters.
12/08/1999JHU CS / Intro to NLP/Jan Hajic12 Word Alignment Algorithm Start with sentence-aligned corpus. Let (E,F) be a pair of sentences (actually, a bead). Initialize p(f|e) randomly (e.g., uniformly), f F, e E. Compute expected counts over the corpus: c(f,e) = (E,F);e E,f F p(f|e) aligned pair (E,F), find if e in E and f in F; if yes, add p(f|e). Reestimate: p(f|e) = c(f,e) / c(e) [c(e) = f c(f,e)] Iterate until change of p(f|e) is small.
12/08/1999JHU CS / Intro to NLP/Jan Hajic13 Best Alignment Select, for each (E,F), A = argmax A P(A|F,E) = argmax A P(F,A|E)/P(F) = argmax A P(F,A|E) = argmax A ( / (l+1) m j=1..m p(f j |e a j )) = argmax A j=1..m p(f j |e a j ) Again, use dynamic programming, Viterbi-like algorithm. Recompute p(f|e) based on the best alignment (only if you are inclined to do so; the “ original ” summed-over-all distribution might perform better). Note: we have also got all Model 1 parameters.