12/07/1999 JHU CS 600.465/Jan Hajic 1 *Introduction to Natural Language Processing (600.465) Statistical Machine Translation Dr. Jan Hajič cCS Dept., Johns.

12/07/1999 JHU CS 600.465/Jan Hajic 1 *Introduction to Natural Language Processing (600.465) Statistical Machine Translation Dr. Jan Hajič cCS Dept., Johns Hopkins Univ. hajic@cs.jhu.edu www.cs.jhu.edu/~hajic

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 2 The Main Idea Treat translation as a noisy channel problem: Input (Source) “ Noisy ” Output (target) The channel E: English words... (adds “ noise ” ) F: Les mots Anglais... The Model: P(E|F) = P(F|E) P(E) / P(F) Interested in rediscovering E given F: After the usual simplification (P(F) fixed): argmax E P(E|F) = argmax E P(F|E) P(E) !

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 3 The Necessities Language Model (LM) P(E) Translation Model (TM): Target given source P(F|E) Search procedure –Given E, find best F using the LM and TM distributions. Usual problem: sparse data –We cannot create a “ sentence dictionary ” E ↔  F –Typically, we do not see a sentence even twice!

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 4 The Language Model Any LM will do: –3-gram LM –3-gram class-based LM –decision tree LM with hierarchical classes Does not necessarily operates on word forms: –cf. later the “ analysis ” and “ generation ” procedures –for simplicity, imagine now it does operate on word forms

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 5 The Translation Models Do not care about correct strings of English words (that ’ s the task of the LM) Therefore, we can make more independence assumptions: –for start, use the “ tagging ” approach: 1 English word ( “ tag ” ) ~ 1 French word ( “ word ” ) –not realistic: rarely even the number of words is the same in both sentences (let alone there is 1:1 correspondence!)  use “ Alignment ”.

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 6 The Alignment 0 1 2 3 4 5 6 e 0 And the program has been implemented f 0 Le programme a é t é mis en application 0 1 2 3 4 5 6 7 Linear notation: f 0 (1) Le(2) programme(3) a(4) é t é (5) mis(6) en(6) application(6) e 0 And(0) the(1) program(2) has(3) been(4) implemented(5,6,7)

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 7 Alignment Mapping In general: –|F| = m, |E| = l (length of sent.): lm connections (each French word to any English word), 2 lm different alignments for any pair (E,F) (any subset) In practice: –From English to French each English word 1-n connections (n - empirical max.-fertility?) each French word exactly 1 connection –therefore, “ only ” (l+1) m alignments ( << 2 lm ) a j = i (link from j-th French word goes to i-th English word)

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 8 Elements of Translation Model(s) Basic distribution: P(F,A,E) - the joint distribution of the English sentence, the Alignment, and the French sentence (length m ) Interested also in marginal distributions: P(F,E) =  A P(F,A,E) P(F|E) = P(F,E) / P(E) =  A P(F,A,E) /  A,F P(F,A,E) =  A P(F,A|E) Useful decomposition [one of possible decompositions]: P(F,A|E) = P( m | E)  j=1..m P(a j |a 1 j-1,f 1 j-1, m,E) P(f j |a 1 j,f 1 j-1, m,E)

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 9 Decomposition Decomposition formula again: P(F,A|E) = P( m | E)  j=1..m P(a j |a 1 j-1,f 1 j-1, m,E) P(f j |a 1 j,f 1 j-1, m,E) m - length of French sentence a j - the alignment (single connection) going from j-th French w. f j - the j-th French word from F a 1 j-1 - sequence of alignments a i up to the word preceding f j a 1 j - sequence of alignments a i up to and including the word f j f 1 j-1 - sequence of French words up to the word preceding f j

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 10 Decomposition and the Generative Model...and again: P(F,A|E) = P( m | E)  j=1..m P(a j |a 1 j-1,f 1 j-1, m,E) P(f j |a 1 j,f 1 j-1, m,E) Generate: –first, the length of the French given the English words E; –then, the link from the first position in F (not knowing the actual word yet)  now we know the English word –then, given the link (and thus the English word), generate the French word at the current position –then, move to the next position in F until m position filled.

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 11 Approximations Still too many parameters –similar situation as in n-gram model with “ unlimited ” n –impossible to estimate reliably. Use 5 models, from the simplest to the most complex (i.e. from heavy independence assumptions to light) Parameter estimation: Estimate parameters of Model 1; use as an initial estimate for estimating Model 2 parameters; etc.

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 12 Model 1 Approximations: –French length P( m | E) is constant (small  ) –Alignment link distribution P(a j |a 1 j-1,f 1 j-1, m,E) depends on English length l only (= 1/(l+1)) –French word distribution depends only on the English and French word connected with link a j.  Model 1 distribution: P(F,A|E) =  / (l+1) m  j=1..m p(f j |e a j )

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 13 Models 2-5 Model 2 –adds more detail into P(a j |...): more “ vertical ” links preferred Model 3 –adds “ fertility ” (number of links for a given English word is explicitly modeled: P(n|e i ) –“ distortion ” replaces alignment probabilities from Model 2 Model 4 –the notion of “ distortion ” extended to chunks of words Model 5 is Model 4, but not deficient (does not waste probability to non-strings)

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 14 The Search Procedure “ Decoder ” : –given “ output ” (French), discover “ input ” (English) Translation model goes in the opposite direction: p(f|e) =.... Naive methods do not work. Possible solution (roughly): –generate English words one-by-one, keep only n-best (variable n) list; also, account for different lengths of the English sentence candidates!

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 15 Analysis - Translation - Generation (ATG) Word forms: too sparse Use four basic analysis, generation steps: –tagging –lemmatization –word-sense disambiguation –noun-phrase “ chunks ” (non-compositional translations) Translation proper: –use chunks as “ words ”

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 16 Training vs. Test with ATG Training: –analyze both languages using all four analysis steps –train TM(s) on the result (i.e. on chunks, tags, etc.) –train LM on analyzed source (English) Runtime/Test: –analyze given language sentence (French) using identical tools as in training –translate using the trained Translation/Language model(s) –generate source (English), reversing the analysis process

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 17 Analysis: Tagging and Morphology Replace word forms by morphologically processed text: –lemmas –tags original approach: mix them into the text, call them “ words ” e.g. She bought two books.  she buy VBP two book NNS. Tagging: yes –but reversed order: tag first, then lemmatize [NB: does not work for inflective languages] technically easy Hand-written deterministic rules for tag+form  lemma

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 18 Word Sense Disambiguation, Word Chunking Sets of senses for each E, F word: –e.g. book-1, book-2,..., book-n –prepositions (de-1, de-2, de-3,...), many others Senses derived automatically using the TM –translation probabilities measured on senses: p(de-3|from-5) Result: –statistical model for assigning senses monolingually based on context (also MaxEnt model used here for each word) Chunks: group words for non-compositional translation

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 19 Generation Inverse of analysis Much simpler: –Chunks  words (lemmas) with senses (trivial) –Words (lemmas) with senses  words (lemmas) (trivial) –Words (lemmas) + tags  word forms Additional step: –Source-language ambiguity: electric vs. electrical, hath vs. has, you vs. thou: treated as a single unit in translation proper, but must be disambiguated at the end of generation phase; using additional pure LM on word forms.

12/07/1999 JHU CS 600.465/Jan Hajic 20 *Introduction to Natural Language Processing (600.465) Statistical Translation: Alignment and Parameter Estimation Dr. Jan Hajič CS Dept., Johns Hopkins Univ. hajic@cs.jhu.edu www.cs.jhu.edu/~hajic

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 21 Alignment Available corpus assumed: –parallel text (translation E ↔  F) No alignment present (day marks only)! Sentence alignment –sentence detection –sentence alignment Word alignment –tokenization –word alignment (with restrictions)

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 22 Sentence Boundary Detection Rules, lists: –Sentence breaks: paragraphs (if marked) certain characters: ?, !, ; (...almost sure) The Problem: period. –could be end of sentence (... left yesterday. He was heading to...) –decimal point: 3.6 (three-point-six) –thousand segment separator: 3.200 (three-thousand-two-hundred) –abbreviation never at the end of sentence: cf., e.g., Calif., Mt., Mr. –ellipsis:... –other languages: ordinal number indication (2nd ~ 2.) –initials: A. B. Smith Statistical methods: e.g., Maximum Entropy

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 23 Sentence Alignment The Problem: sentences detected only: E: F: Desired output: Segmentation with equal number of segments, spanning continuously the whole text. Original sentence boundaries kept: E: F: Alignments obtained: 2-1, 1-1, 1-1, 2-2, 2-1, 0-1 New segments called “ sentences ” from now on.

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 24 Alignment Methods Several methods (probabilistic and not prob.) –character-length based –word-length based –“ cognates ” (word identity used) using an existing dictionary (F: prendre ~ E: make, take) using word “ distance ” (similarity): names, numbers, borrowed words, Latin origin words,... Best performing: –statistical, word- or character- length based (with some words perhaps)

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 25 Length-based Alignment First, define the problem probabilistically: argmax A P(A|E,F) = argmax A P(A,E,F) (E,F fixed) Define a “ bead ” : E: F: Approximate: P(A,E,F)   i=1..n P(B i ), where B i is a bead; P(B i ) does not depend on the rest of E,F. “ bead ” (2:2 in this case)

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 26 The Alignment Task Given the model definition, P(A,E,F)   i=1..n P(B i ), find the partitioning of (E,F) into n beads B i=1..n, that maximizes P(A,E,F) over training data. Define B i = p:q  i, where p:q  {0:1,1:0,1:1,1:2,2:1,2:2} –describes the type of alignment Want to use some sort of dynamic programming: Define Pref(i,j)... probability of the best alignment from the start of (E,F) data (1,1) up to (i,j)

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 27 Recursive Definition Initialize: Pref(0,0) = 0. Pref(i,j) = max ( Pref(i,j-1) P( 0:1  k ), Pref(i-1,j) P( 1:0  k ), Pref(i-1,j-1) P( 1:1  k ), Pref(i-1,j-2) P( 1:2  k ), Pref(i-2,j-1) P( 2:1  k ), Pref(i-2,j-2) P( 2:2  k ) ) This is enough for a Viterbi-like search. E: F: i j Pref(i-2,j-2) P( 2:2  k ) Pref(i-2,j-1) P( 2:1  k ) Pref(i-1,j-2) P( 1:2  k ) Pref(i-1,j-1) P( 1:1  k ) Pref(i-1,j) P( 1:0  k ) Pref(i,j-1) P( 0:1  k )

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 28 Probability of a Bead Remains to define P( p:q  k ) (the red part): –k refers to the “ next ” bead, with segments of p and q sentences, lengths l k,e and l k,f. Use normal distribution for length variation: P( p:q  k ) = P(  l k,e,l k,f, ,  2 ,p:q)  P(  l k,e,l k,f, ,  2  )P(p:q)  l k,e,l k,f, ,  2  = ( l k,f -  l k,e )/  l k,e  2 Estimate P(p:q) from small amount of data, or even guess and re-estimate after aligning some data. Words etc. might be used as better clues in P( p:q a k ) def.

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 29 Saving time For long texts (> 10 4 sentences), even Viterbi (in the version needed) is not effective (o(S 2 ) time) Go paragraph by paragraph if they are aligned 1:1 What if not? Apply the same method first to paragraphs! –identify paragraphs roughly in both languages –run the algorithm to get aligned paragraph-like segments –then, run on sentences within paragraphs. Performs well if not many consecutive 1:0 or 0:1 beads.

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 30 Word alignment Length alone does not help anymore. –mainly because words can be swapped, and mutual translations have often vastly different length....but at least, we have “ sentences ” (sentence-like segments) aligned; that will be exploited heavily. Idea: –Assume some (simple) translation model (such as Model 1). –Find its parameters by considering virtually all alignments. –After we have the parameters, find the best alignment given those parameters.

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 31 Word Alignment Algorithm Start with sentence-aligned corpus. Let (E,F) be a pair of sentences (actually, a bead). Initialize p(f|e) randomly (e.g., uniformly), f  F, e  E. Compute expected counts over the corpus: c(f,e) =  (E,F);e  E,f  F p(f|e)  aligned pair (E,F), find if e in E and f in F; if yes, add p(f|e). Reestimate: p(f|e) = c(f,e) / c(e) [c(e) =  f c(f,e)] Iterate until change of p(f|e) is small.

12/07/1999 JHU CS 600.465/ Intro to NLP/Jan Hajic 32 Best Alignment Select, for each (E,F), A = argmax A P(A|F,E) = argmax A P(F,A|E)/P(F) = argmax A P(F,A|E) = argmax A (  / (l+1) m  j=1..m p(f j |e a j )) = argmax A  j=1..m p(f j |e a j ) (IBM Model 1) Again, use dynamic programming, Viterbi-like algorithm. Recompute p(f|e) based on the best alignment (only if you are inclined to do so; the “ original ” summed-over-all distribution might perform better). Note: we have also got all Model 1 parameters.

12/07/1999 JHU CS 600.465/Jan Hajic 1 *Introduction to Natural Language Processing (600.465) Statistical Machine Translation Dr. Jan Hajič cCS Dept., Johns.

Similar presentations

Presentation on theme: "12/07/1999 JHU CS 600.465/Jan Hajic 1 *Introduction to Natural Language Processing (600.465) Statistical Machine Translation Dr. Jan Hajič cCS Dept., Johns."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

12/07/1999 JHU CS 600.465/Jan Hajic 1 *Introduction to Natural Language Processing (600.465) Statistical Machine Translation Dr. Jan Hajič cCS Dept., Johns.

Similar presentations

Presentation on theme: "12/07/1999 JHU CS 600.465/Jan Hajic 1 *Introduction to Natural Language Processing (600.465) Statistical Machine Translation Dr. Jan Hajič cCS Dept., Johns."— Presentation transcript:

Similar presentations

About project

Feedback