Jan 2009Statistical MT2 Approaching MT There are many different ways of approaching the problem of MT. The choice of approach is complex and depends upon: –Task requirements –Human resources –Linguistic resources
Jan 2009Statistical MT3 Criterial Issues Do we want a translation system for one language pair or for many language pairs? Can we assume a constrained vocabulary or do we need to deal with arbitrary text? What resources exist for the languages that we are dealing with? How long will it take us to develop the resources and what human resources?
Jan 2009Statistical MT4 Parallel Data Lots of translated text available: 100s of million words of translated text for some language pairs –a book has a few 100,000s words –an educated person may read 10,000 words a day –3.5 million words a year –300 million a lifetime Computers can see more translated text than humans read in a lifetime Machine can learn how to translate foreign languages. [Koehn 2006]
Jan 2009Statistical MT5 Statistical Translation Robust Domain independent Extensible Does not require language specialists Does requires parallel texts Uses noisy channel model of translation
Jan 2009Statistical MT6 Noisy Channel Model Sentence Translation (Brown et. al. 1990) source sentence target sentence sentence
Jan 2009Statistical MT7 Statistical Modelling Learn P(f|e) from a parallel corpus Not sufficient data to estimate P(f|e) directly [from Koehn 2006]
Jan 2009Statistical MT8 The Problem of Translation Given a sentence T of the target language, seek the source sentence S from which a translator produced T, i.e. find S that maximises P(S|T) By Bayes' theorem P(S|T) = P(S) x P(T|S) P(T) whose denominator is independent of S. Hence it suffices to maximise P(S) x P(T|S)
Jan 2009Statistical MT9 The Three Components of a Statistical MT model 1.Method for computing language model probabilities (P(S)) 2.Method for computing translation probabilities (P(S|T)) 3.Method for searching amongst source sentences for one that maximises P(S) * P(T|S)
Jan 2009Statistical MT10 A Statistical MT System Source Language Model Translation Model P(S) * P(T|S) = P(S|T) ST Decoder TS
Jan 2009Statistical MT11 Three Kinds of Model Statistical Models Language Models Word Based Phrase Based Syntax Based Translation Models Word Based Phrase Based Syntax Based
Jan 2009Statistical MT12 Language Models based on N-Grams of Words General P(s1s2...sn) = P(s1)*P(s2|s1)...*P(sn|s1...s(n-1)) Trigram P(s1s2...sn) = P(s1)*P(s2|s1)*P(s3|s1,s2)...*P(sn|s(n-1)s(n-2)) Bigram P(s1s2...sn) = P(s1)*P(s2|s1)...*P(sn|s(n-1))
Jan 2009Statistical MT13 Syntax Based Language Models Good syntax tree – good English Allows for long distance contstraints Left sentence preferred by syntax based model
Jan 2009Statistical MT14 Word-Based Translation Models Translation process is decomposed into smaller steps Each is tied to words Based on IBM Models [Brown et al., 1993] [from Koehn 2006]
Jan 2009Statistical MT15 Word TM derived from Bitext ENGLISH the cat sleeps the dog sleeps the horse eats FRENCH le chat dort le chien dort le cheval mange
Jan 2009Statistical MT16 le chat dort/the cat sleeps leIII chatIII chien cheval dortIII mange thecatdoghorsesleepseats
Jan 2009Statistical MT17 le chien dort/the dog sleeps leIIII chatIII chienIII cheval dortIIII mange thecatdoghorsesleepseats
Jan 2009Statistical MT18 le cheval mange/the horse eats leIIIIIIIII chatIII chienIII chevalIII dortIIII mangeIII thecatdoghorsesleepseats P(t|s) 1/9 3/9 1/9 2/9
Jan 2009Statistical MT19 Parameter Estimation Based on counting occurrences within monolingual and bilingual data. For language model, we need only source language text. For translation model, we need pairs of sentences that are translations of each other. Use EM (Expectation Maximisation) Algorithm (Baum 1972) to optimize model parameters.
Jan 2009Statistical MT20 EM Algorithm Word Alignments: for sentence pair ("a b c", "x y z") are formed from arbitrary pairings from the two sentences and include: (a.x,b.y,c.z), (a.z,b.y,c.x), etc. There is a large number of possible alignments, since we also allow, e.g. (ab.x,0.y,c.z),
Jan 2009Statistical MT21 EM Algorithm 1.Make initial estimate of parameters. This can be used to compute the probability of any possible word alignment. 2.Re-estimate parameters by ranking each possible alignment by its probability according to initial guess. 3.Repeated iterations assign ever greater probability to the set of sentences actually observed. Algorithm leads to a local maximum of the probability of observed sentence pairs as a function of the model parameters
Jan 2009Statistical MT22 Parameters for IBM Translation Model Word Translation Probability, P(t|s) probability that source word s is translated as target word t. Fertility P(n|s) probability that source word s is translated by n target words (25 ≥ n≥0). Distortion: P(i|j,l) probability that source word at position j is translated by target word at position i in target sentence of length l.
Jan 2009Statistical MT23 Experiment 1 (Brown et. al. 1990) Hansard. 40,000 pairs of sentences = approx. 800,000 words in each language. Considered 9,000 most common words in each language. Assumptions (initial parameter values) –each of the 9000 target words equally likely as translations of each of the source words. –each of the fertilities from 0 to 25 equally likely for each of the 9000 source words –each target position equally likely given each source position and target length
Jan 2009Statistical MT24 English: the FrenchProbability le.610 la.178 l’.083 les.023 ce.013 il.012 de.009 à.007 que.007 FertilityProbability
Jan 2009Statistical MT25 English: not FrenchProbability pas.469 ne.460 non.024 pas du tout.003 faux.003 plus.002 ce.002 que.002 jamais.002 FertilityProbability
Jan 2009Statistical MT26 English: hear FrenchProbability bravo.992 entendre.005 entendu.002 entends.001 FertilityProbability
Jan 2009Statistical MT27 Sentence Translation Probability Given translation model for words, we can compute translation probability of sentence taking parameters into account. P(Jean aime Marie|John loves Mary) = P(Jean|John) * P(1,John) * P(1|1,3) * P(aime|loves) * P(1,loves) * P(2|2,3) * P(Marie|Mary) * P(1,Mary) * P(3|3,3)
Jan 2009Statistical MT28 Flaws in Word-Based Translation Model handles many:one P(ttt|s) but not one:many P(t|sss) translations e.g. –Zeitmangel erschwert das Problem. –lack of time makes more difficult the problem. Correct translation: Lack of time makes the problem more difficult. MT output: Time makes the problem. [from Koehn 2006]
Jan 2009Statistical MT29 Flaws Word-Based Translation (2) Phrasal Translation: P(ttt|ssss) e.g. erübrigt sich /there is no point in –Eine Diskussion erübrigt sich demnach. –a discussion is made unnecessary itself therefore. –Correct translation: Therefore, there is no point in a discussion. MT output: A debate turned therefore. [from Koehn 2006]
Jan 2009Statistical MT30 Flaws in Word Based Translation (3) Syntactic transformations Example Object/subject reordering Den Vorschlag lehnt die Kommission ab the proposal rejects the commission off Correct translation: The commission rejects the proposal. MT output: The proposal rejects the commission. [ from Koehn 2006]
Jan 2009Statistical MT31 Phrase Based Translation Models Foreign input is segmented in phrases. Phrases are any sequence of words, not necessarily linguistically motivated. Each phrase is translated into English Phrases are reordered. [from Koehn 2006]
Jan 2009Statistical MT32 Syntax Based Translation Models
Jan 2009Statistical MT33 Word Based Decoding: searching for the best translation (Brown 1990) Maintain list of hypotheses. Initial hypothesis: (Jean aime Marie | *) Search proceeds iteratively. At each iteration we extend most promising hypotheses with additional words Jean aime Marie | John(1) * Jean aime Marie | * loves(2) * Jean aime Marie | * Mary(3) * Jean aime Marie | Jean(1) * Parenthesised numbers indicate corresponding position in target sentence
Jan 2009Statistical MT34 Phrase-Based Decoding Build translation left to right –select foreign word(s) to be translated –find English phrase translation –add English phrase to end of partial translation [Koehn 2006]
Jan 2009Statistical MT35 Decoding Process one to many translation [Koehn 2006]
Jan 2009Statistical MT36 Decoding Process many to one translation [Koehn 2006]
Jan 2009Statistical MT37 Decoding Process translation finished [Koehn 2006]
Jan 2009Statistical MT38 Hypothesis Expansion Start with empty hypothesis –e: no English words –f: no foreign words covered –p: probability 1 [Koehn 2006]
Jan 2009Statistical MT39 Hypothesis Expansion
Jan 2009Statistical MT40 Hypothesis Expansion further hypothesis expansion [Koehn 2006]
Jan 2009Statistical MT41 Decoding Process adding more hypotheses leads to explosion of search space. [Koehn 2006]
Jan 2009Statistical MT42 Hypothesis Recombination Sometimes different choices of hypothesis lead to the same translation result. Such paths can be combined. [Koehn 2006]
Jan 2009Statistical MT43 Hypothesis Recombination Drop weaker path Keep pointer from weaker path [Koehn 2006]
Jan 2009Statistical MT44 Pruning Hypothesis recombination is not sufficient –Heuristically discard weak hypotheses early Organize Hypothesis in stacks, e.g. by –same foreign words covered –same number of foreign words covered (Pharaoh does this) –same number of English words produced Compare hypotheses in stacks, discard bad ones –histogram pruning: keep top n hypotheses in each stack (e.g., n=100) –threshold pruning: keep hypotheses that are at most times the cost of best hypothesis in stack (e.g., = 0.001)
Jan 2009Statistical MT45 Hypothesis Stacks Organization of hypothesis into stacks –here: based on number of foreign words translated –during translation all hypotheses from one stack are expanded –expanded Hypotheses are placed into stacks –one to many translation [Koehn 2006]
Jan 2009Statistical MT46 Comparing Hypotheses covering Same Number of Foreign Words Hypothesis that covers easy part of sentence is preferred Need to consider future cost of uncovered parts Should take account of one to many translation [Koehn 2006]
Jan 2009Statistical MT47 Future Cost Estimation Use future cost estimates when pruning hypotheses For each uncovered contiguous span: –look up future costs for each maximal contiguous uncovered span –add to actually accumulated cost for translation option for pruning [Koehn 2006]
Jan 2009Statistical MT48 Pharoah A beam search decoder for phrase-based models –works with various phrase-based models –beam search algorithm –time complexity roughly linear with input length –good quality takes about 1 second per sentence –Very good performance in DARPA/NIST Evaluation –Freely available for researchers Coming soon: open source version of Pharaoh
Jan 2009Statistical MT49 Pharoah Demo % echo ’das ist ein kleines haus’ | pharaoh -f pharaoh.ini > out Pharaoh v1.2.9, written by Philipp Koehn a beam search decoder for phrase-based statistical machine translation models (c) University of Southern California (c) 2004 Massachusetts Institute of Technology (c) 2005 University of Edinburgh, Scotland loading language model from europarl.srilm loading phrase translation table from phrase-table, stored 21, pruned 0, kept 21 loaded data structures in 2 seconds reading input sentences translating 1 sentences.translated 1 sentences in 0 seconds [3mm] % cat out this is a small house
Jan 2009Statistical MT50 Brown Experiment 2 Perform translation using 1000 most frequent words in the English corpus. 1,700 most frequently used French words in translations of sentences completely covered by 1000 word English vocabulary. 117,000 pairs of sentences completely covered by both vocabularies. Parameters of English language model from 570,000 sentences in English part.
Jan 2009Statistical MT51 Experiment 2 contd 73 French sentences tested from elsewhere in corpus. Results were classified as –Exact – same as actual translation –Alternate – same meaning –Different – legitimate translation but different meaning –Wrong – could not be intepreted as a translation –Ungrammatical – grammatically deficient Corrections to the last three categories were made and keystrokes were counted
Jan 2009Statistical MT52 Results Category# sentencespercent Exact45 Alternate1825 Different1318 Wrong1115 Ungrammatical2737 Total73
Jan 2009Statistical MT53 Results - Discussion According to Brown et. al., system performed successfully 48% of the time (first three categories). 776 keystrokes needed to repair 1916 keystrokes to generate all 73 translations from scratch. According to authors, system therefore reduces work by 60%.
Jan 2009Statistical MT54 Issues Automatic evaluation methods can computers decide what are good translations? Phrase-based models what are atomic units of translation? how are they discovered? the best method in statistical machine translation Discriminative training what are the methods that directly optimize translation performance?
Jan 2009Statistical MT55 The Speculative (Koehn 2006) Syntax-based transfer models how can we build models that take advantage of syntax? how can we ensure that the output is grammatical? Factored translation models how can we integrate different levels of abstraction?
Jan 2009Statistical MT56 Bibliography Statistical MT Brown et. al., A Statistical Approach to MT, Computational Linguistics 16.2, 1990 pp79-85 (search “ACL Anthology”) Koehn tutorial (see