Direct Translation Approaches: Statistical Machine Translation Stephan Vogel, Alicia Tribble Interactive Systems Lab Carnegie Mellon University & University Karlsruhe Speech-to-Speech Translation Workshop ESSLLI 2002, Trento, Italy
Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy Overview Translation Approaches Statistical Machine Translation Translating with Cascaded Transducers Experiments on Nespole Data 16 July 2002 Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy
Translation Approaches Interlingua based Transfer based Direct Example based Statistical 16 July 2002 Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy
Statistical Machine Translation Based on Bayes´ Decision Rule: ê = argmax{ p(e | f) } = argmax{ p(e) p(f | e) } 16 July 2002 Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy
Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy Tasks in SMT Modelling build statistical models which capture characteristic features of translation equivalences and of the target language Training train translation model on bilingual corpus, train language model on monolingual corpus Decoding find best translation for new sentences according to models 16 July 2002 Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy
Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy Alignment Example 16 July 2002 Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy
Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy Translation Models IBM1 – lexical probabilities only IBM2 – lexicon plus absolut position HMM – lexicon plus relative position IBM3 – plus fertilities IBM4 – inverted relative position alignment IBM5 – non-deficient version of model 4 [Brown, et.al. 93, Vogel, et.al. 96] 16 July 2002 Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy
Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy HMM Alignment Model p(f|e) = Sa p(f1J, a1J | e1I) = Sa Pj p(fj , aj | f1j-1, a1j-1, e1I) = Sa Pj p(aj | aj-1) p(fj | ea(j)) ~ maxa Pj p(aj | aj-1) p(fj | ea(j)) Alignment aj of current word fj depends on alignment aj-1 of previous word fj-1 . 16 July 2002 Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy
Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy Phrase Translation Why? To capture context Local word reordering How? Train alignment model Extract phrase-to-phrase translations from Viterbi path Notes: Often better results when training target to source for extraction of phrase translations Phrases are not fully integrated into alignment model, they are extracted only after training is completed 16 July 2002 Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy
Translation with Transducers Finite state machine Read sequence of words, write sequene of words Output vocaculary can be different from input vocabulary Transducer used in current implementation: Tree Transducer, i.e. prefix tree over input strings Output from final states Used to encode lexicon, phrase translations, bilingual word classes and grammers 16 July 2002 Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy
Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy Cascaded Transducers Generalization through cascaded transducers: Replace words by category labels and have a transducer for each category [Vogel, Ney 2000] 16 July 2002 Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy
Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy Language Model Standard n-gram model: p(w1 ... wn) = Pi p(wi | w1... wi-1) = Pi p(wi | wi-2 wi-1) trigram = Pi p(wi | wi-1) bigram Many events not seen -> smoothing required 16 July 2002 Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy
Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy Decoding Strategies Sequential construction of target sentence Extend partial translation by words which are translations of words in the source sentence Language model can be applied immediately Mechanism to ensure proper coverage of source sentence required Left – right over source sentence Find translations for sequences of words Construct translation lattice Apply language model and select best path 16 July 2002 Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy
Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy Translation Graph 16 July 2002 Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy
Speech Recognition and Translation Search best string in target language for given acoutsic signal in source language ê = argmax{ p(e) p(x|e) } = argmax{ p(e) Sf p(f,x|e) } = argmax{ p(e) Sf p(f|e) p(f) p(x|f,x) } = argmax{ p(e) Sf p(f|e) p(f) p(x|f) } i.e. recognizer language model not needed !? [Ney, 2001] 16 July 2002 Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy
Coupling Recognition and Translation Sequential – first recognition, then translation First best recognition hypothesis N-best list – translate n times Word lattice – translate all pathes in lattice, reuse results from partial pathes Integrated – recognition and translation in combined search Subsequential transducer approach uses this Note: In Eutrans project best results when translation on first-best hypothesis 16 July 2002 Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy
Example-Based Machine Translation Re-use translations to create new translations: Store bilingual corpus with (partial) alignment Find partial matches, i.e. sequences of words in stored corpus to cover a new sentence Extract translation(s) and build translation lattice Apply language model to find best path, i.e. best translation 16 July 2002 Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy
Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy Nespole Experiments Application of direct translation techniques to dialogue data collected in Nespole! Testing the effect of phrase translation Experiments with additional knowledge sources Preexisting: monolingual data for the LM and publically available Lexica Engineered: handwritten rules for fixed expressions and knowledge extracted from semantic grammars 16 July 2002 Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy
Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy Nespole Project Data CMU database of dialogues in the travel domain German, English (Italian, French) Speech recognizer hypotheses and human transcriptions both available Segmented into SDUs (Speech Dialogue Units) NOTE: transition only!! Just flash it up and take it down!! This is a description of a set of experiments in speech translation- specifically Nespole! – using a direct approach, i.e. the statistical approach with cascaded transducers described by Stephan earlier. 16 July 2002 Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy
Nespole Corpus: Training 3182 Parallel SDUs Language English German Tokens 15572 14992 Vocabulary 1032 1338 Singletons 404 620 50% of the German and 40% of the English are singletons: indicates that we Are far from the edge of our Zipfian type-token curve. i.e. far from saturation, And we can expect a lot of unknowns in the test data... 16 July 2002 Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy
Nespole Corpus: Testing 70 Parallel SDUs German Reference A Reference B Tokens 437 610 607 Vocabulary 183 (45 OOV) 165 160 German stats are for the transcribed data, but the speech-recognizer data is very similar in these statistics 16 July 2002 Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy
Corpus Challenges: Sentence Length Training Data Testing Data These are averages for the two reference translators, but their avg. Lengths agreed very closely However we didn‘t tune for this length difference... Sort of a disadvantage to our system that We did not do so because the training data didn‘t tell us to... 16 July 2002 Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy
Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy Evaluation Human Scoring Good, Okay, Bad (c.f. Nespole evaluation) Collapsed into a „human score“ on [0,1] Bleu Score Average of N-gram precisions from (1..N), typically N=3 or 4 Penalty for short translations to substitute for recall measure Note: automatic is good b-c it is cheap and reproducible, human is good only if it is consistent... Etc. [Papinini et.al. 2001] 16 July 2002 Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy
Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy Phrase Translation Unequal sentence lengths means that training can be improved directionally: S T or T S German compounds are better for 1 to many alignments with English multiword phrases, so direction is important Statistical lexicon alone Statistical lexicon, phrases from S T training Statistical lexicon, phrases from bidir. training 0,1903 0,2350 0,2654 ´as stephan pointed out earlier, training in different directions can affect the phrase translation quality´ Restriction of the alignment model is 1-M but not M-1 in training, so we need to use these morphological Differences wisely (q: what about when we have to translate from english to german? -> then we can Use the German to English model also, but we don´t have to go to the add´l trouble of flipping the Transducer...) Note: unequal sentence lengths is a product(?) or at least an indicator of a mismatch between the amt. Of morphology in the two languages. 16 July 2002 Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy
Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy Language Model Monolingual text available from Verbmobil 500.000 words (32x the size of orig. English corpus) Helps to choose among translation hypotheses but will not generate new ones Stat. lexicon, phrases, fixed expression rules, gen. lexicon, and small LM Stat. lexicon, phrases, fixed expression rules, gen. lexicon, and large LM 0,2613 0,3172 16 July 2002 Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy
General-Purpose Lexicon Statistical lexicon, phrases, and fixed exp´s with small LM 0,2654 Adding general-purpose lexicon as a transducer 0,2522 Using large instead of small LM 0,3141 general-purpose lexicon as training data instead of separate transducer 0,3275 Lexicon size is 160.000 words, ~15x orig. English training data. NOTE: why do results fall with additional knowledge?? - don´t always; even adding as a transducer gave small improvements in some configurations. - Adding to the training data allows the common words in the non-statistical lexicon to be given probabilities commensurate with their numbers in the corpus. Adding as a separate transducer throws these weights off for the alignment model and causes translation quality to fall as a result. -gives better results in combination with large LM than with small, and better when combined with training data than when not, both bc of probabilities as stated above... 16 July 2002 Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy
Fixed Expression Rules Transducer rules are human readable and can be added by hand Fixed expressions for times and dates are re-usable, require less time to build than domain-specific rules and improve coverage of some semi-idiomatic constructions. Statistical lexicon with small LM Statistical lexicon and fixed-expression transducer with small LM 0,1893 0,1903 Format of a transducer rule is similar to a rule in a probabalistic context-free grammar: @LABEL # LHS # RHS # prob === LHS -> RHS w/ prob. Tradeoff of human effort vs. Reusability... Note that these parts are not like interlingual Grammar rules bc they are language-pair specific but they are domain-generic (probably / by design) Improvement in voc. Coverage is not great but helps with word reordering. Can be weigted over the phrase or stat. Lexicon transducers in order to let these expressions show in the final translation output. 16 July 2002 Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy
Knowledge from Existing Grammars Could help in domain- but not language- portability Benefit mostly in additional vocabulary Statistical lexicon, fixed exp´s, phrases, and general lexicon with large LM Statistical lexicon, fixed exp´s, phrases, general lexicon and I-transducer with large LM 0,3141 0,3172 *** experimental and requires more testing/development *** Portability: domain- portability as opposed to language- portability because this has to be specific to a given language pair, like all parts of the statistical system. The direct systems have to be retrained in new languages, Anyway... Vocabulary expansion: we treat the grammars as another domain-specific language source but they are Not exactly paralell- best we can do is take a guess at what means the same thing and add entries to our Lexicon this way (q: why didn´t we try adding these entries to the training data as we did the additional Lexicon... We already argued that this works better!! – think we were in the mindset that we could eventually get a deeper structure out of the grammars so we were reserving a transducer for that...) Allows us to take advantage of generalization on the part of the grammar writers: have they introduced new vocabulary (synonyms, etc) based on the forms in the training data: Not a lot – we checked the additional voc. Coverage which we get in our test data by adding the grammar Vocabulary and it was only 8 types But this is out of 45 unknowns so we have a 5th of the unknowns Recovered for this test set 16 July 2002 Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy
Comparative Evaluation Results Good Okay Bad Score Bleu Text IF 77 104 227 0,32 0,068 SMT 127 80 205 0,40 0,333 Speech 64 101 243 0,28 0,059 95 83 0,34 0,262 -Automatic evaluation is good for measureing the effect of these small changes but we need human evaluation to determine how good the translation actually is, and as a reality check on comparative results Human evaluation here shows that the Bleuscore was too unequal *but* did give higher scores to the Direct approaches in both text and speech. Inter-coder agreement was very good. NOTE: what about the fact that we included grammar knowledge and then compared to grammars alone? Could consider the final SMT system as a sort of hybrid but looking at the scores for adding grammar knowledge shows that this didn´t improve the SMT system that much... Also not including the grammars but extracting translation pairs. In the future we would like to see whether we could extract better structure from grammars- applicable when a new system is being built and old grammars are around without infrastructure to support them... Then the work that went into building them would still be usable by simply extracting useful information and adding it to a direct system. ´legacy´ 16 July 2002 Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy
Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy Selected References Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, Robert L. Mercer. The Mathematics of Statistical Machine Translation: Parameter Estimation, Computational Linguistics, 1993, 19,2, pp.263—311 Stephan Vogel, Hermann Ney, Christoph Tillmann. HMM-Based Word Alignment in Statistical Translation. Int. Conf. on Computational Linguistics, Kopenhagen, Danemark, pp. 836-841, August 1996. Stephan Vogel, Hermann Ney. Translation with Cascaded Finite State Transducers. 36th Annual Conference of the Association for Computational Linguistics, pp. 23-30, Hongkong, China, October2000. Stephan Vogel, Alicia Tribble. Improving statistical machine translation for a speech-to-speech translation task. To appear in ICSLP 2002. H. Ney. The Statistical Approach to Spoken Language Translation. Proc. IEEE Automatic Speech Recognition and Understanding Workshop, Madonna di Campiglio, Trento, Italy, 8 pages, CD ROM, IEEE Catalog No. 01EX544, December 2001. Kishore Papinini, Salim Roukos, Todd Ward, Wei-Jing Zhu. Bleu: a Method for Automatic Evaluation ofMachine Translation. IBM Research Report RC22176(W0109-022), September17, 2001. 16 July 2002 Speech-to-Speech Translation Workshop, ESSLLI, Trento, Italy