Machine Translation Diana Trandab ă ţ Academic Year
Course overview Approaches to MT Language Model Translation model Statistical modeling and IBM Models EM algorithm Word alignment Phrase-based translation Syntax-based translation Reordering Decoding Evaluation
Prerequisites WILL TO LEARN!!!
Minimum expectations LEARN something Adequately use the machine translation terminology Create language models Develop and/or implement translation models Better presentations skills DO something Assignments Project TEACH me something
Evaluation Laboratory – 100 points – Attendance(10%) – Homework (90%) Project – 100 points Exam – 100 points – Midterm – Final exam
Homework ~ Weekly In class delivery – 50% of the points for delivery in class; 50% for submitted homework Late delivery for submissions – 100% of the points for delivery on time, 80% of the points for 1 day late delivery, 60% of the points for 2 days late delivery, … Name convention: MT_HomeworkNO_StudentName_ProgrammingLanguage Each implementation task is submitted with a short documentation (max. 1 page) with implementation details, challenges, methods/solutions, errors, problems etc.
Projects We’ll get to that latter…
What I expect you to know after today What is machine translation What is statistical machine translation Problems of machine translation
What I expect you to know after today What is machine translation What is statistical machine translation Problems of machine translation We are not alone in the universe!?
How do humans translate?
Spend years learning a new language – memorizing words – learning syntactic patterns – exercising – … Use dictionaries and detailed world knowledge to: – Identify meaning – Find proper words to use in new language – Produce a syntactically correct text – Preserve meaning ….
What is machine translation? Translation performed using a machine/computer
How do machines translate? Flowers are lovely!
How do machines translate? Using available resources: Electronic bilingual dictionary Templates, transfer rules: Thesaurus, WordNet, FrameNet, … Parallel data, comparable data Using available NLP tools tokenizer, morphological analyzer, syntactic parser, … More resources for major languages, less for “minor” languages.
How do machines translate?
Statistical machine translation
very large data set of good translations automatically infer a statistical model of translation apply the translation model to new texts to guess a reasonable translation
Statistical machine translation very large data set of good translations automatically infer a statistical model of translation apply the translation model to new texts to guess a reasonable translation
Noisy channel
Language Model P(e) Takes care of fluency in the target language Data: corpora in the target language Translation Model P(f|e) Lexical faithful correspondence between languages Data: aligned corpora in source and target languages argmax Search done by the decoder Noisy channel
Accurate vs. Fluent Often impossible to have a true translation; one that is both: – Faithful to the source language, and – Fluent in the target language Japanese: “fukaku hansei shite orimasu” Fluent translation: “we apologize” Faithful translation: “we are deeply reflecting (on our past behaviour, and what we did wrong, and how to avoid the problem next time)” Need to compromise between faithfulness & fluency
Accurate vs. Fluent Often impossible to have a true translation; one that is both: – Faithful to the source language, and – Fluent in the target language Japanese: “fukaku hansei shite orimasu” Fluent translation: “we apologize” Faithful translation: “we are deeply reflecting (on our past behaviour, and what we did wrong, and how to avoid the problem next time)” Need to compromise between faithfulness & fluency
Accurate vs. Fluent Often impossible to have a true translation; one that is both: – Faithful to the source language, and – Fluent in the target language Japanese: “fukaku hansei shite orimasu” Fluent translation: “we apologize” Faithful translation: “we are deeply reflecting (on our past behaviour, and what we did wrong, and how to avoid the problem next time)” Need to compromise between faithfulness & fluency
Question What is your input on clients which sell pharmaceuticals in Europe?
Group activity
CENTAURIARCTURAN Ok-voon ororok sprok.At-voon bichat dat. Ok-drubel ok-voon anok plok sprok.At-drubel at-voon pippat rrat dat. Erok sprok izok hihok ghirok.Totat dat arrat vat hilat. Ok-voon anok drok brok jok.At-voon krat pippat sat lat. Wiwok farok izok stok.Totat jjat quat cat. Lalok sprok izok jok stok.Wat dat krat quat cat. Lalok farok ororok lalok sprok izok enemok.Wat jjat bichat wat dat vat eneat. Lalok brok anok plok nok.Iat lat pippat rrat nnat. Wiwok nok izok kantok ok-yurp.Totat nnat quat oloat at-yurp. Lalok mok nok yorok ghirok clok.Wat nnat gat mat bat hilat Lalok nok crrrok hihok yorok zanzanok.Wat nnat arrat mat zanzanat. Lalok rarok nok izok hihok mok.Wat nnat forat arrat vat gat.
What we’ve learned Direct (word-by-word) translation Reordering Different word alignment 1:1, 0:1, 1:0, etc. Translation model
Question What is your input on clients which sell pharmaceuticals in Europe?
References Philipp Koehn: Statistical machine translation. Cambridge University Press. xii, 433pp, 2009 Yorick Wilks: Machine translation: its scope and limits. New York: Springer. x, 252pp, 2009 John Hutchins “Machine translation: general overview”. Chapter 27 of R Mitkov (ed.) The Oxford Handbook of Computational Linguistics, Oxford (2004) Harold Somers “Machine Translation”. Chapter 13 of R Dale, H Moisl & H Somers (eds) Handbook of Natural Language Processing, New York (2000): Marcel Dekker Nico Weber (ed.): Machine translation: theory, applications, and evaluation. An assessment of the state-of-the-art St.Augustin: Gardez! Verlag, 1998 Kishore Papineni et. al.: Bleu: a Method for Automatic Evaluation of Machine Translation, ACL '02 Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, Pages , 2002.
“One naturally wonders if the problem of translation could conceivably be treated as a problem in cryptography. When I look at an article in Russian, I say: ‘This is really written in English, but it has been coded in some strange symbols. I will now proceed to decode.’ ” Warren Weaver (1947)
See you next time!
Noisy channel