Presentation is loading. Please wait.

Presentation is loading. Please wait.

Machine Translation Dr. Nizar Habash Center for Computational Learning Systems Columbia University COMS 4705: Natural Language Processing Fall 2010.

Similar presentations


Presentation on theme: "Machine Translation Dr. Nizar Habash Center for Computational Learning Systems Columbia University COMS 4705: Natural Language Processing Fall 2010."— Presentation transcript:

1 Machine Translation Dr. Nizar Habash Center for Computational Learning Systems Columbia University COMS 4705: Natural Language Processing Fall 2010

2 Why (Machine) Translation? Languages in the world 6,800 living languages 600 with written tradition 95% of world population speaks 100 languages Translation Market $26 Billion Global Market (2010) Doubling every five years (Donald Barabé, invited talk, MT Summit 2003)

3 Why (Machine) Translation? Languages in the world 6,800 living languages 600 with written tradition 95% of world population speaks 100 languages Translation Market $26 Billion Global Market (2010) Doubling every five years (Donald Barabé, invited talk, MT Summit 2003)

4 Machine Translation Science Fiction Star Trek Universal Translator an "extremely sophisticated computer program" which functions by "analyzing the patterns" of an unknown foreign language, starting from a speech sample of two or more speakers in conversation. The more extensive the conversational sample, the more accurate and reliable is the "translation matrix"….

5 Machine Translation Reality http://www.medialocate.com/

6 Machine Translation Reality

7 Currently, Google offers translations between the following languages  over 3,000 pairs Afrikaans Albanian Arabic Armenian Azerbaijani Basque Belarusian Bulgarian Catalan Chinese Croatian Czech Danish Dutch English Estonian Filipino Finnish French Galician Georgian German Greek Haitian Creole Hebrew Hindi Hungarian Icelandic Indonesian Irish Italian Japanese Korean Latvian Lithuanian Macedonian Malay Maltese Norwegian Polish Portuguese Romanian Russian Serbian Slovak Slovenian Spanish Swahili Swedish Thai Turkish Ukrainian Urdu Vietnamese Welsh Yiddish

8 “BBC found similar support”!!!

9 Why Machine Translation? Full Translation –Domain specific, e.g., Weather reports Machine-aided Translation –Requires post-editing Cross-lingual NLP applications –Cross-language IR –Cross-language Summarization Testing grounds –Extrinsic evaluation of NLP tools, e.g., parsers, pos taggers, tokenizers, etc.

10 Road Map Multilingual Challenges for MT MT Approaches MT Evaluation

11 Multilingual Challenges Orthographic Variations –Ambiguous spelling كتب الاولاد اشعارا كَتَبَ الأوْلادُ اشعَاراً – Ambiguous word boundaries Lexical Ambiguity –Bank  بنك (financial) vs. ضفة (river) –Eat  essen (human) vs. fressen (animal)

12 Multilingual Challenges Morphological Variations Affixational (prefix/suffix) vs. Templatic (Root+Pattern) write  written كتب  مكتوب kill  killed قتل  مقتول do  done فعل  مفعول conj noun pluralarticle Tokenization (aka segmentation+normalization) And the cars  and the cars والسيارات  w Al SyArAt Et les voitures  et le voitures

13 Multilingual Challenges Syntactic Variations يقرأ الطالب المجتهد كتابا عن الصين في الصف read the-student the-diligent a-book about china in the-classroom the diligent student is reading a book about china in the classroom 这位勤奋的学生在教室读一本关于中国的书 this quant diligent de student in classroom read one quant about china de book ArabicEnglishChinese Subj-VerbV SubjSubj V Subj … V Verb-PPV…PP V PPPP V AdjectivesN AdjAdj NAdj de N PossessivesN PossN of PossPoss ’s NPoss de N RelativesN Rel Rel de N

14 Road Map Multilingual Challenges for MT MT Approaches MT Evaluation

15 MT Approaches MT Pyramid Source word Source syntax Source meaningTarget meaning Target syntax Target word AnalysisGeneration Gisting

16 MT Approaches Gisting Example Sobre la base de dichas experiencias se estableció en 1988 una metodología. Envelope her basis out speak experiences them settle at 1988 one methodology. On the basis of these experiences, a methodology was arrived at in 1988.

17 MT Approaches MT Pyramid Source word Source syntax Source meaningTarget meaning Target syntax Target word AnalysisGeneration GistingTransfer

18 MT Approaches Transfer Example Transfer Lexicon –Map SL structure to TL structure  poner X mantequilla en Y :obj :mod:subj :obj butter X Y :subj:obj X puso mantequilla en YX buttered Y

19 MT Approaches MT Pyramid Source word Source syntax Source meaningTarget meaning Target syntax Target word AnalysisGeneration GistingTransferInterlingua

20 MT Approaches Interlingua Example: Lexical Conceptual Structure (Dorr, 1993)

21 MT Approaches MT Pyramid Source word Source syntax Source meaningTarget meaning Target syntax Target word AnalysisGeneration Interlingua Gisting Transfer

22 MT Approaches MT Pyramid Source word Source syntax Source meaningTarget meaning Target syntax Target word AnalysisGeneration Interlingual Lexicons Dictionaries/Parallel Corpora Transfer Lexicons

23 MT Approaches Statistical vs. Rule-based Source word Source syntax Source meaningTarget meaning Target syntax Target word AnalysisGeneration

24 Statistical MT Noisy Channel Model Portions from http://www.clsp.jhu.edu/ws03/preworkshop/lecture_yamada.pdf

25 Statistical MT Automatic Word Alignment GIZA++ –A statistical machine translation toolkit used to train word alignments. –Uses Expectation-Maximization with various constraints to bootstrap alignments Slide based on Kevin Knight’s http://www.sims.berkeley.edu/courses/is290-2/f04/lectures/mt-lecture.ppt Mary did not slap the green witch Maria no dio una bofetada a la bruja verde

26 Statistical MT IBM Model (Word-based Model) http://www.clsp.jhu.edu/ws03/preworkshop/lecture_yamada.pdf

27 Phrase-Based Statistical MT Foreign input segmented in to phrases –“phrase” is any sequence of words Each phrase is probabilistically translated into English –P(to the conference | zur Konferenz) –P(into the meeting | zur Konferenz) Phrases are probabilistically re-ordered See [Koehn et al, 2003] for an intro. This is state-of-the-art! Morgenfliegeichnach Kanadazur Konferenz TomorrowIwill flyto the conferenceIn Canada Slide courtesy of Kevin Knight http://www.sims.berkeley.edu/courses/is290-2/f04/lectures/mt-lecture.ppt

28 Mary did not slap the green witch Maria no dió una bofetada a la bruja verde Word Alignment Induced Phrases (Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green) Slide courtesy of Kevin Knight http://www.sims.berkeley.edu/courses/is290-2/f04/lectures/mt-lecture.ppt

29 Mary did not slap the green witch Maria no dió una bofetada a la bruja verde Word Alignment Induced Phrases (Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green) (a la, the) (dió una bofetada a, slap the) Slide courtesy of Kevin Knight http://www.sims.berkeley.edu/courses/is290-2/f04/lectures/mt-lecture.ppt

30 Mary did not slap the green witch Maria no dió una bofetada a la bruja verde Word Alignment Induced Phrases (Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green) (a la, the) (dió una bofetada a, slap the) (Maria no, Mary did not) (no dió una bofetada, did not slap), (dió una bofetada a la, slap the) (bruja verde, green witch) Slide courtesy of Kevin Knight http://www.sims.berkeley.edu/courses/is290-2/f04/lectures/mt-lecture.ppt

31 Mary did not slap the green witch Maria no dió una bofetada a la bruja verde (Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green) (a la, the) (dió una bofetada a, slap the) (Maria no, Mary did not) (no dió una bofetada, did not slap), (dió una bofetada a la, slap the) (bruja verde, green witch) (Maria no dió una bofetada, Mary did not slap) (a la bruja verde, the green witch) … Word Alignment Induced Phrases Slide courtesy of Kevin Knight http://www.sims.berkeley.edu/courses/is290-2/f04/lectures/mt-lecture.ppt

32 Mary did not slap the green witch Maria no dió una bofetada a la bruja verde (Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green) (a la, the) (dió una bofetada a, slap the) (Maria no, Mary did not) (no dió una bofetada, did not slap), (dió una bofetada a la, slap the) (bruja verde, green witch) (Maria no dió una bofetada, Mary did not slap) (a la bruja verde, the green witch) … (Maria no dió una bofetada a la bruja verde, Mary did not slap the green witch) Word Alignment Induced Phrases Slide courtesy of Kevin Knight http://www.sims.berkeley.edu/courses/is290-2/f04/lectures/mt-lecture.ppt

33 Advantages of Phrase-Based SMT Many-to-many mappings can handle non- compositional phrases Local context is very useful for disambiguating –“Interest rate”  … –“Interest in”  … The more data, the longer the learned phrases –Sometimes whole sentences Slide courtesy of Kevin Knight http://www.sims.berkeley.edu/courses/is290-2/f04/lectures/mt-lecture.ppt

34 Source word Source syntax Source meaningTarget meaning Target syntax Target word AnalysisGeneration MT Approaches Statistical vs. Rule-based vs. Hybrid

35 MT Approaches Practical Considerations Resources Availability –Parsers and Generators Input/Output compatability –Translation Lexicons Word-based vs. Transfer/Interlingua –Parallel Corpora Domain of interest Bigger is better Time Availability –Statistical training, resource building

36 Road Map Multilingual Challenges for MT MT Approaches MT Evaluation

37 More art than science Wide range of Metrics/Techniques –interface, …, scalability, …, faithfulness,... space/time complexity, … etc. Automatic vs. Human-based –Dumb Machines vs. Slow Humans

38 Human-based Evaluation Example Accuracy Criteria

39 Human-based Evaluation Example Fluency Criteria

40 Fluency vs. Accuracy Accuracy Fluency conMT FAHQ MT Prof. MT Info. MT

41 Automatic Evaluation Example Bleu Metric (Papineni et al 2001) Bleu –BiLingual Evaluation Understudy –Modified n-gram precision with length penalty –Quick, inexpensive and language independent –Correlates highly with human evaluation –Bias against synonyms and inflectional variations

42 Test Sentence colorless green ideas sleep furiously Gold Standard References all dull jade ideas sleep irately drab emerald concepts sleep furiously colorless immature thoughts nap angrily Automatic Evaluation Example Bleu Metric

43 Test Sentence colorless green ideas sleep furiously Gold Standard References all dull jade ideas sleep irately drab emerald concepts sleep furiously colorless immature thoughts nap angrily Unigram precision = 4/5 Automatic Evaluation Example Bleu Metric

44 Test Sentence colorless green ideas sleep furiously Gold Standard References all dull jade ideas sleep irately drab emerald concepts sleep furiously colorless immature thoughts nap angrily Unigram precision = 4 / 5 = 0.8 Bigram precision = 2 / 4 = 0.5 Bleu Score = (a 1 a 2 …a n ) 1/n = (0.8 ╳ 0.5) ½ = 0.6325  63.25 Automatic Evaluation Example Bleu Metric

45 Metrics MATR Workshop Workshop in AMTA conference 2008 –Association for Machine Translation in the Americas Evaluating evaluation metrics Compared 39 metrics –7 baselines and 32 new metrics –Various measures of correlation with human judgment –Different conditions: text genre, source language, number of references, etc.

46 Interested in MT?? Contact me (habash@cs.columbia.edu) Research courses, projects Languages of interest: –English, Arabic, Hebrew, Chinese, Urdu, Spanish, Russian, …. Topics –Statistical, Hybrid MT Phrase-based MT with linguistic extensions Component improvements or full-system improvements –MT Evaluation –Multilingual computing

47 Thank You


Download ppt "Machine Translation Dr. Nizar Habash Center for Computational Learning Systems Columbia University COMS 4705: Natural Language Processing Fall 2010."

Similar presentations


Ads by Google