Presentation is loading. Please wait.

Presentation is loading. Please wait.

Machine Translation Dr. Nizar Habash Research Scientist Center for Computational Learning Systems Columbia University COMS E6998: Topics in Computer Science.

Similar presentations


Presentation on theme: "Machine Translation Dr. Nizar Habash Research Scientist Center for Computational Learning Systems Columbia University COMS E6998: Topics in Computer Science."— Presentation transcript:

1 Machine Translation Dr. Nizar Habash Research Scientist Center for Computational Learning Systems Columbia University COMS E6998: Topics in Computer Science Spring 2013

2 Session #1 Introductions Syllabus Explanation Lecture –Why Machine Translation –Multilingual Challenges for MT –MT Approaches –MT Evaluation

3 Why (Machine) Translation? Languages in the world 6,800 living languages 600 with written tradition 100 languages are spoken by 95% of world population Translation Market $26 Billion Global Market (2010) Doubling every five years (Donald Barabé, invited talk, MT Summit 2003)

4 Multilingualism Tower of Babel Genesis 11:1-9 1 And the whole earth was of one language, and of one speech.... 9 Therefore is the name of it called Babel; because the Lord did there confound the language of all the earth: and from thence did the Lord scatter them abroad upon the face of all the earth. Foremost symbol of multilingualism as a problem

5 Multilingualism Language Families

6 Multilingualism Rosetta Stone Ancient Egyptian stele (196 BCE ) Key to modern understanding of Egyptian hieroglyphs Trilingual document: –ancient Egyptian hieroglyphs –Egyptian demotic script –ancient Greek Common symbol of parallel corpora and translation solutions

7 Modern Rosetta Stones?

8 Multilingual Challenges nai you duo shi means buttered toast naiyou means butter duoshi means toast duo means many shi can mean private (as in the army rank)

9 Shatt Al-Arab Fresh Fish

10 Why (Machine) Translation? Languages in the world 6,800 living languages 600 with written tradition 100 languages are spoken by 95% of world population Translation Market $26 Billion Global Market (2010) Doubling every five years (Donald Barabé, invited talk, MT Summit 2003)

11 Machine Translation Science Fiction Star Trek Universal Translator an "extremely sophisticated computer program" which functions by "analyzing the patterns" of an unknown foreign language, starting from a speech sample of two or more speakers in conversation. The more extensive the conversational sample, the more accurate and reliable is the "translation matrix"….

12 Machine Translation Science Fiction Futurama Universal Translator Dr. Farnsworth: “This is my Universal Translator, although it only translate into an incomprehensible dead language” Cubert: “Hello!” Machine: “Bonjour!” Dr. Farnsworth: "Imcomprehensible gibberish”

13 Machine Translation Science Fiction The Babel Fish The Hitch Hiker's Guide to the Galaxy" (Douglas Adams) "is small, yellow and leech-like,... if you stick a Babel fish in your ear you can instantly understand anything said to you in any form of language…"

14 Machine Translation Reality http://www.medialocate.com/

15 Machine Translation Reality

16 Currently, Google offers translations between the following languages  over 3,000 pairs Afrikaans Albanian Arabic Armenian Azerbaijani Basque Belarusian Bulgarian Catalan Chinese Croatian Czech Danish Dutch English Estonian Filipino Finnish French Galician Georgian German Greek Haitian Creole Hebrew Hindi Hungarian Icelandic Indonesian Irish Italian Japanese Korean Latvian Lithuanian Macedonian Malay Maltese Norwegian Polish Portuguese Romanian Russian Serbian Slovak Slovenian Spanish Swahili Swedish Thai Turkish Ukrainian Urdu Vietnamese Welsh Yiddish

17 “BBC found similar support”!!!

18 Why Machine Translation? Full Translation –Domain specific, e.g., Weather reports Machine-aided Translation –Requires post-editing Cross-lingual NLP applications –Cross-language IR –Cross-language Summarization Testing grounds –Extrinsic evaluation of NLP tools, e.g., parsers, pos taggers, tokenizers, etc.

19 Road Map Multilingual Challenges for MT MT Approaches MT Evaluation

20 Multilingual Challenges Orthographic Variations –Ambiguous spelling كتب الاولاد اشعارا كَتَبَ الأوْلادُ اشعَاراً – Ambiguous word boundaries Lexical Ambiguity –Bank  بنك (financial) vs. ضفة (river) –Eat  essen (human) vs. fressen (animal)

21 Multilingual Challenges Morphological Variations Affixational (prefix/suffix) vs. Templatic (Root+Pattern) write  written كتب  مكتوب kill  killed قتل  مقتول do  done فعل  مفعول conj noun pluralarticle Tokenization (aka segmentation+normalization) And the cars  and the cars والسيارات  w Al SyArAt Et les voitures  et le voitures

22 Morphology Arabic: very rich morphology: number, gender, case, person, aspect, voice, several clitics, etc. –Arabic tokenization English: simple morphology Chinese: no morphology – quantifiers & verbal aspects يقرأ الطالب المجتهد كتابا عن الصين في الصف read the-student the-diligent a-book about china in the-classroom the diligent student is reading a book about china in the classroom 这位勤奋的学生在教室读一本关于中国的书 this quant diligent de student in classroom read one quant about china de book

23 Syntax ArabicEnglishChinese Subj-VerbV SubjSubj V Subj … V Verb-PPV…PP V PPPP V AdjectivesN AdjAdj NAdj de N PossessivesN PossN of PossPoss ’s NPoss de N RelativesN Rel Rel de N يقرأ الطالب المجتهد كتابا عن الصين في الصف read the-student the-diligent a-book about china in the-classroom the diligent student is reading a book about china in the classroom 这位勤奋的学生在教室读一本关于中国的书 this quant diligent de student in classroom read one quant about china de book

24 Syntax يقرأ الطالب المجتهد كتابا عن الصين في الصف read the-student the-diligent a-book about china in the-classroom the diligent student is reading a book about china in the classroom 这位勤奋的学生在教室读一本关于中国的书 this quant diligent de student in classroom read one quant about china de book ArabicEnglishChinese Subj-VerbV SubjSubj V Subj … V Verb-PPV…PP V PPPP V AdjectivesN AdjAdj NAdj de N PossessivesN PossN of PossPoss ’s NPoss de N RelativesN Rel Rel de N

25 لست هنا I-am-not here am Ihere I am not here not لست هنا Translation Divergences conflation Je ne suis pas ici I not am not here suis Jeicinepas

26 * ا نابردان * קרל انا بردان I cold be Icold I am coldקר לי cold for-me אני Translation Divergences categorial, thematic and structural tener Yofrio tengo frio I-have cold

27 swim I quickly across river I swam across the river quickly Translation Divergences head swap and categorial اسرع اناسباحةعبور نهر اسرعت عبور النهر سباحة I-sped crossing the-river swimming

28 swim I quickly across river I swam across the river quickly Translation Divergences head swap and categorial חצה אניבאת נהר ב שחיהמהירות חציתי את הנהר בשחיה במהירות I-crossed obj river in-swim speedily

29 Translation Divergences head swap and categorial חצה אניבאת נהר ב שחיהמהירות اسرع اناسباحةعبور نهر swim I quickly across river noun prep verb noun adverb verb noun verb noun

30 Translation Divergences Orthography+Morphology+Syntax 妈妈的车 mama de che car mom possessed-by mom’s car سيارة ماما sayyArat mama la voiture de maman

31 Road Map Multilingual Challenges for MT MT Approaches MT Evaluation

32 Knowledge Acquisition Strategy Knowledge Representation Strategy All manual Deep/ Complex Shallow/ Simple Fully automated Learn from un- annotated data Phrase tables Word-based only Learn from annotated data Example-based MT Original statistical MT Typical transfer system Classic interlingual system Original direct approach Syntactic Constituent Structure Interlingua New Research Goes Here! Semantic analysis Hand-built by non-experts Hand-built by experts Electronic dictionaries MT Strategies (1954-2004) Slide courtesy of Laurie Gerber

33 MT Approaches MT Pyramid Source word Source syntax Source meaningTarget meaning Target syntax Target word AnalysisGeneration Gisting

34 MT Approaches Gisting Example Sobre la base de dichas experiencias se estableció en 1988 una metodología. Envelope her basis out speak experiences them settle at 1988 one methodology. On the basis of these experiences, a methodology was arrived at in 1988.

35 MT Approaches MT Pyramid Source word Source syntax Source meaningTarget meaning Target syntax Target word AnalysisGeneration GistingTransfer

36 MT Approaches Transfer Example Transfer Lexicon –Map SL structure to TL structure  poner X mantequilla en Y :obj :mod:subj :obj butter X Y :subj:obj X puso mantequilla en YX buttered Y

37 MT Approaches MT Pyramid Source word Source syntax Source meaningTarget meaning Target syntax Target word AnalysisGeneration GistingTransferInterlingua

38 MT Approaches Interlingua Example: Lexical Conceptual Structure (Dorr, 1993)

39 MT Approaches MT Pyramid Source word Source syntax Source meaningTarget meaning Target syntax Target word AnalysisGeneration Interlingua Gisting Transfer

40 MT Approaches MT Pyramid Source word Source syntax Source meaningTarget meaning Target syntax Target word AnalysisGeneration Interlingual Lexicons Dictionaries/Parallel Corpora Transfer Lexicons

41 MT Approaches MT Pyramid

42 Source word Source syntax Source meaningTarget meaning Target syntax Target word AnalysisGeneration Interlingual Lexicons Dictionaries/Parallel Corpora Transfer Lexicons

43 MT Approaches Statistical vs. Rule-based Source word Source syntax Source meaningTarget meaning Target syntax Target word AnalysisGeneration

44 To be continued …


Download ppt "Machine Translation Dr. Nizar Habash Research Scientist Center for Computational Learning Systems Columbia University COMS E6998: Topics in Computer Science."

Similar presentations


Ads by Google