Machine Translation Dr. Nizar Habash Research Scientist Center for Computational Learning Systems Columbia University COMS E6998: Topics in Computer Science.

Slides:



Advertisements
Similar presentations
José Luis Otárola. Refers to Language family Lgs. That contains similar features of Lexicon, Phonology, Morphology and Syntax.
Advertisements

U.S. Government Language Requirements U.S. Government Language Requirements 7 September 2000 Everette Jordan Department of Defense
Machine Translation: Challenges and Approaches
Yemelia International Language Services Translations Translations Translations Interpreting InterpretingInterpreting Multi-lingual IT Presentations Multi-lingual.
Adaptxt® Enhanced Keyboards for Smartphones and Tablets: CUSTOM-MADE FOR OEM SUCCESS KeyPoint Technologies February 25, 2013.
Ideal Lingua Translations Ideal Lingua Translations is a leading Translation Services Provider which offers:  Highest Quality Language Solutions 
Curricular exams Irish, English, Ancient Greek, Arabic, French, German, Hebrew Studies, Italian, Japanese, Spanish and Russian.
 They speak German  8.47 million of people live there.
J. Kunzmann, K. Choukri, E. Janke, A. Kießling, K. Knill, L. Lamel, T. Schultz, and S. Yamamoto Automatic Speech Recognition and Understanding ASRU, December.
Machine Translation Dr. Nizar Habash Center for Computational Learning Systems Columbia University COMS 4705: Natural Language Processing Fall 2010.
Machine Translation (Level 2) Anna Sågvall Hein GSLT Course, September 2004.
English Language Proficiency 2011 Census Analysis Tristan Browne.
INTERNATIONAL MARKETING MANAGEMENT SESSION 7: CUSTOMER BEHAVIOR AND MARKET SEGMENTATION 1.
Machine Translation: Challenges and Approaches Nizar Habash Post-doctoral Fellow Center for Computational Learning Systems Columbia University Invited.
Machine Translation Prof. Alexandros Potamianos Dept. of Electrical & Computer Engineering Technical University of Crete, Greece May 2003.
1 Linguistic Resources needed by Nuance Jan Odijk Cocosda/Write Workshop.
INTERNATIONAL MARKETING MANAGEMENT SESSION 8: CUSTOMER BEHAVIOR 1.
Linkkservicesworld LTD. SERVICES Translation English / Spanish / English Interpretation/ Full Professional Medical Support / Editing / Proofreading.
What's on the Web? The Web as a Linguistic Corpus Adam Kilgarriff Lexical Computing Ltd University of Leeds.
The World's 5 Top Spoken Languages
Languages in Action Translating for the European Commission
In the knowledge society of the 21st century, language competence and inter-cultural understanding are not optional extras, they are an essential part.
The Influence of First Language on Reading and Spelling in English Linda Siegel University of British Columbia Vancouver, CANADA
UNLIMITED. SIMULTANEOUS. NO CHECK-OUT. eREFERENCE.
Tools for Historical corpus research, and a corpus of Latin Barbara McGillivray Oxford University Press Adam Kilgarriff Lexical Computing Ltd.
Advanced Google Searching June Liebert Director and Assistant Professor The John Marshall Law School “Do no harm” – the Google mantra.
Survey on university students choosing a language course as an extra-curricular activity DIUS & AULC Department for Innovation Universities and Skills.
Comparable Corpora BootCaT (CCBC) Adam Kilgarriff, Avinesh PVS, Jan Pomikalek Lexical Computing Ltd.
Indo-European Branches
Machine Translation Dr. Radhika Mamidi. What is Machine Translation? A sub-field of computational linguistics It investigates the use of computer software.
Richard Baraniuk International Experiences with Open Educational Resources.
Software Internationalisation — Single-Byte Scripts Guy Lacoursière Software Globalisation Consultant.
Although there are about 225 indigenous languages in Europe – they are still only 3% of the world’s total.
Democratizing health sciences knowledge Erica Frank, MD, MPH Founder and Executive Director, Health Sciences Online Professor and Canada Research Chair.
School improvement based on
IATE EU tool for translation-oriented terminology work
Defence School of Languages, UK BILC NATO Conference Prague 2012.
1 Translate and Translator Toolkit Universally accessible information through translation Jeff Chin Product Manager Michael Galvez Product Manager.
Rosh ( ראש ) in Ezekiel Tim LaHaye writes that one way we know that Ezekiel 38 and 39 “can only mean modern-day Russia” is because of “etymology,”
DLF Forum Nov OCLC Grid Services Roy Tennant Senior Program Officer OCLC Research EVERY CONNECTION has a starting point.
Comparable Corpora BootCaT (CCBC) (or: In Praise of BootCaT) Adam Kilgarriff, Jan Pomikalek, Avinesh PVS Lexical Computing Ltd. Work Supported by EU FP7.
Approaches to Machine Translation CSC 5930 Machine Translation Fall 2012 Dr. Tom Way.
What you have learned and how you can use it : Grammars and Lexicons Parts I-III.
Afrikaans Hallo Albanian Mirëdita Arabic Ahalan Armenian Parev Asturian hola Azerbaijani Salam Basque Kaixo Bengali Ei Je Bosnian Zdravo Breton Demat.
Why Study Languages Produced by the Subject Centre for Languages, Linguistics and Area Studies …When Everyone Speaks English?
Curricular language exams Irish, English, Ancient Greek, Arabic, French, German, Hebrew Studies, Italian, Japanese, Spanish and Russian.
Machine Translation: Challenges and Approaches Nizar Habash Associate Research Scientist Center for Computational Learning Systems Columbia University.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
LanguagesLanguages. What is language? A human system of communication that uses arbitrary signals such as voice sounds, gestures, or written symbols.
The DigitalMeeting Communications Process. Neil Johnstone, technilink iT Ltd. Your Competitive Weapon for Supply Chain Collaboration.
F ACTORS TO G OOGLE A D S ENSE A PPROVAL By: Aarif Habeeb.
Learning to Generate Complex Morphology for Machine Translation Einat Minkov †, Kristina Toutanova* and Hisami Suzuki* *Microsoft Research † Carnegie Mellon.
Introduction to Machine Translation
LingWear Language Technology for the Information Warrior Alex Waibel, Lori Levin Alon Lavie, Robert Frederking Carnegie Mellon University.
The next 10 years of web globalization John Yunker Byte Level Research.
Tel: Fax: P.O. Box: 22392, Dubai - UAE
EUROPEAN DAY OF LANGUAGES. The European Year of Languages 2001 was organised by the Council of Europe and the European Union. Its activities celebrated.
Languages of Europe Romance, Germanic, and Slavic.
Advanced Directives: What to Assess with Seniors
Approaches to Machine Translation
Anton Boyko Microsoft azure mvp, mcp Microsoft Devops TE
Sales Presenter Available now
Oracle Supplier Management Solution Product Availability
CSS 590 C: Introduction to NLP
Digital Asset Management Part 11: Access
A Latin corpus for Sketch Engine
Introduction to Machine Translation
Definition of Health WHO approved translation
Part of Speech Tagging with Neural Architecture Search
Sales Presenter Available now Standard v Slim
Presentation transcript:

Machine Translation Dr. Nizar Habash Research Scientist Center for Computational Learning Systems Columbia University COMS E6998: Topics in Computer Science Spring 2013

Session #1 Introductions Syllabus Explanation Lecture –Why Machine Translation –Multilingual Challenges for MT –MT Approaches –MT Evaluation

Why (Machine) Translation? Languages in the world 6,800 living languages 600 with written tradition 100 languages are spoken by 95% of world population Translation Market $26 Billion Global Market (2010) Doubling every five years (Donald Barabé, invited talk, MT Summit 2003)

Multilingualism Tower of Babel Genesis 11:1-9 1 And the whole earth was of one language, and of one speech Therefore is the name of it called Babel; because the Lord did there confound the language of all the earth: and from thence did the Lord scatter them abroad upon the face of all the earth. Foremost symbol of multilingualism as a problem

Multilingualism Language Families

Multilingualism Rosetta Stone Ancient Egyptian stele (196 BCE ) Key to modern understanding of Egyptian hieroglyphs Trilingual document: –ancient Egyptian hieroglyphs –Egyptian demotic script –ancient Greek Common symbol of parallel corpora and translation solutions

Modern Rosetta Stones?

Multilingual Challenges nai you duo shi means buttered toast naiyou means butter duoshi means toast duo means many shi can mean private (as in the army rank)

Shatt Al-Arab Fresh Fish

Why (Machine) Translation? Languages in the world 6,800 living languages 600 with written tradition 100 languages are spoken by 95% of world population Translation Market $26 Billion Global Market (2010) Doubling every five years (Donald Barabé, invited talk, MT Summit 2003)

Machine Translation Science Fiction Star Trek Universal Translator an "extremely sophisticated computer program" which functions by "analyzing the patterns" of an unknown foreign language, starting from a speech sample of two or more speakers in conversation. The more extensive the conversational sample, the more accurate and reliable is the "translation matrix"….

Machine Translation Science Fiction Futurama Universal Translator Dr. Farnsworth: “This is my Universal Translator, although it only translate into an incomprehensible dead language” Cubert: “Hello!” Machine: “Bonjour!” Dr. Farnsworth: "Imcomprehensible gibberish”

Machine Translation Science Fiction The Babel Fish The Hitch Hiker's Guide to the Galaxy" (Douglas Adams) "is small, yellow and leech-like,... if you stick a Babel fish in your ear you can instantly understand anything said to you in any form of language…"

Machine Translation Reality

Machine Translation Reality

Currently, Google offers translations between the following languages  over 3,000 pairs Afrikaans Albanian Arabic Armenian Azerbaijani Basque Belarusian Bulgarian Catalan Chinese Croatian Czech Danish Dutch English Estonian Filipino Finnish French Galician Georgian German Greek Haitian Creole Hebrew Hindi Hungarian Icelandic Indonesian Irish Italian Japanese Korean Latvian Lithuanian Macedonian Malay Maltese Norwegian Polish Portuguese Romanian Russian Serbian Slovak Slovenian Spanish Swahili Swedish Thai Turkish Ukrainian Urdu Vietnamese Welsh Yiddish

“BBC found similar support”!!!

Why Machine Translation? Full Translation –Domain specific, e.g., Weather reports Machine-aided Translation –Requires post-editing Cross-lingual NLP applications –Cross-language IR –Cross-language Summarization Testing grounds –Extrinsic evaluation of NLP tools, e.g., parsers, pos taggers, tokenizers, etc.

Road Map Multilingual Challenges for MT MT Approaches MT Evaluation

Multilingual Challenges Orthographic Variations –Ambiguous spelling كتب الاولاد اشعارا كَتَبَ الأوْلادُ اشعَاراً – Ambiguous word boundaries Lexical Ambiguity –Bank  بنك (financial) vs. ضفة (river) –Eat  essen (human) vs. fressen (animal)

Multilingual Challenges Morphological Variations Affixational (prefix/suffix) vs. Templatic (Root+Pattern) write  written كتب  مكتوب kill  killed قتل  مقتول do  done فعل  مفعول conj noun pluralarticle Tokenization (aka segmentation+normalization) And the cars  and the cars والسيارات  w Al SyArAt Et les voitures  et le voitures

Morphology Arabic: very rich morphology: number, gender, case, person, aspect, voice, several clitics, etc. –Arabic tokenization English: simple morphology Chinese: no morphology – quantifiers & verbal aspects يقرأ الطالب المجتهد كتابا عن الصين في الصف read the-student the-diligent a-book about china in the-classroom the diligent student is reading a book about china in the classroom 这位勤奋的学生在教室读一本关于中国的书 this quant diligent de student in classroom read one quant about china de book

Syntax ArabicEnglishChinese Subj-VerbV SubjSubj V Subj … V Verb-PPV…PP V PPPP V AdjectivesN AdjAdj NAdj de N PossessivesN PossN of PossPoss ’s NPoss de N RelativesN Rel Rel de N يقرأ الطالب المجتهد كتابا عن الصين في الصف read the-student the-diligent a-book about china in the-classroom the diligent student is reading a book about china in the classroom 这位勤奋的学生在教室读一本关于中国的书 this quant diligent de student in classroom read one quant about china de book

Syntax يقرأ الطالب المجتهد كتابا عن الصين في الصف read the-student the-diligent a-book about china in the-classroom the diligent student is reading a book about china in the classroom 这位勤奋的学生在教室读一本关于中国的书 this quant diligent de student in classroom read one quant about china de book ArabicEnglishChinese Subj-VerbV SubjSubj V Subj … V Verb-PPV…PP V PPPP V AdjectivesN AdjAdj NAdj de N PossessivesN PossN of PossPoss ’s NPoss de N RelativesN Rel Rel de N

لست هنا I-am-not here am Ihere I am not here not لست هنا Translation Divergences conflation Je ne suis pas ici I not am not here suis Jeicinepas

* ا نابردان * קרל انا بردان I cold be Icold I am coldקר לי cold for-me אני Translation Divergences categorial, thematic and structural tener Yofrio tengo frio I-have cold

swim I quickly across river I swam across the river quickly Translation Divergences head swap and categorial اسرع اناسباحةعبور نهر اسرعت عبور النهر سباحة I-sped crossing the-river swimming

swim I quickly across river I swam across the river quickly Translation Divergences head swap and categorial חצה אניבאת נהר ב שחיהמהירות חציתי את הנהר בשחיה במהירות I-crossed obj river in-swim speedily

Translation Divergences head swap and categorial חצה אניבאת נהר ב שחיהמהירות اسرع اناسباحةعبور نهر swim I quickly across river noun prep verb noun adverb verb noun verb noun

Translation Divergences Orthography+Morphology+Syntax 妈妈的车 mama de che car mom possessed-by mom’s car سيارة ماما sayyArat mama la voiture de maman

Road Map Multilingual Challenges for MT MT Approaches MT Evaluation

Knowledge Acquisition Strategy Knowledge Representation Strategy All manual Deep/ Complex Shallow/ Simple Fully automated Learn from un- annotated data Phrase tables Word-based only Learn from annotated data Example-based MT Original statistical MT Typical transfer system Classic interlingual system Original direct approach Syntactic Constituent Structure Interlingua New Research Goes Here! Semantic analysis Hand-built by non-experts Hand-built by experts Electronic dictionaries MT Strategies ( ) Slide courtesy of Laurie Gerber

MT Approaches MT Pyramid Source word Source syntax Source meaningTarget meaning Target syntax Target word AnalysisGeneration Gisting

MT Approaches Gisting Example Sobre la base de dichas experiencias se estableció en 1988 una metodología. Envelope her basis out speak experiences them settle at 1988 one methodology. On the basis of these experiences, a methodology was arrived at in 1988.

MT Approaches MT Pyramid Source word Source syntax Source meaningTarget meaning Target syntax Target word AnalysisGeneration GistingTransfer

MT Approaches Transfer Example Transfer Lexicon –Map SL structure to TL structure  poner X mantequilla en Y :obj :mod:subj :obj butter X Y :subj:obj X puso mantequilla en YX buttered Y

MT Approaches MT Pyramid Source word Source syntax Source meaningTarget meaning Target syntax Target word AnalysisGeneration GistingTransferInterlingua

MT Approaches Interlingua Example: Lexical Conceptual Structure (Dorr, 1993)

MT Approaches MT Pyramid Source word Source syntax Source meaningTarget meaning Target syntax Target word AnalysisGeneration Interlingua Gisting Transfer

MT Approaches MT Pyramid Source word Source syntax Source meaningTarget meaning Target syntax Target word AnalysisGeneration Interlingual Lexicons Dictionaries/Parallel Corpora Transfer Lexicons

MT Approaches MT Pyramid

Source word Source syntax Source meaningTarget meaning Target syntax Target word AnalysisGeneration Interlingual Lexicons Dictionaries/Parallel Corpora Transfer Lexicons

MT Approaches Statistical vs. Rule-based Source word Source syntax Source meaningTarget meaning Target syntax Target word AnalysisGeneration

To be continued …