Machine Translation: Challenges and Approaches Nizar Habash Associate Research Scientist Center for Computational Learning Systems Columbia University.

Slides:



Advertisements
Similar presentations
Statistical Machine Translation
Advertisements

U.S. Government Language Requirements U.S. Government Language Requirements 7 September 2000 Everette Jordan Department of Defense
Machine Translation: Challenges and Approaches
Yemelia International Language Services Translations Translations Translations Interpreting InterpretingInterpreting Multi-lingual IT Presentations Multi-lingual.
Adaptxt® Enhanced Keyboards for Smartphones and Tablets: CUSTOM-MADE FOR OEM SUCCESS KeyPoint Technologies February 25, 2013.
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser Institute for Natural Language Processing University of Stuttgart
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
J. Kunzmann, K. Choukri, E. Janke, A. Kießling, K. Knill, L. Lamel, T. Schultz, and S. Yamamoto Automatic Speech Recognition and Understanding ASRU, December.
Machine Translation Dr. Nizar Habash Center for Computational Learning Systems Columbia University COMS 4705: Natural Language Processing Fall 2010.
Machine Translation (Level 2) Anna Sågvall Hein GSLT Course, September 2004.
C SC 620 Advanced Topics in Natural Language Processing Lecture 20 4/8.
Machine Translation: Challenges and Approaches Nizar Habash Post-doctoral Fellow Center for Computational Learning Systems Columbia University Invited.
CMSC 723 / LING 645: Intro to Computational Linguistics September 8, 2004: Dorr MT (continued), MT Evaluation Prof. Bonnie J. Dorr Dr. Christof Monz TA:
A Phrase-Based, Joint Probability Model for Statistical Machine Translation Daniel Marcu, William Wong(2002) Presented by Ping Yu 01/17/2006.
Generation-Heavy Hybrid Machine Translation Nizar Habash Postdoctoral Researcher Center for Computational Learning Systems Columbia University Columbia.
Machine Translation Prof. Alexandros Potamianos Dept. of Electrical & Computer Engineering Technical University of Crete, Greece May 2003.
1 Linguistic Resources needed by Nuance Jan Odijk Cocosda/Write Workshop.
Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.
C SC 620 Advanced Topics in Natural Language Processing Lecture 24 4/22.
Corpora and Translation Parallel corpora Statistical MT (not to mention: Corpus of translated text, for translation studies)
LEARNING WORD TRANSLATIONS Does syntactic context fare better than positional context? NCLT/CNGL Internal Workshop Ankit Kumar Srivastava 24 July 2008.
Machine Translation Challenges and Language Divergences Alon Lavie Language Technologies Institute Carnegie Mellon University : Machine Translation.
Does Syntactic Knowledge help English- Hindi SMT ? Avinesh. PVS. K. Taraka Rama, Karthik Gali.
Jan 2005Statistical MT1 CSA4050: Advanced Techniques in NLP Machine Translation III Statistical MT.
Tapta4IPC: helping translation of IPC definitions Bruno Pouliquen 25 feb 2013, IPC workshop Translation.
Collecting Primary Language Information LINKED-DISC - provincial database system for early childhood intervention Services Herb Chan.
Machine Translation Dr. Nizar Habash Research Scientist Center for Computational Learning Systems Columbia University COMS E6998: Topics in Computer Science.
Machine Translation Dr. Radhika Mamidi. What is Machine Translation? A sub-field of computational linguistics It investigates the use of computer software.
An Introduction to SMT Andy Way, DCU. Statistical Machine Translation (SMT) Translation Model Language Model Bilingual and Monolingual Data* Decoder:
English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.
JRC-Ispra, , Slide 1 Next Steps / Technical Details Bruno Pouliquen & Ralf Steinberger Addressing the Language Barrier Problem in the Enlarged.
Evaluation of the Statistical Machine Translation Service for Croatian-English Marija Brkić Department of Informatics, University of Rijeka
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
Profile The METIS Approach Future Work Evaluation METIS II Architecture METIS II, the continuation of the successful assessment project METIS I, is an.
METEOR-Ranking & M-BLEU: Flexible Matching & Parameter Tuning for MT Evaluation Alon Lavie and Abhaya Agarwal Language Technologies Institute Carnegie.
1 Translate and Translator Toolkit Universally accessible information through translation Jeff Chin Product Manager Michael Galvez Product Manager.
Advanced Signal Processing 05/06 Reinisch Bernhard Statistical Machine Translation Phrase Based Model.
METEOR: Metric for Evaluation of Translation with Explicit Ordering An Automatic Metric for MT Evaluation with Improved Correlations with Human Judgments.
Morphological Preprocessing for Statistical Machine Translation Nizar Habash Columbia University NLP Meeting 10/19/2006.
Morpho Challenge competition Evaluations and results Authors Mikko Kurimo Sami Virpioja Ville Turunen Krista Lagus.
Comparable Corpora BootCaT (CCBC) (or: In Praise of BootCaT) Adam Kilgarriff, Jan Pomikalek, Avinesh PVS Lexical Computing Ltd. Work Supported by EU FP7.
Approaches to Machine Translation CSC 5930 Machine Translation Fall 2012 Dr. Tom Way.
What you have learned and how you can use it : Grammars and Lexicons Parts I-III.
LREC 2008 Marrakech 29 May Caroline Lavecchia, Kamel Smaïli and David Langlois LORIA / Groupe Parole, Vandoeuvre-Lès-Nancy, France Phrase-Based Machine.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Exploiting Reducibility in Unsupervised Dependency Parsing David Mareček and Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University.
MACHINE TRANSLATION PAPER 1 Daniel Montalvo, Chrysanthia Cheung-Lau, Jonny Wang CS159 Spring 2011.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,
Learning to Generate Complex Morphology for Machine Translation Einat Minkov †, Kristina Toutanova* and Hisami Suzuki* *Microsoft Research † Carnegie Mellon.
Ling 575: Machine Translation Yuval Marton Winter 2016 February 9: MT Evaluation Much of the materials was borrowed from course slides of Chris Callison-Burch.
Ling 575: Machine Translation Yuval Marton Winter 2016 January 19: Spill-over from last class, some prob+stats, word alignment, phrase-based and hierarchical.
EUROPEAN DAY OF LANGUAGES. The European Year of Languages 2001 was organised by the Council of Europe and the European Union. Its activities celebrated.
Languages of Europe Romance, Germanic, and Slavic.
Statistical Machine Translation Part II: Word Alignments and EM
METEOR: Metric for Evaluation of Translation with Explicit Ordering An Improved Automatic Metric for MT Evaluation Alon Lavie Joint work with: Satanjeev.
RECENT TRENDS IN SMT By M.Balamurugan, Phd Research Scholar,
Approaches to Machine Translation
Monoligual Semantic Text Alignment and its Applications in Machine Translation Alon Lavie March 29, 2012.
Sales Presenter Available now
CSS 590 C: Introduction to NLP
CSCI 5832 Natural Language Processing
CSCI 5832 Natural Language Processing
Approaches to Machine Translation
Part of Speech Tagging with Neural Architecture Search
Lecture 12: Machine Translation (II) November 4, 2004 Dan Jurafsky
Sales Presenter Available now Standard v Slim
Presentation transcript:

Machine Translation: Challenges and Approaches Nizar Habash Associate Research Scientist Center for Computational Learning Systems Columbia University Invited Lecture Introduction to Natural Language Processing Fall 2008

Currently, Google offers translations between the following languages: Arabic Bulgarian Catalan Chinese Croatian Czech Danish Dutch Filipino Finnish French German Greek Hebrew Hindi Indonesian Italian Japanese Korean Latvian Lithuanian Norwegian Polish Portuguese Romanian Russian Serbian Slovak Slovenian Spanish Swedish Ukrainian Vietnamese

Thank you for your attention! Questions?

“BBC found similar support”!!!

Road Map Multilingual Challenges for MT MT Approaches MT Evaluation

Multilingual Challenges Orthographic Variations –Ambiguous spelling كتب الاولاد اشعارا كَتَبَ الأوْلادُ اشعَاراً – Ambiguous word boundaries Lexical Ambiguity –Bank  بنك (financial) vs. ضفة(river) –Eat  essen (human) vs. fressen (animal)

Multilingual Challenges Morphological Variations Affixation vs. Root+Pattern write  writtenكتب  مكتوب kill  killedقتل  مقتول do  doneفعل  مفعول conj noun pluralarticle Tokenization And the cars  and the cars والسيارات  w Al SyArAt Et les voitures  et le voitures

لست هنا I-am-not here am Ihere I am not here not لست هنا Translation Divergences conflation Je ne suis pas ici I not am not here suis Jeicinepas

Translation Divergences head swap and categorial EnglishJohn swam across the river quickly SpanishJuan cruzó rapidamente el río nadando Gloss: John crossed fast the river swimming Arabicاسرع جون عبور النهر سباحة Gloss: sped john crossing the-river swimming Chinese 约翰 快速 地 游 过 这 条 河 Gloss: John quickly (DE) swam cross the (Quantifier) river RussianДжон быстро переплыл реку Gloss: John quickly cross-swam river

Road Map Multilingual Challenges for MT MT Approaches MT Evaluation

MT Approaches MT Pyramid Source word Source syntax Source meaningTarget meaning Target syntax Target word AnalysisGeneration Gisting

MT Approaches Gisting Example Sobre la base de dichas experiencias se estableció en 1988 una metodología. Envelope her basis out speak experiences them settle at 1988 one methodology. On the basis of these experiences, a methodology was arrived at in 1988.

MT Approaches MT Pyramid Source word Source syntax Source meaningTarget meaning Target syntax Target word AnalysisGeneration GistingTransfer

MT Approaches Transfer Example Transfer Lexicon –Map SL structure to TL structure  poner X mantequilla en Y :obj :mod:subj :obj butter X Y :subj:obj X puso mantequilla en YX buttered Y

MT Approaches MT Pyramid Source word Source syntax Source meaningTarget meaning Target syntax Target word AnalysisGeneration GistingTransferInterlingua

MT Approaches Interlingua Example: Lexical Conceptual Structure (Dorr, 1993)

MT Approaches MT Pyramid Source word Source syntax Source meaningTarget meaning Target syntax Target word AnalysisGeneration Interlingua Gisting Transfer

MT Approaches MT Pyramid Source word Source syntax Source meaningTarget meaning Target syntax Target word AnalysisGeneration Interlingual Lexicons Dictionaries/Parallel Corpora Transfer Lexicons

MT Approaches Statistical vs. Rule-based Source word Source syntax Source meaningTarget meaning Target syntax Target word AnalysisGeneration

Statistical MT Noisy Channel Model Portions from

Statistical MT Automatic Word Alignment GIZA++ –A statistical machine translation toolkit used to train word alignments. –Uses Expectation-Maximization with various constraints to bootstrap alignments Slide based on Kevin Knight’s Mary did not slap the green witch Maria no dio una bofetada a la bruja verde

Statistical MT IBM Model (Word-based Model)

Phrase-Based Statistical MT Foreign input segmented in to phrases –“phrase” is any sequence of words Each phrase is probabilistically translated into English –P(to the conference | zur Konferenz) –P(into the meeting | zur Konferenz) Phrases are probabilistically re-ordered See [Koehn et al, 2003] for an intro. This is state-of-the-art! Morgenfliegeichnach Kanadazur Konferenz TomorrowIwill flyto the conferenceIn Canada Slide courtesy of Kevin Knight

Mary did not slap the green witch Maria no dió una bofetada a la bruja verde Word Alignment Induced Phrases (Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green) Slide courtesy of Kevin Knight

Mary did not slap the green witch Maria no dió una bofetada a la bruja verde Word Alignment Induced Phrases (Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green) (a la, the) (dió una bofetada a, slap the) Slide courtesy of Kevin Knight

Mary did not slap the green witch Maria no dió una bofetada a la bruja verde Word Alignment Induced Phrases (Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green) (a la, the) (dió una bofetada a, slap the) (Maria no, Mary did not) (no dió una bofetada, did not slap), (dió una bofetada a la, slap the) (bruja verde, green witch) Slide courtesy of Kevin Knight

Mary did not slap the green witch Maria no dió una bofetada a la bruja verde (Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green) (a la, the) (dió una bofetada a, slap the) (Maria no, Mary did not) (no dió una bofetada, did not slap), (dió una bofetada a la, slap the) (bruja verde, green witch) (Maria no dió una bofetada, Mary did not slap) (a la bruja verde, the green witch) … Word Alignment Induced Phrases Slide courtesy of Kevin Knight

Mary did not slap the green witch Maria no dió una bofetada a la bruja verde (Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green) (a la, the) (dió una bofetada a, slap the) (Maria no, Mary did not) (no dió una bofetada, did not slap), (dió una bofetada a la, slap the) (bruja verde, green witch) (Maria no dió una bofetada, Mary did not slap) (a la bruja verde, the green witch) … (Maria no dió una bofetada a la bruja verde, Mary did not slap the green witch) Word Alignment Induced Phrases Slide courtesy of Kevin Knight

Advantages of Phrase-Based SMT Many-to-many mappings can handle non- compositional phrases Local context is very useful for disambiguating –“Interest rate”  … –“Interest in”  … The more data, the longer the learned phrases –Sometimes whole sentences Slide courtesy of Kevin Knight

Source word Source syntax Source meaningTarget meaning Target syntax Target word AnalysisGeneration MT Approaches Statistical vs. Rule-based vs. Hybrid

MT Approaches Practical Considerations Resources Availability –Parsers and Generators Input/Output compatability –Translation Lexicons Word-based vs. Transfer/Interlingua –Parallel Corpora Domain of interest Bigger is better Time Availability –Statistical training, resource building

Road Map Multilingual Challenges for MT MT Approaches MT Evaluation

More art than science Wide range of Metrics/Techniques –interface, …, scalability, …, faithfulness,... space/time complexity, … etc. Automatic vs. Human-based –Dumb Machines vs. Slow Humans

Human-based Evaluation Example Accuracy Criteria

Human-based Evaluation Example Fluency Criteria

Automatic Evaluation Example Bleu Metric (Papineni et al 2001) Bleu –BiLingual Evaluation Understudy –Modified n-gram precision with length penalty –Quick, inexpensive and language independent –Correlates highly with human evaluation –Bias against synonyms and inflectional variations

Test Sentence colorless green ideas sleep furiously Gold Standard References all dull jade ideas sleep irately drab emerald concepts sleep furiously colorless immature thoughts nap angrily Automatic Evaluation Example Bleu Metric

Test Sentence colorless green ideas sleep furiously Gold Standard References all dull jade ideas sleep irately drab emerald concepts sleep furiously colorless immature thoughts nap angrily Unigram precision = 4/5 Automatic Evaluation Example Bleu Metric

Test Sentence colorless green ideas sleep furiously Gold Standard References all dull jade ideas sleep irately drab emerald concepts sleep furiously colorless immature thoughts nap angrily Unigram precision = 4 / 5 = 0.8 Bigram precision = 2 / 4 = 0.5 Bleu Score = (a 1 a 2 …a n ) 1/n = (0.8 ╳ 0.5) ½ =  Automatic Evaluation Example Bleu Metric

Automatic Evaluation Example METEOR (Lavie and Agrawal 2007) Metric for Evaluation of Translation with Explicit word Ordering Extended Matching between translation and reference –Porter stems, wordNet synsets Unigram Precision, Recall, parameterized F-measure Reordering Penalty Parameters can be tuned to optimize correlation with human judgments Not biased against “non-statistical” MT systems

Metrics MATR Workshop Workshop in AMTA conference 2008 –Association for Machine Translation in the Americas Evaluating evaluation metrics Compared 39 metrics –7 baselines and 32 new metrics –Various measures of correlation with human judgment –Different conditions: text genre, source language, number of references, etc.

A syntactically-aware evaluation metric – (Liu and Gildea, 2005; Owczarzak et al., 2007; Giménez and Màrquez, 2007) Uses dependency representation –MICA parser (Nasr & Rambow 2006) –77% of all structural bigrams are surface n-grams of size 2,3,4 Includes dependency surface span as a factor in score –long-distance dependencies should receive a greater weight than short distance dependencies Higher degree of grammaticality? Automatic Evaluation Example SEPIA (Habash and ElKholy 2008)

Interested in MT?? Contact me Research courses, projects Languages of interest: –English, Arabic, Hebrew, Chinese, Urdu, Spanish, Russian, …. Topics –Statistical, Hybrid MT Phrase-based MT with linguistic extensions Component improvements or full-system improvements –MT Evaluation –Multilingual computing

Thank You