Machine Translation: Challenges and Approaches Nizar Habash Post-doctoral Fellow Center for Computational Learning Systems Columbia University Invited.

Slides:



Advertisements
Similar presentations
Statistical Machine Translation
Advertisements

Machine Translation: Challenges and Approaches
Machine Translation: Interlingual Methods Thanks to Les Sikos Bonnie J. Dorr, Eduard H. Hovy, Lori S. Levin.
Language Divergences and Solutions Advanced Machine Translation Seminar Alison Alvarez.
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Probabilistic Language Processing Chapter 23. Probabilistic Language Models Goal -- define probability distribution over set of strings Unigram, bigram,
Machine Translation Dr. Nizar Habash Center for Computational Learning Systems Columbia University COMS 4705: Natural Language Processing Fall 2010.
Machine Translation (Level 2) Anna Sågvall Hein GSLT Course, September 2004.
CS4705 Natural Language Processing.  Regular Expressions  Finite State Automata ◦ Determinism v. non-determinism ◦ (Weighted) Finite State Transducers.
C SC 620 Advanced Topics in Natural Language Processing Lecture 20 4/8.
Midterm Review CS4705 Natural Language Processing.
CMSC 723 / LING 645: Intro to Computational Linguistics September 8, 2004: Dorr MT (continued), MT Evaluation Prof. Bonnie J. Dorr Dr. Christof Monz TA:
A Phrase-Based, Joint Probability Model for Statistical Machine Translation Daniel Marcu, William Wong(2002) Presented by Ping Yu 01/17/2006.
EBMT1 Example Based Machine Translation as used in the Pangloss system at Carnegie Mellon University Dave Inman.
Generation-Heavy Hybrid Machine Translation Nizar Habash Postdoctoral Researcher Center for Computational Learning Systems Columbia University Columbia.
Machine Translation Prof. Alexandros Potamianos Dept. of Electrical & Computer Engineering Technical University of Crete, Greece May 2003.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
Corpora and Translation Parallel corpora Statistical MT (not to mention: Corpus of translated text, for translation studies)
LEARNING WORD TRANSLATIONS Does syntactic context fare better than positional context? NCLT/CNGL Internal Workshop Ankit Kumar Srivastava 24 July 2008.
Does Syntactic Knowledge help English- Hindi SMT ? Avinesh. PVS. K. Taraka Rama, Karthik Gali.
1 Statistical NLP: Lecture 13 Statistical Alignment and Machine Translation.
Microsoft Research Faculty Summit Robert Moore Principal Researcher Microsoft Research.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
Arabic Natural Language Processing: State of the Art and Prospects
Course Overview  What is AI?  What are the Major Challenges?  What are the Main Techniques?  Where are we failing, and why?  Step back and look at.
New Directions in Machine Translation Introduction 陳惠群 中央研究院 語言所 / 資訊所.
Machine Translation Dr. Nizar Habash Research Scientist Center for Computational Learning Systems Columbia University COMS E6998: Topics in Computer Science.
Machine translation Context-based approach Lucia Otoyo.
Machine Translation Dr. Radhika Mamidi. What is Machine Translation? A sub-field of computational linguistics It investigates the use of computer software.
Course Overview  What is AI?  What are the Major Challenges?  What are the Main Techniques?  Where are we failing, and why?  Step back and look at.
Lecture 12: 22/6/1435 Natural language processing Lecturer/ Kawther Abas 363CS – Artificial Intelligence.
Week 9: resources for globalisation Finish spell checkers Machine Translation (MT) The ‘decoding’ paradigm Ambiguity Translation models Interlingua and.
Evaluation of the Statistical Machine Translation Service for Croatian-English Marija Brkić Department of Informatics, University of Rijeka
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
The Impact of Grammar Enhancement on Semantic Resources Induction Luca Dini Giampaolo Mazzini
Arthur Chan Prepared for Advanced MT Seminar
© Copyright 2013 ABBYY NLP PLATFORM FOR EU-LINGUAL DIGITAL SINGLE MARKET Alexander Rylov LTi Summit 2013 Confidential.
2010 Failures in Czech-English Phrase-Based MT 2010 Failures in Czech-English Phrase-Based MT Full text, acknowledgement and the list of references in.
Approaches to Machine Translation CSC 5930 Machine Translation Fall 2012 Dr. Tom Way.
CS460/IT632 Natural Language Processing/Language Technology for the Web Guest Lecture (31/03/06) Prof. Niladri Chatterjee IIT Delhi Guest Lecture on Machine.
MT with an Interlingua Lori Levin April 13, 2009.
October 2005CSA3180 NLP1 CSA3180 Natural Language Processing Introduction and Course Overview.
What you have learned and how you can use it : Grammars and Lexicons Parts I-III.
Machine Translation (Level 2) Anna Sågvall Hein GSLT Course, January 2003.
February 2006Machine Translation II.21 Postgraduate Diploma In Translation Example Based Machine Translation Statistical Machine Translation.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 1 (03/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Introduction to Natural.
2003 (c) University of Pennsylvania1 Better MT Using Parallel Dependency Trees Yuan Ding University of Pennsylvania.
Machine Translation: Challenges and Approaches Nizar Habash Associate Research Scientist Center for Computational Learning Systems Columbia University.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Pastra and Saggion, EACL 2003 Colouring Summaries BLEU Katerina Pastra and Horacio Saggion Department of Computer Science, Natural Language Processing.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Introduction to Machine Translation
Automatic methods of MT evaluation Lecture 18/03/2009 MODL5003 Principles and applications of machine translation Bogdan Babych.
Ling 575: Machine Translation Yuval Marton Winter 2016 February 9: MT Evaluation Much of the materials was borrowed from course slides of Chris Callison-Burch.
Approaches to Machine Translation
Introduction to Machine Translation
Urdu-to-English Stat-XFER system for NIST MT Eval 2008
Representation of Actions as an Interlingua
CSS 590 C: Introduction to NLP
Machine Translation Nov 8, 2006
Approaches to Machine Translation
CS4705 Natural Language Processing
Introduction to Machine Translation
A Path-based Transfer Model for Machine Translation
AMTEXT: Extraction-based MT for Arabic
Presentation transcript:

Machine Translation: Challenges and Approaches Nizar Habash Post-doctoral Fellow Center for Computational Learning Systems Columbia University Invited Lecture CS 4705: Introduction to Natural Language Processing Fall 2004

Sounds like Faulkner? It lay on the table a candle burning at each corner upon the envelope tied in a soiled pink garter two artificial flowers. Not hit a man in glasses. It was once a shade, which was in all beautiful weather under a tree and varied like the branches in the wind. William Faulkner, "The sound and the fury“ Es war einmal ein Schatten, der lag bei jedem schönen Wetter unter einem Baum und schwankte wie die Zweige im Wind. Helmut Wördemann, "Der unzufriedene Schatten“ (Translated by Systran)  Faulkner  Machine Translation  Faulkner  Machine Translation

Progress in MT Statistical MT example Form a talk by Charles Wayne, DARPA Human Translation insistent Wednesday may recurred her trips to Libya tomorrow for flying Cairo 6-4 ( AFP ) - an official announced today in the Egyptian lines company for flying Tuesday is a company " insistent for flying " may resumed a consideration of a day Wednesday tomorrow her trips to Libya of Security Council decision trace international the imposed ban comment. Egyptair Has Tomorrow to Resume Its Flights to Libya Cairo 4-6 (AFP) - said an official at the Egyptian Aviation Company today that the company egyptair may resume as of tomorrow, Wednesday its flights to Libya after the International Security Council resolution to the suspension of the embargo imposed on Libya. Egypt Air May Resume its Flights to Libya Tomorrow Cairo, April 6 (AFP) - An Egypt Air official announced, on Tuesday, that Egypt Air will resume its flights to Libya as of tomorrow, Wednesday, after the UN Security Council had announced the suspension of the embargo imposed on Libya.

Road Map Why Machine Translation (MT)? Multilingual Challenges for MT MT Approaches MT Evaluation

Why (Machine) Translation? Languages in the world 6,800 living languages 600 with written tradition 95% of world population speaks 100 languages Translation Market $8 Billion Global Market Doubling every five years (Donald Barabé, invited talk, MT Summit 2003)

Why Machine Translation? Full Translation –Domain specific Weather reports Machine-aided Translation –Translation dictionaries –Translation memories –Requires post-editing Cross-lingual NLP applications –Cross-language IR –Cross-language Summarization

Road Map Why Machine Translation (MT)? Multilingual Challenges for MT –Orthographic variations –Lexical ambiguity –Morphological variations –Translation divergences MT Paradigms MT Evaluation

Multilingual Challenges Orthographic Variations –Ambiguous spelling كتب الاولاد اشعارا كَتَبَ الأوْلادُ اشعَاراً – Ambiguous word boundaries Lexical Ambiguity –Bank  بنك (financial) vs. ضفة(river) –Eat  essen (human) vs. fressen (animal)

Multilingual Challenges Morphological Variations Affixation vs. Root+Pattern write  writtenكتب  مكتوب kill  killedقتل  مقتول do  doneفعل  مفعول conj noun pluralarticle Tokenization And the cars  and the cars والسيارات  w Al SyArAt Et les voitures  et le voitures

Multilingual Challenges Translation Divergences How languages map semantics to syntax 35% of sentences in TREC El Norte Corpus (Dorr et al 2002) Divergence Types –Categorial (X tener hambre  X be hungry) [98%] –Conflational (X dar puñaladas a Z  X stab Z) [83%] –Structural (X entrar en Y  X enter Y)[35%] –Head Swapping (X cruzar Y nadando  X swim across Y)[8%] –Thematic (X gustar a Y  Y like X)[6%]

لست هنا I-am-not here be Ihere I am not here not ليس ا ناهنا Translation Divergences conflation Je ne suis pas ici I not be not here etre Jeicinepas

* ا نابردان * קרל انا بردان I cold be Icold I am coldקר לי cold for-me אני Translation Divergences categorial, thematic and structural tener Yofrio tengo frio I-have cold

swim I quickly across river I swam across the river quickly Translation Divergences head swap and categorial اسرع اناسباحةعبور نهر اسرعت عبور النهر سباحة I-sped crossing the-river swimming

swim I quickly across river I swam across the river quickly Translation Divergences head swap and categorial חצה אניבאת נהר ב שחיהמהירות חציתי את הנהר בשחיה במהירות I-crossed obj river in-swim speedily

Translation Divergences head swap and categorial חצה אניבאת נהר ב שחיהמהירות اسرع اناسباحةعبور نهر swim I quickly across river no un pre p ver b no un adve rb ver b no un ver b no un

Translation Divergences Orthography+Morphology+Syntax 妈妈的车 mama de che car mom possessed-by mom’s car سيارة ماما sayyArat mama la voiture de maman

Road Map Why Machine Translation (MT)? Multilingual Challenges for MT MT Approaches –Gisting / Transfer / Interlingua –Statistical / Symbolic / Hybrid –Practical Considerations MT Evaluation

MT Approaches MT Pyramid Source word Source syntax Source meaningTarget meaning Target syntax Target word AnalysisGeneration Gisting

MT Approaches Gisting Example Sobre la base de dichas experiencias se estableció en 1988 una metodología. Envelope her basis out speak experiences them settle at 1988 one methodology. On the basis of these experiences, a methodology was arrived at in 1988.

MT Approaches MT Pyramid Source word Source syntax Source meaningTarget meaning Target syntax Target word AnalysisGeneration GistingTransfer

MT Approaches Transfer Example Transfer Lexicon –Map SL structure to TL structure  poner X mantequilla en Y :obj :mod:subj :obj butter X Y :subj:obj X puso mantequilla en YX buttered Y

MT Approaches MT Pyramid Source word Source syntax Source meaningTarget meaning Target syntax Target word AnalysisGeneration GistingTransferInterlingua

MT Approaches Interlingua Example: Lexical Conceptual Structure (Dorr, 1993)

MT Approaches MT Pyramid Source word Source syntax Source meaningTarget meaning Target syntax Target word AnalysisGeneration Interlingua Gisting Transfer

MT Approaches MT Pyramid Source word Source syntax Source meaningTarget meaning Target syntax Target word AnalysisGeneration Interlingual Lexicons Dictionaries/Parallel Corpora Transfer Lexicons

MT Approaches Statistical vs. Symbolic Source word Source syntax Source meaningTarget meaning Target syntax Target word AnalysisGeneration

MT Approaches Noisy Channel Model Portions from

MT Approaches IBM Model (Word-based Model)

Source word Source syntax Source meaningTarget meaning Target syntax Target word AnalysisGeneration MT Approaches Statistical vs. Symbolic vs. Hybrid

Source word Source syntax Source meaningTarget meaning Target syntax Target word AnalysisGeneration MT Approaches Statistical vs. Symbolic vs. Hybrid

MT Approaches Hybrid Example: GHMT Generation-Heavy Hybrid Machine Transaltion Lexical transfer but NO structural transfer  poner Maria mantequilla en pan :obj :mod:subj :obj lay locate place put render set stand Maria butter bilberry on in into at bread loaf :obj :mod:subj :obj Maria puso la mantequilla en el pan.

MT Approaches Hybrid Example: GHMT LCS-driven Expansion Conflation Example Goal BUTTER V MARIA BREAD Agent Goal PUT V BUTTER N Theme Agent MARIA BREAD [CAUSE GO] CategorialVariation

MT Approaches Hybrid Example: GHMT Structural Overgeneration put Maria butteron bread lay Maria butterat loaf render Maria butter into loaf butter Maria bread Maria butter …

Structural N-gram Model –Long-distance –Lexemes Surface N-gram Model –Local –Surface-forms John buy MT Approaches Hybrid Example: GHMT Target Statistical Resources car ared Johnboughtcarared

MT Approaches Hybrid Example: GHMT Linearization &Ranking Maria buttered the bread Maria butters the bread Maria breaded the butter Maria breads the butter Maria buttered the loaf Maria butters the loaf Maria put the butter on bread

MT Approaches Practical Considerations Resources Availability –Parsers and Generators Input/Output compatability –Translation Lexicons Word-based vs. Transfer/Interlingua –Parallel Corpora Domain of interest Bigger is better Time Availability –Statistical training, resource building

MT Approaches Resource Poverty No Parser? No Translation Dictionary? Parallel Corpus Align with rich language Extract dictionary Parse rich side Infer parses Build a statistical parser

Road Map Why Machine Translation (MT)? Multilingual Challenges for MT MT Approaches MT Evaluation

More art than science Wide range of Metrics/Techniques –interface, …, scalability, …, faithfulness,... space/time complexity, … etc. Automatic vs. Human-based –Dumb Machines vs. Slow Humans

MT Evaluation Metrics System-based Metrics Count internal resources: size of lexicon, number of grammar rules, etc. –easy to measure –not comparable across systems –not necessarily related to utility (Church and Hovy 1993)

MT Evaluation Metrics Text-based Metrics –Sentence-based Metrics Quality: Accuracy, Fluency, Coherence, etc. 3-point scale to 100-point scale –Comprehensibility Metrics Comprehension, Informativeness, x-point scales, questionnaires most related to utility hard to measure

MT Evaluation Metrics Text-based Metrics (cont’d) –Amount of Post-Editing number of keystrokes per page not necessarily related to utility Cost-based Metrics –Cost per page –Time per page

Human-based Evaluation Example Accuracy Criteria

Human-based Evaluation Example Fluency Criteria

Fluency vs. Accuracy Accuracy Fluency conMT FAHQ MT Prof. MT Info. MT

Automatic Evaluation Example Bleu Metric Bleu –BiLingual Evaluation Understudy (Papineni et al 2001) –Modified n-gram precision with length penalty –Quick, inexpensive and language independent –Correlates highly with human evaluation –Bias against synonyms and inflectional variations

Test Sentence colorless green ideas sleep furiously Gold Standard References all dull jade ideas sleep irately drab emerald concepts sleep furiously colorless immature thoughts nap angrily Automatic Evaluation Example Bleu Metric

Test Sentence colorless green ideas sleep furiously Gold Standard References all dull jade ideas sleep irately drab emerald concepts sleep furiously colorless immature thoughts nap angrily Unigram precision = 4/5 Automatic Evaluation Example Bleu Metric

Test Sentence colorless green ideas sleep furiously Gold Standard References all dull jade ideas sleep irately drab emerald concepts sleep furiously colorless immature thoughts nap angrily Unigram precision = 4 / 5 = 0.8 Bigram precision = 2 / 4 = 0.5 Bleu Score = (a 1 a 2 …a n ) 1/n = (0.8 ╳ 0.5) ½ =  Automatic Evaluation Example Bleu Metric