Machine Translation Overview Alon Lavie Language Technologies Institute Carnegie Mellon University August 25, 2004.

Slides:



Advertisements
Similar presentations
The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
Advertisements

Improving Machine Translation Quality via Hybrid Systems and Refined Evaluation Methods Andreas Eisele DFKI GmbH and Saarland University Helsinki, November.
Machine Translation (Level 2) Anna Sågvall Hein GSLT Course, September 2004.
Towards an NLP `module’ The role of an utterance-level interface.
NICE: Native language Interpretation and Communication Environment Lori Levin, Jaime Carbonell, Alon Lavie, Ralf Brown Carnegie Mellon University.
The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.
Machine Translation Prof. Alexandros Potamianos Dept. of Electrical & Computer Engineering Technical University of Crete, Greece May 2003.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English.
Machine Translation Overview Alon Lavie Language Technologies Institute Carnegie Mellon University LTI Immigration Course August 23, 2007.
Language Technologies Institute School of Computer Science Carnegie Mellon University NSF August 6, 2001 NICE: Native language Interpretation and Communication.
Machine Translation History of Machine Translation Difficulties in Machine Translation Structure of Machine Translation System Research methods for Machine.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
Machine translation Context-based approach Lucia Otoyo.
Machine Translation Dr. Radhika Mamidi. What is Machine Translation? A sub-field of computational linguistics It investigates the use of computer software.
Direct Translation Approaches: Statistical Machine Translation
Area Report Machine Translation Hervé Blanchon CLIPS-IMAG A Roadmap for Computational Linguistics COLING 2002 Post-Conference Workshop.
Machine Translation Overview Alon Lavie Language Technologies Institute Carnegie Mellon University LTI Open House March 24, 2006.
Machine Translation: Approaches, Challenges and Future Alon Lavie Language Technologies Institute School of Computer Science Carnegie Mellon University.
Machine Translation: Approaches, Challenges and Future Alon Lavie Language Technologies Institute Carnegie Mellon University ITEC Dinner May 21, 2009.
Transfer-based MT with Strong Decoding for a Miserly Data Scenario Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with:
Coping with Surprise: Multiple CMU MT Approaches Alon Lavie Lori Levin, Jaime Carbonell, Alex Waibel, Stephan Vogel, Ralf Brown, Robert Frederking Language.
Recent Major MT Developments at CMU Briefing for Joe Olive February 5, 2008 Alon Lavie and Stephan Vogel Language Technologies Institute Carnegie Mellon.
Machine Translation  Machine translation is of one of the earliest uses of AI  Two approaches:  Traditional approach using grammars, rewrite rules,
LTI Education Committee Report Alon Lavie LTI Retreat March 2, 2012.
Advanced MT Seminar Spring 2008 Instructors: Alon Lavie and Stephan Vogel.
Approaches to Machine Translation CSC 5930 Machine Translation Fall 2012 Dr. Tom Way.
Machine Translation: An Introduction and Overview Alon Lavie Language Technologies Institute Carnegie Mellon University JHU Summer School June 28, 2006.
Transfer-based MT with Strong Decoding for a Miserly Data Scenario Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with:
MEMT: Multi-Engine Machine Translation Faculty: Alon Lavie, Robert Frederking, Ralf Brown, Jaime Carbonell Students: Shyamsundar Jayaraman, Satanjeev Banerjee.
AVENUE Automatic Machine Translation for low-density languages Ariadna Font Llitjós Language Technologies Institute SCS Carnegie Mellon University.
Machine Translation Overview Alon Lavie Language Technologies Institute Carnegie Mellon University LTI Immigration Course August 16, 2010.
Machine Translation Overview Alon Lavie Language Technologies Institute Carnegie Mellon University LTI Immigration Course August 22, 2011.
Designing a Machine Translation Project Lori Levin and Alon Lavie Language Technologies Institute Carnegie Mellon University CATANAL Planning Meeting Barrow,
A Trainable Transfer-based MT Approach for Languages with Limited Resources Alon Lavie Language Technologies Institute Carnegie Mellon University Joint.
Machine Translation Overview Alon Lavie Language Technologies Institute Carnegie Mellon University Open House March 18, 2005.
Coping with Surprise: Multiple CMU MT Approaches Alon Lavie Lori Levin, Jaime Carbonell, Alex Waibel, Stephan Vogel, Ralf Brown, Robert Frederking Language.
MEMT: Multi-Engine Machine Translation Faculty: Alon Lavie, Jaime Carbonell Students and Staff: Gregory Hanneman, Justin Merrill (Shyamsundar Jayaraman,
A Trainable Transfer-based MT Approach for Languages with Limited Resources Alon Lavie Language Technologies Institute Carnegie Mellon University Joint.
The CMU Mill-RADD Project: Recent Activities and Results Alon Lavie Language Technologies Institute Carnegie Mellon University.
MEMT: Multi-Engine Machine Translation Guided by Explicit Word Matching Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work.
Error Analysis of Two Types of Grammar for the purpose of Automatic Rule Refinement Ariadna Font Llitjós, Katharina Probst, Jaime Carbonell Language Technologies.
Machine Translation Overview Alon Lavie Language Technologies Institute Carnegie Mellon University August 25, 2004.
Recent Advances in Speech Translation Systems ESSLLI-2002 Tutorial Course August 12-16, 2002 Course Organizers: Alon Lavie – Carnegie Mellon University.
Large Vocabulary Data Driven MT: New Developments in the CMU SMT System Stephan Vogel, Alex Waibel Work done in collaboration with: Ying Zhang, Alicia.
Avenue Architecture Learning Module Learned Transfer Rules Lexical Resources Run Time Transfer System Decoder Translation Correction Tool Word- Aligned.
October 10, 2003BLTS Kickoff Meeting1 Transfer with Strong Decoding Learning Module Transfer Rules {PP,4894} ;;Score: PP::PP [NP POSTP] -> [PREP.
MEMT: Multi-Engine Machine Translation Guided by Explicit Word Matching Faculty: Alon Lavie, Jaime Carbonell Students and Staff: Gregory Hanneman, Justin.
Eliciting a corpus of word- aligned phrases for MT Lori Levin, Alon Lavie, Erik Peterson Language Technologies Institute Carnegie Mellon University.
CMU MilliRADD Small-MT Report TIDES PI Meeting 2002 The CMU MilliRADD Team: Jaime Carbonell, Lori Levin, Ralf Brown, Stephan Vogel, Alon Lavie, Kathrin.
AMTEXT: Extraction-based MT for Arabic Faculty: Alon Lavie, Jaime Carbonell Students and Staff: Laura Kieras, Peter Jansen Informant: Loubna El Abadi.
Developing affordable technologies for resource-poor languages Ariadna Font Llitjós Language Technologies Institute Carnegie Mellon University September.
FROM BITS TO BOTS: Women Everywhere, Leading the Way Lenore Blum, Anastassia Ailamaki, Manuela Veloso, Sonya Allin, Bernardine Dias, Ariadna Font Llitjós.
MEMT: Multi-Engine Machine Translation Guided by Explicit Word Matching Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
MEMT: Multi-Engine Machine Translation Faculty: Alon Lavie, Robert Frederking, Ralf Brown, Jaime Carbonell Students: Shyamsundar Jayaraman, Satanjeev Banerjee.
Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,
Machine Translation Overview Alon Lavie Language Technologies Institute Carnegie Mellon University LTI Immigration Course August 22, 2011.
Introduction to Machine Translation
LingWear Language Technology for the Information Warrior Alex Waibel, Lori Levin Alon Lavie, Robert Frederking Carnegie Mellon University.
The AVENUE Project: Automatic Rule Learning for Resource-Limited Machine Translation Faculty: Alon Lavie, Jaime Carbonell, Lori Levin, Ralf Brown Students:
Eliciting a corpus of word-aligned phrases for MT
RECENT TRENDS IN SMT By M.Balamurugan, Phd Research Scholar,
Approaches to Machine Translation
Introduction to Machine Translation
Machine Translation Overview
Approaches to Machine Translation
Machine Translation Overview
Introduction to Machine Translation
Presentation transcript:

Machine Translation Overview Alon Lavie Language Technologies Institute Carnegie Mellon University August 25, 2004

LTI Immigration Course2 Machine Translation: History MT started in 1940’s, one of the first conceived application of computers Promising “toy” demonstrations in the 1950’s, failed miserably to scale up to “real” systems AIPAC Report: MT recognized as an extremely difficult, “AI-complete” problem in the early 1960’s MT Revival started in earnest in 1980s (US, Japan) Field dominated by rule-based approaches, requiring 100s of K-years of manual development Economic incentive for developing MT systems for small number of language pairs (mostly European languages)

August 25, 2004LTI Immigration Course3 Machine Translation: Where are we today? Age of Internet and Globalization – great demand for MT: –Multiple official languages of UN, EU, Canada, etc. –Documentation dissemination for large manufacturers (Microsoft, IBM, Caterpillar) Economic incentive is still primarily within a small number of language pairs Some fairly good commercial products in the market for these language pairs –Primarily a product of rule-based systems after many years of development Pervasive MT between most language pairs still non- existent and not on the immediate horizon

August 25, 2004LTI Immigration Course4 Best Current General-purpose MT PAHO’s Spanam system: Mediante petición recibida por la Comisión Interamericana de Derechos Humanos (en adelante …) el 6 de octubre de 1997, el señor Lino César Oviedo (en adelante …) denunció que la República del Paraguay (en adelante …) violó en su perjuicio los derechos a las garantías judiciales … en su contra. Through petition received by the `Inter-American Commission on Human Rights` (hereinafter …) on 6 October 1997, Mr. Linen César Oviedo (hereinafter “the petitioner”) denounced that the Republic of Paraguay (hereinafter …) violated to his detriment the rights to the judicial guarantees, to the political participation, to // equal protection and to the honor and dignity consecrated in articles 8, 23, 24 and 11, respectively, of the `American Convention on Human Rights` (hereinafter …”), as a consequence of judgments initiated against it.

August 25, 2004LTI Immigration Course5 Core Challenges of MT Ambiguity: –Human languages are highly ambiguous, and differently in different languages –Ambiguity at all “levels”: lexical, syntactic, semantic, language-specific constructions and idioms Amount of required knowledge: –At least several 100k words, about as many phrases, plus syntactic knowledge (i.e. translation rules). How do you acquire and construct a knowledge base that big that is (even mostly) correct and consistent?

August 25, 2004LTI Immigration Course6 How to Tackle the Core Challenges Manual Labor: 1000s of person-years of human experts developing large word and phrase translation lexicons and translation rules. Example: Systran’s RBMT systems. Lots of Parallel Data: data-driven approaches for finding word and phrase correspondences automatically from large amounts of sentence-aligned parallel texts. Example: Statistical MT systems. Learning Approaches: learn translation rules automatically from small amounts of human translated and word-aligned data. Example: AVENUE’s XFER approach Simplify the Problem: build systems that are limited- domain or constrained in other ways. Examples: CATALYST, NESPOLE!

August 25, 2004LTI Immigration Course7 State-of-the-Art in MT What users want: –General purpose (any text) –High quality (human level) –Fully automatic (no user intervention) We can meet any 2 of these 3 goals today, but not all three at once: –FA HQ: Knowledge-Based MT (KBMT) –FA GP: Corpus-Based (Example-Based) MT –GP HQ: Human-in-the-loop (efficiency tool)

August 25, 2004LTI Immigration Course8 Types of MT Applications: Assimilation: multiple source languages, uncontrolled style/topic. General purpose MT, no semantic analysis. (GP FA or GP HQ) Dissemination: one source language, controlled style, single topic/domain. Special purpose MT, full semantic analysis. (FA HQ) Communication: Lower quality may be okay, but degraded input, real-time required.

August 25, 2004LTI Immigration Course9 Mi chiamo Alon LavieMy name is Alon Lavie Give-information+personal-data (name=alon_lavie) [ s [ vp accusative_pronoun “chiamare” proper_name]] [ s [ np [possessive_pronoun “name”]] [ vp “be” proper_name]] Direct Transfer Interlingua Analysis Generation Approaches to MT: Vaquois MT Triangle

August 25, 2004LTI Immigration Course10 Knowledge-based Interlingual MT The “obvious” deep Artificial Intelligence approach: –Analyze the source language into a detailed symbolic representation of its meaning –Generate this meaning in the target language “Interlingua”: one single meaning representation for all languages –Nice in theory, but extremely difficult in practice

August 25, 2004LTI Immigration Course11 The Interlingua KBMT approach With interlingua, need only N parsers/ generators instead of N 2 transfer systems: L1 L2 L3 L4 L5 L6 L1 L2 L3 L6 L5 L4 interlingua

August 25, 2004LTI Immigration Course12 Statistical MT (SMT) Proposed by IBM in early 1990s: a direct, purely statistical, model for MT Statistical translation models are trained on a sentence-aligned translation corpus Attractive: completely automatic, no manual rules, much reduced manual labor Main drawbacks: –Effective only with huge volumes (several mega- words) of parallel text –Very domain-sensitive –Still viable only for small number of language pairs! Impressive progress in last 3-4 years due to large DARPA funding program (TIDES)

EBMT Paradigm New Sentence (Source) Yesterday, 200 delegates met with President Clinton. Matches to Source Found Yesterday, 200 delegates met behind closed doors… Difficulties with President Clinton… Gestern trafen sich 200 Abgeordnete hinter verschlossenen… Schwierigkeiten mit Praesident Clinton… Alignment (Sub-sentential) Translated Sentence (Target) Gestern trafen sich 200 Abgeordnete mit Praesident Clinton. Yesterday, 200 delegates met behind closed doors… Difficulties with President Clinton over… Gestern trafen sich 200 Abgeordnete hinter verschlossenen… Schwierigkeiten mit Praesident Clinton…

August 25, 2004LTI Immigration Course14 GEBMT vs. Statistical MT Generalized-EBMT (GEBMT) uses examples at run time, rather than training a parameterized model. Thus: –GEBMT can work with a smaller parallel corpus than Stat MT –Large target language corpus still useful for generating target language model –Much faster to “train” (index examples) than Stat MT; until recently was much faster at run time as well –Generalizes in a different way than Stat MT (whether this is better or worse depends on match between Statistical model and reality): Stat MT can fail on a training sentence, while GEBMT never will GEBMT generalizations based on linguistic knowledge, rather than statistical model design

August 25, 2004LTI Immigration Course15 Multi-Engine MT Apply several MT engines to each input; use statistical language modeller to select best combination of outputs. Goal is to combine strengths, and avoid weaknesses. Along all dimensions: domain limits, quality, development time/cost, run-time speed, etc. Used in various projects

MEMT chart example Russian leaders signed KBMT (0.8) compact of peace EBMT (0.65) political leaders EBMT (0.9) compact of EBMT (0.7) civilian GLOSS (1.0) tactful DICT (1.0) pact GLOSS (1.0) of peace EBMT (1.0) civil GLOSS (1.0) expedien ts DICT (1.0) bargain DICT (1.0) for DICT (1.0) civil peace EBMT (0.9) political DICT (1.0) Russians DICT (1.0) subscrib e DICT (1.0) pact DICT (1.0) of GLOSS (1.0) quiet DICT (1.0) civilian DICT (1.0) leaders DICT (1.0) politic DICT (1.0) Russian DICT (1.0) sign DICT (1.0) compact DICT (1.0) of DICT (1.0) peace DICT (1.0) civil DICT (1.0) liderespoliticosrusosfirmanpactodepazcivil

August 25, 2004LTI Immigration Course17 Speech-to-Speech MT Speech just makes MT (much) more difficult: –Spoken language is messier False starts, filled pauses, repetitions, out-of- vocabulary words Lack of punctuation and explicit sentence boundaries –Current Speech technology is far from perfect Need for speech recognition and synthesis in foreign languages Robustness: MT quality degradation should be proportional to SR quality Tight Integration: rather than separate sequential tasks, can SR + MT be integrated in ways that improves end-to-end performance?

August 25, 2004LTI Immigration Course18 MT at the LTI LTI originated as the Center for Machine Translation (CMT) in 1985 MT continues to be a prominent sub-discipline of research with the LTI –More MT faculty than any of the other areas –More MT faculty than anywhere else Active research on all main approaches to MT: Interlingua, Transfer, EBMT, SMT Leader in the area of speech-to-speech MT

August 25, 2004LTI Immigration Course19 KBMT: KANT, KANTOO, CATALYST Deep knowledge-based framework, with symbolic interlingua as intermediate representation –Syntactic and semantic analysis into a unambiguous detailed symbolic representation of meaning using unification grammars and transformation mappers –Generation into the target language using unification grammars and transformation mappers First large-scale multi-lingual interlingua-based MT system deployed commercially: –CATALYST at Caterpillar: high quality translation of documentation manuals for heavy equipment Limited domains and controlled English input Minor amounts of post-editing Active follow-on projects Contact Faculty: Eric Nyberg and Teruko Mitamura

August 25, 2004LTI Immigration Course20 EBMT Developed originally for the PANGLOSS system in the early 1990s –Translation between English and Spanish Generalized EBMT under development for the past several years Currently one of the two MT approaches developed at CMU for the DARPA/TIDES program –Chinese-to-English, large and very large amounts of sentence-aligned parallel data Active research work on improving alignment and indexing, decoding from a lattice Contact Faculty: Ralf Brown and Jaime Carbonell

August 25, 2004LTI Immigration Course21 Statistical MT Word-to-word and phrase-to-phrase translation pairs are acquired automatically from data and assigned probabilities based on a statistical model Extracted and trained from very large amounts of sentence-aligned parallel text –Word alignment algorithms –Phrase detection algorithms –Translation model probability estimation Main approach pursued in CMU systems in the DARPA/TIDES program: –Chinese-to-English and Arabic-to-English Most active work is on phrase detection and on advanced lattice decoding Contact Faculty: Stephan Vogel and Alex Waibel

August 25, 2004LTI Immigration Course22 Speech-to-Speech MT Evolution from JANUS/C-STAR systems to NESPOLE!, LingWear, BABYLON –Early 1990s: first prototype system that fully performed sp-to-sp (very limited domain) –Interlingua-based, but with shallow task-oriented representations: “we have single and double rooms available” [give-information+availability] (room-type={single, double}) –Semantic Grammars for analysis and generation –Multiple languages: English, German, French, Italian, Japanese, Korean, and others –Most active work on portable speech translation on small devices: Arabic/English and Thai/English –Contact Faculty: Alan Black, Tanja Schultz and Alex Waibel (also Alon Lavie and Lori Levin)

August 25, 2004LTI Immigration Course23 AVENUE: Transfer-based MT Develop new approaches for automatically acquiring syntactic MT transfer rules from small amounts of elicited translated and word-aligned data –Specifically designed to bootstrap MT for languages for which only limited amounts of electronic resources are available (particularly indigenous minority languages) –Use machine learning techniques to generalize transfer rules from specific translated examples –Combine with decoding techniques from SMT for producing the best translation of new input from a lattice of translation segments Languages: Hebrew, Hindi, Mapudungun, Quechua Most active work on designing a typologically comprehensive elicitation corpus, advanced techniques for automatic rule learning, improved decoding, and rule refinement via user interaction Contact Faculty: Alon Lavie, Lori Levin and Jaime Carbonell

August 25, 2004LTI Immigration Course24 Transfer with Strong Decoding Learning Module Transfer Rules {PP,4894} ;;Score: PP::PP [NP POSTP] -> [PREP NP] ((X2::Y1) (X1::Y2)) Translation Lexicon Run Time Transfer System Lattice Decoder English Language Model Word-to-Word Translation Probabilities Word-aligned elicited data

August 25, 2004LTI Immigration Course25 MT for Minority and Indigenous Languages: Challenges Minimal amount of parallel text Possibly competing standards for orthography/spelling Often relatively few trained linguists Access to native informants possible Need to minimize development time and cost

August 25, 2004LTI Immigration Course26 Learning Transfer-Rules for Languages with Limited Resources Rationale: –Large bilingual corpora not available –Bilingual native informant(s) can translate and align a small pre-designed elicitation corpus, using elicitation tool –Elicitation corpus designed to be typologically comprehensive and compositional –Transfer-rule engine and new learning approach support acquisition of generalized transfer-rules from the data

August 25, 2004LTI Immigration Course27 English-Hindi Example

August 25, 2004LTI Immigration Course28 Questions…