Presentation is loading. Please wait.

Presentation is loading. Please wait.

Machine Translation Overview Alon Lavie Language Technologies Institute Carnegie Mellon University August 25, 2004.

Similar presentations


Presentation on theme: "Machine Translation Overview Alon Lavie Language Technologies Institute Carnegie Mellon University August 25, 2004."— Presentation transcript:

1 Machine Translation Overview Alon Lavie Language Technologies Institute Carnegie Mellon University August 25, 2004

2 LTI Immigration Course2 Machine Translation: History MT started in 1940’s, one of the first conceived application of computers Promising “toy” demonstrations in the 1950’s, failed miserably to scale up to “real” systems AIPAC Report: MT recognized as an extremely difficult, “AI-complete” problem in the early 1960’s MT Revival started in earnest in 1980s (US, Japan) Field dominated by rule-based approaches, requiring 100s of K-years of manual development Economic incentive for developing MT systems for small number of language pairs (mostly European languages)

3 August 25, 2004LTI Immigration Course3 Machine Translation: Where are we today? Age of Internet and Globalization – great demand for MT: –Multiple official languages of UN, EU, Canada, etc. –Documentation dissemination for large manufacturers (Microsoft, IBM, Caterpillar) Economic incentive is still primarily within a small number of language pairs Some fairly good commercial products in the market for these language pairs –Primarily a product of rule-based systems after many years of development Pervasive MT between most language pairs still non- existent and not on the immediate horizon

4 August 25, 2004LTI Immigration Course4 Best Current General-purpose MT PAHO’s Spanam system: Mediante petición recibida por la Comisión Interamericana de Derechos Humanos (en adelante …) el 6 de octubre de 1997, el señor Lino César Oviedo (en adelante …) denunció que la República del Paraguay (en adelante …) violó en su perjuicio los derechos a las garantías judiciales … en su contra. Through petition received by the `Inter-American Commission on Human Rights` (hereinafter …) on 6 October 1997, Mr. Linen César Oviedo (hereinafter “the petitioner”) denounced that the Republic of Paraguay (hereinafter …) violated to his detriment the rights to the judicial guarantees, to the political participation, to // equal protection and to the honor and dignity consecrated in articles 8, 23, 24 and 11, respectively, of the `American Convention on Human Rights` (hereinafter …”), as a consequence of judgments initiated against it.

5 August 25, 2004LTI Immigration Course5 Core Challenges of MT Ambiguity: –Human languages are highly ambiguous, and differently in different languages –Ambiguity at all “levels”: lexical, syntactic, semantic, language-specific constructions and idioms Amount of required knowledge: –At least several 100k words, about as many phrases, plus syntactic knowledge (i.e. translation rules). How do you acquire and construct a knowledge base that big that is (even mostly) correct and consistent?

6 August 25, 2004LTI Immigration Course6 How to Tackle the Core Challenges Manual Labor: 1000s of person-years of human experts developing large word and phrase translation lexicons and translation rules. Example: Systran’s RBMT systems. Lots of Parallel Data: data-driven approaches for finding word and phrase correspondences automatically from large amounts of sentence-aligned parallel texts. Example: Statistical MT systems. Learning Approaches: learn translation rules automatically from small amounts of human translated and word-aligned data. Example: AVENUE’s XFER approach Simplify the Problem: build systems that are limited- domain or constrained in other ways. Examples: CATALYST, NESPOLE!

7 August 25, 2004LTI Immigration Course7 State-of-the-Art in MT What users want: –General purpose (any text) –High quality (human level) –Fully automatic (no user intervention) We can meet any 2 of these 3 goals today, but not all three at once: –FA HQ: Knowledge-Based MT (KBMT) –FA GP: Corpus-Based (Example-Based) MT –GP HQ: Human-in-the-loop (efficiency tool)

8 August 25, 2004LTI Immigration Course8 Types of MT Applications: Assimilation: multiple source languages, uncontrolled style/topic. General purpose MT, no semantic analysis. (GP FA or GP HQ) Dissemination: one source language, controlled style, single topic/domain. Special purpose MT, full semantic analysis. (FA HQ) Communication: Lower quality may be okay, but degraded input, real-time required.

9 August 25, 2004LTI Immigration Course9 Mi chiamo Alon LavieMy name is Alon Lavie Give-information+personal-data (name=alon_lavie) [ s [ vp accusative_pronoun “chiamare” proper_name]] [ s [ np [possessive_pronoun “name”]] [ vp “be” proper_name]] Direct Transfer Interlingua Analysis Generation Approaches to MT: Vaquois MT Triangle

10 August 25, 2004LTI Immigration Course10 Knowledge-based Interlingual MT The “obvious” deep Artificial Intelligence approach: –Analyze the source language into a detailed symbolic representation of its meaning –Generate this meaning in the target language “Interlingua”: one single meaning representation for all languages –Nice in theory, but extremely difficult in practice

11 August 25, 2004LTI Immigration Course11 The Interlingua KBMT approach With interlingua, need only N parsers/ generators instead of N 2 transfer systems: L1 L2 L3 L4 L5 L6 L1 L2 L3 L6 L5 L4 interlingua

12 August 25, 2004LTI Immigration Course12 Statistical MT (SMT) Proposed by IBM in early 1990s: a direct, purely statistical, model for MT Statistical translation models are trained on a sentence-aligned translation corpus Attractive: completely automatic, no manual rules, much reduced manual labor Main drawbacks: –Effective only with huge volumes (several mega- words) of parallel text –Very domain-sensitive –Still viable only for small number of language pairs! Impressive progress in last 3-4 years due to large DARPA funding program (TIDES)

13 EBMT Paradigm New Sentence (Source) Yesterday, 200 delegates met with President Clinton. Matches to Source Found Yesterday, 200 delegates met behind closed doors… Difficulties with President Clinton… Gestern trafen sich 200 Abgeordnete hinter verschlossenen… Schwierigkeiten mit Praesident Clinton… Alignment (Sub-sentential) Translated Sentence (Target) Gestern trafen sich 200 Abgeordnete mit Praesident Clinton. Yesterday, 200 delegates met behind closed doors… Difficulties with President Clinton over… Gestern trafen sich 200 Abgeordnete hinter verschlossenen… Schwierigkeiten mit Praesident Clinton…

14 August 25, 2004LTI Immigration Course14 GEBMT vs. Statistical MT Generalized-EBMT (GEBMT) uses examples at run time, rather than training a parameterized model. Thus: –GEBMT can work with a smaller parallel corpus than Stat MT –Large target language corpus still useful for generating target language model –Much faster to “train” (index examples) than Stat MT; until recently was much faster at run time as well –Generalizes in a different way than Stat MT (whether this is better or worse depends on match between Statistical model and reality): Stat MT can fail on a training sentence, while GEBMT never will GEBMT generalizations based on linguistic knowledge, rather than statistical model design

15 August 25, 2004LTI Immigration Course15 Multi-Engine MT Apply several MT engines to each input; use statistical language modeller to select best combination of outputs. Goal is to combine strengths, and avoid weaknesses. Along all dimensions: domain limits, quality, development time/cost, run-time speed, etc. Used in various projects

16 MEMT chart example Russian leaders signed KBMT (0.8) compact of peace EBMT (0.65) political leaders EBMT (0.9) compact of EBMT (0.7) civilian GLOSS (1.0) tactful DICT (1.0) pact GLOSS (1.0) of peace EBMT (1.0) civil GLOSS (1.0) expedien ts DICT (1.0) bargain DICT (1.0) for DICT (1.0) civil peace EBMT (0.9) political DICT (1.0) Russians DICT (1.0) subscrib e DICT (1.0) pact DICT (1.0) of GLOSS (1.0) quiet DICT (1.0) civilian DICT (1.0) leaders DICT (1.0) politic DICT (1.0) Russian DICT (1.0) sign DICT (1.0) compact DICT (1.0) of DICT (1.0) peace DICT (1.0) civil DICT (1.0) liderespoliticosrusosfirmanpactodepazcivil

17 August 25, 2004LTI Immigration Course17 Speech-to-Speech MT Speech just makes MT (much) more difficult: –Spoken language is messier False starts, filled pauses, repetitions, out-of- vocabulary words Lack of punctuation and explicit sentence boundaries –Current Speech technology is far from perfect Need for speech recognition and synthesis in foreign languages Robustness: MT quality degradation should be proportional to SR quality Tight Integration: rather than separate sequential tasks, can SR + MT be integrated in ways that improves end-to-end performance?

18 August 25, 2004LTI Immigration Course18 MT at the LTI LTI originated as the Center for Machine Translation (CMT) in 1985 MT continues to be a prominent sub-discipline of research with the LTI –More MT faculty than any of the other areas –More MT faculty than anywhere else Active research on all main approaches to MT: Interlingua, Transfer, EBMT, SMT Leader in the area of speech-to-speech MT

19 August 25, 2004LTI Immigration Course19 KBMT: KANT, KANTOO, CATALYST Deep knowledge-based framework, with symbolic interlingua as intermediate representation –Syntactic and semantic analysis into a unambiguous detailed symbolic representation of meaning using unification grammars and transformation mappers –Generation into the target language using unification grammars and transformation mappers First large-scale multi-lingual interlingua-based MT system deployed commercially: –CATALYST at Caterpillar: high quality translation of documentation manuals for heavy equipment Limited domains and controlled English input Minor amounts of post-editing Active follow-on projects Contact Faculty: Eric Nyberg and Teruko Mitamura

20 August 25, 2004LTI Immigration Course20 EBMT Developed originally for the PANGLOSS system in the early 1990s –Translation between English and Spanish Generalized EBMT under development for the past several years Currently one of the two MT approaches developed at CMU for the DARPA/TIDES program –Chinese-to-English, large and very large amounts of sentence-aligned parallel data Active research work on improving alignment and indexing, decoding from a lattice Contact Faculty: Ralf Brown and Jaime Carbonell

21 August 25, 2004LTI Immigration Course21 Statistical MT Word-to-word and phrase-to-phrase translation pairs are acquired automatically from data and assigned probabilities based on a statistical model Extracted and trained from very large amounts of sentence-aligned parallel text –Word alignment algorithms –Phrase detection algorithms –Translation model probability estimation Main approach pursued in CMU systems in the DARPA/TIDES program: –Chinese-to-English and Arabic-to-English Most active work is on phrase detection and on advanced lattice decoding Contact Faculty: Stephan Vogel and Alex Waibel

22 August 25, 2004LTI Immigration Course22 Speech-to-Speech MT Evolution from JANUS/C-STAR systems to NESPOLE!, LingWear, BABYLON –Early 1990s: first prototype system that fully performed sp-to-sp (very limited domain) –Interlingua-based, but with shallow task-oriented representations: “we have single and double rooms available” [give-information+availability] (room-type={single, double}) –Semantic Grammars for analysis and generation –Multiple languages: English, German, French, Italian, Japanese, Korean, and others –Most active work on portable speech translation on small devices: Arabic/English and Thai/English –Contact Faculty: Alan Black, Tanja Schultz and Alex Waibel (also Alon Lavie and Lori Levin)

23 August 25, 2004LTI Immigration Course23 AVENUE: Transfer-based MT Develop new approaches for automatically acquiring syntactic MT transfer rules from small amounts of elicited translated and word-aligned data –Specifically designed to bootstrap MT for languages for which only limited amounts of electronic resources are available (particularly indigenous minority languages) –Use machine learning techniques to generalize transfer rules from specific translated examples –Combine with decoding techniques from SMT for producing the best translation of new input from a lattice of translation segments Languages: Hebrew, Hindi, Mapudungun, Quechua Most active work on designing a typologically comprehensive elicitation corpus, advanced techniques for automatic rule learning, improved decoding, and rule refinement via user interaction Contact Faculty: Alon Lavie, Lori Levin and Jaime Carbonell

24 August 25, 2004LTI Immigration Course24 Transfer with Strong Decoding Learning Module Transfer Rules {PP,4894} ;;Score:0.0470 PP::PP [NP POSTP] -> [PREP NP] ((X2::Y1) (X1::Y2)) Translation Lexicon Run Time Transfer System Lattice Decoder English Language Model Word-to-Word Translation Probabilities Word-aligned elicited data

25 August 25, 2004LTI Immigration Course25 MT for Minority and Indigenous Languages: Challenges Minimal amount of parallel text Possibly competing standards for orthography/spelling Often relatively few trained linguists Access to native informants possible Need to minimize development time and cost

26 August 25, 2004LTI Immigration Course26 Learning Transfer-Rules for Languages with Limited Resources Rationale: –Large bilingual corpora not available –Bilingual native informant(s) can translate and align a small pre-designed elicitation corpus, using elicitation tool –Elicitation corpus designed to be typologically comprehensive and compositional –Transfer-rule engine and new learning approach support acquisition of generalized transfer-rules from the data

27 August 25, 2004LTI Immigration Course27 English-Hindi Example

28 August 25, 2004LTI Immigration Course28 Questions…


Download ppt "Machine Translation Overview Alon Lavie Language Technologies Institute Carnegie Mellon University August 25, 2004."

Similar presentations


Ads by Google