Machine Translation Overview Alon Lavie Language Technologies Institute Carnegie Mellon University LTI Immigration Course August 16, 2010.

Slides:



Advertisements
Similar presentations
The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
Advertisements

Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
Improving Machine Translation Quality via Hybrid Systems and Refined Evaluation Methods Andreas Eisele DFKI GmbH and Saarland University Helsinki, November.
C SC 620 Advanced Topics in Natural Language Processing Lecture 22 4/15.
Machine Translation (Level 2) Anna Sågvall Hein GSLT Course, September 2004.
Natural Language and Speech Processing Creation of computational models of the understanding and the generation of natural language. Different fields coming.
Towards an NLP `module’ The role of an utterance-level interface.
The Unreasonable Effectiveness of Data Alon Halevy, Peter Norvig, and Fernando Pereira Kristine Monteith May 1, 2009 CS 652.
NICE: Native language Interpretation and Communication Environment Lori Levin, Jaime Carbonell, Alon Lavie, Ralf Brown Carnegie Mellon University.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.
Symmetric Probabilistic Alignment Jae Dong Kim Committee: Jaime G. Carbonell Ralf D. Brown Peter J. Jansen.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English.
Machine Translation Overview Alon Lavie Language Technologies Institute Carnegie Mellon University LTI Immigration Course August 23, 2007.
Machine Translation Challenges and Language Divergences Alon Lavie Language Technologies Institute Carnegie Mellon University : Machine Translation.
Microsoft Research Faculty Summit Robert Moore Principal Researcher Microsoft Research.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
Machine translation Context-based approach Lucia Otoyo.
Machine Translation Dr. Radhika Mamidi. What is Machine Translation? A sub-field of computational linguistics It investigates the use of computer software.
Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**
English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.
Lecture 12: 22/6/1435 Natural language processing Lecturer/ Kawther Abas 363CS – Artificial Intelligence.
Week 9: resources for globalisation Finish spell checkers Machine Translation (MT) The ‘decoding’ paradigm Ambiguity Translation models Interlingua and.
Evaluation of the Statistical Machine Translation Service for Croatian-English Marija Brkić Department of Informatics, University of Rijeka
Profile The METIS Approach Future Work Evaluation METIS II Architecture METIS II, the continuation of the successful assessment project METIS I, is an.
METEOR-Ranking & M-BLEU: Flexible Matching & Parameter Tuning for MT Evaluation Alon Lavie and Abhaya Agarwal Language Technologies Institute Carnegie.
Advanced Signal Processing 05/06 Reinisch Bernhard Statistical Machine Translation Phrase Based Model.
Area Report Machine Translation Hervé Blanchon CLIPS-IMAG A Roadmap for Computational Linguistics COLING 2002 Post-Conference Workshop.
Machine Translation: Approaches, Challenges and Future Alon Lavie Language Technologies Institute School of Computer Science Carnegie Mellon University.
Machine Translation: Approaches, Challenges and Future Alon Lavie Language Technologies Institute Carnegie Mellon University ITEC Dinner May 21, 2009.
Recent Major MT Developments at CMU Briefing for Joe Olive February 5, 2008 Alon Lavie and Stephan Vogel Language Technologies Institute Carnegie Mellon.
Machine Translation  Machine translation is of one of the earliest uses of AI  Two approaches:  Traditional approach using grammars, rewrite rules,
LTI Education Committee Report Alon Lavie LTI Retreat March 2, 2012.
Advanced MT Seminar Spring 2008 Instructors: Alon Lavie and Stephan Vogel.
Approaches to Machine Translation CSC 5930 Machine Translation Fall 2012 Dr. Tom Way.
Transfer-based MT with Strong Decoding for a Miserly Data Scenario Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with:
MEMT: Multi-Engine Machine Translation Faculty: Alon Lavie, Robert Frederking, Ralf Brown, Jaime Carbonell Students: Shyamsundar Jayaraman, Satanjeev Banerjee.
Machine Translation Overview Alon Lavie Language Technologies Institute Carnegie Mellon University LTI Immigration Course August 22, 2011.
What you have learned and how you can use it : Grammars and Lexicons Parts I-III.
Designing a Machine Translation Project Lori Levin and Alon Lavie Language Technologies Institute Carnegie Mellon University CATANAL Planning Meeting Barrow,
Jan 2005CSA4050 Machine Translation II1 CSA4050: Advanced Techniques in NLP Machine Translation II Direct MT Transfer MT Interlingual MT.
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
MEMT: Multi-Engine Machine Translation Guided by Explicit Word Matching Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Error Analysis of Two Types of Grammar for the purpose of Automatic Rule Refinement Ariadna Font Llitjós, Katharina Probst, Jaime Carbonell Language Technologies.
Approaching a New Language in Machine Translation Anna Sågvall Hein, Per Weijnitz.
Large Vocabulary Data Driven MT: New Developments in the CMU SMT System Stephan Vogel, Alex Waibel Work done in collaboration with: Ying Zhang, Alicia.
October 10, 2003BLTS Kickoff Meeting1 Transfer with Strong Decoding Learning Module Transfer Rules {PP,4894} ;;Score: PP::PP [NP POSTP] -> [PREP.
MEMT: Multi-Engine Machine Translation Guided by Explicit Word Matching Faculty: Alon Lavie, Jaime Carbonell Students and Staff: Gregory Hanneman, Justin.
CMU MilliRADD Small-MT Report TIDES PI Meeting 2002 The CMU MilliRADD Team: Jaime Carbonell, Lori Levin, Ralf Brown, Stephan Vogel, Alon Lavie, Kathrin.
MEMT: Multi-Engine Machine Translation Guided by Explicit Word Matching Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
MEMT: Multi-Engine Machine Translation Faculty: Alon Lavie, Robert Frederking, Ralf Brown, Jaime Carbonell Students: Shyamsundar Jayaraman, Satanjeev Banerjee.
Machine Translation Overview Alon Lavie Language Technologies Institute Carnegie Mellon University August 25, 2004.
Machine Translation Overview Alon Lavie Language Technologies Institute Carnegie Mellon University LTI Immigration Course August 22, 2011.
Jan 2012MT Architectures1 Human Language Technology Machine Translation Architectures Direct MT Transfer MT Interlingual MT.
Introduction to Machine Translation
LingWear Language Technology for the Information Warrior Alex Waibel, Lori Levin Alon Lavie, Robert Frederking Carnegie Mellon University.
Multi-Engine Machine Translation
Advanced Computer Systems
METEOR: Metric for Evaluation of Translation with Explicit Ordering An Improved Automatic Metric for MT Evaluation Alon Lavie Joint work with: Satanjeev.
Approaches to Machine Translation
Language Technologies Institute Carnegie Mellon University
Monoligual Semantic Text Alignment and its Applications in Machine Translation Alon Lavie March 29, 2012.
Machine Translation Overview
Approaches to Machine Translation
Machine Translation Overview
Introduction to Machine Translation
Presentation transcript:

Machine Translation Overview Alon Lavie Language Technologies Institute Carnegie Mellon University LTI Immigration Course August 16, 2010

LTI IC Machine Translation: History 1946: MT is one of the first conceived applications of modern computers (A.D. Booth, Alan Turing) 1954: The “Georgetown Experiment” Promising “toy” demonstrations of Russian-English MT Late 1950s and early 1960s: MT fails to scale up to “real” systems 1966: ALPAC Report: MT recognized as an extremely difficult, “AI- complete” problem. Funding disappears 1968: SYSTRAN founded 1985: CMU “Center for Machine Translation” (CMT) founded Late 1980s and early 1990s: Field dominated by rule-based approaches – KBMT, KANT, Eurotra, etc. 1992: “Noisy Channel” Statistical MT models invented by IBM researchers (Brown, Della Pietra, et al.). CANDIDE Mid 1990s: First major DARPA MT Program. PANGLOSS Late 1990s: Major Speech-to-Speech MT demonstrations: C-STAR 1999: JHU Summer Workshop results in GIZA 2000s: Large DARPA Funding Programs – TIDES and GALE 2003: Och et al introduce Phrase-based SMT. PHARAOH 2006: Google Translate is launched 2007: Koehn et al release MOSES

August 16, 2010LTI IC Machine Translation: Where are we today? Age of Internet and Globalization – great demand for translation services and MT: –Multiple official languages of UN, EU, Canada, etc. –Documentation dissemination for large manufacturers (Microsoft, IBM, Intel, Apple, Caterpillar, US Steel, ALCOA, etc.) –Language and translation services business sector estimated at $15 Billion worldwide in 2008 and growing at a healthy pace Economic incentive is still primarily within a small number of language pairs Some fairly decent commercial products in the market for these language pairs –Primarily a product of rule-based systems after many years of development –New generation of data-driven “statistical” MT: Google, Language Weaver Web-based (mostly free) MT services: Google, Babelfish, others… Pervasive MT between many language pairs still non-existent, but Google is trying to change that!

August 16, 2010LTI IC How Does MT Work? All modern MT approaches are based on building translations for complete sentences by putting together smaller pieces of translation Core Questions: –What are these smaller pieces of translation? Where do they come from? –How does MT put these pieces together? –How does the MT system pick the correct (or best) translation among many options?

August 16, 2010LTI IC Core Challenges of MT Ambiguity and Language Divergences: –Human languages are highly ambiguous, and differently in different languages –Ambiguity at all “levels”: lexical, syntactic, semantic, language-specific constructions and idioms Amount of required knowledge: –Translation equivalencies for vast vocabularies (several 100k words and phrases) –Syntactic knowledge (how to map syntax of one language to another), plus more complex language divergences (semantic differences, constructions and idioms, etc.) –How do you acquire and construct a knowledge base that big that is (even mostly) correct and consistent?

August 16, 2010LTI IC Rule-based vs. Data-driven Approaches to MT What are the pieces of translation? Where do they come from? –Rule-based: large-scale “clean” word translation lexicons, manually constructed over time by experts –Data-driven: broad-coverage word and multi-word translation lexicons, learned automatically from available sentence-parallel corpora How does MT put these pieces together? –Rule-based: large collections of rules, manually developed over time by human experts, that map structures from the source to the target language –Data-driven: a computer algorithm that explores millions of possible ways of putting the small pieces together, looking for the translation that statistically looks best

August 16, 2010LTI IC Rule-based vs. Data-driven Approaches to MT How does the MT system pick the correct (or best) translation among many options? –Rule-based: Human experts encode preferences among the rules designed to prefer creation of better translations –Data-driven: a variety of fitness and preference scores, many of which can be learned from available training data, are used to model a total score for each of the millions of possible translation candidates; algorithm then selects and outputs the best scoring translation

August 16, 2010LTI IC Rule-based vs. Data-driven Approaches to MT Why have the data-driven approaches become so popular? –We can now do this! Increasing amounts of sentence-parallel data are constantly being created on the web Advances in machine learning algorithms Computational power of today’s computers can train systems on these massive amounts of data and can perform these massive search-based translation computations when translating new texts –Building and maintaining rule-based systems is too difficult, expensive and time-consuming –In many scenarios, it actually works better!

August 16, 2010LTI IC Statistical MT (SMT) Data-driven, most dominant approach in current MT research Proposed by IBM in early 1990s: a direct, purely statistical, model for MT Evolved from word-level translation to phrase- based translation Main Ideas: –Training: statistical “models” of word and phrase translation equivalence are learned automatically from bilingual parallel sentences, creating a bilingual “database” of translations –Decoding: new sentences are translated by a program (the decoder), which matches the source words and phrases with the database of translations, and searches the “space” of all possible translation combinations.

August 16, 2010LTI IC Statistical MT (SMT) Main steps in training phrase-based statistical MT: –Create a sentence-aligned parallel corpus –Word Alignment: train word-level alignment models (GIZA++) –Phrase Extraction: extract phrase-to-phrase translation correspondences using heuristics (Moses) –Minimum Error Rate Training (MERT): optimize translation system parameters on development data to achieve best translation performance Attractive: completely automatic, no manual rules, much reduced manual labor Main drawbacks: –Translation accuracy levels vary widely –Effective only with large volumes (several mega-words) of parallel text –Broad domain, but domain-sensitive –Viable only for limited number of language pairs! Impressive progress in last 5-10 years!

August 16, 2010LTI IC Statistical MT: Major Challenges Current approaches are too naïve and “direct”: –Good at learning word-to-word and phrase-to-phrase correspondences from data –Not good enough at learning how to combine these pieces and reorder them properly during translation –Learning general rules requires much more complicated algorithms and computer processing of the data –The space of translations that is “searched” often doesn’t contain a perfect translation –The fitness scores that are used aren’t good enough to always assign better scores to the better translations  we don’t always find the best translation even when it’s there! –MERT is brittle, problematic and metric-dependent! Solutions: –Google solution: more and more data! –Research solution: “smarter” algorithms and learning methods

August 16, 2010LTI IC Rule-based vs. Data-driven MT We thank all participants of the whole world for their comical and creative drawings; to choose the victors was not easy task! Click here to see work of winning European of these two months, and use it to look at what the winning of USA sent us. We thank all the participants from around the world for their designs cocasses and creative; selecting winners was not easy! Click here to see the artwork of winners European of these two months, and disclosure to look at what the winners of the US have been sending. Rule-basedData-driven

August 16, 2010LTI IC Representative Example: Google Translate

August 16, 2010LTI IC Google Translate

August 16, 2010LTI IC Google Translate

August 16, 2010LTI IC Major Sources of Translation Problems Lexical Differences: –Multiple possible translations for SL word, or difficulties expressing SL word meaning in a single TL word Structural Differences: –Syntax of SL is different than syntax of the TL: word order, sentence and constituent structure Differences in Mappings of Syntax to Semantics: –Meaning in TL is conveyed using a different syntactic structure than in the SL Idioms and Constructions

August 16, 2010LTI IC How to Tackle the Core Challenges Manual Labor: 1000s of person-years of human experts developing large word and phrase translation lexicons and translation rules. Example: Systran’s RBMT systems. Lots of Parallel Data: data-driven approaches for finding word and phrase correspondences automatically from large amounts of sentence-aligned parallel texts. Example: Statistical MT systems. Learning Approaches: learn translation rules automatically from small amounts of human translated and word-aligned data. Example: AVENUE’s Statistical XFER approach. Simplify the Problem: build systems that are limited- domain or constrained in other ways. Examples: CATALYST, NESPOLE!.

August 16, 2010LTI IC State-of-the-Art in MT What users want: –General purpose (any text) –High quality (human level) –Fully automatic (no user intervention) We can meet any 2 of these 3 goals today, but not all three at once: –FA HQ: Knowledge-Based MT (KBMT) –FA GP: Corpus-Based (Example-Based) MT –GP HQ: Human-in-the-loop (Post-editing)

August 16, 2010LTI IC Types of MT Applications: Assimilation: multiple source languages, uncontrolled style/topic. General purpose MT, no semantic analysis. (GP FA or GP HQ) Dissemination: one source language, controlled style, single topic/domain. Special purpose MT, full semantic analysis. (FA HQ) Communication: Lower quality may be okay, but system robustness, real-time required.

August 16, 2010LTI IC Mi chiamo Alon LavieMy name is Alon Lavie Give-information+personal-data (name=alon_lavie) [ s [ vp accusative_pronoun “chiamare” proper_name]] [ s [ np [possessive_pronoun “name”]] [ vp “be” proper_name]] Direct Transfer Interlingua Analysis Generation Approaches to MT: Vaquois MT Triangle

August 16, 2010LTI IC Knowledge-based Interlingual MT The classic “deep” Artificial Intelligence approach: –Analyze the source language into a detailed symbolic representation of its meaning –Generate this meaning in the target language “Interlingua”: one single meaning representation for all languages –Nice in theory, but extremely difficult in practice: What kind of representation? What is the appropriate level of detail to represent? How to ensure that the interlingua is in fact universal?

August 16, 2010LTI IC Interlingua versus Transfer With interlingua, need only N parsers/ generators instead of N 2 transfer systems: L1 L2 L3 L4 L5 L6 L1 L2 L3 L6 L5 L4 interlingua

August 16, 2010LTI IC Multi-Engine MT Apply several MT engines to each input in parallel Create a combined translation from the individual translations Goal is to combine strengths, and avoid weaknesses. Along all dimensions: domain limits, quality, development time/cost, run-time speed, etc. Various approaches to the problem

August 16, 2010LTI IC Speech-to-Speech MT Speech just makes MT (much) more difficult: –Spoken language is messier False starts, filled pauses, repetitions, out-of- vocabulary words Lack of punctuation and explicit sentence boundaries –Current Speech technology is far from perfect Need for speech recognition and synthesis in foreign languages Robustness: MT quality degradation should be proportional to SR quality Tight Integration: rather than separate sequential tasks, can SR + MT be integrated in ways that improves end-to-end performance?

August 16, 2010LTI IC MT at the LTI LTI originated as the Center for Machine Translation (CMT) in 1985 MT continues to be a prominent sub-discipline of research with the LTI –More MT faculty than any of the other areas –More MT faculty than anywhere else Active research on all main approaches to MT: Interlingua, Transfer, EBMT, SMT Leader in the area of speech-to-speech MT Multi-Engine MT (MEMT) MT Evaluation (METEOR)

MT Faculty at LTI Alon Lavie Stephan Vogel Ralf Brown Jaime Carbonell Lori Levin Noah Smith Alan Black Florian Metze Alex Waibel Teruko Mitamura Eric Nyberg August 16, 2010LTI IC

August 16, 2010LTI IC Phrase-based Statistical MT Word-to-word and phrase-to-phrase translation pairs are acquired automatically from data and assigned probabilities based on a statistical model Extracted and trained from very large amounts of sentence-aligned parallel text –Word alignment algorithms –Phrase detection algorithms –Translation model probability estimation Main approach pursued in CMU systems in the DARPA/TIDES program and now in GALE –Chinese-to-English and Arabic-to-English Most active work is on improved word alignment, phrase extraction and advanced decoding techniques Contact Faculty: Stephan Vogel

August 16, 2010LTI IC CMU Statistical Transfer (Stat-XFER) MT Approach Integrate the major strengths of rule-based and statistical MT within a common syntax-based statistically-driven framework: –Linguistically rich formalism that can express complex and abstract compositional transfer rules –Rules can be written by human experts and also acquired automatically from data –Easy integration of morphological analyzers and generators –Word and syntactic-phrase correspondences can be automatically acquired from parallel text –Search-based decoding from statistical MT adapted to find the best translation within syntax-driven search space: multi-feature scoring, beam- search, parameter optimization, etc. –Framework suitable for both resource-rich and resource-poor language scenarios Most active work on phrase and rule acquisition from parallel data, efficient decoding, joint decoding with non-syntactic phrases, effective syntactic modeling, MT for low-resource languages Contact Faculty: Alon Lavie, Lori Levin, Bob Frederking and Jaime Carbonell

August 16, 2010LTI IC EBMT Developed originally for the PANGLOSS system in the early 1990s –Translation between English and Spanish Generalized EBMT under development for the past several years Used in a variety of projects in recent years –DARPA TIDES and GALE programs –DIPLOMAT and TONGUES Active research work on improving alignment and indexing, decoding from a lattice Contact Faculty: Ralf Brown and Jaime Carbonell

August 16, 2010LTI IC Speech-to-Speech MT Evolution from JANUS/C-STAR systems to NESPOLE!, LingWear, BABYLON, TRANSTAC –Early 1990s: first prototype system that fully performed sp-to-sp (very limited domains) –Interlingua-based, but with shallow task-oriented representations: “we have single and double rooms available” [give-information+availability] (room-type={single, double}) –Semantic Grammars for analysis and generation –Multiple languages: English, German, French, Italian, Japanese, Korean, and others –Phrase-based SMT applied in Speech-to-Speech scenarios –Most active work on portable speech translation on small devices: Iraqi-Arabic/English and Thai/English –Contact Faculty: Alan Black, Stephan Vogel, Florian Metze and Alex Waibel

August 16, 2010LTI IC KBMT: KANT, KANTOO, CATALYST Deep knowledge-based framework, with symbolic interlingua as intermediate representation –Syntactic and semantic analysis into a unambiguous detailed symbolic representation of meaning using unification grammars and transformation mappers –Generation into the target language using unification grammars and transformation mappers First large-scale multi-lingual interlingua-based MT system deployed commercially: –CATALYST at Caterpillar: high quality translation of documentation manuals for heavy equipment Limited domains and controlled English input Minor amounts of post-editing Some active follow-on projects Contact Faculty: Eric Nyberg and Teruko Mitamura

August 16, 2010LTI IC Multi-Engine MT Decoding-based approach developed in recent years under DoD and DARPA funding (used in GALE) Main ideas: –Treat original engines as “black boxes” –Align the word and phrase correspondences between the translations –Build a collection of synthetic combinations based on the aligned words and phrases –Score the synthetic combinations based on a variety of features, including n-gram support and Language Model –Parameter Tuning: Learn optimal weights using MERT –Select the top-scoring synthetic combination Architecture Issues: integrating “workflows” that produce multiple translations and then combine them with MEMT –IBM’s UIMA architecture Contact Faculty: Alon Lavie

August 16, 2010LTI IC Automated MT Evaluation METEOR: Automated metric developed at CMU Improves upon BLEU metric developed by IBM and used extensively in recent years Main ideas: –Assess the similarity between a machine-produced translation and (several) human reference translations –Similarity is based on word-to-word matching that matches: Identical words Morphological variants of same word (stemming) Synonyms and paraphrases –Address fluency/grammaticality via a direct penalty: how well-ordered is the matching of the MT output with the reference? –Tunable Weights: Weights for Precision, Recall, Fragmentation are tuned for optimal correlation with human judgments Outcome: Much improved levels of correlation! Contact Faculty: Alon Lavie

Safaba Translation Solutions Recent CMU commercial spin-off company in the MT area Mission: Develop and deliver advanced translation automation software solutions for the commercial translation business sector Target Clients: Language Services Providers (LSPs) and their enterprise clients Primary Service: –Software-as-a-Service customized MT Technology: Develop specialized highly-scalable software for delivering high-quality client-customized Machine Translation (MT) based on a low-cost SaaS model Other Related Services: –Consulting Services: Analyze LSP/client translation processes and technologies and advise clients on effective solutions for increasing their translation automation –Software Implementation Services: Design and implement custom translation automation solutions for LSPs and/or enterprise clients Contact Faculty: Alon Lavie August 16, LTI IC 2010

August 16, 2010LTI IC Summary Main challenges for current state-of-the-art MT approaches - Coverage and Accuracy: –Acquiring broad-coverage high-accuracy translation lexicons (for words and phrases) –learning structural mappings between languages from parallel word-aligned data –overcoming syntax-to-semantics differences and dealing with constructions –Stronger Target Language Modeling –Novel algorithms for model acquisition and decoding

August 16, 2010LTI IC Questions…

August 16, 2010LTI IC Lexical Differences SL word has several different meanings, that translate differently into TL –Ex: financial bank vs. river bank Lexical Gaps: SL word reflects a unique meaning that cannot be expressed by a single word in TL –Ex: English snub doesn’t have a corresponding verb in French or German TL has finer distinctions than SL  SL word should be translated differently in different contexts –Ex: English wall can be German wand (internal), mauer (external)

August 16, 2010LTI IC Structural Differences Syntax of SL is different than syntax of the TL: –Word order within constituents: English NPs: art adj n the big boy Hebrew NPs: art n art adj ha yeled ha gadol –Constituent structure: English is SVO: Subj Verb Obj I saw the man Modern Arabic is VSO: Verb Subj Obj –Different verb syntax: Verb complexes in English vs. in German I can eat the apple Ich kann den apfel essen –Case marking and free constituent order German and other languages that mark case: den apfel esse Ich the (acc) apple eat I (nom)

August 16, 2010LTI IC Syntax-to-Semantics Differences Meaning in TL is conveyed using a different syntactic structure than in the SL –Changes in verb and its arguments –Passive constructions –Motion verbs and state verbs –Case creation and case absorption Main Distinction from Structural Differences: –Structural differences are mostly independent of lexical choices and their semantic meaning  can be addressed by transfer rules that are syntactic in nature –Syntax-to-semantic mapping differences are meaning-specific: require the presence of specific words (and meanings) in the SL

August 16, 2010LTI IC Syntax-to-Semantics Differences Structure-change example: I like swimming “Ich scwhimme gern” I swim gladly Verb-argument example: Jones likes the film. “Le film plait à Jones.” (lit: “the film pleases to Jones”) Passive Constructions –Example: French reflexive passives: Ces livres se lisent facilement *”These books read themselves easily” These books are easily read

August 16, 2010LTI IC Idioms and Constructions Main Distinction: meaning of whole is not directly compositional from meaning of its sub-parts  no compositional translation Examples: –George is a bull in a china shop –He kicked the bucket –Can you please open the window?

August 16, 2010LTI IC Formulaic Utterances Good night. tisbaH cala xEr waking up on good Romanization of Arabic from CallHome Egypt

August 16, 2010LTI IC Analysis and Generation Main Steps Analysis: –Morphological analysis (word-level) and POS tagging –Syntactic analysis and disambiguation (produce syntactic parse-tree) –Semantic analysis and disambiguation (produce symbolic frames or logical form representation) –Map to language-independent Interlingua Generation: –Generate semantic representation in TL –Sentence Planning: generate syntactic structure and lexical selections for concepts –Surface-form realization: generate correct forms of words

August 16, 2010LTI IC Direct Approaches No intermediate stage in the translation First MT systems developed in the 1950’s-60’s (assembly code programs) –Morphology, bi-lingual dictionary lookup, local reordering rules –“Word-for-word, with some local word-order adjustments” Modern Approaches: –Phrase-based Statistical MT (SMT) –Example-based MT (EBMT)

August 16, 2010LTI IC Statistical MT (SMT) Proposed by IBM in early 1990s: a direct, purely statistical, model for MT Most dominant approach in current MT research Evolved from word-level translation to phrase- based translation Main Ideas: –Training: statistical “models” of word and phrase translation equivalence are learned automatically from bilingual parallel sentences, creating a bilingual “database” of translations –Decoding: new sentences are translated by a program (the decoder), which matches the source words and phrases with the database of translations, and searches the “space” of all possible translation combinations.

August 16, 2010LTI IC Statistical MT (SMT) Main steps in training phrase-based statistical MT: –Create a sentence-aligned parallel corpus –Word Alignment: train word-level alignment models (GIZA++) –Phrase Extraction: extract phrase-to-phrase translation correspondences using heuristics (Pharoah) –Minimum Error Rate Training (MERT): optimize translation system parameters on development data to achieve best translation performance Attractive: completely automatic, no manual rules, much reduced manual labor Main drawbacks: –Translation accuracy levels vary –Effective only with large volumes (several mega-words) of parallel text –Broad domain, but domain-sensitive –Still viable only for small number of language pairs! Impressive progress in last 5 years

August 16, 2010LTI IC EBMT Paradigm New Sentence (Source) Yesterday, 200 delegates met with President Clinton. Matches to Source Found Yesterday, 200 delegates met behind closed doors… Difficulties with President Clinton… Gestern trafen sich 200 Abgeordnete hinter verschlossenen… Schwierigkeiten mit Praesident Clinton… Alignment (Sub-sentential) Translated Sentence (Target) Gestern trafen sich 200 Abgeordnete mit Praesident Clinton. Yesterday, 200 delegates met behind closed doors… Difficulties with President Clinton over… Gestern trafen sich 200 Abgeordnete hinter verschlossenen… Schwierigkeiten mit Praesident Clinton…

August 16, 2010LTI IC Transfer Approaches Syntactic Transfer: –Analyze SL input sentence to its syntactic structure (parse tree) –Transfer SL parse-tree to TL parse-tree (various formalisms for specifying mappings) –Generate TL sentence from the TL parse-tree Semantic Transfer: –Analyze SL input to a language-specific semantic representation (i.e., Case Frames, Logical Form) –Transfer SL semantic representation to TL semantic representation –Generate syntactic structure and then surface sentence in the TL

August 16, 2010LTI IC Transfer Approaches Main Advantages and Disadvantages: Syntactic Transfer: –No need for semantic analysis and generation –Syntactic structures are general, not domain specific  Less domain dependent, can handle open domains –Requires word translation lexicon Semantic Transfer: –Requires deeper analysis and generation, symbolic representation of concepts and predicates  difficult to construct for open or unlimited domains –Can better handle non-compositional meaning structures  can be more accurate –No word translation lexicon – generate in TL from symbolic concepts

August 16, 2010LTI IC The METEOR Metric Example: –Reference: “the Iraqi weapons are to be handed over to the army within two weeks” –MT output: “in two weeks Iraq’s weapons will give army” Matching: Ref: Iraqi weapons army two weeks MT: two weeks Iraq’s weapons army P = 5/8 =0.625 R = 5/14 = Fmean = 10*P*R/(9P+R) = Fragmentation: 3 frags of 5 words = (3-1)/(5-1) = 0.50 Discounting factor: DF = 0.5 * (frag**3) = Final score: Fmean * (1- DF) = * =

August 16, 2010LTI IC Synthetic Combination MEMT Two Stage Approach: 1.Align: Identify common words and phrases across the translations provided by the engines 2.Decode: search the space of synthetic combinations of words/phrases and select the highest scoring combined translation Example: 1.announced afghan authorities on saturday reconstituted four intergovernmental committees 2.The Afghan authorities on Saturday the formation of the four committees of government

August 16, 2010LTI IC Synthetic Combination MEMT Two Stage Approach: 1.Align: Identify common words and phrases across the translations provided by the engines 2.Decode: search the space of synthetic combinations of words/phrases and select the highest scoring combined translation Example: 1.announced afghan authorities on saturday reconstituted four intergovernmental committees 2.The Afghan authorities on Saturday the formation of the four committees of government MEMT: the afghan authorities announced on Saturday the formation of four intergovernmental committees