Alon Lavie, Jaime Carbonell, Lori Levin,

Slides:



Advertisements
Similar presentations
The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.
Advertisements

Enabling MT for Languages with Limited Resources Alon Lavie Language Technologies Institute Carnegie Mellon University.
NICE: Native language Interpretation and Communication Environment Lori Levin, Jaime Carbonell, Alon Lavie, Ralf Brown Carnegie Mellon University.
The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.
Automatic Rule Learning for Resource-Limited Machine Translation Alon Lavie, Katharina Probst, Erik Peterson, Jaime Carbonell, Lori Levin, Ralf Brown Language.
Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
Language Technologies Institute School of Computer Science Carnegie Mellon University NSF August 6, 2001 NICE: Native language Interpretation and Communication.
Czech-to-English Translation: MT Marathon 2009 Session Preview Jonathan Clark Greg Hanneman Language Technologies Institute Carnegie Mellon University.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
Building NLP Systems for Two Resource Scarce Indigenous Languages: Mapudungun and Quechua, and some other languages Christian Monson, Ariadna Font Llitjós,
Profile The METIS Approach Future Work Evaluation METIS II Architecture METIS II, the continuation of the successful assessment project METIS I, is an.
Eliciting Features from Minor Languages The elicitation tool provides a simple interface for bilingual informants with no linguistic training and limited.
Transfer-based MT with Strong Decoding for a Miserly Data Scenario Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with:
Coping with Surprise: Multiple CMU MT Approaches Alon Lavie Lori Levin, Jaime Carbonell, Alex Waibel, Stephan Vogel, Ralf Brown, Robert Frederking Language.
Statistical XFER: Hybrid Statistical Rule-based Machine Translation Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with:
Recent Major MT Developments at CMU Briefing for Joe Olive February 5, 2008 Alon Lavie and Stephan Vogel Language Technologies Institute Carnegie Mellon.
Rapid Prototyping of a Transfer-based Hebrew-to-English Machine Translation System Alon Lavie Language Technologies Institute Carnegie Mellon University.
Rule Learning - Overview Goal: Syntactic Transfer Rules 1) Flat Seed Generation: produce rules from word- aligned sentence pairs, abstracted only to POS.
AMTEXT: Extraction-based MT for Arabic Faculty: Alon Lavie, Jaime Carbonell Students and Staff: Laura Kieras, Peter Jansen Informant: Loubna El Abadi.
Transfer-based MT with Strong Decoding for a Miserly Data Scenario Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with:
MEMT: Multi-Engine Machine Translation Faculty: Alon Lavie, Robert Frederking, Ralf Brown, Jaime Carbonell Students: Shyamsundar Jayaraman, Satanjeev Banerjee.
AVENUE Automatic Machine Translation for low-density languages Ariadna Font Llitjós Language Technologies Institute SCS Carnegie Mellon University.
Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University.
Carnegie Mellon Goal Recycle non-expert post-editing efforts to: - Refine translation rules automatically - Improve overall translation quality Proposed.
Data Collection and Language Technologies for Mapudungun Lori Levin, Rodolfo Vega, Jaime Carbonell, Ralf Brown, Alon Lavie Language Technologies Institute.
Hebrew-to-English XFER MT Project - Update Alon Lavie June 2, 2004.
Nov 17, 2005Learning-based MT1 Learning-based MT Approaches for Languages with Limited Resources Alon Lavie Language Technologies Institute Carnegie Mellon.
An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,
A Trainable Transfer-based MT Approach for Languages with Limited Resources Alon Lavie Language Technologies Institute Carnegie Mellon University Joint.
Language Technologies Institute School of Computer Science Carnegie Mellon University NSF, August 6, 2001 Machine Translation for Indigenous Languages.
Coping with Surprise: Multiple CMU MT Approaches Alon Lavie Lori Levin, Jaime Carbonell, Alex Waibel, Stephan Vogel, Ralf Brown, Robert Frederking Language.
A Trainable Transfer-based MT Approach for Languages with Limited Resources Alon Lavie Language Technologies Institute Carnegie Mellon University Joint.
The CMU Mill-RADD Project: Recent Activities and Results Alon Lavie Language Technologies Institute Carnegie Mellon University.
Error Analysis of Two Types of Grammar for the purpose of Automatic Rule Refinement Ariadna Font Llitjós, Katharina Probst, Jaime Carbonell Language Technologies.
Large Vocabulary Data Driven MT: New Developments in the CMU SMT System Stephan Vogel, Alex Waibel Work done in collaboration with: Ying Zhang, Alicia.
Bridging the Gap: Machine Translation for Lesser Resourced Languages
Avenue Architecture Learning Module Learned Transfer Rules Lexical Resources Run Time Transfer System Decoder Translation Correction Tool Word- Aligned.
October 10, 2003BLTS Kickoff Meeting1 Transfer with Strong Decoding Learning Module Transfer Rules {PP,4894} ;;Score: PP::PP [NP POSTP] -> [PREP.
CMU Statistical-XFER System Hybrid “rule-based”/statistical system Scaled up version of our XFER approach developed for low-resource languages Large-coverage.
Eliciting a corpus of word- aligned phrases for MT Lori Levin, Alon Lavie, Erik Peterson Language Technologies Institute Carnegie Mellon University.
Seed Generation and Seeded Version Space Learning Version 0.02 Katharina Probst Feb 28,2002.
CMU MilliRADD Small-MT Report TIDES PI Meeting 2002 The CMU MilliRADD Team: Jaime Carbonell, Lori Levin, Ralf Brown, Stephan Vogel, Alon Lavie, Kathrin.
AVENUE: Machine Translation for Resource-Poor Languages NSF ITR
Developing affordable technologies for resource-poor languages Ariadna Font Llitjós Language Technologies Institute Carnegie Mellon University September.
FROM BITS TO BOTS: Women Everywhere, Leading the Way Lenore Blum, Anastassia Ailamaki, Manuela Veloso, Sonya Allin, Bernardine Dias, Ariadna Font Llitjós.
MEMT: Multi-Engine Machine Translation Faculty: Alon Lavie, Robert Frederking, Ralf Brown, Jaime Carbonell Students: Shyamsundar Jayaraman, Satanjeev Banerjee.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Minority Languages Katharina Probst Language Technologies Institute Carnegie Mellon.
Enabling MT for Languages with Limited Resources Alon Lavie and Lori Levin Language Technologies Institute Carnegie Mellon University.
Learning to Generate Complex Morphology for Machine Translation Einat Minkov †, Kristina Toutanova* and Hisami Suzuki* *Microsoft Research † Carnegie Mellon.
LingWear Language Technology for the Information Warrior Alex Waibel, Lori Levin Alon Lavie, Robert Frederking Carnegie Mellon University.
The AVENUE Project: Automatic Rule Learning for Resource-Limited Machine Translation Faculty: Alon Lavie, Jaime Carbonell, Lori Levin, Ralf Brown Students:
Eliciting a corpus of word-aligned phrases for MT
Approaches to Machine Translation
Faculty: Alon Lavie, Jaime Carbonell, Lori Levin, Ralf Brown Students:
CSC 594 Topics in AI – Natural Language Processing
Monoligual Semantic Text Alignment and its Applications in Machine Translation Alon Lavie March 29, 2012.
Basic Parsing with Context Free Grammars Chapter 13
Urdu-to-English Stat-XFER system for NIST MT Eval 2008
Statistical NLP: Lecture 13
An ICALL writing support system tunable to varying levels
Vamshi Ambati 14 Sept 2007 Student Research Symposium
Eiji Aramaki* Sadao Kurohashi* * University of Tokyo
Approaches to Machine Translation
Statistical Machine Translation Papers from COLING 2004
Statistical n-gram David ling.
Stat-XFER: A General Framework for Search-based Syntax-driven MT
AMTEXT: Extraction-based MT for Arabic
Johns Hopkins 2003 Summer Workshop on Syntax and Statistical Machine Translation Chapters 5-8 Ethan Phelps-Goodman.
Presentation transcript:

Alon Lavie, Jaime Carbonell, Lori Levin, AVENUE/LETRAS: Learning-based MT Approaches for Languages with Limited Resources Alon Lavie, Jaime Carbonell, Lori Levin, Bob Frederking Joint work with: Erik Peterson, Christian Monson, Ariadna Font-Llitjos, Alison Alvarez, Roberto Aranovich

Why Machine Translation for Languages with Limited Resources? We are in the age of information explosion The internet+web+Google  anyone can get the information they want anytime… But what about the text in all those other languages? How do they read all this English stuff? How do we read all the stuff that they put online? MT for these languages would Enable: Better government access to native indigenous and minority communities Better minority and native community participation in information-rich activities (health care, education, government) without giving up their languages. Civilian and military applications (disaster relief) Language preservation Sep 22, 2006 Learning-based MT with Limited Resources

AVENUE/LETRAS Funding Started in 2000 with small amount of DARPA/TIDES funding (NICE) AVENUE project funded by 5-year NSF ITR grant (2001-2006) Follow-on LETRAS project funded by NSF HLC Program grant (2006-2009) Collaboration funding sources: Mapudungun (MINEDUC, Chile) Hebrew (ISF, Israel) Brazilian Portuguese & Native Langs. (Brazilian Gov.) Inupiaq (NSF, Polar Programs) Sep 22, 2006 Learning-based MT with Limited Resources

Learning-based MT with Limited Resources CMU’s AVENUE Approach Elicitation: use bilingual native informants to create a small high-quality word-aligned bilingual corpus of translated phrases and sentences Building Elicitation corpora from feature structures Feature Detection and Navigation Transfer-rule Learning: apply ML-based methods to automatically acquire syntactic transfer rules for translation between the two languages Learn from major language to minor language Translate from minor language to major language XFER + Decoder: XFER engine produces a lattice of possible transferred structures at all levels Decoder searches and selects the best scoring combination Rule Refinement: refine the acquired rules via a process of interaction with bilingual informants Morphology Learning Word and Phrase bilingual lexicon acquisition Sep 22, 2006 Learning-based MT with Limited Resources

Learning-based MT with Limited Resources AVENUE MT Approach Interlingua Semantic Analysis Sentence Planning Syntactic Parsing Transfer Rules Text Generation AVENUE: Automate Rule Learning Source (e.g. Quechua) Target (e.g. English) Direct: SMT, EBMT Sep 22, 2006 Learning-based MT with Limited Resources

AVENUE Architecture Sep 22, 2006 Elicitation Morphology Rule Learning Run-Time System Rule Refinement Learning Module Learned Transfer Rules Word-Aligned Parallel Corpus INPUT TEXT Translation Correction Tool Run Time Transfer System Rule Refinement Module Elicitation Corpus Morphology Analyzer Learning Module Handcrafted rules Decoder Elicitation Tool Lexical Resources OUTPUT TEXT Sep 22, 2006 Learning-based MT with Limited Resources

Transfer Rule Formalism ;SL: the old man, TL: ha-ish ha-zaqen NP::NP [DET ADJ N] -> [DET N DET ADJ] ( (X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2) ((X1 AGR) = *3-SING) ((X1 DEF = *DEF) ((X3 AGR) = *3-SING) ((X3 COUNT) = +) ((Y1 DEF) = *DEF) ((Y3 DEF) = *DEF) ((Y2 AGR) = *3-SING) ((Y2 GENDER) = (Y4 GENDER)) ) Type information Part-of-speech/constituent information Alignments x-side constraints y-side constraints xy-constraints, e.g. ((Y1 AGR) = (X1 AGR)) Sep 22, 2006 Learning-based MT with Limited Resources

Transfer Rule Formalism (II) ;SL: the old man, TL: ha-ish ha-zaqen NP::NP [DET ADJ N] -> [DET N DET ADJ] ( (X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2) ((X1 AGR) = *3-SING) ((X1 DEF = *DEF) ((X3 AGR) = *3-SING) ((X3 COUNT) = +) ((Y1 DEF) = *DEF) ((Y3 DEF) = *DEF) ((Y2 AGR) = *3-SING) ((Y2 GENDER) = (Y4 GENDER)) ) Value constraints Agreement constraints Sep 22, 2006 Learning-based MT with Limited Resources

Transfer Rules  Transfer Trees NP PP NP1 NP P Adj N N1 ke eka aXyAya N jIvana NP NP1 PP Adj N P NP one chapter of N1 N life ; NP1 ke NP2 -> NP2 of NP1 ; Ex: jIvana ke eka aXyAya ; life of (one) chapter ; ==> a chapter of life ; {NP,12} NP::NP : [PP NP1] -> [NP1 PP] ( (X1::Y2) (X2::Y1) ; ((x2 lexwx) = 'kA') ) {NP,13} NP::NP : [NP1] -> [NP1] (X1::Y1) {PP,12} PP::PP : [NP Postp] -> [Prep NP] Sep 22, 2006 Learning-based MT with Limited Resources

Rule Learning - Overview Goal: Acquire Syntactic Transfer Rules Use available knowledge from the source side (grammatical structure) Three steps: Flat Seed Generation: first guesses at transfer rules; flat syntactic structure Compositionality Learning: use previously learned rules to learn hierarchical structure Constraint Learning: refine rules by learning appropriate feature constraints Sep 22, 2006 Learning-based MT with Limited Resources

Flat Seed Rule Generation Learning Example: NP Eng: the big apple Heb: ha-tapuax ha-gadol Generated Seed Rule: NP::NP [ART ADJ N]  [ART N ART ADJ] ((X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2)) Sep 22, 2006 Learning-based MT with Limited Resources

Compositionality Learning Initial Flat Rules: S::S [ART ADJ N V ART N]  [ART N ART ADJ V P ART N] ((X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2) (X4::Y5) (X5::Y7) (X6::Y8)) NP::NP [ART ADJ N]  [ART N ART ADJ] ((X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2)) NP::NP [ART N]  [ART N] ((X1::Y1) (X2::Y2)) Generated Compositional Rule: S::S [NP V NP]  [NP V P NP] ((X1::Y1) (X2::Y2) (X3::Y4)) Sep 22, 2006 Learning-based MT with Limited Resources

Learning-based MT with Limited Resources Constraint Learning Input: Rules and their Example Sets S::S [NP V NP]  [NP V P NP] {ex1,ex12,ex17,ex26} ((X1::Y1) (X2::Y2) (X3::Y4)) NP::NP [ART ADJ N]  [ART N ART ADJ] {ex2,ex3,ex13} ((X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2)) NP::NP [ART N]  [ART N] {ex4,ex5,ex6,ex8,ex10,ex11} ((X1::Y1) (X2::Y2)) Output: Rules with Feature Constraints: S::S [NP V NP]  [NP V P NP] ((X1::Y1) (X2::Y2) (X3::Y4) (X1 NUM = X2 NUM) (Y1 NUM = Y2 NUM) (X1 NUM = Y1 NUM)) Sep 22, 2006 Learning-based MT with Limited Resources

Learning-based MT with Limited Resources AVENUE Prototypes General XFER framework under development for past three years Prototype systems so far: German-to-English, Dutch-to-English Chinese-to-English Hindi-to-English Hebrew-to-English Portuguese-to-English In progress or planned: Mapudungun-to-Spanish Quechua-to-Spanish Inupiaq-to-English Native-Brazilian languages to Brazilian Portuguese Sep 22, 2006 Learning-based MT with Limited Resources

Learning-based MT with Limited Resources Mapudungun Indigenous Language of Chile and Argentina ~ 1 Million Mapuche Speakers Sep 22, 2006 Learning-based MT with Limited Resources

Learning-based MT with Limited Resources Collaboration Eliseo Cañulef Rosendo Huisca Hugo Carrasco Hector Painequeo Flor Caniupil Luis Caniupil Huaiquiñir Marcela Collio Calfunao Cristian Carrillan Anton Salvador Cañulef Mapuche Language Experts Universidad de la Frontera (UFRO) Instituto de Estudios Indígenas (IEI) Institute for Indigenous Studies Chilean Funding Chilean Ministry of Education (Mineduc) Bilingual and Multicultural Education Program Carolina Huenchullan Arrúe Claudio Millacura Salas Sep 22, 2006 Learning-based MT with Limited Resources

Learning-based MT with Limited Resources Accomplishments Corpora Collection Spoken Corpus Collected: Luis Caniupil Huaiquiñir Medical Domain 3 of 4 Mapudungun Dialects 120 hours of Nguluche 30 hours of Lafkenche 20 hours of Pwenche Transcribed in Mapudungun Translated into Spanish Written Corpus ~ 200,000 words Bilingual Mapudungun – Spanish Historical and newspaper text nmlch-nmjm1_x_0405_nmjm_00: M: <SPA>no pütokovilu kay ko C: no, si me lo tomaba con agua M: chumgechi pütokoki femuechi pütokon pu <Noise> C: como se debe tomar, me lo tomé pués nmlch-nmjm1_x_0406_nmlch_00: M: Chengewerkelafuymiürke C: Ya no estabas como gente entonces! Sep 22, 2006 Learning-based MT with Limited Resources

Learning-based MT with Limited Resources Accomplishments Developed At UFRO Bilingual Dictionary with Examples 1,926 entries Spelling Corrected Mapudungun Word List 117,003 fully-inflected word forms Segmented Word List 15,120 forms Stems translated into Spanish Sep 22, 2006 Learning-based MT with Limited Resources

Learning-based MT with Limited Resources Accomplishments Developed at LTI using Mapudungun language resources from UFRO Spelling Checker Integrated into OpenOffice Hand-built Morphological Analyzer Prototype Machine Translation Systems Rule-Based Example-Based Website: LenguasAmerindias.org Sep 22, 2006 Learning-based MT with Limited Resources

Challenges for Hebrew MT Paucity in existing language resources for Hebrew No publicly available broad coverage morphological analyzer No publicly available bilingual lexicons or dictionaries No POS-tagged corpus or parse tree-bank corpus for Hebrew No large Hebrew/English parallel corpus Scenario well suited for CMU transfer-based MT framework for languages with limited resources Sep 22, 2006 Learning-based MT with Limited Resources

Hebrew-to-English MT Prototype Initial prototype developed within a two month intensive effort Accomplished: Adapted available morphological analyzer Constructed a preliminary translation lexicon Translated and aligned Elicitation Corpus Learned XFER rules Developed (small) manual XFER grammar as a point of comparison System debugging and development Evaluated performance on unseen test data using automatic evaluation metrics Sep 22, 2006 Learning-based MT with Limited Resources

Transfer Rules Transfer Engine Decoder Source Input בשורה הבאה Transfer Rules {NP1,3} NP1::NP1 [NP1 "H" ADJ] -> [ADJ NP1] ((X3::Y1) (X1::Y2) ((X1 def) = +) ((X1 status) =c absolute) ((X1 num) = (X3 num)) ((X1 gen) = (X3 gen)) (X0 = X1)) Preprocessing Morphology English Language Model Transfer Engine Translation Lexicon N::N |: ["$WR"] -> ["BULL"] ((X1::Y1) ((X0 NUM) = s) ((Y0 lex) = "BULL")) N::N |: ["$WRH"] -> ["LINE"] ((Y0 lex) = "LINE")) Decoder Translation Output Lattice (0 1 "IN" @PREP) (1 1 "THE" @DET) (2 2 "LINE" @N) (1 2 "THE LINE" @NP) (0 2 "IN LINE" @PP) (0 4 "IN THE NEXT LINE" @PP) English Output in the next line

Learning-based MT with Limited Resources Morphology Example Input word: B$WRH 0 1 2 3 4 |--------B$WRH--------| |-----B-----|$WR|--H--| |--B--|-H--|--$WRH---| Sep 22, 2006 Learning-based MT with Limited Resources

Learning-based MT with Limited Resources Morphology Example Y0: ((SPANSTART 0) Y1: ((SPANSTART 0) Y2: ((SPANSTART 1) (SPANEND 4) (SPANEND 2) (SPANEND 3) (LEX B$WRH) (LEX B) (LEX $WR) (POS N) (POS PREP)) (POS N) (GEN F) (GEN M) (NUM S) (NUM S) (STATUS ABSOLUTE)) (STATUS ABSOLUTE)) Y3: ((SPANSTART 3) Y4: ((SPANSTART 0) Y5: ((SPANSTART 1) (SPANEND 4) (SPANEND 1) (SPANEND 2) (LEX $LH) (LEX B) (LEX H) (POS POSS)) (POS PREP)) (POS DET)) Y6: ((SPANSTART 2) Y7: ((SPANSTART 0) (SPANEND 4) (SPANEND 4) (LEX $WRH) (LEX B$WRH) (POS N) (POS LEX)) (GEN F) (NUM S) (STATUS ABSOLUTE)) Sep 22, 2006 Learning-based MT with Limited Resources

Learning-based MT with Limited Resources Example Translation Input: לאחר דיונים רבים החליטה הממשלה לערוך משאל עם בנושא הנסיגה After debates many decided the government to hold referendum in issue the withdrawal Output: AFTER MANY DEBATES THE GOVERNMENT DECIDED TO HOLD A REFERENDUM ON THE ISSUE OF THE WITHDRAWAL Sep 22, 2006 Learning-based MT with Limited Resources

Sample Output (dev-data) maxwell anurpung comes from ghana for israel four years ago and since worked in cleaning in hotels in eilat a few weeks ago announced if management club hotel that for him to leave israel according to the government instructions and immigration police in a letter in broken english which spread among the foreign workers thanks to them hotel for their hard work and announced that will purchase for hm flight tickets for their countries from their money Sep 22, 2006 Learning-based MT with Limited Resources

Challenges and Future Directions Automatic Transfer Rule Learning: Learning mappings for non-compositional structures Effective models for rule scoring for Decoding: using scores at runtime Pruning the large collections of learned rules Learning Unification Constraints In the absence of morphology or POS annotated lexica Integrated Xfer Engine and Decoder Improved models for scoring tree-to-tree mappings, integration with LM and other knowledge sources in the course of the search Sep 22, 2006 Learning-based MT with Limited Resources

Challenges and Future Directions Our approach for learning transfer rules is applicable to the large parallel data scenario, subject to solutions for several big challenges: No elicitation corpus  break-down parallel sentences into reasonable learning examples Working with less reliable automatic word alignments rather than manual alignments Effective use of reliable parse structures for ONE language (i.e. English) and automatic word alignments in order to decompose the translation of a sentence into several compositional rules. Effective scoring of resulting very large transfer grammars, and scaled up transfer + decoding Sep 22, 2006 Learning-based MT with Limited Resources

Future Research Directions Automatic Rule Refinement Morphology Learning Feature Detection and Corpus Navigation … Sep 22, 2006 Learning-based MT with Limited Resources

Implications for MT with Vast Amounts of Parallel Data Phrase-to-phrase MT ill suited for long-range reorderings  ungrammatical output Recent work on hierarchical Stat-MT [Chiang, 2005] and parsing-based MT [Melamed et al, 2005] [Knight et al] Learning general tree-to-tree syntactic mappings is equally problematic: Meaning is a hybrid of complex, non-compositional phrases embedded within a syntactic structure Some constituents can be translated in isolation, others require contextual mappings Sep 22, 2006 Learning-based MT with Limited Resources

Learning-based MT with Limited Resources Evaluation Results Test set of 62 sentences from Haaretz newspaper, 2 reference translations System BLEU NIST P R METEOR No Gram 0.0616 3.4109 0.4090 0.4427 0.3298 Learned 0.0774 3.5451 0.4189 0.4488 0.3478 Manual 0.1026 3.7789 0.4334 0.4474 0.3617 Sep 22, 2006 Learning-based MT with Limited Resources

Hebrew-English: Test Suite Evaluation Grammar BLEU METEOR Baseline (NoGram) 0.0996 0.4916 Learned Grammar 0.1608 0.5525 Manual Grammar 0.1642 0.5320 Sep 22, 2006 Learning-based MT with Limited Resources

Learning-based MT with Limited Resources QuechuaSpanish MT V-Unit: funded Summer project in Cusco (Peru) June-August 2005 [preparations and data collection started earlier] Intensive Quechua course in Centro Bartolome de las Casas (CBC) Worked together with two Quechua native and one non-native speakers on developing infrastructure (correcting elicited translations, segmenting and translating list of most frequent words) Sep 22, 2006 Learning-based MT with Limited Resources

Quechua  Spanish Prototype MT System Stem Lexicon (semi-automatically generated): 753 lexical entries Suffix lexicon: 21 suffixes (150 Cusihuaman) Quechua morphology analyzer 25 translation rules Spanish morphology generation module User-Studies: 10 sentences, 3 users (2 native, 1 non-native) Sep 22, 2006 Learning-based MT with Limited Resources

Learning-based MT with Limited Resources The Transfer Engine Analysis Source text is parsed into its grammatical structure. Determines transfer application ordering. Example: 他 看 书。(he read book) S NP VP N V NP 他 看 书 Transfer A target language tree is created by reordering, insertion, and deletion. he read DET N a book Article “a” is inserted into object NP. Source words translated with transfer lexicon. Generation Target language constraints are checked and final translation produced. E.g. “reads” is chosen over “read” to agree with “he”. Final translation: “He reads a book” Sep 22, 2006 Learning-based MT with Limited Resources

Learning-based MT with Limited Resources The Transfer Engine Some Unique Features: Works with either learned or manually-developed transfer grammars Handles rules with or without unification constraints Supports interfacing with servers for Morphological analysis and generation Can handle ambiguous source-word analyses and/or SL segmentations represented in the form of lattice structures Sep 22, 2006 Learning-based MT with Limited Resources

Learning-based MT with Limited Resources The Lattice Decoder Simple Stack Decoder, similar in principle to SMT/EBMT decoders Searches for best-scoring path of non-overlapping lattice arcs Scoring based on log-linear combination of scoring components (no MER training yet) Scoring components: Standard trigram LM Fragmentation: how many arcs to cover the entire translation? Length Penalty Rule Scores (not fully integrated yet) Sep 22, 2006 Learning-based MT with Limited Resources

Learning-based MT with Limited Resources Outline Rationale for learning-based MT Roadmap for learning-based MT Framework overview Elicitation Learning transfer Rules Automatic rule refinement Example prototypes Implications for MT with vast parallel data Conclusions and future directions Sep 22, 2006 Learning-based MT with Limited Resources

Data Elicitation for Languages with Limited Resources Rationale: Large volumes of parallel text not available  create a small maximally-diverse parallel corpus that directly supports the learning task Bilingual native informant(s) can translate and align a small pre-designed elicitation corpus, using elicitation tool Elicitation corpus designed to be typologically and structurally comprehensive and compositional Transfer-rule engine and new learning approach support acquisition of generalized transfer-rules from the data Sep 22, 2006 Learning-based MT with Limited Resources

Elicitation Tool: English-Chinese Example Sep 22, 2006 Learning-based MT with Limited Resources

Elicitation Tool: English-Chinese Example Sep 22, 2006 Learning-based MT with Limited Resources

Elicitation Tool: English-Hindi Example Sep 22, 2006 Learning-based MT with Limited Resources

Elicitation Tool: English-Arabic Example Sep 22, 2006 Learning-based MT with Limited Resources

Elicitation Tool: Spanish-Mapudungun Example Sep 22, 2006 Learning-based MT with Limited Resources

Designing Elicitation Corpora What do we want to elicit? Diversity of linguistic phenomena and constructions Syntactic structural diversity How do we construct an elicitation corpus? Typological Elicitation Corpus based on elicitation and documentation work of field linguists (e.g. Comrie 1977, Bouquiaux 1992): initial corpus size ~1000 examples Structural Elicitation Corpus based on representative sample of English phrase structures: ~120 examples Organized compositionally: elicit simple structures first, then use them as building blocks Goal: minimize size, maximize linguistic coverage Sep 22, 2006 Learning-based MT with Limited Resources

Typological Elicitation Corpus Feature Detection Discover what features exist in the language and where/how they are marked Example: does the language mark gender of nouns? How and where are these marked? Method: compare translations of minimal pairs – sentences that differ in only ONE feature Elicit translations/alignments for detected features and their combinations Dynamic corpus navigation based on feature detection: no need to elicit for combinations involving non-existent features Sep 22, 2006 Learning-based MT with Limited Resources

Typological Elicitation Corpus Initial typological corpus of about 1000 sentences was manually constructed New construction methodology for building an elicitation corpus using: A feature specification: lists inventory of available features and their values A definition of the set of desired feature structures Schemas define sets of desired combinations of features and values Multiplier algorithm generates the comprehensive set of feature structures A generation grammar and lexicon: NLG generator generates NL sentences from the feature structures Sep 22, 2006 Learning-based MT with Limited Resources

Structural Elicitation Corpus Goal: create a compact diverse sample corpus of syntactic phrase structures in English in order to elicit how these map into the elicited language Methodology: Extracted all CFG “rules” from Brown section of Penn TreeBank (122K sentences) Simplified POS tag set Constructed frequency histogram of extracted rules Pulled out simplest phrases for most frequent rules for NPs, PPs, ADJPs, ADVPs, SBARs and Sentences Some manual inspection and refinement Resulting corpus of about 120 phrases/sentences representing common structures See [Probst and Lavie, 2004] Sep 22, 2006 Learning-based MT with Limited Resources

Learning-based MT with Limited Resources Outline Rationale for learning-based MT Roadmap for learning-based MT Framework overview Elicitation Learning transfer Rules Automatic rule refinement Example prototypes Implications for MT with vast parallel data Conclusions and future directions Sep 22, 2006 Learning-based MT with Limited Resources

Flat Seed Rule Generation Create a “flat” transfer rule specific to the sentence pair, partially abstracted to POS Words that are aligned word-to-word and have the same POS in both languages are generalized to their POS Words that have complex alignments (or not the same POS) remain lexicalized One seed rule for each translation example No feature constraints associated with seed rules (but mark the example(s) from which it was learned) Sep 22, 2006 Learning-based MT with Limited Resources

Compositionality Learning Detection: traverse the c-structure of the English sentence, add compositional structure for translatable chunks Generalization: adjust constituent sequences and alignments Two implemented variants: Safe Compositionality: there exists a transfer rule that correctly translates the sub-constituent Maximal Compositionality: Generalize the rule if supported by the alignments, even in the absence of an existing transfer rule for the sub-constituent Sep 22, 2006 Learning-based MT with Limited Resources

Learning-based MT with Limited Resources Constraint Learning Goal: add appropriate feature constraints to the acquired rules Methodology: Preserve general structural transfer Learn specific feature constraints from example set Seed rules are grouped into clusters of similar transfer structure (type, constituent sequences, alignments) Each cluster forms a version space: a partially ordered hypothesis space with a specific and a general boundary The seed rules in a group form the specific boundary of a version space The general boundary is the (implicit) transfer rule with the same type, constituent sequences, and alignments, but no feature constraints Sep 22, 2006 Learning-based MT with Limited Resources

Constraint Learning: Generalization The partial order of the version space: Definition: A transfer rule tr1 is strictly more general than another transfer rule tr2 if all f-structures that are satisfied by tr2 are also satisfied by tr1. Generalize rules by merging them: Deletion of constraint Raising two value constraints to an agreement constraint, e.g. ((x1 num) = *pl), ((x3 num) = *pl)  ((x1 num) = (x3 num)) Sep 22, 2006 Learning-based MT with Limited Resources

Automated Rule Refinement Bilingual informants can identify translation errors and pinpoint the errors A sophisticated trace of the translation path can identify likely sources for the error and do “Blame Assignment” Rule Refinement operators can be developed to modify the underlying translation grammar (and lexicon) based on characteristics of the error source: Add or delete feature constraints from a rule Bifurcate a rule into two rules (general and specific) Add or correct lexical entries See [Font-Llitjos, Carbonell & Lavie, 2005] Sep 22, 2006 Learning-based MT with Limited Resources

Learning-based MT with Limited Resources Outline Rationale for learning-based MT Roadmap for learning-based MT Framework overview Elicitation Learning transfer Rules Automatic rule refinement Example prototypes Implications for MT with vast parallel data Conclusions and future directions Sep 22, 2006 Learning-based MT with Limited Resources

Learning-based MT with Limited Resources Outline Rationale for learning-based MT Roadmap for learning-based MT Framework overview Elicitation Learning transfer Rules Automatic rule refinement Learning Morphology Example prototypes Implications for MT with vast parallel data Conclusions and future directions Sep 22, 2006 Learning-based MT with Limited Resources

Implications for MT with Vast Amounts of Parallel Data Example: 他 经常 与 江泽民 总统 通 电话 He freq with J Zemin Pres via phone He freq talked with President J Zemin over the phone Sep 22, 2006 Learning-based MT with Limited Resources

Implications for MT with Vast Amounts of Parallel Data Example: 他 经常 与 江泽民 总统 通 电话 He freq with J Zemin Pres via phone He freq talked with President J Zemin over the phone NP1 NP2 NP3 NP1 NP2 NP3 Sep 22, 2006 Learning-based MT with Limited Resources

Learning-based MT with Limited Resources Conclusions There is hope yet for wide-spread MT between many of the worlds language pairs MT offers a fertile yet extremely challenging ground for learning-based approaches that leverage from diverse sources of information: Syntactic structure of one or both languages Word-to-word correspondences Decomposable units of translation Statistical Language Models AVENUE’s XFER approach provides a feasible solution to MT for languages with limited resources Promising approach for addressing the fundamental weaknesses in current corpus-based MT for languages with vast resources Sep 22, 2006 Learning-based MT with Limited Resources

Learning-based MT with Limited Resources Sep 22, 2006 Learning-based MT with Limited Resources

Mapudungun-to-Spanish Example English I didn’t see Maria Mapudungun pelafiñ Maria Spanish No vi a María Sep 22, 2006 Learning-based MT with Limited Resources

Mapudungun-to-Spanish Example English I didn’t see Maria Mapudungun pelafiñ Maria pe -la -fi -ñ Maria see -neg -3.obj -1.subj.indicative Maria Spanish No vi a María No vi a María neg see.1.subj.past.indicative acc Maria Sep 22, 2006 Learning-based MT with Limited Resources

Learning-based MT with Limited Resources pe-la-fi-ñ Maria V pe Sep 22, 2006 Learning-based MT with Limited Resources

Learning-based MT with Limited Resources pe-la-fi-ñ Maria V pe VSuff Negation = + la Sep 22, 2006 Learning-based MT with Limited Resources

Learning-based MT with Limited Resources pe-la-fi-ñ Maria V pe VSuffG Pass all features up VSuff la Sep 22, 2006 Learning-based MT with Limited Resources

Learning-based MT with Limited Resources pe-la-fi-ñ Maria V pe VSuffG VSuff object person = 3 VSuff fi la Sep 22, 2006 Learning-based MT with Limited Resources

Learning-based MT with Limited Resources pe-la-fi-ñ Maria V pe VSuffG Pass all features up from both children VSuffG VSuff VSuff fi la Sep 22, 2006 Learning-based MT with Limited Resources

Learning-based MT with Limited Resources pe-la-fi-ñ Maria V pe VSuffG VSuff person = 1 number = sg mood = ind VSuffG VSuff ñ VSuff fi la Sep 22, 2006 Learning-based MT with Limited Resources

Learning-based MT with Limited Resources pe-la-fi-ñ Maria V VSuffG pe VSuffG VSuff Pass all features up from both children VSuffG VSuff ñ VSuff fi la Sep 22, 2006 Learning-based MT with Limited Resources

Learning-based MT with Limited Resources pe-la-fi-ñ Maria Pass all features up from both children V Check that: 1) negation = + 2) tense is undefined V VSuffG pe VSuffG VSuff VSuffG VSuff ñ VSuff fi la Sep 22, 2006 Learning-based MT with Limited Resources

Learning-based MT with Limited Resources pe-la-fi-ñ Maria V NP V VSuffG N person = 3 number = sg human = + pe VSuffG VSuff N VSuffG VSuff ñ Maria VSuff fi la Sep 22, 2006 Learning-based MT with Limited Resources

Learning-based MT with Limited Resources pe-la-fi-ñ Maria S Check that NP is human = + Pass features up from V VP V NP V VSuffG N pe VSuffG VSuff N VSuffG VSuff ñ Maria VSuff fi la Sep 22, 2006 Learning-based MT with Limited Resources

Learning-based MT with Limited Resources Transfer to Spanish: Top-Down S S VP VP V NP V VSuffG N pe VSuffG VSuff N VSuffG VSuff ñ Maria VSuff fi la Sep 22, 2006 Learning-based MT with Limited Resources

Learning-based MT with Limited Resources Transfer to Spanish: Top-Down Pass all features to Spanish side S S VP VP V NP V “a” NP V VSuffG N pe VSuffG VSuff N VSuffG VSuff ñ Maria VSuff fi la Sep 22, 2006 Learning-based MT with Limited Resources

Learning-based MT with Limited Resources Transfer to Spanish: Top-Down S S Pass all features down VP VP V NP V “a” NP V VSuffG N pe VSuffG VSuff N VSuffG VSuff ñ Maria VSuff fi la Sep 22, 2006 Learning-based MT with Limited Resources

Learning-based MT with Limited Resources Transfer to Spanish: Top-Down S S Pass object features down VP VP V NP V “a” NP V VSuffG N pe VSuffG VSuff N VSuffG VSuff ñ Maria VSuff fi la Sep 22, 2006 Learning-based MT with Limited Resources

Learning-based MT with Limited Resources Transfer to Spanish: Top-Down S S VP VP V NP V “a” NP V VSuffG N Accusative marker on objects is introduced because human = + pe VSuffG VSuff N VSuffG VSuff ñ Maria VSuff fi la Sep 22, 2006 Learning-based MT with Limited Resources

Learning-based MT with Limited Resources Transfer to Spanish: Top-Down S S VP VP::VP [VBar NP] -> [VBar "a" NP] ( (X1::Y1) (X2::Y3) ((X2 type) = (*NOT* personal)) ((X2 human) =c +) (X0 = X1) ((X0 object) = X2) (Y0 = X0) ((Y0 object) = (X0 object)) (Y1 = Y0) (Y3 = (Y0 object)) ((Y1 objmarker person) = (Y3 person)) ((Y1 objmarker number) = (Y3 number)) ((Y1 objmarker gender) = (Y3 ender))) VP V NP V “a” NP V VSuffG N pe VSuffG VSuff N VSuffG VSuff ñ Maria VSuff fi la Sep 22, 2006 Learning-based MT with Limited Resources

Learning-based MT with Limited Resources Transfer to Spanish: Top-Down S S Pass person, number, and mood features to Spanish Verb VP VP V NP V “a” NP Assign tense = past V VSuffG N “no” V pe VSuffG VSuff N VSuffG VSuff ñ Maria VSuff fi la Sep 22, 2006 Learning-based MT with Limited Resources

Learning-based MT with Limited Resources Transfer to Spanish: Top-Down S S VP VP V NP V “a” NP V VSuffG N “no” V pe VSuffG VSuff N VSuffG VSuff ñ Maria Introduced because negation = + VSuff fi la Sep 22, 2006 Learning-based MT with Limited Resources

Learning-based MT with Limited Resources Transfer to Spanish: Top-Down S S VP VP V NP V “a” NP V VSuffG N “no” V pe VSuffG VSuff N ver VSuffG VSuff ñ Maria VSuff fi la Sep 22, 2006 Learning-based MT with Limited Resources

Learning-based MT with Limited Resources Transfer to Spanish: Top-Down S S VP VP V NP V “a” NP V VSuffG N “no” V pe VSuffG VSuff N ver vi VSuffG VSuff ñ Maria person = 1 number = sg mood = indicative tense = past VSuff fi la Sep 22, 2006 Learning-based MT with Limited Resources

Learning-based MT with Limited Resources Transfer to Spanish: Top-Down S S Pass features over to Spanish side VP VP V NP V “a” NP V VSuffG N “no” V N pe VSuffG VSuff N vi N VSuffG VSuff ñ Maria María VSuff fi la Sep 22, 2006 Learning-based MT with Limited Resources

Learning-based MT with Limited Resources I Didn’t see Maria S S VP VP V NP V “a” NP V VSuffG N “no” V N pe VSuffG VSuff N vi N VSuffG VSuff ñ Maria María VSuff fi la Sep 22, 2006 Learning-based MT with Limited Resources

Learning-based MT with Limited Resources Sep 22, 2006 Learning-based MT with Limited Resources