An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,

Slides:



Advertisements
Similar presentations
The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.
Advertisements

Enabling MT for Languages with Limited Resources Alon Lavie Language Technologies Institute Carnegie Mellon University.
The current status of Chinese- English EBMT -where are we now Joy (Ying Zhang) Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.
NICE: Native language Interpretation and Communication Environment Lori Levin, Jaime Carbonell, Alon Lavie, Ralf Brown Carnegie Mellon University.
The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.
Automatic Rule Learning for Resource-Limited Machine Translation Alon Lavie, Katharina Probst, Erik Peterson, Jaime Carbonell, Lori Levin, Ralf Brown Language.
Machine Translation with Scarce Resources The Avenue Project.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
Language Technologies Institute School of Computer Science Carnegie Mellon University NSF August 6, 2001 NICE: Native language Interpretation and Communication.
Czech-to-English Translation: MT Marathon 2009 Session Preview Jonathan Clark Greg Hanneman Language Technologies Institute Carnegie Mellon University.
MT for Languages with Limited Resources Machine Translation April 20, 2011 Based on Joint Work with: Lori Levin, Jaime Carbonell, Stephan Vogel,
Machine Translation Dr. Radhika Mamidi. What is Machine Translation? A sub-field of computational linguistics It investigates the use of computer software.
Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**
Building NLP Systems for Two Resource Scarce Indigenous Languages: Mapudungun and Quechua, and some other languages Christian Monson, Ariadna Font Llitjós,
Eliciting Features from Minor Languages The elicitation tool provides a simple interface for bilingual informants with no linguistic training and limited.
Transfer-based MT with Strong Decoding for a Miserly Data Scenario Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with:
Coping with Surprise: Multiple CMU MT Approaches Alon Lavie Lori Levin, Jaime Carbonell, Alex Waibel, Stephan Vogel, Ralf Brown, Robert Frederking Language.
Statistical XFER: Hybrid Statistical Rule-based Machine Translation Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with:
Recent Major MT Developments at CMU Briefing for Joe Olive February 5, 2008 Alon Lavie and Stephan Vogel Language Technologies Institute Carnegie Mellon.
Multi-Engine MT for Quick MT. Missing Technology for Quick MT LingWear ISI MT NICE Core Rapid MT - Multi-Engine MT - Omnivorous resource usage - Pervasive.
Approaches to Machine Translation CSC 5930 Machine Translation Fall 2012 Dr. Tom Way.
Rapid Prototyping of a Transfer-based Hebrew-to-English Machine Translation System Alon Lavie Language Technologies Institute Carnegie Mellon University.
Improving Statistical Machine Translation by Means of Transfer Rules Nurit Melnik.
Rule Learning - Overview Goal: Syntactic Transfer Rules 1) Flat Seed Generation: produce rules from word- aligned sentence pairs, abstracted only to POS.
AMTEXT: Extraction-based MT for Arabic Faculty: Alon Lavie, Jaime Carbonell Students and Staff: Laura Kieras, Peter Jansen Informant: Loubna El Abadi.
Transfer-based MT with Strong Decoding for a Miserly Data Scenario Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with:
MEMT: Multi-Engine Machine Translation Faculty: Alon Lavie, Robert Frederking, Ralf Brown, Jaime Carbonell Students: Shyamsundar Jayaraman, Satanjeev Banerjee.
AVENUE Automatic Machine Translation for low-density languages Ariadna Font Llitjós Language Technologies Institute SCS Carnegie Mellon University.
AVENUE/LETRAS: Learning-based MT for Languages with Limited Resources Faculty: Jaime Carbonell, Alon Lavie, Lori Levin, Ralf Brown, Robert Frederking Students.
Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University.
Carnegie Mellon Goal Recycle non-expert post-editing efforts to: - Refine translation rules automatically - Improve overall translation quality Proposed.
Data Collection and Language Technologies for Mapudungun Lori Levin, Rodolfo Vega, Jaime Carbonell, Ralf Brown, Alon Lavie Language Technologies Institute.
The AVENUE Project Data Elicitation System Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University.
Hebrew-to-English XFER MT Project - Update Alon Lavie June 2, 2004.
Nov 17, 2005Learning-based MT1 Learning-based MT Approaches for Languages with Limited Resources Alon Lavie Language Technologies Institute Carnegie Mellon.
Computational support for minority languages using a typologically oriented questionnaire system Lori Levin Language Technologies Institute School of Computer.
Designing a Machine Translation Project Lori Levin and Alon Lavie Language Technologies Institute Carnegie Mellon University CATANAL Planning Meeting Barrow,
A Trainable Transfer-based MT Approach for Languages with Limited Resources Alon Lavie Language Technologies Institute Carnegie Mellon University Joint.
Language Technologies Institute School of Computer Science Carnegie Mellon University NSF, August 6, 2001 Machine Translation for Indigenous Languages.
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
Coping with Surprise: Multiple CMU MT Approaches Alon Lavie Lori Levin, Jaime Carbonell, Alex Waibel, Stephan Vogel, Ralf Brown, Robert Frederking Language.
Semi-Automated Elicitation Corpus Generation The elicitation tool provides a simple interface for bilingual informants with no linguistic training and.
A Trainable Transfer-based MT Approach for Languages with Limited Resources Alon Lavie Language Technologies Institute Carnegie Mellon University Joint.
NICE: Native Language Interpretation and Communication Environment Lori Levin, Jaime Carbonell, Alon Lavie, Ralf Brown, Erik Peterson, Katharina Probst,
Data Elicitation for AVENUE By: Alison Alvarez Lori Levin Bob Frederking Jeff Good (MPI Leipzig) Erik Peterson.
The CMU Mill-RADD Project: Recent Activities and Results Alon Lavie Language Technologies Institute Carnegie Mellon University.
Error Analysis of Two Types of Grammar for the purpose of Automatic Rule Refinement Ariadna Font Llitjós, Katharina Probst, Jaime Carbonell Language Technologies.
Bridging the Gap: Machine Translation for Lesser Resourced Languages
Avenue Architecture Learning Module Learned Transfer Rules Lexical Resources Run Time Transfer System Decoder Translation Correction Tool Word- Aligned.
October 10, 2003BLTS Kickoff Meeting1 Transfer with Strong Decoding Learning Module Transfer Rules {PP,4894} ;;Score: PP::PP [NP POSTP] -> [PREP.
Eliciting a corpus of word- aligned phrases for MT Lori Levin, Alon Lavie, Erik Peterson Language Technologies Institute Carnegie Mellon University.
Seed Generation and Seeded Version Space Learning Version 0.02 Katharina Probst Feb 28,2002.
CMU MilliRADD Small-MT Report TIDES PI Meeting 2002 The CMU MilliRADD Team: Jaime Carbonell, Lori Levin, Ralf Brown, Stephan Vogel, Alon Lavie, Kathrin.
AVENUE: Machine Translation for Resource-Poor Languages NSF ITR
Developing affordable technologies for resource-poor languages Ariadna Font Llitjós Language Technologies Institute Carnegie Mellon University September.
FROM BITS TO BOTS: Women Everywhere, Leading the Way Lenore Blum, Anastassia Ailamaki, Manuela Veloso, Sonya Allin, Bernardine Dias, Ariadna Font Llitjós.
MEMT: Multi-Engine Machine Translation Faculty: Alon Lavie, Robert Frederking, Ralf Brown, Jaime Carbonell Students: Shyamsundar Jayaraman, Satanjeev Banerjee.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Minority Languages Katharina Probst Language Technologies Institute Carnegie Mellon.
Background of the NICE Project Lori Levin Jaime Carbonell Alon Lavie Ralf Brown.
Enabling MT for Languages with Limited Resources Alon Lavie and Lori Levin Language Technologies Institute Carnegie Mellon University.
Learning to Generate Complex Morphology for Machine Translation Einat Minkov †, Kristina Toutanova* and Hisami Suzuki* *Microsoft Research † Carnegie Mellon.
LingWear Language Technology for the Information Warrior Alex Waibel, Lori Levin Alon Lavie, Robert Frederking Carnegie Mellon University.
The AVENUE Project: Automatic Rule Learning for Resource-Limited Machine Translation Faculty: Alon Lavie, Jaime Carbonell, Lori Levin, Ralf Brown Students:
Eliciting a corpus of word-aligned phrases for MT
Approaches to Machine Translation
Faculty: Alon Lavie, Jaime Carbonell, Lori Levin, Ralf Brown Students:
Alon Lavie, Jaime Carbonell, Lori Levin,
Approaches to Machine Translation
Towards Interactive and Automatic Refinement of Translation Rules
Presentation transcript:

An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh, PA USA

AVENUE Project Dr. Jaime Carbonell, PI Dr. Alon Lavie, Co-PI Dr. Lori Levin, Co-PI Dr. Robert Frederking Dr. Ralf Brown Dr. Rodolfo Vega Mapudungun –Dr. Eliseo Cañulef –Rosendo Huisca –and others Erik Peterson Christian Monson Ariadna Font Llitjós Alison Alvarez Roberto Aranovich Dr. Jeff Good Dr. Katharina Probst Hebrew –Dr. Shuly Wintner –student This research was funded in part by NSF grant number IIS

MT Approaches Interlingua : introduce-self Syntactic Parsing Pronoun-acc-1-sg chiamare-1sg N Semantic Analysis Sentence Planning Text Generation [np poss-1sg “name”] BE-pres N Source Mi chiamo Lori Target My name is Lori Transfer Rules Direct: SMT, EBMT AVENUE: Automate Rule Learning

Approaches to MT Direct –Works best with large parallel corpora Millions of words –Can be done without linguistic resources Interlingua –Useful when you are translating between more than two languages –Requires linguistic knowledge Transfer –Requires linguistic knowledge

Useful Resources for MT Parallel corpus Monolingual corpus Lexicon Morphological Analyzer (lemmatizer) Human Linguist Human non-linguist

Low Resource Situations Indigenous languages –May lack large corpora –May lack a computational linguist “Strategic” Languages –Aside from standard written Arabic and Chinese Resource-rich language: limited domain –Most of the large parallel corpora are newspaper, parliamentary proceedings, or broadcast news –Fewer resources for conversation related to humanitarian aid.

Why Machine Translation for Languages with Limited Resources? We are in the age of information explosion –The internet+web+Google  anyone can get the information they want anytime… But what about the text in all those other languages? –How do they read all this English stuff? –How do we read all the stuff that they put online? MT for these languages would Enable: –Better government access to native indigenous and minority communities –Better minority and native community participation in information-rich activities (health care, education, government) without giving up their languages. –Civilian and military applications (disaster relief) –Language preservation

Mixed Resource Situations Some resources are available and others aren’t.

Omnivorous MT Eat whatever resources are available Eat large or small amounts of data

AVENUE’s Inventory Resources –Parallel corpus –Monolingual corpus –Lexicon –Morphological Analyzer (lemmatizer) –Human Linguist –Human non-linguist Techniques –Rule based transfer system –Example Based MT –Morphology Learning –Rule Learning –Interactive Rule Refinement –Multi-Engine MT

The Avenue Low Resource Scenario Learning Module Learned Transfer Rules Lexical Resources Run Time Transfer System Decoder Translation Correction Tool Word- Aligned Parallel Corpus Elicitation Tool Elicitation Corpus ElicitationRule Learning Run-Time System Rule Refinement Rule Refinement Module Morphology Analyzer Learning Module Handcrafted rules INPUT TEXT OUTPUT TEXT

The Avenue Low Resource Scenario Learning Module Learned Transfer Rules Lexical Resources Run Time Transfer System Decoder Translation Correction Tool Word- Aligned Parallel Corpus Elicitation Tool Elicitation Corpus ElicitationRule Learning Run-Time System Rule Refinement Rule Refinement Module Morphology Analyzer Learning Module Handcrafted rules INPUT TEXT OUTPUT TEXT

The Avenue Low Resource Scenario Learning Module Learned Transfer Rules Lexical Resources Run Time Transfer System Decoder Translation Correction Tool Word- Aligned Parallel Corpus Elicitation Tool Elicitation Corpus ElicitationRule Learning Run-Time System Rule Refinement Rule Refinement Module Morphology Analyzer Learning Module Handcrafted rules INPUT TEXT OUTPUT TEXT

The Avenue Low Resource Scenario Learning Module Learned Transfer Rules Lexical Resources Run Time Transfer System Decoder Translation Correction Tool Word- Aligned Parallel Corpus Elicitation Tool Elicitation Corpus ElicitationRule Learning Run-Time System Rule Refinement Rule Refinement Module Morphology Analyzer Learning Module Handcrafted rules INPUT TEXT OUTPUT TEXT

AVENUE Rules can be written by hand or learned automatically. Hybrid –Rule-based transfer –Statistical decoder –Multi-engine combinations with SMT and EBMT

AVENUE systems (Small and experimental, but tested on unseen data) Hebrew-to-English –Alon Lavie, Shuly Wintner, Katharina Probst –Hand-written and automatically learned –Automatic rules trained on 120 sentences perform slightly better than about 20 hand-written rules. Hindi-to-English –Lavie, Peterson, Probst, Levin, Font, Cohen, Monson –Automatically learned –Performs better than SMT when training data is limited to 50K words

AVENUE systems (Small and experimental, but tested on unseen data) English-to-Spanish –Ariadna Font Llitjos –Hand-written, automatically corrected Mapudungun-to-Spanish –Roberto Aranovich and Christian Monson –Hand-written Dutch-to-English –Simon Zwarts –Hand-written

The Avenue Low Resource Scenario Learning Module Learned Transfer Rules Lexical Resources Run Time Transfer System Decoder Translation Correction Tool Word- Aligned Parallel Corpus Elicitation Tool Elicitation Corpus ElicitationRule Learning Run-Time System Rule Refinement Rule Refinement Module Morphology Analyzer Learning Module Handcrafted rules INPUT TEXT OUTPUT TEXT

Elicitation Get data from someone who is –Bilingual –Literate With consistent spelling –Not experienced with linguistics

English-Hindi Example Elicitation Tool: Erik Peterson

English-Chinese Example Note: Translator has to insert spaces between words in Chinese.

English-Arabic Example

Purpose of Elicitation Provide a small but highly targeted corpus of hand aligned data –To support machine learning from a small data set –To discover basic word order –To discover how syntactic dependencies are expressed –To discover which grammatical meanings are reflected in the morphology or syntax of the language srcsent: Tú caíste tgtsent: eymi ütrünagimi aligned: ((1,1),(2,2)) context: tú = Juan [masculino, 2a persona del singular] comment: You (John) fell srcsent: Tú estás cayendo tgtsent: eymi petu ütünagimi aligned: ((1,1),(2 3,2 3)) context: tú = Juan [masculino, 2a persona del singular] comment: You (John) are falling srcsent: Tú caíste tgtsent: eymi ütrunagimi aligned: ((1,1),(2,2)) context: tú = María [femenino, 2a persona del singular] comment: You (Mary) fell

Languages The set of feature structures with English sentences has been delivered to the Linguistic Data Consortium as part of the Reflex program. Translated (by LDC) into: –Thai –Bengali Plans to translate into: –Seven “strategic” languages per year for five years. As one small part of a language pack (BLARK) for each language.

Languages Spanish version in progress at New Mexico State University (Helmreich and Cowie) –Plans to translate into Guarani Portuguese version in progress in Brazil (Marcello Modesto) –Plans to translate into Karitiana 200 speakers Plans to translate into Inupiaq (Kaplan and MacLean)

Previous Elicitation Work Pilot corpus –Around 900 sentences –No feature structures Mapudungun –Two partial translations Quechua –Three translations Aymara –Seven translations Hebrew Hindi –Several translations Dutch

The Avenue Low Resource Scenario Learning Module Learned Transfer Rules Lexical Resources Run Time Transfer System Decoder Translation Correction Tool Word- Aligned Parallel Corpus Elicitation Tool Elicitation Corpus ElicitationRule Learning Run-Time System Rule Refinement Rule Refinement Module Morphology Analyzer Learning Module Handcrafted rules INPUT TEXT OUTPUT TEXT

AVENUE Machine Translation System Type information Synchronous Context Free Rules Alignments x-side constraints y-side constraints xy-constraints, e.g. ((Y1 AGR) = (X1 AGR)) ; SL: the old man, TL: ha-ish ha-zaqen NP::NP [DET ADJ N] -> [DET N DET ADJ] ( (X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2) ((X1 AGR) = *3-SING) ((X1 DEF = *DEF) ((X3 AGR) = *3-SING) ((X3 COUNT) = +) ((Y1 DEF) = *DEF) ((Y3 DEF) = *DEF) ((Y2 AGR) = *3-SING) ((Y2 GENDER) = (Y4 GENDER)) ) Jaime Carbonell (PI), Alon Lavie (Co-PI), Lori Levin (Co-PI) Rule learning: Katharina Probst

Rule Learning - Overview Goal: Acquire Syntactic Transfer Rules Use available knowledge from the major- language side (grammatical structure) Three steps: 1.Flat Seed Generation: first guesses at transfer rules; flat syntactic structure 2.Compositionality Learning: use previously learned rules to learn hierarchical structure 3.Constraint Learning: refine rules by learning appropriate feature constraints

Flat Seed Rule Generation Learning Example: NP Eng: the big apple Heb: ha-tapuax ha-gadol Generated Seed Rule: NP::NP [ART ADJ N]  [ART N ART ADJ] ((X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2))

Flat Seed Rule Generation Create a “flat” transfer rule specific to the sentence pair, partially abstracted to POS –Words that are aligned word-to-word and have the same POS in both languages are generalized to their POS –Words that have complex alignments (or not the same POS) remain lexicalized One seed rule for each translation example No feature constraints associated with seed rules (but mark the example(s) from which it was learned)

Compositionality Learning Initial Flat Rules: S::S [ART ADJ N V ART N]  [ART N ART ADJ V P ART N] ((X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2) (X4::Y5) (X5::Y7) (X6::Y8)) NP::NP [ART ADJ N]  [ART N ART ADJ] ((X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2)) NP::NP [ART N]  [ART N] ((X1::Y1) (X2::Y2)) Generated Compositional Rule: S::S [NP V NP]  [NP V P NP] ((X1::Y1) (X2::Y2) (X3::Y4))

Compositionality Learning Detection: traverse the c-structure of the English sentence, add compositional structure for translatable chunks Generalization: adjust constituent sequences and alignments Two implemented variants: –Safe Compositionality: there exists a transfer rule that correctly translates the sub-constituent –Maximal Compositionality: Generalize the rule if supported by the alignments, even in the absence of an existing transfer rule for the sub-constituent

Constraint Learning Input: Rules and their Example Sets S::S [NP V NP]  [NP V P NP] {ex1,ex12,ex17,ex26} ((X1::Y1) (X2::Y2) (X3::Y4)) NP::NP [ART ADJ N]  [ART N ART ADJ] {ex2,ex3,ex13} ((X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2)) NP::NP [ART N]  [ART N] {ex4,ex5,ex6,ex8,ex10,ex11} ((X1::Y1) (X2::Y2)) Output: Rules with Feature Constraints: S::S [NP V NP]  [NP V P NP] ((X1::Y1) (X2::Y2) (X3::Y4) (X1 NUM = X2 NUM) (Y1 NUM = Y2 NUM) (X1 NUM = Y1 NUM))

Constraint Learning Goal: add appropriate feature constraints to the acquired rules Methodology: –Preserve general structural transfer –Learn specific feature constraints from example set Seed rules are grouped into clusters of similar transfer structure (type, constituent sequences, alignments) Each cluster forms a version space: a partially ordered hypothesis space with a specific and a general boundary The seed rules in a group form the specific boundary of a version space The general boundary is the (implicit) transfer rule with the same type, constituent sequences, and alignments, but no feature constraints

Transfer and Decoding Learning Module Learned Transfer Rules Lexical Resources Run Time Transfer System Decoder Translation Correction Tool Word- Aligned Parallel Corpus Elicitation Tool Elicitation Corpus ElicitationRule Learning Run-Time System Rule Refinement Rule Refinement Module Morphology Analyzer Learning Module Handcrafted rules INPUT TEXT OUTPUT TEXT

The Transfer Engine Analysis Source text is parsed into its grammatical structure. Determines transfer application ordering. Example: ראיתי את האיש הזקן ( I) saw *acc the man the old S VP V P NP D N D Adj ראיתי את האיש הזקן Transfer A target language tree is created by reordering, insertion, and deletion. S NP VP N V NP DET Adj N I saw the old man Source words translated with transfer lexicon. Generation Target language constraints are checked, target morphology applied, and final translation produced. E.g. “saw” in past tense selected. Final translation: “I saw the old man”

Symbolic Decoder System rarely finds a full parse/transfer for complete input sentence XFER engine produces comprehensive lattice of segment translations Decoder selects best combination of translation segments Search for optimal scoring path of partial translations, based on multiple features: –Target Language Model scores –XFER Rule Scores –Path Fragmentation –Other features… Symbolic decoding essential for scenarios where there is insufficient data for training large target LM –Effective Rule Scoring is crucial

The Avenue Low Resource Scenario Learning Module Learned Transfer Rules Lexical Resources Run Time Transfer System Decoder Translation Correction Tool Word- Aligned Parallel Corpus Elicitation Tool Elicitation Corpus ElicitationRule Learning Run-Time System Rule Refinement Rule Refinement Module Morphology Analyzer Learning Module Handcrafted rules INPUT TEXT OUTPUT TEXT

Rule Refinement Learning Module Learned Transfer Rules Lexical Resources Run Time Transfer System Decoder Translation Correction Tool Word- Aligned Parallel Corpus Elicitation Tool Elicitation Corpus ElicitationRule Learning Run-Time System Rule Refinement Rule Refinement Module Morphology Analyzer Learning Module Handcrafted rules INPUT TEXT OUTPUT TEXT

Interactive and Automatic Refinement of Translation Rules Problem: Improve Machine Translation quality. Proposed Solution: Put bilingual speakers back into the loop; use their corrections to detect the source of the error and automatically improve the lexicon and the grammar. Approach: Automate post-editing efforts by feeding them back into the MT system.  Automatic refinement of translation rules that caused an error beyond post-editing. Goal: Improve MT coverage and overall quality.

Technical Challenges Elicit minimal MT information from non-expert users Automatically Refine and Expand Translation Rules minimally Manually written Automatically Learned Automatic Evaluation of Refinement process

43 Error Typology for Automatic Rule Refinement (simplified) Missing word Extra word Wrong word order Incorrect word Wrong agreement Local vs Long distance Word vs. phrase + Word change Sense Form Selectional restrictions Idiom Missing constraint Extra constraint

TCTool (Demo)Demo Add a word Delete a word Modify a word Change word order Actions: Interactive elicitation of error information precisionrecall error detection90%89% error classification72%71%

1. Refine a translation rule: R0  R1 (change R0 to make it more specific or more general) Types of Refinement Operations Automatic Rule Adaptation R0: R1: NP DET N ADJ NP DET ADJ N a nice house una casa bonito NP DET N ADJ NP DET ADJ N a nice house una casa bonita N gender = ADJ gender

2. Bifurcate a translation rule: R0  R0 (same, general rule)  R1 (add a new more specific rule) Types of Refinement Operations Automatic Rule Adaptation R0: NP DET N ADJ NP DET ADJ N NP DET ADJ N NP DET ADJ N R1: a nice house una casa bonita a great artist un gran artista ADJ type: pre-nominal

AVENUE/LETRAS47 Error Information Elicitation Refinement Operation Typology Automatic Rule Adaptation Change word order SL: Gaudí was a great artist MT system output: TL: Gaudí era un artista grande Ucorrection: *Gaudí era un artista grande Gaudí era un gran artista A concrete example clue word error correction

Mapudungun Indigenous Language of Chile and Argentina ~ 1 Million Mapuche Speakers

Mapudungun Language 900,000 Mapuche people At least speakers of Mapudungun Polysynthetic sl: pe- rke- fi- ñ Maria ver-REPORT-3pO-1pSgS/IND tl: DICEN QUE LA VI A MARÍA (They say that) I saw Maria.

AVENUE Mapudungun Joint project between Carnegie Mellon University, the Chilean Ministry of Education, and Universidad de la Frontera.

Mapudungun to Spanish Resources Initially: –Large team of native speakers at Universidad de la Frontera, Temuco, Chile Some knowledge of linguistics No knowledge of computational linguistics –No corpus –A few short word lists –No morphological analyzer Later: Computational Linguists with non-native knowledge of Mapudungun Other considerations: –Produce something that is useful to the community, especially for bilingual education –Experimental MT systems are not useful

Mapudungun Learning Module Learned Transfer Rules Lexical Resources Run Time Transfer System Decoder Translation Correction Tool Word- Aligned Parallel Corpus Elicitation Tool Elicitation Corpus ElicitationRule Learning Run-Time System Rule Refinement Rule Refinement Module Morphology Analyzer Learning Module Handcrafted rules INPUT TEXT OUTPUT TEXT Corpus: 170 hours of spoken Mapudungun Example Based MT Spelling checker Spanish Morphology from UPC, Barcelona

Mapudungun Products –Click: traductor mapudungún –Dictionary lookup (Mapudungun to Spanish) –Morphological analysis –Example Based MT (Mapudungun to Spanish)

V pe I Didn’t see Maria VSuff la VSuffGVSuff fi VSuffGVSuff ñ VSuffG NP N Maria N S V VP S NP“a” V V“no” vi N María N

V pe Transfer to Spanish: Top-Down VSuff la VSuffGVSuff fi VSuffGVSuff ñ VSuffG NP N Maria N S V VP S NP“a” V VP::VP [VBar NP] -> [VBar "a" NP] ((X1::Y1) (X2::Y3) ((X2 type) = (*NOT* personal)) ((X2 human) =c +) (X0 = X1) ((X0 object) = X2) (Y0 = X0) ((Y0 object) = (X0 object)) (Y1 = Y0) (Y3 = (Y0 object)) ((Y1 objmarker person) = (Y3 person)) ((Y1 objmarker number) = (Y3 number)) ((Y1 objmarker gender) = (Y3 ender)))

Mapudungun Indigenous Language of Chile and Argentina ~ 1 Million Mapuche Speakers

Collaboration Mapuche Language Experts –Universidad de la Frontera (UFRO) Instituto de Estudios Indígenas (IEI) –Institute for Indigenous Studies Chilean Funding –Chilean Ministry of Education (Mineduc) Bilingual and Multicultural Education Program Eliseo Cañulef Rosendo Huisca Hugo Carrasco Hector Painequeo Flor Caniupil Luis Caniupil Huaiquiñir Marcela Collio Calfunao Cristian Carrillan Anton Salvador Cañulef Carolina Huenchullan Arrúe Claudio Millacura Salas

Accomplishments Corpora Collection –Spoken Corpus Collected: Luis Caniupil Huaiquiñir Medical Domain 3 of 4 Mapudungun Dialects –120 hours of Nguluche –30 hours of Lafkenche –20 hours of Pwenche Transcribed in Mapudungun Translated into Spanish –Written Corpus ~ 200,000 words Bilingual Mapudungun – Spanish Historical and newspaper text nmlch-nmjm1_x_0405_nmjm_00: M: no pütokovilu kay ko C: no, si me lo tomaba con agua M: chumgechi pütokoki femuechi pütokon pu C: como se debe tomar, me lo tomé pués nmlch-nmjm1_x_0406_nmlch_00: M: Chengewerkelafuymiürke C: Ya no estabas como gente entonces!

Accomplishments Developed At UFRO –Bilingual Dictionary with Examples 1,926 entries –Spelling Corrected Mapudungun Word List 117,003 fully-inflected word forms –Segmented Word List 15,120 forms Stems translated into Spanish

Accomplishments Developed at LTI using Mapudungun language resources from UFRO –Spelling Checker Integrated into OpenOffice –Hand-built Morphological Analyzer –Prototype Machine Translation Systems Rule-Based Example-Based –Website: LenguasAmerindias.org

AVENUE Hebrew Joint project of Carnegie Mellon University and University of Haifa

Hebrew Language Native language of about 3-4 Million in Israel Semitic language, closely related to Arabic and with similar linguistic properties –Root+Pattern word formation system –Rich verb and noun morphology –Particles attach as prefixed to the following word: definite article (H), prepositions (B,K,L,M), coordinating conjuction (W), relativizers ($,K$)… Unique alphabet and Writing System –22 letters represent (mostly) consonants –Vowels represented (mostly) by diacritics –Modern texts omit the diacritic vowels, thus additional level of ambiguity: “bare” word  word –Example: MHGR  mehager, m+hagar, m+h+ger

Hebrew Resources Morphological analyzer developed at Technion Constructed our own Hebrew-to-English lexicon, based primarily on existing “Dahan” H-to-E and E-to-H dictionary Human Computational Linguists Native Speakers

Hebrew Learning Module Learned Transfer Rules Lexical Resources Run Time Transfer System Decoder Translation Correction Tool Word- Aligned Parallel Corpus Elicitation Tool Elicitation Corpus ElicitationRule Learning Run-Time System Rule Refinement Rule Refinement Module Morphology Analyzer Learning Module Handcrafted rules INPUT TEXT OUTPUT TEXT

Flat Seed Rule Generation Learning Example: NP Eng: the big apple Heb: ha-tapuax ha-gadol Generated Seed Rule: NP::NP [ART ADJ N]  [ART N ART ADJ] ((X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2))

Compositionality Learning Initial Flat Rules: S::S [ART ADJ N V ART N]  [ART N ART ADJ V P ART N] ((X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2) (X4::Y5) (X5::Y7) (X6::Y8)) NP::NP [ART ADJ N]  [ART N ART ADJ] ((X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2)) NP::NP [ART N]  [ART N] ((X1::Y1) (X2::Y2)) Generated Compositional Rule: S::S [NP V NP]  [NP V P NP] ((X1::Y1) (X2::Y2) (X3::Y4))

Constraint Learning Input: Rules and their Example Sets S::S [NP V NP]  [NP V P NP] {ex1,ex12,ex17,ex26} ((X1::Y1) (X2::Y2) (X3::Y4)) NP::NP [ART ADJ N]  [ART N ART ADJ] {ex2,ex3,ex13} ((X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2)) NP::NP [ART N]  [ART N] {ex4,ex5,ex6,ex8,ex10,ex11} ((X1::Y1) (X2::Y2)) Output: Rules with Feature Constraints: S::S [NP V NP]  [NP V P NP] ((X1::Y1) (X2::Y2) (X3::Y4) (X1 NUM = X2 NUM) (Y1 NUM = Y2 NUM) (X1 NUM = Y1 NUM))

Challenges for Hebrew MT Paucity in existing language resources for Hebrew –No publicly available broad coverage morphological analyzer –No publicly available bilingual lexicons or dictionaries –No POS-tagged corpus or parse tree-bank corpus for Hebrew –No large Hebrew/English parallel corpus Scenario well suited for CMU transfer-based MT framework for languages with limited resources

Hebrew Morphology Example Input word: B$WRH | B$WRH | |-----B-----|$WR|--H--| |--B--|-H--|--$WRH---|

Hebrew Morphology Example Y0: ((SPANSTART 0) Y1: ((SPANSTART 0) Y2: ((SPANSTART 1) (SPANEND 4) (SPANEND 2) (SPANEND 3) (LEX B$WRH) (LEX B) (LEX $WR) (POS N) (POS PREP)) (POS N) (GEN F) (GEN M) (NUM S) (NUM S) (STATUS ABSOLUTE)) (STATUS ABSOLUTE)) Y3: ((SPANSTART 3) Y4: ((SPANSTART 0) Y5: ((SPANSTART 1) (SPANEND 4) (SPANEND 1) (SPANEND 2) (LEX $LH) (LEX B) (LEX H) (POS POSS)) (POS PREP)) (POS DET)) Y6: ((SPANSTART 2) Y7: ((SPANSTART 0) (SPANEND 4) (SPANEND 4) (LEX $WRH) (LEX B$WRH) (POS N) (POS LEX)) (GEN F) (NUM S) (STATUS ABSOLUTE))

Sample Output (dev-data) maxwell anurpung comes from ghana for israel four years ago and since worked in cleaning in hotels in eilat a few weeks ago announced if management club hotel that for him to leave israel according to the government instructions and immigration police in a letter in broken english which spread among the foreign workers thanks to them hotel for their hard work and announced that will purchase for hm flight tickets for their countries from their money

Quechua  Spanish MT V-Unit: funded Summer project in Cusco (Peru) June-August 2005 [preparations and data collection started earlier] Intensive Quechua course in Centro Bartolome de las Casas (CBC) Worked together with two Quechua native and one non-native speakers on developing infrastructure (correcting elicited translations, segmenting and translating list of most frequent words)

Quechua  Spanish Prototype MT System Stem Lexicon (semi-automatically generated): 753 lexical entries Suffix lexicon: 21 suffixes –(150 Cusihuaman) Quechua morphology analyzer 25 translation rules Spanish morphology generation module User-Studies: 10 sentences, 3 users (2 native, 1 non-native)

Quechua facts Agglutinative language A stem can often have 10 to 12 suffixes, but it can have up to 28 suffixes Supposedly clear cut boundaries, but in reality several suffixes change when followed by certain other suffixes No irregular verbs, nouns or adjectives Does not mark for gender No adjective agreement No definite or indefinite articles (‘topic’ and ‘focus’ markers perform a similar task of articles and intonation in English or Spanish)

Quechua examples – taki+ni (also written takiniy) sing 1sg (I sing)  canto – taki+sha+ni (takishaniy) sing progr 1sg (I am singing)  estoy cantando – taki+pa+ku+q+chu? taki sing -pa+ku to join a group to do something -q agentive -chu interrogative  (para) cantar con la gente (del pueblo)? (to sing with the people (of the village)?)

Quechua Resources A few native speakers, not linguists A computational linguist learning Quechua Two fluent, but non-native linguists

Quechua Learning Module Learned Transfer Rules Lexical Resources Run Time Transfer System Decoder Translation Correction Tool Word- Aligned Parallel Corpus Elicitation Tool Elicitation Corpus ElicitationRule Learning Run-Time System Rule Refinement Rule Refinement Module Morphology Analyzer Learning Module Handcrafted rules INPUT TEXT OUTPUT TEXT Parallel Corpus: OCR with correction

Grammar rules ;taki+sha+ni -> estoy cantando (I am singing) {VBar,3} VBar::VBar : [V VSuff VSuff] -> [V V] ( (X1::Y2) ((x0 person) = (x3 person)) ((x0 number) = (x3 number)) ((x2 mood) =c ger) ((y2 mood) = (x2 mood)) ((y1 form) =c estar) ((y1 person) = (x3 person)) ((y1 number) = (x3 number)) ((y1 tense) = (x3 tense)) ((x0 tense) = (x3 tense)) ((y1 mood) = (x3 mood)) ((x3 inflected) =c +) ((x0 inflected) = +)) lex = cantar mood = ger lex = estar person = 1 number = sg tense = pres mood = ind Spanish Morphology Generation estoy cantando

Hindi Resources Large statistical lexicon from the Linguistic Data Consortium (LDC) Parallel Corpus from LDC Morphological Analyzer-Generator from LDC Lots of native speakers Computational linguists with little or no knowledge of Hindi Experimented with the size of the parallel corpus –Miserly and large scenarios

Hindi Learning Module Learned Transfer Rules Lexical Resources Run Time Transfer System Decoder Translation Correction Tool Word- Aligned Parallel Corpus Elicitation Tool Elicitation Corpus ElicitationRule Learning Run-Time System Rule Refinement Rule Refinement Module Morphology Analyzer Learning Module Handcrafted rules INPUT TEXT OUTPUT TEXT 15,000 Noun Phrases from Penn TreeBank Parallel Corpus EBMT SMT Supported by DARPA TIDES

Manual Transfer Rules: Example ; NP1 ke NP2 -> NP2 of NP1 ; Ex: jIvana ke eka aXyAya ; life of (one) chapter ; ==> a chapter of life ; {NP,12} NP::NP : [PP NP1] -> [NP1 PP] ( (X1::Y2) (X2::Y1) ; ((x2 lexwx) = 'kA') ) {NP,13} NP::NP : [NP1] -> [NP1] ( (X1::Y1) ) {PP,12} PP::PP : [NP Postp] -> [Prep NP] ( (X1::Y2) (X2::Y1) ) NP PP NP1 NP P Adj N N1 ke eka aXyAya N jIvana NP NP1 PP Adj N P NP one chapter of N1 N life

SystemBLEUM-BLEUNIST EBMT SMT XFER (naïve) man grammar XFER (strong) no grammar XFER (strong) learned grammar XFER (strong) man grammar XFER+S MT Very miserly training data. Seven combinations of components Strong decoder allows re- ordering Three automatic scoring metrics Hindi- English