AVENUE: Machine Translation for Resource-Poor Languages NSF ITR 2001-2005.

Slides:



Advertisements
Similar presentations
Rule Learning – Overview Goal: learn transfer rules for a language pair where one language is resource-rich, the other is resource-poor Learning proceeds.
Advertisements

The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.
A Syntactic Translation Memory Vincent Vandeghinste Centre for Computational Linguistics K.U.Leuven
Enabling MT for Languages with Limited Resources Alon Lavie Language Technologies Institute Carnegie Mellon University.
Flow Network Models for Sub-Sentential Alignment Ying Zhang (Joy) Advisor: Ralf Brown Dec 18 th, 2001.
NICE: Native language Interpretation and Communication Environment Lori Levin, Jaime Carbonell, Alon Lavie, Ralf Brown Carnegie Mellon University.
The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.
Automatic Rule Learning for Resource-Limited Machine Translation Alon Lavie, Katharina Probst, Erik Peterson, Jaime Carbonell, Lori Levin, Ralf Brown Language.
Machine Translation with Scarce Resources The Avenue Project.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
Language Technologies Institute School of Computer Science Carnegie Mellon University NSF August 6, 2001 NICE: Native language Interpretation and Communication.
1 Multilingual Writing Students: Opportunities and Challenges Kate Mangelsdorf Evelyn Posey October 20, 2010.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Profile The METIS Approach Future Work Evaluation METIS II Architecture METIS II, the continuation of the successful assessment project METIS I, is an.
Eliciting Features from Minor Languages The elicitation tool provides a simple interface for bilingual informants with no linguistic training and limited.
Machine Learning in Spoken Language Processing Lecture 21 Spoken Language Processing Prof. Andrew Rosenberg.
Transfer-based MT with Strong Decoding for a Miserly Data Scenario Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with:
Coping with Surprise: Multiple CMU MT Approaches Alon Lavie Lori Levin, Jaime Carbonell, Alex Waibel, Stephan Vogel, Ralf Brown, Robert Frederking Language.
Statistical XFER: Hybrid Statistical Rule-based Machine Translation Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with:
Natural Language Processing Artificial Intelligence CMSC February 28, 2002.
Multi-Engine MT for Quick MT. Missing Technology for Quick MT LingWear ISI MT NICE Core Rapid MT - Multi-Engine MT - Omnivorous resource usage - Pervasive.
Rapid Prototyping of a Transfer-based Hebrew-to-English Machine Translation System Alon Lavie Language Technologies Institute Carnegie Mellon University.
Improving Statistical Machine Translation by Means of Transfer Rules Nurit Melnik.
Rule Learning - Overview Goal: Syntactic Transfer Rules 1) Flat Seed Generation: produce rules from word- aligned sentence pairs, abstracted only to POS.
AMTEXT: Extraction-based MT for Arabic Faculty: Alon Lavie, Jaime Carbonell Students and Staff: Laura Kieras, Peter Jansen Informant: Loubna El Abadi.
Transfer-based MT with Strong Decoding for a Miserly Data Scenario Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with:
AVENUE Automatic Machine Translation for low-density languages Ariadna Font Llitjós Language Technologies Institute SCS Carnegie Mellon University.
Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University.
Carnegie Mellon Goal Recycle non-expert post-editing efforts to: - Refine translation rules automatically - Improve overall translation quality Proposed.
Data Collection and Language Technologies for Mapudungun Lori Levin, Rodolfo Vega, Jaime Carbonell, Ralf Brown, Alon Lavie Language Technologies Institute.
The AVENUE Project Data Elicitation System Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University.
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
Hebrew-to-English XFER MT Project - Update Alon Lavie June 2, 2004.
Computational support for minority languages using a typologically oriented questionnaire system Lori Levin Language Technologies Institute School of Computer.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,
Designing a Machine Translation Project Lori Levin and Alon Lavie Language Technologies Institute Carnegie Mellon University CATANAL Planning Meeting Barrow,
A Trainable Transfer-based MT Approach for Languages with Limited Resources Alon Lavie Language Technologies Institute Carnegie Mellon University Joint.
Language Technologies Institute School of Computer Science Carnegie Mellon University NSF, August 6, 2001 Machine Translation for Indigenous Languages.
Coping with Surprise: Multiple CMU MT Approaches Alon Lavie Lori Levin, Jaime Carbonell, Alex Waibel, Stephan Vogel, Ralf Brown, Robert Frederking Language.
Semi-Automated Elicitation Corpus Generation The elicitation tool provides a simple interface for bilingual informants with no linguistic training and.
A Trainable Transfer-based MT Approach for Languages with Limited Resources Alon Lavie Language Technologies Institute Carnegie Mellon University Joint.
NICE: Native Language Interpretation and Communication Environment Lori Levin, Jaime Carbonell, Alon Lavie, Ralf Brown, Erik Peterson, Katharina Probst,
The CMU Mill-RADD Project: Recent Activities and Results Alon Lavie Language Technologies Institute Carnegie Mellon University.
Error Analysis of Two Types of Grammar for the purpose of Automatic Rule Refinement Ariadna Font Llitjós, Katharina Probst, Jaime Carbonell Language Technologies.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
Bridging the Gap: Machine Translation for Lesser Resourced Languages
Avenue Architecture Learning Module Learned Transfer Rules Lexical Resources Run Time Transfer System Decoder Translation Correction Tool Word- Aligned.
October 10, 2003BLTS Kickoff Meeting1 Transfer with Strong Decoding Learning Module Transfer Rules {PP,4894} ;;Score: PP::PP [NP POSTP] -> [PREP.
Eliciting a corpus of word- aligned phrases for MT Lori Levin, Alon Lavie, Erik Peterson Language Technologies Institute Carnegie Mellon University.
Seed Generation and Seeded Version Space Learning Version 0.02 Katharina Probst Feb 28,2002.
CMU MilliRADD Small-MT Report TIDES PI Meeting 2002 The CMU MilliRADD Team: Jaime Carbonell, Lori Levin, Ralf Brown, Stephan Vogel, Alon Lavie, Kathrin.
Developing affordable technologies for resource-poor languages Ariadna Font Llitjós Language Technologies Institute Carnegie Mellon University September.
FROM BITS TO BOTS: Women Everywhere, Leading the Way Lenore Blum, Anastassia Ailamaki, Manuela Veloso, Sonya Allin, Bernardine Dias, Ariadna Font Llitjós.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Minority Languages Katharina Probst Language Technologies Institute Carnegie Mellon.
Background of the NICE Project Lori Levin Jaime Carbonell Alon Lavie Ralf Brown.
Enabling MT for Languages with Limited Resources Alon Lavie and Lori Levin Language Technologies Institute Carnegie Mellon University.
Learning to Generate Complex Morphology for Machine Translation Einat Minkov †, Kristina Toutanova* and Hisami Suzuki* *Microsoft Research † Carnegie Mellon.
LingWear Language Technology for the Information Warrior Alex Waibel, Lori Levin Alon Lavie, Robert Frederking Carnegie Mellon University.
The AVENUE Project: Automatic Rule Learning for Resource-Limited Machine Translation Faculty: Alon Lavie, Jaime Carbonell, Lori Levin, Ralf Brown Students:
Eliciting a corpus of word-aligned phrases for MT
Faculty: Alon Lavie, Jaime Carbonell, Lori Levin, Ralf Brown Students:
Ariadna Font Llitjós March 10, 2004
Alon Lavie, Jaime Carbonell, Lori Levin,
Stat-Xfer מציגים: יוגב וקנין ועומר טבח, 05/01/2012
Towards Interactive and Automatic Refinement of Translation Rules
AMTEXT: Extraction-based MT for Arabic
Presentation transcript:

AVENUE: Machine Translation for Resource-Poor Languages NSF ITR

Project Members: Automated Rule Learning Faculty –Jaime Carbonell –Ralf Brown –Alon Lavie –Lori Levin Coordinator of Latin American Projects –Rodolfo Vega Graduate Students –Ariadna Font Llitjos –Katharina Probst –Christian Monson –Erik Peterson

Resource Poor Languages Not enough linguists to write a human- engineered system. Not enough corpora to build a corpus-based system. No standard orthography. May be spoken by hundreds of thousands of people (Mapudungun, Chile) or by only a few elderly people (Siona, Colombia).

AVENUE languages AVENUE is currently working with: –Mapudungun [Chile] –Inupiaq [Alaska] –Aymara, Quechua and Aguaruna [Peru] –Siona [Colombia]

Mapudungun for the Mapuche Chile Official Language: Spanish Population: ~15 million ~1/2 million Mapuche people Language: Mapudungun

Where can Avenue make a difference for indigenous communities? To contribute to the development of the indigenous people at the local and national level

There are two possible ways to do this: A traditional way, from experts on development –Outcome: To translate government policy documents, on health care, law, agriculture, etc. An alternative way, from local experts, grounded in the community’s experience and needs –Outcome: To contribute to language education in the form of literacy and second language acquisition

Inter- and multi-cultural bilingual education An educational strategy contributing to the development of the indigenous culture beyond the point of subsistence. Helping each individual and their communities to achieve excellence in a multicultural national and global context. Increasing the use of information and communication technologies, in a life-long learning environment.

In exchange for the language data, we agree to contribute in the creation of the following products: Plug-in orthographic corrector for word processors Electronic dictionary Web based translator Intelligent tutor for literacy and second language acquisition

Our last meeting in Temuco, May 2002

Automatic Learning of a Transfer-based MTS Elicitation corpus SVS algorithm Transfer module tentative Transfer rules Rule Refinement module SL sentences (tentative) TL sentences Kathrin Probst Erik Peterson Ariadna Font Morphology learningMorphological analyzer Christian Monson

Morphology Analyzer for Rule Based Machine Translation

Example and Motivation

Results Language: English Corpus:Brown Corpus Set Accuracy:88.3% Example Clusters: –NULL:snavigator, discourse, peptide, … –NULL:’ssmith, china, cook, … –NULL:edslim, reappeared, munch, … –NULL:ingreappear, respond, grunt, … –NULL:lypeaceful, remote, superb, … –…

Future Directions More languages –Spanish –Mapudungun More types of morphology –Prefixes –Infixes Employ a human informant –Small amount of knowledge might help a lot

AVENUE Transfer Engine Written specifically for automatically learned rules –Integrated with rule learner –Can also be augmented with hand-written rules Currently researching constructions –Constructions are non-compositional structures –Many translation problems associated with constructions

Translation Example 总统会辞职吗? presidentwillresignQUEST Transfer English Output: Will the president resign? During translation: Question particle 吗 is deleted Auxiliary “will” is reordered before subject “the” is added before “president”

New approach to MT Fully automatic (no human intervention) Very little electronic data available elicitation corpus Machine learning techniques –Seeded version space algorithm to automatically learn transfer rules –Interactive and Automatic refinement of Transfer rules

Elicitation Tool

Rule Learning – Overview Goal: learn transfer rules for a language pair where one language is resource-rich, the other is resource-poor Learning proceeds in three steps: 1.Flat Seed Generation: “informed guessing” of transfer rules 2.Compositionality: adding structure to rules, using previously learned rules 3.Seeded Version Space Learning: generalizing rules to make them scale to more unseen examples

S::S [det adv adj n aux neg v det n]→ [det adv adj n v det n neg vpart] (;;alignments: (x1:y1)(x2::y2)(x3::y3)(x4::y4)(x6::y8)(x7::y5)(x7::y9)(x8::y6)(x9::y7)) ;;constraints: ((x1 def) = *+) ((x4 agr) = *3-sing) ((x5 tense) = *past) …. ((y1 def) = *+) ((y3 case) = *nom) ((y4 agr) = *3-sing) …. ) The highly qualified applicant did not accept the offer. Der äußerst qualifizierte Bewerber nahm das Angebot nicht an. ((1,1),(2,2),(3,3),(4,4),(6,8),(7,5),(7,9),(8,6),(9,7)) Flat Seed Generation - Example

S::S [det adv adj n aux neg v det n]→ [det adv adj n v det n neg vpart] (;;alignments: (x1:y1)(x2::y2)(x3::y3)(x4::y4)(x6::y8)(x7::y5)(x7::y9)(x8::y6)(x9::y7) ;;constraints: ((x1 def) = *+) ((x4 agr) = *3-sing) ((x5 tense) = *past) …. ((y1 def) = *+) ((y3 case) = *nom) ((y4 agr) = *3-sing) …. ) S::S [NP aux neg v det n]→ [NP v det n neg vpart] (;;alignments: (x1::y1)(x3::y5)(x4::y2)(x4::y6)(x5::y3)(x6::y4) ;;constraints: ((x2 tense) = *past) …. ((y1 def) = *+) ((y1 case) = *nom) …. ) NP::NP [det AJDP n] [det ADJP n] ((x1::y1)… ((y3 agr) = *3-sing) ((x3 agr = *3-sing) ….) Compositionality - Example

S::S [NP aux neg v det n]→ [NP v det n neg vpart] (;;alignments: (x1::y1)(x3::y5)(x4::y2)(x4::y6)(x5::y3)(x6::y4) ;;constraints: ((x2 tense) = *past) …. ((y1 def) = *+) ((y1 case) = *nom) ((y1 agr) = *3-sing) … ) ((y3 agr) = *3-sing) ((y4 agr) = *3-sing)… ) S::S [NP aux neg v det n]→ [NP v det n neg vpart] (;;alignments: (x1::y1)(x3::y5)(x4::y2)(x4::y6)(x5::y3)(x6::y4) ;;constraints: ((x2 tense) = *past) … ((y1 def) = *+) ((y1 case) = *nom) ((y1 agr) = *3-plu) … ((y3 agr) = *3-plu) ((y4 agr) = *3-plu)… ) S::S [NP aux neg v det n]→ [NP n det n neg vpart] ( ;;alignments: (x1::y1)(x3::y5) (x4::y2)(x4::y6) (x5::y3)(x6::y4) ;;constraints: ((x2 tense) = *past) … ((y1 def) = *+) ((y1 case) = *nom) ((y4 agr) = (y3 agr)) … ) Seeded Version Space Learning - Example

Remaining Research Issues Improvement of existing algorithms Reversal of translation direction Learning with less information on the resource-poor language Learning from an unstructured corpus

Interactive and Automatic rule refinement 1. Given an MTS, translate sentences and present them to the users for minimal correction (interface design, MT error classification) 2. Determine blame assignment 3. Structure learning, as opposed to binary feedback, to automatically refine the existing rules

Interactive Learning Translation Correction Tool, web application Bilingual informants (no knowledge of linguistics assumed) User-friendly and Intuitive interface Can naïve users reliably pinpoint the source of errors? MT error classification realistic? Need of user studies: –Spanish - English –English - Spanish –English - Chinese

Structure learning Given user feedback (correction + error classification) and blame assignment, modify the appropriate transfer rule(s) to obtain correct translation Need to evaluate based on cross-validation, number of sentences it can translate correctly (elicitation corpus) Learn mapping between incorrect structures and correct structures: She saw  high woman She saw the tall woman

A simple example Spanish SLS: Ella vio a la mujer alta English TLS: She saw high woman Corrected TLS: She saw the tall woman MT error classification: missing determiner + wrong lexical selection Blame assignment (NP rule that generated the direct object + selectional restrictions) Rule refinement: the Noun Phrase (NP) rule that generated the error: NP -> Adj N needs to be refined into 2 different cases: NP -> Det Adj N[sg] (the tall woman) NP -> (Det) Adj N[pl] ((the)? tall women)

Refine MT error classification Blame assignment Structure Learning algorithm Expand elicitation corpus with more verb subcategorization patterns Remaining research issues