Data Collection and Language Technologies for Mapudungun Lori Levin, Rodolfo Vega, Jaime Carbonell, Ralf Brown, Alon Lavie Language Technologies Institute.

Slides:



Advertisements
Similar presentations
The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
Advertisements

La familia la palabra “de” y los adjectivos posesivos.
Chapter 4 Basics of English Grammar
J. Kunzmann, K. Choukri, E. Janke, A. Kießling, K. Knill, L. Lamel, T. Schultz, and S. Yamamoto Automatic Speech Recognition and Understanding ASRU, December.
PHONEXIA Can I have it in writing?. Discuss and share your answers to the following questions: 1.When you have English lessons listening to spoken English,
The current status of Chinese- English EBMT -where are we now Joy (Ying Zhang) Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.
Flow Network Models for Sub-Sentential Alignment Ying Zhang (Joy) Advisor: Ralf Brown Dec 18 th, 2001.
Elicitation Corpus April 12, Agenda Tagging with feature vectors or feature structures Combinatorics Extensions.
NICE: Native language Interpretation and Communication Environment Lori Levin, Jaime Carbonell, Alon Lavie, Ralf Brown Carnegie Mellon University.
The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.
Automatic Rule Learning for Resource-Limited Machine Translation Alon Lavie, Katharina Probst, Erik Peterson, Jaime Carbonell, Lori Levin, Ralf Brown Language.
Machine Translation with Scarce Resources The Avenue Project.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English.
Seminar on Endangered Languages Alan W Black, Robert Frederking, Lori Levin, Laura Tomokiyo Language Technologies.
Grammar Nuha Alwadaani.
My Marathi Marathi language learning CDs. My Marathi is a CD based Marathi self study tool built by the next generation, for the next generation.
Language Technologies Institute School of Computer Science Carnegie Mellon University NSF August 6, 2001 NICE: Native language Interpretation and Communication.
Phrases and Sentences: Grammar
Chapter 4 Basics of English Grammar Business Communication Copyright 2010 South-Western Cengage Learning.
Machine Translation Dr. Radhika Mamidi. What is Machine Translation? A sub-field of computational linguistics It investigates the use of computer software.
Building NLP Systems for Two Resource Scarce Indigenous Languages: Mapudungun and Quechua, and some other languages Christian Monson, Ariadna Font Llitjós,
Language. Language Communication – transmitting information Many animals communicate Call systems – system of communication limited to a set number of.
Eliciting Features from Minor Languages The elicitation tool provides a simple interface for bilingual informants with no linguistic training and limited.
Chapter 10 Language and Computer English Linguistics: An Introduction.
Coping with Surprise: Multiple CMU MT Approaches Alon Lavie Lori Levin, Jaime Carbonell, Alex Waibel, Stephan Vogel, Ralf Brown, Robert Frederking Language.
Multi-Engine MT for Quick MT. Missing Technology for Quick MT LingWear ISI MT NICE Core Rapid MT - Multi-Engine MT - Omnivorous resource usage - Pervasive.
Rule Learning - Overview Goal: Syntactic Transfer Rules 1) Flat Seed Generation: produce rules from word- aligned sentence pairs, abstracted only to POS.
Transfer-based MT with Strong Decoding for a Miserly Data Scenario Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with:
AVENUE Automatic Machine Translation for low-density languages Ariadna Font Llitjós Language Technologies Institute SCS Carnegie Mellon University.
Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University.
Carnegie Mellon Goal Recycle non-expert post-editing efforts to: - Refine translation rules automatically - Improve overall translation quality Proposed.
Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March 2, 2002.
An Overview of the AVENUE Project Presented by Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh,
Designing a Machine Translation Project Lori Levin and Alon Lavie Language Technologies Institute Carnegie Mellon University CATANAL Planning Meeting Barrow,
A Trainable Transfer-based MT Approach for Languages with Limited Resources Alon Lavie Language Technologies Institute Carnegie Mellon University Joint.
Year 2 AUTUMN TERM 2 nd HALF WeekTopic, vocab. & languageLiteracy, Numeracy, Grammar & Phonics Objectives 1  Revision : PQs  Revision: Numbers
LREC 2004, 26 May 2004, Lisbon 1 Multimodal Multilingual Resources in the Subtitling Process S.Piperidis, I.Demiros, P.Prokopidis, P.Vanroose, A. Hoethker,
Year 1 AUTUMN TERM 2 nd HALF WeekTopic, vocab. & languageLiteracy, Numeracy, Grammar & Phonics Objectives 1  Revision : PQs  Revision: Numbers 0-20 
Language Technologies Institute School of Computer Science Carnegie Mellon University NSF, August 6, 2001 Machine Translation for Indigenous Languages.
2 pt 3 pt 4 pt 5pt 1 pt 2 pt 3 pt 4 pt 5 pt 1 pt 2pt 3 pt 4pt 5 pt 1pt 2pt 3 pt 4 pt 5 pt 1 pt 2 pt 3 pt 4pt 5 pt 1pt Parts of Speech PunctuationVerbals.
Coping with Surprise: Multiple CMU MT Approaches Alon Lavie Lori Levin, Jaime Carbonell, Alex Waibel, Stephan Vogel, Ralf Brown, Robert Frederking Language.
Semi-Automated Elicitation Corpus Generation The elicitation tool provides a simple interface for bilingual informants with no linguistic training and.
A Trainable Transfer-based MT Approach for Languages with Limited Resources Alon Lavie Language Technologies Institute Carnegie Mellon University Joint.
NICE: Native Language Interpretation and Communication Environment Lori Levin, Jaime Carbonell, Alon Lavie, Ralf Brown, Erik Peterson, Katharina Probst,
The CMU Mill-RADD Project: Recent Activities and Results Alon Lavie Language Technologies Institute Carnegie Mellon University.
Corpus Linguistics MOHAMMAD ALIPOUR ISLAMIC AZAD UNIVERSITY, AHVAZ BRANCH.
Error Analysis of Two Types of Grammar for the purpose of Automatic Rule Refinement Ariadna Font Llitjós, Katharina Probst, Jaime Carbonell Language Technologies.
Year 6 AUTUMN TERM 2 nd HALF WeekTopic, vocab. & languageLiteracy, Numeracy, Grammar & Phonics Objectives 1  Revision : PQs  Revision: Numbers
Bridging the Gap: Machine Translation for Lesser Resourced Languages
Avenue Architecture Learning Module Learned Transfer Rules Lexical Resources Run Time Transfer System Decoder Translation Correction Tool Word- Aligned.
October 10, 2003BLTS Kickoff Meeting1 Transfer with Strong Decoding Learning Module Transfer Rules {PP,4894} ;;Score: PP::PP [NP POSTP] -> [PREP.
Eliciting a corpus of word- aligned phrases for MT Lori Levin, Alon Lavie, Erik Peterson Language Technologies Institute Carnegie Mellon University.
Seed Generation and Seeded Version Space Learning Version 0.02 Katharina Probst Feb 28,2002.
CMU MilliRADD Small-MT Report TIDES PI Meeting 2002 The CMU MilliRADD Team: Jaime Carbonell, Lori Levin, Ralf Brown, Stephan Vogel, Alon Lavie, Kathrin.
AVENUE: Machine Translation for Resource-Poor Languages NSF ITR
Developing affordable technologies for resource-poor languages Ariadna Font Llitjós Language Technologies Institute Carnegie Mellon University September.
FROM BITS TO BOTS: Women Everywhere, Leading the Way Lenore Blum, Anastassia Ailamaki, Manuela Veloso, Sonya Allin, Bernardine Dias, Ariadna Font Llitjós.
Cross-Dialectal Data Transferring for Gaussian Mixture Model Training in Arabic Speech Recognition Po-Sen Huang Mark Hasegawa-Johnson University of Illinois.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Minority Languages Katharina Probst Language Technologies Institute Carnegie Mellon.
Background of the NICE Project Lori Levin Jaime Carbonell Alon Lavie Ralf Brown.
LingWear Language Technology for the Information Warrior Alex Waibel, Lori Levin Alon Lavie, Robert Frederking Carnegie Mellon University.
The AVENUE Project: Automatic Rule Learning for Resource-Limited Machine Translation Faculty: Alon Lavie, Jaime Carbonell, Lori Levin, Ralf Brown Students:
Eliciting a corpus of word-aligned phrases for MT
Grammar Workshop Thursday 9th June.
Chapter 4 Basics of English Grammar
Los sustantivos (Nouns)
Syntax.
Chapter 4 Basics of English Grammar
Presentation transcript:

Data Collection and Language Technologies for Mapudungun Lori Levin, Rodolfo Vega, Jaime Carbonell, Ralf Brown, Alon Lavie Language Technologies Institute Carnegie Mellon University Eliseo Cañulef Instituto de Estudias Indígenas Universidad de la Frontera Carolina Huenchullan Ministerio de Educación Chile Presented by Ariadna Font-Llitjos Language Technologies Institute Carnegie Mellon University

Overview Chile’s programs in bilingual and multicultural education The AVENUE project at Carnegie Mellon University The Mapudungun corpus Plans for Example-Based Machine Translation Plans for Rule-Based Machine Translation

Bilingual and Multicultural Education in Chile Eight ethnic groups: Mapuche, Aymara, Rapa Nui (Pascuense), Likay Antai, Quechua, Colla, Kawashkar (Alacalufe), Yamana (Yagan). Make education culturally and linguistically relevant. Languages of instruction are native language and second language (Spanish). Community involvement in curriculum design.

AVENUE: Automatic Voice Enabled Natural language Understanding Environment Affordable machine translation for languages with scarce resources. –No large corpus in electronic form –Few or no native speakers trained in computational linguistics

AVENUE: Omnivorous MT AVENUE can consume whatever resources are available –EBMT: if a parallel corpus is available –Human-Engineered MT: if a human computational linguist is available –Seeded Version Space Learning for automatic acquisition of transfer rules: if no corpus or computational linguist is available

Mapudungun Language of the Mapuche –Over 900,000 Mapuche in Chile and Argentina Words contain several morphemes including multiple open class items. Still spoken by a majority of Mapuche Still spoken as a first language Competing orthographies Some vocabulary loss Some written literature and newsletters

The Mapudungun Corpora First step toward: –Corpus-based machine translation –Authentic corpus for instructional purposes Written corpus Spoken corpus

The Written Mapudungun Corpus Existing texts were entered in electronic form and translated into Spanish: –Memorias de Pascual Coña: the life story of a Mapuche leader written by Ernesto Wilhelm de Moessbach. – Las Ultimas Familias by Tomás Guevara. –Nuestros Pueblos newspaper published by Corporación Nacional de Desarrollo Indígena (CONADI). Total of around 200,000 words

The Spoken Mapudungun Corpus Recorded with Sony DAT recorder and digital stereo microphone. Downloaded with CoolEdit Transcribed with TransEdit –Alignment of audio and transcript for speech recognition

The Spoken Mapudungun Corpus All sessions were scheduled and recorded by a native speaker interviewer Subject matter: primary and preventive health –Limited domain for higher quality machine translation –People were asked to describe their experiences with an illness and how it was treated by modern or traditional medicine

The Spoken Mapudungun Corpus Speakers: –21-75 years old; most –Fully native speakers –Some auxiliary nurses for rural areas in Chilean Public health system –Some machi: Did not reveal specialized knowledge

The Mapudungun Spoken Corpus Dialects: –Lafkenche, Nguluche, Pewenche –Williche will be recorded at a later stage of the project more morpho-syntactic differences from the other dialects

The Mapudungun Spoken Corpus Orthography: –Pan-dialectal: 32 phones Some are dialectal variants of each other –Supra-dialectal 28 letters covering the 32 phones –Typable on Spanish keyboard with some diacritics such as apostrophes –Use Spanish letters for phonemes that sound like Spanish phonemes

Plans for Machine Translation Example-Based MT Seeded Version Space Learning for automated acquisition of transfer rules

Example-Based MT Insert one of Ralf’s slides

Automated Acquisition of Transfer Rules Elicitation Tool Seeded Version Space Learning Run-time transfer system for MT

Chinese-English Transfer Rule for Yes- No Questions S::S : [NP VP MA] -> [AUX NP VP] ((x1::y2) ; set alignments (x2::y3) ((x0 subj) = x1) ; create Chinese f-structure ((x0 subj case) = nom) ; Chinese has no case, so add it ((x0 act) = quest) ; set speech act to question (x0 = x2) ; create Chinese f-structure ((y1 form) = do) ; set base form of AUX to "do" ; proper form will be selected based on subj-verb agreement ((y3 vform) =c inf) ; verb must be infinitive ((y1 agr) = (y2 agr)) ; subject and "do" must agree )

Example of Seed Rule and Generalization Pair 1: the man::der mann Pair 2: the woman::die frau

Seed Rule 1Seed Rule 2Generalization Det N  Det N X1::Y1 X2::Y2 ((X1 AGR) = *3-SING) ((X1 DEF) = *DEF) ((X2 AGR) = *3-SING) ((X2 COUNT) = +) ((Y1 AGR) = *3-SING) ((Y1 CASE) = *NOM)((Y1 CASE) = (*NOT* *GEN *DAT)) ((Y1 DEF) = *DEF) ((Y2 GENDER) = *M)((Y2 GENDER) = *F) ((Y2 AGR) = *3-SING) ((Y2 CASE) = *NOM) ((Y2 GENDER) = *M)((Y2 GENDER) = *F)((Y2 GENDER) = (Y1 GENDER))

Elicitation Tool

Elicitation Process Bilingual informant Literate in the elicitation language and the elicited language Translate sentences Align words

Elicitation Corpus: Excerpt He has sold both of his cars. English prompt El ha vendido sus dos automóviles Spanish prompt fey weluiñi epu awtu Mapudungun provided by informant He can move both of his thumbs. El puede mover sus dos pulgares fey pepi newüleliñi epu fütrarumechangüll He loves both of his sisters. El ama a sus dos hermanas fey poyeyñi epu deya He loves both of his brothers. El ama a sus dos hermanos fey poyeyñi epu peñi

Elicitation Corpus Compositional: –Small phrases are elicited first and then are combined into larger phrases –For learnability Minimal Pairs: –Sentences that differ in only one feature (e.g., number of the subject) –For automatic feature detection If the minimal pair differs only in the number of the subject, and the verbs are different in the two sentences, the language may have agreement in number between subjects and verbs.

Elicitation Corpus: Current Coverage 864 Sentences (pilot corpus) Transitive and intransitive sentences Animate and inanimate subjects and objects Definite and indefinite subjects and objects Present/ongoing and past/completed Singular, plural, and dual nouns Simple noun phrases with definiteness, modifiers Possessive noun phrases

Elicitation Corpus: Future Work Probst and Levin (2002) –Pitfalls of automated elicitation Automatic Branching and skipping: –Automatically skip parts of the corpus depending on what features have been detected

Status of automated rule learning Preliminary results –Learned some compositional rules for German Current work: –Interaction of compositional rules –Seed rule generation –Generalization and verification of seed rule hypothesis

Status of Transfer Rule System Preliminary experiments on Chinese- English MT Integrated into a multi-engine system with Example-Based MT

Tools for Field Linguists? Can feature detection and automatically learned rules be useful to alert a field worker to possible interesting data? Can automated elicitation with branching and skipping be helpful?