NICE: Native Language Interpretation and Communication Environment Lori Levin, Jaime Carbonell, Alon Lavie, Ralf Brown, Erik Peterson, Katharina Probst,

Slides:



Advertisements
Similar presentations
The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
Advertisements

The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.
CODE/ CODE SWITCHING.
PLURILINGUAL EDUCATION IN EUROPE Promotion of plurilingual education as a value and competence. Plurilingualism: the ability to use several languages for.
Help communities share knowledge more effectively across the language barrier Automated Community Content Editing PorTal.
J. Kunzmann, K. Choukri, E. Janke, A. Kießling, K. Knill, L. Lamel, T. Schultz, and S. Yamamoto Automatic Speech Recognition and Understanding ASRU, December.
The current status of Chinese- English EBMT -where are we now Joy (Ying Zhang) Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.
Flow Network Models for Sub-Sentential Alignment Ying Zhang (Joy) Advisor: Ralf Brown Dec 18 th, 2001.
NICE: Native language Interpretation and Communication Environment Lori Levin, Jaime Carbonell, Alon Lavie, Ralf Brown Carnegie Mellon University.
Data-Driven South Asian Language Learning SALRC Pedagogy Workshop June 8, 2005 J. Scott Payne Penn State University
The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.
Automatic Rule Learning for Resource-Limited Machine Translation Alon Lavie, Katharina Probst, Erik Peterson, Jaime Carbonell, Lori Levin, Ralf Brown Language.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Machine Translation with Scarce Resources The Avenue Project.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English.
Seminar on Endangered Languages Alan W Black, Robert Frederking, Lori Levin, Laura Tomokiyo Language Technologies.
APPROACHES and METHODS IN LANGUAGE TEACHING
Language Technologies Institute School of Computer Science Carnegie Mellon University NSF August 6, 2001 NICE: Native language Interpretation and Communication.
WELCOME TO THE PLYMOUTH LITERACY NETWORK… TACKLING CHALLENGES TOGETHER.
1 NLP in Thailand by Asanee Kawtrakul Kasetsart University.
Managing Software Quality
Machine Translation Dr. Radhika Mamidi. What is Machine Translation? A sub-field of computational linguistics It investigates the use of computer software.
NSW Curriculum and Learning Innovation Centre Draft Senior Secondary Curriculum ENGLISH May, 2012.
Christina Schäffner Aston University, Birmingham Squaring the circle: The contribution of universities to the needs of the profession.
Eliciting Features from Minor Languages The elicitation tool provides a simple interface for bilingual informants with no linguistic training and limited.
Can Controlled Language Rules increase the value of MT? Fred Hollowood & Johann Rotourier Symantec Dublin.
Transfer-based MT with Strong Decoding for a Miserly Data Scenario Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with:
Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March 2, 2002.
Chapter 10 Language and Computer English Linguistics: An Introduction.
Coping with Surprise: Multiple CMU MT Approaches Alon Lavie Lori Levin, Jaime Carbonell, Alex Waibel, Stephan Vogel, Ralf Brown, Robert Frederking Language.
UCREL: from LOB to REVERE Paul Rayson. November 1999CSEG awayday Paul Rayson2 A brief history of UCREL In ten minutes, I will present a brief history.
Recent Major MT Developments at CMU Briefing for Joe Olive February 5, 2008 Alon Lavie and Stephan Vogel Language Technologies Institute Carnegie Mellon.
Multi-Engine MT for Quick MT. Missing Technology for Quick MT LingWear ISI MT NICE Core Rapid MT - Multi-Engine MT - Omnivorous resource usage - Pervasive.
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
Online curriculum centre Faculty member training, April 2009.
Transfer-based MT with Strong Decoding for a Miserly Data Scenario Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with:
AVENUE Automatic Machine Translation for low-density languages Ariadna Font Llitjós Language Technologies Institute SCS Carnegie Mellon University.
Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University.
Carnegie Mellon Goal Recycle non-expert post-editing efforts to: - Refine translation rules automatically - Improve overall translation quality Proposed.
Data Collection and Language Technologies for Mapudungun Lori Levin, Rodolfo Vega, Jaime Carbonell, Ralf Brown, Alon Lavie Language Technologies Institute.
Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March 2, 2002.
Designing a Machine Translation Project Lori Levin and Alon Lavie Language Technologies Institute Carnegie Mellon University CATANAL Planning Meeting Barrow,
By Billye Darlene Jones EDLD 5362 Section ET8004-1B February, 2010.
A Trainable Transfer-based MT Approach for Languages with Limited Resources Alon Lavie Language Technologies Institute Carnegie Mellon University Joint.
Language Technologies Institute School of Computer Science Carnegie Mellon University NSF, August 6, 2001 Machine Translation for Indigenous Languages.
Coping with Surprise: Multiple CMU MT Approaches Alon Lavie Lori Levin, Jaime Carbonell, Alex Waibel, Stephan Vogel, Ralf Brown, Robert Frederking Language.
Semi-Automated Elicitation Corpus Generation The elicitation tool provides a simple interface for bilingual informants with no linguistic training and.
A Trainable Transfer-based MT Approach for Languages with Limited Resources Alon Lavie Language Technologies Institute Carnegie Mellon University Joint.
11/23/00UNU/IAS/UNL Centre1 The Universal Networking Language United Nations University Institute of Advanced Studies United Networking Language ® UNU/IAS.
Error Analysis of Two Types of Grammar for the purpose of Automatic Rule Refinement Ariadna Font Llitjós, Katharina Probst, Jaime Carbonell Language Technologies.
Bridging the Gap: Machine Translation for Lesser Resourced Languages
Avenue Architecture Learning Module Learned Transfer Rules Lexical Resources Run Time Transfer System Decoder Translation Correction Tool Word- Aligned.
October 10, 2003BLTS Kickoff Meeting1 Transfer with Strong Decoding Learning Module Transfer Rules {PP,4894} ;;Score: PP::PP [NP POSTP] -> [PREP.
Eliciting a corpus of word- aligned phrases for MT Lori Levin, Alon Lavie, Erik Peterson Language Technologies Institute Carnegie Mellon University.
1 February 2012 ILCAA, TUFS, Tokyo program David Nathan and Peter Austin Hans Rausing Endangered Languages Project SOAS, University of London Language.
Seed Generation and Seeded Version Space Learning Version 0.02 Katharina Probst Feb 28,2002.
CMU MilliRADD Small-MT Report TIDES PI Meeting 2002 The CMU MilliRADD Team: Jaime Carbonell, Lori Levin, Ralf Brown, Stephan Vogel, Alon Lavie, Kathrin.
AVENUE: Machine Translation for Resource-Poor Languages NSF ITR
Developing affordable technologies for resource-poor languages Ariadna Font Llitjós Language Technologies Institute Carnegie Mellon University September.
FROM BITS TO BOTS: Women Everywhere, Leading the Way Lenore Blum, Anastassia Ailamaki, Manuela Veloso, Sonya Allin, Bernardine Dias, Ariadna Font Llitjós.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Minority Languages Katharina Probst Language Technologies Institute Carnegie Mellon.
Background of the NICE Project Lori Levin Jaime Carbonell Alon Lavie Ralf Brown.
LingWear Language Technology for the Information Warrior Alex Waibel, Lori Levin Alon Lavie, Robert Frederking Carnegie Mellon University.
The AVENUE Project: Automatic Rule Learning for Resource-Limited Machine Translation Faculty: Alon Lavie, Jaime Carbonell, Lori Levin, Ralf Brown Students:
Eliciting a corpus of word-aligned phrases for MT
Language Technologies Institute Carnegie Mellon University
Grammar Workshop Thursday 9th June.
Presentation transcript:

NICE: Native Language Interpretation and Communication Environment Lori Levin, Jaime Carbonell, Alon Lavie, Ralf Brown, Erik Peterson, Katharina Probst, Rodolfo Vega, Hal Daume Language Technologies Institute Carnegie Mellon University April 12, 2001

NICE Rapid development of machine translation for low and very low density languages

Classification of MT by Language Density High density pairs (E-F, E-S, E-J, …) –Statistical or traditional MT approaches are O.K. Medium density (E-Czech, E-Croatian, …) –Example-based MT (success with Croatian, Korean) –JHU: initial success with stat-MT (Czech) Low density (S-Mapudungun, E-Iñupiaq, …) –10,000 to 1 million speakers –Insufficient bilingual corpora for SMT, EBMT –Partial corpus-based resources –Insufficient trained computational linguists

Machine Translation of Very Low Density Languages No text in electronic form –Can’t apply current methods for statistical MT No standard spelling or orthography Few literate native speakers Few linguists familiar with the language –Nobody is available to do rule-based MT Not enough money or time for years of linguistic information gathering/analysis E.g., Siona (Colombia)

Motivation for LDMT Methods developed for languages with very scarce resources will generalize to all MT. Policy makers can get input from indigenous people. –E.g., Has there been an epidemic or a crop failure Indigenous people can participate in government, education, and internet without losing their language. First MT of polysynthetic languages

New Ideas MT without large amounts of text and without trained linguists Machine learning of rule-based MT Multi-Engine architecture can flexibly take advantage of whatever resources are available. Research partnerships with indigenous communities (Future: Exponential models for data-miserly SMT)

History of NICE Arose from a series of joint workshops of NSF and OAS-CICAD. Workshop recommendations: –Create multinational projects using information technology to: provide immediate benefits to governments and citizens develop critical infrastructure for communication and collaborative research –training researchers and engineers –advancing science and technology

Approach Machine learning –Uncontrolled corpus (Generalized Example-Based MT) –Controlled corpus elicited from native speakers (Version Space Learning) Multi-Engine MT –Flexibly adapt to whatever resources are available –Take advantage of the strengths of different MT approaches

Evaluation Objective To achieve a given level of translation quality for a series of languages L1 to Ln –Reduce the amount of training data required –Reduce the amount of language-specific development time after language-independent software has been developed

Evaluation Baseline From Previous Work (Generalized EBMT) High density languages (French, Spanish) –1MW parallel corpora (e.g., subset of Hansards) Consistent spelling, grammatically correct High coverage, gisting-quality translation

Evaluation Baseline GEBMT French Hansards Coverage (in percent) as a function of corpus size (in millions of words)

Long-Term Target: Reduction in Linguistic and Human Resources

Work Completed

Establishing Partnerships

NICE Partners LanguageCountryInstitutions Mapudungun (in place) Chile Universidad de la Frontera, Institute for Indigenous Studies, Ministry of Education Iñupiaq (advanced discussion) US (Alaska) Ilisagvik College, Barrow school district, Alaska Rural Systemic Initiative, Trans-Arctic and Antarctic Institute, Alaska Native Language Center Siona (discussion) Colombia OAS-CICAD, Plante, Department of the Interior

Nice/Mapudungun: Current Products Writing conventions (Grafemario) Glossary Mapudungun/Spanish Bilingual newspaper, 4 issues Ultimas Familias –memoirs Memorias de Pascual Coña 6 hours transcribed speech 40 hours recorded speech`

Instructible Knowledge-Based MT

iRBMT: Instructible Rule Based MT

Elicitation Process Purpose: controlled elicitation of data that will be input to machine learning of translation rules

Elicitation Interface Example

Elicitation Interface Native informant sees source language sentence (in English or Spanish) Native informant types in translation, then uses mouse to add word alignments Informant is –Literate –Bilingual –Not an expert in linguistics or in linguistics or computation

The Learning Process Learning Instance: English: the big boy Hebrew: ha-yeled ha-gadol Acquired Transfer Rule: Hebrew: NP: N ADJ English: NP: the ADJ N where: (Hebrew:N English: N) (Hebrew:ADJ English:ADJ) (Hebrew:N has ((def +))) (Hebrew:ADJ has ((def +)))

Seeded Version Space Learning –SVS is based on Mitchell-style inductive version-space learning, but instead of keeping full S and G boundaries for each concept, it starts from a seeded rule andgrows by generalization, specialization and rule- bifurcation with incrementally acquired data.

Version Space Abstraction Lattice

The Elicitation Corpus List of sentences in a major language –English –Spanish Dynamically adaptable –Different sentences are presented depending on what was previously elicited Compositional –Joe, Joe’s brother, I saw Joe’s brother, I told you that I saw Joe’s brother, etc. Aim for typological completeness –Cover all types of languages

Pilot Version of Elicitation Corpus Approximately 800 sentences Tested on Swahili Vocabulary –Include a variety of semantic classes e.g., animate, inanimate, man-made objects, natural objects, etc. Noun phrases –Detect number, gender, types of possessives, classifiers, etc. Basic sentences –Detect agreement between verb and subject and/or object, basic word order, problems with indefinite or inanimate subjects, etc. Complex constructions –Currently relative clauses. Later, comparatives, questions, embedded clauses, etc.

Detection of Grammatical Features Each language uses a different inventory of grammatical features: tense, number, person, agreement. Swahili The hunter kill-ed the animal Mwindaji a-li-mu-ua mnyama a – class-one subject li – past tense mu – class-one object ua – kill Fox (Algonquian) Ne-waapam-aa-wa I-see-direct-him Ne-waapam-ek-wa me-see-indirect-he

Organization of Tests Diagnostic Tests Plural Dual Paucal Subj-V Agr … … …

Demo of Elicitation Interface and Feature Detection

Data Collection

Mapudungun Data Spanish-Mapudungun parallel corpora –Total words: 223,366 Spanish-Mapudungun glossary –About 5500 entries 40 hours of speech recorded 6 hours of speech transcribed Speech data will be translated into Spanish

Progress and Plans

Summary of Year 1: Partnerships Establishment of a partnership with the Institute for Indigenous Studies at the Universidad de la Frontera (UFRO) in Chile. Establishment of a partnership with the Chilean Ministry of Education. Identified partners in Alaska and Colombia. Details of the partnership are being discussed.

Summary of Year 1: Data Spanish-Mapudungun parallel corpus: over 200,000 words Standardization of orthography: Linguists at UFRO have evaluated the competing orthographies for Mapudungun and written a report detailing their recommendations for a standardized orthography for NICE. Training for spoken language collection: In January 2001 native speakers of Mapudungun were trained in the recording and transcription of spoken data. Mapudungun spoken language corpus: 40 hours recorded, 6 hours transcribed (as of end of February).

Summary of Year 1: iKBMT Preliminary design of transfer rule formalism for machine translation. Design and pilot testing of prototype elicitation corpus. First prototype of feature detection Morphological processing in PC Kimmo covering about 40 Mapudungun morphemes. Preliminary version of new parser for run-time translation component.

Goals for Year 2: Data Continue collection, transcription, and translation of Mapudungun data. Take inventory of existing Inupiaq data available from the Alaska Native Languages Center and the Inupiaq community. –Focus on the North Slope dialect and other dialects that are easily intelligible to North Slope speakers. Type and record additional Inupiaq data as needed. Plans for Siona data collection will be discussed at a meeting in Bogota in May.

Goals for Year 2: Elicitation Corpus Extend the elicitation corpus with more complex constructions (such as causatives and comparatives) and add diagnostics for complex features such as the tense and aspect system. Refine elicitation interface based on preliminary experiments. Preliminary user studies with the corpus and interface using at least two languages. Refine the linguistic corpus so as to accelerate learning of the more common and useful structures first.

Goals for Year 2: EBMT Baseline EBMT systems for Mapudungun and Inupiaq. Extend baseline systems with preliminary version of linguistic generalization.

Goals for Year 2: MT Run-time System Develop learnable transfer-rule structure and interpreter. –Unlike existing hand-coded transfer system for machine translation, a learnable structure requires full compositionality and component-wise generalizability/specializability for data-driven inductive learning. Develop morphological processors and part of speech taggers for Mapudungun and Spanish.

Goals for Year 2: Version Space Learning Develop baseline Seeded-Version-Space (SVS) inductive learning method Extend the elicitation interface to enable the SVS system to generate questions for the native informant, so as to speed the transfer- rule learning process

Future Projects Discussion

Appendix

The IEI Team Coordinator (leader of a bilingual and multicultural education project) Distinguished native speaker Linguists (one native speaker, one near-native) Typists/Transcribers Recording assistants Translators Native speaker linguistic informants

Agreement Between LTI and Institute of Indigenous Studies (IEI), Universidad De La Frontera, Chile Contributions of IEI –Socio-linguistic knowledge –Linguistic knowledge –Experience in multicultural bilingual education –The use of IEI facilities, faculty/researchers and staff for the project –electronic network support and computer technical support

Agreement between LTI and Institute of Indigenous Studies (IEI), Universidad de la Frontera, Chile Contributions of LTI –Equipment: four computers and four DAT recorders –Payment of consulting fees pending funding from the Chilean Ministry of Education –Expertise in language technologies

LTI/IEI Agreement Cooperate in expanding the project to convergent areas, such as bilingual education, as well as in pursuing additional funding

MINEDUC/IEI Agreement Highlights: Based on the LTI/IEI agreement, the Chilean Ministry of Education got involved in funding the data collection and processing team for the year This agreement will be renewed each year, as needed.

MINEDUC/IEI Agreement:  Objectives:  To evaluate the NICE/Mapudungun proposal for orthography and spelling  To collect an oral corpus that represent the four Mapudungun dialects spoken in Chile. The main domain is primary health, traditional and Occidental.

MINEDUC/IEI Agreement:  Deliverables:  An oral corpus of 800 hours recorded, proportional to the demography of each current spoken dialect  120 hours transcribed and translated from Mapudungun to Spanish  A refined proposal for writing Mapudungun

Mapudungun Morphology kudu.le.me.we.la.n lay_down.st.Hh.rem.neg.ind.1S I am not going to lay down there any more illku.faluw.kUle.n get_angry.SIM.ST.IND.1s I am pretending to be angry antU.kUdaw.kiaw.ke.rke.fu.y day.work.CIRC.CF.REP.IPD.IND.3s he used to work here and there as a day laborer, I am told wisa.ka.dungu.fe.nge.y.mi bad.VERB.FAC.speak.NOM.VERB.IND.2s you are someone who always does and says nasty things