NICE: Native Language Interpretation and Communication Environment Lori Levin, Jaime Carbonell, Alon Lavie, Ralf Brown, Erik Peterson, Katharina Probst, Rodolfo Vega, Hal Daume Language Technologies Institute Carnegie Mellon University April 12, 2001
NICE Rapid development of machine translation for low and very low density languages
Classification of MT by Language Density High density pairs (E-F, E-S, E-J, …) –Statistical or traditional MT approaches are O.K. Medium density (E-Czech, E-Croatian, …) –Example-based MT (success with Croatian, Korean) –JHU: initial success with stat-MT (Czech) Low density (S-Mapudungun, E-Iñupiaq, …) –10,000 to 1 million speakers –Insufficient bilingual corpora for SMT, EBMT –Partial corpus-based resources –Insufficient trained computational linguists
Machine Translation of Very Low Density Languages No text in electronic form –Can’t apply current methods for statistical MT No standard spelling or orthography Few literate native speakers Few linguists familiar with the language –Nobody is available to do rule-based MT Not enough money or time for years of linguistic information gathering/analysis E.g., Siona (Colombia)
Motivation for LDMT Methods developed for languages with very scarce resources will generalize to all MT. Policy makers can get input from indigenous people. –E.g., Has there been an epidemic or a crop failure Indigenous people can participate in government, education, and internet without losing their language. First MT of polysynthetic languages
New Ideas MT without large amounts of text and without trained linguists Machine learning of rule-based MT Multi-Engine architecture can flexibly take advantage of whatever resources are available. Research partnerships with indigenous communities (Future: Exponential models for data-miserly SMT)
History of NICE Arose from a series of joint workshops of NSF and OAS-CICAD. Workshop recommendations: –Create multinational projects using information technology to: provide immediate benefits to governments and citizens develop critical infrastructure for communication and collaborative research –training researchers and engineers –advancing science and technology
Approach Machine learning –Uncontrolled corpus (Generalized Example-Based MT) –Controlled corpus elicited from native speakers (Version Space Learning) Multi-Engine MT –Flexibly adapt to whatever resources are available –Take advantage of the strengths of different MT approaches
Evaluation Objective To achieve a given level of translation quality for a series of languages L1 to Ln –Reduce the amount of training data required –Reduce the amount of language-specific development time after language-independent software has been developed
Evaluation Baseline From Previous Work (Generalized EBMT) High density languages (French, Spanish) –1MW parallel corpora (e.g., subset of Hansards) Consistent spelling, grammatically correct High coverage, gisting-quality translation
Evaluation Baseline GEBMT French Hansards Coverage (in percent) as a function of corpus size (in millions of words)
Long-Term Target: Reduction in Linguistic and Human Resources
Work Completed
Establishing Partnerships
NICE Partners LanguageCountryInstitutions Mapudungun (in place) Chile Universidad de la Frontera, Institute for Indigenous Studies, Ministry of Education Iñupiaq (advanced discussion) US (Alaska) Ilisagvik College, Barrow school district, Alaska Rural Systemic Initiative, Trans-Arctic and Antarctic Institute, Alaska Native Language Center Siona (discussion) Colombia OAS-CICAD, Plante, Department of the Interior
Nice/Mapudungun: Current Products Writing conventions (Grafemario) Glossary Mapudungun/Spanish Bilingual newspaper, 4 issues Ultimas Familias –memoirs Memorias de Pascual Coña 6 hours transcribed speech 40 hours recorded speech`
Instructible Knowledge-Based MT
iRBMT: Instructible Rule Based MT
Elicitation Process Purpose: controlled elicitation of data that will be input to machine learning of translation rules
Elicitation Interface Example
Elicitation Interface Native informant sees source language sentence (in English or Spanish) Native informant types in translation, then uses mouse to add word alignments Informant is –Literate –Bilingual –Not an expert in linguistics or in linguistics or computation
The Learning Process Learning Instance: English: the big boy Hebrew: ha-yeled ha-gadol Acquired Transfer Rule: Hebrew: NP: N ADJ English: NP: the ADJ N where: (Hebrew:N English: N) (Hebrew:ADJ English:ADJ) (Hebrew:N has ((def +))) (Hebrew:ADJ has ((def +)))
Seeded Version Space Learning –SVS is based on Mitchell-style inductive version-space learning, but instead of keeping full S and G boundaries for each concept, it starts from a seeded rule andgrows by generalization, specialization and rule- bifurcation with incrementally acquired data.
Version Space Abstraction Lattice
The Elicitation Corpus List of sentences in a major language –English –Spanish Dynamically adaptable –Different sentences are presented depending on what was previously elicited Compositional –Joe, Joe’s brother, I saw Joe’s brother, I told you that I saw Joe’s brother, etc. Aim for typological completeness –Cover all types of languages
Pilot Version of Elicitation Corpus Approximately 800 sentences Tested on Swahili Vocabulary –Include a variety of semantic classes e.g., animate, inanimate, man-made objects, natural objects, etc. Noun phrases –Detect number, gender, types of possessives, classifiers, etc. Basic sentences –Detect agreement between verb and subject and/or object, basic word order, problems with indefinite or inanimate subjects, etc. Complex constructions –Currently relative clauses. Later, comparatives, questions, embedded clauses, etc.
Detection of Grammatical Features Each language uses a different inventory of grammatical features: tense, number, person, agreement. Swahili The hunter kill-ed the animal Mwindaji a-li-mu-ua mnyama a – class-one subject li – past tense mu – class-one object ua – kill Fox (Algonquian) Ne-waapam-aa-wa I-see-direct-him Ne-waapam-ek-wa me-see-indirect-he
Organization of Tests Diagnostic Tests Plural Dual Paucal Subj-V Agr … … …
Demo of Elicitation Interface and Feature Detection
Data Collection
Mapudungun Data Spanish-Mapudungun parallel corpora –Total words: 223,366 Spanish-Mapudungun glossary –About 5500 entries 40 hours of speech recorded 6 hours of speech transcribed Speech data will be translated into Spanish
Progress and Plans
Summary of Year 1: Partnerships Establishment of a partnership with the Institute for Indigenous Studies at the Universidad de la Frontera (UFRO) in Chile. Establishment of a partnership with the Chilean Ministry of Education. Identified partners in Alaska and Colombia. Details of the partnership are being discussed.
Summary of Year 1: Data Spanish-Mapudungun parallel corpus: over 200,000 words Standardization of orthography: Linguists at UFRO have evaluated the competing orthographies for Mapudungun and written a report detailing their recommendations for a standardized orthography for NICE. Training for spoken language collection: In January 2001 native speakers of Mapudungun were trained in the recording and transcription of spoken data. Mapudungun spoken language corpus: 40 hours recorded, 6 hours transcribed (as of end of February).
Summary of Year 1: iKBMT Preliminary design of transfer rule formalism for machine translation. Design and pilot testing of prototype elicitation corpus. First prototype of feature detection Morphological processing in PC Kimmo covering about 40 Mapudungun morphemes. Preliminary version of new parser for run-time translation component.
Goals for Year 2: Data Continue collection, transcription, and translation of Mapudungun data. Take inventory of existing Inupiaq data available from the Alaska Native Languages Center and the Inupiaq community. –Focus on the North Slope dialect and other dialects that are easily intelligible to North Slope speakers. Type and record additional Inupiaq data as needed. Plans for Siona data collection will be discussed at a meeting in Bogota in May.
Goals for Year 2: Elicitation Corpus Extend the elicitation corpus with more complex constructions (such as causatives and comparatives) and add diagnostics for complex features such as the tense and aspect system. Refine elicitation interface based on preliminary experiments. Preliminary user studies with the corpus and interface using at least two languages. Refine the linguistic corpus so as to accelerate learning of the more common and useful structures first.
Goals for Year 2: EBMT Baseline EBMT systems for Mapudungun and Inupiaq. Extend baseline systems with preliminary version of linguistic generalization.
Goals for Year 2: MT Run-time System Develop learnable transfer-rule structure and interpreter. –Unlike existing hand-coded transfer system for machine translation, a learnable structure requires full compositionality and component-wise generalizability/specializability for data-driven inductive learning. Develop morphological processors and part of speech taggers for Mapudungun and Spanish.
Goals for Year 2: Version Space Learning Develop baseline Seeded-Version-Space (SVS) inductive learning method Extend the elicitation interface to enable the SVS system to generate questions for the native informant, so as to speed the transfer- rule learning process
Future Projects Discussion
Appendix
The IEI Team Coordinator (leader of a bilingual and multicultural education project) Distinguished native speaker Linguists (one native speaker, one near-native) Typists/Transcribers Recording assistants Translators Native speaker linguistic informants
Agreement Between LTI and Institute of Indigenous Studies (IEI), Universidad De La Frontera, Chile Contributions of IEI –Socio-linguistic knowledge –Linguistic knowledge –Experience in multicultural bilingual education –The use of IEI facilities, faculty/researchers and staff for the project –electronic network support and computer technical support
Agreement between LTI and Institute of Indigenous Studies (IEI), Universidad de la Frontera, Chile Contributions of LTI –Equipment: four computers and four DAT recorders –Payment of consulting fees pending funding from the Chilean Ministry of Education –Expertise in language technologies
LTI/IEI Agreement Cooperate in expanding the project to convergent areas, such as bilingual education, as well as in pursuing additional funding
MINEDUC/IEI Agreement Highlights: Based on the LTI/IEI agreement, the Chilean Ministry of Education got involved in funding the data collection and processing team for the year This agreement will be renewed each year, as needed.
MINEDUC/IEI Agreement: Objectives: To evaluate the NICE/Mapudungun proposal for orthography and spelling To collect an oral corpus that represent the four Mapudungun dialects spoken in Chile. The main domain is primary health, traditional and Occidental.
MINEDUC/IEI Agreement: Deliverables: An oral corpus of 800 hours recorded, proportional to the demography of each current spoken dialect 120 hours transcribed and translated from Mapudungun to Spanish A refined proposal for writing Mapudungun
Mapudungun Morphology kudu.le.me.we.la.n lay_down.st.Hh.rem.neg.ind.1S I am not going to lay down there any more illku.faluw.kUle.n get_angry.SIM.ST.IND.1s I am pretending to be angry antU.kUdaw.kiaw.ke.rke.fu.y day.work.CIRC.CF.REP.IPD.IND.3s he used to work here and there as a day laborer, I am told wisa.ka.dungu.fe.nge.y.mi bad.VERB.FAC.speak.NOM.VERB.IND.2s you are someone who always does and says nasty things