Developing affordable technologies for resource-poor languages Ariadna Font Llitjós Language Technologies Institute Carnegie Mellon University September 22, 2004
October 11, 2002AMTA dot = language
October 11, 2002AMTA Motivation Resource-poor scenarios -Indigenous communities have difficult access to crucial information that directly affects their life (such as land laws, health warnings, etc.) -Formalize a potentially endangered language Affordable technologies, such as -spell-checkers, -on-line dictionaries, -Machine Translation (MT) systems, -computer assisted tutoring
October 11, 2002AMTA AVENUE Partners LanguageCountryInstitutions Mapudungun (in place) Chile Universidad de la Frontera, Institute for Indigenous Studies, Ministry of Education Quechua (started) Peru Ministry of Education Iñupiaq (discussion) US (Alaska) Ilisagvik College, Barrow school district, Alaska Rural Systemic Initiative, Trans- Arctic and Antarctic Institute, Alaska Native Language Center Siona (discussion) Colombia OAS-CICAD, Plante, Department of the Interior
October 11, 2002AMTA Chile Official Language: Spanish Population: ~15 million ~1/2 million Mapuche people Language: Mapudungun Mapudungun for the Mapuche
October 11, 2002AMTA What’s Machine Translation (MT)? Japanese sentence Swahili sentence
October 11, 2002AMTA Speech to Speech MT
October 11, 2002AMTA Why Machine Translation for resource-poor (indigenous) languages? Commercial MT economically feasible for only a handful of major languages with large resources (corpora, human developers) Benefits include: –Better government access to indigenous communities (Epidemics, crop failures, etc.) –Better indigenous communities participation in information-rich activities (health care, education, government) without giving up their languages. –Language preservation –Civilian and military applications (disaster relief)
October 11, 2002AMTA MT for resource-poor languages: Challenges Minimal amount of parallel text (oral tradition) Possibly competing standards for orthography/spelling Often relatively few trained linguists Access to native informants possible Need to minimize development time and cost
October 11, 2002AMTA Interlingua Transfer rules Corpus-based methods analysis interpretation generation I saw you Yo vi tú Machine Translation Pyramid
October 11, 2002AMTA AVENUE MT system overview \spa Una mujer se quedó en casa \map Kie domo mlewey ruka mew \eng One woman stayed at home. {VP,3} VP::VP : [VP NP] -> [VP NP] ( (X1::Y1) (X2::Y2) ((x2 case) = acc) ((x0 obj) = x2) ((x0 agr) = (x1 agr)) (y2 == (y0 obj)) ((y0 tense) = (x0 tense)) ((y0 agr) = (y1 agr))) V::V |: [stayed] -> [quedó] ((X1::Y1) ((x0 form) = stay) ((x0 actform) = stayed) ((x0 tense) = past-pp) ((y0 agr pers) = 3) ((y0 agr num) = sg))
October 11, 2002AMTA Avenue MT system overview Learning Module Transfer Rules Lexical Resources Run Time Transfer System Lattice Translation Correction Tool Word- Aligned Parallel Corpus Elicitation Tool Elicitation Corpus ElicitationRule Learning Run-Time System Rule Refinement Rule Refinement Module Handcrafted rules Morphology Morpho- logical analyzer
October 11, 2002AMTA Avenue overview: my research Learning Module Transfer Rules Lexical Resources Run Time Transfer System Lattice Translation Correction Tool Word- Aligned Parallel Corpus Elicitation Tool Elicitation Corpus ElicitationRule Learning Run-Time System Rule Refinement Rule Refinement Module Handcrafted rules Morphology Morpho- logical analyzer
Interactive and Automatic Refinement of Translation Rules Or: How to recycle corrections of MT output back into the MT system by adjusting and adapting the grammar and lexical rules
October 11, 2002AMTA Error correction by non-expert bilingual users
October 11, 2002AMTA Interactive elicitation of MT errors Assumptions: non-expert bilingual users can reliably detect and minimally correct MT errors, given: –SL sentence (I saw you) –TL sentence (Yo vi tú) –word-to-word alignments (I-yo, saw-vi, you-tú) –(context) using an online GUI: the Translation Correction Tool (TCTool) Goal: simplify MT correction task maximally
October 11, 2002AMTA Translation Correction Tool Actions:
October 11, 2002AMTA SL + best TL picked by user
October 11, 2002AMTA Changing word order
October 11, 2002AMTA Changing “grande” into “gran”
October 11, 2002AMTA
October 11, 2002AMTA
October 11, 2002AMTA Automatic Rule Refinement Framework Find best RR operations given a: grammar (G), lexicon (L), (set of) source language sentence(s) (SL), (set of) target language sentence(s) (TL), its parse tree (P), and minimal correction of TL (TL’) such that TQ2 > TQ1 Which can also be expressed as: max TQ (TL|TL’,P,SL,RR(G,L))
October 11, 2002AMTA Types of RR operations Grammar: –R0 R0 + R1 [=R0’ + contr] Cov[R0] Cov[R0,R1] –R0 R1 [=R0 + constr] Cov[R0] Cov[R1] –R0 R1[=R0 + constr= -] R2[=R0’ + constr=c +] Cov[R0] Cov[R1,R2] Lexicon –Lex0 Lex0 + Lex1[=Lex0 + constr] –Lex0 Lex1[=Lex0 + constr] –Lex0 Lex1[ Lex0 + TLword] – Lex1 (adding lexical item)
October 11, 2002AMTA Questions & Discussion Thanks!
October 11, 2002AMTA Formalizing Error Information W i = error W i ’ = correction W c = clue word Example: SL: the red car - TL: *el auto roja TL’: el auto rojo W i = roja W i ’ = rojo W c = auto
October 11, 2002AMTA Finding Triggering Features Once we have user’s correction (W i ’), we can compare it with W i at the feature level and find which is the triggering feature. If set is empty, need to postulate a new binary feature Delta function: