Bridging the Gap: Machine Translation for Lesser Resourced Languages Christian Monson, Ariadna Font Llitjós, Lori Levin, Alon Lavie, Alison Alvarez, Roberto Aranovich, Jaime Carbonell, Robert Frederking, Erik Peterson, Kathrin Probst
Inupiaq Katrina Quechua Mapudungun 100’s of Speakers Katrina 100’s of Speakers Quechua 6 Million Speakers Mapudungun 900,000 Speakers
Machine Translation (MT) Source Language Target Language
Machine Translation (MT) Source Language Target Language Direct Statistical MT Example Based MT
Machine Translation (MT) Transfer Rule Based MT Morphologial Analysis Syntactic Parsing Text Generation + Source Language Target Language Direct Statistical MT Example Based MT
Machine Translation (MT) Interlingua Semantic Analysis Sentence Planning Transfer Rule Based MT Morphologial Analysis Syntactic Parsing Text Generation + Source Language Target Language Direct Statistical MT Example Based MT
Machine Translation (MT) Interlingua + High quality - Expertise intensive development cycle Semantic Analysis Transfer Rule Based MT Morphologial Analysis Syntactic Parsing Text Generation + Source Language Target Language Direct Statistical MT Example Based MT
Machine Translation (MT) Interlingua + Short development time - Requires large bilingual corpus Semantic Analysis Transfer Rule Based MT Morphologial Analysis Syntactic Parsing Text Generation + Source Language Target Language Direct Statistical MT Example Based MT
Machine Translation (MT) Interlingua Semantic Analysis Our Approach Transfer Rule Based MT Morphologial Analysis Syntactic Parsing Text Generation + Source Language Target Language Direct Statistical MT Example Based MT
Machine Translation (MT) Interlingua + High quality - Expertise intensive development cycle Semantic Analysis Transfer Rule Based MT Morphologial Analysis Syntactic Parsing Text Generation + Source Language Target Language Direct Statistical MT Example Based MT
Machine Translation (MT) Interlingua + High quality - Expertise intensive development cycle Semantic Analysis Morphologial Analysis Syntactic Parsing Text Generation + Automate the development of deep-analysis MT Source Language Target Language
Our Position Linguistic Structure and Bilingual Informants help automate the development of deep-analysis machine translation systems
Sub-Problems Morphology Induction Syntax Refinement
Morphology Induction 1. Linguistic Structure 2. Bilingual Informants
Morphology Induction 1. Linguistic Structure 2. Bilingual Informants
Paradigms Organize Morphology Mapudungun Loc Asp pa tu pu ka Ø Hab Mode Report Pol / Mood Tense Obj Agr ke pe (ü)rke la a fi ki fu Ø nu afu Subj Agr / Mood (ü)n li chi yu …
Paradigm Discovery in 3 Steps Search out partial paradigms in a network of candidates Cluster overlapping partial paradigms Filter the clusters, keeping the largest clusters most likely to model true paradigms e.er.erá.ido.ieron.ió 28: deb, escog, ofrec, roconoc, vend, ... e.ido.ieron.ir.irá.ió 28: asist, dirig, exig, ocurr, sufr, ... e.erá.ido.ieron.ió 28: deb, escog, ... e.er.ido.ieron.ió 46: deb, parec, recog... e.ido.ieron.irá.ió 28: asist, dirig, ... e.ido.ieron.ir.ió 39: asist, bat, sal, ... e.er.erá.ieron.ió 32: deb, padec, romp, ... e.ido.ieron.ió 86: asist, deb, hund,... e.erá.ieron.ió 32: deb, padec, ... er.ido.ieron.ió 58: ascend, ejerc, recog, ... ido.ieron.ir.ió 44: interrump, sal, ... azar.e.ido.ieron.ir.ió 1: sal A portion of a Spanish paradigm candidate network
Morpho Challenge 2007 Unsupervised Morphology Induction Competition English 3rd Place Overall Bested the Strong Baseline Morfessor (Creutz, 2006) German 1st Place when Combined with Morfessor
Morpho Challenge 2007 Unsupervised Morphology Induction Competition English 3rd Place Overall Bested the Strong Baseline Morfessor (Creutz, 2006) German 1st Place when Combined with Morfessor No Mapudungun yet Agglutinative sequences of suffixes coming soon
Our Machine Translation Architecture INPUT TEXT Finish feedback loop Given an arbitrary small set of linguistic resources, for example a small grammar and a small lexicon, if we add a RR component at the end of our Translation process, we can use bilingual speaker feedback to AUGMENT and IMPROVE the initial resources (G and L). The approach I am proposing can be generalized to any rule-based system. We chose to implement our work on this system developed at CMU Propagate corrections to the underlying representations that produce translations
Our Machine Translation Architecture INPUT TEXT Morphology Analysis Lexicon Morphology Analysis Finish feedback loop Given an arbitrary small set of linguistic resources, for example a small grammar and a small lexicon, if we add a RR component at the end of our Translation process, we can use bilingual speaker feedback to AUGMENT and IMPROVE the initial resources (G and L). The approach I am proposing can be generalized to any rule-based system. We chose to implement our work on this system developed at CMU Propagate corrections to the underlying representations that produce translations
Our Machine Translation Architecture INPUT TEXT Morphology Analysis Lexicon Grammar & Lexicon Morphology Analysis Machine Translation System Finish feedback loop Given an arbitrary small set of linguistic resources, for example a small grammar and a small lexicon, if we add a RR component at the end of our Translation process, we can use bilingual speaker feedback to AUGMENT and IMPROVE the initial resources (G and L). The approach I am proposing can be generalized to any rule-based system. We chose to implement our work on this system developed at CMU Propagate corrections to the underlying representations that produce translations
Our Machine Translation Architecture INPUT TEXT Morphology Analysis Lexicon Grammar & Lexicon Morphology Analysis Machine Translation System Finish feedback loop Given an arbitrary small set of linguistic resources, for example a small grammar and a small lexicon, if we add a RR component at the end of our Translation process, we can use bilingual speaker feedback to AUGMENT and IMPROVE the initial resources (G and L). The approach I am proposing can be generalized to any rule-based system. We chose to implement our work on this system developed at CMU Propagate corrections to the underlying representations that produce translations Morphology Generation Lexicon Morphology Generation
Our Machine Translation Architecture INPUT TEXT Morphology Analysis Lexicon Grammar & Lexicon Morphology Analysis Machine Translation System Finish feedback loop Given an arbitrary small set of linguistic resources, for example a small grammar and a small lexicon, if we add a RR component at the end of our Translation process, we can use bilingual speaker feedback to AUGMENT and IMPROVE the initial resources (G and L). The approach I am proposing can be generalized to any rule-based system. We chose to implement our work on this system developed at CMU Propagate corrections to the underlying representations that produce translations Morphology Generation Lexicon Morphology Generation OUTPUT TEXT
Our Machine Translation Architecture INPUT TEXT Morphology Analysis Lexicon Grammar & Lexicon Morphology Analysis Machine Translation System Finish feedback loop Given an arbitrary small set of linguistic resources, for example a small grammar and a small lexicon, if we add a RR component at the end of our Translation process, we can use bilingual speaker feedback to AUGMENT and IMPROVE the initial resources (G and L). The approach I am proposing can be generalized to any rule-based system. We chose to implement our work on this system developed at CMU Propagate corrections to the underlying representations that produce translations Morphology Generation Lexicon Morphology Generation OUTPUT TEXT
Our Machine Translation Architecture INPUT TEXT Morphology Analysis Lexicon Grammar & Lexicon Morphology Analysis Machine Translation System Finish feedback loop Given an arbitrary small set of linguistic resources, for example a small grammar and a small lexicon, if we add a RR component at the end of our Translation process, we can use bilingual speaker feedback to AUGMENT and IMPROVE the initial resources (G and L). The approach I am proposing can be generalized to any rule-based system. We chose to implement our work on this system developed at CMU Propagate corrections to the underlying representations that produce translations Morphology Generation Lexicon Morphology Generation OUTPUT TEXT
Sub-Problems Morphology Induction Syntax Refinement
Syntax Refinement 1. Linguistic Structure 2. Bilingual Informants
Syntax Refinement 1. Linguistic Structure 2. Bilingual Informants
Linguistic Structure: Syntax English I didn’t see Maria Mapudungun pelafiñ Maria Spanish No vi a María
Linguistic Structure: Syntax English I didn’t see Maria Mapudungun pelafiñ Maria pe -la -fi -ñ Maria see -neg -3.obj -1.subj.indicative Maria Spanish No vi a María No vi a María neg see.1.subj.past.indicative acc Maria
pe-la-fi-ñ Maria V pe
pe-la-fi-ñ Maria V pe VSuff Negation = + la
pe-la-fi-ñ Maria V pe VSuffG Pass all features up VSuff la
pe-la-fi-ñ Maria V pe VSuffG VSuff object person = 3 VSuff fi la
pe-la-fi-ñ Maria V pe VSuffG Pass all features up from both children
pe-la-fi-ñ Maria V pe VSuffG VSuff person = 1 number = sg mood = ind
pe-la-fi-ñ Maria V VSuffG pe VSuffG VSuff Pass all features up from both children VSuffG VSuff ñ VSuff fi la
pe-la-fi-ñ Maria Pass all features up from both children V Check that: 1) negation = + 2) tense is undefined V VSuffG pe VSuffG VSuff VSuffG VSuff ñ VSuff fi la
pe-la-fi-ñ Maria V NP V VSuffG N person = 3 number = sg human = + pe
pe-la-fi-ñ Maria S Check that NP is human = + Pass features up from V VP V NP V VSuffG N pe VSuffG VSuff N VSuffG VSuff ñ Maria VSuff fi la
Transfer to Spanish: Top-Down VP VP V NP V VSuffG N pe VSuffG VSuff N VSuffG VSuff ñ Maria VSuff fi la
Transfer to Spanish: Top-Down Pass all features to Spanish side S S VP VP V NP V “a” NP V VSuffG N pe VSuffG VSuff N VSuffG VSuff ñ Maria VSuff fi la
Transfer to Spanish: Top-Down Pass all features down VP VP V NP V “a” NP V VSuffG N pe VSuffG VSuff N VSuffG VSuff ñ Maria VSuff fi la
Transfer to Spanish: Top-Down Pass object features down VP VP V NP V “a” NP V VSuffG N pe VSuffG VSuff N VSuffG VSuff ñ Maria VSuff fi la
Transfer to Spanish: Top-Down VP VP V NP V “a” NP V VSuffG N Accusative marker on objects is introduced because human = + pe VSuffG VSuff N VSuffG VSuff ñ Maria VSuff fi la
Transfer to Spanish: Top-Down VP VP::VP [VBar NP] -> [VBar "a" NP] ( (X1::Y1) (X2::Y3) ((X2 type) = (*NOT* personal)) ((X2 human) =c +) (X0 = X1) ((X0 object) = X2) (Y0 = X0) ((Y0 object) = (X0 object)) (Y1 = Y0) (Y3 = (Y0 object)) ((Y1 objmarker person) = (Y3 person)) ((Y1 objmarker number) = (Y3 number)) ((Y1 objmarker gender) = (Y3 gender))) VP V NP V “a” NP V VSuffG N pe VSuffG VSuff N VSuffG VSuff ñ Maria VSuff fi la
Transfer to Spanish: Top-Down Pass person, number, and mood features to Spanish Verb VP VP V NP V “a” NP Assign tense = past V VSuffG N “no” V pe VSuffG VSuff N VSuffG VSuff ñ Maria VSuff fi la
Transfer to Spanish: Top-Down VP VP V NP V “a” NP V VSuffG N “no” V pe VSuffG VSuff N VSuffG VSuff ñ Maria Introduced because negation = + VSuff fi la
Transfer to Spanish: Top-Down VP VP V NP V “a” NP V VSuffG N “no” V pe VSuffG VSuff N ver VSuffG VSuff ñ Maria VSuff fi la
Transfer to Spanish: Top-Down VP VP V NP V “a” NP V VSuffG N “no” V pe VSuffG VSuff N ver vi VSuffG VSuff ñ Maria person = 1 number = sg mood = indicative tense = past VSuff fi la
Transfer to Spanish: Top-Down Pass features over to Spanish side VP VP V NP V “a” NP V VSuffG N “no” V N pe VSuffG VSuff N vi N VSuffG VSuff ñ Maria María VSuff fi la
I didn’t see Maria S S VP VP V NP V “a” NP V VSuffG N “no” V N pe vi N VSuffG VSuff ñ Maria María VSuff fi la
Syntax Refinement 1. Linguistic Structure 2. Bilingual Informants
Syntax Refinement Architecture INPUT TEXT Morphology Analysis Lexicon Grammar & Lexicon Morphology Analysis Run-Time MT System Finish feedback loop Given an arbitrary small set of linguistic resources, for example a small grammar and a small lexicon, if we add a RR component at the end of our Translation process, we can use bilingual speaker feedback to AUGMENT and IMPROVE the initial resources (G and L). The approach I am proposing can be generalized to any rule-based system. We chose to implement our work on this system developed at CMU Propagate corrections to the underlying representations that produce translations Morphology Generation Lexicon Morphology Generation OUTPUT TEXT
Syntax Refinement Architecture INPUT TEXT Rule Refinement Grammar & Lexicon Morphology Analysis Online Translation Correction Tool Run-Time MT System Finish feedback loop Given an arbitrary small set of linguistic resources, for example a small grammar and a small lexicon, if we add a RR component at the end of our Translation process, we can use bilingual speaker feedback to AUGMENT and IMPROVE the initial resources (G and L). The approach I am proposing can be generalized to any rule-based system. We chose to implement our work on this system developed at CMU Propagate corrections to the underlying representations that produce translations Morphology Generation OUTPUT TEXT
Syntax Refinement Architecture INPUT TEXT Rule Refinement Grammar & Lexicon Morphology Analysis Online Translation Correction Tool Run-Time MT System Finish feedback loop Given an arbitrary small set of linguistic resources, for example a small grammar and a small lexicon, if we add a RR component at the end of our Translation process, we can use bilingual speaker feedback to AUGMENT and IMPROVE the initial resources (G and L). The approach I am proposing can be generalized to any rule-based system. We chose to implement our work on this system developed at CMU Propagate corrections to the underlying representations that produce translations
Syntax Refinement Architecture INPUT TEXT Rule Refinement Grammar & Lexicon Morphology Analysis Online Translation Correction Tool Run-Time MT System Finish feedback loop Given an arbitrary small set of linguistic resources, for example a small grammar and a small lexicon, if we add a RR component at the end of our Translation process, we can use bilingual speaker feedback to AUGMENT and IMPROVE the initial resources (G and L). The approach I am proposing can be generalized to any rule-based system. We chose to implement our work on this system developed at CMU Propagate corrections to the underlying representations that produce translations Morphologhy Generation OUTPUT TEXT
Children played a game Translation Correction Tool (TCTool): online GUI to elicit correction of MT output from non-expert bilingual speakers
The children played a game
Refining the Grammar S NP VP N VP N PolP NP niños V Det N V un N jugaron juego
Refining the Grammar los S NP VP N VP N PolP NP niños V Det N V un N jugaron juego
Refining the Grammar los S NP VP N VP N PolP NP niños V Det N V un N jugaron juego
Syntax Refinement Summary Increases translation quality on unseen data English-Spanish experiments (Font Llitjós et al, 2007, MT Summit) Generalizes to a Mapudungun-Spanish machine translation system Today I’ve shown you an example of grammar expansion, but the ARR can also automatically augment the lexicon (see paper).
Overall Summary Linguistic Structure and Bilingual Informants help automate the development of deep-analysis machine translation systems: Morphology Induction Syntax Refinement
Thank You!