Bridging the Gap: Machine Translation for Lesser Resourced Languages

Bridging the Gap: Machine Translation for Lesser Resourced Languages
Christian Monson, Ariadna Font Llitjós, Lori Levin, Alon Lavie, Alison Alvarez, Roberto Aranovich, Jaime Carbonell, Robert Frederking, Erik Peterson, Kathrin Probst

Inupiaq Katrina Quechua Mapudungun
100’s of Speakers Katrina 100’s of Speakers Quechua 6 Million Speakers Mapudungun 900,000 Speakers

Machine Translation (MT)
Source Language Target Language

Source Language Target Language Direct Statistical MT Example Based MT

Transfer Rule Based MT Morphologial Analysis Syntactic Parsing Text Generation + Source Language Target Language Direct Statistical MT Example Based MT

Interlingua Semantic Analysis Sentence Planning Transfer Rule Based MT Morphologial Analysis Syntactic Parsing Text Generation + Source Language Target Language Direct Statistical MT Example Based MT

Interlingua + High quality - Expertise intensive development cycle Semantic Analysis Transfer Rule Based MT Morphologial Analysis Syntactic Parsing Text Generation + Source Language Target Language Direct Statistical MT Example Based MT

Interlingua + Short development time - Requires large bilingual corpus Semantic Analysis Transfer Rule Based MT Morphologial Analysis Syntactic Parsing Text Generation + Source Language Target Language Direct Statistical MT Example Based MT

Interlingua Semantic Analysis Our Approach Transfer Rule Based MT Morphologial Analysis Syntactic Parsing Text Generation + Source Language Target Language Direct Statistical MT Example Based MT

Interlingua + High quality - Expertise intensive development cycle Semantic Analysis Transfer Rule Based MT Morphologial Analysis Syntactic Parsing Text Generation + Source Language Target Language Direct Statistical MT Example Based MT

Interlingua + High quality - Expertise intensive development cycle Semantic Analysis Morphologial Analysis Syntactic Parsing Text Generation + Automate the development of deep-analysis MT Source Language Target Language

Our Position Linguistic Structure and Bilingual Informants
help automate the development of deep-analysis machine translation systems

Sub-Problems Morphology Induction Syntax Refinement

Morphology Induction 1. Linguistic Structure 2. Bilingual Informants

Paradigms Organize Morphology
Mapudungun Loc Asp pa tu pu ka Ø Hab Mode Report Pol / Mood Tense Obj Agr ke pe (ü)rke la a fi ki fu Ø nu afu Subj Agr / Mood (ü)n li chi yu …

Paradigm Discovery in 3 Steps
Search out partial paradigms in a network of candidates Cluster overlapping partial paradigms Filter the clusters, keeping the largest clusters most likely to model true paradigms e.er.erá.ido.ieron.ió 28: deb, escog, ofrec, roconoc, vend, ... e.ido.ieron.ir.irá.ió 28: asist, dirig, exig, ocurr, sufr, ... e.erá.ido.ieron.ió 28: deb, escog, ... e.er.ido.ieron.ió 46: deb, parec, recog... e.ido.ieron.irá.ió 28: asist, dirig, ... e.ido.ieron.ir.ió 39: asist, bat, sal, ... e.er.erá.ieron.ió 32: deb, padec, romp, ... e.ido.ieron.ió 86: asist, deb, hund,... e.erá.ieron.ió 32: deb, padec, ... er.ido.ieron.ió 58: ascend, ejerc, recog, ... ido.ieron.ir.ió 44: interrump, sal, ... azar.e.ido.ieron.ir.ió 1: sal A portion of a Spanish paradigm candidate network

Morpho Challenge 2007 Unsupervised Morphology Induction Competition
English 3rd Place Overall Bested the Strong Baseline Morfessor (Creutz, 2006) German 1st Place when Combined with Morfessor

Morpho Challenge 2007 Unsupervised Morphology Induction Competition
English 3rd Place Overall Bested the Strong Baseline Morfessor (Creutz, 2006) German 1st Place when Combined with Morfessor No Mapudungun yet Agglutinative sequences of suffixes coming soon

Our Machine Translation Architecture
INPUT TEXT Finish feedback loop Given an arbitrary small set of linguistic resources, for example a small grammar and a small lexicon, if we add a RR component at the end of our Translation process, we can use bilingual speaker feedback to AUGMENT and IMPROVE the initial resources (G and L). The approach I am proposing can be generalized to any rule-based system. We chose to implement our work on this system developed at CMU Propagate corrections to the underlying representations that produce translations

INPUT TEXT Morphology Analysis Lexicon Morphology Analysis Finish feedback loop Given an arbitrary small set of linguistic resources, for example a small grammar and a small lexicon, if we add a RR component at the end of our Translation process, we can use bilingual speaker feedback to AUGMENT and IMPROVE the initial resources (G and L). The approach I am proposing can be generalized to any rule-based system. We chose to implement our work on this system developed at CMU Propagate corrections to the underlying representations that produce translations

INPUT TEXT Morphology Analysis Lexicon Grammar & Lexicon Morphology Analysis Machine Translation System Finish feedback loop Given an arbitrary small set of linguistic resources, for example a small grammar and a small lexicon, if we add a RR component at the end of our Translation process, we can use bilingual speaker feedback to AUGMENT and IMPROVE the initial resources (G and L). The approach I am proposing can be generalized to any rule-based system. We chose to implement our work on this system developed at CMU Propagate corrections to the underlying representations that produce translations

INPUT TEXT Morphology Analysis Lexicon Grammar & Lexicon Morphology Analysis Machine Translation System Finish feedback loop Given an arbitrary small set of linguistic resources, for example a small grammar and a small lexicon, if we add a RR component at the end of our Translation process, we can use bilingual speaker feedback to AUGMENT and IMPROVE the initial resources (G and L). The approach I am proposing can be generalized to any rule-based system. We chose to implement our work on this system developed at CMU Propagate corrections to the underlying representations that produce translations Morphology Generation Lexicon Morphology Generation

INPUT TEXT Morphology Analysis Lexicon Grammar & Lexicon Morphology Analysis Machine Translation System Finish feedback loop Given an arbitrary small set of linguistic resources, for example a small grammar and a small lexicon, if we add a RR component at the end of our Translation process, we can use bilingual speaker feedback to AUGMENT and IMPROVE the initial resources (G and L). The approach I am proposing can be generalized to any rule-based system. We chose to implement our work on this system developed at CMU Propagate corrections to the underlying representations that produce translations Morphology Generation Lexicon Morphology Generation OUTPUT TEXT

Sub-Problems Morphology Induction Syntax Refinement

Syntax Refinement 1. Linguistic Structure 2. Bilingual Informants

Linguistic Structure: Syntax
English I didn’t see Maria Mapudungun pelafiñ Maria Spanish No vi a María

Linguistic Structure: Syntax
English I didn’t see Maria Mapudungun pelafiñ Maria pe -la -fi -ñ Maria see -neg -3.obj -1.subj.indicative Maria Spanish No vi a María No vi a María neg see.1.subj.past.indicative acc Maria

pe-la-fi-ñ Maria V pe

pe-la-fi-ñ Maria V pe VSuff Negation = + la

pe-la-fi-ñ Maria V pe VSuffG Pass all features up VSuff la

pe-la-fi-ñ Maria V pe VSuffG VSuff object person = 3 VSuff fi la

pe-la-fi-ñ Maria V pe VSuffG Pass all features up from both children

pe-la-fi-ñ Maria V pe VSuffG VSuff person = 1 number = sg mood = ind

pe-la-fi-ñ Maria V VSuffG pe VSuffG VSuff
Pass all features up from both children VSuffG VSuff ñ VSuff fi la

pe-la-fi-ñ Maria Pass all features up from both children V Check that:
1) negation = + 2) tense is undefined V VSuffG pe VSuffG VSuff VSuffG VSuff ñ VSuff fi la

pe-la-fi-ñ Maria V NP V VSuffG N person = 3 number = sg human = + pe

pe-la-fi-ñ Maria S Check that NP is human = + Pass features up from V
VP V NP V VSuffG N pe VSuffG VSuff N VSuffG VSuff ñ Maria VSuff fi la

Transfer to Spanish: Top-Down
VP VP V NP V VSuffG N pe VSuffG VSuff N VSuffG VSuff ñ Maria VSuff fi la

Pass all features to Spanish side S S VP VP V NP V “a” NP V VSuffG N pe VSuffG VSuff N VSuffG VSuff ñ Maria VSuff fi la

Pass all features down VP VP V NP V “a” NP V VSuffG N pe VSuffG VSuff N VSuffG VSuff ñ Maria VSuff fi la

Pass object features down VP VP V NP V “a” NP V VSuffG N pe VSuffG VSuff N VSuffG VSuff ñ Maria VSuff fi la

VP VP V NP V “a” NP V VSuffG N Accusative marker on objects is introduced because human = + pe VSuffG VSuff N VSuffG VSuff ñ Maria VSuff fi la

VP VP::VP [VBar NP] -> [VBar "a" NP] ( (X1::Y1) (X2::Y3) ((X2 type) = (*NOT* personal)) ((X2 human) =c +) (X0 = X1) ((X0 object) = X2) (Y0 = X0) ((Y0 object) = (X0 object)) (Y1 = Y0) (Y3 = (Y0 object)) ((Y1 objmarker person) = (Y3 person)) ((Y1 objmarker number) = (Y3 number)) ((Y1 objmarker gender) = (Y3 gender))) VP V NP V “a” NP V VSuffG N pe VSuffG VSuff N VSuffG VSuff ñ Maria VSuff fi la

Pass person, number, and mood features to Spanish Verb VP VP V NP V “a” NP Assign tense = past V VSuffG N “no” V pe VSuffG VSuff N VSuffG VSuff ñ Maria VSuff fi la

VP VP V NP V “a” NP V VSuffG N “no” V pe VSuffG VSuff N VSuffG VSuff ñ Maria Introduced because negation = + VSuff fi la

VP VP V NP V “a” NP V VSuffG N “no” V pe VSuffG VSuff N ver VSuffG VSuff ñ Maria VSuff fi la

VP VP V NP V “a” NP V VSuffG N “no” V pe VSuffG VSuff N ver vi VSuffG VSuff ñ Maria person = 1 number = sg mood = indicative tense = past VSuff fi la

Pass features over to Spanish side VP VP V NP V “a” NP V VSuffG N “no” V N pe VSuffG VSuff N vi N VSuffG VSuff ñ Maria María VSuff fi la

I didn’t see Maria S S VP VP V NP V “a” NP V VSuffG N “no” V N pe
vi N VSuffG VSuff ñ Maria María VSuff fi la

Syntax Refinement 1. Linguistic Structure 2. Bilingual Informants

Syntax Refinement Architecture
INPUT TEXT Morphology Analysis Lexicon Grammar & Lexicon Morphology Analysis Run-Time MT System Finish feedback loop Given an arbitrary small set of linguistic resources, for example a small grammar and a small lexicon, if we add a RR component at the end of our Translation process, we can use bilingual speaker feedback to AUGMENT and IMPROVE the initial resources (G and L). The approach I am proposing can be generalized to any rule-based system. We chose to implement our work on this system developed at CMU Propagate corrections to the underlying representations that produce translations Morphology Generation Lexicon Morphology Generation OUTPUT TEXT

INPUT TEXT Rule Refinement Grammar & Lexicon Morphology Analysis Online Translation Correction Tool Run-Time MT System Finish feedback loop Given an arbitrary small set of linguistic resources, for example a small grammar and a small lexicon, if we add a RR component at the end of our Translation process, we can use bilingual speaker feedback to AUGMENT and IMPROVE the initial resources (G and L). The approach I am proposing can be generalized to any rule-based system. We chose to implement our work on this system developed at CMU Propagate corrections to the underlying representations that produce translations Morphology Generation OUTPUT TEXT

INPUT TEXT Rule Refinement Grammar & Lexicon Morphology Analysis Online Translation Correction Tool Run-Time MT System Finish feedback loop Given an arbitrary small set of linguistic resources, for example a small grammar and a small lexicon, if we add a RR component at the end of our Translation process, we can use bilingual speaker feedback to AUGMENT and IMPROVE the initial resources (G and L). The approach I am proposing can be generalized to any rule-based system. We chose to implement our work on this system developed at CMU Propagate corrections to the underlying representations that produce translations

INPUT TEXT Rule Refinement Grammar & Lexicon Morphology Analysis Online Translation Correction Tool Run-Time MT System Finish feedback loop Given an arbitrary small set of linguistic resources, for example a small grammar and a small lexicon, if we add a RR component at the end of our Translation process, we can use bilingual speaker feedback to AUGMENT and IMPROVE the initial resources (G and L). The approach I am proposing can be generalized to any rule-based system. We chose to implement our work on this system developed at CMU Propagate corrections to the underlying representations that produce translations Morphologhy Generation OUTPUT TEXT

Children played a game Translation Correction Tool (TCTool): online GUI to elicit correction of MT output from non-expert bilingual speakers

The children played a game

Refining the Grammar S NP VP N VP N PolP NP niños V Det N V un N
jugaron juego

Refining the Grammar los S NP VP N VP N PolP NP niños V Det N V un N
jugaron juego

Syntax Refinement Summary
Increases translation quality on unseen data English-Spanish experiments (Font Llitjós et al, 2007, MT Summit) Generalizes to a Mapudungun-Spanish machine translation system Today I’ve shown you an example of grammar expansion, but the ARR can also automatically augment the lexicon (see paper).

Overall Summary Linguistic Structure and Bilingual Informants
help automate the development of deep-analysis machine translation systems: Morphology Induction Syntax Refinement

Thank You!

Bridging the Gap: Machine Translation for Lesser Resourced Languages

Similar presentations

Presentation on theme: "Bridging the Gap: Machine Translation for Lesser Resourced Languages"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Bridging the Gap: Machine Translation for Lesser Resourced Languages

Similar presentations

Presentation on theme: "Bridging the Gap: Machine Translation for Lesser Resourced Languages"— Presentation transcript:

Similar presentations

About project

Feedback