Download presentation
Presentation is loading. Please wait.
Published byJosé Antonio Castillo Torregrosa Modified over 6 years ago
1
Alon Lavie, Jaime Carbonell, Lori Levin,
AVENUE/LETRAS: Learning-based MT Approaches for Languages with Limited Resources Alon Lavie, Jaime Carbonell, Lori Levin, Bob Frederking Joint work with: Erik Peterson, Christian Monson, Ariadna Font-Llitjos, Alison Alvarez, Roberto Aranovich
2
Why Machine Translation for Languages with Limited Resources?
We are in the age of information explosion The internet+web+Google anyone can get the information they want anytime… But what about the text in all those other languages? How do they read all this English stuff? How do we read all the stuff that they put online? MT for these languages would Enable: Better government access to native indigenous and minority communities Better minority and native community participation in information-rich activities (health care, education, government) without giving up their languages. Civilian and military applications (disaster relief) Language preservation Sep 22, 2006 Learning-based MT with Limited Resources
3
AVENUE/LETRAS Funding
Started in 2000 with small amount of DARPA/TIDES funding (NICE) AVENUE project funded by 5-year NSF ITR grant ( ) Follow-on LETRAS project funded by NSF HLC Program grant ( ) Collaboration funding sources: Mapudungun (MINEDUC, Chile) Hebrew (ISF, Israel) Brazilian Portuguese & Native Langs. (Brazilian Gov.) Inupiaq (NSF, Polar Programs) Sep 22, 2006 Learning-based MT with Limited Resources
4
Learning-based MT with Limited Resources
CMU’s AVENUE Approach Elicitation: use bilingual native informants to create a small high-quality word-aligned bilingual corpus of translated phrases and sentences Building Elicitation corpora from feature structures Feature Detection and Navigation Transfer-rule Learning: apply ML-based methods to automatically acquire syntactic transfer rules for translation between the two languages Learn from major language to minor language Translate from minor language to major language XFER + Decoder: XFER engine produces a lattice of possible transferred structures at all levels Decoder searches and selects the best scoring combination Rule Refinement: refine the acquired rules via a process of interaction with bilingual informants Morphology Learning Word and Phrase bilingual lexicon acquisition Sep 22, 2006 Learning-based MT with Limited Resources
5
Learning-based MT with Limited Resources
AVENUE MT Approach Interlingua Semantic Analysis Sentence Planning Syntactic Parsing Transfer Rules Text Generation AVENUE: Automate Rule Learning Source (e.g. Quechua) Target (e.g. English) Direct: SMT, EBMT Sep 22, 2006 Learning-based MT with Limited Resources
6
AVENUE Architecture Sep 22, 2006
Elicitation Morphology Rule Learning Run-Time System Rule Refinement Learning Module Learned Transfer Rules Word-Aligned Parallel Corpus INPUT TEXT Translation Correction Tool Run Time Transfer System Rule Refinement Module Elicitation Corpus Morphology Analyzer Learning Module Handcrafted rules Decoder Elicitation Tool Lexical Resources OUTPUT TEXT Sep 22, 2006 Learning-based MT with Limited Resources
7
Transfer Rule Formalism
;SL: the old man, TL: ha-ish ha-zaqen NP::NP [DET ADJ N] -> [DET N DET ADJ] ( (X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2) ((X1 AGR) = *3-SING) ((X1 DEF = *DEF) ((X3 AGR) = *3-SING) ((X3 COUNT) = +) ((Y1 DEF) = *DEF) ((Y3 DEF) = *DEF) ((Y2 AGR) = *3-SING) ((Y2 GENDER) = (Y4 GENDER)) ) Type information Part-of-speech/constituent information Alignments x-side constraints y-side constraints xy-constraints, e.g. ((Y1 AGR) = (X1 AGR)) Sep 22, 2006 Learning-based MT with Limited Resources
8
Transfer Rule Formalism (II)
;SL: the old man, TL: ha-ish ha-zaqen NP::NP [DET ADJ N] -> [DET N DET ADJ] ( (X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2) ((X1 AGR) = *3-SING) ((X1 DEF = *DEF) ((X3 AGR) = *3-SING) ((X3 COUNT) = +) ((Y1 DEF) = *DEF) ((Y3 DEF) = *DEF) ((Y2 AGR) = *3-SING) ((Y2 GENDER) = (Y4 GENDER)) ) Value constraints Agreement constraints Sep 22, 2006 Learning-based MT with Limited Resources
9
Transfer Rules Transfer Trees
NP PP NP1 NP P Adj N N1 ke eka aXyAya N jIvana NP NP PP Adj N P NP one chapter of N1 N life ; NP1 ke NP2 -> NP2 of NP1 ; Ex: jIvana ke eka aXyAya ; life of (one) chapter ; ==> a chapter of life ; {NP,12} NP::NP : [PP NP1] -> [NP1 PP] ( (X1::Y2) (X2::Y1) ; ((x2 lexwx) = 'kA') ) {NP,13} NP::NP : [NP1] -> [NP1] (X1::Y1) {PP,12} PP::PP : [NP Postp] -> [Prep NP] Sep 22, 2006 Learning-based MT with Limited Resources
10
Rule Learning - Overview
Goal: Acquire Syntactic Transfer Rules Use available knowledge from the source side (grammatical structure) Three steps: Flat Seed Generation: first guesses at transfer rules; flat syntactic structure Compositionality Learning: use previously learned rules to learn hierarchical structure Constraint Learning: refine rules by learning appropriate feature constraints Sep 22, 2006 Learning-based MT with Limited Resources
11
Flat Seed Rule Generation
Learning Example: NP Eng: the big apple Heb: ha-tapuax ha-gadol Generated Seed Rule: NP::NP [ART ADJ N] [ART N ART ADJ] ((X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2)) Sep 22, 2006 Learning-based MT with Limited Resources
12
Compositionality Learning
Initial Flat Rules: S::S [ART ADJ N V ART N] [ART N ART ADJ V P ART N] ((X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2) (X4::Y5) (X5::Y7) (X6::Y8)) NP::NP [ART ADJ N] [ART N ART ADJ] ((X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2)) NP::NP [ART N] [ART N] ((X1::Y1) (X2::Y2)) Generated Compositional Rule: S::S [NP V NP] [NP V P NP] ((X1::Y1) (X2::Y2) (X3::Y4)) Sep 22, 2006 Learning-based MT with Limited Resources
13
Learning-based MT with Limited Resources
Constraint Learning Input: Rules and their Example Sets S::S [NP V NP] [NP V P NP] {ex1,ex12,ex17,ex26} ((X1::Y1) (X2::Y2) (X3::Y4)) NP::NP [ART ADJ N] [ART N ART ADJ] {ex2,ex3,ex13} ((X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2)) NP::NP [ART N] [ART N] {ex4,ex5,ex6,ex8,ex10,ex11} ((X1::Y1) (X2::Y2)) Output: Rules with Feature Constraints: S::S [NP V NP] [NP V P NP] ((X1::Y1) (X2::Y2) (X3::Y4) (X1 NUM = X2 NUM) (Y1 NUM = Y2 NUM) (X1 NUM = Y1 NUM)) Sep 22, 2006 Learning-based MT with Limited Resources
14
Learning-based MT with Limited Resources
AVENUE Prototypes General XFER framework under development for past three years Prototype systems so far: German-to-English, Dutch-to-English Chinese-to-English Hindi-to-English Hebrew-to-English Portuguese-to-English In progress or planned: Mapudungun-to-Spanish Quechua-to-Spanish Inupiaq-to-English Native-Brazilian languages to Brazilian Portuguese Sep 22, 2006 Learning-based MT with Limited Resources
15
Learning-based MT with Limited Resources
Mapudungun Indigenous Language of Chile and Argentina ~ 1 Million Mapuche Speakers Sep 22, 2006 Learning-based MT with Limited Resources
16
Learning-based MT with Limited Resources
Collaboration Eliseo Cañulef Rosendo Huisca Hugo Carrasco Hector Painequeo Flor Caniupil Luis Caniupil Huaiquiñir Marcela Collio Calfunao Cristian Carrillan Anton Salvador Cañulef Mapuche Language Experts Universidad de la Frontera (UFRO) Instituto de Estudios Indígenas (IEI) Institute for Indigenous Studies Chilean Funding Chilean Ministry of Education (Mineduc) Bilingual and Multicultural Education Program Carolina Huenchullan Arrúe Claudio Millacura Salas Sep 22, 2006 Learning-based MT with Limited Resources
17
Learning-based MT with Limited Resources
Accomplishments Corpora Collection Spoken Corpus Collected: Luis Caniupil Huaiquiñir Medical Domain 3 of 4 Mapudungun Dialects 120 hours of Nguluche 30 hours of Lafkenche 20 hours of Pwenche Transcribed in Mapudungun Translated into Spanish Written Corpus ~ 200,000 words Bilingual Mapudungun – Spanish Historical and newspaper text nmlch-nmjm1_x_0405_nmjm_00: M: <SPA>no pütokovilu kay ko C: no, si me lo tomaba con agua M: chumgechi pütokoki femuechi pütokon pu <Noise> C: como se debe tomar, me lo tomé pués nmlch-nmjm1_x_0406_nmlch_00: M: Chengewerkelafuymiürke C: Ya no estabas como gente entonces! Sep 22, 2006 Learning-based MT with Limited Resources
18
Learning-based MT with Limited Resources
Accomplishments Developed At UFRO Bilingual Dictionary with Examples 1,926 entries Spelling Corrected Mapudungun Word List 117,003 fully-inflected word forms Segmented Word List 15,120 forms Stems translated into Spanish Sep 22, 2006 Learning-based MT with Limited Resources
19
Learning-based MT with Limited Resources
Accomplishments Developed at LTI using Mapudungun language resources from UFRO Spelling Checker Integrated into OpenOffice Hand-built Morphological Analyzer Prototype Machine Translation Systems Rule-Based Example-Based Website: LenguasAmerindias.org Sep 22, 2006 Learning-based MT with Limited Resources
20
Challenges for Hebrew MT
Paucity in existing language resources for Hebrew No publicly available broad coverage morphological analyzer No publicly available bilingual lexicons or dictionaries No POS-tagged corpus or parse tree-bank corpus for Hebrew No large Hebrew/English parallel corpus Scenario well suited for CMU transfer-based MT framework for languages with limited resources Sep 22, 2006 Learning-based MT with Limited Resources
21
Hebrew-to-English MT Prototype
Initial prototype developed within a two month intensive effort Accomplished: Adapted available morphological analyzer Constructed a preliminary translation lexicon Translated and aligned Elicitation Corpus Learned XFER rules Developed (small) manual XFER grammar as a point of comparison System debugging and development Evaluated performance on unseen test data using automatic evaluation metrics Sep 22, 2006 Learning-based MT with Limited Resources
22
Transfer Rules Transfer Engine Decoder Source Input
בשורה הבאה Transfer Rules {NP1,3} NP1::NP1 [NP1 "H" ADJ] -> [ADJ NP1] ((X3::Y1) (X1::Y2) ((X1 def) = +) ((X1 status) =c absolute) ((X1 num) = (X3 num)) ((X1 gen) = (X3 gen)) (X0 = X1)) Preprocessing Morphology English Language Model Transfer Engine Translation Lexicon N::N |: ["$WR"] -> ["BULL"] ((X1::Y1) ((X0 NUM) = s) ((Y0 lex) = "BULL")) N::N |: ["$WRH"] -> ["LINE"] ((Y0 lex) = "LINE")) Decoder Translation Output Lattice (0 1 (1 1 (2 2 (1 2 "THE (0 2 "IN (0 4 "IN THE NEXT English Output in the next line
23
Learning-based MT with Limited Resources
Morphology Example Input word: B$WRH | B$WRH | |-----B-----|$WR|--H--| |--B--|-H--|--$WRH---| Sep 22, 2006 Learning-based MT with Limited Resources
24
Learning-based MT with Limited Resources
Morphology Example Y0: ((SPANSTART 0) Y1: ((SPANSTART 0) Y2: ((SPANSTART 1) (SPANEND 4) (SPANEND 2) (SPANEND 3) (LEX B$WRH) (LEX B) (LEX $WR) (POS N) (POS PREP)) (POS N) (GEN F) (GEN M) (NUM S) (NUM S) (STATUS ABSOLUTE)) (STATUS ABSOLUTE)) Y3: ((SPANSTART 3) Y4: ((SPANSTART 0) Y5: ((SPANSTART 1) (SPANEND 4) (SPANEND 1) (SPANEND 2) (LEX $LH) (LEX B) (LEX H) (POS POSS)) (POS PREP)) (POS DET)) Y6: ((SPANSTART 2) Y7: ((SPANSTART 0) (SPANEND 4) (SPANEND 4) (LEX $WRH) (LEX B$WRH) (POS N) (POS LEX)) (GEN F) (NUM S) (STATUS ABSOLUTE)) Sep 22, 2006 Learning-based MT with Limited Resources
25
Learning-based MT with Limited Resources
Example Translation Input: לאחר דיונים רבים החליטה הממשלה לערוך משאל עם בנושא הנסיגה After debates many decided the government to hold referendum in issue the withdrawal Output: AFTER MANY DEBATES THE GOVERNMENT DECIDED TO HOLD A REFERENDUM ON THE ISSUE OF THE WITHDRAWAL Sep 22, 2006 Learning-based MT with Limited Resources
26
Sample Output (dev-data)
maxwell anurpung comes from ghana for israel four years ago and since worked in cleaning in hotels in eilat a few weeks ago announced if management club hotel that for him to leave israel according to the government instructions and immigration police in a letter in broken english which spread among the foreign workers thanks to them hotel for their hard work and announced that will purchase for hm flight tickets for their countries from their money Sep 22, 2006 Learning-based MT with Limited Resources
27
Challenges and Future Directions
Automatic Transfer Rule Learning: Learning mappings for non-compositional structures Effective models for rule scoring for Decoding: using scores at runtime Pruning the large collections of learned rules Learning Unification Constraints In the absence of morphology or POS annotated lexica Integrated Xfer Engine and Decoder Improved models for scoring tree-to-tree mappings, integration with LM and other knowledge sources in the course of the search Sep 22, 2006 Learning-based MT with Limited Resources
28
Challenges and Future Directions
Our approach for learning transfer rules is applicable to the large parallel data scenario, subject to solutions for several big challenges: No elicitation corpus break-down parallel sentences into reasonable learning examples Working with less reliable automatic word alignments rather than manual alignments Effective use of reliable parse structures for ONE language (i.e. English) and automatic word alignments in order to decompose the translation of a sentence into several compositional rules. Effective scoring of resulting very large transfer grammars, and scaled up transfer + decoding Sep 22, 2006 Learning-based MT with Limited Resources
29
Future Research Directions
Automatic Rule Refinement Morphology Learning Feature Detection and Corpus Navigation … Sep 22, 2006 Learning-based MT with Limited Resources
30
Implications for MT with Vast Amounts of Parallel Data
Phrase-to-phrase MT ill suited for long-range reorderings ungrammatical output Recent work on hierarchical Stat-MT [Chiang, 2005] and parsing-based MT [Melamed et al, 2005] [Knight et al] Learning general tree-to-tree syntactic mappings is equally problematic: Meaning is a hybrid of complex, non-compositional phrases embedded within a syntactic structure Some constituents can be translated in isolation, others require contextual mappings Sep 22, 2006 Learning-based MT with Limited Resources
31
Learning-based MT with Limited Resources
Evaluation Results Test set of 62 sentences from Haaretz newspaper, 2 reference translations System BLEU NIST P R METEOR No Gram 0.0616 3.4109 0.4090 0.4427 0.3298 Learned 0.0774 3.5451 0.4189 0.4488 0.3478 Manual 0.1026 3.7789 0.4334 0.4474 0.3617 Sep 22, 2006 Learning-based MT with Limited Resources
32
Hebrew-English: Test Suite Evaluation
Grammar BLEU METEOR Baseline (NoGram) 0.0996 0.4916 Learned Grammar 0.1608 0.5525 Manual Grammar 0.1642 0.5320 Sep 22, 2006 Learning-based MT with Limited Resources
33
Learning-based MT with Limited Resources
QuechuaSpanish MT V-Unit: funded Summer project in Cusco (Peru) June-August 2005 [preparations and data collection started earlier] Intensive Quechua course in Centro Bartolome de las Casas (CBC) Worked together with two Quechua native and one non-native speakers on developing infrastructure (correcting elicited translations, segmenting and translating list of most frequent words) Sep 22, 2006 Learning-based MT with Limited Resources
34
Quechua Spanish Prototype MT System
Stem Lexicon (semi-automatically generated): 753 lexical entries Suffix lexicon: 21 suffixes (150 Cusihuaman) Quechua morphology analyzer 25 translation rules Spanish morphology generation module User-Studies: 10 sentences, 3 users (2 native, 1 non-native) Sep 22, 2006 Learning-based MT with Limited Resources
35
Learning-based MT with Limited Resources
The Transfer Engine Analysis Source text is parsed into its grammatical structure. Determines transfer application ordering. Example: 他 看 书。(he read book) S NP VP N V NP 他 看 书 Transfer A target language tree is created by reordering, insertion, and deletion. he read DET N a book Article “a” is inserted into object NP. Source words translated with transfer lexicon. Generation Target language constraints are checked and final translation produced. E.g. “reads” is chosen over “read” to agree with “he”. Final translation: “He reads a book” Sep 22, 2006 Learning-based MT with Limited Resources
36
Learning-based MT with Limited Resources
The Transfer Engine Some Unique Features: Works with either learned or manually-developed transfer grammars Handles rules with or without unification constraints Supports interfacing with servers for Morphological analysis and generation Can handle ambiguous source-word analyses and/or SL segmentations represented in the form of lattice structures Sep 22, 2006 Learning-based MT with Limited Resources
37
Learning-based MT with Limited Resources
The Lattice Decoder Simple Stack Decoder, similar in principle to SMT/EBMT decoders Searches for best-scoring path of non-overlapping lattice arcs Scoring based on log-linear combination of scoring components (no MER training yet) Scoring components: Standard trigram LM Fragmentation: how many arcs to cover the entire translation? Length Penalty Rule Scores (not fully integrated yet) Sep 22, 2006 Learning-based MT with Limited Resources
38
Learning-based MT with Limited Resources
Outline Rationale for learning-based MT Roadmap for learning-based MT Framework overview Elicitation Learning transfer Rules Automatic rule refinement Example prototypes Implications for MT with vast parallel data Conclusions and future directions Sep 22, 2006 Learning-based MT with Limited Resources
39
Data Elicitation for Languages with Limited Resources
Rationale: Large volumes of parallel text not available create a small maximally-diverse parallel corpus that directly supports the learning task Bilingual native informant(s) can translate and align a small pre-designed elicitation corpus, using elicitation tool Elicitation corpus designed to be typologically and structurally comprehensive and compositional Transfer-rule engine and new learning approach support acquisition of generalized transfer-rules from the data Sep 22, 2006 Learning-based MT with Limited Resources
40
Elicitation Tool: English-Chinese Example
Sep 22, 2006 Learning-based MT with Limited Resources
41
Elicitation Tool: English-Chinese Example
Sep 22, 2006 Learning-based MT with Limited Resources
42
Elicitation Tool: English-Hindi Example
Sep 22, 2006 Learning-based MT with Limited Resources
43
Elicitation Tool: English-Arabic Example
Sep 22, 2006 Learning-based MT with Limited Resources
44
Elicitation Tool: Spanish-Mapudungun Example
Sep 22, 2006 Learning-based MT with Limited Resources
45
Designing Elicitation Corpora
What do we want to elicit? Diversity of linguistic phenomena and constructions Syntactic structural diversity How do we construct an elicitation corpus? Typological Elicitation Corpus based on elicitation and documentation work of field linguists (e.g. Comrie 1977, Bouquiaux 1992): initial corpus size ~1000 examples Structural Elicitation Corpus based on representative sample of English phrase structures: ~120 examples Organized compositionally: elicit simple structures first, then use them as building blocks Goal: minimize size, maximize linguistic coverage Sep 22, 2006 Learning-based MT with Limited Resources
46
Typological Elicitation Corpus
Feature Detection Discover what features exist in the language and where/how they are marked Example: does the language mark gender of nouns? How and where are these marked? Method: compare translations of minimal pairs – sentences that differ in only ONE feature Elicit translations/alignments for detected features and their combinations Dynamic corpus navigation based on feature detection: no need to elicit for combinations involving non-existent features Sep 22, 2006 Learning-based MT with Limited Resources
47
Typological Elicitation Corpus
Initial typological corpus of about 1000 sentences was manually constructed New construction methodology for building an elicitation corpus using: A feature specification: lists inventory of available features and their values A definition of the set of desired feature structures Schemas define sets of desired combinations of features and values Multiplier algorithm generates the comprehensive set of feature structures A generation grammar and lexicon: NLG generator generates NL sentences from the feature structures Sep 22, 2006 Learning-based MT with Limited Resources
48
Structural Elicitation Corpus
Goal: create a compact diverse sample corpus of syntactic phrase structures in English in order to elicit how these map into the elicited language Methodology: Extracted all CFG “rules” from Brown section of Penn TreeBank (122K sentences) Simplified POS tag set Constructed frequency histogram of extracted rules Pulled out simplest phrases for most frequent rules for NPs, PPs, ADJPs, ADVPs, SBARs and Sentences Some manual inspection and refinement Resulting corpus of about 120 phrases/sentences representing common structures See [Probst and Lavie, 2004] Sep 22, 2006 Learning-based MT with Limited Resources
49
Learning-based MT with Limited Resources
Outline Rationale for learning-based MT Roadmap for learning-based MT Framework overview Elicitation Learning transfer Rules Automatic rule refinement Example prototypes Implications for MT with vast parallel data Conclusions and future directions Sep 22, 2006 Learning-based MT with Limited Resources
50
Flat Seed Rule Generation
Create a “flat” transfer rule specific to the sentence pair, partially abstracted to POS Words that are aligned word-to-word and have the same POS in both languages are generalized to their POS Words that have complex alignments (or not the same POS) remain lexicalized One seed rule for each translation example No feature constraints associated with seed rules (but mark the example(s) from which it was learned) Sep 22, 2006 Learning-based MT with Limited Resources
51
Compositionality Learning
Detection: traverse the c-structure of the English sentence, add compositional structure for translatable chunks Generalization: adjust constituent sequences and alignments Two implemented variants: Safe Compositionality: there exists a transfer rule that correctly translates the sub-constituent Maximal Compositionality: Generalize the rule if supported by the alignments, even in the absence of an existing transfer rule for the sub-constituent Sep 22, 2006 Learning-based MT with Limited Resources
52
Learning-based MT with Limited Resources
Constraint Learning Goal: add appropriate feature constraints to the acquired rules Methodology: Preserve general structural transfer Learn specific feature constraints from example set Seed rules are grouped into clusters of similar transfer structure (type, constituent sequences, alignments) Each cluster forms a version space: a partially ordered hypothesis space with a specific and a general boundary The seed rules in a group form the specific boundary of a version space The general boundary is the (implicit) transfer rule with the same type, constituent sequences, and alignments, but no feature constraints Sep 22, 2006 Learning-based MT with Limited Resources
53
Constraint Learning: Generalization
The partial order of the version space: Definition: A transfer rule tr1 is strictly more general than another transfer rule tr2 if all f-structures that are satisfied by tr2 are also satisfied by tr1. Generalize rules by merging them: Deletion of constraint Raising two value constraints to an agreement constraint, e.g. ((x1 num) = *pl), ((x3 num) = *pl) ((x1 num) = (x3 num)) Sep 22, 2006 Learning-based MT with Limited Resources
54
Automated Rule Refinement
Bilingual informants can identify translation errors and pinpoint the errors A sophisticated trace of the translation path can identify likely sources for the error and do “Blame Assignment” Rule Refinement operators can be developed to modify the underlying translation grammar (and lexicon) based on characteristics of the error source: Add or delete feature constraints from a rule Bifurcate a rule into two rules (general and specific) Add or correct lexical entries See [Font-Llitjos, Carbonell & Lavie, 2005] Sep 22, 2006 Learning-based MT with Limited Resources
55
Learning-based MT with Limited Resources
Outline Rationale for learning-based MT Roadmap for learning-based MT Framework overview Elicitation Learning transfer Rules Automatic rule refinement Example prototypes Implications for MT with vast parallel data Conclusions and future directions Sep 22, 2006 Learning-based MT with Limited Resources
56
Learning-based MT with Limited Resources
Outline Rationale for learning-based MT Roadmap for learning-based MT Framework overview Elicitation Learning transfer Rules Automatic rule refinement Learning Morphology Example prototypes Implications for MT with vast parallel data Conclusions and future directions Sep 22, 2006 Learning-based MT with Limited Resources
57
Implications for MT with Vast Amounts of Parallel Data
Example: 他 经常 与 江泽民 总统 通 电话 He freq with J Zemin Pres via phone He freq talked with President J Zemin over the phone Sep 22, 2006 Learning-based MT with Limited Resources
58
Implications for MT with Vast Amounts of Parallel Data
Example: 他 经常 与 江泽民 总统 通 电话 He freq with J Zemin Pres via phone He freq talked with President J Zemin over the phone NP1 NP2 NP3 NP1 NP2 NP3 Sep 22, 2006 Learning-based MT with Limited Resources
59
Learning-based MT with Limited Resources
Conclusions There is hope yet for wide-spread MT between many of the worlds language pairs MT offers a fertile yet extremely challenging ground for learning-based approaches that leverage from diverse sources of information: Syntactic structure of one or both languages Word-to-word correspondences Decomposable units of translation Statistical Language Models AVENUE’s XFER approach provides a feasible solution to MT for languages with limited resources Promising approach for addressing the fundamental weaknesses in current corpus-based MT for languages with vast resources Sep 22, 2006 Learning-based MT with Limited Resources
60
Learning-based MT with Limited Resources
Sep 22, 2006 Learning-based MT with Limited Resources
61
Mapudungun-to-Spanish Example
English I didn’t see Maria Mapudungun pelafiñ Maria Spanish No vi a María Sep 22, 2006 Learning-based MT with Limited Resources
62
Mapudungun-to-Spanish Example
English I didn’t see Maria Mapudungun pelafiñ Maria pe -la -fi -ñ Maria see -neg -3.obj -1.subj.indicative Maria Spanish No vi a María No vi a María neg see.1.subj.past.indicative acc Maria Sep 22, 2006 Learning-based MT with Limited Resources
63
Learning-based MT with Limited Resources
pe-la-fi-ñ Maria V pe Sep 22, 2006 Learning-based MT with Limited Resources
64
Learning-based MT with Limited Resources
pe-la-fi-ñ Maria V pe VSuff Negation = + la Sep 22, 2006 Learning-based MT with Limited Resources
65
Learning-based MT with Limited Resources
pe-la-fi-ñ Maria V pe VSuffG Pass all features up VSuff la Sep 22, 2006 Learning-based MT with Limited Resources
66
Learning-based MT with Limited Resources
pe-la-fi-ñ Maria V pe VSuffG VSuff object person = 3 VSuff fi la Sep 22, 2006 Learning-based MT with Limited Resources
67
Learning-based MT with Limited Resources
pe-la-fi-ñ Maria V pe VSuffG Pass all features up from both children VSuffG VSuff VSuff fi la Sep 22, 2006 Learning-based MT with Limited Resources
68
Learning-based MT with Limited Resources
pe-la-fi-ñ Maria V pe VSuffG VSuff person = 1 number = sg mood = ind VSuffG VSuff ñ VSuff fi la Sep 22, 2006 Learning-based MT with Limited Resources
69
Learning-based MT with Limited Resources
pe-la-fi-ñ Maria V VSuffG pe VSuffG VSuff Pass all features up from both children VSuffG VSuff ñ VSuff fi la Sep 22, 2006 Learning-based MT with Limited Resources
70
Learning-based MT with Limited Resources
pe-la-fi-ñ Maria Pass all features up from both children V Check that: 1) negation = + 2) tense is undefined V VSuffG pe VSuffG VSuff VSuffG VSuff ñ VSuff fi la Sep 22, 2006 Learning-based MT with Limited Resources
71
Learning-based MT with Limited Resources
pe-la-fi-ñ Maria V NP V VSuffG N person = 3 number = sg human = + pe VSuffG VSuff N VSuffG VSuff ñ Maria VSuff fi la Sep 22, 2006 Learning-based MT with Limited Resources
72
Learning-based MT with Limited Resources
pe-la-fi-ñ Maria S Check that NP is human = + Pass features up from V VP V NP V VSuffG N pe VSuffG VSuff N VSuffG VSuff ñ Maria VSuff fi la Sep 22, 2006 Learning-based MT with Limited Resources
73
Learning-based MT with Limited Resources
Transfer to Spanish: Top-Down S S VP VP V NP V VSuffG N pe VSuffG VSuff N VSuffG VSuff ñ Maria VSuff fi la Sep 22, 2006 Learning-based MT with Limited Resources
74
Learning-based MT with Limited Resources
Transfer to Spanish: Top-Down Pass all features to Spanish side S S VP VP V NP V “a” NP V VSuffG N pe VSuffG VSuff N VSuffG VSuff ñ Maria VSuff fi la Sep 22, 2006 Learning-based MT with Limited Resources
75
Learning-based MT with Limited Resources
Transfer to Spanish: Top-Down S S Pass all features down VP VP V NP V “a” NP V VSuffG N pe VSuffG VSuff N VSuffG VSuff ñ Maria VSuff fi la Sep 22, 2006 Learning-based MT with Limited Resources
76
Learning-based MT with Limited Resources
Transfer to Spanish: Top-Down S S Pass object features down VP VP V NP V “a” NP V VSuffG N pe VSuffG VSuff N VSuffG VSuff ñ Maria VSuff fi la Sep 22, 2006 Learning-based MT with Limited Resources
77
Learning-based MT with Limited Resources
Transfer to Spanish: Top-Down S S VP VP V NP V “a” NP V VSuffG N Accusative marker on objects is introduced because human = + pe VSuffG VSuff N VSuffG VSuff ñ Maria VSuff fi la Sep 22, 2006 Learning-based MT with Limited Resources
78
Learning-based MT with Limited Resources
Transfer to Spanish: Top-Down S S VP VP::VP [VBar NP] -> [VBar "a" NP] ( (X1::Y1) (X2::Y3) ((X2 type) = (*NOT* personal)) ((X2 human) =c +) (X0 = X1) ((X0 object) = X2) (Y0 = X0) ((Y0 object) = (X0 object)) (Y1 = Y0) (Y3 = (Y0 object)) ((Y1 objmarker person) = (Y3 person)) ((Y1 objmarker number) = (Y3 number)) ((Y1 objmarker gender) = (Y3 ender))) VP V NP V “a” NP V VSuffG N pe VSuffG VSuff N VSuffG VSuff ñ Maria VSuff fi la Sep 22, 2006 Learning-based MT with Limited Resources
79
Learning-based MT with Limited Resources
Transfer to Spanish: Top-Down S S Pass person, number, and mood features to Spanish Verb VP VP V NP V “a” NP Assign tense = past V VSuffG N “no” V pe VSuffG VSuff N VSuffG VSuff ñ Maria VSuff fi la Sep 22, 2006 Learning-based MT with Limited Resources
80
Learning-based MT with Limited Resources
Transfer to Spanish: Top-Down S S VP VP V NP V “a” NP V VSuffG N “no” V pe VSuffG VSuff N VSuffG VSuff ñ Maria Introduced because negation = + VSuff fi la Sep 22, 2006 Learning-based MT with Limited Resources
81
Learning-based MT with Limited Resources
Transfer to Spanish: Top-Down S S VP VP V NP V “a” NP V VSuffG N “no” V pe VSuffG VSuff N ver VSuffG VSuff ñ Maria VSuff fi la Sep 22, 2006 Learning-based MT with Limited Resources
82
Learning-based MT with Limited Resources
Transfer to Spanish: Top-Down S S VP VP V NP V “a” NP V VSuffG N “no” V pe VSuffG VSuff N ver vi VSuffG VSuff ñ Maria person = 1 number = sg mood = indicative tense = past VSuff fi la Sep 22, 2006 Learning-based MT with Limited Resources
83
Learning-based MT with Limited Resources
Transfer to Spanish: Top-Down S S Pass features over to Spanish side VP VP V NP V “a” NP V VSuffG N “no” V N pe VSuffG VSuff N vi N VSuffG VSuff ñ Maria María VSuff fi la Sep 22, 2006 Learning-based MT with Limited Resources
84
Learning-based MT with Limited Resources
I Didn’t see Maria S S VP VP V NP V “a” NP V VSuffG N “no” V N pe VSuffG VSuff N vi N VSuffG VSuff ñ Maria María VSuff fi la Sep 22, 2006 Learning-based MT with Limited Resources
85
Learning-based MT with Limited Resources
Sep 22, 2006 Learning-based MT with Limited Resources
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.