Download presentation
Presentation is loading. Please wait.
Published byAshlie Logan Modified over 9 years ago
1
Building NLP Systems for Two Resource Scarce Indigenous Languages: Mapudungun and Quechua, and some other languages Christian Monson, Ariadna Font Llitjós, Roberto Aranovich, Lori Levin, Ralf Brown, Erik Peterson, Jaime Carbonell, and Alon Lavie
2
Omnivorous MT Eat whatever resources are available Eat large or small amounts of data Mapusaurus Roseae Mapu = land Mapuche = land people Mapudungun= land speech
3
AVENUE’s Inventory Resources –Parallel corpus –Monolingual corpus –Lexicon –Morphological Analyzer (lemmatizer) –Human Linguist –Human non-linguist Techniques –Rule based transfer system –Example Based MT –Morphology Learning –Rule Learning –Interactive Rule Refinement –Multi-Engine MT This research was funded in part by NSF grant number IIS-0121-631.
4
Startup without corpus or linguist Requires someone who is bilingual and literate
5
The Elicitation Tool has been used with these languages Mapudungun Hindi Hebrew Quechua Aymara Thai Japanese Chinese Dutch Arabic
6
Purpose of Elicitation Provide a small but highly targeted corpus of hand aligned data –To support machine learning from a small data set –To discover basic word order –To discover how syntactic dependencies are expressed –To discover which grammatical meanings are reflected in the morphology or syntax of the language srcsent: Tú caíste tgtsent: eymi ütrünagimi aligned: ((1,1),(2,2)) context: tú = Juan [masculino, 2a persona del singular] comment: You (John) fell srcsent: Tú estás cayendo tgtsent: eymi petu ütünagimi aligned: ((1,1),(2 3,2 3)) context: tú = Juan [masculino, 2a persona del singular] comment: You (John) are falling srcsent: Tú caíste tgtsent: eymi ütrunagimi aligned: ((1,1),(2,2)) context: tú = María [femenino, 2a persona del singular] comment: You (Mary) fell
7
Feature Structures srcsent: Mary was not a leader. context: Translate this as though it were spoken to a peer co- worker; ((actor ((np-function fn-actor)(np-animacy anim-human)(np- biological-gender bio-gender-female) (np-general-type proper-noun-type)(np-identifiability identifiable)(np- specificity specific)…)) (pred ((np-function fn-predicate-nominal)(np-animacy anim- human)(np-biological-gender bio-gender-female) (np- general-type common-noun-type)(np-specificity specificity- neutral)…)) (c-v-lexical-aspect state)(c-copula-type copula-role)(c-secondary-type secondary-copula)(c-solidarity solidarity-neutral) (c-v-grammatical- aspect gram-aspect-neutral)(c-v-absolute-tense past) (c-v-phase- aspect phase-aspect-neutral) (c-general-type declarative-clause)(c- polarity polarity-negative)(c-my-causer-intentionality intentionality- n/a)(c-comparison-type comparison-n/a)(c-relative-tense relative- n/a)(c-our-boundary boundary-n/a)…)
8
Current Work Search space: –Elements of meanings that might be expressed by syntax or morphology: tense, aspect, person, number, gender, causation, evidentiality, etc. –Syntactic dependencies: subject, object –Interactions of features: Tense and person Tense and interrogative mood Etc.
9
Current Work For a new language –For each item of the search space Eliminate it as irrelevant or Explore it –Using as few sentences as possible
10
Mar 1, 2006 Tools for Creating Elicitation Corpora List of semantic features and values The Corpus Feature Maps: which combinations of features and values are of interest … Clause- Level Noun- Phrase Tense & Aspect Modality Feature Structure Sets Feature Specification Reverse Annotated Feature Structure Sets: add English sentences Smaller Corpus Sampling XML Schema XSLT Script
11
Mar 1, 2006 Tools for Creating Elicitation Corpora List of semantic features and values The Corpus Feature Maps: which combinations of features and values are of interest … Clause- Level Noun- Phrase Tense & Aspect Modality Feature Structure Sets Feature Specification Reverse Annotated Feature Structure Sets: add English sentences Smaller Corpus Sampling Combination Formalism
12
Mar 1, 2006 Tools for Creating Elicitation Corpora List of semantic features and values The Corpus Feature Maps: which combinations of features and values are of interest … Clause- Level Noun- Phrase Tense & Aspect Modality Feature Structure Sets Feature Specification Reverse Annotated Feature Structure Sets: add English sentences Smaller Corpus Sampling Feature Structure Viewer
13
Mar 1, 2006 Tools for Creating Elicitation Corpora List of semantic features and values The Corpus Feature Maps: which combinations of features and values are of interest … Clause- Level Noun- Phrase Tense & Aspect Modality Feature Structure Sets Feature Specification Reverse Annotated Feature Structure Sets: add English sentences Smaller Corpus Sampling
14
Outline Two ideas –Omnivorous MT –Startup for low resource situation Four Languages –Mapudungun –Quechua –Hindi –Hebrew
15
The Avenue Low Resource Scenario Learning Module Learned Transfer Rules Lexical Resources Run Time Transfer System Decoder Translation Correction Tool Word- Aligned Parallel Corpus Elicitation Tool Elicitation Corpus ElicitationRule Learning Run-Time System Rule Refinement Rule Refinement Module Morphology Analyzer Learning Module Handcrafted rules INPUT TEXT OUTPUT TEXT
16
The Avenue Low Resource Scenario Learning Module Learned Transfer Rules Lexical Resources Run Time Transfer System Decoder Translation Correction Tool Word- Aligned Parallel Corpus Elicitation Tool Elicitation Corpus ElicitationRule Learning Run-Time System Rule Refinement Rule Refinement Module Morphology Analyzer Learning Module Handcrafted rules INPUT TEXT OUTPUT TEXT
17
The Avenue Low Resource Scenario Learning Module Learned Transfer Rules Lexical Resources Run Time Transfer System Decoder Translation Correction Tool Word- Aligned Parallel Corpus Elicitation Tool Elicitation Corpus ElicitationRule Learning Run-Time System Rule Refinement Rule Refinement Module Morphology Analyzer Learning Module Handcrafted rules INPUT TEXT OUTPUT TEXT
18
The Avenue Low Resource Scenario Learning Module Learned Transfer Rules Lexical Resources Run Time Transfer System Decoder Translation Correction Tool Word- Aligned Parallel Corpus Elicitation Tool Elicitation Corpus ElicitationRule Learning Run-Time System Rule Refinement Rule Refinement Module Morphology Analyzer Learning Module Handcrafted rules INPUT TEXT OUTPUT TEXT
19
Mapudungun Language 900,000 Mapuche people At least 300.000 speakers of Mapudungun Polysynthetic sl: pe- rke- fi- ñ Maria ver-REPORT-3pO-1pSgS/IND tl: DICEN QUE LA VI A MARÍA (They say that) I saw Maria.
20
AVENUE Mapudungun Joint project between Carnegie Mellon University, the Chilean Ministry of Education, and Universidad de la Frontera.
21
Mapudungun to Spanish Resources Initially: –Large team of native speakers at Universidad de la Frontera, Temuco, Chile Some knowledge of linguistics No knowledge of computational linguistics –No corpus –A few short word lists –No morphological analyzer Later: Computational Linguists with non-native knowledge of Mapudungun Other considerations: –Produce something that is useful to the community, especially for bilingual education –Experimental MT systems are not useful
22
Mapudungun Learning Module Learned Transfer Rules Lexical Resources Run Time Transfer System Decoder Translation Correction Tool Word- Aligned Parallel Corpus Elicitation Tool Elicitation Corpus ElicitationRule Learning Run-Time System Rule Refinement Rule Refinement Module Morphology Analyzer Learning Module Handcrafted rules INPUT TEXT OUTPUT TEXT Corpus: 170 hours of spoken Mapudungun Example Based MT Spelling checker Spanish Morphology from UPC, Barcelona
23
Mapudungun Products http://www.lenguasamerindias.org/ –Click: traductor mapudungún –Dictionary lookup (Mapudungun to Spanish) –Morphological analysis –Example Based MT (Mapudungun to Spanish)
24
V pe I Didn’t see Maria VSuff la VSuffGVSuff fi VSuffGVSuff ñ VSuffG NP N Maria N S V VP S NP“a” V V“no” vi N María N
25
V pe Transfer to Spanish: Top-Down VSuff la VSuffGVSuff fi VSuffGVSuff ñ VSuffG NP N Maria N S V VP S NP“a” V VP::VP [VBar NP] -> [VBar "a" NP] ((X1::Y1) (X2::Y3) ((X2 type) = (*NOT* personal)) ((X2 human) =c +) (X0 = X1) ((X0 object) = X2) (Y0 = X0) ((Y0 object) = (X0 object)) (Y1 = Y0) (Y3 = (Y0 object)) ((Y1 objmarker person) = (Y3 person)) ((Y1 objmarker number) = (Y3 number)) ((Y1 objmarker gender) = (Y3 ender)))
26
AVENUE Hebrew Joint project of Carnegie Mellon University and University of Haifa
27
Hebrew Language Native language of about 3-4 Million in Israel Semitic language, closely related to Arabic and with similar linguistic properties –Root+Pattern word formation system –Rich verb and noun morphology –Particles attach as prefixed to the following word: definite article (H), prepositions (B,K,L,M), coordinating conjuction (W), relativizers ($,K$)… Unique alphabet and Writing System –22 letters represent (mostly) consonants –Vowels represented (mostly) by diacritics –Modern texts omit the diacritic vowels, thus additional level of ambiguity: “bare” word word –Example: MHGR mehager, m+hagar, m+h+ger
28
Hebrew Resources Morphological analyzer developed at Technion Constructed our own Hebrew-to-English lexicon, based primarily on existing “Dahan” H-to-E and E-to-H dictionary Human Computational Linguists Native Speakers
29
Hebrew Learning Module Learned Transfer Rules Lexical Resources Run Time Transfer System Decoder Translation Correction Tool Word- Aligned Parallel Corpus Elicitation Tool Elicitation Corpus ElicitationRule Learning Run-Time System Rule Refinement Rule Refinement Module Morphology Analyzer Learning Module Handcrafted rules INPUT TEXT OUTPUT TEXT
30
Flat Seed Rule Generation Learning Example: NP Eng: the big apple Heb: ha-tapuax ha-gadol Generated Seed Rule: NP::NP [ART ADJ N] [ART N ART ADJ] ((X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2))
31
Compositionality Learning Initial Flat Rules: S::S [ART ADJ N V ART N] [ART N ART ADJ V P ART N] ((X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2) (X4::Y5) (X5::Y7) (X6::Y8)) NP::NP [ART ADJ N] [ART N ART ADJ] ((X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2)) NP::NP [ART N] [ART N] ((X1::Y1) (X2::Y2)) Generated Compositional Rule: S::S [NP V NP] [NP V P NP] ((X1::Y1) (X2::Y2) (X3::Y4))
32
Constraint Learning Input: Rules and their Example Sets S::S [NP V NP] [NP V P NP] {ex1,ex12,ex17,ex26} ((X1::Y1) (X2::Y2) (X3::Y4)) NP::NP [ART ADJ N] [ART N ART ADJ] {ex2,ex3,ex13} ((X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2)) NP::NP [ART N] [ART N] {ex4,ex5,ex6,ex8,ex10,ex11} ((X1::Y1) (X2::Y2)) Output: Rules with Feature Constraints: S::S [NP V NP] [NP V P NP] ((X1::Y1) (X2::Y2) (X3::Y4) (X1 NUM = X2 NUM) (Y1 NUM = Y2 NUM) (X1 NUM = Y1 NUM))
33
Quechua facts Agglutinative language A stem can often have 10 to 12 suffixes, but it can have up to 28 suffixes Supposedly clear cut boundaries, but in reality several suffixes change when followed by certain other suffixes No irregular verbs, nouns or adjectives Does not mark for gender No adjective agreement No definite or indefinite articles (‘topic’ and ‘focus’ markers perform a similar task of articles and intonation in English or Spanish)
34
Quechua examples – taki+ni (also written takiniy) sing 1sg (I sing) canto – taki+sha+ni (takishaniy) sing progr 1sg (I am singing) estoy cantando – taki+pa+ku+q+chu? taki sing -pa+ku to join a group to do something -q agentive -chu interrogative (para) cantar con la gente (del pueblo)? (to sing with the people (of the village)?)
35
Quechua Resources A few native speakers, not linguists A computational linguist learning Quechua Two fluent, but non-native linguists
36
Quechua Learning Module Learned Transfer Rules Lexical Resources Run Time Transfer System Decoder Translation Correction Tool Word- Aligned Parallel Corpus Elicitation Tool Elicitation Corpus ElicitationRule Learning Run-Time System Rule Refinement Rule Refinement Module Morphology Analyzer Learning Module Handcrafted rules INPUT TEXT OUTPUT TEXT Parallel Corpus: OCR with correction
37
Grammar rules ;taki+sha+ni -> estoy cantando (I am singing) {VBar,3} VBar::VBar : [V VSuff VSuff] -> [V V] ( (X1::Y2) ((x0 person) = (x3 person)) ((x0 number) = (x3 number)) ((x2 mood) =c ger) ((y2 mood) = (x2 mood)) ((y1 form) =c estar) ((y1 person) = (x3 person)) ((y1 number) = (x3 number)) ((y1 tense) = (x3 tense)) ((x0 tense) = (x3 tense)) ((y1 mood) = (x3 mood)) ((x3 inflected) =c +) ((x0 inflected) = +)) lex = cantar mood = ger lex = estar person = 1 number = sg tense = pres mood = ind Spanish Morphology Generation estoy cantando
38
Hindi Resources Large statistical lexicon from the Linguistic Data Consortium (LDC) Parallel Corpus from LDC Morphological Analyzer-Generator from LDC Lots of native speakers Computational linguists with little or no knowledge of Hindi Experimented with the size of the parallel corpus –Miserly and large scenarios
39
Hindi Learning Module Learned Transfer Rules Lexical Resources Run Time Transfer System Decoder Translation Correction Tool Word- Aligned Parallel Corpus Elicitation Tool Elicitation Corpus ElicitationRule Learning Run-Time System Rule Refinement Rule Refinement Module Morphology Analyzer Learning Module Handcrafted rules INPUT TEXT OUTPUT TEXT 15,000 Noun Phrases from Penn TreeBank Parallel Corpus EBMT SMT Supported by DARPA TIDES
40
Manual Transfer Rules: Example ; NP1 ke NP2 -> NP2 of NP1 ; Ex: jIvana ke eka aXyAya ; life of (one) chapter ; ==> a chapter of life ; {NP,12} NP::NP : [PP NP1] -> [NP1 PP] ( (X1::Y2) (X2::Y1) ; ((x2 lexwx) = 'kA') ) {NP,13} NP::NP : [NP1] -> [NP1] ( (X1::Y1) ) {PP,12} PP::PP : [NP Postp] -> [Prep NP] ( (X1::Y2) (X2::Y1) ) NP PP NP1 NP P Adj N N1 ke eka aXyAya N jIvana NP NP1 PP Adj N P NP one chapter of N1 N life
41
SystemBLEUM-BLEUNIST EBMT0.0580.1654.22 SMT0.0930.1914.64 XFER (naïve) man grammar 0.0550.1774.46 XFER (strong) no grammar 0.1090.2245.29 XFER (strong) learned grammar 0.1160.2315.37 XFER (strong) man grammar 0.1350.2435.59 XFER+S MT 0.1360.2435.65 Very miserly training data. Seven combinations of components Strong decoder allows re- ordering Three automatic scoring metrics Hindi- English
42
Extra Slides
43
The Avenue Low Resource Scenario Learning Module Learned Transfer Rules Lexical Resources Run Time Transfer System Decoder Translation Correction Tool Word- Aligned Parallel Corpus Elicitation Tool Elicitation Corpus ElicitationRule Learning Run-Time System Rule Refinement Rule Refinement Module Morphology Analyzer Learning Module Handcrafted rules INPUT TEXT OUTPUT TEXT
44
Feature Specification Defines Features and their values Sets default values for features Specifies feature requirements and restrictions Written in XML
45
Feature Specification Feature: c-copula-type (a copula is a verb like “be”; some languages do not have copulas) Values copula-n/a Restrictions: 1. ~(c-secondary-type secondary-copula) Notes: copula-role Restrictions: 1. (c-secondary-type secondary-copula) Notes: 1. A role is something like a job or a function. "He is a teacher" "This is a vegetable peeler" copula-identity Restrictions: 1. (c-secondary-type secondary-copula) Notes: 1. "Clark Kent is Superman" "Sam is the teacher" copula-location Restrictions: 1. (c-secondary-type secondary-copula) Notes: 1. "The book is on the table" There is a long list of locative relations later in the feature specification. copula-description Restrictions: 1. (c-secondary-type secondary-copula) Notes: 1. A description is an attribute. "The children are happy." "The books are long."
46
Feature Maps Some features interact in the grammar –English –s reflects person and number of the subject and tense of the verb. –In expressing the English present progressive tense, the auxiliary verb is in a different place in a question and a statement: He is running. Is he running? We need to check many, but not all combinations of features and values. Using unlimited feature combinations leads to an unmanageable number of sentences
48
Evidentiality Map Lexical Aspect Assertiveness Polarity Source Tense Gram. Aspect activity-accomplishment Assertiveness-asserted, Assetiveness-neutral Polarity-positive, Polarity-negative Hearsay, quotative, inferred, assumption Visual, Auditory, non- visual-or-auditory PastPresent, FuturePastPresent Perfective, progressive, habitual, neutral habitual, neutral, progressive Perfective, progressive, habitual, neutral habitual, neutral, progressive
49
Current Work Navigation –Start: large search space of all possible feature combinations –Finish: each feature has been eliminated as irrelevant or has been explored –Goal: dynamically find the most efficient path through the search space for each language.
50
Current Work Feature Detection –Which features have an effect on morphosyntax? –What is the effect? –Drives the Navigation process
51
Feature Detection: Spanish The girl saw a red book. ((1,1)(2,2)(3,3)(4,4)(5,6)(6,5)) La niña vió un libro rojo A girl saw a red book ((1,1)(2,2)(3,3)(4,4)(5,6)(6,5)) Una niña vió un libro rojo I saw the red book ((1,1)(2,2)(3,3)(4,5)(5,4)) Yo vi el libro rojo I saw a red book. ((1,1)(2,2)(3,3)(4,5)(5,4)) Yo vi un libro rojo Feature: definiteness Values: definite, indefinite Function-of-*: subj, obj Marked-on-head-of-*: no Marked-on-dependent: yes Marked-on-governor: no Marked-on-other: no Add/delete-word: no Change-in-alignment: no
52
Feature Detection: Chinese A girl saw a red book. ((1,2)(2,2)(3,3)(3,4)(4,5)(5,6)(5,7)(6,8)) 有 一个 女人 看见 了 一本 红色 的 书 。 The girl saw a red book. ((1,1)(2,1)(3,3)(3,4)(4,5)(5,6)(6,7)) 女人 看见 了 一本 红色的 书 Feature: definiteness Values: definite, indefinite Function-of-*: subject Marked-on-head-of-*: no Marked-on-dependent: no Marked-on-governor: no Add/delete-word: yes Change-in-alignment: no
53
Feature Detection: Chinese I saw the red book ((1, 3)(2, 4)(2, 5)(4, 1)(5, 2)) 红色的 书, 我 看见 了 I saw a red book. ((1,1)(2,2)(2,3)(2, 4)(4,5)(5,6)) 我 看见 了 一本 红色的 书 。 Feature: definitenes Values: definite, indefinite Function-of-*: object Marked-on-head-of-*: no Marked-on-dependent: no Marked-on-governor: no Add/delete-word: yes Change-in-alignment: yes
54
Feature Detection: Hebrew A girl saw a red book. ((2,1) (3,2)(5,4)(6,3)) ילדה ראתה ספר אדום The girl saw a red book ((1,1)(2,1)(3,2)(5,4)(6,3)) הילדה ראתה ספר אדום I saw a red book. ((2,1)(4,3)(5,2)) ראיתי ספר אדום I saw the red book. ((2,1)(3,3)(3,4)(4,4)(5,3)) ראיתי את הספר האדום Feature: definiteness Values: definite, indefinite Function-of-*: subj, obj Marked-on-head-of-*: yes Marked-on-dependent: yes Marked-on-governor: no Add-word: no Change-in-alignment: no
55
Feature Detection Feeds into… Corpus Navigation: which minimal pairs to pursue next. –Don’t pursue gender in Mapudungun –Do pursue definiteness in Hebrew Morphology Learning: –Morphological learner identifies the forms of the morphemes –Feature detection identifies the functions Rule learning: –Rule learner will have to learn a constraint for each morpho- syntactic marker that is discovered E.g., Adjectives and nouns agree in gender, number, and definiteness in Hebrew.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.