Download presentation
Presentation is loading. Please wait.
Published byNoah Ryan Modified over 9 years ago
1
The AVENUE Project Data Elicitation System Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University
2
Joint work with Dr. Jeff Good Dr. Robert Frederking Alison Alvarez
3
Outline The AVENUE MT project –Including a list of languages we have worked on The elicitation tool –Including which kinds of fonts it works for The elicitation corpus –Including which languages it has been translated into Tools for building and revising elicitation corpora
4
MT Approaches Interlingua : introduce-self Syntactic Parsing Pronoun-acc-1-sg chiamare-1sg N Semantic Analysis Sentence Planning Text Generation [np poss-1sg “name”] BE-pres N Source Mi chiamo Lori Target My name is Lori Transfer Rules Direct: SMT, EBMT AVENUE: Automate Rule Learning
5
AVENUE Machine Translation System Type information Synchronous Context Free Rules Alignments x-side constraints y-side constraints xy-constraints, e.g. ((Y1 AGR) = (X1 AGR)) ; SL: the old man, TL: ha-ish ha-zaqen NP::NP [DET ADJ N] -> [DET N DET ADJ] ( (X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2) ((X1 AGR) = *3-SING) ((X1 DEF = *DEF) ((X3 AGR) = *3-SING) ((X3 COUNT) = +) ((Y1 DEF) = *DEF) ((Y3 DEF) = *DEF) ((Y2 AGR) = *3-SING) ((Y2 GENDER) = (Y4 GENDER)) ) Jaime Carbonell (PI), Alon Lavie (Co-PI), Lori Levin (Co-PI) Rule learning: Katharina Probst
6
AVENUE Rules can be written by hand or learned automatically. Hybrid –Rule-based transfer –Statistical decoder –Multi-engine combinations with SMT and EBMT
7
AVENUE systems (Small and experimental, but tested on unseen data) Hebrew-to-English –Alon Lavie, Shuly Wintner, Katharina Probst –Hand-written and automatically learned –Automatic rules trained on 120 sentences perform slightly better than about 20 hand-written rules. Hindi-to-English –Lavie, Peterson, Probst, Levin, Font, Cohen, Monson –Automatically learned –Performs better than SMT when training data is limited to 50K words
8
AVENUE systems (Small and experimental, but tested on unseen data) English-to-Spanish –Ariadna Font Llitjos –Hand-written, automatically corrected Mapudungun-to-Spanish –Roberto Aranovich and Christian Monson –Hand-written Dutch-to-English –Simon Zwarts –Hand-written
9
Outline The AVENUE MT project The elicitation tool The questionnaire Tools for building questionnaires
10
Elicitation Get data from someone who is –Bilingual –Literate With consistent spelling –Not experienced with linguistics
11
English-Hindi Example Elicitation Tool: Erik Peterson
12
English-Chinese Example Note: Translator has to insert spaces between words in Chinese.
13
English-Arabic Example
14
Outline The AVENUE MT project The elicitation tool The elicitation corpus Tools for building elicitation corpora
15
Size of Questionnaire Around 3200 sentences 20K words
16
EC Sample: clause level Mary is writing a book for John. Who let him eat the sandwich? Who had the machine crush the car? They did not make the policeman run. Mary had not blinked. The policewoman was willing to chase the boy. Our brothers did not destroy files. He said that there is not a manual. The teacher who wrote a textbook left. The policeman chased the man who was a thief. Mary began to work. Tense, aspect, transitivity, animacy Questions, causation and permission Interaction of lexical and grammatical aspect Volitionality Embedded clauses and sequence of tense Relative clauses Phase aspect
17
EC Sample: noun phrase level The man quit in November. The man works in the afternoon. The balloon floated over the library. The man walked over the platform. The man came out from among the group of boys. The long weekly meeting ended. The large bus to the post office broke down. The second man laughed. All five boys laughed. Temporal and locative meanings Quantifiers Numbers Combinations of different types of modifers –My book Possession, definiteness –A book of mine Possession, indefiniteness
18
Organization into Minimal Pairs srcsent: Tú caíste. tgtsent: Eymi ütrünagimi. aligned: ((1,1),(2,2)) context: tú = Juan [masculino, 2a persona del singular] comment: You (John) fell srcsent: Tú estás cayendo. tgtsent: Eymi petu ütrünagimi. aligned: ((1,1),(2 3,2 3)) context: tú = Juan [masculino, 2a persona del singular] comment: You (John) are falling srcsent: Tú caíste. tgtsent: Eymi ütrunagimi. aligned: ((1,1),(2,2)) context: tú = María [femenino, 2a persona del singular] comment: You (Mary) fell
19
Feature Detection: Spanish The girl saw a red book. ((1,1)(2,2)(3,3)(4,4)(5,6)(6,5)) La niña vió un libro rojo A girl saw a red book ((1,1)(2,2)(3,3)(4,4)(5,6)(6,5)) Una niña vió un libro rojo I saw the red book ((1,1)(2,2)(3,3)(4,5)(5,4)) Yo vi el libro rojo I saw a red book. ((1,1)(2,2)(3,3)(4,5)(5,4)) Yo vi un libro rojo Feature: definiteness Values: definite, indefinite Function-of-*: subj, obj Marked-on-head-of-*: no Marked-on-dependent: yes Marked-on-governor: no Marked-on-other: no Add/delete-word: no Change-in-alignment: no
20
Feature Detection: Chinese A girl saw a red book. ((1,2)(2,2)(3,3)(3,4)(4,5)(5,6)(5,7)(6,8)) 有 一个 女人 看见 了 一本 红色 的 书 。 The girl saw a red book. ((1,1)(2,1)(3,3)(3,4)(4,5)(5,6)(6,7)) 女人 看见 了 一本 红色的 书 Feature: definiteness Values: definite, indefinite Function-of-*: subject Marked-on-head-of-*: no Marked-on-dependent: no Marked-on-governor: no Add/delete-word: yes Change-in-alignment: no
21
Feature Detection: Chinese I saw the red book ((1, 3)(2, 4)(2, 5)(4, 1)(5, 2)) 红色的 书, 我 看见 了 I saw a red book. ((1,1)(2,2)(2,3)(2, 4)(4,5)(5,6)) 我 看见 了 一本 红色的 书 。 Feature: definitenes Values: definite, indefinite Function-of-*: object Marked-on-head-of-*: no Marked-on-dependent: no Marked-on-governor: no Add/delete-word: yes Change-in-alignment: yes
22
Feature Detection: Hebrew A girl saw a red book. ((2,1) (3,2)(5,4)(6,3)) ילדה ראתה ספר אדום The girl saw a red book ((1,1)(2,1)(3,2)(5,4)(6,3)) הילדה ראתה ספר אדום I saw a red book. ((2,1)(4,3)(5,2)) ראיתי ספר אדום I saw the red book. ((2,1)(3,3)(3,4)(4,4)(5,3)) ראיתי את הספר האדום Feature: definiteness Values: definite, indefinite Function-of-*: subj, obj Marked-on-head-of-*: yes Marked-on-dependent: yes Marked-on-governor: no Add-word: no Change-in-alignment: no
23
Feature Detection Feeds into… Corpus Navigation: which minimal pairs to pursue next. –Don’t pursue gender in Mapudungun –Do pursue definiteness in Hebrew Morphology Learning: –Morphological learner identifies the forms of the morphemes –Feature detection identifies the functions Rule learning: –Rule learner will have to learn a constraint for each morpho- syntactic marker that is discovered E.g., Adjectives and nouns agree in gender, number, and definiteness in Hebrew.
24
Languages The set of feature structures with English sentences has been delivered to the Linguistic Data Consortium as part of the Reflex program. Translated (by LDC) into: –Thai –Bengali Plans to translate into: –Seven “strategic” languages per year for five years. As one small part of a language pack (BLARK) for each language.
25
Languages Spanish version in progress at New Mexico State University (Helmreich and Cowie) –Plans to translate into Guarani Portuguese version in progress in Brazil (Marcello Modesto) –Plans to translate into Karitiana 200 speakers Plans to translate into Inupiaq (Kaplan and MacLean)
26
Previous Elicitation Work Pilot corpus –Around 900 sentences –No feature structures Mapudungun –Two partial translations Quechua –Three translations Aymara –Seven translations Hebrew Hindi –Several translations Dutch
27
Feature Structures The EC is actually a corpus of feature structures that happen to have English or Spanish sentences attached to them.
28
Bengali example with feature structure srcsent: The large bus to the post office broke down. context: tgtsent: ((actor ((modifier ((mod-role mod-descriptor) (mod-role role-loc-general-to))) (np-identifiability identifiable)(np-specificity specific) (np-biological-gender bio-gender-n/a)(np-animacy anim-inanimate) (np-person person-third)(np-function fn-actor)(np-general-type common-noun- type)(np-number num-sg)(np-pronoun-exclusivity inclusivity-n/a)(np-pronoun- antecedent antecedent-n/a)(np-distance distance-neutral))) (c-general-type declarative-clause)(c-my-causer-intentionality intentionality-n/a)(c- comparison-type comparison-n/a)(c-relative-tense relative-n/a)(c-our-boundary boundary-n/a)(c-comparator-function comparator-n/a)(c-causee-control control- n/a)(c-our-situations situations-n/a)(c-comparand-type comparand-n/a)(c-causation- directness directness-n/a)(c-source source-neutral)(c-causee-volitionality volition- n/a)(c-assertiveness assertiveness-neutral)(c-solidarity solidarity-neutral)(c-polarity polarity-positive)(c-v-grammatical-aspect gram-aspect-neutral)(c-adjunct-clause-type adjunct-clause-type-n/a)(c-v-phase-aspect phase-aspect-neutral)(c-v-lexical-aspect activity-accomplishment)(c-secondary-type secondary-neutral)(c-event-modality event-modality-none)(c-function fn-main-clause)(c-minor-type minor-n/a)(c-copula- type copula-n/a)(c-v-absolute-tense past)(c-power-relationship power-peer)(c-our- shared-subject shared-subject-n/a)(c-question-gap gap-n/a))
29
Why feature structures? Decide what grammatical meaning to elicit. Represent it in a feature structure. Formulate an English or Spanish sentence that expresses that meaning. –We can use the same corpus of feature structures for several elicitation languages Have the informant translate it.
30
Grammatical meanings vs syntactic categories Features and values are based on a collection of grammatical meanings –Many of which are similar to the grammatemes of the Prague Treebanks
31
Grammatical Meanings YES Semantic Roles Identifiability Specificity Time –Before, after, or during time of speech Modality NO Case Voice Determiners Auxiliary verbs
32
Grammatical Meanings YES How is identifiability expressed? –Determiner –Word order –Optional case marker –Optional verb agreement How is specificity expressed? How are generics expressed? How are predicate nominals marked? NO How are English determiners translated? –The boy cried. –The lion is a fierce beast. –I ate a sandwich. –He is a soldier. Il est soldat.
33
Argument Roles Actor Undergoer Predicate and predicatee –The woman is the manager. Recipient –I gave a book to the students. Beneficiary –I made a phone call for Sam.
34
Why not subject and object? Languages use their voice systems for different purposes. Mapudungun obligatorily uses an inverse marked verb when third person acts on first or second person. –Verb agrees with undergoer –Undergoer exhibits other subjecthood properties –Actor may be object. Yes: How are actor and undergoer encoded in combination with other semantic features like adversity (Japanese) and person (Mapudungun)? No: How is English voice translated into another language?
35
Argument Roles Accompaniment –With someone –With pleasure Material –(out) of wood About 20 more roles –From the Lingua checklist; Comrie & Smith (1977) –Many also found in tectogrammatical representations in the Prague Treebanks Around 80 locative relations –From Lingua checklist Many temporal relations
36
Noun Phrase Features Person Number Biological gender Animacy Distance (for deictics) Identifiability Specificity Possession Other semantic roles –Accompaniment, material, location, time, etc. Type –Proper, common, pronoun Cardinals Ordinals Quantifiers Given and new information –Not used yet because of limited context in the elicitation tool.
37
Clause level features Tense Aspect –Lexical, grammatical, phase Type –Declarative, open-q, yes-no-q Function –Main, argument, adjunct, relative Source –Hearsay, first-hand, sensory, assumed Assertedness –Asserted, presupposed, wanted Modality –Permission, obligation –Internal, external
38
Other clause types (Constructions) Causative –Make/let/have someone do something Predication –May be expressed with or without an overt copula. Existential –There is a problem. Impersonal –One doesn’t smoke in restaurants in the US. Lament –If only I had read the paper. Conditional Comparative Etc.
39
Outline The AVENUE MT project The elicitation tool The elicitation corpus Tools for elicitation corpora
40
Mar 1, 2006 Tools for Creating Elicitation Corpora List of semantic features and values The Corpus Feature Maps: which combinations of features and values are of interest … Clause- Level Noun- Phrase Tense & Aspect Modality Feature Structure Sets Feature Specification Reverse Annotated Feature Structure Sets: add English sentences Smaller Corpus Sampling XML Schema XSLT Script
41
Mar 1, 2006 Tools for Creating Elicitation Corpora List of semantic features and values The Corpus Feature Maps: which combinations of features and values are of interest … Clause- Level Noun- Phrase Tense & Aspect Modality Feature Structure Sets Feature Specification Reverse Annotated Feature Structure Sets: add English sentences Smaller Corpus Sampling Combination Formalism
42
Mar 1, 2006 Tools for Creating Elicitation Corpora List of semantic features and values The Corpus Feature Maps: which combinations of features and values are of interest … Clause- Level Noun- Phrase Tense & Aspect Modality Feature Structure Sets Feature Specification Reverse Annotated Feature Structure Sets: add English sentences Smaller Corpus Sampling Feature Structure Viewer
43
Mar 1, 2006 Tools for Creating Elicitation Corpora List of semantic features and values The Corpus Feature Maps: which combinations of features and values are of interest … Clause- Level Noun- Phrase Tense & Aspect Modality Feature Structure Sets Feature Specification Reverse Annotated Feature Structure Sets: add English sentences Smaller Corpus Sampling
44
Feature Specification Defines Features and their values Sets default values for features Specifies feature requirements and restrictions Written in XML
45
Feature Specification Feature: c-copula-type (a copula is a verb like “be”; some languages do not have copulas) Values copula-n/a Restrictions: 1. ~(c-secondary-type secondary-copula) Notes: copula-role Restrictions: 1. (c-secondary-type secondary-copula) Notes: 1. A role is something like a job or a function. "He is a teacher" "This is a vegetable peeler" copula-identity Restrictions: 1. (c-secondary-type secondary-copula) Notes: 1. "Clark Kent is Superman" "Sam is the teacher" copula-location Restrictions: 1. (c-secondary-type secondary-copula) Notes: 1. "The book is on the table" There is a long list of locative relations later in the feature specification. copula-description Restrictions: 1. (c-secondary-type secondary-copula) Notes: 1. A description is an attribute. "The children are happy." "The books are long."
46
Feature Maps Some features interact in the grammar –English –s reflects person and number of the subject and tense of the verb. –In expressing the English present progressive tense, the auxiliary verb is in a different place in a question and a statement: He is running. Is he running? We need to check many, but not all combinations of features and values. Using unlimited feature combinations leads to an unmanageable number of sentences
47
Feature Combination Template ((predicatee ((np-general-type pronoun-type common- noun-type) (np-person person-first person-second person-third) (np-number num-sg num-pl) (np-biological-gender bio-gender-male bio- gender-female))) {[(predicate ((np-general-type common- noun-type) (np-person person-third))) (c-copula-type role)] [(predicate ((adj-general-type quality-type) (c-copula-type attributive)))] [(predicate ((np-general-type common- noun-type) (np-person person-third) (c-copula-type identity)))]} (c-secondary-type secondary-copula) (c- polarity #all) (c-general-type declarative) (c-speech-act sp-act-state) (c-v-grammatical-aspect gram-aspect- neutral) (c-v-lexical-aspect state) (c-v-absolute-tense past present future) (c-v-phase-aspect durative)) Summarizes 288 feature structures, which are automatically generated.
48
Adding Sentences to Feature Structures srcsent: Mary was not a leader. context: Translate this as though it were spoken to a peer co- worker; ((actor ((np-function fn-actor)(np-animacy anim-human)(np- biological-gender bio-gender-female) (np-general-type proper-noun-type)(np-identifiability identifiable)(np- specificity specific)…)) (pred ((np-function fn-predicate-nominal)(np-animacy anim- human)(np-biological-gender bio-gender-female) (np- general-type common-noun-type)(np-specificity specificity- neutral)…)) (c-v-lexical-aspect state)(c-copula-type copula-role)(c-secondary-type secondary-copula)(c-solidarity solidarity-neutral) (c-v-grammatical- aspect gram-aspect-neutral)(c-v-absolute-tense past) (c-v-phase- aspect phase-aspect-neutral) (c-general-type declarative-clause)(c- polarity polarity-negative)(c-my-causer-intentionality intentionality- n/a)(c-comparison-type comparison-n/a)(c-relative-tense relative- n/a)(c-our-boundary boundary-n/a)…)
49
Difficult Issues in Adding Sentences Have to remember that the grammatical meanings don’t correspond exactly to English morphemes. –Identifiability and specificity vs the and a –Modality, tense, aspect vs auxiliary verbs The meaning has to be clear to a translator. –If English is going to be the source language for translation, the clearest way to say something may not be the most common way it is said in real text or conversation.
50
Hard Problems Expressing meanings that are not grammaticalized in English. –Evidentiality: He stole the bread. Context: Translate this as if you do not have first hand knowledge. In English, we might say, “They say that he stole the bread” or “I hear that he stole the bread.”
51
Hard Problems Reverse annotating things that can be said in several ways in English. –Impersonals: One doesn’t smoke here. You don’t smoke here. They don’t smoke here. There’s no smoking here. Credit cards aren’t accepted. –Problem in the Reflex corpus because space was limited.
52
Evaluation Current funding has not covered evaluation of the questionnaire. –Except for informal observations as it was translated into several languages. Does it elicit the meanings it was intended to elicit? –Informal observation: usually Is it useful for machine translation?
53
Navigation Currently, feature combinations are specified by a human. Plan to work in active learning mode. –Build seed questionnaire –Translate some data –Do some learning –Identify most valuable pieces of information to get next –Generate an RTB for those pieces of information –Translate more –Learn more –Generate more, etc.
54
Summary Feature Specification: –lists features and values –Grammatical meanings Feature Combinations Set of Feature Structures Add English or Spanish Sentences Get a translation and word alignment from a bilingual, literate informant
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.