Download presentation
Presentation is loading. Please wait.
1
Eliciting a corpus of word-aligned phrases for MT
Lori Levin, Alon Lavie, Jaime Carbonell Erik Peterson, Alison Alvarez Language Technologies Institute Carnegie Mellon University
2
Introduction Problem: Building Machine Translation systems for languages with scarce resources: Not enough data for Statistical MT and Example-Based MT Not enough human linguistic expertise for writing rules Approach: Elicit high quality, word-aligned data from bilingual speakers Learn transfer rules from the elicited data
3
Modules of the AVENUE/MilliRADD rule learning system and MT system
Learning Module Transfer Rules {PP,4894} ;;Score: PP::PP [NP POSTP] -> [PREP NP] ((X2::Y1) (X1::Y2)) Translation Lexicon Run Time Transfer System Lattice Decoder English Language Model Word-to-Word Translation Probabilities Word-aligned elicited data
4
Outline Demo of elicitation interface
Description of elicitation corpus
5
Demo of Elicitation Tool
Speaker needs to be bilingual and literate: no other knowledge necessary Mappings between words and phrases: Many-to-many, one-to-none, many-to-none, etc. Create phrasal mappings Fonts and character sets: Including Hindi, Chinese, and Arabic Add morpheme boundaries to target language Add alternate translations Notes and context
6
English-Chinese Example
7
English-Hindi Example
8
Spanish-Mapudungun Example
9
English-Arabic Example
10
Testing of Elicitation Tool
DARPA Hindi Surprise Language Exercise Around 10 Hindi speakers Around 17,000 phrases translated and aligned Elicitation corpus NPs and PPs from Treebanked Brown Corpus
11
Elicitation Corpus: Basic Principles
Minimal pairs Syntactic compositionality Special semantic/pragmatic constructions Navigation based on language typology and universals Challenges
12
Elicitation Corpus: Minimal Pairs
Eng: I fell. Sp: Caí M: Tranün Eng: You (John) fell. Sp: Tu (Juan) caiste M: Eymi tranimi (Kuan) Eng: You (Mary) fell. ;; Sp: Tu (María) caiste M: Eymi tranimi (Maria) Eng: I am falling. Sp: Estoy cayendo M: Tranmeken Eng: You (John) are falling. Sp: Tu (Juan) estás cayendo M: Eimi(Kuan) tranmekeymi Mapudungun: Spoken by around one million people in Chile and Argentina.
13
Using feature vectors to detect minimal pairs
np1:(subj-of cl1).pro-pers.hum.2.sg. masc.no-clusn.no-def.no-alien cl1:(subj np1).intr-ag.past.complete Eng: You (John) fell. Sp: Tu (Juan) caiste M: Eymi tranimi (Kuan) np1:(subj-of cl1).pro-pers.hum.2.sg. fem.no-clusn.no-def.no-alien Eng: You (Mary) fell. ;; Sp: Tu (María) caiste M: Eymi tranimi (Maria) Inventory of features is based on fieldwork checklists: Comrie and Smith; Boqiaux and Thomas. Feature vectors can be extracted from the output of a parser for English or Spanish. (Except for features that English and Spanish do not have…)
14
Syntactic Compositionality
The tree The tree fell. I think that the tree fell. We learn rules for smaller phrases E.g., NP Their root nodes become non-terminals in the rules for larger phrases. E.g., S containing an NP Meaning of a phrase is predictable from the meanings of the parts.
15
Special Semantic and Pragmatic Constructions
Meaning may not be compositional Not predictable from the meanings of the parts May not follow normal rules of grammar. Suggestion: Why not go? Word-for-word translation may not work. Tend to be sources of MT mismatches Comparative: English: Hotel A is [closer than Hotel B] Japanese: Hoteru A wa [Hoteru B yori] [tikai desu] Hotel A TOP Hotel B than close is “Closer than Hotel B” is a constituent in English, but “Hoteru B yori tikai” is not a constituent in Japanese.
16
Examples of Semantic/Pragmatic Categories
Speech Acts: requests, suggestions, etc. Comparatives and Equatives Modality: possibility, probability, ability, obligation, uncertainty, evidentiality Correllatives: (the more the merrier) Causatives Etc.
17
A Challenge: Combinatorics
Person (1, 2, 3, 4) Number (sg, pl, du, paucal) Gender/Noun Class (?) Animacy (animate/inanimate) Definiteness (definite/indefinite) Proximity (near, far, very far, etc.) Inclusion/exclusion Multiply with: tenses and aspects (complete, incomplete, real, unreal, iterative, habitual, present, past, recent past, future, recent future, non-past, non-future, etc.) Multiply with verb class: agentive intransitive, non-agentive intransitive, transitive, ditransitive, etc. (Case marking and agreement may vary with verb tense, verb class, animacy, definiteness, and whether or not object outranks subject in person or animacy.)
18
Solutions to Combinatorics
Generate paradigms of feature vectors, and then automatically generate sentences to match each feature vector. Use known universals to eliminate features: e.g., Languages without plurals don’t have duals.
19
Navigation through the corpus
Initial diagnostics: Does the language mark number on nouns or in agreement with verbs? Sentence selection: Based on initial diagnostics Based on principles and universals E.g., languages that don’t have plurals don’t have duals So that the informant sees very few sentences that are not relevant for his/her language
20
Other Challenges of Computer Based Elicitation
Inconsistency of human translation and alignment Bias toward word order of the elicitation language Need to provide discourse context for given and new information How to elicit things that aren’t grammaticalized in the elicitation language: Evidential: I see that it is raining/Apparently it is raining/It must be raining. Context: You are inside the house. Your friend comes in wet.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.