Eliciting Features from Minor Languages The elicitation tool provides a simple interface for bilingual informants with no linguistic training and limited computer skills to translate and word-align a corpus in some source language. The output of the elicitation tool is a text file containing triplets of eliciting sentence, elicited sentence, and alignment. The elicitation tool can produce bilingual glossaries based on the aligned corpus. It also has a simple "auto-align" option to add alignments for unambiguous word pairs in the same file. The Elicitation Tool Our Goals Alison Alvarez Lori Levin Robert Frederking Erik Peterson Language Technologies Institute Carnegie Mellon University 5000 Forbes Avenue Pittsburgh, PA Jeff Good (MPI Leipzig) Max Planck Institute for Evolutionary AnthropologyDeutscher Platz Leipzig Feature Specification ((subj ((np-my-general-type pronoun-type common-noun-type) (np-my-person person-first person-second person-third) (np-my-number num-sg num-pl) (np-my-biological-gender bio-gender-male bio-gender-female) (np-my-function fn-predicatee))) {[(predicate ((np-my-general-type common-noun-type) (np-my-definiteness definiteness-minus) (np-my-person person-third) (np-my-function predicate))) (c-my-copula-type role)] [(predicate ((adj-my-general-type quality-type))) (c-my-copula-type attributive)] [(predicate ((np-my-general-type common-noun-type) (np-my-person person-third) (np-my-definiteness definiteness-plus) (np-my-function predicate))) (c-my-copula-type identity)]} (c-my-secondary-type secondary-copula) (c-my-polarity #all) (c-my-function fn-main-clause)(c-my-general-type declarative) (c-my-speech-act sp-act-state) (c-v-my-grammatical-aspect gram-aspect-neutral) (c-v-my-lexical-aspect state) (c-v-my-absolute-tense past present future) (c-v-my-phase-aspect durative)) “Use all values of polarity” “Multiply out by these lists of values” Disjoint set of copula types and their predicates Feature Structure Design A control language is used to define the size and scope of the set of feature structures that will be used by GenKit to generate the corpus np-my-number num-sg num-pl num-dual Notes for analysis of data: CS, page 38, seem to imply that some combinations of numbers are more expected than others Overview This research is part of the AVENUE Machine Translation Project. AVENUE is supported by the US National Science Foundation, NSF grant number IIS In the field of Machine Translation fully aligned and tagged translation corpora are considered to be one of the most valuable resources for automatically training translation systems. However, among minority languages such resources are hard to find. It is possible to overcome this obstacle by using techniques inspired by field linguistics. That is, by drawing on bilingual informants to translate and align given sentences. Field linguists have relied on questionnaires that have remained relatively static over a number of years. We want the flexibility to change the questionnaire to reflect different semantic domains, different goals for machine translation systems, different levels of detail, etc. We also want the questionnaire to be available in multiple languages. For example, we would want a version of the questionnaire in Spanish for use by Latin American minority language speakers. We also want flexibility in lexical selection in order to avoid cultural bias and to choose appropriate lexical items for the major language. This paper will look at methods for specifying the scope and depth of an elicitation corpus as well as methods for quick design and implementation of elicitation corpora. The resulting can also be used as a test suite to explore existing machine translation systems or design far-reaching corpora for studying low resource languages. ((subj ((np-my-general-type pronoun-type) (np-my-person person-third) (np-my-number num-sg) (np-my-biological-gender bio-gender-male) (np-my-function fn-predicatee)(np-my-animacy anim-human) (np-my-info-function info-neutral)(np-d-my-distance-from-speaker distance-neutral) (np-pronoun-reflexivity reflexivity-n/a)(np-my-emphasis emph-no-emph) (np-my-semantic-class NEED_VALUES)(np-pronoun-exclusivity exclusivity-n/a) (np-pronoun-antecedent-function antecedent-n/a))) (predicate ((np-my-general-type common-noun-type) (np-my-person person-third) (np-my-function predicate)(np-my-animacy anim-human) (np-my-info-function info-neutral) (np-d-my-distance-from-speaker distance-neutral) (np-pronoun-reflexivity reflexivity-n/a)(np-my-emphasis emph-no-emph) (np-my-number num-sg)(np-my-semantic-class NEED_VALUES) (np-pronoun-exclusivity exclusivity-n/a) (np-pronoun-antecedent-function antecedent-! n/a))) (c-my-copula-type role) (c-my-secondary-type secondary-copula) (c-my-polarity polarity-positive) (c-my-function fn-main-clause) (c-my-general- type declarative)(c-my-speech-act sp-act-state) (c-v-my-grammatical-aspect gram-aspect-neutral) (c-v-my-lexical-aspect state)(c-v-my-absolute- tense past)(c-v-my-phase-aspect durative)(c-my-imperative-degree imp-degree-n/a)(c-my-ynq-type ynq-n/a)(c-my-actor's-sem-role actor-sem- role-neutral)(c-my-minor-type minor-n/a)(c-my-headedness-rc rc-head-n/a)(c-my-answer-type ans-n/a)(c-my-restrictivess-rc rc-restrictive-n/a)(c- my-focus-rc focus-n/a)(c-my-actor's-status actor-neutral)(c-my-gaps-function gap-n/a)(c-my-relative-tense relative-n/a)) Feature Structures They are multi-level sets of feature-value pairs that are used to reflect the grammatical structures intended for elicitation. When paired with an English grammar and lexicon the above feature structure will generate ‘He was a teacher.’ 1. Tools for semi-automated corpus design: Test suite for MT Structured corpus for input to machine learning 2. A user interface for producing high quality, word-aligned parallel corpora (Elicitation Tool) 3. Automated learning of morpho-syntax for low-resource languages Feature Detection ((Subj((person first)(num sg) (animacy human)(head-token-1 I))) (Obj((person third) (animacy human)(identifiability -)…))) (tense past)…) (Subj((person first) (num sg) (animacy human)…)) (Obj(((person third) (animacy human) (identifiability -)…))) (tense past) (num sg) (animacy human) (person first)(person third) (animacy human) (identifiability - ) I was a teacher Sentence Selection Translation/Alignment Mapping ((Subj((person first)(num sg) (animacy human)(head-token-1 I))) (Obj((person third) (animacy human)(identifiability -)…))) (tense past)…) (Subj((person first) (num sg) (animacy human)…)) (Obj(((person third) (animacy human) (identifiability -)…))) (tense past) (num sg) (animacy human) (person first)(person third) (animacy human) (identifiability - ) I was a teacher watashi wa sensei deshita ((Subj((person first)(num sg) (animacy human)(head-token-1 I))) (Obj((person third) (animacy human)(identifiability -)…))) (tense past)…) (Subj((person first) (num sg) (animacy human)…)) (Obj(((person third) (animacy human) (identifiability -)…))) (tense past) (num sg) (animacy human) (person first)(person third) (animacy human) (identifiability - ) I was a teacher watashi wa sensei deshita Minimal Pair Linking ((Subj((person first)(num sg) (animacy human)(head-token-1 I))) (Obj((person third) (animacy human)(identifiability -)…))) (tense past)…) (Subj((person first) (num sg) (animacy human)…)) (Obj(((person third) (animacy human) (identifiability -)…))) (tense past) (num sg) (animacy human) (person first) (person third) (animacy human) (identifiability - ) ((Subj((person first)(num sg) (animacy human)(head-token-1 I))) (Obj((person third) (animacy human)(identifiability -)…))) (tense present)…) (Subj((person first) (num sg) (animacy human)…)) (Obj(((person third) (animacy human) (identifiability -)…))) (tense present) (num sg) (animacy human) (person first) (person third) (animacy human) (identifiability - ) “I was a teacher” Watashi wa sensei deshita “I am a teacher” Watashi wa sensei desu Difference Detection ((Subj((person first)(num sg) (animacy human)(head-token-1 I))) (Obj((person third) (animacy human)(identifiability -)…))) (tense past)…) (Subj((person first) (num sg) (animacy human)…)) (Obj(((person third) (animacy human) (identifiability -)…))) (tense past) (num sg) (animacy human) (person first)(person third) (animacy human) (identifiability - ) watashi wa sensei deshita ((Subj((person first)(num sg) (animacy human)(head-token-1 I))) (Obj((person third) (animacy human)(identifiability -)…))) (tense present)…) (Subj((person first) (num sg) (animacy human)…)) (Obj(((person third) (animacy human) (identifiability -)…))) (tense present) (num sg) (animacy human) (person first)(person third) (animacy human) (identifiability - ) watashi wa sensei desu = = = ≠ Substitution mismatch Difference is found on ME