Data Elicitation for AVENUE By: Alison Alvarez Lori Levin Bob Frederking Jeff Good (MPI Leipzig) Erik Peterson.

Slides:



Advertisements
Similar presentations
Lexis and Grammar for Translation Dott. M. Gatto Lingue e Culture per il Turismo Lingua e Traduzione Inglese I.
Advertisements

2 types of Articles The English word THE is called a ( Definite Article ) because it is used to refer to a Definite or Specific person or thing.  The.
Augmented Transition Networks
CS Morphological Parsing CS Parsing Taking a surface input and analyzing its components and underlying structure Morphological parsing:
Semantics (Representing Meaning)
Grammar Engineering: Set-valued Attributes Various Kinds of Constraints Case Restrictions on Arguments Miriam Butt (University of Konstanz) and Martin.
An interactive environment for creating and validating syntactic rules Panagiotis Bouros*, Aggeliki Fotopoulou, Nicholas Glaros Institute for Language.
Elicitation Corpus April 12, Agenda Tagging with feature vectors or feature structures Combinatorics Extensions.
The Use of Corpora for Automatic Evaluation of Grammar Inference Systems Andrew Roberts & Eric Atwell Corpus Linguistics ’03 – 29 th March Computer Vision.
Machine Translation with Scarce Resources The Avenue Project.
CS 330 Programming Languages 09 / 16 / 2008 Instructor: Michael Eckmann.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.
Machine Translation History of Machine Translation Difficulties in Machine Translation Structure of Machine Translation System Research methods for Machine.
Cracking the English Test. General Hints Do the questions in order, leaving the tougher rhetorical questions for the end. If you’re having trouble with.
MBI 630: Class 6 Logic Modeling 9/7/2015. Class 6: Logic Modeling Logic Modeling Broadway Entertainment Co. Inc., Case –Group Discussion (Handout) –Logic.
Katanosh Morovat.   This concept is a formal approach for identifying the rules that encapsulate the structure, constraint, and control of the operation.
Machine Translation Dr. Radhika Mamidi. What is Machine Translation? A sub-field of computational linguistics It investigates the use of computer software.
Building NLP Systems for Two Resource Scarce Indigenous Languages: Mapudungun and Quechua, and some other languages Christian Monson, Ariadna Font Llitjós,
Data Elicitation for AVENUE Lori Levin Alison Alvarez Jeff Good (MPI Leipzig) Bob Frederking Erik Peterson Language Technologies Institute Carnegie Mellon.
Eliciting Features from Minor Languages The elicitation tool provides a simple interface for bilingual informants with no linguistic training and limited.
ASPECTS OF LINGUISTIC COMPETENCE 4 SEPT 09, 2013 – DAY 6 Brain & Language LING NSCI Harry Howard Tulane University.
Representations Floyd Nelson A.D 2009 December 28.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Introduction Algorithms and Conventions The design and analysis of algorithms is the core subject matter of Computer Science. Given a problem, we want.
Morphology An Introduction to the Structure of Words Lori Levin and Christian Monson Grammars and Lexicons Fall Term, 2004.
Levels of Language 6 Levels of Language. Levels of Language Aspect of language are often referred to as 'language levels'. To look carefully at language.
Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University.
The AVENUE Project Data Elicitation System Lori Levin Language Technologies Institute School of Computer Science Carnegie Mellon University.
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
Structural Levels of Language Lecture 1. Ferdinand de Saussure  "Language is a system sui generis “ = a system where everything holds together  The.
Rules, Movement, Ambiguity
Computational support for minority languages using a typologically oriented questionnaire system Lori Levin Language Technologies Institute School of Computer.
Programming Errors. Errors of different types Syntax errors – easiest to fix, found by compiler or interpreter Semantic errors – logic errors, found by.
Language Technologies Institute School of Computer Science Carnegie Mellon University NSF, August 6, 2001 Machine Translation for Indigenous Languages.
Designing an Elicitation Corpus with Semantic Representations Simon Fung Advisor: Lori Levin November 2006.
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
SYNTAX.
Building Sub-Corpora Suitable for Extraction of Lexico-Syntactic Information Ondřej Bojar, Institute of Formal and Applied Linguistics, ÚFAL.
Semi-Automated Elicitation Corpus Generation The elicitation tool provides a simple interface for bilingual informants with no linguistic training and.
Concepts and Realization of a Diagram Editor Generator Based on Hypergraph Transformation Author: Mark Minas Presenter: Song Gu.
Error Analysis of Two Types of Grammar for the purpose of Automatic Rule Refinement Ariadna Font Llitjós, Katharina Probst, Jaime Carbonell Language Technologies.
Bridging the Gap: Machine Translation for Lesser Resourced Languages
Avenue Architecture Learning Module Learned Transfer Rules Lexical Resources Run Time Transfer System Decoder Translation Correction Tool Word- Aligned.
October 10, 2003BLTS Kickoff Meeting1 Transfer with Strong Decoding Learning Module Transfer Rules {PP,4894} ;;Score: PP::PP [NP POSTP] -> [PREP.
Eliciting a corpus of word- aligned phrases for MT Lori Levin, Alon Lavie, Erik Peterson Language Technologies Institute Carnegie Mellon University.
NATURAL LANGUAGE PROCESSING
September 26, : Grammars and Lexicons Lori Levin.
AVENUE: Machine Translation for Resource-Poor Languages NSF ITR
Developing affordable technologies for resource-poor languages Ariadna Font Llitjós Language Technologies Institute Carnegie Mellon University September.
Present tense of Ser spanish 1 p
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Minority Languages Katharina Probst Language Technologies Institute Carnegie Mellon.
Tasneem Ghnaimat. Language Model An abstract representation of a (natural) language. An approximation to real language Assume we have a set of sentences,
Learning to Generate Complex Morphology for Machine Translation Einat Minkov †, Kristina Toutanova* and Hisami Suzuki* *Microsoft Research † Carnegie Mellon.
LingWear Language Technology for the Information Warrior Alex Waibel, Lori Levin Alon Lavie, Robert Frederking Carnegie Mellon University.
Child Syntax and Morphology
Eliciting a corpus of word-aligned phrases for MT
Assessing Grammar Module 5 Activity 5.
Approaches to Machine Translation
Lecture – VIII Monojit Choudhury RS, CSE, IIT Kharagpur
Ariadna Font Llitjós March 10, 2004
Assessing Grammar Module 5 Activity 5.
How Do We Translate? Methods of Translation The Process of Translation.
Chapter Eight Syntax.
Cracking the English Test
Chapter Eight Syntax.
Approaches to Machine Translation
The Winograd Schema Challenge Hector J. Levesque AAAI, 2011
Artificial Intelligence 2004 Speech & Natural Language Processing
Presentation transcript:

Data Elicitation for AVENUE By: Alison Alvarez Lori Levin Bob Frederking Jeff Good (MPI Leipzig) Erik Peterson

Learning Module Transfer Rules Lexical Resources Run Time Transfer System Lattice Translation Correction Tool Word- Aligned Parallel Corpus Elicitation Tool Elicitation Corpus ElicitationRule Learning Run-Time System Rule Refinement Rule Refinement Module Handcrafted rules Morphology Morpho- logical analyzer Avenue System Diagram

Goals for Corpus Creation and Elicitation Parallel corpus with high quality word alignment For a language with little or no digitized language resources  Use a bilingual informant with no linguistic expertise

Outline Elicitation Feature Detection The Functional-Typological Corpus Corpus Creation and Elicitation Corpus Navigation

The Elicitation Tool

Input to the Elicitation Tool Eliciting from Spanish # 1,2,3 {Sg,pl} person pronouns newpair srcsent: Canto context: comment: newpair srcsent: Canté context: comment: newpair srcsent: Estoy cantando context: comment: newpair srcsent: Cantaste context: comment: Eliciting from English # 1,2,3 {Sg,pl} person pronouns newpair srcsent: I sing context: comment: newpair srcsent: I sang context: comment: newpair srcsent: I am singing context: comment: newpair srcsent: You sang context: comment:

Output of the elicitation process newpair srcsent: Tú caíste tgtsent: eymi ütrünagimi aligned: ((1,1),(2,2)) context: tú = Juan [masculino, 2a persona del singular] comment: You (John) fell newpair srcsent: Tú estás cayendo tgtsent: eymi petu ütünagimi aligned: ((1,1),(2 3,2 3)) context: tú = Juan [masculino, 2a persona del singular] comment: You (John) are falling newpair srcsent: Tú caíste tgtsent: eymi, ütrunagimi aligned: ((1,1),(2,2)) context: tú = María [femenino, 2a persona del singular] comment: You (Mary) fell

Elicitation Corpus Elicitation Corpus refers to the list of sentences in the major language.  Not yet translated or aligned Field workers call it a questionnaire.

Feature Detection Identify meaning components that have morpho-syntactic consequences in the language that is being elicited.  The gender of the subject is marked on the verb in Hebrew.  The gender of the subject has no morpho- syntactic realization in Mapudungun.

Feature detection feeds into Corpus Navigation: which minimal pairs to pursue next.  Don’t pursue gender in Mapudungun  Do pursue definiteness in Hebrew Morphology Learning:  Morphological rule learner identifies the forms of the morphemes  Feature detection identifies the functions Rule learning:  Rule learner will have to learn a constraints corresponding to fact records. E.g., Adjectives and nouns agree in gender, number, and definiteness in Hebrew.

Other uses of Feature Detection A human-readable reference grammar can be generated from fact records.  A human analyst knows Northern Ostyak, and then has to translate a document in Eastern Ostyak. The only reference grammar of Eastern Ostyak is written in Hungarian, which the analyst does not speak. An Eastern Ostyak consultant who speaks Russian translates the Elicitation Corpus from Russian to Eastern Ostyak. The analyst learns about Eastern Ostyak from the automatically generated fact records.

Other uses of Feature Detection A human-readable reference grammar can be generated from fact records.  A human analyst knows Northern Ostyak, and then has to translate a document in Eastern Ostyak. The only reference grammar of Eastern Ostyak is written in Hungarian, which the analyst does not speak. An Eastern Ostyak consultant who speaks Russian translates the Elicitation Corpus from Russian to Eastern Ostyak. The analyst learns about Eastern Ostyak from the automatically generated fact records. I’m not really sure whether the only grammar of Eastern Ostyak is written in Hungarian. There is one reference grammar of Northern Ostyak written in English. All other Ostyak materials are in Hungarian, Russian, and German. The Ostyaks are subsistence hunters, and Eastern Ostyak is nearly extinct, so there is no real need for government translators. Other Siberian and Central Asian languages with similar scarcity of resources may be important.

Other uses of Feature Detection Help a field worker  Instead of “Elicit by day; analyze by night” (in order to know what to elicit the next day), go to sleep and look at the fact records in the morning.  We have been working with people at EMELD and MPI Leipzig.

Feature Detection: Spanish The girl saw a red book. ((1,1)(2,2)(3,3)(4,4)(5,6)(6,5)) La niña vió un libro rojo A girl saw a red book ((1,1)(2,2)(3,3)(4,4)(5,6)(6,5)) Una niña vió un libro rojo I saw the red book ((1,1)(2,2)(3,3)(4,5)(5,4)) Yo vi el libro rojo I saw a red book. ((1,1)(2,2)(3,3)(4,5)(5,4)) Yo vi un libro rojo Feature: definiteness Values: definite, indefinite Function-of-*: subj, obj Marked-on-head-of-*: no Marked-on-dependent: yes Marked-on-governor: no Marked-on-other: no Add/delete-word: no Change-in-alignment: no

Feature Detection: Chinese A girl saw a red book. ((1,2)(2,2)(3,3)(3,4)(4,5)(5,6)(5,7)(6,8)) 有 一个 女人 看见 了 一本 红色 的 书 。 The girl saw a red book. ((1,1)(2,1)(3,3)(3,4)(4,5)(5,6)(6,7)) 女人 看见 了 一本 红色的 书 Feature: definiteness Values: definite, indefinite Function-of-*: subject Marked-on-head-of-*: no Marked-on-dependent: no Marked-on-governor: no Add/delete-word: yes Change-in-alignment: no

Feature Detection: Chinese I saw the red book ((1, 3)(2, 4)(2, 5)(4, 1)(5, 2)) 红色的 书, 我 看见 了 I saw a red book. ((1,1)(2,2)(2,3)(2, 4)(4,5)(5,6)) 我 看见 了 一本 红色的 书 。 Feature: definitenes Values: definite, indefinite Function-of-*: object Marked-on-head-of-*: no Marked-on-dependent: no Marked-on-governor: no Add/delete-word: yes Change-in-alignment: yes

Feature Detection: Hebrew A girl saw a red book. ((2,1) (3,2)(5,4)(6,3)) ילדה ראתה ספר אדום The girl saw a red book ((1,1)(2,1)(3,2)(5,4)(6,3)) הילדה ראתה ספר אדום I saw a red book. ((2,1)(4,3)(5,2)) ראיתי ספר אדום I saw the red book. ((2,1)(3,3)(3,4)(4,4)(5,3)) ראיתי את הספר האדום Feature: definiteness Values: definite, indefinite Function-of-*: subj, obj Marked-on-head-of-*: yes Marked-on-dependent: yes Marked-on-governor: no Add-word: no Change-in-alignment: no

AVENUE Elicitation Corpora The Functional-Typological Corpus  Based on microtheories of meanings that may have morpho-syntactic realization The Structural Elicitation Corpus  Based on sentence structures from the Penn TreeBank

The Functional Typological Corpus c-my-polarity polarity-positive polarity-negative Stick to the two obvious values of polarity for now. Feature Name: c-my-polarity Values: positive, negative Note: Stick to the two obvious values of polarity for now.

Functional Typological Corpus In XML XSLT scripts can format it into human- readable text or into data structures. Currently contains around 50 features and a few hundred values. Still under development.

Functional Typological Corpus: Representation of “Who is at the meeting” ((subj ((np-my-general-type pronoun-type)(np-my-person person-unk) (np-my-number num-sg)(np-my-animacy anim-human) (np-my-function fn-predicatee) (np-d-my-distance-from-speaker distance-neutral) (np-my-emphasis emph-no-emph) (np-my-info-function info-neutral) (np-pronoun-exclusivity exclusivity-n/a) (np-pronoun-antecedent-function antecedent-n/a) (np-pronoun-reflexivity reflexivity-n/a))) (predicate ((loc-roles loc-general-at))) Continued on next slide

Continued: “Who is at the meeting” (c-my-copula-type locative)(c-my-secondary-type secondary-copula) (c-my-polarity polarity-positive) (c-my-function fn-main-clause)(c-my- general-type open-question)(gap-function gap-copula-subject)(c-my- sp-act sp-act-request-information)(c-v-my-grammatical-aspect gram- aspect-neutral)(c-v-my-absolute-tense present) (c-v-my-phase- aspect durative)(c-my-headedness-rc rc-head-n/a)(c-my-minor-type minor-n/a)(c-my-restrictivess-rc rc-restrictive-n/a)(c-my-answer-type ans-n/a)(c-my-imperative-degree imp-degree-n/a)(c-my-actor's- status actor-neutral)(c-my-focus-rc focus-n/a)(c-my-gaps-function gap-n/a)(c-my-relative-tense relative-n/a)(c-my-ynq-type ynq-n/a)(c- my-actor's-sem-role actor-sem-role-neutral)(c-v-my-lexical-aspect state))

Why is the corpus represented as a set of feature structures? Multiple elicitation languages  Generate the English and Spanish elicitation corpora from the same internal representation  Easy to add a new elicitation language Write a GenKit grammar to generate sentences from the same internal representation

Why is the corpus represented as a set of feature structures? Feature structure represents things that are not expressed in the major language  These things show up as comments in the elicitation corpus “I am singing” (comment: female)  May eventually use pictures and discourse context  We actually want to elicit the meaning associated with the feature structure. English and Spanish are just vehicles for getting at the meaning.

Corpus Creation Tools The elicitation corpus can be changed and new corpora can be created.

Motivation for Corpus Creation Tools Make new corpora easily  Add a new tense (e.g., remote past) and automatically get all the combinations with other features  Make a specialized corpus for a limited semantic domain or a specific language family

Motivation for Corpus Creation Tools Combinatorics  For example, all combinations of person, number, gender, tense, etc.  Too much bookkeeping for a human corpus creator, and too time consuming

Where do the feature structures come from? A linguist formulates a Multiply The multiply specifies a set of feature structures

A Multiply ((subj ((np-my-general-type pronoun-type common-noun-type) (np-my-person person-first person-second person-third) (np-my-number num-sg num-pl) (np-my-biological-gender bio-gender-male bio-gender-female) (np-my-function fn-predicatee))) {[(predicate ((np-my-general-type common-noun-type) (np-my-definiteness definiteness-minus) (np-my-person person-third) (np-my-function predicate))) (c-my-copula-type role)] [(predicate ((adj-my-general-type quality-type))) (c-my-copula-type attributive)] [(predicate ((np-my-general-type common-noun-type) (np-my-person person-third) (np-my-definiteness definiteness-plus) (np-my-function predicate))) (c-my-copula-type identity)]} (c-my-secondary-type secondary-copula) (c-my-polarity #all) (c-my-function fn-main-clause)(c-my-general-type declarative) (c-my-speech-act sp-act-state) (c-v-my-grammatical-aspect gram-aspect-neutral) (c-v-my-lexical-aspect state) (c-v-my-absolute-tense past present future) (c-v-my-phase-aspect durative)) This multiply expands to 288 feature structures.

There is a GUI for making Multiplies Demo available on request

GenKit Grammar Use GenKit for generation ;;declarative ( ==> ( ) (((x0 c-my-general-type) =c declarative) ((x2 verb-form) = fin) ((x3 c-my-copula-type) = (x0 c-my-copula-type)) ((x4 d-speaker-gender) = (x0 d-speaker-gender)) ((x4 d-hearer-gender) = (x0 d-hearer-gender)) ((x4 d-my-formality) = (x0 d-my-formality)) ((x3 np-my-number) = (x0 np-my-number)) ((x3 np-my-animacy) = (x0 np-my-animacy)) ((x3 np-my-biological-gender) = (x0 np-my-biological-gender)) (x3 = (x0 predicate)) (x1 = (x0 subj)) (x2 = x0)))

GenKit Lexicon ;;Pronouns (word ((cat n) (root you) (pred pro) (np-my-person person-second) (np-my-animacy anim-human) (np-my-general-type pronoun-type))) (word ((cat n) (root I) (pred pro) (np-my-person person-first) (np-my-number num-sg) (np-my-animacy anim-human) (np-my-general-type pronoun-type))) (word ((cat n) (root we) (pred pro) (np-my-person person-first) (np-my-number num-pl) (np-my-animacy anim-human) (np-my-general-type pronoun-type))) (word ((cat n) (root we) (pred pro) (np-my-person person-first) (np-my-number num-dual) (np-my-animacy anim-human) (np-my-general-type pronoun-type))) (word ((cat n) (root she) (pred pro) (np-my-person person-third) (np-my-number num-sg) (np-my-biological-gender bio-gender-female) (np-my-animacy anim-human) (np-my-general-type pronoun-type)))

Comments are also generated I & one female & sang Use comments for things that are not expressed in English.

Convert to Elicitation Format (input to Elicitation Tool) original: WHO & IS AT THE BOX & full comment: Sentence: WHO IS AT THE BOX original: I &ONE-WOMAN & AM PN_FEMALE &ONE-WOMAN & & full comment: NP1: ONE-WOMAN Sentence: I AM PN_FEMALE original: WILL I &ONE-WOMAN & BE THE TEACHER & full comment: NP1: ONE-WOMAN Sentence: WILL I BE THE TEACHER

Eight Basic Steps for Corpus Creation 1. Write FVD and format into data structure 2. Gather Exclusions (restrictions on co- occurrence of features 3. Design the Multiply 4. Get a full set of Feature Structures 5. Design Grammar and Comments 6. Design Lexicon 7. Generate Sentences from Feature Structures 8. Convert to Elicitation Format

Can make other types of corpora The Elicitation Corpus does not have to be functional-typological

Alternative Corpora: The Medical Corpus Feature: Body-Parts Values part-hand Restrictions: part-finger Restrictions: part-tooth Restrictions: symptom_redness symptom_scratch symptom_numbness symptom_cut symptom_lump symptom_rash symptom_puncture symptom_bruise symptom_frozen part-eye Restrictions: symptom_rash part-arm Restrictions: … ((subj ((body-parts #all) (Poss ((np-my-general-type pronoun-type) (np-my-person #all) (np-my-number num-sg num-pl) (np-my-animacy anim-human) (np-my-use possessive))) (Pred ((symptoms #all)) (c-my-general-type declarative) (c-my-speech-act sp-act-state) (c-v-my-grammatical-aspect gram-aspect-neutral) (c-v-my-lexical-aspect state) (c-v-my-absolute-tense present)); The Result: YOUR ARM IS RED YOUR ARM IS SCRATCHED YOUR ARM IS NUMB YOUR ARM IS NIL YOUR ARM HAS A/N INFECTION…

Corpus Navigation While the Elicitation Corpus for any one target language (TL) can be kept to a reasonable size, the universal Elicitation Corpus must check for all phenomena that might occur in any langauge. Since the universal corpus cannot be kept to a reasonable size, Corpus Navigation is necessary. Facts discovered about a particular TL early in the process constrain what needs to be looked for later in the process for that TL. Thus this is a dynamic process, different for each TL.

Corpus Navigation: search Search process, with the informant in the inner loop, expanding search states he/she is given as SL sentences by supplying the corresponding TL sentence and alignments. Analogously to game search, there is an "opening book" of moves (SL sentences to check for all languages), until enough inforrmation has been gathered to make intelligent search choices. The hueristic function driving the search process is Relative Info Gain: RIG(Y|X) = [H(Y) - H(Y|X)]/H(Y) The system reduces the remaining entropy in its knowledge of the language as much as possible. There should also be a cost factor, estimating the human effort required to expand the node. To make the process efficient enough, we will create "decision graphs", similar to RETE networks, that cache information so only the information that changes needs to be recomputed.