Eliciting a corpus of word-aligned phrases for MT

Slides:



Advertisements
Similar presentations
The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
Advertisements

The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.
Syntax. Definition: a set of rules that govern how words are combined to form longer strings of meaning meaning like sentences.
Statistical NLP: Lecture 3
Elicitation Corpus April 12, Agenda Tagging with feature vectors or feature structures Combinatorics Extensions.
NICE: Native language Interpretation and Communication Environment Lori Levin, Jaime Carbonell, Alon Lavie, Ralf Brown Carnegie Mellon University.
Automatic Rule Learning for Resource-Limited Machine Translation Alon Lavie, Katharina Probst, Erik Peterson, Jaime Carbonell, Lori Levin, Ralf Brown Language.
Machine Translation with Scarce Resources The Avenue Project.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
Creation of a Russian-English Translation Program Karen Shiells.
1 A Chart Parser for Analyzing Modern Standard Arabic Sentence Eman Othman Computer Science Dept., Institute of Statistical Studies and Research (ISSR),
Data Elicitation for AVENUE Lori Levin Alison Alvarez Jeff Good (MPI Leipzig) Bob Frederking Erik Peterson Language Technologies Institute Carnegie Mellon.
Eliciting Features from Minor Languages The elicitation tool provides a simple interface for bilingual informants with no linguistic training and limited.
Transfer-based MT with Strong Decoding for a Miserly Data Scenario Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with:
Coping with Surprise: Multiple CMU MT Approaches Alon Lavie Lori Levin, Jaime Carbonell, Alex Waibel, Stephan Vogel, Ralf Brown, Robert Frederking Language.
Recent Major MT Developments at CMU Briefing for Joe Olive February 5, 2008 Alon Lavie and Stephan Vogel Language Technologies Institute Carnegie Mellon.
Time, Tense and Aspect Rajat Kumar Mohanty Centre For Indian Language Technology Department of Computer Science and Engineering Indian.
Morphology An Introduction to the Structure of Words Lori Levin and Christian Monson Grammars and Lexicons Fall Term, 2004.
11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.
Transfer-based MT with Strong Decoding for a Miserly Data Scenario Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with:
AVENUE Automatic Machine Translation for low-density languages Ariadna Font Llitjós Language Technologies Institute SCS Carnegie Mellon University.
Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University.
Carnegie Mellon Goal Recycle non-expert post-editing efforts to: - Refine translation rules automatically - Improve overall translation quality Proposed.
Data Collection and Language Technologies for Mapudungun Lori Levin, Rodolfo Vega, Jaime Carbonell, Ralf Brown, Alon Lavie Language Technologies Institute.
Linguistic Essentials
What you have learned and how you can use it : Grammars and Lexicons Parts I-III.
For Monday Read chapter 26 Last Homework –Chapter 23, exercise 7.
A Trainable Transfer-based MT Approach for Languages with Limited Resources Alon Lavie Language Technologies Institute Carnegie Mellon University Joint.
Language Technologies Institute School of Computer Science Carnegie Mellon University NSF, August 6, 2001 Machine Translation for Indigenous Languages.
Coping with Surprise: Multiple CMU MT Approaches Alon Lavie Lori Levin, Jaime Carbonell, Alex Waibel, Stephan Vogel, Ralf Brown, Robert Frederking Language.
SYNTAX.
Semi-Automated Elicitation Corpus Generation The elicitation tool provides a simple interface for bilingual informants with no linguistic training and.
Data Elicitation for AVENUE By: Alison Alvarez Lori Levin Bob Frederking Jeff Good (MPI Leipzig) Erik Peterson.
Error Analysis of Two Types of Grammar for the purpose of Automatic Rule Refinement Ariadna Font Llitjós, Katharina Probst, Jaime Carbonell Language Technologies.
Bridging the Gap: Machine Translation for Lesser Resourced Languages
Avenue Architecture Learning Module Learned Transfer Rules Lexical Resources Run Time Transfer System Decoder Translation Correction Tool Word- Aligned.
October 10, 2003BLTS Kickoff Meeting1 Transfer with Strong Decoding Learning Module Transfer Rules {PP,4894} ;;Score: PP::PP [NP POSTP] -> [PREP.
Eliciting a corpus of word- aligned phrases for MT Lori Levin, Alon Lavie, Erik Peterson Language Technologies Institute Carnegie Mellon University.
A Syntax-Driven Bracketing Model for Phrase-Based Translation Deyi Xiong, et al. ACL 2009.
CMU MilliRADD Small-MT Report TIDES PI Meeting 2002 The CMU MilliRADD Team: Jaime Carbonell, Lori Levin, Ralf Brown, Stephan Vogel, Alon Lavie, Kathrin.
Developing affordable technologies for resource-poor languages Ariadna Font Llitjós Language Technologies Institute Carnegie Mellon University September.
FROM BITS TO BOTS: Women Everywhere, Leading the Way Lenore Blum, Anastassia Ailamaki, Manuela Veloso, Sonya Allin, Bernardine Dias, Ariadna Font Llitjós.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
X-Bar Theory. The part of the grammar regulating the structure of phrases has come to be known as X'-theory (X’-bar theory'). X-bar theory brings out.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Minority Languages Katharina Probst Language Technologies Institute Carnegie Mellon.
Learning to Generate Complex Morphology for Machine Translation Einat Minkov †, Kristina Toutanova* and Hisami Suzuki* *Microsoft Research † Carnegie Mellon.
LingWear Language Technology for the Information Warrior Alex Waibel, Lori Levin Alon Lavie, Robert Frederking Carnegie Mellon University.
10/31/00 1 Introduction to Cognitive Science Linguistics Component Topic: Formal Grammars: Generating and Parsing Lecturer: Dr Bodomo.
Natural Language Processing Vasile Rus
Neural Machine Translation
RECENT TRENDS IN SMT By M.Balamurugan, Phd Research Scholar,
Approaches to Machine Translation
Beginning Syntax Linda Thomas
Lecture – VIII Monojit Choudhury RS, CSE, IIT Kharagpur
Statistical NLP: Lecture 3
Basic Parsing with Context Free Grammars Chapter 13
Natural Language Processing (NLP)
Natural Language - General
Automatic Detection of Causal Relations for Question Answering
Approaches to Machine Translation
Linguistic Essentials
Natural Language Processing (NLP)
AMTEXT: Extraction-based MT for Arabic
Dekai Wu Presented by David Goss-Grubbs
The 7Cs: A Pedagogical Framework for Grammar Teaching and Learning
Structure of a Lexicon Debasri Chakrabarti 13-May-19.
Artificial Intelligence 2004 Speech & Natural Language Processing
Information Retrieval
Natural Language Processing (NLP)
Presentation transcript:

Eliciting a corpus of word-aligned phrases for MT Lori Levin, Alon Lavie, Jaime Carbonell Erik Peterson, Alison Alvarez Language Technologies Institute Carnegie Mellon University

Introduction Problem: Building Machine Translation systems for languages with scarce resources: Not enough data for Statistical MT and Example-Based MT Not enough human linguistic expertise for writing rules Approach: Elicit high quality, word-aligned data from bilingual speakers Learn transfer rules from the elicited data

Modules of the AVENUE/MilliRADD rule learning system and MT system Learning Module Transfer Rules {PP,4894} ;;Score:0.0470 PP::PP [NP POSTP] -> [PREP NP] ((X2::Y1) (X1::Y2)) Translation Lexicon Run Time Transfer System Lattice Decoder English Language Model Word-to-Word Translation Probabilities Word-aligned elicited data

Outline Demo of elicitation interface Description of elicitation corpus

Demo of Elicitation Tool Speaker needs to be bilingual and literate: no other knowledge necessary Mappings between words and phrases: Many-to-many, one-to-none, many-to-none, etc. Create phrasal mappings Fonts and character sets: Including Hindi, Chinese, and Arabic Add morpheme boundaries to target language Add alternate translations Notes and context

English-Chinese Example

English-Hindi Example

Spanish-Mapudungun Example

English-Arabic Example

Testing of Elicitation Tool DARPA Hindi Surprise Language Exercise Around 10 Hindi speakers Around 17,000 phrases translated and aligned Elicitation corpus NPs and PPs from Treebanked Brown Corpus

Elicitation Corpus: Basic Principles Minimal pairs Syntactic compositionality Special semantic/pragmatic constructions Navigation based on language typology and universals Challenges

Elicitation Corpus: Minimal Pairs Eng: I fell. Sp: Caí M: Tranün Eng: You (John) fell. Sp: Tu (Juan) caiste M: Eymi tranimi (Kuan) Eng: You (Mary) fell. ;; Sp: Tu (María) caiste M: Eymi tranimi (Maria) Eng: I am falling. Sp: Estoy cayendo M: Tranmeken Eng: You (John) are falling. Sp: Tu (Juan) estás cayendo M: Eimi(Kuan) tranmekeymi Mapudungun: Spoken by around one million people in Chile and Argentina.

Using feature vectors to detect minimal pairs np1:(subj-of cl1).pro-pers.hum.2.sg. masc.no-clusn.no-def.no-alien cl1:(subj np1).intr-ag.past.complete Eng: You (John) fell. Sp: Tu (Juan) caiste M: Eymi tranimi (Kuan) np1:(subj-of cl1).pro-pers.hum.2.sg. fem.no-clusn.no-def.no-alien Eng: You (Mary) fell. ;; Sp: Tu (María) caiste M: Eymi tranimi (Maria) Inventory of features is based on fieldwork checklists: Comrie and Smith; Boqiaux and Thomas. Feature vectors can be extracted from the output of a parser for English or Spanish. (Except for features that English and Spanish do not have…)

Syntactic Compositionality The tree The tree fell. I think that the tree fell. We learn rules for smaller phrases E.g., NP Their root nodes become non-terminals in the rules for larger phrases. E.g., S containing an NP Meaning of a phrase is predictable from the meanings of the parts.

Special Semantic and Pragmatic Constructions Meaning may not be compositional Not predictable from the meanings of the parts May not follow normal rules of grammar. Suggestion: Why not go? Word-for-word translation may not work. Tend to be sources of MT mismatches Comparative: English: Hotel A is [closer than Hotel B] Japanese: Hoteru A wa [Hoteru B yori] [tikai desu] Hotel A TOP Hotel B than close is “Closer than Hotel B” is a constituent in English, but “Hoteru B yori tikai” is not a constituent in Japanese.

Examples of Semantic/Pragmatic Categories Speech Acts: requests, suggestions, etc. Comparatives and Equatives Modality: possibility, probability, ability, obligation, uncertainty, evidentiality Correllatives: (the more the merrier) Causatives Etc.

A Challenge: Combinatorics Person (1, 2, 3, 4) Number (sg, pl, du, paucal) Gender/Noun Class (?) Animacy (animate/inanimate) Definiteness (definite/indefinite) Proximity (near, far, very far, etc.) Inclusion/exclusion Multiply with: tenses and aspects (complete, incomplete, real, unreal, iterative, habitual, present, past, recent past, future, recent future, non-past, non-future, etc.) Multiply with verb class: agentive intransitive, non-agentive intransitive, transitive, ditransitive, etc. (Case marking and agreement may vary with verb tense, verb class, animacy, definiteness, and whether or not object outranks subject in person or animacy.)

Solutions to Combinatorics Generate paradigms of feature vectors, and then automatically generate sentences to match each feature vector. Use known universals to eliminate features: e.g., Languages without plurals don’t have duals.

Navigation through the corpus Initial diagnostics: Does the language mark number on nouns or in agreement with verbs? Sentence selection: Based on initial diagnostics Based on principles and universals E.g., languages that don’t have plurals don’t have duals So that the informant sees very few sentences that are not relevant for his/her language

Other Challenges of Computer Based Elicitation Inconsistency of human translation and alignment Bias toward word order of the elicitation language Need to provide discourse context for given and new information How to elicit things that aren’t grammaticalized in the elicitation language: Evidential: I see that it is raining/Apparently it is raining/It must be raining. Context: You are inside the house. Your friend comes in wet.