Eliciting a corpus of word- aligned phrases for MT Lori Levin, Alon Lavie, Erik Peterson Language Technologies Institute Carnegie Mellon University.

Slides:



Advertisements
Similar presentations
The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
Advertisements

Statistical NLP: Lecture 3
Amirkabir University of Technology Computer Engineering Faculty AILAB Efficient Parsing Ahmad Abdollahzadeh Barfouroush Aban 1381 Natural Language Processing.
Elicitation Corpus April 12, Agenda Tagging with feature vectors or feature structures Combinatorics Extensions.
NICE: Native language Interpretation and Communication Environment Lori Levin, Jaime Carbonell, Alon Lavie, Ralf Brown Carnegie Mellon University.
The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.
Automatic Rule Learning for Resource-Limited Machine Translation Alon Lavie, Katharina Probst, Erik Peterson, Jaime Carbonell, Lori Levin, Ralf Brown Language.
1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Fall 2005-Lecture 2.
Machine Translation Prof. Alexandros Potamianos Dept. of Electrical & Computer Engineering Technical University of Crete, Greece May 2003.
Machine Translation with Scarce Resources The Avenue Project.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English.
Breaking the Resource Bottleneck for Multilingual Parsing Rebecca Hwa, Philip Resnik and Amy Weinberg University of Maryland.
1 A Chart Parser for Analyzing Modern Standard Arabic Sentence Eman Othman Computer Science Dept., Institute of Statistical Studies and Research (ISSR),
Czech-to-English Translation: MT Marathon 2009 Session Preview Jonathan Clark Greg Hanneman Language Technologies Institute Carnegie Mellon University.
Data Elicitation for AVENUE Lori Levin Alison Alvarez Jeff Good (MPI Leipzig) Bob Frederking Erik Peterson Language Technologies Institute Carnegie Mellon.
Eliciting Features from Minor Languages The elicitation tool provides a simple interface for bilingual informants with no linguistic training and limited.
Transfer-based MT with Strong Decoding for a Miserly Data Scenario Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with:
Coping with Surprise: Multiple CMU MT Approaches Alon Lavie Lori Levin, Jaime Carbonell, Alex Waibel, Stephan Vogel, Ralf Brown, Robert Frederking Language.
Recent Major MT Developments at CMU Briefing for Joe Olive February 5, 2008 Alon Lavie and Stephan Vogel Language Technologies Institute Carnegie Mellon.
NLP. Introduction to NLP Is language more than just a “bag of words”? Grammatical rules apply to categories and groups of words, not individual words.
Morphology An Introduction to the Structure of Words Lori Levin and Christian Monson Grammars and Lexicons Fall Term, 2004.
11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.
Transfer-based MT with Strong Decoding for a Miserly Data Scenario Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with:
AVENUE Automatic Machine Translation for low-density languages Ariadna Font Llitjós Language Technologies Institute SCS Carnegie Mellon University.
Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University.
Carnegie Mellon Goal Recycle non-expert post-editing efforts to: - Refine translation rules automatically - Improve overall translation quality Proposed.
Data Collection and Language Technologies for Mapudungun Lori Levin, Rodolfo Vega, Jaime Carbonell, Ralf Brown, Alon Lavie Language Technologies Institute.
Linguistic Essentials
What you have learned and how you can use it : Grammars and Lexicons Parts I-III.
INSTITUTE OF COMPUTING TECHNOLOGY Forest-to-String Statistical Translation Rules Yang Liu, Qun Liu, and Shouxun Lin Institute of Computing Technology Chinese.
CSA2050 Introduction to Computational Linguistics Parsing I.
For Monday Read chapter 26 Last Homework –Chapter 23, exercise 7.
A Trainable Transfer-based MT Approach for Languages with Limited Resources Alon Lavie Language Technologies Institute Carnegie Mellon University Joint.
Language Technologies Institute School of Computer Science Carnegie Mellon University NSF, August 6, 2001 Machine Translation for Indigenous Languages.
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
Coping with Surprise: Multiple CMU MT Approaches Alon Lavie Lori Levin, Jaime Carbonell, Alex Waibel, Stephan Vogel, Ralf Brown, Robert Frederking Language.
SYNTAX.
Semi-Automated Elicitation Corpus Generation The elicitation tool provides a simple interface for bilingual informants with no linguistic training and.
Data Elicitation for AVENUE By: Alison Alvarez Lori Levin Bob Frederking Jeff Good (MPI Leipzig) Erik Peterson.
Error Analysis of Two Types of Grammar for the purpose of Automatic Rule Refinement Ariadna Font Llitjós, Katharina Probst, Jaime Carbonell Language Technologies.
Natural Language Processing Group Computer Sc. & Engg. Department JADAVPUR UNIVERSITY KOLKATA – , INDIA. Professor Sivaji Bandyopadhyay
Bridging the Gap: Machine Translation for Lesser Resourced Languages
Avenue Architecture Learning Module Learned Transfer Rules Lexical Resources Run Time Transfer System Decoder Translation Correction Tool Word- Aligned.
October 10, 2003BLTS Kickoff Meeting1 Transfer with Strong Decoding Learning Module Transfer Rules {PP,4894} ;;Score: PP::PP [NP POSTP] -> [PREP.
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
A Syntax-Driven Bracketing Model for Phrase-Based Translation Deyi Xiong, et al. ACL 2009.
CMU MilliRADD Small-MT Report TIDES PI Meeting 2002 The CMU MilliRADD Team: Jaime Carbonell, Lori Levin, Ralf Brown, Stephan Vogel, Alon Lavie, Kathrin.
AVENUE: Machine Translation for Resource-Poor Languages NSF ITR
Developing affordable technologies for resource-poor languages Ariadna Font Llitjós Language Technologies Institute Carnegie Mellon University September.
FROM BITS TO BOTS: Women Everywhere, Leading the Way Lenore Blum, Anastassia Ailamaki, Manuela Veloso, Sonya Allin, Bernardine Dias, Ariadna Font Llitjós.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Minority Languages Katharina Probst Language Technologies Institute Carnegie Mellon.
Learning to Generate Complex Morphology for Machine Translation Einat Minkov †, Kristina Toutanova* and Hisami Suzuki* *Microsoft Research † Carnegie Mellon.
LingWear Language Technology for the Information Warrior Alex Waibel, Lori Levin Alon Lavie, Robert Frederking Carnegie Mellon University.
NLP Midterm Solution #1 bilingual corpora –parallel corpus (document-aligned, sentence-aligned, word-aligned) (4) –comparable corpus (4) Source.
The AVENUE Project: Automatic Rule Learning for Resource-Limited Machine Translation Faculty: Alon Lavie, Jaime Carbonell, Lori Levin, Ralf Brown Students:
10/31/00 1 Introduction to Cognitive Science Linguistics Component Topic: Formal Grammars: Generating and Parsing Lecturer: Dr Bodomo.
Natural Language Processing Vasile Rus
Eliciting a corpus of word-aligned phrases for MT
Neural Machine Translation
RECENT TRENDS IN SMT By M.Balamurugan, Phd Research Scholar,
Approaches to Machine Translation
Lecture – VIII Monojit Choudhury RS, CSE, IIT Kharagpur
Statistical NLP: Lecture 3
Basic Parsing with Context Free Grammars Chapter 13
Approaches to Machine Translation
Linguistic Essentials
Information Retrieval
Presentation transcript:

Eliciting a corpus of word- aligned phrases for MT Lori Levin, Alon Lavie, Erik Peterson Language Technologies Institute Carnegie Mellon University

Introduction Problem: Building Machine Translation systems for languages with scarce resources: –Not enough data for Statistical MT and Example- Based MT –Not enough human linguistic expertise for writing rules Approach: –Elicit high quality, word-aligned data from bilingual speakers –Learn transfer rules from the elicited data

Modules of the AVENUE/MilliRADD rule learning system and MT system Learning Module Transfer Rules {PP,4894} ;;Score: PP::PP [NP POSTP] -> [PREP NP] ((X2::Y1) (X1::Y2)) Translation Lexicon Run Time Transfer System Lattice Decoder English Language Model Word-to-Word Translation Probabilities Word-aligned elicited data

Outline Demo of elicitation interface Description of elicitation corpus Overview of automated rule learning

Demo of Elicitation Tool Speaker needs to be bilingual and literate: no other knowledge necessary Mappings between words and phrases: Many- to-many, one-to-none, many-to-none, etc. Create phrasal mappings Fonts and character sets: –Including Hindi, Chinese, and Arabic Add morpheme boundaries to target language Add alternate translations Notes and context

Testing of Elicitation Tool DARPA Hindi Surprise Language Exercise Around 10 Hindi speakers Around 17,000 phrases translated and aligned –Elicitation corpus –NPs and PPs from Treebanked Brown Corpus

Elicitation Corpus: Basic Principles Minimal pairs Syntactic compositionality Special semantic/pragmatic constructions Navigation based on language typology and universals Challenges

Elicitation Corpus: Minimal Pairs Eng: I fell. Sp: Caí M: Tranün Eng: You (John) fell. Sp: Tu (Juan) caiste M: Eymi tranimi (Kuan) Eng: You (Mary) fell. ;; Sp: Tu (María) caiste M: Eymi tranimi (Maria) Eng: I am falling. Sp: Estoy cayendo M: Tranmeken Eng: You (John) are falling. Sp: Tu (Juan) estás cayendo M: Eimi(Kuan) tranmekeymi Mapudungun: Spoken by around one million people in Chile and Argentina.

Using feature vectors to detect minimal pairs np1:(subj-of cl1).pro-pers.hum.2.sg. masc.no-clusn.no-def.no-alien cl1:(subj np1).intr-ag.past.complete –Eng: You (John) fell. Sp: Tu (Juan) caiste M: Eymi tranimi (Kuan) np1:(subj-of cl1).pro-pers.hum.2.sg. fem.no- clusn.no-def.no-alien cl1:(subj np1).intr-ag.past.complete –Eng: You (Mary) fell. ;; Sp: Tu (María) caiste M: Eymi tranimi (Maria) Feature vectors can be extracted from the output of a parser for English or Spanish. (Except for features that English and Spanish do not have…)

Syntactic Compositionality –The tree –The tree fell. –I think that the tree fell. We learn rules for smaller phrases –E.g., NP Their root nodes become non-terminals in the rules for larger phrases. –E.g., S containing an NP Meaning of a phrase is predictable from the meanings of the parts.

Special Semantic and Pragmatic Constructions Meaning may not be compositional –Not predictable from the meanings of the parts May not follow normal rules of grammar. –Suggestion: Why not go? Word-for-word translation may not work. Tend to be sources of MT mismatches –Comparative: English: Hotel A is [closer than Hotel B] Japanese: Hoteru A wa [Hoteru B yori] [tikai desu] Hotel A TOP Hotel B than close is “Closer than Hotel B” is a constituent in English, but “Hoteru B yori tikai” is not a constituent in Japanese.

Examples of Semantic/Pragmatic Categories Speech Acts: requests, suggestions, etc. Comparatives and Equatives Modality: possibility, probability, ability, obligation, uncertainty, evidentiality The more the merrier (can’t think of the name of this construction) Causatives Etc.

A Challenge: Combinatorics –Person (1, 2, 3, 4) –Number (sg, pl, du, paucal) –Gender/Noun Class (?) –Animacy (animate/inanimate) –Definiteness (definite/indefinite) –Proximity (near, far, very far, etc.) –Inclusion/exclusion Multiply with: tenses and aspects (complete, incomplete, real, unreal, iterative, habitual, present, past, recent past, future, recent future, non-past, non-future, etc.) Multiply with verb class: agentive intransitive, non- agentive intransitive, transitive, ditransitive, etc. (Case marking and agreement may vary with verb tense, verb class, animacy, definiteness, and whether or not object outranks subject in person or animacy.)

Solutions to Combinatorics Generate paradigms of feature vectors, and then automatically generate sentences to match each feature vector. Use known universals to eliminate features: e.g., Languages without plurals don’t have duals.

Other Challenges of Computer Based Elicitation Inconsistency of human translation and alignment Bias toward word order of the elicitation language –Need to provide discourse context for given and new information How to elicit things that aren’t grammaticalized in the elicitation language: –Evidential: I see that it is raining/Apparently it is raining/It must be raining. Context: You are inside the house. Your friend comes in wet.