Learning to Generate Complex Morphology for Machine Translation Einat Minkov †, Kristina Toutanova* and Hisami Suzuki* *Microsoft Research † Carnegie Mellon.

Slides:



Advertisements
Similar presentations
Mini Presentations: How To
Advertisements

LABELING TURKISH NEWS STORIES WITH CRF Prof. Dr. Eşref Adalı ISTANBUL TECHNICAL UNIVERSITY COMPUTER ENGINEERING 1.
Masaki Itagaki (Language Excellence) Takako Aikawa (Machine Translation Incubation at MSR) Microsoft.
Development of a German- English Translator Felix Zhang.
CODE/ CODE SWITCHING.
 Christel Kemke 2007/08 COMP 4060 Natural Language Processing Feature Structures and Unification.
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
Morphological Analysis for Phrase- Based Statistical Machine Translation Luong Minh Thang WING group meeting – 15 Aug, 2008 HYP update - part1 4/30/20151.
Word Sense Disambiguation for Machine Translation Han-Bin Chen
Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance July 27 EMNLP 2011 Shay B. Cohen Dipanjan Das Noah A. Smith Carnegie Mellon University.
The University of Wisconsin-Madison Universal Morphological Analysis using Structured Nearest Neighbor Prediction Young-Bum Kim, João V. Graça, and Benjamin.
1 A Hidden Markov Model- Based POS Tagger for Arabic ICS 482 Presentation A Hidden Markov Model- Based POS Tagger for Arabic By Saleh Yousef Al-Hudail.
August 6 th ISAAC 2008 Word Prediction in Hebrew Preliminary and Surprising Results Yael Netzer Meni Adler Michael Elhadad Department of Computer Science.
Session 6 Morphology 1 Matakuliah : G0922/Introduction to Linguistics
“Applying Morphology Generation Models to Machine Translation” By Kristina Toutanova, Hisami Suzuki, Achim Ruopp (Microsoft Research). UW Machine Translation.
Machine Translation Prof. Alexandros Potamianos Dept. of Electrical & Computer Engineering Technical University of Crete, Greece May 2003.
Week 2a. Morphosyntactic features, part II. Ch. 2, 4.2- CAS LX 522 Syntax I.
Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging and Chunking with Maximum Entropy Model Sandipan Dandapat.
Dependency Parsing with Reference to Slovene, Spanish and Swedish Simon Corston-Oliver Anthony Aue Microsoft Research.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
Treebanks as Training Data for Parsers Joakim Nivre Växjö University and Uppsala University
Does Syntactic Knowledge help English- Hindi SMT ? Avinesh. PVS. K. Taraka Rama, Karthik Gali.
The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.
Statistical Machine Translation Part X – Dealing with Morphology for Translating to German Alexander Fraser ICL, U. Heidelberg CIS, LMU München
1 A Chart Parser for Analyzing Modern Standard Arabic Sentence Eman Othman Computer Science Dept., Institute of Statistical Studies and Research (ISSR),
Czech-to-English Translation: MT Marathon 2009 Session Preview Jonathan Clark Greg Hanneman Language Technologies Institute Carnegie Mellon University.
323 Morphology The Structure of Words 1.1 What is Morphology? Morphology is the internal structure of words. V: walk, walk+s, walk+ed, walk+ing N: dog,
Statistical Machine Translation Part VI – Dealing with Morphology for Translating to German Alexander Fraser Institute for Natural Language Processing.
Morphology For Marathi POS-Tagger Veena Dixit 11/ 10 /2005.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.
Morphosyntactic correspondence: a progress report on bitext parsing Alexander Fraser, Renjing Wang, Hinrich Schütze Institute for NLP University of Stuttgart.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Advanced Signal Processing 05/06 Reinisch Bernhard Statistical Machine Translation Phrase Based Model.
Morphological Preprocessing for Statistical Machine Translation Nizar Habash Columbia University NLP Meeting 10/19/2006.
Language Knowledge Engineering Lab. Kyoto University NTCIR-10 PatentMT, Japan, Jun , 2013 Description of KYOTO EBMT System in PatentMT at NTCIR-10.
Morpho Challenge competition Evaluations and results Authors Mikko Kurimo Sami Virpioja Ville Turunen Krista Lagus.
Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
1 Co-Training for Cross-Lingual Sentiment Classification Xiaojun Wan ( 萬小軍 ) Associate Professor, Peking University ACL 2009.
Translating from Morphologically Complex Languages: A Paraphrase-Based Approach Preslav Nakov & Hwee Tou Ng.
Phrase Reordering for Statistical Machine Translation Based on Predicate-Argument Structure Mamoru Komachi, Yuji Matsumoto Nara Institute of Science and.
NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.
A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart
A Cross-Lingual ILP Solution to Zero Anaphora Resolution Ryu Iida & Massimo Poesio (ACL-HLT 2011)
Ibrahim Badr, Rabih Zbib, James Glass. Introduction Experiment on English-to-Arabic SMT. Two domains: text news,spoken travel conv. Explore the effect.
Word classes and part of speech tagging Chapter 5.
What you have learned and how you can use it : Grammars and Lexicons Parts I-III.
Presenter: Jinhua Du ( 杜金华 ) Xi’an University of Technology 西安理工大学 NLP&CC, Chongqing, Nov , 2013 Discriminative Latent Variable Based Classifier.
Natural Language Processing Chapter 2 : Morphology.
Supertagging CMSC Natural Language Processing January 31, 2006.
11 Project, Part 3. Outline Basics of supervised learning using Naïve Bayes (using a simpler example) Features for the project 2.
Human Language Technology Part of Speech (POS) Tagging II Rule-based Tagging.
Error Analysis of Two Types of Grammar for the purpose of Automatic Rule Refinement Ariadna Font Llitjós, Katharina Probst, Jaime Carbonell Language Technologies.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
Bridging the Gap: Machine Translation for Lesser Resourced Languages
Avenue Architecture Learning Module Learned Transfer Rules Lexical Resources Run Time Transfer System Decoder Translation Correction Tool Word- Aligned.
October 10, 2003BLTS Kickoff Meeting1 Transfer with Strong Decoding Learning Module Transfer Rules {PP,4894} ;;Score: PP::PP [NP POSTP] -> [PREP.
Word classes and part of speech tagging Chapter 5.
A knowledge rich morph analyzer for Marathi derived forms Ashwini Vaidya IIIT Hyderabad.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
BAMAE: Buckwalter Arabic Morphological Analyzer Enhancer Sameh Alansary Alexandria University Bibliotheca Alexandrina 4th International.
A. Baker, J. de Jong, A. Orgassa & F. Weerman Collaborators: VARIFLEX project: Elma Blom & Daniela Polišenská (NWO-research grant : Disentangling.
LING 575 Lecture 5 Kristina Toutanova MSR & UW April 27, 2010 With materials borrowed from Philip Koehn, Chris Quirk, David Chiang, Dekai Wu, Aria Haghighi.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Minority Languages Katharina Probst Language Technologies Institute Carnegie Mellon.
A CRF-BASED NAMED ENTITY RECOGNITION SYSTEM FOR TURKISH Information Extraction Project Reyyan Yeniterzi.
Tasneem Ghnaimat. Language Model An abstract representation of a (natural) language. An approximation to real language Assume we have a set of sentences,
Syntax.
Meni Adler and Michael Elhadad Ben Gurion University COLING-ACL 2006
Presentation transcript:

Learning to Generate Complex Morphology for Machine Translation Einat Minkov †, Kristina Toutanova* and Hisami Suzuki* *Microsoft Research † Carnegie Mellon University

Motivation I would like to meet this nice woman. اود ان مواجهه هذا جيد امراه. femmasc woman nice this

Motivation

System guess (Quirk et al, 05)

Motivation Correct System guess (Quirk et al, 05)

Motivation Correct System guess (Quirk et al, 05)

Information ‘missing’ on source side Data sparsity Morphological agreement in the target language SMT challenges for English  Morphology rich language

Related work Translation from morphology-rich languages to English –Preprocessing of the inputs, to improve alignments Arabic (Lee, 04), German (Koehn and Knight, 03; Nießen and Ney, 04; Popović and Ney, 04; Collins et al. 05), Czech (Goldwater and McClosky 05) Translation from English to morphology-rich languages –Preprocessing and postprocessing Turkish (El-Kahlout and Oflazer 06), Spanish and Catalan (Oeffing and Ney, 03) Our approach –Extension of Japanese case marker prediction (Suzuki and Toutanova, 06)

Morphology generation as classification: Classify each stem into an inflected form Morphology Prediction Possible inflections eliminare elimino elimini eliminiamo … un una vincolo vincoli di del dei della … chiave chiavi primario primaria primari primarie System guess: eliminare un vincolo di chiave primario Eliminate a primary key constraint Source:

Morphology –Russian, Arabic –Lexicon operations The task of inflection prediction A log-linear model Features –Lexical, Syntactic and Morphological Experiments Outline

Russian Morphology 3 genders, 2 numbers, 6 cases (nom, acc, location …) Nouns have gender, and inflect for number and case Adjectives agree with nouns in number, gender, and case; Verbs agree with Subject person and number (past tense agrees with gender and number) У меня есть синий карандаш at me is blue pencil Pers1 Pres Gen Nom Nom Masc Masc Sing Sing

Arabic morphology Arabic: inflection + clitics –Prefixes: Conj/Prep/Det (in strict order) –Suffixes: Object pronouns/Possessive pronouns Agreement: –In person, number, gender and definiteness وللمكتبات /walilmaktabāt/ و + ل + ال + مكتبة + ات wa+li+al+maktabāt and+for+the+libraries and for the libraries فقلناها /faqulnāhā/ ف+ قال+ نا+ ها fa+qul+na+hā so+said+we+it so we said it (from Bar-Haim et al) (from Nizar Habash)

Lexicon Operations Lexicon Set of possible morphological variants Set of possible lemmas то, тот того, тому, тем, том, те, тех, теми,то тот+PronAdj+DemPron+Neut+Sg+NomAcc (that) то то+Pron+Neut+Inanim+Sg+NomAcc (it) то то+Conj (then) Set of possible morphological analyses Surface word то Stemming Inflection Analysis

Inflection Prediction Model Given a sentence, predict the inflection of each word. Conditional Markov Model y1y1 y2y2 y3y3 y4y4 Sentence processed left-to-right (can be applied top-down) Features: pairs of target and context predicates Can model agreement: POS(y i-2 )=DT & Number(y i-1 )=sg & Number(y i )=sg

Linguistic annotations Source dependency tree Surface features Projected dependency tree Annotations used in Quirk et al (05) system POS & morphological features

Features MonoligualBilingual stem left stem right stem y i-1,y i-2 parent stem … aligned words a i parent (a i ) left sister (a i ) right sister (a i ) POS (a i ) number (a i ) person (a i ) tense (a i ) det* (a i ) prep* (a i ) pron* (a i ) … Inflection inflection (y i ) POS (y i ) tense (y i ) number (y i ) … Syntax Morph. POS (y i-1 ) number (y i-1 ) person (y i-1 ) tense (y i-1 ) … Lexical

[Prev.Stem=qam~-u_qam~, Prep_Inflection=bi] [Aligned_Number=Plur, Number_Inflection=pl] [AlignedWords=and, Conj_Inflection=true] [PrevStem=fiy_y, Prep_Inflection=none] [AlignedWords=applications, Gender_Inflection=fem] Arabic [PrevStem=X, Case_Inflection=y] [AlignedWords=will,Tense_Inflection=future] [AlignedWords=been,Tense_Inflection=past] [AlignedWords=click,Tense_Inflection=imperative] Russian

Reference Experiments Baselines –Random baseline (pick a label at random) –Word-trigram language model baseline Trained using the CMU toolkit on the same training dataset Models –Monolingual word / all, Bilingual Word / all Lexicons: –Russian dictionary, Arabic: Buckwalter analyzer –Evaluated only on words in the lexicon DataEng-RussianEng-Arabic Training1M470K Dev1K Test1K

Russian inflection prediction: accuracy The suggested model better than a language model Syntactic and morphological features are informative

Arabic inflection prediction: accuracy

Accuracy vs. training data size

Error Analysis Russian – Gender of pronoun (it ~ he/she/it) – Case/Gender in coordinate construction – Morphological analysis ambiguity Arabic – Gender/Number of pronoun – Definiteness in noun phrases

Summary Proposed a general framework for improving SMT into morphology rich languages Showed that morpho-syntactic features and source sentence information, derived from aligned sentence pair and a lexicon, are effective. Achieved good results also for little training data

Future Directions Integration with the MT system –Initial results for Russian: 1.7 BLEU improvement Improvements to the model and features –Morphological disambiguation –Semantic role labeling –Longer distance agreements (e.g. pronoun coreference) More languages

Thanks! Questions?