Translating from Morphologically Complex Languages: A Paraphrase-Based Approach Preslav Nakov & Hwee Tou Ng.

Slides:



Advertisements
Similar presentations
PHONE MODELING AND COMBINING DISCRIMINATIVE TRAINING FOR MANDARIN-ENGLISH BILINGUAL SPEECH RECOGNITION Yanmin Qian, Jia Liu ICASSP2010 Pei-Ning Chen CSIE.
Advertisements

The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
Morphological Analysis for Phrase- Based Statistical Machine Translation Luong Minh Thang WING group meeting – 15 Aug, 2008 HYP update - part1 4/30/20151.
Finite-State Transducers Shallow Processing Techniques for NLP Ling570 October 10, 2011.
Baselines for Recognizing Textual Entailment Ling 541 Final Project Terrence Szymanski.
1 Linguistics week 11 Finish assimilation; start morphology.
Word Sense Disambiguation for Machine Translation Han-Bin Chen
Morphology Morphology is the branch of linguistics that studies the structure of words. In English and many other languages, many words can be broken down.
Morphology. Overview We all have an internal mental dictionary called a lexicon Morphology is the study of words (the study of our lexicon) To look at.
Brief introduction to morphology
Morphology I. Basic concepts and terms Derivational processes
1 Morphological analysis LING 570 Fei Xia Week 4: 10/15/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A.
A Phrase-Based, Joint Probability Model for Statistical Machine Translation Daniel Marcu, William Wong(2002) Presented by Ping Yu 01/17/2006.
“Applying Morphology Generation Models to Machine Translation” By Kristina Toutanova, Hisami Suzuki, Achim Ruopp (Microsoft Research). UW Machine Translation.
Linguisitics Levels of description. Speech and language Language as communication Speech vs. text –Speech primary –Text is derived –Text is not “written.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
Does Syntactic Knowledge help English- Hindi SMT ? Avinesh. PVS. K. Taraka Rama, Karthik Gali.
1 A Chart Parser for Analyzing Modern Standard Arabic Sentence Eman Othman Computer Science Dept., Institute of Statistical Studies and Research (ISSR),
Decoder-Guided Backoff Using Word Lattices to Improve Translation from Morphologically Complex Languages Chris Dyer University of Maryland.
323 Morphology The Structure of Words 1.1 What is Morphology? Morphology is the internal structure of words. V: walk, walk+s, walk+ed, walk+ing N: dog,
The CMU-UKA Statistical Machine Translation Systems for IWSLT 2007 Ian Lane, Andreas Zollmann, Thuy Linh Nguyen, Nguyen Bach, Ashish Venugopal, Stephan.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.
Query Rewriting Using Monolingual Statistical Machine Translation Stefan Riezler Yi Liu Google 2010 Association for Computational Linguistics.
Introduction Morphology is the study of the way words are built from smaller units: morphemes un-believe-able-ly Two broad classes of morphemes: stems.
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
Advanced Signal Processing 05/06 Reinisch Bernhard Statistical Machine Translation Phrase Based Model.
Learning Phonetic Similarity for Matching Named Entity Translation and Mining New Translations Wai Lam, Ruizhang Huang, Pik-Shan Cheung ACM SIGIR 2004.
Active Learning for Statistical Phrase-based Machine Translation Gholamreza Haffari Joint work with: Maxim Roy, Anoop Sarkar Simon Fraser University NAACL.
2010 Failures in Czech-English Phrase-Based MT 2010 Failures in Czech-English Phrase-Based MT Full text, acknowledgement and the list of references in.
Morpho Challenge competition Evaluations and results Authors Mikko Kurimo Sami Virpioja Ville Turunen Krista Lagus.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Formal Properties of Language. Grammar Morphology Syntax Semantics.
Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.
A Language Independent Method for Question Classification COLING 2004.
Machine Translation  Machine translation is of one of the earliest uses of AI  Two approaches:  Traditional approach using grammars, rewrite rules,
A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages Minh-Thang Luong, Preslav Nakov & Min-Yen Kan EMNLP 2010,
Formal Properties of Language: Talk is achieved through the interdependent components of sounds, words, sentences, and meanings.
Ibrahim Badr, Rabih Zbib, James Glass. Introduction Experiment on English-to-Arabic SMT. Two domains: text news,spoken travel conv. Explore the effect.
Improved Statistical Machine Translation for Resource-Poor Languages Using Related Resource-Rich Languages Preslav Nakov and Hwee Tou Ng Dept. of Computer.
Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005.
Chinese Word Segmentation Adaptation for Statistical Machine Translation Hailong Cao, Masao Utiyama and Eiichiro Sumita Language Translation Group NICT&ATR.
NRC Report Conclusion Tu Zhaopeng NIST06  The Portage System  For Chinese large-track entry, used simple, but carefully- tuned, phrase-based.
An Ambiguity-Controlled Morphological Analyzer for Modern Standard Arabic By: Mohammed A. Attia Abbas Al-Julaih Natural Language Processing ICS.
LREC 2008 Marrakech 29 May Caroline Lavecchia, Kamel Smaïli and David Langlois LORIA / Groupe Parole, Vandoeuvre-Lès-Nancy, France Phrase-Based Machine.
FlashNormalize: Programming by Examples for Text Normalization International Joint Conference on Artificial Intelligence, Buenos Aires 7/29/2015FlashNormalize1.
Morphological typology
Natural Language Processing Chapter 2 : Morphology.
MACHINE TRANSLATION PAPER 1 Daniel Montalvo, Chrysanthia Cheung-Lau, Jonny Wang CS159 Spring 2011.
MORPHOLOGY definition; variability among languages.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Source Language Adaptation for Resource-Poor Machine Translation Pidong Wang, National University of Singapore Preslav Nakov, QCRI, Qatar Foundation Hwee.
III. MORPHOLOGY. III. Morphology 1. Morphology The study of the internal structure of words and the rules by which words are formed. 1.1 Open classes.
Natural Language Generation with Tree Conditional Random Fields Wei Lu, Hwee Tou Ng, Wee Sun Lee Singapore-MIT Alliance National University of Singapore.
1 Linguistics week 13 Morphology 3. 2 Morphology, then What is it? It’s the study of word forms, and the changes we make to words It’s part of the grammar.
Large Vocabulary Data Driven MT: New Developments in the CMU SMT System Stephan Vogel, Alex Waibel Work done in collaboration with: Ying Zhang, Alicia.
October 10, 2003BLTS Kickoff Meeting1 Transfer with Strong Decoding Learning Module Transfer Rules {PP,4894} ;;Score: PP::PP [NP POSTP] -> [PREP.
MORPHOLOGY. PART 1: INTRODUCTION Parts of speech 1. What is a part of speech?part of speech 1. Traditional grammar classifies words based on eight parts.
LING 575 Lecture 5 Kristina Toutanova MSR & UW April 27, 2010 With materials borrowed from Philip Koehn, Chris Quirk, David Chiang, Dekai Wu, Aria Haghighi.
Grammatical Issues in translation
Tasneem Ghnaimat. Language Model An abstract representation of a (natural) language. An approximation to real language Assume we have a set of sentences,
Learning to Generate Complex Morphology for Machine Translation Einat Minkov †, Kristina Toutanova* and Hisami Suzuki* *Microsoft Research † Carnegie Mellon.
Morphological Types of Languages
Chapter 3 Morphology Without grammar, little can be conveyed. Without vocabulary, nothing can be conveyed. (David Wilkins ,1972) Morphology refers to.
Joint Training for Pivot-based Neural Machine Translation
A Hybrid Morpheme-Word Representation for Machine Translation of Morphologically Rich Languages Minh-Thang Luong, Preslav Nakov & Min-Yen.
Memory-augmented Chinese-Uyghur Neural Machine Translation
A Joint Model of Orthography and Morphological Segmentation
Presentation transcript:

Translating from Morphologically Complex Languages: A Paraphrase-Based Approach Preslav Nakov & Hwee Tou Ng

Overview

ACL’2011 : Preslav Nakov & Hwee Tou Ng Overview  Statistical Machine Translation systems  Typically assume that word is the basic token-unit of translation  Problem  Data sparseness issues for languages with rich morphology.  Our Solution  Paraphrase-based approach to translating morphological variants. 3 Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

Introduction

ACL’2011 : Preslav Nakov & Hwee Tou Ng Morphology in Statistical Machine Translation (SMT)  Traditionally, word was the basic token-unit of translation  The earliest SMT models (aka, IBM models) were proposed for French and English, which have little morphology.  Most subsequent models remain word-atomic  phrase-based  hierarchical  treelet  syntactic 5 Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

ACL’2011 : Preslav Nakov & Hwee Tou Ng Morphology in Statistical Machine Translation (SMT) Word as an atomic token-unit of translation  Fine for languages with little morphology:  English, French, Spanish  Chinese (almost no morphology)  Inadequate for morphologically rich languages:  Arabic, Turkish, Finnish  word inflections  word-attached clitics  German  compounds 6 Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

ACL’2011 : Preslav Nakov & Hwee Tou Ng The Case of Malay  Malay language  rich derivational morphology  but poor in  word inflections (unlike Arabic, Turkish, Finnish)  word-attached clitics (unlike Arabic, Turkish, Finnish)  concatenated compounds (unlike German, Finnish)  Problem: classic methods do not work for Malay  Solution: paraphrasing techniques  word-level  phrase-level  sentence-level 7 Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

Related Work

ACL’2011 : Preslav Nakov & Hwee Tou Ng Related Work Two general lines of research 1.Inflected forms of the same word are used as equivalence classes or as possible alternatives in translation  stemming (Yang and Kirchhoff, 2006)  lemmatization (Al-Onaizan et al., 1999; Goldwater and McClosky, 2005; Dyer, 2007)  direct clustering (Talbot and Osborne, 2006)  factored models (Koehn and Hoang, 2007). 2.Word segmentation  compound words (Koehn and Knight,2003; Yang and Kirchhoff, 2006)  clitics attached to the preceding word (Habash and Sadat, 2006)  morpheme sequence representations (Lee, 2004;Dyer et al., 2008; Dyer, 2009). Do not work well for Malay  It has very little inflectional morphology, if any  compounds are not concatenated  clitics are rare 9 Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

Malay Morphology

ACL’2011 : Preslav Nakov & Hwee Tou Ng The Malay Language  Malay  Astronesian language  ~180M speakers  official in Malaysia, Indonesia, Singapore, and Brunei  two major standard versions (mutually intelligible)  Bahasa Malaysia (lit. ‘language of Malaysia’)  Bahasa Indonesia (lit. ‘language of Indonesia’). 11 Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

ACL’2011 : Preslav Nakov & Hwee Tou Ng The Malay Language  Malay – an agglutinative language  very rich derivational morphology  but nearly non-existent derivational morphology  Inflectionally, Malay is like Chinese:  no grammatical gender, number or tense,  verbs are not marked for person, etc. 12 Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

ACL’2011 : Preslav Nakov & Hwee Tou Ng Malay Morphology  New word formation processes  affixation  compounding  reduplication  Other morphological processes  clitic attachment 13 Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

ACL’2011 : Preslav Nakov & Hwee Tou Ng New Word Formation Processes in Malay  Affixation – attaching affixes, which are not words, to a word  prefixes (e.g., ajar/‘teach’  pelajar/‘student’)  suffixes (e.g., ajar  ajaran/‘teachings’)  circumfixes (e.g., ajar  pengajaran/‘lesson’)  infixes (e.g., gigi/‘teeth’  gerigi/‘toothed blade’)  Compounding – putting two or more existing words together  e.g., kereta/‘car’ + api/‘fire’  keretapi or kereta api  typically not concatenated  Reduplication – word repetition  e.g., pelajar-pelajar/‘students’ 14 Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

ACL’2011 : Preslav Nakov & Hwee Tou Ng Clitics in Malay  Examples  duduk/‘sit down’ + lah  duduklah/‘please, sit down’,  kereta + nya  keretanya/‘his car’.  Notes:  Clitics are not affixes.  Clitic attachment is NOT  word inflection process  word derivation process 15 Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

Translating Malay Morphology A Paraphrase-based Approach to Translating from Malay

ACL’2011 : Preslav Nakov & Hwee Tou Ng Paraphrase-based Approach to Morphology  Given a complex Malay word, we generate  morphologically simpler words from which it can be derived  alternative word segmentations  We treat these forms as potential paraphrases of the original word.  We use paraphrasing techniques at three levels:  word-level  phrase-level  sentence-level 17 Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

ACL’2011 : Preslav Nakov & Hwee Tou Ng Generating Simpler Morphological Variants  Given a complex Malay word, we generate 1. words obtainable by affix stripping  e.g., pelajaran  pelajar, ajaran, ajar 2. words that are part of a compound word  e.g., kerjasama  kerja, sama 3. words appearing on either side of a dash  e.g., adik-beradik  adik, beradik 4. words without clitics  e.g., keretanya  kereta 5. clitic-segmented word sequences  e.g., keretanya  kereta nya 6. dash-segmented wordforms  e.g., aceh-nias  aceh – nias 7. combinations of the above. 18 adik-beradiknya  adik-beradiknya adik-beradik beradiknya beradik adik nya adik berpelajaran  berpelajaran pelajaran pelajar ajaran ajar Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

ACL’2011 : Preslav Nakov & Hwee Tou Ng Word-Level Paraphrases  Given a dev/test sentence: 1. We generate a list of variants {w’} for each Malay word w. 2. We add them to the sentence, thus forming a lattice. 19 Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

ACL’2011 : Preslav Nakov & Hwee Tou Ng Word-Level Paraphrases (cont.)  The lattice requires a weight for each arc.  We set 1.0 for the original word w.  For each paraphrase w’ of w, we use the probability Pr(w’|w), estimated using word-level pivoting over English: 20 Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

ACL’2011 : Preslav Nakov & Hwee Tou Ng Word-Level Paraphrases (cont.)  Estimating the probability Pr(w’|w): 21 Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

ACL’2011 : Preslav Nakov & Hwee Tou Ng Sentence-Level Paraphrases  dev/test word-level paraphrases need matching phrases  Paraphrase the training data at the sentence-level:  For each paraphrasable word w & for each of its paraphrases w’: we create a version of the sentence with w substituted by w’.  Pair each paraphrased sentence with the original target 22 dia mahu membeli keretanya. || she wants to buy his car. dia mahu beli keretanya. || she wants to buy his car. dia mahu membeli kereta. || she wants to buy his car. dia mahu membeli kereta nya. || she wants to buy his car. Paraphrased bi-text Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

ACL’2011 : Preslav Nakov & Hwee Tou Ng Sentence-Level Paraphrases (cont.)  We build two phrase tables  T orig from the original training bi-text  T par from the paraphrased bi-text  We merge these tables 1.Keep all entries from T orig. 2.Add those phrase pairs from T par that are not in T orig. 3.Add extra features:  F1: 1 if the entry came from T orig, 0.5 otherwise.  F2: 1 if the entry came from T par, 0.5 otherwise.  F3: 1 if the entry was in both tables, 0.5 otherwise. The feature weights are set using MERT, and the number of features is optimized on the development set. 23 Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

ACL’2011 : Preslav Nakov & Hwee Tou Ng Phrase-Level Paraphrases  We further augment the phrase table with an extra feature, which is calculated using phrase-level pivoting:  1, for phrase pairs coming from T orig  max p Pr(p ’ |p), for phrase pairs coming from T par  where p’ is a paraphrase of some original Malay phrase p 24 Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

Experiments and Evaluation

ACL’2011 : Preslav Nakov & Hwee Tou Ng Data  Training  bi-text:350K sentence pairs  English: 10.4M words  Malay: 9.7M words  Development  bi-text:2,000 sentence pairs  English:63.4K words  Malay:58.5K words  Testing  bi-text:1,420 sentences  Malay:28.8K words.  English: 32.8K, 32.4K, and 32.9K words (3 reference translations)  LM  49.8M English words 26 Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

ACL’2011 : Preslav Nakov & Hwee Tou Ng Evaluation Results: BLEU 27 Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

ACL’2011 : Preslav Nakov & Hwee Tou Ng 28 Detailed BLEU Improvement for all n-grams used in BLEU Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

ACL’2011 : Preslav Nakov & Hwee Tou Ng 29 Evaluation With 5 Measures Consistent improvement for 5 measures Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

ACL’2011 : Preslav Nakov & Hwee Tou Ng 30 Example Translations Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

Conclusion

ACL’2011 : Preslav Nakov & Hwee Tou Ng Conclusion  Presented a novel approach to translating from a morphologically complex language  uses paraphrases at three levels of translation  word-level  phrase-level  sentence-level  Demonstrated the potential of the approach to Malay  derivationally rich  but almost no inflectional morphology 32 Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

ACL’2011 : Preslav Nakov & Hwee Tou Ng Future Work  Improve the paraphrasing models  use a richer sense similarity model that combines monolingual and bilingual similarity (Chen et al., 2010)  Try phrase table paraphrasing  instead of sentence-level paraphrasing (Nakov, 2008)  Try other  morphologically complex languages  SMT models 33 The presented work is supported by research grant POD Translating from Morphologically Complex Languages: A Paraphrase-Based Approach