Translating from Morphologically Complex Languages: A Paraphrase-Based Approach Preslav Nakov & Hwee Tou Ng
Overview
ACL’2011 : Preslav Nakov & Hwee Tou Ng Overview Statistical Machine Translation systems Typically assume that word is the basic token-unit of translation Problem Data sparseness issues for languages with rich morphology. Our Solution Paraphrase-based approach to translating morphological variants. 3 Translating from Morphologically Complex Languages: A Paraphrase-Based Approach
Introduction
ACL’2011 : Preslav Nakov & Hwee Tou Ng Morphology in Statistical Machine Translation (SMT) Traditionally, word was the basic token-unit of translation The earliest SMT models (aka, IBM models) were proposed for French and English, which have little morphology. Most subsequent models remain word-atomic phrase-based hierarchical treelet syntactic 5 Translating from Morphologically Complex Languages: A Paraphrase-Based Approach
ACL’2011 : Preslav Nakov & Hwee Tou Ng Morphology in Statistical Machine Translation (SMT) Word as an atomic token-unit of translation Fine for languages with little morphology: English, French, Spanish Chinese (almost no morphology) Inadequate for morphologically rich languages: Arabic, Turkish, Finnish word inflections word-attached clitics German compounds 6 Translating from Morphologically Complex Languages: A Paraphrase-Based Approach
ACL’2011 : Preslav Nakov & Hwee Tou Ng The Case of Malay Malay language rich derivational morphology but poor in word inflections (unlike Arabic, Turkish, Finnish) word-attached clitics (unlike Arabic, Turkish, Finnish) concatenated compounds (unlike German, Finnish) Problem: classic methods do not work for Malay Solution: paraphrasing techniques word-level phrase-level sentence-level 7 Translating from Morphologically Complex Languages: A Paraphrase-Based Approach
Related Work
ACL’2011 : Preslav Nakov & Hwee Tou Ng Related Work Two general lines of research 1.Inflected forms of the same word are used as equivalence classes or as possible alternatives in translation stemming (Yang and Kirchhoff, 2006) lemmatization (Al-Onaizan et al., 1999; Goldwater and McClosky, 2005; Dyer, 2007) direct clustering (Talbot and Osborne, 2006) factored models (Koehn and Hoang, 2007). 2.Word segmentation compound words (Koehn and Knight,2003; Yang and Kirchhoff, 2006) clitics attached to the preceding word (Habash and Sadat, 2006) morpheme sequence representations (Lee, 2004;Dyer et al., 2008; Dyer, 2009). Do not work well for Malay It has very little inflectional morphology, if any compounds are not concatenated clitics are rare 9 Translating from Morphologically Complex Languages: A Paraphrase-Based Approach
Malay Morphology
ACL’2011 : Preslav Nakov & Hwee Tou Ng The Malay Language Malay Astronesian language ~180M speakers official in Malaysia, Indonesia, Singapore, and Brunei two major standard versions (mutually intelligible) Bahasa Malaysia (lit. ‘language of Malaysia’) Bahasa Indonesia (lit. ‘language of Indonesia’). 11 Translating from Morphologically Complex Languages: A Paraphrase-Based Approach
ACL’2011 : Preslav Nakov & Hwee Tou Ng The Malay Language Malay – an agglutinative language very rich derivational morphology but nearly non-existent derivational morphology Inflectionally, Malay is like Chinese: no grammatical gender, number or tense, verbs are not marked for person, etc. 12 Translating from Morphologically Complex Languages: A Paraphrase-Based Approach
ACL’2011 : Preslav Nakov & Hwee Tou Ng Malay Morphology New word formation processes affixation compounding reduplication Other morphological processes clitic attachment 13 Translating from Morphologically Complex Languages: A Paraphrase-Based Approach
ACL’2011 : Preslav Nakov & Hwee Tou Ng New Word Formation Processes in Malay Affixation – attaching affixes, which are not words, to a word prefixes (e.g., ajar/‘teach’ pelajar/‘student’) suffixes (e.g., ajar ajaran/‘teachings’) circumfixes (e.g., ajar pengajaran/‘lesson’) infixes (e.g., gigi/‘teeth’ gerigi/‘toothed blade’) Compounding – putting two or more existing words together e.g., kereta/‘car’ + api/‘fire’ keretapi or kereta api typically not concatenated Reduplication – word repetition e.g., pelajar-pelajar/‘students’ 14 Translating from Morphologically Complex Languages: A Paraphrase-Based Approach
ACL’2011 : Preslav Nakov & Hwee Tou Ng Clitics in Malay Examples duduk/‘sit down’ + lah duduklah/‘please, sit down’, kereta + nya keretanya/‘his car’. Notes: Clitics are not affixes. Clitic attachment is NOT word inflection process word derivation process 15 Translating from Morphologically Complex Languages: A Paraphrase-Based Approach
Translating Malay Morphology A Paraphrase-based Approach to Translating from Malay
ACL’2011 : Preslav Nakov & Hwee Tou Ng Paraphrase-based Approach to Morphology Given a complex Malay word, we generate morphologically simpler words from which it can be derived alternative word segmentations We treat these forms as potential paraphrases of the original word. We use paraphrasing techniques at three levels: word-level phrase-level sentence-level 17 Translating from Morphologically Complex Languages: A Paraphrase-Based Approach
ACL’2011 : Preslav Nakov & Hwee Tou Ng Generating Simpler Morphological Variants Given a complex Malay word, we generate 1. words obtainable by affix stripping e.g., pelajaran pelajar, ajaran, ajar 2. words that are part of a compound word e.g., kerjasama kerja, sama 3. words appearing on either side of a dash e.g., adik-beradik adik, beradik 4. words without clitics e.g., keretanya kereta 5. clitic-segmented word sequences e.g., keretanya kereta nya 6. dash-segmented wordforms e.g., aceh-nias aceh – nias 7. combinations of the above. 18 adik-beradiknya adik-beradiknya adik-beradik beradiknya beradik adik nya adik berpelajaran berpelajaran pelajaran pelajar ajaran ajar Translating from Morphologically Complex Languages: A Paraphrase-Based Approach
ACL’2011 : Preslav Nakov & Hwee Tou Ng Word-Level Paraphrases Given a dev/test sentence: 1. We generate a list of variants {w’} for each Malay word w. 2. We add them to the sentence, thus forming a lattice. 19 Translating from Morphologically Complex Languages: A Paraphrase-Based Approach
ACL’2011 : Preslav Nakov & Hwee Tou Ng Word-Level Paraphrases (cont.) The lattice requires a weight for each arc. We set 1.0 for the original word w. For each paraphrase w’ of w, we use the probability Pr(w’|w), estimated using word-level pivoting over English: 20 Translating from Morphologically Complex Languages: A Paraphrase-Based Approach
ACL’2011 : Preslav Nakov & Hwee Tou Ng Word-Level Paraphrases (cont.) Estimating the probability Pr(w’|w): 21 Translating from Morphologically Complex Languages: A Paraphrase-Based Approach
ACL’2011 : Preslav Nakov & Hwee Tou Ng Sentence-Level Paraphrases dev/test word-level paraphrases need matching phrases Paraphrase the training data at the sentence-level: For each paraphrasable word w & for each of its paraphrases w’: we create a version of the sentence with w substituted by w’. Pair each paraphrased sentence with the original target 22 dia mahu membeli keretanya. || she wants to buy his car. dia mahu beli keretanya. || she wants to buy his car. dia mahu membeli kereta. || she wants to buy his car. dia mahu membeli kereta nya. || she wants to buy his car. Paraphrased bi-text Translating from Morphologically Complex Languages: A Paraphrase-Based Approach
ACL’2011 : Preslav Nakov & Hwee Tou Ng Sentence-Level Paraphrases (cont.) We build two phrase tables T orig from the original training bi-text T par from the paraphrased bi-text We merge these tables 1.Keep all entries from T orig. 2.Add those phrase pairs from T par that are not in T orig. 3.Add extra features: F1: 1 if the entry came from T orig, 0.5 otherwise. F2: 1 if the entry came from T par, 0.5 otherwise. F3: 1 if the entry was in both tables, 0.5 otherwise. The feature weights are set using MERT, and the number of features is optimized on the development set. 23 Translating from Morphologically Complex Languages: A Paraphrase-Based Approach
ACL’2011 : Preslav Nakov & Hwee Tou Ng Phrase-Level Paraphrases We further augment the phrase table with an extra feature, which is calculated using phrase-level pivoting: 1, for phrase pairs coming from T orig max p Pr(p ’ |p), for phrase pairs coming from T par where p’ is a paraphrase of some original Malay phrase p 24 Translating from Morphologically Complex Languages: A Paraphrase-Based Approach
Experiments and Evaluation
ACL’2011 : Preslav Nakov & Hwee Tou Ng Data Training bi-text:350K sentence pairs English: 10.4M words Malay: 9.7M words Development bi-text:2,000 sentence pairs English:63.4K words Malay:58.5K words Testing bi-text:1,420 sentences Malay:28.8K words. English: 32.8K, 32.4K, and 32.9K words (3 reference translations) LM 49.8M English words 26 Translating from Morphologically Complex Languages: A Paraphrase-Based Approach
ACL’2011 : Preslav Nakov & Hwee Tou Ng Evaluation Results: BLEU 27 Translating from Morphologically Complex Languages: A Paraphrase-Based Approach
ACL’2011 : Preslav Nakov & Hwee Tou Ng 28 Detailed BLEU Improvement for all n-grams used in BLEU Translating from Morphologically Complex Languages: A Paraphrase-Based Approach
ACL’2011 : Preslav Nakov & Hwee Tou Ng 29 Evaluation With 5 Measures Consistent improvement for 5 measures Translating from Morphologically Complex Languages: A Paraphrase-Based Approach
ACL’2011 : Preslav Nakov & Hwee Tou Ng 30 Example Translations Translating from Morphologically Complex Languages: A Paraphrase-Based Approach
Conclusion
ACL’2011 : Preslav Nakov & Hwee Tou Ng Conclusion Presented a novel approach to translating from a morphologically complex language uses paraphrases at three levels of translation word-level phrase-level sentence-level Demonstrated the potential of the approach to Malay derivationally rich but almost no inflectional morphology 32 Translating from Morphologically Complex Languages: A Paraphrase-Based Approach
ACL’2011 : Preslav Nakov & Hwee Tou Ng Future Work Improve the paraphrasing models use a richer sense similarity model that combines monolingual and bilingual similarity (Chen et al., 2010) Try phrase table paraphrasing instead of sentence-level paraphrasing (Nakov, 2008) Try other morphologically complex languages SMT models 33 The presented work is supported by research grant POD Translating from Morphologically Complex Languages: A Paraphrase-Based Approach