Presentation is loading. Please wait.

Presentation is loading. Please wait.

Translating from Morphologically Complex Languages: A Paraphrase-Based Approach Preslav Nakov & Hwee Tou Ng.

Similar presentations


Presentation on theme: "Translating from Morphologically Complex Languages: A Paraphrase-Based Approach Preslav Nakov & Hwee Tou Ng."— Presentation transcript:

1 Translating from Morphologically Complex Languages: A Paraphrase-Based Approach Preslav Nakov & Hwee Tou Ng

2 Overview

3 ACL’2011 : Preslav Nakov & Hwee Tou Ng Overview  Statistical Machine Translation systems  Typically assume that word is the basic token-unit of translation  Problem  Data sparseness issues for languages with rich morphology.  Our Solution  Paraphrase-based approach to translating morphological variants. 3 Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

4 Introduction

5 ACL’2011 : Preslav Nakov & Hwee Tou Ng Morphology in Statistical Machine Translation (SMT)  Traditionally, word was the basic token-unit of translation  The earliest SMT models (aka, IBM models) were proposed for French and English, which have little morphology.  Most subsequent models remain word-atomic  phrase-based  hierarchical  treelet  syntactic 5 Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

6 ACL’2011 : Preslav Nakov & Hwee Tou Ng Morphology in Statistical Machine Translation (SMT) Word as an atomic token-unit of translation  Fine for languages with little morphology:  English, French, Spanish  Chinese (almost no morphology)  Inadequate for morphologically rich languages:  Arabic, Turkish, Finnish  word inflections  word-attached clitics  German  compounds 6 Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

7 ACL’2011 : Preslav Nakov & Hwee Tou Ng The Case of Malay  Malay language  rich derivational morphology  but poor in  word inflections (unlike Arabic, Turkish, Finnish)  word-attached clitics (unlike Arabic, Turkish, Finnish)  concatenated compounds (unlike German, Finnish)  Problem: classic methods do not work for Malay  Solution: paraphrasing techniques  word-level  phrase-level  sentence-level 7 Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

8 Related Work

9 ACL’2011 : Preslav Nakov & Hwee Tou Ng Related Work Two general lines of research 1.Inflected forms of the same word are used as equivalence classes or as possible alternatives in translation  stemming (Yang and Kirchhoff, 2006)  lemmatization (Al-Onaizan et al., 1999; Goldwater and McClosky, 2005; Dyer, 2007)  direct clustering (Talbot and Osborne, 2006)  factored models (Koehn and Hoang, 2007). 2.Word segmentation  compound words (Koehn and Knight,2003; Yang and Kirchhoff, 2006)  clitics attached to the preceding word (Habash and Sadat, 2006)  morpheme sequence representations (Lee, 2004;Dyer et al., 2008; Dyer, 2009). Do not work well for Malay  It has very little inflectional morphology, if any  compounds are not concatenated  clitics are rare 9 Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

10 Malay Morphology

11 ACL’2011 : Preslav Nakov & Hwee Tou Ng The Malay Language  Malay  Astronesian language  ~180M speakers  official in Malaysia, Indonesia, Singapore, and Brunei  two major standard versions (mutually intelligible)  Bahasa Malaysia (lit. ‘language of Malaysia’)  Bahasa Indonesia (lit. ‘language of Indonesia’). 11 Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

12 ACL’2011 : Preslav Nakov & Hwee Tou Ng The Malay Language  Malay – an agglutinative language  very rich derivational morphology  but nearly non-existent derivational morphology  Inflectionally, Malay is like Chinese:  no grammatical gender, number or tense,  verbs are not marked for person, etc. 12 Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

13 ACL’2011 : Preslav Nakov & Hwee Tou Ng Malay Morphology  New word formation processes  affixation  compounding  reduplication  Other morphological processes  clitic attachment 13 Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

14 ACL’2011 : Preslav Nakov & Hwee Tou Ng New Word Formation Processes in Malay  Affixation – attaching affixes, which are not words, to a word  prefixes (e.g., ajar/‘teach’  pelajar/‘student’)  suffixes (e.g., ajar  ajaran/‘teachings’)  circumfixes (e.g., ajar  pengajaran/‘lesson’)  infixes (e.g., gigi/‘teeth’  gerigi/‘toothed blade’)  Compounding – putting two or more existing words together  e.g., kereta/‘car’ + api/‘fire’  keretapi or kereta api  typically not concatenated  Reduplication – word repetition  e.g., pelajar-pelajar/‘students’ 14 Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

15 ACL’2011 : Preslav Nakov & Hwee Tou Ng Clitics in Malay  Examples  duduk/‘sit down’ + lah  duduklah/‘please, sit down’,  kereta + nya  keretanya/‘his car’.  Notes:  Clitics are not affixes.  Clitic attachment is NOT  word inflection process  word derivation process 15 Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

16 Translating Malay Morphology A Paraphrase-based Approach to Translating from Malay

17 ACL’2011 : Preslav Nakov & Hwee Tou Ng Paraphrase-based Approach to Morphology  Given a complex Malay word, we generate  morphologically simpler words from which it can be derived  alternative word segmentations  We treat these forms as potential paraphrases of the original word.  We use paraphrasing techniques at three levels:  word-level  phrase-level  sentence-level 17 Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

18 ACL’2011 : Preslav Nakov & Hwee Tou Ng Generating Simpler Morphological Variants  Given a complex Malay word, we generate 1. words obtainable by affix stripping  e.g., pelajaran  pelajar, ajaran, ajar 2. words that are part of a compound word  e.g., kerjasama  kerja, sama 3. words appearing on either side of a dash  e.g., adik-beradik  adik, beradik 4. words without clitics  e.g., keretanya  kereta 5. clitic-segmented word sequences  e.g., keretanya  kereta nya 6. dash-segmented wordforms  e.g., aceh-nias  aceh – nias 7. combinations of the above. 18 adik-beradiknya  adik-beradiknya adik-beradik beradiknya beradik adik nya adik berpelajaran  berpelajaran pelajaran pelajar ajaran ajar Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

19 ACL’2011 : Preslav Nakov & Hwee Tou Ng Word-Level Paraphrases  Given a dev/test sentence: 1. We generate a list of variants {w’} for each Malay word w. 2. We add them to the sentence, thus forming a lattice. 19 Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

20 ACL’2011 : Preslav Nakov & Hwee Tou Ng Word-Level Paraphrases (cont.)  The lattice requires a weight for each arc.  We set 1.0 for the original word w.  For each paraphrase w’ of w, we use the probability Pr(w’|w), estimated using word-level pivoting over English: 20 Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

21 ACL’2011 : Preslav Nakov & Hwee Tou Ng Word-Level Paraphrases (cont.)  Estimating the probability Pr(w’|w): 21 Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

22 ACL’2011 : Preslav Nakov & Hwee Tou Ng Sentence-Level Paraphrases  dev/test word-level paraphrases need matching phrases  Paraphrase the training data at the sentence-level:  For each paraphrasable word w & for each of its paraphrases w’: we create a version of the sentence with w substituted by w’.  Pair each paraphrased sentence with the original target 22 dia mahu membeli keretanya. || she wants to buy his car. dia mahu beli keretanya. || she wants to buy his car. dia mahu membeli kereta. || she wants to buy his car. dia mahu membeli kereta nya. || she wants to buy his car. Paraphrased bi-text Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

23 ACL’2011 : Preslav Nakov & Hwee Tou Ng Sentence-Level Paraphrases (cont.)  We build two phrase tables  T orig from the original training bi-text  T par from the paraphrased bi-text  We merge these tables 1.Keep all entries from T orig. 2.Add those phrase pairs from T par that are not in T orig. 3.Add extra features:  F1: 1 if the entry came from T orig, 0.5 otherwise.  F2: 1 if the entry came from T par, 0.5 otherwise.  F3: 1 if the entry was in both tables, 0.5 otherwise. The feature weights are set using MERT, and the number of features is optimized on the development set. 23 Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

24 ACL’2011 : Preslav Nakov & Hwee Tou Ng Phrase-Level Paraphrases  We further augment the phrase table with an extra feature, which is calculated using phrase-level pivoting:  1, for phrase pairs coming from T orig  max p Pr(p ’ |p), for phrase pairs coming from T par  where p’ is a paraphrase of some original Malay phrase p 24 Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

25 Experiments and Evaluation

26 ACL’2011 : Preslav Nakov & Hwee Tou Ng Data  Training  bi-text:350K sentence pairs  English: 10.4M words  Malay: 9.7M words  Development  bi-text:2,000 sentence pairs  English:63.4K words  Malay:58.5K words  Testing  bi-text:1,420 sentences  Malay:28.8K words.  English: 32.8K, 32.4K, and 32.9K words (3 reference translations)  LM  49.8M English words 26 Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

27 ACL’2011 : Preslav Nakov & Hwee Tou Ng Evaluation Results: BLEU 27 Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

28 ACL’2011 : Preslav Nakov & Hwee Tou Ng 28 Detailed BLEU Improvement for all n-grams used in BLEU Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

29 ACL’2011 : Preslav Nakov & Hwee Tou Ng 29 Evaluation With 5 Measures Consistent improvement for 5 measures Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

30 ACL’2011 : Preslav Nakov & Hwee Tou Ng 30 Example Translations Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

31 Conclusion

32 ACL’2011 : Preslav Nakov & Hwee Tou Ng Conclusion  Presented a novel approach to translating from a morphologically complex language  uses paraphrases at three levels of translation  word-level  phrase-level  sentence-level  Demonstrated the potential of the approach to Malay  derivationally rich  but almost no inflectional morphology 32 Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

33 ACL’2011 : Preslav Nakov & Hwee Tou Ng Future Work  Improve the paraphrasing models  use a richer sense similarity model that combines monolingual and bilingual similarity (Chen et al., 2010)  Try phrase table paraphrasing  instead of sentence-level paraphrasing (Nakov, 2008)  Try other  morphologically complex languages  SMT models 33 The presented work is supported by research grant POD0713875. Translating from Morphologically Complex Languages: A Paraphrase-Based Approach


Download ppt "Translating from Morphologically Complex Languages: A Paraphrase-Based Approach Preslav Nakov & Hwee Tou Ng."

Similar presentations


Ads by Google