Source Language Adaptation for Resource-Poor Machine Translation Pidong Wang, National University of Singapore Preslav Nakov, QCRI, Qatar Foundation Hwee.

Source Language Adaptation for Resource-Poor Machine Translation Pidong Wang, National University of Singapore Preslav Nakov, QCRI, Qatar Foundation Hwee Tou Ng, National University of Singapore

Introduction

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 3 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) 3 Overview  Statistical Machine Translation (SMT) systems  Need large sentence-aligned bilingual corpora (bi-texts).  Problem  Such training bi-texts do not exist for most languages.  Idea  Adapt a bi-text for a related resource-rich language.

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 4 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)  Idea: reuse bi-texts from related resource-rich languages to improve resource-poor SMT  Related languages have  overlapping vocabulary (cognates)  e.g., casa (‘house’) in Spanish, Portuguese  similar  word order  syntax Idea & Motivation

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 5 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) 5  Related EU – nonEU languages  Swedish – Norwegian  Bulgarian – Macedonian  Related EU languages  Spanish – Catalan  Czech – Slovak  Irish – Gaelic Scottish  Standard German – Swiss German  Related languages outside Europe  MSA – Dialectical Arabic (e.g., Egyptian, Gulf, Levantine, Iraqi)  Hindi – Urdu  Turkish – Azerbaijani  Russian – Ukrainian  Malay – Indonesian Resource-rich vs. Resource-poor Languages We will explore these pairs.

6 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Our Main focus: Improving Indonesian-English SMT Using Malay-English

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 7 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) 7 Malay vs. Indonesian  Malay  Semua manusia dilahirkan bebas dan samarata dari segi kemuliaan dan hak-hak.  Mereka mempunyai pemikiran dan perasaan hati dan hendaklah bertindak di antara satu sama lain dengan semangat persaudaraan.  Indonesian  Semua orang dilahirkan merdeka dan mempunyai martabat dan hak-hak yang sama.  Mereka dikaruniai akal dan hati nurani dan hendaknya bergaul satu sama lain dalam semangat persaudaraan. ~50% exact word overlap from Article 1 of the Universal Declaration of Human Rights

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 8 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) 8 Malay Can Look “More Indonesian”…  Malay  Semua manusia dilahirkan bebas dan samarata dari segi kemuliaan dan hak-hak.  Mereka mempunyai pemikiran dan perasaan hati dan hendaklah bertindak di antara satu sama lain dengan semangat persaudaraan. ~75% exact word overlap Post-edited Malay to look “Indonesian” (by an Indonesian speaker).  Indonesian  Semua manusia dilahirkan bebas dan mempunyai martabat dan hak-hak yang sama.  Mereka mempunyai pemikiran dan perasaan dan hendaklah bergaul satu sama lain dalam semangat persaudaraan. from Article 1 of the Universal Declaration of Human Rights We attempt to do this automatically: adapt Malay to look Indonesian Then, use it to improve SMT…

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 9 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Indonesian Malay English poor rich Method at a Glance Indonesian “Indonesian” English poor rich Step 1: Adaptation Indonesian + “Indonesian” English Step 2: Combination Adapt Note that we have no Malay-Indonesian bi-text!

10 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Step 1: Adapting Malay-English to “Indonesian”-English

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 11 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) 11 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Word-Level Bi-text Adaptation: Overview Given a Malay-English sentence pair 1.Adapt the Malay sentence to “Indonesian” Word-level paraphrases Phrase-level paraphrases Cross-lingual morphology 2.We pair the adapted “Indonesian” with English from Malay- English sentence pair Thus, we generate a new “Indonesian”-English sentence pair.

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 12 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) 12 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Malay: KDNK Malaysia dijangka cecah 8 peratus pada tahun 2010. Decode using a large Indonesian LM Word-Level Bi-text Adaptation: Overview

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 13 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Malaysia’s GDP is expected to reach 8 per cent in 2010. 13 Pair each with the English counter-part Thus, we generate a new “Indonesian”-English bi-text. Word-Level Bi-text Adaptation: Overview

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 14 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)  Indonesian translations for Malay: pivoting over English  Weights 14 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Malay sentence ML1ML2ML3ML4ML5 English sentence EN1EN2EN3EN4 English sentence EN11EN3EN12 Indonesian sentence IN1IN2IN3 IN4 ML-EN bi-text IN-EN bi-text Word-Level Adaptation: Extracting Paraphrases Note: we have no Malay-Indonesian bi-text, so we pivot.

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 15 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) IN-EN bi-text is small, thus:  Unreliable IN-EN word alignments  bad ML-IN paraphrases  Solution:  improve IN-EN alignments using the ML-EN bi-text  concatenate: IN-EN*k + ML-EN »k ≈ |ML-EN| / |IN-EN|  word alignment  get the alignments for one copy of IN-EN only 15 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Word-Level Adaptation: Issue 1 IN ML EN poor rich Works because of cognates between Malay and Indonesian.

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 16 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) IN-EN bi-text is small, thus:  Small IN vocabulary for the ML-IN paraphrases  Solution:  Add cross-lingual morphological variants:  Given ML word: seperminuman  Find ML lemma: minum  Propose all known IN words sharing the same lemma: » diminum, diminumkan, diminumnya, makan-minum, makananminuman, meminum, meminumkan, meminumnya, meminum-minuman, minum, minum-minum, minum-minuman, minuman, minumanku, minumannya, peminum, peminumnya, perminum, terminum 16 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Word-Level Adaptation: Issue 2 IN ML EN poor rich Note: The IN variants are from a larger monolingual IN text.

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 17 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Word-level pivoting  Ignores context, and relies on LM  Cannot drop/insert/merge/split/reorder words  Solution:  Phrase-level pivoting  Build ML-EN and EN-IN phrase tables  Induce ML-IN phrase table (pivoting over EN)  Adapt the ML side of ML-EN to get “IN”-EN bi-text: »using Indonesian LM and n-best “IN” as before  Also, use cross-lingual morphological variants 17 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Word-Level Adaptation: Issue 3 - Models context better: not only Indonesian LM, but also phrases. - Allows many word operations, e.g., insertion, deletion. IN ML EN poor rich

18 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Step 2: Combining IN-EN + “IN”-EN

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 19 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Combining IN-EN and “IN”-EN bi-texts  Simple concatenation: IN-EN + “IN”-EN  Balanced concatenation: IN-EN * k + “IN”-EN  Sophisticated phrase table combination: (Nakov and Ng, EMNLP 2009), (Nakov and Ng, JAIR 2012)  Improved word alignments for IN-EN  Phrase table combination with extra features Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Improved Statistical Machine Translation for Resource-Poor Languages Using Related Resource-Rich Languages. (EMNLP 2009) Preslav Nakov, Hwee Tou Ng

20 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Experiments & Evaluation

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 21 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Data  Translation data (for IN-EN)  IN2EN-train: 0.9M  IN2EN-dev: 37K  IN2EN-test: 37K  EN-monoling.: 5M  Adaptation data (for ML-EN  “IN”-EN)  ML2EN: 8.6M  IN-monoling.: 20M (tokens)

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 22 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Isolated Experiments: Training on “IN”-EN only BLEU System combination using MEMT (Heafield and Lavie, 2010)

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 23 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) 23 BLEU Combined Experiments: Training on IN-EN + “IN”-EN

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 24 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Experiments: Improvements 24 BLEU

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 25 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)  Improve Macedonian-English SMT by adapting Bulgarian-English bi-text  Adapt BG-EN (11.5M words) to “MK”-EN (1.2M words)  OPUS movie subtitles Application to Other Languages & Domains BLEU

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 26 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) 26 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Conclusion

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 27 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)  Adapt bi-texts for related resource-rich languages, using  confusion networks  word-level & phrase-level paraphrasing  cross-lingual morphological analysis  Achieved:  +6.7 BLEU over ML2EN  +2.6 BLEU over IN2EN  +1.5-3.0 BLEU over comb(IN2EN,ML2EN)  Future work  add split/merge as word operations  better integrate word-level and phrase-level methods  apply our methods to other languages & NLP problems Thank you! Conclusion & Future Work Supported by the Singapore National Research Foundation under its International Research Centre @ Singapore Funding Initiative and administered by the IDM Programme Office.

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 28 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) 28 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Further Analysis

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 29 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Paraphrasing Non-Indonesian Malay Words Only So, we do need to paraphrase all words.

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 30 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Human Judgments Morphology yields worse top-3 adaptations but better phrase tables, due to coverage. Is the adapted sentence better Indonesian than the original Malay sentence? 100 random sentences

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 31 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Reverse Adaptation Idea: Adapt dev/test Indonesian input to “Malay”, then, translate with a Malay-English system Input to SMT: - “Malay” lattice - 1-best “Malay” sentence from the lattice Adapting dev/test is worse than adapting the training bi-text: So, we need both n-best and LM

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 32 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) 32 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Related Work

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 33 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Related Work (1)  Machine translation between related languages  E.g.  Cantonese–Mandarin (Zhang, 1998)  Czech–Slovak (Hajic & al., 2000)  Turkish–Crimean Tatar (Altintas & Cicekli, 2002)  Irish–Scottish Gaelic (Scannell, 2006)  Bulgarian–Macedonian (Nakov & Tiedemann, 2012)  We do not translate (no training data), we “adapt”.

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 34 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Related Work (2)  Adapting dialects to standard language (e.g., Arabic) (Bakr & al., 2008; Sawaf, 2010; Salloum & Habash, 2011)  manual rules  Normalizing Tweets and SMS (Aw & al., 2006; Han & Baldwin, 2011)  informal text: spelling, abbreviations, slang  same language

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 35 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Related Work (3)  Adapt Brazilian to European Portuguese (Marujo & al. 2011)  rule-based, language-dependent  tiny improvements for SMT  Reuse bi-texts between related languages (Nakov & Ng. 2009)  no language adaptation (just transliteration)

Source Language Adaptation for Resource-Poor Machine Translation Pidong Wang, National University of Singapore Preslav Nakov, QCRI, Qatar Foundation Hwee.

Similar presentations

Presentation on theme: "Source Language Adaptation for Resource-Poor Machine Translation Pidong Wang, National University of Singapore Preslav Nakov, QCRI, Qatar Foundation Hwee."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Source Language Adaptation for Resource-Poor Machine Translation Pidong Wang, National University of Singapore Preslav Nakov, QCRI, Qatar Foundation Hwee.

Similar presentations

Presentation on theme: "Source Language Adaptation for Resource-Poor Machine Translation Pidong Wang, National University of Singapore Preslav Nakov, QCRI, Qatar Foundation Hwee."— Presentation transcript:

Similar presentations

About project

Feedback