Source Language Adaptation for Resource-Poor Machine Translation Pidong Wang, National University of Singapore Preslav Nakov, QCRI, Qatar Foundation Hwee.

Slides:



Advertisements
Similar presentations
The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.
Advertisements

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Improving Machine Translation Quality via Hybrid Systems and Refined Evaluation Methods Andreas Eisele DFKI GmbH and Saarland University Helsinki, November.
Chinese Word Segmentation Method for Domain-Special Machine Translation Su Chen; Zhang Yujie; Guo Zhen; Xu Jin’an Beijing Jiaotong University.
Languages & The Media, 4 Nov 2004, Berlin 1 Multimodal multilingual information processing for automatic subtitle generation: Resources, Methods and System.
Hybridity in MT: Experiments on the Europarl Corpus Declan Groves 24 th May, NCLT Seminar Series 2006.
The current status of Chinese- English EBMT -where are we now Joy (Ying Zhang) Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.
The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.
“Applying Morphology Generation Models to Machine Translation” By Kristina Toutanova, Hisami Suzuki, Achim Ruopp (Microsoft Research). UW Machine Translation.
TIDES MT Workshop Review. Using Syntax?  ISI-small: –Cross-lingual parsing/decoding Input: Chinese sentence + English lattice built with all possible.
Machine Translation Prof. Alexandros Potamianos Dept. of Electrical & Computer Engineering Technical University of Crete, Greece May 2003.
MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English.
LEARNING WORD TRANSLATIONS Does syntactic context fare better than positional context? NCLT/CNGL Internal Workshop Ankit Kumar Srivastava 24 July 2008.
Funded under the EU ICT Policy Support Programme Automated Solutions for Patent Translation John Tinsley Project PLuTO WIPO Symposium of.
Tapta4IPC: helping translation of IPC definitions Bruno Pouliquen 25 feb 2013, IPC workshop Translation.
Malay Aksara MLCL Project. Time period What is the period when Malays started to use Kawi? Old Malay ( C.E) When the Indians set their feet on.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.
Query Rewriting Using Monolingual Statistical Machine Translation Stefan Riezler Yi Liu Google 2010 Association for Computational Linguistics.
CLEF Ǻrhus Robust – Word Sense Disambiguation exercise UBC: Eneko Agirre, Oier Lopez de Lacalle, Arantxa Otegi, German Rigau UVA & Irion: Piek Vossen.
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
Malay Aksara. Done by : Cheow Tian Cong(32), Ang Zhen Xuan(31) and Tan Hao Yang(40) Class:2E.
Profile The METIS Approach Future Work Evaluation METIS II Architecture METIS II, the continuation of the successful assessment project METIS I, is an.
1 Translate and Translator Toolkit Universally accessible information through translation Jeff Chin Product Manager Michael Galvez Product Manager.
Advanced Signal Processing 05/06 Reinisch Bernhard Statistical Machine Translation Phrase Based Model.
2012: Monolingual and Crosslingual SMS-based FAQ Retrieval Johannes Leveling CNGL, School of Computing, Dublin City University, Ireland.
Active Learning for Statistical Phrase-based Machine Translation Gholamreza Haffari Joint work with: Maxim Roy, Anoop Sarkar Simon Fraser University NAACL.
INSTITUTE OF COMPUTING TECHNOLOGY Bagging-based System Combination for Domain Adaptation Linfeng Song, Haitao Mi, Yajuan Lü and Qun Liu Institute of Computing.
2010 Failures in Czech-English Phrase-Based MT 2010 Failures in Czech-English Phrase-Based MT Full text, acknowledgement and the list of references in.
1 Co-Training for Cross-Lingual Sentiment Classification Xiaojun Wan ( 萬小軍 ) Associate Professor, Peking University ACL 2009.
Translating from Morphologically Complex Languages: A Paraphrase-Based Approach Preslav Nakov & Hwee Tou Ng.
Advanced MT Seminar Spring 2008 Instructors: Alon Lavie and Stephan Vogel.
Approaches to Machine Translation CSC 5930 Machine Translation Fall 2012 Dr. Tom Way.
Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,
Transfer-based MT with Strong Decoding for a Miserly Data Scenario Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with:
Ibrahim Badr, Rabih Zbib, James Glass. Introduction Experiment on English-to-Arabic SMT. Two domains: text news,spoken travel conv. Explore the effect.
Improved Statistical Machine Translation for Resource-Poor Languages Using Related Resource-Rich Languages Preslav Nakov and Hwee Tou Ng Dept. of Computer.
Korea Maritime and Ocean University NLP Jung Tae LEE
Chinese Word Segmentation Adaptation for Statistical Machine Translation Hailong Cao, Masao Utiyama and Eiichiro Sumita Language Translation Group NICT&ATR.
NRC Report Conclusion Tu Zhaopeng NIST06  The Portage System  For Chinese large-track entry, used simple, but carefully- tuned, phrase-based.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Mutual bilingual terminology extraction Le An Ha*, Gabriela Fernandez**, Ruslan Mitkov*, Gloria Corpas*** * University of Wolverhampton ** Universidad.
LREC 2004, 26 May 2004, Lisbon 1 Multimodal Multilingual Resources in the Subtitling Process S.Piperidis, I.Demiros, P.Prokopidis, P.Vanroose, A. Hoethker,
Improved Statistical Machine Translation for Resource-Poor Languages Using Related Resource-Rich Languages Preslav Nakov National University of Singapore.
MACHINE TRANSLATION PAPER 1 Daniel Montalvo, Chrysanthia Cheung-Lau, Jonny Wang CS159 Spring 2011.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Error Analysis of Two Types of Grammar for the purpose of Automatic Rule Refinement Ariadna Font Llitjós, Katharina Probst, Jaime Carbonell Language Technologies.
Approaching a New Language in Machine Translation Anna Sågvall Hein, Per Weijnitz.
Large Vocabulary Data Driven MT: New Developments in the CMU SMT System Stephan Vogel, Alex Waibel Work done in collaboration with: Ying Zhang, Alicia.
October 10, 2003BLTS Kickoff Meeting1 Transfer with Strong Decoding Learning Module Transfer Rules {PP,4894} ;;Score: PP::PP [NP POSTP] -> [PREP.
Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,
English-Hindi Neural machine translation and parallel corpus generation EKANSH GUPTA ROHIT GUPTA.
Learning to Generate Complex Morphology for Machine Translation Einat Minkov †, Kristina Toutanova* and Hisami Suzuki* *Microsoft Research † Carnegie Mellon.
Arnar Thor Jensson Koji Iwano Sadaoki Furui Tokyo Institute of Technology Development of a Speech Recognition System For Icelandic Using Machine Translated.
RECENT TRENDS IN SMT By M.Balamurugan, Phd Research Scholar,
Approaches to Machine Translation
Chenchen Ding, Masao Utiyama, Eiichiro Sumita
Neural Machine Translation by Jointly Learning to Align and Translate
KantanNeural™ LQR Experiment
Translation of Unknown Words in Low Resource Languages
Suggestions for Class Projects
Multilingualism in UK websites Kate Fernie, MLA
--Mengxue Zhang, Qingyang Li
Deep Learning based Machine Translation
Approaches to Machine Translation
Improved Word Alignments Using the Web as a Corpus
Machine Translation(MT)
COUNTRIES NATIONALITIES LANGUAGES.
Idiap Research Institute University of Edinburgh
Meghan Dowling Teresa Lynn Andy Way
Presentation transcript:

Source Language Adaptation for Resource-Poor Machine Translation Pidong Wang, National University of Singapore Preslav Nakov, QCRI, Qatar Foundation Hwee Tou Ng, National University of Singapore

Introduction

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 3 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) 3 Overview  Statistical Machine Translation (SMT) systems  Need large sentence-aligned bilingual corpora (bi-texts).  Problem  Such training bi-texts do not exist for most languages.  Idea  Adapt a bi-text for a related resource-rich language.

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 4 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)  Idea: reuse bi-texts from related resource-rich languages to improve resource-poor SMT  Related languages have  overlapping vocabulary (cognates)  e.g., casa (‘house’) in Spanish, Portuguese  similar  word order  syntax Idea & Motivation

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 5 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) 5  Related EU – nonEU languages  Swedish – Norwegian  Bulgarian – Macedonian  Related EU languages  Spanish – Catalan  Czech – Slovak  Irish – Gaelic Scottish  Standard German – Swiss German  Related languages outside Europe  MSA – Dialectical Arabic (e.g., Egyptian, Gulf, Levantine, Iraqi)  Hindi – Urdu  Turkish – Azerbaijani  Russian – Ukrainian  Malay – Indonesian Resource-rich vs. Resource-poor Languages We will explore these pairs.

6 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Our Main focus: Improving Indonesian-English SMT Using Malay-English

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 7 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) 7 Malay vs. Indonesian  Malay  Semua manusia dilahirkan bebas dan samarata dari segi kemuliaan dan hak-hak.  Mereka mempunyai pemikiran dan perasaan hati dan hendaklah bertindak di antara satu sama lain dengan semangat persaudaraan.  Indonesian  Semua orang dilahirkan merdeka dan mempunyai martabat dan hak-hak yang sama.  Mereka dikaruniai akal dan hati nurani dan hendaknya bergaul satu sama lain dalam semangat persaudaraan. ~50% exact word overlap from Article 1 of the Universal Declaration of Human Rights

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 8 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) 8 Malay Can Look “More Indonesian”…  Malay  Semua manusia dilahirkan bebas dan samarata dari segi kemuliaan dan hak-hak.  Mereka mempunyai pemikiran dan perasaan hati dan hendaklah bertindak di antara satu sama lain dengan semangat persaudaraan. ~75% exact word overlap Post-edited Malay to look “Indonesian” (by an Indonesian speaker).  Indonesian  Semua manusia dilahirkan bebas dan mempunyai martabat dan hak-hak yang sama.  Mereka mempunyai pemikiran dan perasaan dan hendaklah bergaul satu sama lain dalam semangat persaudaraan. from Article 1 of the Universal Declaration of Human Rights We attempt to do this automatically: adapt Malay to look Indonesian Then, use it to improve SMT…

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 9 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Indonesian Malay English poor rich Method at a Glance Indonesian “Indonesian” English poor rich Step 1: Adaptation Indonesian + “Indonesian” English Step 2: Combination Adapt Note that we have no Malay-Indonesian bi-text!

10 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Step 1: Adapting Malay-English to “Indonesian”-English

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 11 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) 11 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Word-Level Bi-text Adaptation: Overview Given a Malay-English sentence pair 1.Adapt the Malay sentence to “Indonesian” Word-level paraphrases Phrase-level paraphrases Cross-lingual morphology 2.We pair the adapted “Indonesian” with English from Malay- English sentence pair Thus, we generate a new “Indonesian”-English sentence pair.

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 12 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) 12 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Malay: KDNK Malaysia dijangka cecah 8 peratus pada tahun Decode using a large Indonesian LM Word-Level Bi-text Adaptation: Overview

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 13 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Malaysia’s GDP is expected to reach 8 per cent in Pair each with the English counter-part Thus, we generate a new “Indonesian”-English bi-text. Word-Level Bi-text Adaptation: Overview

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 14 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)  Indonesian translations for Malay: pivoting over English  Weights 14 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Malay sentence ML1ML2ML3ML4ML5 English sentence EN1EN2EN3EN4 English sentence EN11EN3EN12 Indonesian sentence IN1IN2IN3 IN4 ML-EN bi-text IN-EN bi-text Word-Level Adaptation: Extracting Paraphrases Note: we have no Malay-Indonesian bi-text, so we pivot.

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 15 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) IN-EN bi-text is small, thus:  Unreliable IN-EN word alignments  bad ML-IN paraphrases  Solution:  improve IN-EN alignments using the ML-EN bi-text  concatenate: IN-EN*k + ML-EN »k ≈ |ML-EN| / |IN-EN|  word alignment  get the alignments for one copy of IN-EN only 15 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Word-Level Adaptation: Issue 1 IN ML EN poor rich Works because of cognates between Malay and Indonesian.

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 16 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) IN-EN bi-text is small, thus:  Small IN vocabulary for the ML-IN paraphrases  Solution:  Add cross-lingual morphological variants:  Given ML word: seperminuman  Find ML lemma: minum  Propose all known IN words sharing the same lemma: » diminum, diminumkan, diminumnya, makan-minum, makananminuman, meminum, meminumkan, meminumnya, meminum-minuman, minum, minum-minum, minum-minuman, minuman, minumanku, minumannya, peminum, peminumnya, perminum, terminum 16 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Word-Level Adaptation: Issue 2 IN ML EN poor rich Note: The IN variants are from a larger monolingual IN text.

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 17 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Word-level pivoting  Ignores context, and relies on LM  Cannot drop/insert/merge/split/reorder words  Solution:  Phrase-level pivoting  Build ML-EN and EN-IN phrase tables  Induce ML-IN phrase table (pivoting over EN)  Adapt the ML side of ML-EN to get “IN”-EN bi-text: »using Indonesian LM and n-best “IN” as before  Also, use cross-lingual morphological variants 17 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Word-Level Adaptation: Issue 3 - Models context better: not only Indonesian LM, but also phrases. - Allows many word operations, e.g., insertion, deletion. IN ML EN poor rich

18 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Step 2: Combining IN-EN + “IN”-EN

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 19 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Combining IN-EN and “IN”-EN bi-texts  Simple concatenation: IN-EN + “IN”-EN  Balanced concatenation: IN-EN * k + “IN”-EN  Sophisticated phrase table combination: (Nakov and Ng, EMNLP 2009), (Nakov and Ng, JAIR 2012)  Improved word alignments for IN-EN  Phrase table combination with extra features Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Improved Statistical Machine Translation for Resource-Poor Languages Using Related Resource-Rich Languages. (EMNLP 2009) Preslav Nakov, Hwee Tou Ng

20 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Experiments & Evaluation

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 21 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Data  Translation data (for IN-EN)  IN2EN-train: 0.9M  IN2EN-dev: 37K  IN2EN-test: 37K  EN-monoling.: 5M  Adaptation data (for ML-EN  “IN”-EN)  ML2EN: 8.6M  IN-monoling.: 20M (tokens)

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 22 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Isolated Experiments: Training on “IN”-EN only BLEU System combination using MEMT (Heafield and Lavie, 2010)

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 23 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) 23 BLEU Combined Experiments: Training on IN-EN + “IN”-EN

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 24 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Experiments: Improvements 24 BLEU

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 25 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)  Improve Macedonian-English SMT by adapting Bulgarian-English bi-text  Adapt BG-EN (11.5M words) to “MK”-EN (1.2M words)  OPUS movie subtitles Application to Other Languages & Domains BLEU

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 26 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) 26 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Conclusion

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 27 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng)  Adapt bi-texts for related resource-rich languages, using  confusion networks  word-level & phrase-level paraphrasing  cross-lingual morphological analysis  Achieved:  +6.7 BLEU over ML2EN  +2.6 BLEU over IN2EN  BLEU over comb(IN2EN,ML2EN)  Future work  add split/merge as word operations  better integrate word-level and phrase-level methods  apply our methods to other languages & NLP problems Thank you! Conclusion & Future Work Supported by the Singapore National Research Foundation under its International Research Singapore Funding Initiative and administered by the IDM Programme Office.

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 28 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) 28 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Further Analysis

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 29 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Paraphrasing Non-Indonesian Malay Words Only So, we do need to paraphrase all words.

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 30 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Human Judgments Morphology yields worse top-3 adaptations but better phrase tables, due to coverage. Is the adapted sentence better Indonesian than the original Malay sentence? 100 random sentences

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 31 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Reverse Adaptation Idea: Adapt dev/test Indonesian input to “Malay”, then, translate with a Malay-English system Input to SMT: - “Malay” lattice - 1-best “Malay” sentence from the lattice Adapting dev/test is worse than adapting the training bi-text: So, we need both n-best and LM

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 32 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) 32 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Related Work

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 33 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Related Work (1)  Machine translation between related languages  E.g.  Cantonese–Mandarin (Zhang, 1998)  Czech–Slovak (Hajic & al., 2000)  Turkish–Crimean Tatar (Altintas & Cicekli, 2002)  Irish–Scottish Gaelic (Scannell, 2006)  Bulgarian–Macedonian (Nakov & Tiedemann, 2012)  We do not translate (no training data), we “adapt”.

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 34 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Related Work (2)  Adapting dialects to standard language (e.g., Arabic) (Bakr & al., 2008; Sawaf, 2010; Salloum & Habash, 2011)  manual rules  Normalizing Tweets and SMS (Aw & al., 2006; Han & Baldwin, 2011)  informal text: spelling, abbreviations, slang  same language

EMNLP-CoNLL 2012, July 12, 2012, Jeju, Korea 35 Source Language Adaptation for Resource-Poor Machine Translation (Wang, Nakov, & Ng) Related Work (3)  Adapt Brazilian to European Portuguese (Marujo & al. 2011)  rule-based, language-dependent  tiny improvements for SMT  Reuse bi-texts between related languages (Nakov & Ng. 2009)  no language adaptation (just transliteration)