Improved Statistical Machine Translation for Resource-Poor Languages Using Related Resource-Rich Languages Preslav Nakov and Hwee Tou Ng Dept. of Computer.

Slides:

Advertisements

Similar presentations

Statistical Machine Translation

Advertisements

Rationale for a multilingual corpus for machine translation evaluation Debbie Elliott Anthony Hartley Eric Atwell Corpus Linguistics 2003, Lancaster, England.

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser Institute for Natural Language Processing University of Stuttgart

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.

Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Word Sense Disambiguation for Machine Translation Han-Bin Chen

Automatic Identification of Cognates, False Friends, and Partial Cognates University of Ottawa, Canada University of Ottawa, Canada.

Automatic Identification of Cognates and False Friends in French and English Diana Inkpen and Oana Frunza University of Ottawa and Greg Kondrak University.

Identifying Translations Philip Resnik, Noah Smith University of Maryland.

1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.

Statistical Phrase-Based Translation Authors: Koehn, Och, Marcu Presented by Albert Bertram Titles, charts, graphs, figures and tables were extracted from.

The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.

“Applying Morphology Generation Models to Machine Translation” By Kristina Toutanova, Hisami Suzuki, Achim Ruopp (Microsoft Research). UW Machine Translation.

ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June Competitive Grouping in Integrated Segmentation and Alignment.

Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.

Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.

Machine Translation A Presentation by: Julie Conlonova, Rob Chase, and Eric Pomerleau.

MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English.

Comments on Guillaume Pitel: “Using bilingual LSA for FrameNet annotation of French text from generic resources” Gerd Fliedner Computational Linguistics.

Corpora and Translation Parallel corpora Statistical MT (not to mention: Corpus of translated text, for translation studies)

Does Syntactic Knowledge help English- Hindi SMT ? Avinesh. PVS. K. Taraka Rama, Karthik Gali.

Tapta4IPC: helping translation of IPC definitions Bruno Pouliquen 25 feb 2013, IPC workshop Translation.

Natural Language Processing Lab Northeastern University, China Feiliang Ren EBMT Based on Finite Automata State Transfer Generation Feiliang Ren.

Machine translation Context-based approach Lucia Otoyo.

English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.

Combining Lexical Semantic Resources with Question & Answer Archives for Translation-Based Answer Finding Delphine Bernhard and Iryna Gurevvch Ubiquitous.

An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.

METEOR-Ranking & M-BLEU: Flexible Matching & Parameter Tuning for MT Evaluation Alon Lavie and Abhaya Agarwal Language Technologies Institute Carnegie.

Advanced Signal Processing 05/06 Reinisch Bernhard Statistical Machine Translation Phrase Based Model.

Learning Phonetic Similarity for Matching Named Entity Translation and Mining New Translations Wai Lam, Ruizhang Huang, Pik-Shan Cheung ACM SIGIR 2004.

Active Learning for Statistical Phrase-based Machine Translation Gholamreza Haffari Joint work with: Maxim Roy, Anoop Sarkar Simon Fraser University NAACL.

2010 Failures in Czech-English Phrase-Based MT 2010 Failures in Czech-English Phrase-Based MT Full text, acknowledgement and the list of references in.

Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.

Translating from Morphologically Complex Languages: A Paraphrase-Based Approach Preslav Nakov & Hwee Tou Ng.

The ICT Statistical Machine Translation Systems for IWSLT 2007 Zhongjun He, Haitao Mi, Yang Liu, Devi Xiong, Weihua Luo, Yun Huang, Zhixiang Ren, Yajuan.

NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.

Reordering Model Using Syntactic Information of a Source Tree for Statistical Machine Translation Kei Hashimoto, Hirohumi Yamamoto, Hideo Okuma, Eiichiro.

Malay-English Bitext Mapping and Alignment Using SIMR/GSA Algorithms Mosleh Al-Adhaileh Tang Enya Kong Mosleh Al-Adhaileh and Tang Enya Kong Computer Aided.

Modelling Human Thematic Fit Judgments IGK Colloquium 3/2/2005 Ulrike Padó.

Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,

Korea Maritime and Ocean University NLP Jung Tae LEE

CLEF2003 Forum/ August 2003 / Trondheim / page 1 Report on CLEF-2003 ML4 experiments Extracting multilingual resources from corpora N. Cancedda, H. Dejean,

Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005.

Chinese Word Segmentation Adaptation for Statistical Machine Translation Hailong Cao, Masao Utiyama and Eiichiro Sumita Language Translation Group NICT&ATR.

NRC Report Conclusion Tu Zhaopeng NIST06  The Portage System  For Chinese large-track entry, used simple, but carefully- tuned, phrase-based.

LREC 2008 Marrakech 29 May Caroline Lavecchia, Kamel Smaïli and David Langlois LORIA / Groupe Parole, Vandoeuvre-Lès-Nancy, France Phrase-Based Machine.

Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.

Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,

Improved Statistical Machine Translation for Resource-Poor Languages Using Related Resource-Rich Languages Preslav Nakov National University of Singapore.

MACHINE TRANSLATION PAPER 1 Daniel Montalvo, Chrysanthia Cheung-Lau, Jonny Wang CS159 Spring 2011.

Accurate Cross-lingual Projection between Count-based Word Vectors by Exploiting Translatable Context Pairs SHONOSUKE ISHIWATARI NOBUHIRO KAJI NAOKI YOSHINAGA.

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

A New Approach for English- Chinese Named Entity Alignment Donghui Feng Yayuan Lv Ming Zhou USC MSR Asia EMNLP-04.

Source Language Adaptation for Resource-Poor Machine Translation Pidong Wang, National University of Singapore Preslav Nakov, QCRI, Qatar Foundation Hwee.

Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Large Vocabulary Data Driven MT: New Developments in the CMU SMT System Stephan Vogel, Alex Waibel Work done in collaboration with: Ying Zhang, Alicia.

A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.

Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,

Spring 2010 Lecture 4 Kristina Toutanova MSR & UW With slides borrowed from Philipp Koehn and Hwee Tou Ng LING 575: Seminar on statistical machine translation.

Learning to Generate Complex Morphology for Machine Translation Einat Minkov †, Kristina Toutanova* and Hisami Suzuki* *Microsoft Research † Carnegie Mellon.

Statistical Machine Translation Part II: Word Alignments and EM

Approaches to Machine Translation

Ankit Srivastava CNGL, DCU Sergio Penkale CNGL, DCU

Joint Training for Pivot-based Neural Machine Translation

Suggestions for Class Projects

Deep Learning based Machine Translation

Approaches to Machine Translation

Improved Word Alignments Using the Web as a Corpus

Statistical Machine Translation Papers from COLING 2004

Johns Hopkins 2003 Summer Workshop on Syntax and Statistical Machine Translation Chapters 5-8 Ethan Phelps-Goodman.

Presentation transcript:

Improved Statistical Machine Translation for Resource-Poor Languages Using Related Resource-Rich Languages Preslav Nakov and Hwee Tou Ng Dept. of Computer Science National University of Singapore CHIME seminar - National University of Singapore July 7, 2009

Overview

3  Statistical Machine Translation (SMT) systems Rely on large sentence-aligned bilingual corpora (bi-texts).  Problem Such large bi-texts do not exist for most languages, and building them is hard.  Our (Partial) Solution Use bi-texts for a resource-rich language to build a better SMT system for a related resource-poor language.

Introduction

5 Building an SMT for a New Language Pair  In theory: only requires few hours/days  In practice: large bi-texts are needed Such large bi-text are available for  Arabic  Chinese  the official languages of the EU  some other languages However, most of the 6,500+ world languages remain resource-poor from an SMT viewpoint.

6 Building a Bi-text for SMT  Small bi-text Relatively easy to build  Large bi-text Hard to get, e.g., because of copyright Sources: parliament debates and legislation  national: Canada, Hong Kong  international  United Nations  European Union: Europarl, Acquis Becoming an official language of the EU is an easy recipe for getting rich in bi-texts quickly. Not all languages are so “lucky”, but many can still benefit.

7 Idea: Use Related Languages  Idea Use bi-texts for a resource-rich language to build a better SMT system for a related resource-poor language.  Related languages have overlapping vocabulary (cognates)  e.g., casa (‘house’) in Spanish and Portuguese similar  word order  syntax

8 Example (1)  Malay–Indonesian Malay  Semua manusia dilahirkan bebas dan samarata dari segi kemuliaan dan hak-hak. Indonesian  Semua orang dilahirkan merdeka dan mempunyai martabat dan hak-hak yang sama. English: All human beings are born free and equal in dignity and rights. (from Article 1 of the Universal Declaration of Human Rights)

9 Example (2)  Spanish–Portuguese Spanish  Señora Presidenta, estimados colegas, lo que está sucediendo en Oriente Medio es una tragedia. Portuguese  Senhora Presidente, caros colegas, o que está a acontecer no Médio Oriente é uma tragédia. English: Madam President, ladies and gentlemen, the events in the Middle East are a real tragedy.

10 Some Languages That Could Benefit  Related nonEU–EU language pairs Norwegian – Swedish Moldavian 1 – Romanian Macedonian 2 – Bulgarian (future: Croatia is due to join the EU soon )  Serbian 3, Bosnian 3, Montenegrin 3 (related to Croatian 3 )  Related EU languages Czech – Slovak Spanish – Portuguese  Related languages outside Europe Malay - Indonesian  Some notes: 1 Moldavian is not recognized by Romania 2 Macedonian is not recognized by Bulgaria and Greece 3 Serbian, Bosnian, Montenegrin and Croatian were Serbo-Croatian until 1991; Croatia is due to join the EU soon. We will explore these pairs.

How Phrase-Based SMT Systems Work A Very Brief Overview

12 Phrase-based SMT 1. The sentence is segmented into phrases. 2. Each phrase is translated in isolation. 3. The phrases are reordered.

13 Phrases: Learnt from a Bi-text

14 Sample Phrase Table " -- as in ||| " - como en el caso de ||| e " -- as in ||| " - como en el caso ||| e " -- as in ||| " - como en el ||| " -- as in ||| " - como en ||| is more ||| es mucho más ||| is more ||| es más un club de ||| e is more ||| es más un club ||| e is more ||| es más un ||| is more ||| es más ||| is more ||| es mucho más ||| is more ||| es más un club de ||| e is more ||| es más un club ||| e is more ||| es más un ||| is more ||| es más |||

15 Training and Tuning a Baseline System 1. M:1 and 1:M word alignments: IBM Model M:M word alignments: intersect + grow (Och&Ney’03). 3. Phrase Pairs: alignment template (Och&Ney’04). 4. Minimum-error rate training (MERT) (Och’03).  Forward/backward  translation probability  lexical weight probability  Language model probability  Phrase penalty  Word penalty  Distortion cost

Using an Additional Related Language Simple Bi-text Combination Strategies

17 Problem Definition  We want improved SMT From  a resource-poor source language X 1 Into  a resource-rich target language Y Given  a small bi-text for X 1 -Y  a much larger bi-text for X 2 -Y for a resource-rich language X 2 closely related to X 1  source languages  X 1 – resource-poor  X 2 - resource-rich  target language: Y

18 Basic Bi-text Combination Strategies  Merge bi-texts  Build and use separate phrase tables  Extension: transliteration

19 Merging Bi-texts  Summary: Concatenate X 1 -Y and X 2 -Y  Advantages: improved word alignments  e.g., for rare words more translation options  less unknown words  useful non-compositional phrases (improved fluency)  phrases with words for language X 2 that do not exist in X 1 are effectively ignored at translation time  Disadvantages: the additional bi-text X 2 -Y will dominate: it is larger phrases from X 1 -Y and X 2 -Y cannot be distinguished  source languages  X 1 – resource-poor  X 2 - resource-rich  target language: Y

20 Separate Phrase Tables  Summary: Build two separate phrase tables, then (a) use them together: as alternative decoding paths (b) merge them: using features to indicate the bi-text each phrase entry came from (c) interpolate them: e.g., using a linear interpolation  Advantages: phrases from X 1 -Y and X 2 -Y can be distinguished the larger bi-text X 2 -Y does not dominate X 1 -Y more translation options probabilities are combined in a principled manner  Disadvantages: improved word alignments are not possible  source languages  X 1 – resource-poor  X 2 - resource-rich  target language: Y

21 Extension: Transliteration  Both strategies rely on cognates Linguistics  Def: Words derived from a common root, e.g.,  Latin tu (‘2nd person singular’)  Old English thou  Spanish tú  German du  Greek sú  Orthography/phonetics/semantics: ignored. Computational linguistics  Def: Words in different languages that are mutual translations and have a similar orthography, e.g.,  evolution vs. evolución vs. evolução  Orthography & semantics: important.  Origin: ignored. Wordforms can differ: night vs. nacht vs. nuit vs. noite vs. noch star vs. estrella vs. stella vs. étoile arbeit vs. rabota vs. robota (‘work’) father vs. père head vs. chef

22 Spelling Differences Between Cognates  Systematic spelling differences Spanish – Portuguese  different spelling  -nh-  -ñ- (senhor vs. señor)  phonetic  -ción  -ção (evolución vs. evolução)  -é  -ei (1 st sing past)(visité vs. visitei)  -ó  -ou (3 rd sing past)(visitó vs. visitou)  Occasional differences Spanish – Portuguese  decir vs. dizer (‘to say’)  Mario vs. Mário  María vs. Maria Malay – Indonesian  kerana vs. karena (‘because’)  Inggeris vs. Inggris (‘English’)  mahu vs. mau (‘want’) Many of these differences can be learned automatically.

Experiments

24 Language Pairs  Use the following language pairs: Indonesian  English(resource-poor)  using Malay  English(resource-rich) Spanish  English (resource-poor)  using Portuguese  English(resource-rich) We just pretend that Spanish is resource-poor.

25 Datasets  Indonesian-English (in-en): 28,383 sentence pairs (0.8M, 0.9M words); monolingual English en in : 5.1M words.  Malay-English (ml-en): 190,503 sentence pairs (5.4M, 5.8M words); monolingual English en ml : 27.9M words.  Spanish-English (es-en): 1,240,518 sentence pairs (35.7M, 34.6M words); monolingual English en es:pt : 45.3M words (same for pt-en).  Portuguese-English (pt-en): 1,230,038 sentence pairs (35.9M, 34.6M words). monolingual English en es:pt : 45.3M words (same for es-en).

26 Using an Additional Bi-text (1)  Two-tables: Build two separate phrase tables and use them as alternative decoding paths (Birch et al., 2007).

27 Using an Additional Bi-text (2)  Interpolation: Build two separate phrase tables, T orig and T extra, and combine them using linear interpolation: Pr(e|s) = αPr orig (e|s) + (1 − α)Pr extra (e|s). The value of α is optimized on the development dataset, trying the following values:.5,.6,.7,.8, and.9.

28 Using an Additional Bi-text (3)  Merge: 1. Build separate phrase tables: T orig and T extra. 2. Keep all entries from T orig. 3. Add those phrase pairs from T extra that are not in T orig. 4. Add extra features:  F1: it is 1 if the entry came from T orig, and 0.5 otherwise.  F2: it is 1 if the entry came from T extra, and 0.5 otherwise.  F3: it is 1 if the entry was in both tables, 0.5 otherwise. The feature weights are set using MERT, and the number of features is optimized on the development set.

29 Using an Additional Bi-text (4)  Concat×k: Concatenate k copies of the original and one copy of the additional training bi-text.  Concat×k:align: Concatenate k copies of the original and one copy of the additional bi-text. Generate word alignments. Truncate them only keeping them for one copy of the original bi-text. Build a phrase table, and tune an SMT system. The value k is optimized on the development dataset.

30 Our Method  Use Merge to combine the phrase tables for concat×k:align (as T orig ) and for concat×1 (as T extra ).  Two parameters to tune number of repetitions k extra features to use with Merge:  (a) F1 only;  (b) F1 and F2,  (c) F1, F2 and F3 Improved word alignments. Improved lexical coverage. Distinguish phrases by source table.

31 Automatic Transliteration (1)  Portuguese-Spanish (pt-es) transliteration: 1. Build IBM Model 4 alignments for pt-en, en-es. 2. Extract pairs of likely pt-es cognates using English (en) as a pivot. 3. Train and tune a character-level SMT system. 4. Transliterate the Portuguese side of pt-en. Transliteration did not help much for Malay-Indonesian.

32 Automatic Transliteration (2)  Extract pt-es cognates using English (en) 1. Induce pt-es word translation probabilities 2. Filter out by probability if 3. Filter out by orthographic similarity if constants proposed in the literature Longest common subsequence

33 Automatic Transliteration (3) Train & tune a monotone character-level SMT system  Representation  Data 28,725 pt-es cognate pairs (total) 9,201 (32%) had spelling differences train/tune split: 26,725 / 2,000 pairs language model: 34.6M Spanish words  Tuning BLEU: 95.22% (baseline: 87.63%) We use this model to transliterate the Portuguese side of pt-en.

Evaluation Results

35 Cross-lingual SMT Experiments train on Malay, test on Malay train on Malay, test on Indonesian

36 Cross-lingual SMT Experiments train on Portuguese, test on Portuguese train on Portuguese, test on Spanish

37 Indonesian  English (using Malay) Original (constant) Extra (changing) stat. sign. over baseline Second best method

38 Number of Concatenations

39 Spanish  English (using Portuguese) Orig. (varied) stat. sign. over baseline

40 Comparison to the Pivoting Technique of Callison-Burch et al. (2006) +160K pairs, no transliteration +1,230K pairs, no transliteration +160K pairs, with transliteration +1,230K pairs, with transliteration +8 languages, pivoting +1 language, our method

Related Work 1. Using Cognates 2. Paraphrasing with a Pivot Language

42 Using Cognates  Al-Onaizan et al. (1999) used likely cognates to improve Czech-English word alignments (a) by seeding the parameters of IBM model 1 (b) by constraining word co-occurrences for IBM models 1-4 (c) by using the cognate pairs as additional “sentence pairs”  Kondrak et al. (2003): improved SMT for nine European languages using the “sentence pairs” approach

43 Al-Onaizan & al. vs. Our Approach  Use cognates between the source and the target languages  Extract cognates explicitly  Do not use context  Use single words only  Uses cognates between the source and some non- target language  Does not extract cognates (except for transliteration)  Leaves cognates in their sentence contexts  Can use multi-word cognate phrases

44 Paraphrasing Using a Pivot Language  Paraphrasing with Bilingual Parallel Corpora (Bannard & Callison-Burch'05)  Improved statistical machine translation using paraphrases (Callison-Burch &al.’06) e.g., English  Spanish (using German)

45 Improved MT Using a Pivot Language  Use many pivot languages  New source phrases are added to the phrase table (paired with the original English)  A new feature is added to each table entry:

46 Pivoting vs. Our Approach – can only improve source- language lexical coverage – ignores context entirely – needs two parallel corpora: for X 1 -Z and for Z-Y + the additional language does not have to be related to the source + augments both the source- and the target-language sides + takes context into account + only need one additional parallel corpus: for X 2 -Y – requires that the additional language be related to the source  source languages  X 1 – resource-poor  X 2 - resource-rich  target language: Y The two approaches are orthogonal and thus can be combined.

Conclusion and Future Work

48 Overall  We have presented An approach that uses a bi-text for a resource-rich language pair to build a better SMT system for a related resource-poor language.  We have achieved Up to 3.37 Bleu points absolute improvement for Spanish- English (using Portuguese) Up to 1.35 for Indonesian-English (using Malay)  The approach could be used for many resource- poor languages.

49 Future Work  Exploit multiple auxiliary languages.  Try auxiliary languages related to the target.  Extend the approach to a multi-lingual corpus, e.g., Spanish-Portuguese-English.

50 Acknowledgments  This research was supported by research grant POD

51 Thank You Any questions?