Language Models for Machine Translation: Original vs. Translated Texts Gennadi Lembersky Noam Ordan Shuly Wintner MTML, 2011.

Slides:



Advertisements
Similar presentations
European Patent Office Wolfgang Täger December 2006 European Patent Office European Machine Translation Programme.
Advertisements

Rationale for a multilingual corpus for machine translation evaluation Debbie Elliott Anthony Hartley Eric Atwell Corpus Linguistics 2003, Lancaster, England.
Variation and regularities in translation: insights from multiple translation corpora Sara Castagnoli (University of Bologna at Forlì – University of Pisa)
Research & Development ICASSP' Analysis of Model Adaptation on Non-Native Speech for Multiple Accent Speech Recognition D. Jouvet & K. Bartkova France.
Measuring the Influence of Long Range Dependencies with Neural Network Language Models Le Hai Son, Alexandre Allauzen, Franc¸ois Yvon Univ. Paris-Sud and.
Human Judgements in Parallel Treebank Alignment Martin Volk, Torsten Marek, Yvonne Samuelsson University of Zurich and Stockholm University
A Syntactic Translation Memory Vincent Vandeghinste Centre for Computational Linguistics K.U.Leuven
INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING NLP-AI IIIT-Hyderabad CIIL, Mysore ICON DECEMBER, 2003.
What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.
Non-native Speech Languages have different pronunciation spaces
“Applying Morphology Generation Models to Machine Translation” By Kristina Toutanova, Hisami Suzuki, Achim Ruopp (Microsoft Research). UW Machine Translation.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
METIS-II: a hybrid MT system Peter Dirix Vincent Vandeghinste Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven TMI 2007,
Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.
Click to highlight each section of the article one by one Read the section, then click once to view the description of it If you want to read it, you.
Left click or use the forward arrows to advance through the PowerPoint Upon clicking, each section of the article will be highlighted one by one Read.
Topics Covered Abstract Headings/Subheadings Introduction/Literature Review Methods Goal Discussion Hypothesis References.
The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.
A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.
A corpus-based study of PP placement in translated and non-translated journalistic Dutch. Annelore Willems, Gert De Sutter Faculty of Translation Studies.
Left click or use the forward arrows to advance through the PowerPoint Upon clicking, each section of the article will be highlighted one by one Read.
Na-Rae Han (University of Pittsburgh), Joel Tetreault (ETS), Soo-Hwa Lee (Chungdahm Learning, Inc.), Jin-Young Ha (Kangwon University) May , LREC.
Dutch Parallel Corpus A Multilingual Annotated Corpus Lieve Macken Language & Translation Technology Team University College Ghent.
BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:
Achieving Domain Specificity in SMT without Over Siloing William Lewis, Chris Wendt, David Bullock Microsoft Research Machine Translation.
6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
Profile The METIS Approach Future Work Evaluation METIS II Architecture METIS II, the continuation of the successful assessment project METIS I, is an.
Can Controlled Language Rules increase the value of MT? Fred Hollowood & Johann Rotourier Symantec Dublin.
Left click or use the forward arrows to advance through the PowerPoint Upon advancing, each section of the article will be highlighted one by one Read.
2010 Failures in Czech-English Phrase-Based MT 2010 Failures in Czech-English Phrase-Based MT Full text, acknowledgement and the list of references in.
Statistical Machine Translation Part III – Phrase-based SMT Alexander Fraser CIS, LMU München WSD and MT.
A Language Independent Method for Question Classification COLING 2004.
NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.
02/19/13English-Indian Language MT (Phase-II)1 English – Indian Language Machine Translation Anuvadaksh Phase – II - The SMT Team, CDAC Mumbai.
Experiments of Opinion Analysis On MPQA and NTCIR-6 Yaoyong Li, Kalina Bontcheva, Hamish Cunningham Department of Computer Science University of Sheffield.
Opinion Holders in Opinion Text from Online Newspapers Youngho Kim, Yuchul Jung and Sung-Hyon Myaeng Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi Chen.
MT with an Interlingua Lori Levin April 13, 2009.
NRC Report Conclusion Tu Zhaopeng NIST06  The Portage System  For Chinese large-track entry, used simple, but carefully- tuned, phrase-based.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Blog Summarization We have built a blog summarization system to assist people in getting opinions from the blogs. After identifying topic-relevant sentences,
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
LREC 2004, 26 May 2004, Lisbon 1 Multimodal Multilingual Resources in the Subtitling Process S.Piperidis, I.Demiros, P.Prokopidis, P.Vanroose, A. Hoethker,
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Approaching a New Language in Machine Translation Anna Sågvall Hein, Per Weijnitz.
ZRINKA DUJMOVIĆ University of Zagreb/ETF JRC Workshop: Exploiting parallel corpora in up to 20 Languages Arona, September 2005 STATISTICAL ANALYSIS.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,
A CRF-BASED NAMED ENTITY RECOGNITION SYSTEM FOR TURKISH Information Extraction Project Reyyan Yeniterzi.
LingWear Language Technology for the Information Warrior Alex Waibel, Lori Levin Alon Lavie, Robert Frederking Carnegie Mellon University.
Arnar Thor Jensson Koji Iwano Sadaoki Furui Tokyo Institute of Technology Development of a Speech Recognition System For Icelandic Using Machine Translated.
Is Neural Machine Translation the New State of the Art?
A CASE STUDY OF GERMAN INTO ENGLISH BY MACHINE TRANSLATION: MOSES EVALUATED USING MOSES FOR MERE MORTALS. Roger Haycock 
Xiaolin Wang Andrew Finch Masao Utiyama Eiichiro Sumita
Measuring Monolinguality
Statistical Machine Translation
Ankit Srivastava CNGL, DCU Sergio Penkale CNGL, DCU
KantanNeural™ LQR Experiment
Joint Training for Pivot-based Neural Machine Translation
On the Impact of Various Types of Noise on Neural Machine Translation
The CoNLL-2014 Shared Task on Grammatical Error Correction
Statistical n-gram David ling.
Domain Mixing for Chinese-English Neural Machine Translation
Improving IBM Word-Alignment Model 1(Robert C. MOORE)
Multilingual Translation – (1)
Presentation transcript:

Language Models for Machine Translation: Original vs. Translated Texts Gennadi Lembersky Noam Ordan Shuly Wintner MTML, 2011

2

Background: Original vs. Translated Texts Wanderer's Night Song Up there all summits are still. In all the tree-tops you will feel but the dew. The birds in the forest stopped talking. Soon, done with walking, you shall rest, too. (~50 translations into Hebrew) Wandrers Nachtlied Über allen Gipfeln ist Ruh, in allen Wipfeln spürest du kaum einen Hauch; die Vögelein schweigen im Walde, warte nur, balde ruhest du auch! (26 tokens) 3 Soure TextTarget Text LM TM

Background: Is sex/translation dirty? 4 Soure TextTarget Text LM TM

Background: Original vs. Translated Texts Given this simplified model: Two points are made with regard to the “intermediate component” (TM and LM): 1.TM is blind to direction (but see Kurokawa et al., 2009) 2.LMs are based on originally written texts. 5 Source TextTarget Text LM TM

Background: Original vs. Translated Texts LMs are based on originally written texts for two possible reasons: 1.They are more readily available; 2.Perhaps the question of whether they are translated or not is considered irrelevant for LM. 6

Background: Original vs. Translated Texts Translated texts are ontologically different from non-translated texts ; they generally exhibit 1.Simplification of the message, the grammar or both (Al-Shabab, 1996, Laviosa, 1998) ; 2.Explicitation, the tendency to spell out implicit utterances that occur in the source text (Blum-Kulka, 1986). 7

Background: Original vs. Translated Texts Translated texts can be distinguished from non- translated texts with high accuracy (87% and more) -For Italian (Baroni & Bernardini, 2006) -For Spanish (Iliseiet al., 2010); -For English (Koppel & Ordan, forthcoming) 8

9

Our Hypotheses We investigate the following three hypotheses: 1.Translated texts differ from original texts 2.Texts translated from one language differ from texts translated from other languages 3.LMs compiled from translated texts are better for MT than LMs compiled from original texts 10

Testing Hypothesis

Testing Hypothesis 3 12

Identifying the Source Language For the most part, we rely on the LANGUAGE attribute of the SPEAKER tag ▫ ▫BUT: it is rarely used with British MEPs To identify original English speakers we use ID attribute, which we match against the list of British members of the European parliament 13

14

Resources 4 European language pairs taken from Europarl ▫German – English ▫Dutch – English ▫French – English ▫Italian – English 15

Language Models Stats 16 German - English LenTokensSent’s Orig. Lang ,325,26182,700 Mix ,324,74591,100 O-EN ,322,97387,900 T-DE ,323,64694,000 T-NL ,325,18377,550 T-FR ,325,99665,199 T-IT Dutch - English LenTokensSent’s Orig. Lang ,508,26590,500 Mix ,475,65297,000 O-EN ,503,35494,200 T-DE ,513,769101,950 T-NL ,523,05586,600 T-FR ,518,19673,541 T-IT

Language Models Stats 17 Italian - English LenTokensSent’s Orig. Lang ,534,79387,040 Mix ,534,89293,520 O-EN ,534,86790,550 T-DE ,535,05396,850 T-NL ,534,93082,930 T-FR ,535,22569,270 T-IT French - English LenTokensSent’s Orig. Lang ,546,27490,700 Mix ,545,89199,300 O-EN ,546,12494,900 T-DE ,545,645103,350 T-NL ,546,08585,750 T-FR ,546,98472,008 T-IT

SMT Training Data 18 LenTokensSent’sSideLang’s ,439,37092,901 DE DE-EN ,602,37692,901 EN ,327,60184,811 NL NL-EN ,303,84684,811 EN ,610,55193,162 FR FR-EN ,869,32893,162 EN ,531,92585,485 IT IT-EN ,517,12885,485 EN

Reference Sets 19 LenTokensSent’sSideLang’s ,8896,675 DE DE-EN ,9846,675 EN ,2724,593 NL NL-EN ,0834,593 EN ,1988,494 FR FR-EN ,5368,494 EN ,2612,269 IT IT-EN ,2582,269 EN

Hypotheses 1+2 Results 20 German - English PPUnigrams Orig. Lang ,238 Mix ,204 O-EN ,940 T-DE ,074 T-NL ,405 T-FR ,586 T-IT Dutch - English PPUnigrams Orig. Lang ,050 Mix ,064 O-EN ,766 T-DE ,178 T-NL ,502 T-FR ,386 T-IT

Hypotheses 1+2 Results 21 French - English PPUnigrams Orig. Lang ,444 Mix ,576 O-EN ,935 T-DE ,221 T-NL ,609 T-FR ,633 T-IT Italian - English PPUnigrams Orig. Lang ,353 Mix ,546 O-EN ,835 T-DE ,130 T-NL ,460 T-FR ,466 T-IT

Hypothesis 1+2 Results Corpora statistics and LM perplexity results support the hypotheses: ▫translated and original texts are different ▫texts translated from one language are different from texts translated from another language For every source language, L: ▫LM trained on texts translated from L has the lowest (the best) perplexity ▫The MIX LMs are second-best and the LMs trained on texts translated from related languages (German Dutch; French Italian) are next ▫The LMs trained on original English texts are the worst 22

Hypotheses 3 (MT) Results 23 German - English BLEU Orig. Lang Mix O-EN T-DE T-NL T-FR T-IT Dutch - English BLEU Orig. Lang Mix O-EN T-DE T-NL T-FR T-IT French - English BLEU Orig. Lang Mix O-EN T-DE T-NL T-FR T-IT Italian - English BLEU Orig. Lang Mix O-EN T-DE T-NL T-FR T-IT

Hypotheses 3 (MT) Results / 2 The results support the hypothesis: ▫For every source language L, the MT system that uses LM trained on text translated from L has the best translations. ▫Systems that use O-EN LMs got the lowest BLEU scores. Statistical significance (bootstrap resampling): ▫The best-performing system is statistically better than all other systems (p < 0.05) ▫The best-performing system is statistically better than O-EN system (p < 0.01) ▫The MIX systems are statistically better than O-EN systems (p < 0.01) 24

25

Hebrew-English MT System MOSES PB-SMT Factored Translation Model (surface | lemma) trained on ~ 65,000 parallel sentences Fully segmented source (Hebrew) ▫Morphological analyzer (from “MILA” knowledge center) and Roy Bar-Haim’s disambiguator Lemma-based alignment + “trgtosrc alignment” Performance: ▫~ 23 BLEU on 1000 sentences with 1 ref. translations ▫~ 32 BLEU on 300 sentences with 4 ref. translations Demo: 26

Language Model Resources Two English Corpora for the language models ▫Original English corpus (O-EN) – “International Herald Tribune” articles collected over a period of 7 months (January to July 2009) ▫Translated from Hebrew (T-HE) – Israeli newspaper “HaAretz” published in Hebrew collected over the same period of time Each corpus comprises 4 topics: news, business, opinion and arts ▫Both corpora have approximately the same number of tokens in each topic 27

Language Models Resources 28 Hebrew - English LenTokensSent’s Orig. Lang ,561,559135,228 O-EN 24.23,561,556147,227 T-HE

Parallel Resources SMT Training Model ▫Hebrew-English parallel corpus (Tsvetkov and Wintner, 2010)  Genres: news, literature and subtitles  Original Hebrew (54%)  Original English (46%) – mostly subtitles Reference Set ▫Translated from Hebrew to English ▫Literature (88.6%) and news (11.4%) 29

Parallel Resources 30 LenTokensSent’sSideLang’s SMT Training Data ,51295,912 HE HE-EN ,83095,912 EN Reference Set ,0857,546 HE HE-EN ,1837,546 EN

Hypothesis 1 Results Problem: What if the different perplexity results are due to the contents bias between T-HE corpus and the reference sets ▫ We conducted more experiments in which we gradually abstract away from the specific contents 31 Hebrew - English PPUnigrams Orig. Lang ,305 O-EN ,729 T-HE

Abstraction Experiments 4 abstraction levels: ▫1 – we remove all punctuation ▫2 – we replace named entities with a “NE” token  We use Stanford Named Entity Recognizer  We train 5-gram LMs ▫3 – we replace all nouns with a their POS tag  We use Stanford POS Tagger  We train 5-gram LMs ▫4 – we replace all tokens with their POS tags  We train 8-gram LM 32

Abstraction Experiments T-HE fits the reference consistently better than O-EN 33 PP diff. T-HEO-EN Abstraction PP 19.2% No Punctuation 17.3% NE Abstraction 12.4% Noun Abstraction 6.2% POS Abstraction

Hypothesis 3 (MT) Results T-HE system produces slightly better results The gain is statistically significant (p = < 0.05) 34 Hebrew - English BLEUOrig. Lang O-EN T-HE

35

Discussion The results consistently support our hypotheses: 1.Translated texts differ from original texts 2.Texts translated from one language differ from texts translated from other languages 3.LMs compiled from translated texts are better for MT than LMs compiled from original texts 36

Discussion Practical Outcome: ▫Use LMs trained on texts translated from a source language ▫If not available, use the mixture of translated texts  The texts translated from languages closely-related to the source language are for most part better than other translated texts 37

Discussion Why did it work? Two hypotheses: 1.Since translations simplify the originals, error potential gets smaller and LMs better predict translated language; 2.Recurrent multiword expressions in the SL converge to a set of high-quality translations in the TL. 38

Discussion When machine translation meets translation studies, 1.MT Results improve; 2.Pending hypotheses in translation studies are tested experimentally in a more rigorous way. We call for further cooperation. 39

40