2010 Failures in Czech-English Phrase-Based MT 2010 Failures in Czech-English Phrase-Based MT Full text, acknowledgement and the list of references in.

Slides:



Advertisements
Similar presentations
Statistical Machine Translation
Advertisements

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser Institute for Natural Language Processing University of Stuttgart
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Word Sense Disambiguation for Machine Translation Han-Bin Chen
English-Hindi Translation in 21 Days Ondřej Bojar, Pavel Straňák, Daniel Zeman ÚFAL MFF, Univerzita Karlova, Praha.
Chinese Word Segmentation Method for Domain-Special Machine Translation Su Chen; Zhang Yujie; Guo Zhen; Xu Jin’an Beijing Jiaotong University.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
“Applying Morphology Generation Models to Machine Translation” By Kristina Toutanova, Hisami Suzuki, Achim Ruopp (Microsoft Research). UW Machine Translation.
1 Language Model (LM) LING 570 Fei Xia Week 4: 10/21/2009 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAA A A.
1 Language Model Adaptation in Machine Translation from Speech Ivan Bulyko, Spyros Matsoukas, Richard Schwartz, Long Nguyen, and John Makhoul.
Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
Does Syntactic Knowledge help English- Hindi SMT ? Avinesh. PVS. K. Taraka Rama, Karthik Gali.
The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.
TectoMT two goals of TectoMT –to allow experimenting with MT based on deep- syntactic (tectogrammatical) transfer –to create a software framework into.
Czech-to-English Translation: MT Marathon 2009 Session Preview Jonathan Clark Greg Hanneman Language Technologies Institute Carnegie Mellon University.
CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 18– Training and Decoding in SMT System) Kushal Ladha M.Tech Student CSE Dept.,
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
The use of machine translation tools for cross-lingual text-mining Blaz Fortuna Jozef Stefan Institute, Ljubljana John Shawe-Taylor Southampton University.
English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.
Technical Report of NEUNLPLab System for CWMT08 Xiao Tong, Chen Rushan, Li Tianning, Ren Feiliang, Zhang Zhuyu, Zhu Jingbo, Wang Huizhen
Metadata generation and glossary creation in eLearning Lothar Lemnitzer Review meeting, Zürich, 25 January 2008.
Achieving Domain Specificity in SMT without Over Siloing William Lewis, Chris Wendt, David Bullock Microsoft Research Machine Translation.
6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.
A Compositional Context Sensitive Multi-document Summarizer: Exploring the Factors That Influence Summarization Ani Nenkova, Stanford University Lucy Vanderwende,
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
Profile The METIS Approach Future Work Evaluation METIS II Architecture METIS II, the continuation of the successful assessment project METIS I, is an.
METEOR-Ranking & M-BLEU: Flexible Matching & Parameter Tuning for MT Evaluation Alon Lavie and Abhaya Agarwal Language Technologies Institute Carnegie.
Experiments on Building Language Resources for Multi-Modal Dialogue Systems Goals identification of a methodology for adapting linguistic resources for.
2012: Monolingual and Crosslingual SMS-based FAQ Retrieval Johannes Leveling CNGL, School of Computing, Dublin City University, Ireland.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Chinese Word Segmentation and Statistical Machine Translation Presenter : Wu, Jia-Hao Authors : RUIQIANG.
Czech-English Word Alignment Ondřej Bojar Magdalena Prokopová
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
Translating from Morphologically Complex Languages: A Paraphrase-Based Approach Preslav Nakov & Hwee Tou Ng.
Phrase Reordering for Statistical Machine Translation Based on Predicate-Argument Structure Mamoru Komachi, Yuji Matsumoto Nara Institute of Science and.
Coşkun Mermer, Hamza Kaya, Mehmet Uğur Doğan National Research Institute of Electronics and Cryptology (UEKAE) The Scientific and Technological Research.
NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.
Cs target cs target en source Subject-PastParticiple agreement Czech subject and past participle must agree in number and gender. Two-step translation.
Approximating a Deep-Syntactic Metric for MT Evaluation and Tuning Matouš Macháček, Ondřej Bojar; {machacek, Charles University.
Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,
Ibrahim Badr, Rabih Zbib, James Glass. Introduction Experiment on English-to-Arabic SMT. Two domains: text news,spoken travel conv. Explore the effect.
Chinese Word Segmentation Adaptation for Statistical Machine Translation Hailong Cao, Masao Utiyama and Eiichiro Sumita Language Translation Group NICT&ATR.
NRC Report Conclusion Tu Zhaopeng NIST06  The Portage System  For Chinese large-track entry, used simple, but carefully- tuned, phrase-based.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Building Sub-Corpora Suitable for Extraction of Lexico-Syntactic Information Ondřej Bojar, Institute of Formal and Applied Linguistics, ÚFAL.
A Statistical Approach to Machine Translation ( Brown et al CL ) POSTECH, NLP lab 김 지 협.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Approaching a New Language in Machine Translation Anna Sågvall Hein, Per Weijnitz.
Large Vocabulary Data Driven MT: New Developments in the CMU SMT System Stephan Vogel, Alex Waibel Work done in collaboration with: Ying Zhang, Alicia.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
The P YTHY Summarization System: Microsoft Research at DUC 2007 Kristina Toutanova, Chris Brockett, Michael Gamon, Jagadeesh Jagarlamudi, Hisami Suzuki,
A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.
A Syntax-Driven Bracketing Model for Phrase-Based Translation Deyi Xiong, et al. ACL 2009.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
Statistical NLP Spring 2011 Lecture 3: Language Models II Dan Klein – UC Berkeley TexPoint fonts used in EMF. Read the TexPoint manual before you delete.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,
Arnar Thor Jensson Koji Iwano Sadaoki Furui Tokyo Institute of Technology Development of a Speech Recognition System For Icelandic Using Machine Translated.
Statistical Machine Translation Part II: Word Alignments and EM
Monoligual Semantic Text Alignment and its Applications in Machine Translation Alon Lavie March 29, 2012.
Translation of Unknown Words in Low Resource Languages
--Mengxue Zhang, Qingyang Li
Deep Learning based Machine Translation
Build MT systems with Moses
Memory-augmented Chinese-Uyghur Neural Machine Translation
Sadov M. A. , NRU HSE, Moscow, Russia Kutuzov A. B
Domain Mixing for Chinese-English Neural Machine Translation
Presentation transcript:

2010 Failures in Czech-English Phrase-Based MT 2010 Failures in Czech-English Phrase-Based MT Full text, acknowledgement and the list of references in the proceedings of ACL 2010 Workshop on Statistical Machine Translation (WMT10). Rich Morphology Our WMT10 System Configuration Ondřej Bojar, Kamil Kos; Charles University in Prague, MFF, ÚFAL English–to-CzechCzech–to-English Optimisation towards Coarser Metric - SemPOS Optimisation towards Coarser Metric - SemPOS LargeSmall Sentences7.5M126.1k Czech Tokens79.2M2.6M English Tokens89.1M2.9M Czech Vocabulary923.1k138.7k English Vocabulary646.3k64.7k Czech Lemmas553.5k60.3k English Lemmas611.4k53.8k Higher Out-of-Vocabulary Rates Distortion Limit Topts3630 Czech 10,2%0,3% 501,2%1,5%1,7% 1001,2%1,5%1,7% English 10,4% 504,9%6,7%8,6% 1005,3%7,6%9,4% WeightsBLEUSemPOS 1:010,42±0,3829,91 1:110,15±0,3929,81 1:19,42±0,3729,30 2:110,37±0,3829,95 3:110,30±0,3930,03 10:110,17±0,4029,58 1:210,11±0,3829,80 1:109,44±0,4029,74 WeightsBLEUSemPOS 1:014,08±0,5032,44 1:113,79±0,5533,17 Although the training data contains roughly the same number of Czech and English tokens, Czech vocabulary has many more entries. Czech vocabulary has approximately double the size compared to Czech lemmas. Large Vocabulary Lower Reachability of Reference Translations Two training corpora:  Large – 7,5M sentences  Small – 126,1k sentences Overview Out-of-Vocabulary (OOV) rate of unigrams and bigrams significantly increases for Small data setting. OOV can be reduced by lemmatizing word forms if we have only limited amount of data. Almost 12% of the Czech devset tokens are not available in the extracted phrase tables. Percentage of reference translations reachable by exhaustive search (Schwartz, 2008). Exhaustive search influenced by: distortion limit (default: 6), number of translation options (default: 50). It is much harder to reach reference translation in Czech than in English. Problems of English-to-Czech MT Large Target-Side VocabularyHigher Out-of-Vocabulary RateLow Reachability of Human Translations Possible Solutions Examined  Two-Step Translation  Coarser Optimisation Metric SemPOS (Kos and Bojar, 2009) SemPOS: Operates on lemmas of content words. Ignores word order. Reference lemmas matched only if semantic parts-of-speech (Hajič et al., 2004) agree. Czech and English supported so far. SemPOS-tuned parameters with MERT widely range in final quality. (6.96 – 9.46 BLEU for Small data) Linear combination of BLEU and SemPOS Small data Large data Two-Step Translation Data SizeSimpleTwo-StepChange ParallelMonoBLEUSemPOSBLEUSemPOSBLEUSemPOS Small 10,28±0,4029,9210,38±0,3830,01  SmallLarge12,50±0,4431,0112,29±0,4731,40  Large 14,17±0,5133,0714,06±0,4932,57  Two- Step Both Fine Both Wrong SimpleTotal Two-Step Better Both Fine Both Wrong Simple Better Total Moses run twice in a row: 1.Translate English to lemmatized Czech augmented to preserve important semantic properties known from the source phrase 2.Generate fully inflected Czech. 150 sentences manually annotated by two annotators (Small-Large setting). Each of them mildly prefers Two-Step model. Equal result (23) when limited to sentences where they agree. Manual Annotation Overcome Target-Side Data Sparseness Simple vs. Two-Step Translation Srcafter a sharp drop Midpo+6ASA1.prudkýNSA-.pokles Glossafter+vocadj+sg…sharpnoun+sg…drop Outpoprudkémpoklesu Minor gain in Small-Small, minor loss in Large-Large setting. Interesting mixed result in Small-Large: Indicates that large LM can improve BLEU score without addressing the cross- lingual data sparseness (tackled by Two-Step model and appreciated by SemPOS). Note that large monolingual data were used also as the LM in the first step. Standard GIZA++ word alignment based on both source and target lemmas. Two alternative decoding paths; forms always truecased: form+tag → form form → form. The first path is more specific and helps to preserve core syntactic elements in the sentence. Without the tag, ambiguous English words could often all translate as e.g. nouns, leading to no verb in the Czech sentence. The default path serves as a back-off. Significance filtering of the phrase tables (Johnson et al.,2007) implemented for Moses by Chris Dyer; default settings of filter value a+e and the cut-off 30. Lexicalized reordering (or-bi-fe) based on forms. Far fewer configurations tested, this is the final one: Two alternative decoding paths; forms always truecased: form → form lemma → form. Significance filtering as in English-to-Czech. 5-gram English LM based on CzEng English side only. Lexicalized reordering (or-bi-fe) based on forms. Standard Moses MERT towards BLEU. Using Gigaword LM as compiled by Chris Callison-Burch caused a significant loss in quality, probably due to different tokenization rules. Two separate 5-gram Czech LMs of truecased forms each of which interpolates models trained on the following datasets; the interpolation weights were set automatically using SRILM (Stolcke, 2002) based on the target side of the development set: Interpolated CzEng domains: news, web, fiction. The rationale behind the selection of the domains is that we prefer prose-like texts for LM estimation (and not e.g. technical documentation) while we want as much parallel data as possible. Interpolated monolingual corpora: WMT09 monolingual, WMT10 monolingual, Czech National Corpus (Kocek et al., 2000) sections SYN PUB. Standard Moses MERT towards BLEU.