Automatic Post-editing (pilot) Task Rajen Chatterjee, Matteo Negri and Marco Turchi Fondazione Bruno Kessler [ chatterjee | negri | turchi

Slides:



Advertisements
Similar presentations
KeTra.
Advertisements

Variation and regularities in translation: insights from multiple translation corpora Sara Castagnoli (University of Bologna at Forlì – University of Pisa)
Rating Evaluation Methods through Correlation presented by Lena Marg, Language Tools MTE 2014, Workshop on Automatic and Manual Metrics for Operational.
Post-Editing – Professional translation service redefined
MEANT: semi-automatic metric for evaluating for MT evaluation via semantic frames an asembling of ACL11,IJCAI11,SSST11 Chi-kiu Lo & Dekai Wu Presented.
5/10/20151 Evaluating Spoken Dialogue Systems Julia Hirschberg CS 4706.
MT Evaluation: Human Measures and Assessment Methods : Machine Translation Alon Lavie February 23, 2011.
Proactive Learning: Cost- Sensitive Active Learning with Multiple Imperfect Oracles Pinar Donmez and Jaime Carbonell Pinar Donmez and Jaime Carbonell Language.
Improving Machine Translation Quality via Hybrid Systems and Refined Evaluation Methods Andreas Eisele DFKI GmbH and Saarland University Helsinki, November.
Interactive Translation vs. Pre-Translation in the Context of Translation Memory Systems: Investigating the Effects of Translation Method on Productivity,
Chinese Word Segmentation Method for Domain-Special Machine Translation Su Chen; Zhang Yujie; Guo Zhen; Xu Jin’an Beijing Jiaotong University.
Discriminative Learning of Extraction Sets for Machine Translation John DeNero and Dan Klein UC Berkeley TexPoint fonts used in EMF. Read the TexPoint.
J. Turmo, 2006 Adaptive Information Extraction Summary Information Extraction Systems Multilinguality Introduction Language guessers Machine Translators.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.
Basic Concept of Data Coding Codes, Variables, and File Structures.
© 2014 The MITRE Corporation. All rights reserved. Stacey Bailey and Keith Miller On the Value of Machine Translation Adaptation LREC Workshop: Automatic.
Tapta4IPC: helping translation of IPC definitions Bruno Pouliquen 25 feb 2013, IPC workshop Translation.
Czech-to-English Translation: MT Marathon 2009 Session Preview Jonathan Clark Greg Hanneman Language Technologies Institute Carnegie Mellon University.
Automating Translation in the Localisation Factory An Investigation of Post-Editing Effort Sharon O’Brien Dublin City University.
English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.
Query Rewriting Using Monolingual Statistical Machine Translation Stefan Riezler Yi Liu Google 2010 Association for Computational Linguistics.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Achieving Domain Specificity in SMT without Over Siloing William Lewis, Chris Wendt, David Bullock Microsoft Research Machine Translation.
Chapter 13: Schedules of Reinforcement
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
METEOR-Ranking & M-BLEU: Flexible Matching & Parameter Tuning for MT Evaluation Alon Lavie and Abhaya Agarwal Language Technologies Institute Carnegie.
Active Learning for Statistical Phrase-based Machine Translation Gholamreza Haffari Joint work with: Maxim Roy, Anoop Sarkar Simon Fraser University NAACL.
The Necessity of Combining Adaptation Methods Cognitive Computation Group, University of Illinois Experimental Results Title Ming-Wei Chang, Michael Connor.
1 TURKOISE: a Mechanical Turk-based Tailor-made Metric for Spoken Language Translation Systems in the Medical Domain Workshop on Automatic and Manual Metrics.
Named Entity Recognition based on Bilingual Co-training Li Yegang School of Computer, BIT.
INSTITUTE OF COMPUTING TECHNOLOGY Bagging-based System Combination for Domain Adaptation Linfeng Song, Haitao Mi, Yajuan Lü and Qun Liu Institute of Computing.
2010 Failures in Czech-English Phrase-Based MT 2010 Failures in Czech-English Phrase-Based MT Full text, acknowledgement and the list of references in.
Czech-English Word Alignment Ondřej Bojar Magdalena Prokopová
How do Humans Evaluate Machine Translation? Francisco Guzmán, Ahmed Abdelali, Irina Temnikova, Hassan Sajjad, Stephan Vogel.
1 Co-Training for Cross-Lingual Sentiment Classification Xiaojun Wan ( 萬小軍 ) Associate Professor, Peking University ACL 2009.
Coşkun Mermer, Hamza Kaya, Mehmet Uğur Doğan National Research Institute of Electronics and Cryptology (UEKAE) The Scientific and Technological Research.
Case Study Summary Link Translation entered a partner agreement with Autodesk to provide translation solutions integrating human and machine translation.
The MultilingualWeb-LT Working Group receives funding by the European Commission (project name LT-Web) through the Seventh Framework Programme (FP7) in.
A Framework of Diagnostic Teaching II. Framework A form of progress monitoring, continuous assessment (CA) is another way for the teacher to assess growth.
Korea Maritime and Ocean University NLP Jung Tae LEE
Modern MT Systems and the Myth of Human Translation: Real World Status Quo ● Intro ● MT & HT Definitions ● Comparison MT vs. HT ● Evaluation Methods ●
Chinese Word Segmentation Adaptation for Statistical Machine Translation Hailong Cao, Masao Utiyama and Eiichiro Sumita Language Translation Group NICT&ATR.
NRC Report Conclusion Tu Zhaopeng NIST06  The Portage System  For Chinese large-track entry, used simple, but carefully- tuned, phrase-based.
UNSUPERVISED CV LANGUAGE MODEL ADAPTATION BASED ON DIRECT LIKELIHOOD MAXIMIZATION SENTENCE SELECTION Takahiro Shinozaki, Yasuo Horiuchi, Shingo Kuroiwa.
Presenter: Jinhua Du ( 杜金华 ) Xi’an University of Technology 西安理工大学 NLP&CC, Chongqing, Nov , 2013 Discriminative Latent Variable Based Classifier.
LREC 2008 Marrakech 29 May Caroline Lavecchia, Kamel Smaïli and David Langlois LORIA / Groupe Parole, Vandoeuvre-Lès-Nancy, France Phrase-Based Machine.
Mutual bilingual terminology extraction Le An Ha*, Gabriela Fernandez**, Ruslan Mitkov*, Gloria Corpas*** * University of Wolverhampton ** Universidad.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Effectiveness of Reading and Math Software Products Findings From the First Student Cohort Mark Dynarski May 2007.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Wei Lu, Hwee Tou Ng, Wee Sun Lee National University of Singapore
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Error Analysis of Two Types of Grammar for the purpose of Automatic Rule Refinement Ariadna Font Llitjós, Katharina Probst, Jaime Carbonell Language Technologies.
School of something FACULTY OF OTHER School of Languages, Cultures and Societies – Faculty of Arts School of Computing – Faculty of Engineering Multilingual.
Approaching a New Language in Machine Translation Anna Sågvall Hein, Per Weijnitz.
The P YTHY Summarization System: Microsoft Research at DUC 2007 Kristina Toutanova, Chris Brockett, Michael Gamon, Jagadeesh Jagarlamudi, Hisami Suzuki,
Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,
Is Neural Machine Translation the New State of the Art?
Language Identification and Part-of-Speech Tagging
Regression Testing with its types
Monoligual Semantic Text Alignment and its Applications in Machine Translation Alon Lavie March 29, 2012.
Suggestions for Class Projects
Build MT systems with Moses
Software metrics.
Statistical vs. Neural Machine Translation: a Comparison of MTH and DeepL at Swiss Post’s Language service Lise Volkart – Pierrette Bouillon – Sabrina.
Potential impact of QT21 Eleanor Cornelius
Presentation transcript:

Automatic Post-editing (pilot) Task Rajen Chatterjee, Matteo Negri and Marco Turchi Fondazione Bruno Kessler [ chatterjee | negri | turchi

Task –Automatically correct errors in a machine-translated text Impact –Cope with systematic errors of an MT system whose decoding process is not accessible –Provide professional translators with improved MT output quality to reduce (human) post-editing effort –Adapt the output of a general-purpose MT system to the lexicon/style requested in specific domains Automatic post-editing WMT15

Task –Automatically correct errors in a machine-translated text Impact –Cope with systematic errors of an MT system whose decoding process is not accessible –Provide professional translators with improved MT output quality to reduce (human) post-editing effort –Adapt the output of a general-purpose MT system to the lexicon/style requested in specific domains Automatic post-editing WMT15

Objectives of the pilot –Define a sound evaluation framework for future rounds –Identify critical aspects of data acquisition and system evaluation –Make an inventory of current approaches and evaluate the state of the art Automatic post-editing WMT15

Evaluation setting: data ) Data (provided by –English-Spanish, news domain Training: 11,272 (src, tgt, pe) triplets –src: tokenized EN sentence –tgt: tokenized ES translation by an unknown MT system –pe: crowdsourced human post-edition of tgt Development: 1,000 triplets Test: 1,817 (src, tgt) pairs

Evaluation setting: data ) Data (provided by –English-Spanish, news domain Training: 11,272 (src, tgt, pe) triplets –src: tokenized EN sentence –tgt: tokenized ES translation by an unknown MT system –pe: crowdsourced human post-edition of tgt Development: 1,000 triplets Test: 1,817 (src, tgt) pairs

Metric –Average TER between automatic and human post-edits (the lower the better) –Two modes: case sensitive/insensitive Baseline(s) –Official: average TER between tgt and human post-edits (a system that leaves the tgt test instances unmodified) –Additional: a re-implementation of the statistical post- editing method of Simard et al. (2007) “Monolingual translation”: phrase-based Moses system trained with (tgt, pe) “parallel” data Evaluation setting: metric and baseline

Metric –Average TER between automatic and human post-edits (the lower the better) –Two modes: case sensitive/insensitive Baseline(s) –Official: average TER between tgt and human post-edits (a system that leaves the tgt test instances unmodified) –Additional: a re-implementation of the statistical post- editing method of Simard et al. (2007) “Monolingual translation”: phrase-based Moses system trained with (tgt, pe) “parallel” data Evaluation setting: metric and baseline

Participants and results

Abu-MaTran (2 runs) –Statistical post-editing, Moses-based –QE classifiers to chose between MT and APE SVM-based HTER predictor RNN-based to label each word as good or bad FBK (2 runs) –Statistical post-editing: The basic method of (Simard et al. 2007): f’ ||| f The “context-aware” variant of (Béchara et al. 2011): f’#e ||| f Phrase table pruning based on rules’ usefulness Dense features capturing rules’ reliability Participants (4) and submitted runs (7)

LIMSI (2 runs) –Statistical post-editing –Sieves-based approach PE rules for casing, punctuation and verbal endings USAAR (1 run) –Statistical post-editing –Hybrid word alignment combining multiple aligners Participants (4) and submitted runs (7)

Results (Average TER  ) Case insensitive Case sensitive

Results (Average TER  ) Case insensitive Case sensitive None of the submitted runs improved over the baseline Similar performance difference between case sensitive/insensitive Close results reflect the same underlying statistical APE approach Improvements over the common backbone indicate some progress

Results (Average TER  ) Case insensitive Case sensitive None of the submitted runs improved over the baseline Similar performance difference between case sensitive/insensitive Close results reflect the same underlying statistical APE approach Improvements over the common backbone indicate some progress

Results (Average TER  ) Case insensitive Case sensitive None of the submitted runs improved over the baseline Similar performance difference between case sensitive/insensitive Close results reflect the same underlying statistical APE approach Improvements over the common backbone indicate some progress

Results (Average TER  ) Case insensitive Case sensitive None of the submitted runs improved over the baseline Similar performance difference between case sensitive/insensitive Close results reflect the same underlying statistical APE approach Improvements over the common backbone indicate some progress

Discussion

Experiments with the Autodesk Post-Editing Data corpus –Same languages (EN-ES) –Same amount of target words for training, dev and test –Same data quality (~ same TER) –Different domain: software manuals (vs news) –Different origin: professional translators (vs crowd) Discussion: the role of data APE Task dataAutodesk data Type/Token Ratio SRC TGT PE Repetition Rate SRC TGT PE3.18.5

Experiments with the Autodesk Post-Editing Data corpus –Same languages (EN-ES) –Same amount of target words for training, dev and test –Same data quality (~ same TER) –Different domain: software manuals (vs news) –Different origin: professional translators (vs crowd) Discussion: the role of data APE Task dataAutodesk data Type/Token Ratio SRC TGT PE Repetition Rate SRC TGT PE More repetitive Easier?

Repetitiveness of the learned correction patterns –Train two basic statistical APE systems –Count how often a translation option is found in the training pairs (more singletons = higher sparseness) Discussion: the role of data Percentage of phrase pairs Phrase pair countAPE task dataAutodesk data Total entries1,066,344703,944

Repetitiveness of the learned correction patterns –Train two basic statistical APE systems –Count how often a translation option is found in the training pairs (more singletons = higher sparsity) Discussion: the role of data Percentage of phrase pairs Phrase pair countAPE task dataAutodesk data Total entries1,066,344703,944 More compact PT Less singletons Repeated translation options Easier?

Professionals translators –Necessary corrections to maximize productivity –Consistent translation/correction criteria Crowdsourced workers –No specific time/consistency constraints Analysis of 221 test instances post-edited by professional translators MT output Professional PEsCrowdsourced PEs TER: TER: TER: Discussion: professional vs. crowdsourced PEs

Professionals translators –Necessary corrections to maximize productivity –Consistent translation/correction criteria Crowdsourced workers –No specific time/consistency constraints Analysis of 221 test instances post-edited by professional translators Discussion: professional vs. crowdsourced PEs MT output Professional PEsCrowdsourced PEs TER: TER: TER: The crowd corrects more

Professionals translators –Necessary corrections to maximize productivity –Consistent translation/correction criteria Crowdsourced workers –No specific time/consistency constraints Analysis of 221 test instances post-edited by professional translators Discussion: professional vs. crowdsourced PEs MT output Professional PEsCrowdsourced PEs TER: TER: TER: The crowd corrects more The crowd corrects differently

Discussion: impact on performance Evaluation on the respective test sets Avg. TER APE task dataAutodesk data Baseline (Simard et al. 2007)23.83 (+0.92)20.02 (-3.55) More difficult task with WMT data –Same baseline but significant TER differences –-1.43 points with 25% of the Autodesk training instances Repetitiveness and homogeneity help!

Discussion: systems’ behavior Few modified sentences (22% on average) Best results achieved by conservative runs –A consequence of data sparsity? –An evaluation problem: good corrections can harm TER –A problem of statistical APE: correct words should not be touched

Define a sound evaluation framework –No need of radical changes in future rounds Identify critical aspects for data acquisition –Domain: specific vs general –Post-editors: professional translators vs crowd Evaluate the state of the art –Same underlying approach –Some progress due to slight variations But the baseline is unbeaten –Problem: how to avoid unnecessary corrections? Summary

Define a sound evaluation framework –No need of radical changes in future rounds Identify critical aspects for data acquisition –Domain: specific vs general –Post-editors: professional translators vs crowd Evaluate the state of the art –Same underlying approach –Some progress due to slight variations But the baseline is unbeaten –Problem: how to avoid unnecessary corrections? Summary ✔

Define a sound evaluation framework –No need of radical changes in future rounds Identify critical aspects for data acquisition –Domain: specific vs general –Post-editors: professional translators vs crowd Evaluate the state of the art –Same underlying approach –Some progress due to slight variations But the baseline is unbeaten –Problem: how to avoid unnecessary corrections? Summary ✔ 

Define a sound evaluation framework –No need of radical changes in future rounds Identify critical aspects for data acquisition –Domain: specific vs general –Post-editors: professional translators vs crowd Summary ✔ ✔

Define a sound evaluation framework –No need of radical changes in future rounds Identify critical aspects for data acquisition –Domain: specific vs general –Post-editors: professional translators vs crowd Evaluate the state of the art –Same underlying approach –Some progress due to slight variations But the baseline is unbeaten –Problem: how to avoid unnecessary corrections? Summary ✔ ✔

Define a sound evaluation framework –No need of radical changes in future rounds Identify critical aspects for data acquisition –Domain: specific vs general –Post-editors: professional translators vs crowd Evaluate the state of the art –Same underlying approach –Some progress due to slight variations But the baseline is unbeaten –Problem: how to avoid unnecessary corrections? Summary ✔ ✔ ✔

Thanks! Questions?

MT: translation of the entire source sentence –Translate everything! SAPE: “translation” of the errors –Don’t correct everything! Mimic the human! The “aggressiveness” problem SRC: 巴尔干的另一个关键步骤 TGT: Yet a key step in the Balkans TGT_corrected: Another key step for the Balkans

MT: translation of the entire source sentence –Translate everything! SAPE: “translation” of the errors –Don’t correct everything! Mimic the human! The “aggressiveness” problem SRC: 巴尔干的另一个关键步骤 TGT: Yet a key step in the Balkans TGT_corrected: Another key step for the Balkans

MT: translation of the entire source sentence –Translate everything! SAPE: “translation” of the errors –Don’t correct everything! Mimic the human! The “aggressiveness” problem SRC: 巴尔干的另一个关键步骤 TGT: Yet a key step in the Balkans TGT_corrected: Another crucial step for the Balkans Changing correct terms will be penalized by TER-based evaluation against humans