Machine translation markers in post-edited machine translation output

Slides:



Advertisements
Similar presentations
Statistical modelling of MT output corpora for Information Extraction.
Advertisements

MT Evaluation: Human Measures and Assessment Methods : Machine Translation Alon Lavie February 23, 2011.
Inferential Statistics
Alternative Measures of Risk. The Optimal Risk Measure Desirable Properties for Risk Measure A risk measure maps the whole distribution of one dollar.
Psy B07 Chapter 4Slide 1 SAMPLING DISTRIBUTIONS AND HYPOTHESIS TESTING.
Chapter 10 Algorithmic Thinking. Learning Objectives Explain similarities and differences among algorithms, programs, and heuristic solutions List the.
Chapter 9 Estimating a Population Proportion Created by Kathy Fritz.
The University of Illinois System in the CoNLL-2013 Shared Task Alla RozovskayaKai-Wei ChangMark SammonsDan Roth Cognitive Computation Group University.
Real-time aspects June 19, 2016
Is Neural Machine Translation the New State of the Art?
Chapter 11 Analysis of Variance
Understanding Standards: Advanced Higher Physics
Cornerstones of Managerial Accounting, 5e
EMPA P MGT 630.
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Effect Size 10/15.
8.4 Management of Postdelivery Maintenance
Experimental Psychology
Virtual University of Pakistan
Pedagogical grammar 4 Ortega and Norris.
Effect Size.
Review You run a t-test and get a result of t = 0.5. What is your conclusion? Reject the null hypothesis because t is bigger than expected by chance Reject.
Software Reliability Definition: The probability of failure-free operation of the software for a specified period of time in a specified environment.
Variance & standard deviation
Understanding Results
Introduction to Corpus Linguistics: Exploring Collocation
By Dr. Abdulrahman H. Altalhi
Meredith A. Henry, M.S. Department of Psychology
Elementary Statistics
Dr. A .K. Bhattacharyya Professor EEI(NE Region), AAU, Jorhat
KS3 Mathematics D2 Processing data
Portfolios 2018 Wednesday, February 28th.
Hypothesis Tests for a Population Mean,
The documentation format of the Modern Language Association
Chapter 6 Hypothesis tests.
Chapter 11 Goodness-of-Fit and Contingency Tables
Written Description of Algorithms
Significant Figures The significant figures of a (measured or calculated) quantity are the meaningful digits in it. There are conventions which you should.
Chapter 11 Analysis of Variance
Discrete Event Simulation - 4
Statistical Analysis Error Bars
Hypothesis Testing.
Chapter 1 Introduction(1.1)
Chi Square (2) Dr. Richard Jackson
Applied Software Project Management
One-Way Analysis of Variance
Chapter 10: Estimating with Confidence
Translating and the Computer 40
CS 594: Empirical Methods in HCC Experimental Research in HCI (Part 2)
Chapter 8: Estimating with Confidence
Chapter 12 Power Analysis.
FCE (FIRST CERTIFICATE IN ENGLISH) General information.
Statistical vs. Neural Machine Translation: a Comparison of MTH and DeepL at Swiss Post’s Language service Lise Volkart – Pierrette Bouillon – Sabrina.
The documentation format of the Modern Language Association
Inferential Statistics
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Word embeddings (continued)
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
The documentation format of the Modern Language Association
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 10: Comparing Two Populations or Groups
Presentation transcript:

Machine translation markers in post-edited machine translation output Translating and the Computer 40 London, UK #TC18

Translation vs. Post-Edited MT Some authors say people prefer translated texts (Fiederer and O’Brien, 2009; Bowker and Buitrago Ciro, 2015) Others say people are not able to tell the difference between HT and PEMT (Daems, De Clercq and Macken, 2017)

No difference, really? Given that: Post-editors tend to leave acceptable solutions unedited Machine translation tends to choose one of the solutions most frequently chosen by translators

No difference, really? Then: The statistically most frequent solutions in human translation will occur with a higher than natural frequency in PEMT MT markers MT markers may be used to design tests to tell HT and PEMT apart

Preliminary experiment 51 postgraduate university students Extracts from Wikipedia entries on Venice (153 words) and Verona (168 words) Half did unaided human translations from English into Italian Half full-post-edited machine translation Microsoft Translator, both statistical and neural versions

Preliminary experiment Compare human translations with source text to find turns of phrase and expressions (n-grams) that have been translated in a wide variety of different ways 41 n-grams identified Compare this variety to the number of ways in which the same n-grams have been rendered in post-edited MT Excluding translation errors

Example (there are) N-gram HT group PESMT group PENMT group Combined PEMT group ci sono 10 7 11 18 sono 4 ospita 1 vanta presenta 2 sono presenti vi sono si possono trovare si possono visitare è possibile visitare possiamo trovare offre è famosa per ha TOTAL 26 12 24 È famosa per was rated as a debateable translation solution (see Method above for a definition) and therefore omitted from the calculations (there are numerous attractions in Venice vs. Venice is famous for numerous attractions). Fisher's exact two-tailed test

Translation errors HT PEMT Debatable choices 18 12 Mistranslation 35 42 Total 53 54 Errors per translator 2.04 2.25 Errors were only counted for the n-grams analysed 75/153 words = 49% of text The difference between the two groups is not statistically significant The quality is comparable if we evaluate quality purely in terms of translation errors

Variety (NTS/S) was: Higher in HT group 22 cases (88%) Virtually the same 1 cases (4%) Difficult to calculate Clearly higher in PEMT group Difficult to calculate in 1 case (4%) Numerous post-editing errors cause highly uneven group sizes

22 cases of greater variety 5 x greater variety 1 4 x greater variety 2 3 x greater variety 2 x greater variety 4 Reverse

Conclusion Much greater variety of translation solutions in the HT group than in the combined PEMT group.

Translation in raw MT Top human choice 14 cases (56%) Second to top human translation 3 cases (12%) Different inflection of the THC 1 cases (4%) Mistranslation 2 cases (8%) Unappealing solution Not rated among top human choices 4 cases (16%) 1 was an unappealing solution (all except one post-editor chose to change it)

Conclusion The raw MT outputs more often than not propose the most commonly chosen translation solutions found in human translation

THC frequency in PEMT There are two predominant cases when there was a statistically significant difference in THC frequency: When the raw MT output contained the THC, in which case it was significantly higher When the raw output contained the second to top human choice, in which case it was significantly lower

Conclusion If a post-editor finds a highly appealing translation solution, they tend to leave it and not waste time looking for alternatives.

MT markers Ideal candidate MT markers: THC found in MT output THC occurs a very or extremely statistically significant number of times more in PEMT than in HT There is two or more times greater variety in HT than in PEMT Four n-grams satisfied these conditions There are was chosen for its ubiquity which makes it easily repeatable in a relatively short text without it seeming artificial

There are test A text (273 words / 4 paragraphs) containing 5 occurrences of there are was given to three volunteer professional translators for translation Google-translated (neural) and given to another three for full post-editing The raw MT output contained the same proposed solution (ci sono) for each of the five occurrences.

There are test SC 8 51 5 HT LZ 11 32 4 MLD 25 64 3 CP 16 47 1 PEMT PV Professional experience (years) Time (minutes) Number of occurrences of ci sono Number of different solutions chosen HT/PEMT SC 8 51 5 HT LZ 11 32 4 MLD 25 64 3 CP 16 47 1 PEMT PV 28 45 DG 26 2 Surprise result Preliminary experiment says nothing about the variety of solutions adopted by an individual in the same job Variety of solutions chosen by a group or - by extrapolation - the community Post-editors in the there are test came up with a comparably wide range of solutions A different factor may have come into play Small scale may have distorted results This deserves further investigation A different factor may have come into play. Italians are taught that good writers should avoid unnecessary lexical repetition. Five occurrences of the same expression in four paragraphs may have triggered a repetitiveness alarm, turning an otherwise correct solution into an unacceptable one. Alternatively it may also be more simply argued that the scale of the second additional experiment may not be big enough to give reliable results.

Discussion Variety and inventiveness are not always desirable features There are also various kinds of text where lexical uniformity is a negative quality factor In these cases, counting errors and measuring fluency and adequacy are not sufficient to judge translation quality

Discussion Preliminary experiment shows apparent normalization and homogenization of the choices made by post-editors as a whole Failure to remedy this normalization and homogenization may eventually lead to lexical impoverishment One solution might be to program NMT engines to sometimes randomly pick the second or third best fit translated sentence vectors Particularly in cultures where English has become the primary language in which new written material is created

Discussion Possible to train post-editors to add originality and inventiveness Defeats the object of post-editing (time and cost saving) As MT systems improve, homogenization and normalization will probably be exacerbated

Discussion On account of the findings reported herein, the use of PEMT for texts where variety, originality and inventiveness are quality factors would appear to be unadvisable with the MT technology currently available

Translating and the Computer 40 The End Translating and the Computer 40 London, UK #TC18