Download presentation
Presentation is loading. Please wait.
Published byJordan O’Brien’ Modified over 5 years ago
1
Statistical vs. Neural Machine Translation: a Comparison of MTH and DeepL at Swiss Post’s Language service Lise Volkart – Pierrette Bouillon – Sabrina Girletti University of Geneva – Translation Technology Department (TIM) lise.volkart ¦ pierrette.bouillon ¦ Asling, Translation and the Computer 40 London, November 2018
2
Introduction Context Research questions
Microsoft Translator Hub (trained with 288,211 segments and 76 terms from Swiss Post Data) DeepL (generic neural machine translation system) German-to-French Test set: Swiss Post’s annual report Research questions Can a generic neural system compete with a customised statistical MT system? Is BLEU a suitable metric for the evaluation of NMT?
3
Comparison of MTH and DeepL
3 types of evaluation Automatic evaluation (BLEU) Human evaluation I: post-editing productivity test Human evaluation II: comparative evaluation of the post-edited output
4
Automatic evaluation Results Corpus of 1,718 segments
Very similar scores, BLEU is slightly better for DeepL. System BLEU DeepL 25.23 MTH 23.46
5
Human evaluation I Post-editing productivity test
2 participants (in-house translator and freelance) 250 segments Full post-editing Time and HTER (Human-Targeted Error Rate)
6
Human evaluation I (continued)
7
Human evaluation I (continued)
Results (continued) Post-editing: 53.6% faster for DeepL HTER: 75.1% lower for DeepL
8
Human evaluation II Comparative evaluation of the post-edited output
Goal: to ensure that a lower PE time and lower HTER ≠ lower final quality 3 evaluators (MA translation students) Post-edited output from MTH vs. DeepL
9
Human evaluation II (continued)
Results
10
BLEU score’s reliability for NMT evaluation
Motivations Low correlation between automatic and human evaluations Previous studies BLEU tends to underestimate the quality of NMT Methodology Calculating the underestimation rate (Shterionov et al., 2017) Number of segments that are better according to human but have lower BLEU, divided by the number of segments better according to human
11
BLEU score’s reliability for NMT evaluation
Results
12
Summary of the results DeepL obtains a slightly better BLEU than MTH
DeepL’s output requires less PE effort Final quality seems to be better while using DeepL BLEU seems to underestimate the quality of DeepL’s output
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.