Download presentation
Presentation is loading. Please wait.
Published byBenedikte Gjerde Modified over 6 years ago
1
On the Impact of Various Types of Noise on Neural Machine Translation
Huda Khayrallah & Philipp Koehn {huda, This talk was presented at WNMT at ACL2018 It is based on this paper: bib:
2
On the Impact of Various Types of Noise on Neural Machine Translation
Huda Khayrallah & Philipp Koehn {huda,
3
More data is better! BLEU Corpus size (English words)
[Koehn & Knowles 2017] En-Spanish Corpus size (English words) Khayrallah & Koehn
4
NMT SMT WMT17 27.2 24.0 + noisy corpus 17.3 (-9.9) 25.2 (+1.2)
We ran into a situation where more data hurt NMT performance and wanted to analyze what happened there. DE-En translation noisy web corpus paracrawl Khayrallah & Koehn
5
Noisy Corpus NMT SMT Khayrallah & Koehn EXPLAIN DELTA PLOT SMT Blue
NMT green NMT SMT Khayrallah & Koehn
6
Manual Analysis Khayrallah & Koehn Short Segments <=2 1%
7
Noise Types Misaligned Sentences Misordered words Wrong Language
Untranslated Sentences Short Segments Not practical to annotate all of paracawl for noise types We generate artificial noise in an attempt to mimic these situations to analyze how MT reacts to each I’m going to go through them all one by one. In general, most impact NMT more than SMT, but one was catastrophic for NMT Khayrallah & Koehn
8
Misaligned Sentences Khayrallah & Koehn
9
Misaligned Sentences Die Koalas sind süß The koalas are cute
Die Kängurus springen The kangaroos jump Der Koala ist weich The koala is soft Das Känguru ist schnell The kangaroo is fast Khayrallah & Koehn
10
Misaligned Sentences Die Koalas sind süß The kangaroos jump
Die Kängurus springen The koala is soft Der Koala ist weich The kangaroo is fast Das Känguru ist schnell The koalas are cute Khayrallah & Koehn
11
Misaligned Sentences NMT SMT Khayrallah & Koehn
12
Misordered Words Khayrallah & Koehn
product of machine translation, poor human translation, or heavily specialized language use, such as bullet points in product descriptions Khayrallah & Koehn
13
Misordered Words (source)
Die Koalas sind süß The koalas are cute Die Kängurus springen The kangaroos jump Der Koala ist weich The koala is soft Das Känguru ist schnell The kangaroo is fast Khayrallah & Koehn
14
Misordered Words (source)
Koalas Die sind süß The koalas are cute Kängurus springen Die The kangaroos jump ist Der weich Koala The koala is soft schnell Känguru ist Das The kangaroo is fast Khayrallah & Koehn
15
Misordered Words (source)
NMT SMT Khayrallah & Koehn
16
Misordered Words (target)
Die Koalas sind süß The koalas are cute Die Kängurus springen The kangaroos jump Der Koala ist weich The koala is soft Das Känguru ist schnell The kangaroo is fast Khayrallah & Koehn
17
Misordered Words (target)
Die Koalas sind süß koalas cute are The Die Kängurus springen kangaroos The jump Der Koala ist weich is The soft koala Das Känguru ist schnell fast The is kangaroo Khayrallah & Koehn
18
Misordered Words (target)
NMT is hurt same SMT is hurt more NMT SMT Khayrallah & Koehn
19
Wrong Language Khayrallah & Koehn
Language ID does not work well on single sentences Khayrallah & Koehn
20
Wrong Language (French source)
Die Koalas sind süß The koalas are cute Die Kängurus springen The kangaroos jump Der Koala ist weich The koala is soft Das Känguru ist schnell The kangaroo is fast Khayrallah & Koehn
21
Wrong Language (French source)
Les koalas sont mignons The koalas are cute Les kangourous sautent The kangaroos jump Le koala est doux The koala is soft Le kangourou est rapide The kangaroo is fast Khayrallah & Koehn
22
Wrong Language (French source)
Multilingual MT works NMT SMT Khayrallah & Koehn
23
Wrong Language (French target)
Die Koalas sind süß The koalas are cute Die Kängurus springen The kangaroos jump Der Koala ist weich The koala is soft Das Känguru ist schnell The kangaroo is fast Khayrallah & Koehn
24
Wrong Language (French target)
Die Koalas sind süß Les koalas sont mignons Die Kängurus springen Les kangourous sautent Der Koala ist weich Le koala est doux Das Känguru ist schnell Le kangourou est rapide Khayrallah & Koehn
25
Wrong Language (French target)
This didn’t hurt as much as we expected – perhaps because there was a domain shift between the two different corpora We should have been in the wrong language half the time… systems picking up on the domain shift NMT SMT Khayrallah & Koehn
26
Untranslated Khayrallah & Koehn
27
Untranslated (English Source)
Die Koalas sind süß The koalas are cute Die Kängurus springen The kangaroos jump Der Koala ist weich The koala is soft Das Känguru ist schnell The kangaroo is fast Khayrallah & Koehn
28
Untranslated (English source)
The koalas are cute The kangaroos jump The koala is soft The kangaroo is fast Khayrallah & Koehn
29
Untranslated (English source)
multi task copy & tranlsate NMT SMT Khayrallah & Koehn
30
Untranslated (German target)
Die Koalas sind süß The koalas are cute Die Kängurus springen The kangaroos jump Der Koala ist weich The koala is soft Das Känguru ist schnell The kangaroo is fast Khayrallah & Koehn
31
Untranslated (German target)
Die Koalas sind süß Die Kängurus springen Der Koala ist weich Das Känguru ist schnell Khayrallah & Koehn
32
Untranslated (German target)
more analysis on poster learns to copy cant *just* filter exact copies NMT SMT Khayrallah & Koehn
33
Short Segments Khayrallah & Koehn <=2 3-5 Practical in low resource
specialized terms Recently some papers have come out on dictionary training in NMT, but prior to that there had been concern about this affecting the LM component of NMT Khayrallah & Koehn
34
Short Segments Die The süß cute Känguru Kangaroo schnell fast
Khayrallah & Koehn
35
Short Segments ≤ 2 words 3-5 words <=2 3-5 Khayrallah & Koehn
36
We do some further analysis
Come to the poster to talk more!
37
Existing filtering methods
BiCleaner [Miquel Espla-Gomis and M Forcada 2009] Zipporah [Xu & Koehn 2017] Option with paracawrl is to use a filtered version but we want to see what kind of noise exists to improve it what kind of noise in data to motivate future filtering work Khayrallah & Koehn
38
Existing filtering methods
BiCleaner [Miquel Espla-Gomis and M Forcada 2009] Zipporah [Xu & Koehn 2017] WMT shared task Shared task – did will, beat zipporah Khayrallah & Koehn
39
Questions? Khayrallah & Koehn
40
We do some further analysis
Come to the poster to talk more!
41
Khayrallah & Koehn
42
Khayrallah & Koehn
43
Khayrallah & Koehn
44
DeEn translation MARIANNMT Moses RNN encoder-decoder With attention
BPE Dropout (Shallow!) … Moses 5gram LM hierarchical lexicalized reordering Operation sequence model MBR decoding Cube pruning … No monolingual data for either Khayrallah & Koehn
45
Let’s go get more data! Khayrallah & Koehn
I focus on webcrawled data in this talk Khayrallah & Koehn
46
Existing filtering methods
Khayrallah & Koehn
47
DeEn translation NMT SMT WMT17 27.2 24.0 + noisy corpus 17.3 (-9.9)
25.2 (+1.2) This work came about because we wanted to add paracrawl data to WMT Khayrallah & Koehn
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.