Presentation is loading. Please wait.

Presentation is loading. Please wait.

On the Impact of Various Types of Noise on Neural Machine Translation

Similar presentations


Presentation on theme: "On the Impact of Various Types of Noise on Neural Machine Translation"— Presentation transcript:

1 On the Impact of Various Types of Noise on Neural Machine Translation
Huda Khayrallah & Philipp Koehn {huda, This talk was presented at WNMT at ACL2018 It is based on this paper: bib:

2 On the Impact of Various Types of Noise on Neural Machine Translation
Huda Khayrallah & Philipp Koehn {huda,

3 More data is better! BLEU Corpus size (English words)
[Koehn & Knowles 2017] En-Spanish Corpus size (English words) Khayrallah & Koehn

4 NMT SMT WMT17 27.2 24.0 + noisy corpus 17.3 (-9.9) 25.2 (+1.2)
We ran into a situation where more data hurt NMT performance and wanted to analyze what happened there. DE-En translation noisy web corpus paracrawl Khayrallah & Koehn

5 Noisy Corpus NMT SMT Khayrallah & Koehn EXPLAIN DELTA PLOT SMT Blue
NMT green NMT SMT Khayrallah & Koehn

6 Manual Analysis Khayrallah & Koehn Short Segments <=2 1%

7 Noise Types Misaligned Sentences Misordered words Wrong Language
Untranslated Sentences Short Segments Not practical to annotate all of paracawl for noise types We generate artificial noise in an attempt to mimic these situations to analyze how MT reacts to each I’m going to go through them all one by one. In general, most impact NMT more than SMT, but one was catastrophic for NMT Khayrallah & Koehn

8 Misaligned Sentences Khayrallah & Koehn

9 Misaligned Sentences Die Koalas sind süß The koalas are cute
Die Kängurus springen The kangaroos jump Der Koala ist weich The koala is soft Das Känguru ist schnell The kangaroo is fast Khayrallah & Koehn

10 Misaligned Sentences Die Koalas sind süß The kangaroos jump
Die Kängurus springen The koala is soft Der Koala ist weich The kangaroo is fast Das Känguru ist schnell The koalas are cute Khayrallah & Koehn

11 Misaligned Sentences NMT SMT Khayrallah & Koehn

12 Misordered Words Khayrallah & Koehn
product of machine translation, poor human translation, or heavily specialized language use, such as bullet points in product descriptions Khayrallah & Koehn

13 Misordered Words (source)
Die Koalas sind süß   The koalas are cute Die Kängurus springen The kangaroos jump Der Koala ist weich The koala is soft Das Känguru ist schnell The kangaroo is fast Khayrallah & Koehn

14 Misordered Words (source)
Koalas Die sind süß   The koalas are cute Kängurus springen Die The kangaroos jump ist Der weich Koala The koala is soft schnell Känguru ist Das The kangaroo is fast Khayrallah & Koehn

15 Misordered Words (source)
NMT SMT Khayrallah & Koehn

16 Misordered Words (target)
Die Koalas sind süß   The koalas are cute Die Kängurus springen The kangaroos jump Der Koala ist weich The koala is soft Das Känguru ist schnell The kangaroo is fast Khayrallah & Koehn

17 Misordered Words (target)
Die Koalas sind süß   koalas cute are The Die Kängurus springen kangaroos The jump Der Koala ist weich is The soft koala Das Känguru ist schnell fast The is kangaroo Khayrallah & Koehn

18 Misordered Words (target)
NMT is hurt same SMT is hurt more NMT SMT Khayrallah & Koehn

19 Wrong Language Khayrallah & Koehn
Language ID does not work well on single sentences Khayrallah & Koehn

20 Wrong Language (French source)
Die Koalas sind süß The koalas are cute Die Kängurus springen The kangaroos jump Der Koala ist weich The koala is soft Das Känguru ist schnell The kangaroo is fast Khayrallah & Koehn

21 Wrong Language (French source)
Les koalas sont mignons The koalas are cute Les kangourous sautent  The kangaroos jump Le koala est doux The koala is soft Le kangourou est rapide The kangaroo is fast Khayrallah & Koehn

22 Wrong Language (French source)
Multilingual MT works NMT SMT Khayrallah & Koehn

23 Wrong Language (French target)
Die Koalas sind süß The koalas are cute Die Kängurus springen The kangaroos jump Der Koala ist weich The koala is soft Das Känguru ist schnell The kangaroo is fast Khayrallah & Koehn

24 Wrong Language (French target)
Die Koalas sind süß Les koalas sont mignons Die Kängurus springen Les kangourous sautent  Der Koala ist weich Le koala est doux Das Känguru ist schnell Le kangourou est rapide Khayrallah & Koehn

25 Wrong Language (French target)
This didn’t hurt as much as we expected – perhaps because there was a domain shift between the two different corpora We should have been in the wrong language half the time… systems picking up on the domain shift NMT SMT Khayrallah & Koehn

26 Untranslated Khayrallah & Koehn

27 Untranslated (English Source)
Die Koalas sind süß The koalas are cute Die Kängurus springen The kangaroos jump Der Koala ist weich The koala is soft Das Känguru ist schnell The kangaroo is fast Khayrallah & Koehn

28 Untranslated (English source)
The koalas are cute The kangaroos jump The koala is soft The kangaroo is fast Khayrallah & Koehn

29 Untranslated (English source)
multi task copy & tranlsate NMT SMT Khayrallah & Koehn

30 Untranslated (German target)
Die Koalas sind süß The koalas are cute Die Kängurus springen The kangaroos jump Der Koala ist weich The koala is soft Das Känguru ist schnell The kangaroo is fast Khayrallah & Koehn

31 Untranslated (German target)
Die Koalas sind süß Die Kängurus springen Der Koala ist weich Das Känguru ist schnell Khayrallah & Koehn

32 Untranslated (German target)
more analysis on poster learns to copy cant *just* filter exact copies NMT SMT Khayrallah & Koehn

33 Short Segments Khayrallah & Koehn <=2 3-5 Practical in low resource
specialized terms Recently some papers have come out on dictionary training in NMT, but prior to that there had been concern about this affecting the LM component of NMT Khayrallah & Koehn

34 Short Segments Die The süß cute Känguru Kangaroo schnell fast
Khayrallah & Koehn

35 Short Segments ≤ 2 words 3-5 words <=2 3-5 Khayrallah & Koehn

36 We do some further analysis
Come to the poster to talk more!

37 Existing filtering methods
BiCleaner [Miquel Espla-Gomis and M Forcada 2009] Zipporah [Xu & Koehn 2017] Option with paracawrl is to use a filtered version but we want to see what kind of noise exists to improve it what kind of noise in data to motivate future filtering work Khayrallah & Koehn

38 Existing filtering methods
BiCleaner [Miquel Espla-Gomis and M Forcada 2009] Zipporah [Xu & Koehn 2017] WMT shared task Shared task – did will, beat zipporah Khayrallah & Koehn

39 Questions? Khayrallah & Koehn

40 We do some further analysis
Come to the poster to talk more!

41 Khayrallah & Koehn

42 Khayrallah & Koehn

43 Khayrallah & Koehn

44 DeEn translation MARIANNMT Moses RNN encoder-decoder With attention
BPE Dropout (Shallow!) Moses 5gram LM hierarchical lexicalized reordering Operation sequence model MBR decoding Cube pruning No monolingual data for either Khayrallah & Koehn

45 Let’s go get more data! Khayrallah & Koehn
I focus on webcrawled data in this talk Khayrallah & Koehn

46 Existing filtering methods
Khayrallah & Koehn

47 DeEn translation NMT SMT WMT17 27.2 24.0 + noisy corpus 17.3 (-9.9)
25.2 (+1.2) This work came about because we wanted to add paracrawl data to WMT Khayrallah & Koehn

48


Download ppt "On the Impact of Various Types of Noise on Neural Machine Translation"

Similar presentations


Ads by Google