On the Impact of Various Types of Noise on Neural Machine Translation Huda Khayrallah & Philipp Koehn {huda, phi}@jhu.edu This talk was presented at WNMT at ACL2018 It is based on this paper: http://aclweb.org/anthology/W18-2705 bib: http://aclweb.org/anthology/W18-2705.bib
On the Impact of Various Types of Noise on Neural Machine Translation Huda Khayrallah & Philipp Koehn {huda, phi}@jhu.edu
More data is better! BLEU Corpus size (English words) [Koehn & Knowles 2017] En-Spanish Corpus size (English words) Khayrallah & Koehn
NMT SMT WMT17 27.2 24.0 + noisy corpus 17.3 (-9.9) 25.2 (+1.2) We ran into a situation where more data hurt NMT performance and wanted to analyze what happened there. DE-En translation noisy web corpus paracrawl Khayrallah & Koehn
Noisy Corpus NMT SMT Khayrallah & Koehn EXPLAIN DELTA PLOT SMT Blue NMT green NMT SMT Khayrallah & Koehn
Manual Analysis Khayrallah & Koehn Short Segments <=2 1%
Noise Types Misaligned Sentences Misordered words Wrong Language Untranslated Sentences Short Segments Not practical to annotate all of paracawl for noise types We generate artificial noise in an attempt to mimic these situations to analyze how MT reacts to each I’m going to go through them all one by one. In general, most impact NMT more than SMT, but one was catastrophic for NMT Khayrallah & Koehn
Misaligned Sentences Khayrallah & Koehn
Misaligned Sentences Die Koalas sind süß The koalas are cute Die Kängurus springen The kangaroos jump Der Koala ist weich The koala is soft Das Känguru ist schnell The kangaroo is fast Khayrallah & Koehn
Misaligned Sentences Die Koalas sind süß The kangaroos jump Die Kängurus springen The koala is soft Der Koala ist weich The kangaroo is fast Das Känguru ist schnell The koalas are cute Khayrallah & Koehn
Misaligned Sentences NMT SMT Khayrallah & Koehn
Misordered Words Khayrallah & Koehn product of machine translation, poor human translation, or heavily specialized language use, such as bullet points in product descriptions Khayrallah & Koehn
Misordered Words (source) Die Koalas sind süß The koalas are cute Die Kängurus springen The kangaroos jump Der Koala ist weich The koala is soft Das Känguru ist schnell The kangaroo is fast Khayrallah & Koehn
Misordered Words (source) Koalas Die sind süß The koalas are cute Kängurus springen Die The kangaroos jump ist Der weich Koala The koala is soft schnell Känguru ist Das The kangaroo is fast Khayrallah & Koehn
Misordered Words (source) NMT SMT Khayrallah & Koehn
Misordered Words (target) Die Koalas sind süß The koalas are cute Die Kängurus springen The kangaroos jump Der Koala ist weich The koala is soft Das Känguru ist schnell The kangaroo is fast Khayrallah & Koehn
Misordered Words (target) Die Koalas sind süß koalas cute are The Die Kängurus springen kangaroos The jump Der Koala ist weich is The soft koala Das Känguru ist schnell fast The is kangaroo Khayrallah & Koehn
Misordered Words (target) NMT is hurt same SMT is hurt more NMT SMT Khayrallah & Koehn
Wrong Language Khayrallah & Koehn Language ID does not work well on single sentences Khayrallah & Koehn
Wrong Language (French source) Die Koalas sind süß The koalas are cute Die Kängurus springen The kangaroos jump Der Koala ist weich The koala is soft Das Känguru ist schnell The kangaroo is fast Khayrallah & Koehn
Wrong Language (French source) Les koalas sont mignons The koalas are cute Les kangourous sautent The kangaroos jump Le koala est doux The koala is soft Le kangourou est rapide The kangaroo is fast Khayrallah & Koehn
Wrong Language (French source) Multilingual MT works NMT SMT Khayrallah & Koehn
Wrong Language (French target) Die Koalas sind süß The koalas are cute Die Kängurus springen The kangaroos jump Der Koala ist weich The koala is soft Das Känguru ist schnell The kangaroo is fast Khayrallah & Koehn
Wrong Language (French target) Die Koalas sind süß Les koalas sont mignons Die Kängurus springen Les kangourous sautent Der Koala ist weich Le koala est doux Das Känguru ist schnell Le kangourou est rapide Khayrallah & Koehn
Wrong Language (French target) This didn’t hurt as much as we expected – perhaps because there was a domain shift between the two different corpora We should have been in the wrong language half the time… systems picking up on the domain shift NMT SMT Khayrallah & Koehn
Untranslated Khayrallah & Koehn
Untranslated (English Source) Die Koalas sind süß The koalas are cute Die Kängurus springen The kangaroos jump Der Koala ist weich The koala is soft Das Känguru ist schnell The kangaroo is fast Khayrallah & Koehn
Untranslated (English source) The koalas are cute The kangaroos jump The koala is soft The kangaroo is fast Khayrallah & Koehn
Untranslated (English source) multi task copy & tranlsate NMT SMT Khayrallah & Koehn
Untranslated (German target) Die Koalas sind süß The koalas are cute Die Kängurus springen The kangaroos jump Der Koala ist weich The koala is soft Das Känguru ist schnell The kangaroo is fast Khayrallah & Koehn
Untranslated (German target) Die Koalas sind süß Die Kängurus springen Der Koala ist weich Das Känguru ist schnell Khayrallah & Koehn
Untranslated (German target) more analysis on poster learns to copy cant *just* filter exact copies NMT SMT Khayrallah & Koehn
Short Segments Khayrallah & Koehn <=2 3-5 Practical in low resource specialized terms Recently some papers have come out on dictionary training in NMT, but prior to that there had been concern about this affecting the LM component of NMT Khayrallah & Koehn
Short Segments Die The süß cute Känguru Kangaroo schnell fast Khayrallah & Koehn
Short Segments ≤ 2 words 3-5 words <=2 3-5 Khayrallah & Koehn
We do some further analysis Come to the poster to talk more!
Existing filtering methods BiCleaner [Miquel Espla-Gomis and M Forcada 2009] Zipporah [Xu & Koehn 2017] Option with paracawrl is to use a filtered version but we want to see what kind of noise exists to improve it what kind of noise in data to motivate future filtering work Khayrallah & Koehn
Existing filtering methods BiCleaner [Miquel Espla-Gomis and M Forcada 2009] Zipporah [Xu & Koehn 2017] WMT shared task Shared task – did will, beat zipporah Khayrallah & Koehn
Questions? Khayrallah & Koehn
We do some further analysis Come to the poster to talk more!
Khayrallah & Koehn
Khayrallah & Koehn
Khayrallah & Koehn
DeEn translation MARIANNMT Moses RNN encoder-decoder With attention BPE Dropout (Shallow!) … Moses 5gram LM hierarchical lexicalized reordering Operation sequence model MBR decoding Cube pruning … No monolingual data for either Khayrallah & Koehn
Let’s go get more data! Khayrallah & Koehn I focus on webcrawled data in this talk Khayrallah & Koehn
Existing filtering methods Khayrallah & Koehn
DeEn translation NMT SMT WMT17 27.2 24.0 + noisy corpus 17.3 (-9.9) 25.2 (+1.2) This work came about because we wanted to add paracrawl data to WMT Khayrallah & Koehn