On the Impact of Various Types of Noise on Neural Machine Translation

Slides:



Advertisements
Similar presentations
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser Institute for Natural Language Processing University of Stuttgart
Advertisements

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Word Sense Disambiguation for Machine Translation Han-Bin Chen
“Applying Morphology Generation Models to Machine Translation” By Kristina Toutanova, Hisami Suzuki, Achim Ruopp (Microsoft Research). UW Machine Translation.
Statistical Machine Translation Part III – Phrase-based SMT / Decoding Alex Fraser Institute for Natural Language Processing University of Stuttgart
English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.
FishBase Summary Page about Salmo salar in the standard Language of FishBase (English) ENBI-WP-11: Multilingual Access to European Biodiversity Sites through.
2010 Failures in Czech-English Phrase-Based MT 2010 Failures in Czech-English Phrase-Based MT Full text, acknowledgement and the list of references in.
Statistical Machine Translation Part III – Phrase-based SMT Alexander Fraser CIS, LMU München WSD and MT.
Statistical Machine Translation Part III – Phrase-based SMT / Decoding Alexander Fraser Institute for Natural Language Processing Universität Stuttgart.
WALT: GET TO KNOW PARIS AND SAY HOW I’M GOING TO GET THERE. WILF: RECOGNISE THE FUTURE TENSE FOR LEVEL 5.
Addressing the Rare Word Problem in Neural Machine Translation
Haitham Elmarakeby.  Speech recognition
MACHINE TRANSLATION PAPER 1 Daniel Montalvo, Chrysanthia Cheung-Lau, Jonny Wang CS159 Spring 2011.
Phrase-Based Statistical Machine Translation as a Traveling Salesman Problem Mikhail Zaslavskiy Marc Dymetman Nicola Cancedda ACL 2009.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Discriminative Word Alignment with Conditional Random Fields Phil Blunsom & Trevor Cohn [ACL2006] Eiji ARAMAKI.
LING 575 Lecture 5 Kristina Toutanova MSR & UW April 27, 2010 With materials borrowed from Philip Koehn, Chris Quirk, David Chiang, Dekai Wu, Aria Haghighi.
English-Hindi Neural machine translation and parallel corpus generation EKANSH GUPTA ROHIT GUPTA.
GDEX: Automatically finding good dictionary examples in a corpus.
Arnar Thor Jensson Koji Iwano Sadaoki Furui Tokyo Institute of Technology Development of a Speech Recognition System For Icelandic Using Machine Translated.
Is Neural Machine Translation the New State of the Art?
Machine Translation Statistical Machine Translation Part VI – Better Word Alignment, Morphology and Syntax Alexander Fraser CIS, LMU München.
Fabien Cromieres Chenhui Chu Toshiaki Nakazawa Sadao Kurohashi
Neural Machine Translation
Statistical Machine Translation Part II: Word Alignments and EM
RECENT TRENDS IN SMT By M.Balamurugan, Phd Research Scholar,
Approaches to Machine Translation
CS 388: Natural Language Processing: LSTM Recurrent Neural Networks
Statistical Machine Translation
Wu et. al., arXiv - sept 2016 Presenter: Lütfi Kerem Şenel
Monoligual Semantic Text Alignment and its Applications in Machine Translation Alon Lavie March 29, 2012.
Neural Machine Translation by Jointly Learning to Align and Translate
Attention Is All You Need
Show and Tell: A Neural Image Caption Generator (CVPR 2015)
An Overview of Machine Translation
KantanNeural™ LQR Experiment
Translation of Unknown Words in Low Resource Languages
Joint Training for Pivot-based Neural Machine Translation
Suggestions for Class Projects
Statistical NLP: Lecture 13
Neural Lattice Search for Domain Adaptation in Machine Translation
Neural Machine Translation By Learning to Jointly Align and Translate
Neural Lattice Search for Domain Adaptation in Machine Translation
--Mengxue Zhang, Qingyang Li
Statistical Machine Translation Part III – Phrase-based SMT / Decoding
Attention Is All You Need
Deep Learning based Machine Translation
Terminology translation accuracy in SMT vs. NMT
Transformer result, convolutional encoder-decoder
Approaches to Machine Translation
Machine Translation and MT tools: Giza++ and Moses
Memory-augmented Chinese-Uyghur Neural Machine Translation
Machine Translation(MT)
Statistical vs. Neural Machine Translation: a Comparison of MTH and DeepL at Swiss Post’s Language service Lise Volkart – Pierrette Bouillon – Sabrina.
Domain Mixing for Chinese-English Neural Machine Translation
An Empirical Comparison of Domain Adaptation Methods for
Machine Translation and MT tools: Giza++ and Moses
Idiap Research Institute University of Edinburgh
Attention for translation
Statistical Machine Translation Part VI – Phrase-based Decoding
Presented By: Sparsh Gupta Anmol Popli Hammad Abdullah Ayyubi
Neural Machine Translation - Encoder-Decoder Architecture and Attention Mechanism Anmol Popli CSE 291G.
Neural Machine Translation using CNN
Neural Machine Translation
Sequence-to-Sequence Models
CSC 578 Neural Networks and Deep Learning
Multilingual Translation – (1)
Neural Machine Translation by Jointly Learning to Align and Translate
Presentation transcript:

On the Impact of Various Types of Noise on Neural Machine Translation Huda Khayrallah & Philipp Koehn {huda, phi}@jhu.edu This talk was presented at WNMT at ACL2018 It is based on this paper: http://aclweb.org/anthology/W18-2705 bib: http://aclweb.org/anthology/W18-2705.bib

On the Impact of Various Types of Noise on Neural Machine Translation Huda Khayrallah & Philipp Koehn {huda, phi}@jhu.edu

More data is better! BLEU Corpus size (English words) [Koehn & Knowles 2017] En-Spanish Corpus size (English words) Khayrallah & Koehn

NMT SMT WMT17 27.2 24.0 + noisy corpus 17.3 (-9.9) 25.2 (+1.2) We ran into a situation where more data hurt NMT performance and wanted to analyze what happened there. DE-En translation noisy web corpus paracrawl Khayrallah & Koehn

Noisy Corpus NMT SMT Khayrallah & Koehn EXPLAIN DELTA PLOT SMT Blue NMT green NMT SMT Khayrallah & Koehn

Manual Analysis Khayrallah & Koehn Short Segments <=2 1%

Noise Types Misaligned Sentences Misordered words Wrong Language Untranslated Sentences Short Segments Not practical to annotate all of paracawl for noise types We generate artificial noise in an attempt to mimic these situations to analyze how MT reacts to each I’m going to go through them all one by one. In general, most impact NMT more than SMT, but one was catastrophic for NMT Khayrallah & Koehn

Misaligned Sentences Khayrallah & Koehn

Misaligned Sentences Die Koalas sind süß The koalas are cute Die Kängurus springen The kangaroos jump Der Koala ist weich The koala is soft Das Känguru ist schnell The kangaroo is fast Khayrallah & Koehn

Misaligned Sentences Die Koalas sind süß The kangaroos jump Die Kängurus springen The koala is soft Der Koala ist weich The kangaroo is fast Das Känguru ist schnell The koalas are cute Khayrallah & Koehn

Misaligned Sentences NMT SMT Khayrallah & Koehn

Misordered Words Khayrallah & Koehn product of machine translation, poor human translation, or heavily specialized language use, such as bullet points in product descriptions Khayrallah & Koehn

Misordered Words (source) Die Koalas sind süß   The koalas are cute Die Kängurus springen The kangaroos jump Der Koala ist weich The koala is soft Das Känguru ist schnell The kangaroo is fast Khayrallah & Koehn

Misordered Words (source) Koalas Die sind süß   The koalas are cute Kängurus springen Die The kangaroos jump ist Der weich Koala The koala is soft schnell Känguru ist Das The kangaroo is fast Khayrallah & Koehn

Misordered Words (source) NMT SMT Khayrallah & Koehn

Misordered Words (target) Die Koalas sind süß   The koalas are cute Die Kängurus springen The kangaroos jump Der Koala ist weich The koala is soft Das Känguru ist schnell The kangaroo is fast Khayrallah & Koehn

Misordered Words (target) Die Koalas sind süß   koalas cute are The Die Kängurus springen kangaroos The jump Der Koala ist weich is The soft koala Das Känguru ist schnell fast The is kangaroo Khayrallah & Koehn

Misordered Words (target) NMT is hurt same SMT is hurt more NMT SMT Khayrallah & Koehn

Wrong Language Khayrallah & Koehn Language ID does not work well on single sentences Khayrallah & Koehn

Wrong Language (French source) Die Koalas sind süß The koalas are cute Die Kängurus springen The kangaroos jump Der Koala ist weich The koala is soft Das Känguru ist schnell The kangaroo is fast Khayrallah & Koehn

Wrong Language (French source) Les koalas sont mignons The koalas are cute Les kangourous sautent  The kangaroos jump Le koala est doux The koala is soft Le kangourou est rapide The kangaroo is fast Khayrallah & Koehn

Wrong Language (French source) Multilingual MT works NMT SMT Khayrallah & Koehn

Wrong Language (French target) Die Koalas sind süß The koalas are cute Die Kängurus springen The kangaroos jump Der Koala ist weich The koala is soft Das Känguru ist schnell The kangaroo is fast Khayrallah & Koehn

Wrong Language (French target) Die Koalas sind süß Les koalas sont mignons Die Kängurus springen Les kangourous sautent  Der Koala ist weich Le koala est doux Das Känguru ist schnell Le kangourou est rapide Khayrallah & Koehn

Wrong Language (French target) This didn’t hurt as much as we expected – perhaps because there was a domain shift between the two different corpora We should have been in the wrong language half the time… systems picking up on the domain shift NMT SMT Khayrallah & Koehn

Untranslated Khayrallah & Koehn

Untranslated (English Source) Die Koalas sind süß The koalas are cute Die Kängurus springen The kangaroos jump Der Koala ist weich The koala is soft Das Känguru ist schnell The kangaroo is fast Khayrallah & Koehn

Untranslated (English source) The koalas are cute The kangaroos jump The koala is soft The kangaroo is fast Khayrallah & Koehn

Untranslated (English source) multi task copy & tranlsate NMT SMT Khayrallah & Koehn

Untranslated (German target) Die Koalas sind süß The koalas are cute Die Kängurus springen The kangaroos jump Der Koala ist weich The koala is soft Das Känguru ist schnell The kangaroo is fast Khayrallah & Koehn

Untranslated (German target) Die Koalas sind süß Die Kängurus springen Der Koala ist weich Das Känguru ist schnell Khayrallah & Koehn

Untranslated (German target) more analysis on poster learns to copy cant *just* filter exact copies NMT SMT Khayrallah & Koehn

Short Segments Khayrallah & Koehn <=2 3-5 Practical in low resource specialized terms Recently some papers have come out on dictionary training in NMT, but prior to that there had been concern about this affecting the LM component of NMT Khayrallah & Koehn

Short Segments Die The süß cute Känguru Kangaroo schnell fast Khayrallah & Koehn

Short Segments ≤ 2 words 3-5 words <=2 3-5 Khayrallah & Koehn

We do some further analysis Come to the poster to talk more!

Existing filtering methods BiCleaner [Miquel Espla-Gomis and M Forcada 2009] Zipporah [Xu & Koehn 2017] Option with paracawrl is to use a filtered version but we want to see what kind of noise exists to improve it what kind of noise in data to motivate future filtering work Khayrallah & Koehn

Existing filtering methods BiCleaner [Miquel Espla-Gomis and M Forcada 2009] Zipporah [Xu & Koehn 2017] WMT shared task Shared task – did will, beat zipporah Khayrallah & Koehn

Questions? Khayrallah & Koehn

We do some further analysis Come to the poster to talk more!

Khayrallah & Koehn

Khayrallah & Koehn

Khayrallah & Koehn

DeEn translation MARIANNMT Moses RNN encoder-decoder With attention BPE Dropout (Shallow!) … Moses 5gram LM hierarchical lexicalized reordering Operation sequence model MBR decoding Cube pruning … No monolingual data for either Khayrallah & Koehn

Let’s go get more data! Khayrallah & Koehn I focus on webcrawled data in this talk Khayrallah & Koehn

Existing filtering methods Khayrallah & Koehn

DeEn translation NMT SMT WMT17 27.2 24.0 + noisy corpus 17.3 (-9.9) 25.2 (+1.2) This work came about because we wanted to add paracrawl data to WMT Khayrallah & Koehn