Neural Machine Translation by Jointly Learning to Align and Translate

Slides:



Advertisements
Similar presentations
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Advertisements

Application of RNNs to Language Processing Andrey Malinin, Shixiang Gu CUED Division F Speech Group.
Addressing the Rare Word Problem in Neural Machine Translation
Haitham Elmarakeby.  Speech recognition
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
NOTE: To change the image on this slide, select the picture and delete it. Then click the Pictures icon in the placeholder to insert your own image. SHOW.
Xintao Wu University of Arkansas Introduction to Deep Learning 1.
Mastering the Pipeline CSCI-GA.2590 Ralph Grishman NYU.
Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation EMNLP’14 paper by Kyunghyun Cho, et al.
S.Bengio, O.Vinyals, N.Jaitly, N.Shazeer
Deep Learning Methods For Automated Discourse CIS 700-7
Attention Model in NLP Jichuan ZENG.
Fabien Cromieres Chenhui Chu Toshiaki Nakazawa Sadao Kurohashi
Neural Machine Translation
Convolutional Sequence to Sequence Learning
Deep Learning Methods For Automated Discourse CIS 700-7
Learning linguistic structure with simple and more complex recurrent neural networks Psychology February 2, 2017.
Best viewed with Computer Modern fonts installed
End-To-End Memory Networks
CS 388: Natural Language Processing: LSTM Recurrent Neural Networks
CS 4501: Introduction to Computer Vision Computer Vision + Natural Language Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy / Justin Johnson.
Deep Learning for Bacteria Event Identification
Deep Learning Amin Sobhani.
Wu et. al., arXiv - sept 2016 Presenter: Lütfi Kerem Şenel
Recurrent Neural Networks for Natural Language Processing
Adversarial Learning for Neural Dialogue Generation
Neural Machine Translation by Jointly Learning to Align and Translate
Attention Is All You Need
Show and Tell: A Neural Image Caption Generator (CVPR 2015)
An Overview of Machine Translation
Matt Gormley Lecture 16 October 24, 2016
Deep Learning: Model Summary
Intro to NLP and Deep Learning
ICS 491 Big Data Analytics Fall 2017 Deep Learning
Joint Training for Pivot-based Neural Machine Translation
Intelligent Information System Lab
Intro to NLP and Deep Learning
Neural networks (3) Regularization Autoencoder
Neural Machine Translation By Learning to Jointly Align and Translate
Attention Is All You Need
Grid Long Short-Term Memory
Advanced Artificial Intelligence
Paraphrase Generation Using Deep Learning
Image Captions With Deep Learning Yulia Kogan & Ron Shiff
Final Presentation: Neural Network Doc Summarization
RNNs & LSTM Hadar Gorodissky Niv Haim.
Understanding LSTM Networks
Code Completion with Neural Attention and Pointer Networks
Memory-augmented Chinese-Uyghur Neural Machine Translation
Lecture 16: Recurrent Neural Networks (RNNs)
Machine Translation(MT)
Natural Language to SQL(nl2sql)
Report by: 陆纪圆.
Attention.
Neural networks (3) Regularization Autoencoder
Please enjoy.
Word embeddings (continued)
Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton
Attention for translation
-- Ray Mooney, Association for Computational Linguistics (ACL) 2014
Presented by: Anurag Paul
Neural Machine Translation - Encoder-Decoder Architecture and Attention Mechanism Anmol Popli CSE 291G.
Neural Machine Translation using CNN
Question Answering System
Neural Machine Translation
Recurrent Neural Networks
Sequence-to-Sequence Models
Bidirectional LSTM-CRF Models for Sequence Tagging
LHC beam mode classification
Presentation transcript:

Neural Machine Translation by Jointly Learning to Align and Translate Presented by: Minhao Cheng, Pan Xu, Md Rizwan Parvez

Outline Problem setting Seq2seq model Attention Mechanism RNN/LSTM/GRU Autoencoder Attention Mechanism Pipeline and Model Architecture Experiment Results Extended Method Self-attentive (transformer)

Problem Setting Input: sentence (word sequence) in the source language Output: sentence (word sequence) in the target language Model: seq2seq

History of Machine Translation

History of Machine Translation Rule-based used mostly in the creation of dictionaries and grammar programs Example-based on the idea of analogy. Statistical using statistical methods based on bilingual text corpora Neural Deep learning

Neural machine translation (NMT) RNN/LSTM/GRU Why? Input/output length variant Order dependent Seq2Seq Model

Recurrent Neural Network RNN unit structure:

Vanilla RNN Problem: vanishing gradient

Long Short Term Memory (LSTM) Forget gate Input gate Updating cell state Updating hidden state

Long Short Term Memory (LSTM) Forget gate Input gate Updating cell state Updating hidden state

Long Short Term Memory (LSTM) Forget gate Input gate Updating cell state Updating hidden state

Long Short Term Memory (LSTM) Forget gate Input gate Updating cell state Updating hidden state

Gated Recurrent Unit (GRU)

Using RNN Text classification Machine translation Output dim = 1 Output dim ≠ 1 (Autoencoder)

Autoencoder

Seq2Seq model Target sequence y1 y2 y3 y4 h1 h2 h3 c s1 s2 s3 s4 x1 x2 __ y1 y2 y3 Input sequence

Decoder Encoder Pipeline: Seq2Seq Nonlinear Function Fixed-length representation is a potential bottleneck for long sentences Decoder Nonlinear Function Encoder Image: https://courses.engr.illinois.edu/cs546/sp2018/Slides/Mar15_Bahdanau.pdf

Attention Decoder Encoder Decoder Alignment Alignment model While generating yt, searches in x=(x1 , …, xT ) where the most relevant information is concentrated. Attention Decoder Alignment model ct =∑αt,j hj Encoder p(yi) = g(yi−1, si, ci)

Decoder Alignment si = f(yi−1, si-1, ci) αij: How relatively well the match between inputs around position j and the output at position i si = f(yi−1, si-1, ci) Use simple feedforward NN to compute eij based on si-1 and hj Compute relative alignment ct =∑αt,j hj Instead of a global context vector, learn a different context vector for each Yi that captures the relevance of the input words and gives more focus ( Image: https://courses.engr.illinois.edu/cs546/sp2018/Slides/Mar15_Bahdanau.pdf

The Full Pipeline Word embedding: I love Sundays Now, we have introduced the architecture of RNN and the attention scheme. Combining these two structures, we will get the full pipeline proposed in this paper.

The Full Pipeline Output Sundays love I Hidden states Annotations: each word only summarizes the information of its preceding words Alignment Hidden states Input

The Full Pipeline Output Hidden states Bidirectional RNNs for the annotation hidden states Alignment True annotations can be obtained by concatenating the forward and backward annotations Forward Backward Input

Experiments Dataset: ACL WMT ‘14 (850M words, tokenized by Moses, 30000 words kept) Baselines: two models with different sentence length RNNenc-30, RNNenc-50 (RNN Encoder-Decoder in Cho et al., 2014a) RNNsearch-30, RNNsearch-50 (The proposed model in this paper) RNNsearch-50* (trained until the performance on the development set stopped improving) Training: Random initialization (orthogonal matrices for weights) Stochastic gradient descent (SGD) with adaptive learning rates (Adadelta) Minibatch = 80 Log Likelihood, by chain rule is sum of next step log likelihoods English-French parallel corpora contains a total of 850M words. Tokenization is done using ”Moses” machine translation package. Only 30,000 most frequent words are kep and all others are mapped to a special token [UNK] 1000 hidden units for each encoder, decoder. For bidirectional RNNs, each RNN has 1000 hidden units

Inference Beam Search (Size = 2) 2 partial hypothesis expand hypotheses 2 new partial hypotheses I My I decided My decision I thought I tried My thinking My direction I decided My decision expand and sort prune … After the model is trained, we will use beam search to keep a small set of words at each step. At the end, we choose the sentence with the highest joint probability.

Experiment Results Sample alignments calculated based on RNNsearch-50 X-axis and y-axis correspond to the words in the source sentence (English) and the generated translation (French). Each pixel shows weight \alpha_{ij} of the annotation of the j-th source word for the i-th target word

Experiment Results BLEU score: How close the candidate translation is to the reference translations. c: length of candidate translation r: length of reference translation p: n-gram precision w: weight parameters Table: BLEU scores computed on the test set. (◦) Only sentences without [UNK] tokens When we only consider frequent words, the performance of the RNNsearch is as high as that of the conventional phrase-based translation system (Moses). RNNsearch-50-asterisk, is Same as RNNsearch-50 but trained longer until no improvement pn the validation set performance

Reference https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/ http://colah.github.io/posts/2015-08-Understanding-LSTMs/ https://arxiv.org/pdf/1409.0473.pdf https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf https://docs.google.com/presentation/d/1quIMxEEPEf5EkRHc2USQaoJRC4QNX6_KomdZTBMBWjk/edit#slid e=id.g1f9e4ca2dd_0_23 https://pdfs.semanticscholar.org/8873/7a86b0ddcc54389bf3aa1aaf62030deec9e6.pdf https://www.freecodecamp.org/news/a-history-of-machine-translation-from-the-cold-war-to-deep-learning- f1d335ce8b5/ https://courses.engr.illinois.edu/cs546/sp2018/Slides/Mar15_Bahdanau.pdf

Self Attention [Vaswani et al. 2017] slides are adopted from https://people.cs.umass.edu/~strubell/doc/lisa-final.key

Self Attention slides are adopted from https://people.cs.umass.edu/~strubell/doc/lisa-final.key

Self Attention slides are adopted from https://people.cs.umass.edu/~strubell/doc/lisa-final.key

Self Attention slides are adopted from https://people.cs.umass.edu/~strubell/doc/lisa-final.key

Self Attention slides are adopted from https://people.cs.umass.edu/~strubell/doc/lisa-final.key

Thank You!!!