Neural Machine Translation by Jointly Learning to Align and Translate

Neural Machine Translation by Jointly Learning to Align and Translate
Presented by: Minhao Cheng, Pan Xu, Md Rizwan Parvez

Outline Problem setting Seq2seq model Attention Mechanism
RNN/LSTM/GRU Autoencoder Attention Mechanism Pipeline and Model Architecture Experiment Results Extended Method Self-attentive (transformer)

Problem Setting Input: sentence (word sequence) in the source language
Output: sentence (word sequence) in the target language Model: seq2seq

History of Machine Translation

History of Machine Translation
Rule-based used mostly in the creation of dictionaries and grammar programs Example-based on the idea of analogy. Statistical using statistical methods based on bilingual text corpora Neural Deep learning

Neural machine translation (NMT)
RNN/LSTM/GRU Why? Input/output length variant Order dependent Seq2Seq Model

Recurrent Neural Network
RNN unit structure:

Vanilla RNN Problem: vanishing gradient

Long Short Term Memory (LSTM)
Forget gate Input gate Updating cell state Updating hidden state

Gated Recurrent Unit (GRU)

Using RNN Text classification Machine translation Output dim = 1
Output dim ≠ 1 (Autoencoder)

Autoencoder

Seq2Seq model Target sequence y1 y2 y3 y4 h1 h2 h3 c s1 s2 s3 s4 x1 x2
__ y1 y2 y3 Input sequence

Decoder Encoder Pipeline: Seq2Seq Nonlinear Function
Fixed-length representation is a potential bottleneck for long sentences Decoder Nonlinear Function Encoder Image:

Attention Decoder Encoder Decoder Alignment Alignment model
While generating yt, searches in x=(x1 , …, xT ) where the most relevant information is concentrated. Attention Decoder Alignment model ct =∑αt,j hj Encoder p(yi) = g(yi−1, si, ci)

Decoder Alignment si = f(yi−1, si-1, ci)
αij: How relatively well the match between inputs around position j and the output at position i si = f(yi−1, si-1, ci) Use simple feedforward NN to compute eij based on si-1 and hj Compute relative alignment ct =∑αt,j hj Instead of a global context vector, learn a different context vector for each Yi that captures the relevance of the input words and gives more focus ( Image:

The Full Pipeline Word embedding: I love Sundays
Now, we have introduced the architecture of RNN and the attention scheme. Combining these two structures, we will get the full pipeline proposed in this paper.

The Full Pipeline Output Sundays love I Hidden states
Annotations: each word only summarizes the information of its preceding words Alignment Hidden states Input

The Full Pipeline Output Hidden states
Bidirectional RNNs for the annotation hidden states Alignment True annotations can be obtained by concatenating the forward and backward annotations Forward Backward Input

Experiments Dataset: ACL WMT ‘14 (850M words, tokenized by Moses, words kept) Baselines: two models with different sentence length RNNenc-30, RNNenc-50 (RNN Encoder-Decoder in Cho et al., 2014a) RNNsearch-30, RNNsearch-50 (The proposed model in this paper) RNNsearch-50* (trained until the performance on the development set stopped improving) Training: Random initialization (orthogonal matrices for weights) Stochastic gradient descent (SGD) with adaptive learning rates (Adadelta) Minibatch = 80 Log Likelihood, by chain rule is sum of next step log likelihoods English-French parallel corpora contains a total of 850M words. Tokenization is done using ”Moses” machine translation package. Only 30,000 most frequent words are kep and all others are mapped to a special token [UNK] 1000 hidden units for each encoder, decoder. For bidirectional RNNs, each RNN has 1000 hidden units

Inference Beam Search (Size = 2) 2 partial hypothesis
expand hypotheses 2 new partial hypotheses I My I decided My decision I thought I tried My thinking My direction I decided My decision expand and sort prune … After the model is trained, we will use beam search to keep a small set of words at each step. At the end, we choose the sentence with the highest joint probability.

Experiment Results Sample alignments calculated based on RNNsearch-50
X-axis and y-axis correspond to the words in the source sentence (English) and the generated translation (French). Each pixel shows weight \alpha_{ij} of the annotation of the j-th source word for the i-th target word

Experiment Results BLEU score: How close the candidate translation
is to the reference translations. c: length of candidate translation r: length of reference translation p: n-gram precision w: weight parameters Table: BLEU scores computed on the test set. (◦) Only sentences without [UNK] tokens When we only consider frequent words, the performance of the RNNsearch is as high as that of the conventional phrase-based translation system (Moses). RNNsearch-50-asterisk, is Same as RNNsearch-50 but trained longer until no improvement pn the validation set performance

Reference e=id.g1f9e4ca2dd_0_23 f1d335ce8b5/

Self Attention [Vaswani et al. 2017]
slides are adopted from

Self Attention slides are adopted from

Thank You!!!

Neural Machine Translation by Jointly Learning to Align and Translate

Similar presentations

Presentation on theme: "Neural Machine Translation by Jointly Learning to Align and Translate"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Neural Machine Translation by Jointly Learning to Align and Translate

Similar presentations

Presentation on theme: "Neural Machine Translation by Jointly Learning to Align and Translate"— Presentation transcript:

Similar presentations

About project

Feedback