Neural Machine Translation - Encoder-Decoder Architecture and Attention Mechanism Anmol Popli CSE 291G.

Neural Machine Translation - Encoder-Decoder Architecture and Attention Mechanism
Anmol Popli CSE 291G

RNN Encoder-Decoder Architecture that learns to encode a variable-length sequence into a fixed-length vector representation and to decode a given fixed-length vector representation back into a variable-length sequence. General method to learn the conditional distribution over a variable-length sequence conditioned on yet another variable-length sequence v (Summary) This idea of encoder-decoder architectures is the basic principle behind neural machine translation (complicated and non-monotonic relationship) - encodes meaning and decodes from that Exploits two properties of RNNs: history capturer, probabilistically model a sequence (by being trained to predict next symbol in sequence) Basically a Language model (except that it’s conditioned on input sequence) Two different RNNs - increases number of model parameters at negligible computational cost, makes it natural to train RNN on multile language pairs simultaneously

RNN Encoder-Decoder The two components are jointly trained to maximize the conditional log-likelihood Produce translations by finding the most probable translation Using a left-to-right beam search decoder (approximate but simple to implement) v (Summary) Sum of (softmax+CEL)s

Dataset Evaluated on the task of English-to-French translation
Bilingual, parallel corpora provided by ACL WMT ’14 WMT ’14 contains the following English-French parallel corpora: Europarl (61M words), news commentary (5.5M), UN (421M) and two crawled corpora of 90M and 272.5M words respectively, totaling 850M words Concatenated news-test-2012 and news-test-2013 to make a development (validation) set, and evaluate the models on the test set (news-test-2014) from WMT ’14

An Encoder-Decoder Method (Predecessor to Bengio’s Attention paper)

Network Architecture Proposed Gated Recurrent Unit
Single-layer uni-directional GRU Both yt and h(t) are also conditioned on yt-1 and on the summary c of the input sequence. Input & output vocabulary - 15,000 words each Residual connection - all outputs directly influencing weight update at latent vector Skip connection - c would better help to translate output at a certain state Translation at every step would be able to utilize c in a better way, Also errors at all outputs would directly influence its gradient

Quantitative Performance (or lack thereof)
Models BLEU Scores Validation Test RNNenc 13.15 13.92 Moses SMT (baseline) 30.64 33.30 Moses + RNNenc 31.48 34.64 Integrated into statistical model - score phrase pairs and rerank them

What could be certain ways to overcome these?
Question: What are the issues with the simple encoder-decoder model that you trained in your assignment? What could be certain ways to overcome these? Shortcomings Not well equipped to handle long sentences Fixed-length vector representation does not have enough capacity to encode a long sentence with complicated structure and meaning. In order to encode a variable-length sequence, a neural network may “sacrifice” some of the important topics in the input sentence in order to remember others.

The Trouble with Simple Encoder-Decoder Architecture
The context vector must contain every single detail of the source sentence the true function approximated by the encoder has to be extremely nonlinear and complicated Dimensionality of the context vector must be large enough that a sentence of any length can be compressed Insight Representational power of the encoder needs to be large the model must be large in order to cope with long sentences Google’s Sequence to Sequence Learning paper Key insight - translation quality dramatically degrades as the length of the source sentence increases when the encoder-decoder model size is small

Google’s Sequence to Sequence

Key Technical Contributions
Ensemble of Deep LSTMs with 4 layers, with 1000 cells at each layer and 1000-dimensional word embeddings; parallelized on an 8-GPU machine Input vocabulary 160,000, Output vocabulary - 80,000 (larger than Bengio’s papers’) Reversing the Source Sentences Instead of mapping the sentence a, b, c to the sentence α, β, γ, the LSTM was asked to map c, b, a to α, β, γ, where α, β, γ is the translation of a, b, c. Introduction of many short term dependencies Though the average distance between corresponding words remains unchanged, “minimal time lag” is greatly reduced Backprop has an easier time “establishing communication” between source and target Performance improved with addition of each layer a is pretty close to \alpha, b close to \beta... Optimization problem simpler Performed much better on longer sentences Reversing results in LSTMs with better memory utilization

Experimental Results Outperformed baseline SMT
Did well on long sentences

What does the summary vector look like?
2-dimensional PCA projection of LSTM hidden states that are obtained after processing the phrases in the figures capture the underlying structure, including semantics and syntax; in other words, similar sentences are close together in summary vector space Sensible representations that are sensitive to word order, invariant to active and passive voice a larger model implies higher computation and memory requirements. it’s possible to overcome this issue by using multiple GPUs while distributing a single model across those GPUs (8 GPUs, 10 days)

Can we do better than the simple encoder-decoder based model?
Let’s assume for now that we have access to a single machine with a single GPU due to space, power and other physical constraints Can we do better than the simple encoder-decoder based model?

Soft Attention Mechanism for Neural Machine Translation
Words in a target sentence are determined by certain words and their contexts in the source sentence Key idea: Let the decoder learn how to retrieve this information Allowing the model to automatically search for parts of a source sentence that are relevant to predicting a target word Learns to align and translate jointly Sample translations made by the neural machine translation model with the soft-attention mechanism. Edge thicknesses represent the attention weights found by the attention model.

Bidirectional Encoder for Annotating Sequences
context-dependent word representation Store as a memory that contains as many banks as there are source words Each annotation hi contains information about the whole input sequence with a strong focus on the parts surrounding the i-th word of the input sequence Sequence of annotations used by the decoder and alignment model to compute context vector No need to encode all information in source sentence into a fixed-length vector Spread throughout the sequence of annotations The biggest issue with the simple encoder-decoder architecture is that a sentence of any length needs to be compressed into a fixed-size vector. store the source sentence as a variable-length representation, as opposed to the fixed-length, fixed-dimensional summary from the simple encoder-decoder model {h}_j of the forward RNN summarizes the source sentence up to the j-th word beginning from the first word, and {h}_j of the backward RNN up to the j-th word beginning from the last word. This summary at the position of each word, however, is not the perfect summary of the whole input sentence. Due to its sequential nature, a recurrent neural network tends to remember recent symbols better. Earlier we used the last state as a fixed-D summary vector, now we’ll compute context vector from these annotations (choose adaptively)

Decoder: Attention Mechanism
Predicting target word at i-th index decoder now needs to be able to selectively focus on one or more of the context-dependent word representations, or the annotation vectors, for each target word. So, which annotation vector should the decoder focus on each time?

Alignment model - small neural network - scores how well inputs around position j and output at position i match Measure of the importance of annotation h_{j} in deciding the next state z_{i} and generating u_{i}

From this probabilistic perspective, one can think of the attention weight \alpha_{ij} as the probability that the target word u_{i} is aligned to the source word x_{j}. Then, we can compute the expected context-dependent word representation under this distribution (defined by the attention weights \alpha_{ij}) using c_{i}=... vector c_i summarizes the information about the whole source sentence, however, with different emphasis on different locations/words of the source sentence this implements a mechanism of attention in the decoder. The decoder decides parts of the source sentence to pay attention to.

By letting the decoder have an attention mechanism, we relieve the encoder from the burden of having to encode all information in the source sentence into a fixed-length vector. With this new approach the information can be spread throughout the sequence of annotations, which can be selectively retrieved by the decoder accordingly.

Network Architecture Decoder GRU Deep Output Alignment Model
Alignment computation expensive … … precompute

Experiment Settings and Results
Trained 2 types of models: RNNencdec, RNNsearch Encoder and decoder are single-layer GRUs in both (with 1000 hidden units) To minimize waste of computation, sentence pairs were sorted according to their lengths and subsequently split into minibatches Vocabulary - Used a shortlist of 30,000 most frequent words in each language Trained with sentences of lengths upto 30 words (50 words) When considering only sentences of known words RNNencdec deteriorates RNNsearch-30 outperforms RNNencdec-50 BLEU scores computed on test set

Performance on Long Sentences
Consider this source sentence: Translation by RNNencdec-50: Replaced [based on his status as a health care worker at a hospital] in the source sentence with [enfonction de son état de santé] (“based on his state of health”) Correct translation from RNNsearch-50: RNNsearch does not require encoding a long sentence into a fixed-length vector perfectly, but only accurately encoding the parts of the input sentence that surround a particular word.

Sample Alignments found by RNNsearch-50
Largely monotonic, along the diagonal European economic area -> zone economique europeen The man -> l’ homme (consider word following ‘the’) The x-axis and y-axis of each plot correspond to the words in the source sentence (English) and the generated translation (French) respectively. Each pixel shows the weight αij of the annotation of the j-th source word for the i-th target word, in grayscale (0: black, 1: white)

-> Google’s Neural Machine Translation System
Key Insights Attention extracts relevant information from the accurately captured annotations of the source sentence To have the best possible context at each point in the encoder network it makes sense to use a bi-directional RNN for the encoder To achieve good accuracy, both the encoder and decoder RNNs have to be deep enough to capture subtle irregularities in the source and target language -> Google’s Neural Machine Translation System

Neural Machine Translation - Encoder-Decoder Architecture and Attention Mechanism Anmol Popli CSE 291G.

Similar presentations

Presentation on theme: "Neural Machine Translation - Encoder-Decoder Architecture and Attention Mechanism Anmol Popli CSE 291G."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Neural Machine Translation - Encoder-Decoder Architecture and Attention Mechanism Anmol Popli CSE 291G.

Similar presentations

Presentation on theme: "Neural Machine Translation - Encoder-Decoder Architecture and Attention Mechanism Anmol Popli CSE 291G."— Presentation transcript:

Similar presentations

About project

Feedback