Attention for translation

Attention for translation
Learn to encode multiple pieces of information and use them selectively for the output. Encode the input sentence into a sequence of vectors. Choose a subset of these adaptively while decoding (translating) – choose those vectors most relevant for current output. I.e., learn to jointly align and translate. Question: How can we learn and use a vector to decide where to focus attention? How can we make that differentiable to work with gradient descent? Bahdanau et al., 2015 :

Soft attention Use a probability distribution over all inputs.
Classification assigned probability to all possible outputs Attention uses probability to weight all possible inputs – learn to weight more relevant parts more heavily.

Input, x, and output, y, are sequences Encoder has a hidden state hi associated with each xi. These states come from a bidirectional RNN to allow information from both sides in encoding.

Decoder: each output yi is predicted using Previous output yi-1 Decoder hidden state si Context vector ci Decoder hidden state depends on Previous state si-1 Attention is embedded in the context vector via learned weights

Context vector The encoder hidden states are combined in a weighted average to form a context vector ct for the tth output. This can capture features from each part of the input.

The weights va and Wa are learned using a feed- forward network trained with the rest of the network.

Key ideas: Implement attention as a probability distribution over inputs/features. Extend encoder/decoder pair to include context information relevant to the current decoding task.

Attention with images Can combine a CNN with an RNN using attention.
CNN extracts high-level features. RNN generates a description, using attention to focus on relevant parts of the image.

Self-attention Previous: focus attention on the input while working on the output Self-attention: focus on other parts of input while processing the input

Self-attention Use each input vector to produce query, key, and value for that input: each is defined by a matrix multiplication of the embedding with each of WQ, WK, WV

Self-attention Similarity is determined by dot product of the Query of one input with the Key of all the inputs E.g., for input 1, get a vector of dot products (q1, k1), (q1, k2), … Do a scaling and softmax to get a distribution over the input vectors. This gives a distribution p11, p12, … that is the attention for input 1 on all inputs. Use the attention vector to do a weighted sum over the Value vectors for the inputs: z1 = p11v1 + p12v2 + … This is the output of the self-attention for input 1

Self-attention Multi-headed attention: run several copies in parallel and concatenate the outputs for next layer.

Related work Neural Turing Machines – combine an RNN with an external memory.

Neural Turing Machines
Use attention to do weighted read/writes at every location. Can combine content-based attention with location- based attention to take advantage of both.

Related work Adaptive computation time for RNNs
Include a probability distribution on the number of steps for a single input Final output is a weighted sum of the steps for that input

Related work Neural programmer
Determine a sequence of operations to solve some problem. Use a probability distribution to combine multiple possible sequences.

Attention: summary Attention uses a probability distribution to allow the learning and use of relevant inputs for RNN output This can be used in multiple ways to augment RNNs: Better use of input to encoder External memory Program control (adaptive computation) Neural programming

Attention for translation

Similar presentations

Presentation on theme: "Attention for translation"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Attention for translation

Similar presentations

Presentation on theme: "Attention for translation"— Presentation transcript:

Similar presentations

About project

Feedback