Download presentation
Presentation is loading. Please wait.
1
Attention for translation
Learn to encode multiple pieces of information and use them selectively for the output. Encode the input sentence into a sequence of vectors. Choose a subset of these adaptively while decoding (translating) – choose those vectors most relevant for current output. I.e., learn to jointly align and translate. Question: How can we learn and use a vector to decide where to focus attention? How can we make that differentiable to work with gradient descent? Bahdanau et al., 2015 :
2
Soft attention Use a probability distribution over all inputs.
Classification assigned probability to all possible outputs Attention uses probability to weight all possible inputs – learn to weight more relevant parts more heavily.
3
Attention for translation
Input, x, and output, y, are sequences Encoder has a hidden state hi associated with each xi. These states come from a bidirectional RNN to allow information from both sides in encoding.
4
Attention for translation
Decoder: each output yi is predicted using Previous output yi-1 Decoder hidden state si Context vector ci Decoder hidden state depends on Previous state si-1 Attention is embedded in the context vector via learned weights
5
Context vector The encoder hidden states are combined in a weighted average to form a context vector ct for the tth output. This can capture features from each part of the input.
6
Attention for translation
The weights va and Wa are learned using a feed- forward network trained with the rest of the network.
7
Attention for translation
Key ideas: Implement attention as a probability distribution over inputs/features. Extend encoder/decoder pair to include context information relevant to the current decoding task.
8
Attention with images Can combine a CNN with an RNN using attention.
CNN extracts high-level features. RNN generates a description, using attention to focus on relevant parts of the image.
9
Self-attention Previous: focus attention on the input while working on the output Self-attention: focus on other parts of input while processing the input
10
Self-attention Use each input vector to produce query, key, and value for that input: each is defined by a matrix multiplication of the embedding with each of WQ, WK, WV
11
Self-attention Similarity is determined by dot product of the Query of one input with the Key of all the inputs E.g., for input 1, get a vector of dot products (q1, k1), (q1, k2), … Do a scaling and softmax to get a distribution over the input vectors. This gives a distribution p11, p12, … that is the attention for input 1 on all inputs. Use the attention vector to do a weighted sum over the Value vectors for the inputs: z1 = p11v1 + p12v2 + … This is the output of the self-attention for input 1
12
Self-attention Multi-headed attention: run several copies in parallel and concatenate the outputs for next layer.
13
Related work Neural Turing Machines – combine an RNN with an external memory.
14
Neural Turing Machines
Use attention to do weighted read/writes at every location. Can combine content-based attention with location- based attention to take advantage of both.
15
Related work Adaptive computation time for RNNs
Include a probability distribution on the number of steps for a single input Final output is a weighted sum of the steps for that input
16
Related work Neural programmer
Determine a sequence of operations to solve some problem. Use a probability distribution to combine multiple possible sequences.
17
Attention: summary Attention uses a probability distribution to allow the learning and use of relevant inputs for RNN output This can be used in multiple ways to augment RNNs: Better use of input to encoder External memory Program control (adaptive computation) Neural programming
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.