Presentation is loading. Please wait.

Presentation is loading. Please wait.

Attention.

Similar presentations


Presentation on theme: "Attention."— Presentation transcript:

1 Attention

2 Recurrent neural networks (RNNs)
Idea: Input is processed sequentially, repeating the same operation. LSTM and GRU use gating variables to allow for modulation of dependence (forgetting) and easier backpropagation (assignment of credit).

3 seq2seq Combine 2 RNNs to map from one sequence to another (e.g., for sentence translation). Encoder RNN compresses the input into a context vector (generalization of a word embedding). Decoder RNN takes the embedding and expands to produce the output.

4 seq2seq: problems Difficult to make long-range associations.
Sequential input makes strongest association between nearest neighbors. Entire content is compressed to a single, usually fixed-length vector. Question: How can we learn to attend to relevant parts of the input when we need them for the output? This would help address both problems above.

5 Attention for translation
Learn to encode multiple pieces of information and use them selectively for the output. Encode the input sentence into a sequence of vectors. Choose a subset of these adaptively while decoding (translating) – choose those vectors most relevant for current output. I.e., learn to jointly align and translate. Question: How can we learn and use a vector to decide where to focus attention? How can we make that differentiable to work with gradient descent? Bahdanau et al., 2015 :

6 Soft attention Use a probability distribution over all inputs.
Classification assigned probability to all possible outputs Attention uses probability to weight all possible inputs – learn to weight more relevant parts more heavily.

7 Soft attention Content-based attention
Each position has a vector encoding content there. Dot product each with a query vector, then softmax.

8 Soft attention Content-based attention allows the NN to associate relevant parts of the input with the current part of the output. Using dot products and softmax makes this attention framework differentiable – fits into the usual SGD learning mechanism.


Download ppt "Attention."

Similar presentations


Ads by Google