Attention
Recurrent neural networks (RNNs) Idea: Input is processed sequentially, repeating the same operation. LSTM and GRU use gating variables to allow for modulation of dependence (forgetting) and easier backpropagation (assignment of credit). https://distill.pub/2016/augmented-rnns/
seq2seq Combine 2 RNNs to map from one sequence to another (e.g., for sentence translation). Encoder RNN compresses the input into a context vector (generalization of a word embedding). Decoder RNN takes the embedding and expands to produce the output. https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html
seq2seq: problems Difficult to make long-range associations. Sequential input makes strongest association between nearest neighbors. Entire content is compressed to a single, usually fixed-length vector. Question: How can we learn to attend to relevant parts of the input when we need them for the output? This would help address both problems above.
Attention for translation Learn to encode multiple pieces of information and use them selectively for the output. Encode the input sentence into a sequence of vectors. Choose a subset of these adaptively while decoding (translating) – choose those vectors most relevant for current output. I.e., learn to jointly align and translate. Question: How can we learn and use a vector to decide where to focus attention? How can we make that differentiable to work with gradient descent? Bahdanau et al., 2015 : https://arxiv.org/pdf/1409.0473.pdf
Soft attention Use a probability distribution over all inputs. Classification assigned probability to all possible outputs Attention uses probability to weight all possible inputs – learn to weight more relevant parts more heavily. https://distill.pub/2016/augmented-rnns/
Soft attention Content-based attention Each position has a vector encoding content there. Dot product each with a query vector, then softmax. https://distill.pub/2016/augmented-rnns/
Soft attention Content-based attention allows the NN to associate relevant parts of the input with the current part of the output. Using dot products and softmax makes this attention framework differentiable – fits into the usual SGD learning mechanism.