Attention Is All You Need Presenter: Haotian Xu
Agenda Background Motivation Model Experimental results Discussion Neural machine translation Attention mechanism Google vs. Facebook Motivation Model Experimental results Discussion
Neural machine translation (NMT) Human read the entire sentence, understand its meaning, and produce a translation NMT mimics that: English->French
Neural machine translation (NMT) Conventional encode/decoder choice: RNN/LSTM/GRU
Attention mechanism establish direct short-cuts between the target and the source
Attention mechanism Alignments between source and target sentences French->English
Google vs. Facebook Google seq2seq: Facebook ConvS2S: Sequence to Sequence Learning with Neural Networks, 2014 LSTM Facebook ConvS2S: Convolutional Sequence to Sequence Learning, 05/2017 CNN Google Transformer: Attention Is All You Need, 06/2017 based solely on attention mechanisms
Motivation Drawbacks of RNN Preclude parallelization Long-range dependency* Preclude parallelization High computation complexity O(N) O(N/k) for CNN with kernel size k O(1) for Transformer
Model Encoder: Decoder: N=6 blocks multi-head attention position-wise feed-forward network Decoder: masked multi-head attention prevent positions from attending to subsequent positions same with Encoder Mask: ensures that the predictions for position i can depend only on the known outputs at positions less than i
Positional encoding Inject relative position info of the tokens in the sequence make use of the order of sequence no recurrence pos is the position and i is the dimension The positional encodings have the same dimension dmodel as the embeddings PEpos+k can be represented as a linear function of PEpos
Attention Scaled Dot-Product Multi-Head Linearly project Q,K,V h times with different, learned projections 8 parallel heads employed Optimal mask
Position-wise feed-forward layer applied to each position separately and identically the linear transformations are the same across different positions, they use different parameters from layer to layer
Experimental results floating point operations
Discussion RNN vs. ConvS2S vs. Transformer long-term dependencies Transformer > ConvS2S Transformer = Conv with very wide filters representation ability RNN+attention <- info duplication