Download presentation
Presentation is loading. Please wait.
1
Attention Is All You Need
Presenter: Haotian Xu
2
Agenda Background Motivation Model Experimental results Discussion
Neural machine translation Attention mechanism Google vs. Facebook Motivation Model Experimental results Discussion
3
Neural machine translation (NMT)
Human read the entire sentence, understand its meaning, and produce a translation NMT mimics that: English->French
4
Neural machine translation (NMT)
Conventional encode/decoder choice: RNN/LSTM/GRU
5
Attention mechanism establish direct short-cuts between the target and the source
6
Attention mechanism Alignments between source and target sentences
French->English
7
Google vs. Facebook Google seq2seq: Facebook ConvS2S:
Sequence to Sequence Learning with Neural Networks, 2014 LSTM Facebook ConvS2S: Convolutional Sequence to Sequence Learning, 05/2017 CNN Google Transformer: Attention Is All You Need, 06/2017 based solely on attention mechanisms
8
Motivation Drawbacks of RNN Preclude parallelization
Long-range dependency* Preclude parallelization High computation complexity O(N) O(N/k) for CNN with kernel size k O(1) for Transformer
9
Model Encoder: Decoder: N=6 blocks multi-head attention
position-wise feed-forward network Decoder: masked multi-head attention prevent positions from attending to subsequent positions same with Encoder Mask: ensures that the predictions for position i can depend only on the known outputs at positions less than i
10
Positional encoding Inject relative position info of the tokens in the sequence make use of the order of sequence no recurrence pos is the position and i is the dimension The positional encodings have the same dimension dmodel as the embeddings PEpos+k can be represented as a linear function of PEpos
11
Attention Scaled Dot-Product Multi-Head
Linearly project Q,K,V h times with different, learned projections 8 parallel heads employed Optimal mask
12
Position-wise feed-forward layer
applied to each position separately and identically the linear transformations are the same across different positions, they use different parameters from layer to layer
13
Experimental results floating point operations
14
Discussion RNN vs. ConvS2S vs. Transformer long-term dependencies
Transformer > ConvS2S Transformer = Conv with very wide filters representation ability RNN+attention <- info duplication
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.