Download presentation
Presentation is loading. Please wait.
1
Attention Is All You Need
Shiyue Zhang
2
Introduction Existing NMT architectures This work: Transformer
RNN + attention CNN + attention This work: Transformer No RNN, no CNN, only attention
3
Model Architecture Attention Attention in RNNsearch
Query(Q): St-1 Keys(K): [h1, h2, …, hT] Values(V): [h1, h2, …, hT] Attention(Q, K, V) = softmax(v*tanh(Q+K)) * V (additive attention) Scaled Dot-Product Attention (dot-product attention) Multi-Head Attention 512 8*64 512/8 =64 512*512 512
4
Model Architecture Encoder: Two sub-layers (*6)
Multi-head attention (self-attention) h1 = LayerNorm(attention(x1, X, X) + x1) Feed forward FFN(h1) = LayerNorm(max(0, h1*W1+b1) * W2 + b2 + h1)
5
Model Architecture Output Decoder Three sub-layers (*6)
Masked Multi-head attention h1 = LayerNorm(attention(p1, Ybefore, Ybefore) + p1) ? Multi-head attention Feed forward Output Linear softmax
6
Model Architecture
7
Model Architecture Position Embedding (fixed)
Use sine and cosine functions of different frequencies We chose this function because we hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset k, PE(pos+k) can be represented as a linear function of PE(pos). 𝑐𝑜𝑠𝑘 𝑠𝑖𝑛𝑘 … −𝑠𝑖𝑛𝑘 𝑐𝑜𝑠𝑘 … … sin(𝑝𝑜𝑠) cos(𝑝𝑜𝑠) … sin(𝑝𝑜𝑠+𝑘) cos(𝑝𝑜𝑠+𝑘) … = sin 𝑝𝑜𝑠 cos 𝑘 + cos 𝑝𝑜𝑠 sin(𝑘) cos 𝑝𝑜𝑠 cos 𝑘 − sin 𝑝𝑜𝑠 sin(𝑘) … Sine with different frequencies? Easily learn to attend by relative positions?
8
Why self-attention? Advantages: Less complex Can be paralleled, faster
Easy to learn distant dependency
9
Experiments Translation: WMT14 En-De, En-Fr
10
Experiments English constituency parsing
11
Thanks! Q&A
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.