Attention Is All You Need

Attention Is All You Need
Shiyue Zhang

Introduction Existing NMT architectures This work: Transformer
RNN + attention CNN + attention This work: Transformer No RNN, no CNN, only attention

Model Architecture Attention Attention in RNNsearch
Query(Q): St-1 Keys(K): [h1, h2, …, hT] Values(V): [h1, h2, …, hT] Attention(Q, K, V) = softmax(v*tanh(Q+K)) * V (additive attention) Scaled Dot-Product Attention (dot-product attention) Multi-Head Attention 512 8*64 512/8 =64 512*512 512

Model Architecture Encoder: Two sub-layers (*6)
Multi-head attention (self-attention) h1 = LayerNorm(attention(x1, X, X) + x1) Feed forward FFN(h1) = LayerNorm(max(0, h1*W1+b1) * W2 + b2 + h1)

Model Architecture Output Decoder Three sub-layers (*6)
Masked Multi-head attention h1 = LayerNorm(attention(p1, Ybefore, Ybefore) + p1) ? Multi-head attention Feed forward Output Linear softmax

Model Architecture

Model Architecture Position Embedding (fixed)
Use sine and cosine functions of different frequencies We chose this function because we hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset k, PE(pos+k) can be represented as a linear function of PE(pos). 𝑐𝑜𝑠𝑘 𝑠𝑖𝑛𝑘 … −𝑠𝑖𝑛𝑘 𝑐𝑜𝑠𝑘 … … sin⁡(𝑝𝑜𝑠) cos⁡(𝑝𝑜𝑠) … sin⁡(𝑝𝑜𝑠+𝑘) cos⁡(𝑝𝑜𝑠+𝑘) … = sin 𝑝𝑜𝑠 cos 𝑘 + cos 𝑝𝑜𝑠 sin⁡(𝑘) cos 𝑝𝑜𝑠 cos 𝑘 − sin 𝑝𝑜𝑠 sin⁡(𝑘) … Sine with different frequencies? Easily learn to attend by relative positions?

Why self-attention? Advantages: Less complex Can be paralleled, faster
Easy to learn distant dependency

Experiments Translation: WMT14 En-De, En-Fr

Experiments English constituency parsing

Thanks! Q&A

Attention Is All You Need

Similar presentations

Presentation on theme: "Attention Is All You Need"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Attention Is All You Need

Similar presentations

Presentation on theme: "Attention Is All You Need"— Presentation transcript:

Similar presentations

About project

Feedback