Attention Is All You Need

Slides:



Advertisements
Similar presentations
Measuring the Influence of Long Range Dependencies with Neural Network Language Models Le Hai Son, Alexandre Allauzen, Franc¸ois Yvon Univ. Paris-Sud and.
Advertisements

Application of RNNs to Language Processing Andrey Malinin, Shixiang Gu CUED Division F Speech Group.
Addressing the Rare Word Problem in Neural Machine Translation
Deep Learning for Efficient Discriminative Parsing Niranjan Balasubramanian September 2 nd, 2015 Slides based on Ronan Collobert’s Paper and video from.
Haitham Elmarakeby.  Speech recognition
Convolutional LSTM Networks for Subcellular Localization of Proteins
Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation EMNLP’14 paper by Kyunghyun Cho, et al.
S.Bengio, O.Vinyals, N.Jaitly, N.Shazeer
Attention Model in NLP Jichuan ZENG.
Fabien Cromieres Chenhui Chu Toshiaki Nakazawa Sadao Kurohashi
Convolutional Sequence to Sequence Learning
Learning linguistic structure with simple and more complex recurrent neural networks Psychology February 2, 2017.
SD Study RNN & LSTM 2016/11/10 Seitaro Shinagawa.
Hierarchical Question-Image Co-Attention for Visual Question Answering
End-To-End Memory Networks
CS 388: Natural Language Processing: LSTM Recurrent Neural Networks
CS 4501: Introduction to Computer Vision Computer Vision + Natural Language Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy / Justin Johnson.
Deep Learning Amin Sobhani.
Wu et. al., arXiv - sept 2016 Presenter: Lütfi Kerem Şenel
CSE 190 Neural Networks: The Neural Turing Machine
Neural Machine Translation by Jointly Learning to Align and Translate
Show and Tell: A Neural Image Caption Generator (CVPR 2015)
Intro to NLP and Deep Learning
Attention Is All You Need
Deep Learning with TensorFlow online Training at GoLogica Technologies
Intelligent Information System Lab
Intro to NLP and Deep Learning
Different Units Ramakrishna Vedantam.
Deep learning and applications to Natural language processing
Neural Machine Translation By Learning to Jointly Align and Translate
Hybrid computing using a neural network with dynamic external memory
Attention Is All You Need
Deep Learning based Machine Translation
Layer-wise Performance Bottleneck Analysis of Deep Neural Networks
Attention-based Caption Description Mun Jonghwan.
Paraphrase Generation Using Deep Learning
A First Look at Music Composition using LSTM Recurrent Neural Networks
Final Presentation: Neural Network Doc Summarization
Core Platform The base of EmpFinesse™ Suite.
Understanding LSTM Networks
The Big Health Data–Intelligent Machine Paradox
Memory-augmented Chinese-Uyghur Neural Machine Translation
Neural Speech Synthesis with Transformer Network
Machine Translation(MT)
Unsupervised Pretraining for Semantic Parsing
Fundamentals of Neural Networks Dr. Satinder Bal Gupta
Report by: 陆纪圆.
Presented By: Darlene Banta
RNN Encoder-decoder Architecture
Attention.
实习生汇报 ——北邮 张安迪.
Topological Signatures For Fast Mobility Analysis
Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton
Attention for translation
-- Ray Mooney, Association for Computational Linguistics (ACL) 2014
CSE 291G : Deep Learning for Sequences
Presented By: Sparsh Gupta Anmol Popli Hammad Abdullah Ayyubi
Graph Attention Networks
Presented by: Anurag Paul
Neural Machine Translation - Encoder-Decoder Architecture and Attention Mechanism Anmol Popli CSE 291G.
Neural Machine Translation using CNN
Presented By: Harshul Gupta
Attention Is All You Need
Recurrent Neural Networks
Sequence-to-Sequence Models
Week 7 Presentation Ngoc Ta Aidean Sharghi
LHC beam mode classification
Joint Learning of Correlated Sequence Labeling Tasks
Neural Machine Translation by Jointly Learning to Align and Translate
Visual Grounding.
Presentation transcript:

Attention Is All You Need Group presentation Wang Yue 2017.09.18

Ten English sentences in this paper Part 0 Ten English sentences in this paper This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples. Numerous efforts have since continued to push the boundaries of recurrent language models and encoder-decoder architectures As side benefit, self-attention could yield more interpretable models. Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Our model establishes a new single-model state-of-the-art BLEU score of 41.0 In the Transformer this is reduced to a constant number of operations, albeit at the cost of reduced effective resolution due to averaging attention-weighted positions, an effect we counteract with Multi-Head Attention At each step the model is auto-regressive, consuming the previously generated symbols as additional input when generating the next  Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence. The fundamental constraint of sequential computation, however, remains.

Outline 1 Overview 2 Transformer 3 Why self-attention Experiment 4

Part 1 Overview NIPS 2017

Part 1 Overview There has been a running joke in the NLP community that an LSTM with attention will yield state-of-the-art performance on any task. Example: Neural Machine Translation Attention Observation: attention is built upon RNN This paper break this observation!

Contributions: Overview Part 1 Overview Contributions: Propose Transformer: a new simple network architecture based solely on attention mechanisms, dispensing with recurrence and convolutions entirely Establish a new state-of-the-art BLEU on machine translation task High efficiency: less training cost

Scaled Dot-Product Attention Multi-Head Attention Part 2 Transformer Scaled Dot-Product Attention Multi-Head Attention Position-wise Feed-Forward Networks Embeddings and Softmax Positional Encoding

Scaled Dot-Product Attention Multi-Head Attention Part 2 Transformer Scaled Dot-Product Attention Multi-Head Attention Position-wise Feed-Forward Networks Embeddings and Softmax Positional Encoding Q: queries, K: keys, V: values

Scaled Dot-Product Attention Multi-Head Attention Part 2 Transformer Scaled Dot-Product Attention Multi-Head Attention Position-wise Feed-Forward Networks Embeddings and Softmax Positional Encoding The total computational cost is similar to that of single-head attention with full dimensionality.

Scaled Dot-Product Attention Multi-Head Attention Part 2 Transformer Scaled Dot-Product Attention Multi-Head Attention Position-wise Feed-Forward Networks Embeddings and Softmax Positional Encoding

Scaled Dot-Product Attention Multi-Head Attention Part 2 Transformer Scaled Dot-Product Attention Multi-Head Attention Position-wise Feed-Forward Networks Embeddings and Softmax Positional Encoding Share embedding weights and the pre-softmax linear transformation (refer to arXiv:1608.05859)

Scaled Dot-Product Attention Multi-Head Attention Part 3 Transformer Scaled Dot-Product Attention Multi-Head Attention Position-wise Feed-Forward Networks Embeddings and Softmax Positional Encoding Reason: no RNN to model the sequence position Two types: learned positional embeddings (arXiv:1705.03122v2) Sinusoid:

Why self-attention Observation: Part 3 Why self-attention Observation: Self-Attention has O(1) maximum path length (capture long range dependency easily) When n<d, Self-Attention has lower complexity per layer

Part 4 Experiment English-to-German translation: a new state-of-the-art (increase over 2 BLEU) English-to-French translation: a new single-model state-of-the-art (BLEU score of 41.0) Less training cost

Experiment Some results I got: Part 4 source: Aber ich habe es nicht hingekriegt expected: But I didn't handle it got: But I didn't <UNK> it source: Wir könnten zum Mars fliegen wenn wir wollen expected: We could go to Mars if we want got: We could fly to Mars when we want source: Dies ist nicht meine Meinung Das sind Fakten expected: This is not my opinion These are the facts got: This is not my opinion These are facts source: Wie würde eine solche Zukunft aussehen expected: What would such a future look like got: What would a future like this There are two implementations: (simple) https://github.com/Kyubyong/transformer (official) https://github.com/tensorflow/tensor2tensor

Attention Is All You Need Yue Wang, 1155085636, CSE, CUHK Thank you!

Keys := Values Queries RNN removed in transformer