Attention Is All You Need

Slides:

Advertisements

Similar presentations

Measuring the Influence of Long Range Dependencies with Neural Network Language Models Le Hai Son, Alexandre Allauzen, Franc¸ois Yvon Univ. Paris-Sud and.

Advertisements

Application of RNNs to Language Processing Andrey Malinin, Shixiang Gu CUED Division F Speech Group.

Addressing the Rare Word Problem in Neural Machine Translation

Deep Learning for Efficient Discriminative Parsing Niranjan Balasubramanian September 2 nd, 2015 Slides based on Ronan Collobert’s Paper and video from.

Haitham Elmarakeby.  Speech recognition

Convolutional LSTM Networks for Subcellular Localization of Proteins

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation EMNLP’14 paper by Kyunghyun Cho, et al.

S.Bengio, O.Vinyals, N.Jaitly, N.Shazeer

Attention Model in NLP Jichuan ZENG.

Fabien Cromieres Chenhui Chu Toshiaki Nakazawa Sadao Kurohashi

Convolutional Sequence to Sequence Learning

Learning linguistic structure with simple and more complex recurrent neural networks Psychology February 2, 2017.

SD Study RNN & LSTM 2016/11/10 Seitaro Shinagawa.

Hierarchical Question-Image Co-Attention for Visual Question Answering

End-To-End Memory Networks

CS 388: Natural Language Processing: LSTM Recurrent Neural Networks

CS 4501: Introduction to Computer Vision Computer Vision + Natural Language Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy / Justin Johnson.

Deep Learning Amin Sobhani.

Wu et. al., arXiv - sept 2016 Presenter: Lütfi Kerem Şenel

CSE 190 Neural Networks: The Neural Turing Machine

Neural Machine Translation by Jointly Learning to Align and Translate

Show and Tell: A Neural Image Caption Generator (CVPR 2015)

Intro to NLP and Deep Learning

Attention Is All You Need

Deep Learning with TensorFlow online Training at GoLogica Technologies

Intelligent Information System Lab

Intro to NLP and Deep Learning

Different Units Ramakrishna Vedantam.

Deep learning and applications to Natural language processing

Neural Machine Translation By Learning to Jointly Align and Translate

Hybrid computing using a neural network with dynamic external memory

Attention Is All You Need

Deep Learning based Machine Translation

Layer-wise Performance Bottleneck Analysis of Deep Neural Networks

Attention-based Caption Description Mun Jonghwan.

Paraphrase Generation Using Deep Learning

A First Look at Music Composition using LSTM Recurrent Neural Networks

Final Presentation: Neural Network Doc Summarization

Core Platform The base of EmpFinesse™ Suite.

Understanding LSTM Networks

The Big Health Data–Intelligent Machine Paradox

Memory-augmented Chinese-Uyghur Neural Machine Translation

Neural Speech Synthesis with Transformer Network

Machine Translation(MT)

Unsupervised Pretraining for Semantic Parsing

Fundamentals of Neural Networks Dr. Satinder Bal Gupta

Report by: 陆纪圆.

Presented By: Darlene Banta

RNN Encoder-decoder Architecture

实习生汇报 ——北邮张安迪.

Topological Signatures For Fast Mobility Analysis

Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton

Attention for translation

-- Ray Mooney, Association for Computational Linguistics (ACL) 2014

CSE 291G : Deep Learning for Sequences

Presented By: Sparsh Gupta Anmol Popli Hammad Abdullah Ayyubi

Graph Attention Networks

Presented by: Anurag Paul

Neural Machine Translation - Encoder-Decoder Architecture and Attention Mechanism Anmol Popli CSE 291G.

Neural Machine Translation using CNN

Presented By: Harshul Gupta

Attention Is All You Need

Recurrent Neural Networks

Sequence-to-Sequence Models

Week 7 Presentation Ngoc Ta Aidean Sharghi

LHC beam mode classification

Joint Learning of Correlated Sequence Labeling Tasks

Neural Machine Translation by Jointly Learning to Align and Translate

Visual Grounding.

Presentation transcript:

Attention Is All You Need Group presentation Wang Yue 2017.09.18

Ten English sentences in this paper Part 0 Ten English sentences in this paper This inherently sequential nature precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples. Numerous efforts have since continued to push the boundaries of recurrent language models and encoder-decoder architectures As side benefit, self-attention could yield more interpretable models. Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Our model establishes a new single-model state-of-the-art BLEU score of 41.0 In the Transformer this is reduced to a constant number of operations, albeit at the cost of reduced effective resolution due to averaging attention-weighted positions, an effect we counteract with Multi-Head Attention At each step the model is auto-regressive, consuming the previously generated symbols as additional input when generating the next Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence. The fundamental constraint of sequential computation, however, remains.

Outline 1 Overview 2 Transformer 3 Why self-attention Experiment 4

Part 1 Overview NIPS 2017

Part 1 Overview There has been a running joke in the NLP community that an LSTM with attention will yield state-of-the-art performance on any task. Example: Neural Machine Translation Attention Observation: attention is built upon RNN This paper break this observation!

Contributions: Overview Part 1 Overview Contributions: Propose Transformer: a new simple network architecture based solely on attention mechanisms, dispensing with recurrence and convolutions entirely Establish a new state-of-the-art BLEU on machine translation task High efficiency: less training cost

Scaled Dot-Product Attention Multi-Head Attention Part 2 Transformer Scaled Dot-Product Attention Multi-Head Attention Position-wise Feed-Forward Networks Embeddings and Softmax Positional Encoding

Scaled Dot-Product Attention Multi-Head Attention Part 2 Transformer Scaled Dot-Product Attention Multi-Head Attention Position-wise Feed-Forward Networks Embeddings and Softmax Positional Encoding Q: queries, K: keys, V: values

Scaled Dot-Product Attention Multi-Head Attention Part 2 Transformer Scaled Dot-Product Attention Multi-Head Attention Position-wise Feed-Forward Networks Embeddings and Softmax Positional Encoding The total computational cost is similar to that of single-head attention with full dimensionality.

Scaled Dot-Product Attention Multi-Head Attention Part 2 Transformer Scaled Dot-Product Attention Multi-Head Attention Position-wise Feed-Forward Networks Embeddings and Softmax Positional Encoding

Scaled Dot-Product Attention Multi-Head Attention Part 2 Transformer Scaled Dot-Product Attention Multi-Head Attention Position-wise Feed-Forward Networks Embeddings and Softmax Positional Encoding Share embedding weights and the pre-softmax linear transformation (refer to arXiv:1608.05859)

Scaled Dot-Product Attention Multi-Head Attention Part 3 Transformer Scaled Dot-Product Attention Multi-Head Attention Position-wise Feed-Forward Networks Embeddings and Softmax Positional Encoding Reason: no RNN to model the sequence position Two types: learned positional embeddings (arXiv:1705.03122v2) Sinusoid:

Why self-attention Observation: Part 3 Why self-attention Observation: Self-Attention has O(1) maximum path length (capture long range dependency easily) When n<d, Self-Attention has lower complexity per layer

Part 4 Experiment English-to-German translation: a new state-of-the-art (increase over 2 BLEU) English-to-French translation: a new single-model state-of-the-art (BLEU score of 41.0) Less training cost

Experiment Some results I got: Part 4 source: Aber ich habe es nicht hingekriegt expected: But I didn't handle it got: But I didn't <UNK> it source: Wir könnten zum Mars fliegen wenn wir wollen expected: We could go to Mars if we want got: We could fly to Mars when we want source: Dies ist nicht meine Meinung Das sind Fakten expected: This is not my opinion These are the facts got: This is not my opinion These are facts source: Wie würde eine solche Zukunft aussehen expected: What would such a future look like got: What would a future like this There are two implementations: (simple) https://github.com/Kyubyong/transformer (official) https://github.com/tensorflow/tensor2tensor

Attention Is All You Need Yue Wang, 1155085636, CSE, CUHK Thank you!

Keys := Values Queries RNN removed in transformer