Attention Is All You Need

Slides:



Advertisements
Similar presentations
Application of RNNs to Language Processing Andrey Malinin, Shixiang Gu CUED Division F Speech Group.
Advertisements

Addressing the Rare Word Problem in Neural Machine Translation
Haitham Elmarakeby.  Speech recognition
Convolutional LSTM Networks for Subcellular Localization of Proteins
Predicting the dropouts rate of online course using LSTM method
Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation EMNLP’14 paper by Kyunghyun Cho, et al.
S.Bengio, O.Vinyals, N.Jaitly, N.Shazeer
Attention Model in NLP Jichuan ZENG.
Convolutional Sequence to Sequence Learning
CNN-RNN: A Unified Framework for Multi-label Image Classification
SUNY Korea BioData Mining Lab - Journal Review
End-To-End Memory Networks
CS 388: Natural Language Processing: LSTM Recurrent Neural Networks
CS 4501: Introduction to Computer Vision Computer Vision + Natural Language Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy / Justin Johnson.
Wu et. al., arXiv - sept 2016 Presenter: Lütfi Kerem Şenel
Sentence Modeling Representation of sentences is the heart of Natural Language Processing A sentence model is a representation and analysis of semantic.
Recurrent Neural Networks for Natural Language Processing
Neural Machine Translation by Jointly Learning to Align and Translate
Attention Is All You Need
Show and Tell: A Neural Image Caption Generator (CVPR 2015)
Attention Is All You Need
Joint Training for Pivot-based Neural Machine Translation
Deep Learning with TensorFlow online Training at GoLogica Technologies
Intelligent Information System Lab
Intro to NLP and Deep Learning
Neural Networks 2 CS446 Machine Learning.
Multiple Wavelet Coefficients Fusion in Deep Residual Networks for Fault Diagnosis
Hybrid computing using a neural network with dynamic external memory
Presenter: Hajar Emami
Image Question Answering
Master’s Thesis defense Ming Du Advisor: Dr. Yi Shang
Deep Learning based Machine Translation
Attention-based Caption Description Mun Jonghwan.
Grid Long Short-Term Memory
RNN and LSTM Using MXNet Cyrus M Vahid, Principal Solutions Architect
A First Look at Music Composition using LSTM Recurrent Neural Networks
Final Presentation: Neural Network Doc Summarization
Understanding LSTM Networks
The Big Health Data–Intelligent Machine Paradox
Introduction to Natural Language Processing
Neural Speech Synthesis with Transformer Network
Recurrent Encoder-Decoder Networks for Time-Varying Dense Predictions
Forward and Backward Max Pooling
Machine Translation(MT)
Natural Language to SQL(nl2sql)
Report by: 陆纪圆.
RNN Encoder-decoder Architecture
Attention.
实习生汇报 ——北邮 张安迪.
Please enjoy.
Word embeddings (continued)
Word Embedding 모든 단어를 vector로 표시 Word vector Word embedding Word
Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton
Feature fusion and attention scheme
Attention for translation
-- Ray Mooney, Association for Computational Linguistics (ACL) 2014
RNNs and Sequence to sequence models
Dilated Neural Networks for Time Series Forecasting
Neural Machine Translation using CNN
Question Answering System
Baseline Model CSV Files Pandas DataFrame Sentence Lists
Week 3 Presentation Ngoc Ta Aidean Sharghi.
Recurrent Neural Networks
Sequence-to-Sequence Models
Deep learning: Recurrent Neural Networks CV192
Bidirectional LSTM-CRF Models for Sequence Tagging
CSC 578 Neural Networks and Deep Learning
Week 7 Presentation Ngoc Ta Aidean Sharghi
LHC beam mode classification
Neural Machine Translation by Jointly Learning to Align and Translate
Presentation transcript:

Attention Is All You Need Presenter: Haotian Xu

Agenda Background Motivation Model Experimental results Discussion Neural machine translation Attention mechanism Google vs. Facebook Motivation Model Experimental results Discussion

Neural machine translation (NMT) Human read the entire sentence, understand its meaning, and produce a translation NMT mimics that: English->French

Neural machine translation (NMT) Conventional encode/decoder choice: RNN/LSTM/GRU

Attention mechanism establish direct short-cuts between the target and the source

Attention mechanism Alignments between source and target sentences French->English

Google vs. Facebook Google seq2seq: Facebook ConvS2S: Sequence to Sequence Learning with Neural Networks, 2014 LSTM Facebook ConvS2S: Convolutional Sequence to Sequence Learning, 05/2017 CNN Google Transformer: Attention Is All You Need, 06/2017 based solely on attention mechanisms

Motivation Drawbacks of RNN Preclude parallelization Long-range dependency* Preclude parallelization High computation complexity O(N) O(N/k) for CNN with kernel size k O(1) for Transformer

Model Encoder: Decoder: N=6 blocks multi-head attention position-wise feed-forward network Decoder: masked multi-head attention prevent positions from attending to subsequent positions same with Encoder Mask: ensures that the predictions for position i can depend only on the known outputs at positions less than i

Positional encoding Inject relative position info of the tokens in the sequence make use of the order of sequence no recurrence pos is the position and i is the dimension The positional encodings have the same dimension dmodel as the embeddings PEpos+k can be represented as a linear function of PEpos

Attention Scaled Dot-Product Multi-Head Linearly project Q,K,V h times with different, learned projections 8 parallel heads employed Optimal mask

Position-wise feed-forward layer applied to each position separately and identically the linear transformations are the same across different positions, they use different parameters from layer to layer

Experimental results floating point operations

Discussion RNN vs. ConvS2S vs. Transformer long-term dependencies Transformer > ConvS2S Transformer = Conv with very wide filters representation ability RNN+attention <- info duplication