Attention Is All You Need

Slides:

Advertisements

Similar presentations

Application of RNNs to Language Processing Andrey Malinin, Shixiang Gu CUED Division F Speech Group.

Advertisements

Addressing the Rare Word Problem in Neural Machine Translation

Haitham Elmarakeby.  Speech recognition

Convolutional LSTM Networks for Subcellular Localization of Proteins

Predicting the dropouts rate of online course using LSTM method

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation EMNLP’14 paper by Kyunghyun Cho, et al.

S.Bengio, O.Vinyals, N.Jaitly, N.Shazeer

Attention Model in NLP Jichuan ZENG.

Convolutional Sequence to Sequence Learning

CNN-RNN: A Uniﬁed Framework for Multi-label Image Classiﬁcation

SUNY Korea BioData Mining Lab - Journal Review

End-To-End Memory Networks

CS 388: Natural Language Processing: LSTM Recurrent Neural Networks

CS 4501: Introduction to Computer Vision Computer Vision + Natural Language Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy / Justin Johnson.

Wu et. al., arXiv - sept 2016 Presenter: Lütfi Kerem Şenel

Sentence Modeling Representation of sentences is the heart of Natural Language Processing A sentence model is a representation and analysis of semantic.

Recurrent Neural Networks for Natural Language Processing

Neural Machine Translation by Jointly Learning to Align and Translate

Attention Is All You Need

Show and Tell: A Neural Image Caption Generator (CVPR 2015)

Attention Is All You Need

Joint Training for Pivot-based Neural Machine Translation

Deep Learning with TensorFlow online Training at GoLogica Technologies

Intelligent Information System Lab

Intro to NLP and Deep Learning

Neural Networks 2 CS446 Machine Learning.

Multiple Wavelet Coefficients Fusion in Deep Residual Networks for Fault Diagnosis

Hybrid computing using a neural network with dynamic external memory

Presenter: Hajar Emami

Image Question Answering

Master’s Thesis defense Ming Du Advisor: Dr. Yi Shang

Deep Learning based Machine Translation

Attention-based Caption Description Mun Jonghwan.

Grid Long Short-Term Memory

RNN and LSTM Using MXNet Cyrus M Vahid, Principal Solutions Architect

A First Look at Music Composition using LSTM Recurrent Neural Networks

Final Presentation: Neural Network Doc Summarization

Understanding LSTM Networks

The Big Health Data–Intelligent Machine Paradox

Introduction to Natural Language Processing

Neural Speech Synthesis with Transformer Network

Recurrent Encoder-Decoder Networks for Time-Varying Dense Predictions

Forward and Backward Max Pooling

Machine Translation(MT)

Natural Language to SQL(nl2sql)

Report by: 陆纪圆.

RNN Encoder-decoder Architecture

实习生汇报 ——北邮张安迪.

Word embeddings (continued)

Word Embedding 모든 단어를 vector로 표시 Word vector Word embedding Word

Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton

Feature fusion and attention scheme

Attention for translation

-- Ray Mooney, Association for Computational Linguistics (ACL) 2014

RNNs and Sequence to sequence models

Dilated Neural Networks for Time Series Forecasting

Neural Machine Translation using CNN

Question Answering System

Baseline Model CSV Files Pandas DataFrame Sentence Lists

Week 3 Presentation Ngoc Ta Aidean Sharghi.

Recurrent Neural Networks

Sequence-to-Sequence Models

Deep learning: Recurrent Neural Networks CV192

Bidirectional LSTM-CRF Models for Sequence Tagging

CSC 578 Neural Networks and Deep Learning

Week 7 Presentation Ngoc Ta Aidean Sharghi

LHC beam mode classification

Neural Machine Translation by Jointly Learning to Align and Translate

Presentation transcript:

Attention Is All You Need Presenter: Haotian Xu

Agenda Background Motivation Model Experimental results Discussion Neural machine translation Attention mechanism Google vs. Facebook Motivation Model Experimental results Discussion

Neural machine translation (NMT) Human read the entire sentence, understand its meaning, and produce a translation NMT mimics that: English->French

Neural machine translation (NMT) Conventional encode/decoder choice: RNN/LSTM/GRU

Attention mechanism establish direct short-cuts between the target and the source

Attention mechanism Alignments between source and target sentences French->English

Google vs. Facebook Google seq2seq: Facebook ConvS2S: Sequence to Sequence Learning with Neural Networks, 2014 LSTM Facebook ConvS2S: Convolutional Sequence to Sequence Learning, 05/2017 CNN Google Transformer: Attention Is All You Need, 06/2017 based solely on attention mechanisms

Motivation Drawbacks of RNN Preclude parallelization Long-range dependency* Preclude parallelization High computation complexity O(N) O(N/k) for CNN with kernel size k O(1) for Transformer

Model Encoder: Decoder: N=6 blocks multi-head attention position-wise feed-forward network Decoder: masked multi-head attention prevent positions from attending to subsequent positions same with Encoder Mask: ensures that the predictions for position i can depend only on the known outputs at positions less than i

Positional encoding Inject relative position info of the tokens in the sequence make use of the order of sequence no recurrence pos is the position and i is the dimension The positional encodings have the same dimension dmodel as the embeddings PEpos+k can be represented as a linear function of PEpos

Attention Scaled Dot-Product Multi-Head Linearly project Q,K,V h times with different, learned projections 8 parallel heads employed Optimal mask

Position-wise feed-forward layer applied to each position separately and identically the linear transformations are the same across different positions, they use different parameters from layer to layer

Experimental results floating point operations

Discussion RNN vs. ConvS2S vs. Transformer long-term dependencies Transformer > ConvS2S Transformer = Conv with very wide filters representation ability RNN+attention <- info duplication