Convolutional Sequence to Sequence Learning

Slides:



Advertisements
Similar presentations
A brief review of non-neural-network approaches to deep learning
Advertisements

Deep Learning for Efficient Discriminative Parsing Niranjan Balasubramanian September 2 nd, 2015 Slides based on Ronan Collobert’s Paper and video from.
Convolutional LSTM Networks for Subcellular Localization of Proteins
Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.
Deep Learning Overview Sources: workshop-tutorial-final.pdf
Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation EMNLP’14 paper by Kyunghyun Cho, et al.
Attention Model in NLP Jichuan ZENG.
Today’s Lecture Neural networks Training
Unsupervised Learning of Video Representations using LSTMs
Learning linguistic structure with simple and more complex recurrent neural networks Psychology February 2, 2017.
CS 388: Natural Language Processing: LSTM Recurrent Neural Networks
CS 4501: Introduction to Computer Vision Computer Vision + Natural Language Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy / Justin Johnson.
Summary of “Efficient Deep Learning for Stereo Matching”
Deep Learning Amin Sobhani.
Computer Science and Engineering, Seoul National University
Neural Machine Translation by Jointly Learning to Align and Translate
Attention Is All You Need
Deep Learning: Model Summary
Intro to NLP and Deep Learning
ICS 491 Big Data Analytics Fall 2017 Deep Learning
Intelligent Information System Lab
Intro to NLP and Deep Learning
Supervised Training of Deep Networks
Convolutional Networks
Random walk initialization for training very deep feedforward networks
Presenter: Hajar Emami
Attention Is All You Need
Classification / Regression Neural Networks 2
Master’s Thesis defense Ming Du Advisor: Dr. Yi Shang
RNNs: Going Beyond the SRN in Language Prediction
Attention-based Caption Description Mun Jonghwan.
Goodfellow: Chap 6 Deep Feedforward Networks
Grid Long Short-Term Memory
A First Look at Music Composition using LSTM Recurrent Neural Networks
Incremental Training of Deep Convolutional Neural Networks
Recurrent Neural Networks
Final Presentation: Neural Network Doc Summarization
Learning linguistic structure with simple and more complex recurrent neural networks Psychology February 8, 2018.
Neuro-Computing Lecture 4 Radial Basis Function Network
The Big Health Data–Intelligent Machine Paradox
Neural Networks Geoff Hulten.
Other Classification Models: Recurrent Neural Network (RNN)
Artificial Intelligence Lecture No. 28
Lecture 16: Recurrent Neural Networks (RNNs)
Recurrent Encoder-Decoder Networks for Time-Varying Dense Predictions
Convolutional Neural Networks
Graph Neural Networks Amog Kamsetty January 30, 2019.
RNN Encoder-decoder Architecture
Neural networks (1) Traditional multi-layer perceptrons
Attention.
实习生汇报 ——北邮 张安迪.
Neural networks (3) Regularization Autoencoder
Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton
Feature fusion and attention scheme
Attention for translation
-- Ray Mooney, Association for Computational Linguistics (ACL) 2014
An introduction to: Deep Learning aka or related to Deep Neural Networks Deep Structural Learning Deep Belief Networks etc,
Department of Computer Science Ben-Gurion University of the Negev
Automatic Handwriting Generation
A unified extension of lstm to deep network
Error Correction Coding
Neural Machine Translation using CNN
Image recognition.
Sequence-to-Sequence Models
Deep learning: Recurrent Neural Networks CV192
CSC 578 Neural Networks and Deep Learning
Week 7 Presentation Ngoc Ta Aidean Sharghi
Neural Machine Translation by Jointly Learning to Align and Translate
Visual Grounding.
Shengcong Chen, Changxing Ding, Minfeng Liu 2018
Presentation transcript:

Convolutional Sequence to Sequence Learning Facebook AI Research

Background The prevalent approach to seq2seq learning maps an input to a variable length output sequence via RNN. In this paper, a CNN-based seq2seq model is proposed. Computations over all elements can be fully parallelized during training. Optimization is easier since the number of non-linearities is fixed and independent of the input length. The use of gated linear units eases gradient propagation Each decoder layer is equipped with a separate attention module (multi-hop attention)

RNN Seq2seq The encoder RNN processes an input sequence x =(x1,…,xm) of m elements and returns state representations z = (z1,…,zm). The decoder RNN takes z and generates the output sequence y=(y1,…,yn) left to right, one element at a time. To generate output yi+1, the decoder computes a new hidden state hi+1 based on the following parameters: The previous state hi An embedding gi of the previous target language word yi A conditional input ci derived from the encoder output z

RNN Seq2seq Models without attention consider only the final encoder state zm by setting ci = zm for all i, or simply initialize the first decoder state with zm, in which case ci is not used. Architecture with attention compute ci as a weighted sum of (z1,…,zm) at each time step. The weights of the sum are referred to as attention scores and allow the network to focus on different parts of the input sequences as it generates the output sequences. Attention scores are computed by essentially comparing each encoder state zj to a combination of the previous decoder state hi and the last prediction yi; the result is normalized to be a distribution over input elements.

Comparison with RNN Seq2seq (1) Convolutional networks do not depend on the computations of the previous time step and therefore allow parallelization over every element in a sequence. This contrasts with RNNs which maintain hidden state of the entire past that prevents parallel computation within a sequence.

Comparison with RNN Seq2seq (2) Multi-layer convolutional neural networks create hierarchical representations over the input sequence in which nearby input elements interact at lower layers while distant elements interact at higher layers. Hierarchical structure provides a shorter path to capture long-range dependencies compared to the chain structure modeled by RNNs. E.g., we can obtain a feature representation capturing relationships with a window of n words by applying only O(n/k) convolutional operations for a kernel of width k, compared to a linear number O(n) for RNNs.

Comparison with RNN Seq2seq (3) Inputs to a convolutional network are fed through a constant number of kernels and non-linearities, whereas recurrent networks apply up to n operations and non-linearities to the first word and only a single set of operations to the last word. Also, fixing the number of non-linearities applied to the inputs eases learning.

CNN Seq2seq architecture Position Embedding Convolutional Block Structure Multi-step Attention Normalization Strategy Initialization

Position Embedding The input element representations fed into encoder are combination of word embedding and position embedding of input elements. The output elements that were already generated by the decoder network are proceeded similarly to yield output element representations that are being fed back into the decoder network. Position embeddings are useful in this architecture since they give the model a sense of which portion of the sequence in the input or output it is currently dealing with.

Convolutional Block Structure Both encoder and decoder networks share a simple block structure that computes intermediate states based on a fixed number of input elements. The output of the l-th block (layer) : For the encoder network : hl =(hl1,…hln) For the decoder network : zl = (zl1,…zlm)

Convolutional Block Structure For a decoder network with a single block and kernel width k , each resulting state h1i contains information over k input elements. Stacking several blocks on top of each other increases the number of input elements represented in a state. For instance, stacking 6 blocks with k = 5 results in an input field of 25 elements, i.e. each output depends on 25 inputs.

Convolutional Block Structure Non-linearities allow the networks to exploit the full input field, or to focus on fewer elements if needed. They choose gated linear units as non-linearity which implement a simple gating mechanism over the output of the convolutional Y=[AB] (R2d)

Convolutional Block Structure To enable deep convolutional networks, they add residual connections from the input of each convolution to the output of the block Padding input at each layer to ensure that the output of the convolutional layers matches the input length Linear mappings (refer to paper for more details)

Convolutional Block Structure Finally, we compute a distribution over the T possible next target elements yi+1 by transforming the top decoder output hLi via a linear layer with weights Wo and bias bo :

Multi-step attention We introduce a separate attention mechanism for each decoder layer. To compute the attention, we combine the current decoder state hli with an embedding of the previous target element gi : For decoder layer l the attention alij of state i and source element j is computed as a dot-product between the decoder state summary dli and each output zuj of the last encoder block u

Multi-step attention The conditional input cli for the current decoder layer is a weighted sum of the encoder outputs as well as the input element embeddings ej

Two more tricks Normalization Strategy Initialization