Attention for translation

Slides:



Advertisements
Similar presentations
Distributed Representations of Sentences and Documents
Advertisements

Analysis of a Neural Language Model Eric Doi CS 152: Neural Networks Harvey Mudd College.
Deep Learning Overview Sources: workshop-tutorial-final.pdf
Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation EMNLP’14 paper by Kyunghyun Cho, et al.
Attention Model in NLP Jichuan ZENG.
Neural Machine Translation
Convolutional Sequence to Sequence Learning
Learning linguistic structure with simple and more complex recurrent neural networks Psychology February 2, 2017.
RNNs: An example applied to the prediction task
End-To-End Memory Networks
CS 388: Natural Language Processing: LSTM Recurrent Neural Networks
CS 4501: Introduction to Computer Vision Computer Vision + Natural Language Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy / Justin Johnson.
Deep Learning Amin Sobhani.
Wu et. al., arXiv - sept 2016 Presenter: Lütfi Kerem Şenel
Sentence Modeling Representation of sentences is the heart of Natural Language Processing A sentence model is a representation and analysis of semantic.
Recursive Neural Networks
Recurrent Neural Networks for Natural Language Processing
Neural Machine Translation by Jointly Learning to Align and Translate
Attention Is All You Need
Show and Tell: A Neural Image Caption Generator (CVPR 2015)
Matt Gormley Lecture 16 October 24, 2016
A Hierarchical Model of Reviews for Aspect-based Sentiment Analysis
Intro to NLP and Deep Learning
Attention Is All You Need
Intelligent Information System Lab
Intro to NLP and Deep Learning
Neural networks (3) Regularization Autoencoder
Neural Machine Translation By Learning to Jointly Align and Translate
Hybrid computing using a neural network with dynamic external memory
Please, Pay Attention, Neural Attention
Attention Is All You Need
RNNs: Going Beyond the SRN in Language Prediction
Distributed Representation of Words, Sentences and Paragraphs
Advanced Recurrent Architectures
Attention-based Caption Description Mun Jonghwan.
Grid Long Short-Term Memory
Hidden Markov Models Part 2: Algorithms
Final Presentation: Neural Network Doc Summarization
Understanding LSTM Networks
Word Embedding Word2Vec.
Creating Data Representations
Code Completion with Neural Attention and Pointer Networks
The Big Health Data–Intelligent Machine Paradox
Other Classification Models: Recurrent Neural Network (RNN)
Natural Language to SQL(nl2sql)
Report by: 陆纪圆.
Learning linguistic structure with simple recurrent neural networks
Artificial Neural Networks
Attention.
实习生汇报 ——北邮 张安迪.
Neural networks (3) Regularization Autoencoder
LSTM: Long Short Term Memory
Word embeddings (continued)
Meta Learning (Part 2): Gradient Descent as LSTM
Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton
-- Ray Mooney, Association for Computational Linguistics (ACL) 2014
Graph Attention Networks
Word representations David Kauchak CS158 – Fall 2016.
Neural Machine Translation - Encoder-Decoder Architecture and Attention Mechanism Anmol Popli CSE 291G.
Modeling IDS using hybrid intelligent systems
Neural Machine Translation using CNN
Neural Machine Translation
Week 3 Presentation Ngoc Ta Aidean Sharghi.
Recurrent Neural Networks
Sequence-to-Sequence Models
Bidirectional LSTM-CRF Models for Sequence Tagging
CSC 578 Neural Networks and Deep Learning
Week 7 Presentation Ngoc Ta Aidean Sharghi
Neural Machine Translation by Jointly Learning to Align and Translate
Visual Grounding.
Presentation transcript:

Attention for translation Learn to encode multiple pieces of information and use them selectively for the output. Encode the input sentence into a sequence of vectors. Choose a subset of these adaptively while decoding (translating) – choose those vectors most relevant for current output. I.e., learn to jointly align and translate. Question: How can we learn and use a vector to decide where to focus attention? How can we make that differentiable to work with gradient descent? Bahdanau et al., 2015 : https://arxiv.org/pdf/1409.0473.pdf

Soft attention Use a probability distribution over all inputs. Classification assigned probability to all possible outputs Attention uses probability to weight all possible inputs – learn to weight more relevant parts more heavily. https://distill.pub/2016/augmented-rnns/

Attention for translation Input, x, and output, y, are sequences Encoder has a hidden state hi associated with each xi. These states come from a bidirectional RNN to allow information from both sides in encoding. https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html

Attention for translation Decoder: each output yi is predicted using Previous output yi-1 Decoder hidden state si Context vector ci Decoder hidden state depends on Previous state si-1 Attention is embedded in the context vector via learned weights https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html

Context vector The encoder hidden states are combined in a weighted average to form a context vector ct for the tth output. This can capture features from each part of the input. https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html

Attention for translation The weights va and Wa are learned using a feed- forward network trained with the rest of the network. https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html

Attention for translation Key ideas: Implement attention as a probability distribution over inputs/features. Extend encoder/decoder pair to include context information relevant to the current decoding task. https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html

Attention with images Can combine a CNN with an RNN using attention. CNN extracts high-level features. RNN generates a description, using attention to focus on relevant parts of the image. https://distill.pub/2016/augmented-rnns/

Self-attention Previous: focus attention on the input while working on the output Self-attention: focus on other parts of input while processing the input http://jalammar.github.io/illustrated-transformer/

Self-attention Use each input vector to produce query, key, and value for that input: each is defined by a matrix multiplication of the embedding with each of WQ, WK, WV http://jalammar.github.io/illustrated-transformer/

Self-attention Similarity is determined by dot product of the Query of one input with the Key of all the inputs E.g., for input 1, get a vector of dot products (q1, k1), (q1, k2), … Do a scaling and softmax to get a distribution over the input vectors. This gives a distribution p11, p12, … that is the attention for input 1 on all inputs. Use the attention vector to do a weighted sum over the Value vectors for the inputs: z1 = p11v1 + p12v2 + … This is the output of the self-attention for input 1 http://jalammar.github.io/illustrated-transformer/

Self-attention Multi-headed attention: run several copies in parallel and concatenate the outputs for next layer.

Related work Neural Turing Machines – combine an RNN with an external memory. https://distill.pub/2016/augmented-rnns/

Neural Turing Machines Use attention to do weighted read/writes at every location. Can combine content-based attention with location- based attention to take advantage of both. https://distill.pub/2016/augmented-rnns/

Related work Adaptive computation time for RNNs Include a probability distribution on the number of steps for a single input Final output is a weighted sum of the steps for that input https://distill.pub/2016/augmented-rnns/

Related work Neural programmer Determine a sequence of operations to solve some problem. Use a probability distribution to combine multiple possible sequences. https://distill.pub/2016/augmented-rnns/

Attention: summary Attention uses a probability distribution to allow the learning and use of relevant inputs for RNN output This can be used in multiple ways to augment RNNs: Better use of input to encoder External memory Program control (adaptive computation) Neural programming