Attention for translation

Slides:

Advertisements

Similar presentations

Distributed Representations of Sentences and Documents

Advertisements

Analysis of a Neural Language Model Eric Doi CS 152: Neural Networks Harvey Mudd College.

Deep Learning Overview Sources: workshop-tutorial-final.pdf

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation EMNLP’14 paper by Kyunghyun Cho, et al.

Attention Model in NLP Jichuan ZENG.

Neural Machine Translation

Convolutional Sequence to Sequence Learning

Learning linguistic structure with simple and more complex recurrent neural networks Psychology February 2, 2017.

RNNs: An example applied to the prediction task

End-To-End Memory Networks

CS 388: Natural Language Processing: LSTM Recurrent Neural Networks

CS 4501: Introduction to Computer Vision Computer Vision + Natural Language Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy / Justin Johnson.

Deep Learning Amin Sobhani.

Wu et. al., arXiv - sept 2016 Presenter: Lütfi Kerem Şenel

Sentence Modeling Representation of sentences is the heart of Natural Language Processing A sentence model is a representation and analysis of semantic.

Recursive Neural Networks

Recurrent Neural Networks for Natural Language Processing

Neural Machine Translation by Jointly Learning to Align and Translate

Attention Is All You Need

Show and Tell: A Neural Image Caption Generator (CVPR 2015)

Matt Gormley Lecture 16 October 24, 2016

A Hierarchical Model of Reviews for Aspect-based Sentiment Analysis

Intro to NLP and Deep Learning

Attention Is All You Need

Intelligent Information System Lab

Intro to NLP and Deep Learning

Neural networks (3) Regularization Autoencoder

Neural Machine Translation By Learning to Jointly Align and Translate

Hybrid computing using a neural network with dynamic external memory

Please, Pay Attention, Neural Attention

Attention Is All You Need

RNNs: Going Beyond the SRN in Language Prediction

Distributed Representation of Words, Sentences and Paragraphs

Advanced Recurrent Architectures

Attention-based Caption Description Mun Jonghwan.

Grid Long Short-Term Memory

Hidden Markov Models Part 2: Algorithms

Final Presentation: Neural Network Doc Summarization

Understanding LSTM Networks

Word Embedding Word2Vec.

Creating Data Representations

Code Completion with Neural Attention and Pointer Networks

The Big Health Data–Intelligent Machine Paradox

Other Classification Models: Recurrent Neural Network (RNN)

Natural Language to SQL(nl2sql)

Report by: 陆纪圆.

Learning linguistic structure with simple recurrent neural networks

Artificial Neural Networks

实习生汇报 ——北邮张安迪.

Neural networks (3) Regularization Autoencoder

LSTM: Long Short Term Memory

Word embeddings (continued)

Meta Learning (Part 2): Gradient Descent as LSTM

Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton

-- Ray Mooney, Association for Computational Linguistics (ACL) 2014

Graph Attention Networks

Word representations David Kauchak CS158 – Fall 2016.

Neural Machine Translation - Encoder-Decoder Architecture and Attention Mechanism Anmol Popli CSE 291G.

Modeling IDS using hybrid intelligent systems

Neural Machine Translation using CNN

Neural Machine Translation

Week 3 Presentation Ngoc Ta Aidean Sharghi.

Recurrent Neural Networks

Sequence-to-Sequence Models

Bidirectional LSTM-CRF Models for Sequence Tagging

CSC 578 Neural Networks and Deep Learning

Week 7 Presentation Ngoc Ta Aidean Sharghi

Neural Machine Translation by Jointly Learning to Align and Translate

Visual Grounding.

Presentation transcript:

Attention for translation Learn to encode multiple pieces of information and use them selectively for the output. Encode the input sentence into a sequence of vectors. Choose a subset of these adaptively while decoding (translating) – choose those vectors most relevant for current output. I.e., learn to jointly align and translate. Question: How can we learn and use a vector to decide where to focus attention? How can we make that differentiable to work with gradient descent? Bahdanau et al., 2015 : https://arxiv.org/pdf/1409.0473.pdf

Soft attention Use a probability distribution over all inputs. Classification assigned probability to all possible outputs Attention uses probability to weight all possible inputs – learn to weight more relevant parts more heavily. https://distill.pub/2016/augmented-rnns/

Attention for translation Input, x, and output, y, are sequences Encoder has a hidden state hi associated with each xi. These states come from a bidirectional RNN to allow information from both sides in encoding. https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html

Attention for translation Decoder: each output yi is predicted using Previous output yi-1 Decoder hidden state si Context vector ci Decoder hidden state depends on Previous state si-1 Attention is embedded in the context vector via learned weights https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html

Context vector The encoder hidden states are combined in a weighted average to form a context vector ct for the tth output. This can capture features from each part of the input. https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html

Attention for translation The weights va and Wa are learned using a feed- forward network trained with the rest of the network. https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html

Attention for translation Key ideas: Implement attention as a probability distribution over inputs/features. Extend encoder/decoder pair to include context information relevant to the current decoding task. https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html

Attention with images Can combine a CNN with an RNN using attention. CNN extracts high-level features. RNN generates a description, using attention to focus on relevant parts of the image. https://distill.pub/2016/augmented-rnns/

Self-attention Previous: focus attention on the input while working on the output Self-attention: focus on other parts of input while processing the input http://jalammar.github.io/illustrated-transformer/

Self-attention Use each input vector to produce query, key, and value for that input: each is defined by a matrix multiplication of the embedding with each of WQ, WK, WV http://jalammar.github.io/illustrated-transformer/

Self-attention Similarity is determined by dot product of the Query of one input with the Key of all the inputs E.g., for input 1, get a vector of dot products (q1, k1), (q1, k2), … Do a scaling and softmax to get a distribution over the input vectors. This gives a distribution p11, p12, … that is the attention for input 1 on all inputs. Use the attention vector to do a weighted sum over the Value vectors for the inputs: z1 = p11v1 + p12v2 + … This is the output of the self-attention for input 1 http://jalammar.github.io/illustrated-transformer/

Self-attention Multi-headed attention: run several copies in parallel and concatenate the outputs for next layer.

Related work Neural Turing Machines – combine an RNN with an external memory. https://distill.pub/2016/augmented-rnns/

Neural Turing Machines Use attention to do weighted read/writes at every location. Can combine content-based attention with location- based attention to take advantage of both. https://distill.pub/2016/augmented-rnns/

Related work Adaptive computation time for RNNs Include a probability distribution on the number of steps for a single input Final output is a weighted sum of the steps for that input https://distill.pub/2016/augmented-rnns/

Related work Neural programmer Determine a sequence of operations to solve some problem. Use a probability distribution to combine multiple possible sequences. https://distill.pub/2016/augmented-rnns/

Attention: summary Attention uses a probability distribution to allow the learning and use of relevant inputs for RNN output This can be used in multiple ways to augment RNNs: Better use of input to encoder External memory Program control (adaptive computation) Neural programming