Please, Pay Attention, Neural Attention

Slides:



Advertisements
Similar presentations
Application of RNNs to Language Processing Andrey Malinin, Shixiang Gu CUED Division F Speech Group.
Advertisements

Haitham Elmarakeby.  Speech recognition
Reasoning, Attention, Memory (RAM) NIPS Workshop 2015
NOTE: To change the image on this slide, select the picture and delete it. Then click the Pictures icon in the placeholder to insert your own image. SHOW.
Deep Learning Overview Sources: workshop-tutorial-final.pdf
Parsing Natural Scenes and Natural Language with Recursive Neural Networks INTERNATIONAL CONFERENCE ON MACHINE LEARNING (ICML 2011) RICHARD SOCHER CLIFF.
Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation EMNLP’14 paper by Kyunghyun Cho, et al.
S.Bengio, O.Vinyals, N.Jaitly, N.Shazeer
Attention Model in NLP Jichuan ZENG.
Today’s Lecture Neural networks Training
Neural Machine Translation
Convolutional Sequence to Sequence Learning
Unsupervised Learning of Video Representations using LSTMs
Learning linguistic structure with simple and more complex recurrent neural networks Psychology February 2, 2017.
Hierarchical Question-Image Co-Attention for Visual Question Answering
End-To-End Memory Networks
CS 388: Natural Language Processing: LSTM Recurrent Neural Networks
CS 4501: Introduction to Computer Vision Computer Vision + Natural Language Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy / Justin Johnson.
Deep Feedforward Networks
Deep Learning Amin Sobhani.
Wu et. al., arXiv - sept 2016 Presenter: Lütfi Kerem Şenel
Natural Language and Text Processing Laboratory
Sentence Modeling Representation of sentences is the heart of Natural Language Processing A sentence model is a representation and analysis of semantic.
CSE 190 Neural Networks: How to train a network to look and see
SeaRNN: training RNNs with global-local losses
ECE 5424: Introduction to Machine Learning
Recurrent Neural Networks for Natural Language Processing
Adversarial Learning for Neural Dialogue Generation
Neural Machine Translation by Jointly Learning to Align and Translate
Attention Is All You Need
Show and Tell: A Neural Image Caption Generator (CVPR 2015)
Matt Gormley Lecture 16 October 24, 2016
Deep Learning: Model Summary
Combining CNN with RNN for scene labeling (segmentation)
Spring Courses CSCI 5922 – Probabilistic Models (Mozer) CSCI Mind Reading Machines (Sidney D’Mello) CSCI 7000 – Human Centered Machine Learning.
Intro to NLP and Deep Learning
Intelligent Information System Lab
Intro to NLP and Deep Learning
Neural Machine Translation By Learning to Jointly Align and Translate
Deep Learning based Machine Translation
Distributed Representation of Words, Sentences and Paragraphs
Advanced Recurrent Architectures
Attention-based Caption Description Mun Jonghwan.
Hidden Markov Models Part 2: Algorithms
Image Captions With Deep Learning Yulia Kogan & Ron Shiff
Understanding LSTM Networks
The Big Health Data–Intelligent Machine Paradox
Machine Translation(MT)
Graph Neural Networks Amog Kamsetty January 30, 2019.
Attention.
实习生汇报 ——北邮 张安迪.
Word embeddings (continued)
Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton
Attention for translation
-- Ray Mooney, Association for Computational Linguistics (ACL) 2014
Automatic Handwriting Generation
A unified extension of lstm to deep network
Presented by: Anurag Paul
Neural Machine Translation - Encoder-Decoder Architecture and Attention Mechanism Anmol Popli CSE 291G.
Neural Machine Translation using CNN
Neural Machine Translation
Presented By: Harshul Gupta
Week 3 Presentation Ngoc Ta Aidean Sharghi.
Recurrent Neural Networks
Sequence-to-Sequence Models
Deep learning: Recurrent Neural Networks CV192
CSC 578 Neural Networks and Deep Learning
Week 7 Presentation Ngoc Ta Aidean Sharghi
Neural Machine Translation by Jointly Learning to Align and Translate
CS249: Neural Language Model
Presentation transcript:

Please, Pay Attention, Neural Attention Alexander G. Ororbia II, C. Lee Giles IST597: Foundations of Deep Learning The Pennsylvania State University Thanks to John Canny

Origins/History Origins in vision: 1) Larochelle and Hinton, 2010, “Learning to combine foveal glimpses with a third-order Boltzmann machine” 2) Misha Denil et al, 2011, “Learning where to Attend with Deep Architectures for Image Tracking” Rise in neural machine translation Devlin et al, ACL’2014 Cho et al EMNLP’2014 Bahdanau, Cho & Bengio, arXiv sept. 2014 Jean, Cho, Memisevic & Bengio, arXiv dec. 2014 Sutskever et al NIPS’2014

Rise in Speech and Text Problems Other Applications Rise in Speech and Text Problems Ba et al 2014, Visual attention for recognition Mnih et al 2014, Visual attention for recognition Chorowski et al, 2014, Speech recognition Graves et al 2014, Neural Turing machines Yao et al 2015, Video description generation Vinyals et al, 2015, Conversational Agents Xu et al 2015, Image caption generation Xu et al 2015, Visual Question Answering

What problem does Attention solve What problem does Attention solve? A View from Neural Machine Translation (NMT) Traditional Machine Translation typically relies on sophisticated feature engineering based on statistical properties of text Difficult to build, requires extensive expertise In NMT: 1) Map meaning of a sentence to a fixed-length vector representation 2) Generate a translation based on that vector Does NOT rely on n-gram counts and focuses on capturing higher-level meaning of a text, Generalize to new sentences better than many other approaches Easier to build and train Most NMT systems work by encoding source sentence (e.g. a German sentence) into a vector using an RNN and then decoding an English sentence based on that vector, also using a RNN

Thinking about NMT In the picture above, “Echt”, “Dicke” and “Kiste” words are fed into encoder, and after a special signal the decoder starts producing a translated sentence Decoder keeps generating words until a special end of sentence token is produced Final h vectors represent the internal state of the encoder. Vector h3 (at end of encoder) must encode everything we need to know about source sentence Must fully capture sentence meaning = a “sentence embedding” (semantically similar phrases end up close to each other.) Problem: unreasonable to assume that we can encode all information about a potentially very long sentence into single vector Hoping that decoder can produce a good translation based on only that. If source sentence is 50 words long, first word of the English translation is probably highly correlated with first word of the source sentence Decoder has to consider information from 50 steps ago, and that information needs to be somehow encoded in vector. RNNs have problems dealing with long-range dependencies (could use LSTMs) or hacks like reversing the source sentence (feed backwards into encoder = shortens the path from decoder to relevant parts of encoder)

Solution: Use Attention! With an attention mechanism we no longer try encode the full source sentence into a fixed-length vector We allow decoder to “attend” to different parts of source sentence at each step of the output generation Model learns what to attend to based on input sentence and what it has produced so far In well-aligned languages (like English and German) decoder would probably choose to attend to things sequentially (attending to first word when producing first English word, etc.) One key paper: Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate." arXiv preprint arXiv:1409.0473 (2014).

Interpreting Attention The y‘s are our translated words produced by the decoder, and the x‘s are our source sentence words. Each decoder output word y (at time t) now depends on a weighted combination of all input states, not just last state. The weights (“alpha”) define how much each input state should be considered for each output So, if alpha (at step 32) is a large number, this would mean that the decoder pays a lot of attention to the second state in the source sentence while producing the third word of the target sentence. The weights are typically normalized to sum to 1 (so they are a distribution over input states) Big advantage = gives us ability to interpret and visualize what the model is doing

Visualizing the attention weight matrix a when a sentence is translated, we can understand how the model is translating! Here we see that while translating from French to English, the network attends sequentially to each input state, but sometimes it attends to two words at time while producing an output, as in translation “la Syrie” to “Syria” for example.

The Mechanism is Counterintuitive however… Human attention is something that’s supposed to save computational resources By focusing on one thing, we can neglect many other things Here we’re essentially looking at everything in detail before deciding what to focus on Equivalent to outputting a translated word, and then going back through all of your internal memory of the text in order to decide which word to produce next Seem like a waste and not at all what humans are doing (more akin to a memory access) Nonetheless, hasn’t stopped attention mechanisms from becoming quite popular and performing well on many tasks!

Same attention mechanism can be applied to any recurrent model In Show, Attend and Tell the authors apply attention mechanisms to the problem of generating image descriptions Use a Convolutional Neural Network to “encode” the image, and a Recurrent Neural Network with attention mechanisms to generate a description By visualizing the attention weights (just like in the translation example), we interpret what the model is looking at while generating a word:

In Teaching Machines to Read and Comprehend, the authors use a In Grammar as a Foreign Language, the authors use a Recurrent Neural Network with attention mechanisk to generate sentence parse trees.  In Teaching Machines to Read and Comprehend, the authors use a Machine Reading: RNN reads a text, reads a (synthetically generated) question, and then produces an answer. By visualizing the attention matrix we can see where the networks “looks” while it tries to find the answer to the question!

Attention = (Fuzzy) Memory? The basic problem that the attention mechanism solves is that it allows the network to refer back to the input sequence, instead of forcing it to encode all information into one fixed-length vector. Gives network access to its internal memory, which is hidden state of the encoder = network chooses what to retrieve from memory Unlike typical memory, memory access mechanism here is soft, which means that the network retrieves a weighted combination of all memory locations, not a value from a single discrete location Making the memory access soft has the benefit that we can easily train the network end-to-end using backpropagation The trend towards more complex memory structures is now continuing =  End-to-End Memory Networks allowing the network to read same input sequence multiple times before making an output, updating the memory contents at each step Neural Turing Machines use a similar form of memory mechanism, but with a more sophisticated type of addressing that using both content-based (like here) and location-based addressing, allowing the network to learn addressing pattern to execute simple computer programs, like sorting algorithms. MemNN: . For example, answering a question by making multiple reasoning steps over an input story. However, when the networks parameter weights are tied in a certain way, the memory mechanism inEnd-to-End Memory Networks identical to the attention mechanism presented here, only that it makes multiple hops over the memory (because it tries to integrate information from multiple sentences).

Soft vs Hard Attention Models Attend to a single input location. Can’t use gradient descent. Need reinforcement learning. Soft attention: Compute a weighted combination (attention) over some inputs using an attention network. Can use backpropagation to train end-to-end.

Attention for Recognition (Ba et al 2014) RNN-based model. Hard attention. Required reinforcement learning. li = location xi = image

Attention for Recognition (Mnih et al 2014) Glimpses are retinal (graded resolution) images li = location ai = action (classification)

Attention for Recognition (Mnih et al 2014) Glimpse trace on some digit images: Green line shows trajectory, other images are the glimpses themselves.

Soft Attention for Translation “I love coffee” -> “Me gusta el café” Bahdanau et al, “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR 2015

Soft Attention for Translation Distribution over input words “I love coffee” -> “Me gusta el café” Bahdanau et al, “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR 2015

Soft Attention for Translation Distribution over input words “I love coffee” -> “Me gusta el café” Bahdanau et al, “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR 2015

Soft Attention for Translation Distribution over input words “I love coffee” -> “Me gusta el café” Bahdanau et al, “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR 2015

Soft Attention for Translation Distribution over input words “I love coffee” -> “Me gusta el café” Bahdanau et al, “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR 2015

Soft Attention for Translation From Y. Bengio CVPR 2015 Tutorial

Soft Attention for Translation Decoder RNN From Y. Bengio CVPR 2015 Tutorial

Soft Attention for Translation Decoder RNN Bidirectional encoder RNN From Y. Bengio CVPR 2015 Tutorial

Soft Attention for Translation Decoder RNN Attention Model Bidirectional encoder RNN From Y. Bengio CVPR 2015 Tutorial

Soft Attention for Translation Context vector (input to decoder): Mixture weights: Alignment score (how well do input words near j match output words at position i): Bahdanau et al, “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR 2015

Soft Attention for Translation Bahdanau et al, “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR 2015

Soft Attention for Translation Reached State of the art in one year: Yoshua Bengio, NIPS RAM workshop 2015

Soft Attention for Translation Luong, Pham and Manning’s Translation System (2015): Translation Error Rate vs Human Luong and Manning IWSLT 2015

Luong, Pham and Manning 2015 Stacked LSTM (c.f. bidirectional flat encoder in Bahdanau et al): Effective Approaches to Attention-based Neural Machine Translation Minh-Thang Luong, Hieu Pham, Christopher D. Manning, EMNLP 15

Global Attention Model Global attention model is similar but simpler than Badanau’s: Different word matching functions were used Effective Approaches to Attention-based Neural Machine Translation Minh-Thang Luong Hieu Pham Christopher D. Manning, EMNLP 15

Local Attention Model Compute a best aligned position pt first Then compute a context vector centered at that position Effective Approaches to Attention-based Neural Machine Translation Minh-Thang Luong Hieu Pham Christopher D. Manning, EMNLP 15

Results Local and global models Effective Approaches to Attention-based Neural Machine Translation Minh-Thang Luong Hieu Pham Christopher D. Manning, EMNLP 15

Recall: RNN for Captioning Image: H x W x 3

Recall: RNN for Captioning CNN Image: H x W x 3 Features: D

Recall: RNN for Captioning CNN h0 Image: H x W x 3 Features: D Hidden state: H

Recall: RNN for Captioning Distribution over vocab d1 CNN h0 h1 Image: H x W x 3 Features: D Hidden state: H y1 First word

Recall: RNN for Captioning Distribution over vocab d1 d2 CNN h0 h1 h2 Image: H x W x 3 Features: D Hidden state: H y1 y2 First word Second word

Recall: RNN for Captioning RNN only looks at whole image, once Distribution over vocab d1 d2 CNN h0 h1 h2 Image: H x W x 3 Features: D Hidden state: H y1 y2 First word Second word

Recall: RNN for Captioning RNN only looks at whole image, once Distribution over vocab d1 d2 CNN h0 h1 h2 Image: H x W x 3 Features: D Hidden state: H What if the RNN looks at different parts of the image at each timestep? y1 y2 First word Second word

Soft Attention for Captioning CNN Features: L x D Image: H x W x 3 Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

Soft Attention for Captioning CNN h0 Features: L x D Image: H x W x 3 Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

Soft Attention for Captioning Distribution over L locations a1 CNN h0 Features: L x D Image: H x W x 3 Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

Soft Attention for Captioning Distribution over L locations a1 CNN h0 Features: L x D Image: H x W x 3 Weighted features: D z1 Weighted combination of features

Soft Attention for Captioning Distribution over L locations a1 CNN h0 h1 Features: L x D Image: H x W x 3 Weighted features: D z1 y1 Weighted combination of features First word

Soft Attention for Captioning Distribution over L locations Distribution over vocab a1 a2 d1 CNN h0 h1 Features: L x D Image: H x W x 3 Weighted features: D z1 y1 Weighted combination of features First word

Soft Attention for Captioning Distribution over L locations Distribution over vocab a1 a2 d1 CNN h0 h1 Features: L x D Image: H x W x 3 Weighted features: D z1 y1 z2 Weighted combination of features First word

Soft Attention for Captioning Distribution over L locations Distribution over vocab a1 a2 d1 CNN h0 h1 h2 Features: L x D Image: H x W x 3 Weighted features: D z1 y1 z2 y2 Weighted combination of features First word

Soft Attention for Captioning Distribution over L locations Distribution over vocab a1 a2 d1 a3 d2 CNN h0 h1 h2 Features: L x D Image: H x W x 3 Weighted features: D z1 y1 z2 y2 Weighted combination of features First word

Distribution over grid locations pa + pb + pc + pc = 1 Soft vs Hard Attention a b c d CNN Grid of features (Each D-dimensional) Image: H x W x 3 pa pb pc pd From RNN: Distribution over grid locations pa + pb + pc + pc = 1 Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

Soft vs Hard Attention a b c d CNN pa pb pc pd Grid of features (Each D-dimensional) Image: H x W x 3 Context vector z (D-dimensional) pa pb pc pd From RNN: Distribution over grid locations pa + pb + pc + pc = 1 Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

Soft vs Hard Attention a b c d CNN pa pb pc pd Soft attention: Summarize ALL locations z = paa+ pbb + pcc + pdd Derivative dz/dp is nice! Train with gradient descent a b c d CNN Grid of features (Each D-dimensional) Image: H x W x 3 Context vector z (D-dimensional) pa pb pc pd From RNN: Distribution over grid locations pa + pb + pc + pc = 1 Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

Soft vs Hard Attention a b c d CNN pa pb pc pd Soft attention: Summarize ALL locations z = paa+ pbb + pcc + pdd Derivative dz/dp is nice! Train with gradient descent a b c d CNN Grid of features (Each D-dimensional) Image: H x W x 3 Context vector z (D-dimensional) Hard attention: Sample ONE location according to p, z = that vector With argmax, dz/dp is zero almost everywhere … Can’t use gradient descent; need reinforcement learning pa pb pc pd From RNN: Distribution over grid locations pa + pb + pc + pc = 1 Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

Soft Attention for Captioning Hard attention

Soft Attention for Captioning

Soft Attention for Video “Describing Videos by Exploiting Temporal Structure,” Li Yao et al, arXiv 2015.

Soft Attention for Video The attention model: “Describing Videos by Exploiting Temporal Structure,” Li Yao et al, arXiv 2015.

Soft Attention for Video “Describing Videos by Exploiting Temporal Structure,” Li Yao et al, arXiv 2015.

Soft Attention for Captioning Attention constrained to fixed grid! We’ll come back to this ….

Attending to arbitrary regions? Features: L x D Image: H x W x 3 Attention mechanism from Show, Attend, and Tell only lets us softly attend to fixed grid positions … can we do better?

Attending to Arbitrary Regions - Read text, generate handwriting using an RNN - Attend to arbitrary regions of the output by predicting params of a mixture model Graves, “Generating Sequences with Recurrent Neural Networks”, arXiv 2013

Attending to Arbitrary Regions - Read text, generate handwriting using an RNN - Attend to arbitrary regions of the output by predicting params of a mixture model Which are real and which are generated? Graves, “Generating Sequences with Recurrent Neural Networks”, arXiv 2013

Attending to Arbitrary Regions - Read text, generate handwriting using an RNN - Attend to arbitrary regions of the output by predicting params of a mixture model Which are real and which are generated? REAL Graves, “Generating Sequences with Recurrent Neural Networks”, arXiv 2013 GENERATED

Attending to Arbitrary Regions: DRAW Classify images by attending to arbitrary regions of the input Gregor et al, “DRAW: A Recurrent Neural Network For Image Generation”, ICML 2015

Attending to Arbitrary Regions: DRAW Classify images by attending to arbitrary regions of the input Generate images by attending to arbitrary regions of the output Gregor et al, “DRAW: A Recurrent Neural Network For Image Generation”, ICML 2015

Attention Takeaways Performance: Complexity: Attention models can improve accuracy and reduce computation at the same time. Complexity: There are many design choices. Those choices have a big effect on performance. Ensembling has unusually large benefits. Simplify where possible!

Attention Takeaways Explainability: Hard vs. Soft: Attention models encode explanations. Both locus and trajectory help understand what’s going on. Hard vs. Soft: Soft models are easier to train, hard models require reinforcement learning. They can be combined, as in Luong et al.

What did we learn today? Attention can be quite useful in a variety of applications Attention = a form of memory access and can help with capturing long-term dependencies!

Questions? Deep robots! Deep questions?!