Please, Pay Attention, Neural Attention

Please, Pay Attention, Neural Attention
Alexander G. Ororbia II, C. Lee Giles IST597: Foundations of Deep Learning The Pennsylvania State University Thanks to John Canny

Origins/History Origins in vision:
1) Larochelle and Hinton, 2010, “Learning to combine foveal glimpses with a third-order Boltzmann machine” 2) Misha Denil et al, 2011, “Learning where to Attend with Deep Architectures for Image Tracking” Rise in neural machine translation Devlin et al, ACL’2014 Cho et al EMNLP’2014 Bahdanau, Cho & Bengio, arXiv sept. 2014 Jean, Cho, Memisevic & Bengio, arXiv dec. 2014 Sutskever et al NIPS’2014

Rise in Speech and Text Problems
Other Applications Rise in Speech and Text Problems Ba et al 2014, Visual attention for recognition Mnih et al 2014, Visual attention for recognition Chorowski et al, 2014, Speech recognition Graves et al 2014, Neural Turing machines Yao et al 2015, Video description generation Vinyals et al, 2015, Conversational Agents Xu et al 2015, Image caption generation Xu et al 2015, Visual Question Answering

What problem does Attention solve
What problem does Attention solve? A View from Neural Machine Translation (NMT) Traditional Machine Translation typically relies on sophisticated feature engineering based on statistical properties of text Difficult to build, requires extensive expertise In NMT: 1) Map meaning of a sentence to a fixed-length vector representation 2) Generate a translation based on that vector Does NOT rely on n-gram counts and focuses on capturing higher-level meaning of a text, Generalize to new sentences better than many other approaches Easier to build and train Most NMT systems work by encoding source sentence (e.g. a German sentence) into a vector using an RNN and then decoding an English sentence based on that vector, also using a RNN

Thinking about NMT In the picture above, “Echt”, “Dicke” and “Kiste” words are fed into encoder, and after a special signal the decoder starts producing a translated sentence Decoder keeps generating words until a special end of sentence token is produced Final h vectors represent the internal state of the encoder. Vector h3 (at end of encoder) must encode everything we need to know about source sentence Must fully capture sentence meaning = a “sentence embedding” (semantically similar phrases end up close to each other.) Problem: unreasonable to assume that we can encode all information about a potentially very long sentence into single vector Hoping that decoder can produce a good translation based on only that. If source sentence is 50 words long, first word of the English translation is probably highly correlated with first word of the source sentence Decoder has to consider information from 50 steps ago, and that information needs to be somehow encoded in vector. RNNs have problems dealing with long-range dependencies (could use LSTMs) or hacks like reversing the source sentence (feed backwards into encoder = shortens the path from decoder to relevant parts of encoder)

Solution: Use Attention!
With an attention mechanism we no longer try encode the full source sentence into a fixed-length vector We allow decoder to “attend” to different parts of source sentence at each step of the output generation Model learns what to attend to based on input sentence and what it has produced so far In well-aligned languages (like English and German) decoder would probably choose to attend to things sequentially (attending to first word when producing first English word, etc.) One key paper: Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate." arXiv preprint arXiv: (2014).

Interpreting Attention
The y‘s are our translated words produced by the decoder, and the x‘s are our source sentence words. Each decoder output word y (at time t) now depends on a weighted combination of all input states, not just last state. The weights (“alpha”) define how much each input state should be considered for each output So, if alpha (at step 32) is a large number, this would mean that the decoder pays a lot of attention to the second state in the source sentence while producing the third word of the target sentence. The weights are typically normalized to sum to 1 (so they are a distribution over input states) Big advantage = gives us ability to interpret and visualize what the model is doing

Visualizing the attention weight matrix a when a sentence is translated, we can understand how the model is translating! Here we see that while translating from French to English, the network attends sequentially to each input state, but sometimes it attends to two words at time while producing an output, as in translation “la Syrie” to “Syria” for example.

The Mechanism is Counterintuitive however…
Human attention is something that’s supposed to save computational resources By focusing on one thing, we can neglect many other things Here we’re essentially looking at everything in detail before deciding what to focus on Equivalent to outputting a translated word, and then going back through all of your internal memory of the text in order to decide which word to produce next Seem like a waste and not at all what humans are doing (more akin to a memory access) Nonetheless, hasn’t stopped attention mechanisms from becoming quite popular and performing well on many tasks!

Same attention mechanism can be applied to any recurrent model
In Show, Attend and Tell the authors apply attention mechanisms to the problem of generating image descriptions Use a Convolutional Neural Network to “encode” the image, and a Recurrent Neural Network with attention mechanisms to generate a description By visualizing the attention weights (just like in the translation example), we interpret what the model is looking at while generating a word:

In Teaching Machines to Read and Comprehend, the authors use a
In Grammar as a Foreign Language, the authors use a Recurrent Neural Network with attention mechanisk to generate sentence parse trees. In Teaching Machines to Read and Comprehend, the authors use a Machine Reading: RNN reads a text, reads a (synthetically generated) question, and then produces an answer. By visualizing the attention matrix we can see where the networks “looks” while it tries to find the answer to the question!

Attention = (Fuzzy) Memory?
The basic problem that the attention mechanism solves is that it allows the network to refer back to the input sequence, instead of forcing it to encode all information into one fixed-length vector. Gives network access to its internal memory, which is hidden state of the encoder = network chooses what to retrieve from memory Unlike typical memory, memory access mechanism here is soft, which means that the network retrieves a weighted combination of all memory locations, not a value from a single discrete location Making the memory access soft has the benefit that we can easily train the network end-to-end using backpropagation The trend towards more complex memory structures is now continuing = End-to-End Memory Networks allowing the network to read same input sequence multiple times before making an output, updating the memory contents at each step Neural Turing Machines use a similar form of memory mechanism, but with a more sophisticated type of addressing that using both content-based (like here) and location-based addressing, allowing the network to learn addressing pattern to execute simple computer programs, like sorting algorithms. MemNN: . For example, answering a question by making multiple reasoning steps over an input story. However, when the networks parameter weights are tied in a certain way, the memory mechanism inEnd-to-End Memory Networks identical to the attention mechanism presented here, only that it makes multiple hops over the memory (because it tries to integrate information from multiple sentences).

Soft vs Hard Attention Models
Attend to a single input location. Can’t use gradient descent. Need reinforcement learning. Soft attention: Compute a weighted combination (attention) over some inputs using an attention network. Can use backpropagation to train end-to-end.

Attention for Recognition (Ba et al 2014)
RNN-based model. Hard attention. Required reinforcement learning. li = location xi = image

Attention for Recognition (Mnih et al 2014)
Glimpses are retinal (graded resolution) images li = location ai = action (classification)

Attention for Recognition (Mnih et al 2014)
Glimpse trace on some digit images: Green line shows trajectory, other images are the glimpses themselves.

Soft Attention for Translation
“I love coffee” -> “Me gusta el café” Bahdanau et al, “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR 2015

Distribution over input words “I love coffee” -> “Me gusta el café” Bahdanau et al, “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR 2015

From Y. Bengio CVPR 2015 Tutorial

Decoder RNN From Y. Bengio CVPR 2015 Tutorial

Decoder RNN Bidirectional encoder RNN From Y. Bengio CVPR 2015 Tutorial

Decoder RNN Attention Model Bidirectional encoder RNN From Y. Bengio CVPR 2015 Tutorial

Context vector (input to decoder): Mixture weights: Alignment score (how well do input words near j match output words at position i): Bahdanau et al, “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR 2015

Bahdanau et al, “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR 2015

Reached State of the art in one year: Yoshua Bengio, NIPS RAM workshop 2015

Luong, Pham and Manning’s Translation System (2015): Translation Error Rate vs Human Luong and Manning IWSLT 2015

Luong, Pham and Manning 2015 Stacked LSTM (c.f. bidirectional flat encoder in Bahdanau et al): Effective Approaches to Attention-based Neural Machine Translation Minh-Thang Luong, Hieu Pham, Christopher D. Manning, EMNLP 15

Global Attention Model
Global attention model is similar but simpler than Badanau’s: Different word matching functions were used Effective Approaches to Attention-based Neural Machine Translation Minh-Thang Luong Hieu Pham Christopher D. Manning, EMNLP 15

Local Attention Model Compute a best aligned position pt first
Then compute a context vector centered at that position Effective Approaches to Attention-based Neural Machine Translation Minh-Thang Luong Hieu Pham Christopher D. Manning, EMNLP 15

Results Local and global models
Effective Approaches to Attention-based Neural Machine Translation Minh-Thang Luong Hieu Pham Christopher D. Manning, EMNLP 15

Recall: RNN for Captioning
Image: H x W x 3

CNN Image: H x W x 3 Features: D

CNN h0 Image: H x W x 3 Features: D Hidden state: H

Distribution over vocab d1 CNN h0 h1 Image: H x W x 3 Features: D Hidden state: H y1 First word

Distribution over vocab d1 d2 CNN h0 h1 h2 Image: H x W x 3 Features: D Hidden state: H y1 y2 First word Second word

RNN only looks at whole image, once Distribution over vocab d1 d2 CNN h0 h1 h2 Image: H x W x 3 Features: D Hidden state: H y1 y2 First word Second word

RNN only looks at whole image, once Distribution over vocab d1 d2 CNN h0 h1 h2 Image: H x W x 3 Features: D Hidden state: H What if the RNN looks at different parts of the image at each timestep? y1 y2 First word Second word

Soft Attention for Captioning
CNN Features: L x D Image: H x W x 3 Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

CNN h0 Features: L x D Image: H x W x 3 Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

Distribution over L locations a1 CNN h0 Features: L x D Image: H x W x 3 Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

Distribution over L locations a1 CNN h0 Features: L x D Image: H x W x 3 Weighted features: D z1 Weighted combination of features

Distribution over L locations a1 CNN h0 h1 Features: L x D Image: H x W x 3 Weighted features: D z1 y1 Weighted combination of features First word

Distribution over L locations Distribution over vocab a1 a2 d1 CNN h0 h1 Features: L x D Image: H x W x 3 Weighted features: D z1 y1 Weighted combination of features First word

Distribution over L locations Distribution over vocab a1 a2 d1 CNN h0 h1 Features: L x D Image: H x W x 3 Weighted features: D z1 y1 z2 Weighted combination of features First word

Distribution over L locations Distribution over vocab a1 a2 d1 CNN h0 h1 h2 Features: L x D Image: H x W x 3 Weighted features: D z1 y1 z2 y2 Weighted combination of features First word

Distribution over L locations Distribution over vocab a1 a2 d1 a3 d2 CNN h0 h1 h2 Features: L x D Image: H x W x 3 Weighted features: D z1 y1 z2 y2 Weighted combination of features First word

Distribution over grid locations pa + pb + pc + pc = 1
Soft vs Hard Attention a b c d CNN Grid of features (Each D-dimensional) Image: H x W x 3 pa pb pc pd From RNN: Distribution over grid locations pa + pb + pc + pc = 1 Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

Soft vs Hard Attention a b c d CNN pa pb pc pd Grid of features
(Each D-dimensional) Image: H x W x 3 Context vector z (D-dimensional) pa pb pc pd From RNN: Distribution over grid locations pa + pb + pc + pc = 1 Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

Soft vs Hard Attention a b c d CNN pa pb pc pd Soft attention:
Summarize ALL locations z = paa+ pbb + pcc + pdd Derivative dz/dp is nice! Train with gradient descent a b c d CNN Grid of features (Each D-dimensional) Image: H x W x 3 Context vector z (D-dimensional) pa pb pc pd From RNN: Distribution over grid locations pa + pb + pc + pc = 1 Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

Soft vs Hard Attention a b c d CNN pa pb pc pd Soft attention:
Summarize ALL locations z = paa+ pbb + pcc + pdd Derivative dz/dp is nice! Train with gradient descent a b c d CNN Grid of features (Each D-dimensional) Image: H x W x 3 Context vector z (D-dimensional) Hard attention: Sample ONE location according to p, z = that vector With argmax, dz/dp is zero almost everywhere … Can’t use gradient descent; need reinforcement learning pa pb pc pd From RNN: Distribution over grid locations pa + pb + pc + pc = 1 Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015

Hard attention

Soft Attention for Video
“Describing Videos by Exploiting Temporal Structure,” Li Yao et al, arXiv 2015.

The attention model: “Describing Videos by Exploiting Temporal Structure,” Li Yao et al, arXiv 2015.

“Describing Videos by Exploiting Temporal Structure,” Li Yao et al, arXiv 2015.

Attention constrained to fixed grid! We’ll come back to this ….

Attending to arbitrary regions?
Features: L x D Image: H x W x 3 Attention mechanism from Show, Attend, and Tell only lets us softly attend to fixed grid positions … can we do better?

Attending to Arbitrary Regions
- Read text, generate handwriting using an RNN - Attend to arbitrary regions of the output by predicting params of a mixture model Graves, “Generating Sequences with Recurrent Neural Networks”, arXiv 2013

- Read text, generate handwriting using an RNN - Attend to arbitrary regions of the output by predicting params of a mixture model Which are real and which are generated? Graves, “Generating Sequences with Recurrent Neural Networks”, arXiv 2013

- Read text, generate handwriting using an RNN - Attend to arbitrary regions of the output by predicting params of a mixture model Which are real and which are generated? REAL Graves, “Generating Sequences with Recurrent Neural Networks”, arXiv 2013 GENERATED

Attending to Arbitrary Regions: DRAW
Classify images by attending to arbitrary regions of the input Gregor et al, “DRAW: A Recurrent Neural Network For Image Generation”, ICML 2015

Attending to Arbitrary Regions: DRAW
Classify images by attending to arbitrary regions of the input Generate images by attending to arbitrary regions of the output Gregor et al, “DRAW: A Recurrent Neural Network For Image Generation”, ICML 2015

Attention Takeaways Performance: Complexity:
Attention models can improve accuracy and reduce computation at the same time. Complexity: There are many design choices. Those choices have a big effect on performance. Ensembling has unusually large benefits. Simplify where possible!

Attention Takeaways Explainability: Hard vs. Soft:
Attention models encode explanations. Both locus and trajectory help understand what’s going on. Hard vs. Soft: Soft models are easier to train, hard models require reinforcement learning. They can be combined, as in Luong et al.

What did we learn today? Attention can be quite useful in a variety of applications Attention = a form of memory access and can help with capturing long-term dependencies!

Questions? Deep robots! Deep questions?!

Please, Pay Attention, Neural Attention

Similar presentations

Presentation on theme: "Please, Pay Attention, Neural Attention"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Please, Pay Attention, Neural Attention

Similar presentations

Presentation on theme: "Please, Pay Attention, Neural Attention"— Presentation transcript:

Similar presentations

About project

Feedback