Download presentation
Presentation is loading. Please wait.
1
Please, Pay Attention, Neural Attention
Alexander G. Ororbia II, C. Lee Giles IST597: Foundations of Deep Learning The Pennsylvania State University Thanks to John Canny
3
Origins/History Origins in vision:
1) Larochelle and Hinton, 2010, “Learning to combine foveal glimpses with a third-order Boltzmann machine” 2) Misha Denil et al, 2011, “Learning where to Attend with Deep Architectures for Image Tracking” Rise in neural machine translation Devlin et al, ACL’2014 Cho et al EMNLP’2014 Bahdanau, Cho & Bengio, arXiv sept. 2014 Jean, Cho, Memisevic & Bengio, arXiv dec. 2014 Sutskever et al NIPS’2014
4
Rise in Speech and Text Problems
Other Applications Rise in Speech and Text Problems Ba et al 2014, Visual attention for recognition Mnih et al 2014, Visual attention for recognition Chorowski et al, 2014, Speech recognition Graves et al 2014, Neural Turing machines Yao et al 2015, Video description generation Vinyals et al, 2015, Conversational Agents Xu et al 2015, Image caption generation Xu et al 2015, Visual Question Answering
5
What problem does Attention solve
What problem does Attention solve? A View from Neural Machine Translation (NMT) Traditional Machine Translation typically relies on sophisticated feature engineering based on statistical properties of text Difficult to build, requires extensive expertise In NMT: 1) Map meaning of a sentence to a fixed-length vector representation 2) Generate a translation based on that vector Does NOT rely on n-gram counts and focuses on capturing higher-level meaning of a text, Generalize to new sentences better than many other approaches Easier to build and train Most NMT systems work by encoding source sentence (e.g. a German sentence) into a vector using an RNN and then decoding an English sentence based on that vector, also using a RNN
7
Thinking about NMT In the picture above, “Echt”, “Dicke” and “Kiste” words are fed into encoder, and after a special signal the decoder starts producing a translated sentence Decoder keeps generating words until a special end of sentence token is produced Final h vectors represent the internal state of the encoder. Vector h3 (at end of encoder) must encode everything we need to know about source sentence Must fully capture sentence meaning = a “sentence embedding” (semantically similar phrases end up close to each other.) Problem: unreasonable to assume that we can encode all information about a potentially very long sentence into single vector Hoping that decoder can produce a good translation based on only that. If source sentence is 50 words long, first word of the English translation is probably highly correlated with first word of the source sentence Decoder has to consider information from 50 steps ago, and that information needs to be somehow encoded in vector. RNNs have problems dealing with long-range dependencies (could use LSTMs) or hacks like reversing the source sentence (feed backwards into encoder = shortens the path from decoder to relevant parts of encoder)
8
Solution: Use Attention!
With an attention mechanism we no longer try encode the full source sentence into a fixed-length vector We allow decoder to “attend” to different parts of source sentence at each step of the output generation Model learns what to attend to based on input sentence and what it has produced so far In well-aligned languages (like English and German) decoder would probably choose to attend to things sequentially (attending to first word when producing first English word, etc.) One key paper: Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate." arXiv preprint arXiv: (2014).
10
Interpreting Attention
The y‘s are our translated words produced by the decoder, and the x‘s are our source sentence words. Each decoder output word y (at time t) now depends on a weighted combination of all input states, not just last state. The weights (“alpha”) define how much each input state should be considered for each output So, if alpha (at step 32) is a large number, this would mean that the decoder pays a lot of attention to the second state in the source sentence while producing the third word of the target sentence. The weights are typically normalized to sum to 1 (so they are a distribution over input states) Big advantage = gives us ability to interpret and visualize what the model is doing
11
Visualizing the attention weight matrix a when a sentence is translated, we can understand how the model is translating! Here we see that while translating from French to English, the network attends sequentially to each input state, but sometimes it attends to two words at time while producing an output, as in translation “la Syrie” to “Syria” for example.
12
The Mechanism is Counterintuitive however…
Human attention is something that’s supposed to save computational resources By focusing on one thing, we can neglect many other things Here we’re essentially looking at everything in detail before deciding what to focus on Equivalent to outputting a translated word, and then going back through all of your internal memory of the text in order to decide which word to produce next Seem like a waste and not at all what humans are doing (more akin to a memory access) Nonetheless, hasn’t stopped attention mechanisms from becoming quite popular and performing well on many tasks!
13
Same attention mechanism can be applied to any recurrent model
In Show, Attend and Tell the authors apply attention mechanisms to the problem of generating image descriptions Use a Convolutional Neural Network to “encode” the image, and a Recurrent Neural Network with attention mechanisms to generate a description By visualizing the attention weights (just like in the translation example), we interpret what the model is looking at while generating a word:
15
In Teaching Machines to Read and Comprehend, the authors use a
In Grammar as a Foreign Language, the authors use a Recurrent Neural Network with attention mechanisk to generate sentence parse trees. In Teaching Machines to Read and Comprehend, the authors use a Machine Reading: RNN reads a text, reads a (synthetically generated) question, and then produces an answer. By visualizing the attention matrix we can see where the networks “looks” while it tries to find the answer to the question!
16
Attention = (Fuzzy) Memory?
The basic problem that the attention mechanism solves is that it allows the network to refer back to the input sequence, instead of forcing it to encode all information into one fixed-length vector. Gives network access to its internal memory, which is hidden state of the encoder = network chooses what to retrieve from memory Unlike typical memory, memory access mechanism here is soft, which means that the network retrieves a weighted combination of all memory locations, not a value from a single discrete location Making the memory access soft has the benefit that we can easily train the network end-to-end using backpropagation The trend towards more complex memory structures is now continuing = End-to-End Memory Networks allowing the network to read same input sequence multiple times before making an output, updating the memory contents at each step Neural Turing Machines use a similar form of memory mechanism, but with a more sophisticated type of addressing that using both content-based (like here) and location-based addressing, allowing the network to learn addressing pattern to execute simple computer programs, like sorting algorithms. MemNN: . For example, answering a question by making multiple reasoning steps over an input story. However, when the networks parameter weights are tied in a certain way, the memory mechanism inEnd-to-End Memory Networks identical to the attention mechanism presented here, only that it makes multiple hops over the memory (because it tries to integrate information from multiple sentences).
17
Soft vs Hard Attention Models
Attend to a single input location. Can’t use gradient descent. Need reinforcement learning. Soft attention: Compute a weighted combination (attention) over some inputs using an attention network. Can use backpropagation to train end-to-end.
18
Attention for Recognition (Ba et al 2014)
RNN-based model. Hard attention. Required reinforcement learning. li = location xi = image
19
Attention for Recognition (Mnih et al 2014)
Glimpses are retinal (graded resolution) images li = location ai = action (classification)
20
Attention for Recognition (Mnih et al 2014)
Glimpse trace on some digit images: Green line shows trajectory, other images are the glimpses themselves.
21
Soft Attention for Translation
“I love coffee” -> “Me gusta el café” Bahdanau et al, “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR 2015
22
Soft Attention for Translation
Distribution over input words “I love coffee” -> “Me gusta el café” Bahdanau et al, “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR 2015
23
Soft Attention for Translation
Distribution over input words “I love coffee” -> “Me gusta el café” Bahdanau et al, “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR 2015
24
Soft Attention for Translation
Distribution over input words “I love coffee” -> “Me gusta el café” Bahdanau et al, “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR 2015
25
Soft Attention for Translation
Distribution over input words “I love coffee” -> “Me gusta el café” Bahdanau et al, “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR 2015
26
Soft Attention for Translation
From Y. Bengio CVPR 2015 Tutorial
27
Soft Attention for Translation
Decoder RNN From Y. Bengio CVPR 2015 Tutorial
28
Soft Attention for Translation
Decoder RNN Bidirectional encoder RNN From Y. Bengio CVPR 2015 Tutorial
29
Soft Attention for Translation
Decoder RNN Attention Model Bidirectional encoder RNN From Y. Bengio CVPR 2015 Tutorial
30
Soft Attention for Translation
Context vector (input to decoder): Mixture weights: Alignment score (how well do input words near j match output words at position i): Bahdanau et al, “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR 2015
31
Soft Attention for Translation
Bahdanau et al, “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR 2015
32
Soft Attention for Translation
Reached State of the art in one year: Yoshua Bengio, NIPS RAM workshop 2015
33
Soft Attention for Translation
Luong, Pham and Manning’s Translation System (2015): Translation Error Rate vs Human Luong and Manning IWSLT 2015
34
Luong, Pham and Manning 2015 Stacked LSTM (c.f. bidirectional flat encoder in Bahdanau et al): Effective Approaches to Attention-based Neural Machine Translation Minh-Thang Luong, Hieu Pham, Christopher D. Manning, EMNLP 15
35
Global Attention Model
Global attention model is similar but simpler than Badanau’s: Different word matching functions were used Effective Approaches to Attention-based Neural Machine Translation Minh-Thang Luong Hieu Pham Christopher D. Manning, EMNLP 15
36
Local Attention Model Compute a best aligned position pt first
Then compute a context vector centered at that position Effective Approaches to Attention-based Neural Machine Translation Minh-Thang Luong Hieu Pham Christopher D. Manning, EMNLP 15
37
Results Local and global models
Effective Approaches to Attention-based Neural Machine Translation Minh-Thang Luong Hieu Pham Christopher D. Manning, EMNLP 15
38
Recall: RNN for Captioning
Image: H x W x 3
39
Recall: RNN for Captioning
CNN Image: H x W x 3 Features: D
40
Recall: RNN for Captioning
CNN h0 Image: H x W x 3 Features: D Hidden state: H
41
Recall: RNN for Captioning
Distribution over vocab d1 CNN h0 h1 Image: H x W x 3 Features: D Hidden state: H y1 First word
42
Recall: RNN for Captioning
Distribution over vocab d1 d2 CNN h0 h1 h2 Image: H x W x 3 Features: D Hidden state: H y1 y2 First word Second word
43
Recall: RNN for Captioning
RNN only looks at whole image, once Distribution over vocab d1 d2 CNN h0 h1 h2 Image: H x W x 3 Features: D Hidden state: H y1 y2 First word Second word
44
Recall: RNN for Captioning
RNN only looks at whole image, once Distribution over vocab d1 d2 CNN h0 h1 h2 Image: H x W x 3 Features: D Hidden state: H What if the RNN looks at different parts of the image at each timestep? y1 y2 First word Second word
45
Soft Attention for Captioning
CNN Features: L x D Image: H x W x 3 Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
46
Soft Attention for Captioning
CNN h0 Features: L x D Image: H x W x 3 Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
47
Soft Attention for Captioning
Distribution over L locations a1 CNN h0 Features: L x D Image: H x W x 3 Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
48
Soft Attention for Captioning
Distribution over L locations a1 CNN h0 Features: L x D Image: H x W x 3 Weighted features: D z1 Weighted combination of features
49
Soft Attention for Captioning
Distribution over L locations a1 CNN h0 h1 Features: L x D Image: H x W x 3 Weighted features: D z1 y1 Weighted combination of features First word
50
Soft Attention for Captioning
Distribution over L locations Distribution over vocab a1 a2 d1 CNN h0 h1 Features: L x D Image: H x W x 3 Weighted features: D z1 y1 Weighted combination of features First word
51
Soft Attention for Captioning
Distribution over L locations Distribution over vocab a1 a2 d1 CNN h0 h1 Features: L x D Image: H x W x 3 Weighted features: D z1 y1 z2 Weighted combination of features First word
52
Soft Attention for Captioning
Distribution over L locations Distribution over vocab a1 a2 d1 CNN h0 h1 h2 Features: L x D Image: H x W x 3 Weighted features: D z1 y1 z2 y2 Weighted combination of features First word
53
Soft Attention for Captioning
Distribution over L locations Distribution over vocab a1 a2 d1 a3 d2 CNN h0 h1 h2 Features: L x D Image: H x W x 3 Weighted features: D z1 y1 z2 y2 Weighted combination of features First word
54
Distribution over grid locations pa + pb + pc + pc = 1
Soft vs Hard Attention a b c d CNN Grid of features (Each D-dimensional) Image: H x W x 3 pa pb pc pd From RNN: Distribution over grid locations pa + pb + pc + pc = 1 Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
55
Soft vs Hard Attention a b c d CNN pa pb pc pd Grid of features
(Each D-dimensional) Image: H x W x 3 Context vector z (D-dimensional) pa pb pc pd From RNN: Distribution over grid locations pa + pb + pc + pc = 1 Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
56
Soft vs Hard Attention a b c d CNN pa pb pc pd Soft attention:
Summarize ALL locations z = paa+ pbb + pcc + pdd Derivative dz/dp is nice! Train with gradient descent a b c d CNN Grid of features (Each D-dimensional) Image: H x W x 3 Context vector z (D-dimensional) pa pb pc pd From RNN: Distribution over grid locations pa + pb + pc + pc = 1 Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
57
Soft vs Hard Attention a b c d CNN pa pb pc pd Soft attention:
Summarize ALL locations z = paa+ pbb + pcc + pdd Derivative dz/dp is nice! Train with gradient descent a b c d CNN Grid of features (Each D-dimensional) Image: H x W x 3 Context vector z (D-dimensional) Hard attention: Sample ONE location according to p, z = that vector With argmax, dz/dp is zero almost everywhere … Can’t use gradient descent; need reinforcement learning pa pb pc pd From RNN: Distribution over grid locations pa + pb + pc + pc = 1 Xu et al, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
58
Soft Attention for Captioning
Hard attention
59
Soft Attention for Captioning
60
Soft Attention for Video
“Describing Videos by Exploiting Temporal Structure,” Li Yao et al, arXiv 2015.
61
Soft Attention for Video
The attention model: “Describing Videos by Exploiting Temporal Structure,” Li Yao et al, arXiv 2015.
62
Soft Attention for Video
“Describing Videos by Exploiting Temporal Structure,” Li Yao et al, arXiv 2015.
63
Soft Attention for Captioning
Attention constrained to fixed grid! We’ll come back to this ….
64
Attending to arbitrary regions?
Features: L x D Image: H x W x 3 Attention mechanism from Show, Attend, and Tell only lets us softly attend to fixed grid positions … can we do better?
65
Attending to Arbitrary Regions
- Read text, generate handwriting using an RNN - Attend to arbitrary regions of the output by predicting params of a mixture model Graves, “Generating Sequences with Recurrent Neural Networks”, arXiv 2013
66
Attending to Arbitrary Regions
- Read text, generate handwriting using an RNN - Attend to arbitrary regions of the output by predicting params of a mixture model Which are real and which are generated? Graves, “Generating Sequences with Recurrent Neural Networks”, arXiv 2013
67
Attending to Arbitrary Regions
- Read text, generate handwriting using an RNN - Attend to arbitrary regions of the output by predicting params of a mixture model Which are real and which are generated? REAL Graves, “Generating Sequences with Recurrent Neural Networks”, arXiv 2013 GENERATED
68
Attending to Arbitrary Regions: DRAW
Classify images by attending to arbitrary regions of the input Gregor et al, “DRAW: A Recurrent Neural Network For Image Generation”, ICML 2015
69
Attending to Arbitrary Regions: DRAW
Classify images by attending to arbitrary regions of the input Generate images by attending to arbitrary regions of the output Gregor et al, “DRAW: A Recurrent Neural Network For Image Generation”, ICML 2015
71
Attention Takeaways Performance: Complexity:
Attention models can improve accuracy and reduce computation at the same time. Complexity: There are many design choices. Those choices have a big effect on performance. Ensembling has unusually large benefits. Simplify where possible!
72
Attention Takeaways Explainability: Hard vs. Soft:
Attention models encode explanations. Both locus and trajectory help understand what’s going on. Hard vs. Soft: Soft models are easier to train, hard models require reinforcement learning. They can be combined, as in Luong et al.
73
What did we learn today? Attention can be quite useful in a variety of applications Attention = a form of memory access and can help with capturing long-term dependencies!
74
Questions? Deep robots! Deep questions?!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.