Download presentation
Presentation is loading. Please wait.
Published byKathlyn Cordelia Jordan Modified over 6 years ago
1
Hierarchical Question-Image Co-Attention for Visual Question Answering
Jiasen Lu, Jianwei Yang, Dhruv Batra, Devi Parikh
2
How to design a model for VQA?
VQA and image captioning Difference with Image captioning.
3
How to design a model for VQA?
Recall the image captioning model VQA and image captioning Difference with Image captioning.
4
How to design a model for VQA?
Recall the image captioning model Task Image caption: Image -> Caption VQA: Image + Question -> Answer VQA and image captioning Difference with Image captioning.
5
How to design a model for VQA?
Recall the image captioning model Task Image caption: Image -> Caption VQA: Image + Question -> Answer Can we develop a VQA model? VQA and image captioning Difference with Image captioning.
6
Convolution Layer + Non-Linearity
2-Channel VQA Model Image Embedding Convolution Layer + Non-Linearity Pooling Layer Fully-Connected MLP 4096-dim
7
2-Channel VQA Model Image Embedding Question Embedding
Convolution Layer + Non-Linearity Pooling Layer Fully-Connected MLP 4096-dim Embedding Question “How many horses are in this image?” 1024-dim
8
2-Channel VQA Model Image Embedding Question Embedding
Neural Network Softmax over top K answers Image Embedding Convolution Layer + Non-Linearity Pooling Layer Fully-Connected MLP 4096-dim Embedding Question “How many horses are in this image?” 1024-dim
9
What is the name of the cafe? - bagdad
Human Attention (EMNLP 2016) What is the name of the cafe? - bagdad What number of cat is laying on bed? - 2 Image credit: Human Attention in Visual Question Answering:
10
Human Attention (EMNLP 2016)
What is the name of the cafe? - bagdad What number of cat is laying on bed? - 2 Image credit: Human Attention in Visual Question Answering:
11
Recall: Attention Mechanism (Soft)
CNN 7 512 summarize Query LSTM LSTM LSTM LSTM LSTM <start> w1 w2 w3
12
Attention Mechanism (Soft)
CNN 7 512 summarize Query LSTM LSTM LSTM LSTM LSTM What is in the Image ?
13
Q: What if the algorithm attending wrong place at first place?
What are sitting in the basket on a bicycle? Image credit: Stacked Attention Networks for Image Question Answering
14
Q: What if the algorithm attending wrong place at first place?
What are sitting in the basket on a bicycle? Image credit: Stacked Attention Networks for Image Question Answering
15
Q: What if the algorithm attending wrong place at first place?
Let’s do another round! What are sitting in the basket on a bicycle? Image credit: Stacked Attention Networks for Image Question Answering
16
Hint: Let’s do another round!
Q: What if the algorithm attending wrong place at first place? Hint: Let’s do another round! What are sitting in the basket on a bicycle? Image credit: Stacked Attention Networks for Image Question Answering
17
Stacked Attention Network (CVPR 2016)
Image credit: Stacked Attention Networks for Image Question Answering
18
Q: What about question?
19
Q: What about question? are in the image ? how many horses
can you see in this image? how many horses
20
Q: What about question? are in the image ? how many horses
can you see in this image? how many horses
21
Q: What about question? are in the image ? how many horses
can you see in this image? how many horses Hint: Can we do attention on question as well?
22
Q: What about question? What are sitting in the basket on a bicycle?
23
Q: What about question? What are sitting in the basket on a Bicycle ?
24
Q: What about question? What are sitting in the basket on a Bicycle ?
25
Q: What about question? What are sitting in the basket on a Bicycle ?
Hint: Can we use the compositionality of the question ?
26
Q: what is the color of the bird?
HieCoAttn (NIPS 2016) Answer: white What is the color of the bird ? the color of CNN What is is the of the bird ? In this paper, we propose a novel mechanism that jointly reasons about visual attention and question attention. We build a hierarchical architecture that co-attends to the image and question at three levels: At word level, we embed the words to a vector space through a embedding matrix. At phrase, level, 1-dimensional convolution neural networks are used to capture the information contained in unigrams, bigrams and trigrams. At question level, we use recurrent neural networks to encode the entire question. At each level, we construct the co-attention maps, which are them combined recursively to predict the answers. what is the color of the bird ? Q: what is the color of the bird?
28
Hierarchical Co-Attention (NIPS 2016)
29
Base Model, Simple Model for VQA
Aishwaya, Lu, Antol et.al 2015 Zhou et.al 2015 In this talk, I’ll go over the major recent VQA method papers, and my thought about those methods. Jabri et.al 2016
30
Attention based VQA model (where to look?)
SAN (Yang et. al. 2016) DMN (Xiong et. al. 2016) In this talk, I’ll go over the major recent VQA method papers, and my thought about those methods. HieCoAtt(Lu et. al. 2016)
31
Multi-modality Feature Learning (how to fuse feature?)
MCB(Fukui et al 2016) MLB (Kim et al 2016) In this talk, I’ll go over the major recent VQA method papers, and my thought about those methods. MUTAN (Younes et al 2017)
32
Modular Network / Programmer (compositionality)
In this talk, I’ll go over the major recent VQA method papers, and my thought about those methods. Modular Network (Andreas et. al. 2015) Jonhson et. al. 2017
33
Others (Bottom up attention) Anderson et al 2017
In this talk, I’ll go over the major recent VQA method papers, and my thought about those methods. (Early Fusion) Vries and Strub et al 2017
34
Thanks! Q&A Result of show and tell.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.