Hierarchical Question-Image Co-Attention for Visual Question Answering

Slides:



Advertisements
Similar presentations
Deep Learning and Neural Nets Spring 2015
Advertisements

Feedforward semantic segmentation with zoom-out features
Object Recognizing. Deep Learning Success in 2012 DeepNet and speech processing.
Deep Learning Overview Sources: workshop-tutorial-final.pdf
Learning to Answer Questions from Image Using Convolutional Neural Network Lin Ma, Zhengdong Lu, and Hang Li Huawei Noah’s Ark Lab, Hong Kong
Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation EMNLP’14 paper by Kyunghyun Cho, et al.
Analyzing the Behavior of Deep Models Dhruv Batra Devi Parikh Aishwarya Agrawal (EMNLP 2016)
A Sentence Interaction Network for Modeling Dependence between Sentences Biao Liu, Minlie Huang Tsinghua University.
Ensembling Diverse Approaches to Question Answering
Attention Model in NLP Jichuan ZENG.
R-NET: Machine Reading Comprehension With Self-Matching Networks
Recent developments in object detection
Sentiment analysis using deep learning methods
Hybrid Deep Learning for Reflectance Confocal Microscopy Skin Images
Convolutional Sequence to Sequence Learning
CNN-RNN: A Unified Framework for Multi-label Image Classification
Faster R-CNN – Concepts
Convolutional Neural Network
IEEE BIBM 2016 Xu Min, Wanwen Zeng, Ning Chen, Ting Chen*, Rui Jiang*
End-To-End Memory Networks
CS 4501: Introduction to Computer Vision Computer Vision + Natural Language Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy / Justin Johnson.
Object Detection based on Segment Masks
Dhruv Batra Georgia Tech
Deep Compositional Cross-modal Learning to Rank via Local-Global Alignment Xinyang Jiang, Fei Wu, Xi Li, Zhou Zhao, Weiming Lu, Siliang Tang, Yueting.
Attention Is All You Need
CSCI 5922 Neural Networks and Deep Learning: Image Captioning
Different Units Ramakrishna Vedantam.
Lecture 5 Smaller Network: CNN
Please, Pay Attention, Neural Attention
Image Question Answering
Neural Language Model CS246 Junghoo “John” Cho.
Human-level control through deep reinforcement learning
Bird-species Recognition Using Convolutional Neural Network
Computer Vision James Hays
Advanced Recurrent Architectures
Attention-based Caption Description Mun Jonghwan.
Wei Liu, Chaofeng Chen and Kwan-Yee K. Wong
Towards Understanding the Invertibility of Convolutional Neural Networks Anna C. Gilbert1, Yi Zhang1, Kibok Lee1, Yuting Zhang1, Honglak Lee1,2 1University.
Recurrent Neural Networks
INTERSUBJECTIVITY AND SENTIMENT: FROM LANGUAGE TO KNOWLEDGE
Very Deep Convolutional Networks for Large-Scale Image Recognition
Tina Jiang. , Vivek Natarajan. , Xinlei Chen
Diversity meets Deep Networks: Inference, Ensembles, and Applications
Papers 15/08.
Recurrent Encoder-Decoder Networks for Time-Varying Dense Predictions
RCNN, Fast-RCNN, Faster-RCNN
Graph Neural Networks Amog Kamsetty January 30, 2019.
Introduction to Object Tracking
Attention.
Heterogeneous convolutional neural networks for visual recognition
Course Recap and What’s Next?
Feature fusion and attention scheme
Neural Modular Networks
Attention for translation
CIS 519 Recitation 11/15/18.
Dynamic Neural Networks Joseph E. Gonzalez
Jointly Generating Captions to Aid Visual Question Answering
Visual Question Answering
Presented by: Anurag Paul
Building Dynamic Knowledge Graphs From Text Using Machine Reading Comprehension Rajarshi Das, Tsendsuren Munkhdalai, Xingdi Yuan, Adam Trischler, Andrew.
Week 3 Presentation Ngoc Ta Aidean Sharghi.
Sequence-to-Sequence Models
Deep learning: Recurrent Neural Networks CV192
Peng Cui Tsinghua University
CRCV REU 2019 Week 5.
Week 7 Presentation Ngoc Ta Aidean Sharghi
Visual Grounding.
CRCV REU 2019 Aaron Honculada.
CVPR 2019 Poster.
Presentation transcript:

Hierarchical Question-Image Co-Attention for Visual Question Answering Jiasen Lu, Jianwei Yang, Dhruv Batra, Devi Parikh

How to design a model for VQA? VQA and image captioning Difference with Image captioning.

How to design a model for VQA? Recall the image captioning model VQA and image captioning Difference with Image captioning.

How to design a model for VQA? Recall the image captioning model Task Image caption: Image -> Caption VQA: Image + Question -> Answer VQA and image captioning Difference with Image captioning.

How to design a model for VQA? Recall the image captioning model Task Image caption: Image -> Caption VQA: Image + Question -> Answer Can we develop a VQA model? VQA and image captioning Difference with Image captioning.

Convolution Layer + Non-Linearity 2-Channel VQA Model Image Embedding Convolution Layer + Non-Linearity Pooling Layer Fully-Connected MLP 4096-dim

2-Channel VQA Model Image Embedding Question Embedding Convolution Layer + Non-Linearity Pooling Layer Fully-Connected MLP 4096-dim Embedding Question “How many horses are in this image?” 1024-dim

2-Channel VQA Model Image Embedding Question Embedding Neural Network Softmax over top K answers Image Embedding Convolution Layer + Non-Linearity Pooling Layer Fully-Connected MLP 4096-dim Embedding Question “How many horses are in this image?” 1024-dim

What is the name of the cafe? - bagdad Human Attention (EMNLP 2016) What is the name of the cafe? - bagdad What number of cat is laying on bed? - 2 Image credit: Human Attention in Visual Question Answering:

Human Attention (EMNLP 2016) What is the name of the cafe? - bagdad What number of cat is laying on bed? - 2 Image credit: Human Attention in Visual Question Answering:

Recall: Attention Mechanism (Soft) CNN 7 512 summarize Query LSTM LSTM LSTM LSTM LSTM <start> w1 w2 w3

Attention Mechanism (Soft) CNN 7 512 summarize Query LSTM LSTM LSTM LSTM LSTM What is in the Image ?

Q: What if the algorithm attending wrong place at first place? What are sitting in the basket on a bicycle? Image credit: Stacked Attention Networks for Image Question Answering

Q: What if the algorithm attending wrong place at first place? What are sitting in the basket on a bicycle? Image credit: Stacked Attention Networks for Image Question Answering

Q: What if the algorithm attending wrong place at first place? Let’s do another round! What are sitting in the basket on a bicycle? Image credit: Stacked Attention Networks for Image Question Answering

Hint: Let’s do another round! Q: What if the algorithm attending wrong place at first place? Hint: Let’s do another round! What are sitting in the basket on a bicycle? Image credit: Stacked Attention Networks for Image Question Answering

Stacked Attention Network (CVPR 2016) Image credit: Stacked Attention Networks for Image Question Answering

Q: What about question?

Q: What about question? are in the image ? how many horses can you see in this image? how many horses

Q: What about question? are in the image ? how many horses can you see in this image? how many horses

Q: What about question? are in the image ? how many horses can you see in this image? how many horses Hint: Can we do attention on question as well?

Q: What about question? What are sitting in the basket on a bicycle?

Q: What about question? What are sitting in the basket on a Bicycle ?

Q: What about question? What are sitting in the basket on a Bicycle ?

Q: What about question? What are sitting in the basket on a Bicycle ? Hint: Can we use the compositionality of the question ?

Q: what is the color of the bird? HieCoAttn (NIPS 2016) Answer: white What is the color of the bird ? the color of CNN What is is the of the bird ? In this paper, we propose a novel mechanism that jointly reasons about visual attention and question attention. We build a hierarchical architecture that co-attends to the image and question at three levels: At word level, we embed the words to a vector space through a embedding matrix. At phrase, level, 1-dimensional convolution neural networks are used to capture the information contained in unigrams, bigrams and trigrams. At question level, we use recurrent neural networks to encode the entire question. At each level, we construct the co-attention maps, which are them combined recursively to predict the answers. what is the color of the bird ? Q: what is the color of the bird?

Hierarchical Co-Attention (NIPS 2016)

Base Model, Simple Model for VQA Aishwaya, Lu, Antol et.al 2015 Zhou et.al 2015 In this talk, I’ll go over the major recent VQA method papers, and my thought about those methods. Jabri et.al 2016

Attention based VQA model (where to look?) SAN (Yang et. al. 2016) DMN (Xiong et. al. 2016) In this talk, I’ll go over the major recent VQA method papers, and my thought about those methods. HieCoAtt(Lu et. al. 2016)

Multi-modality Feature Learning (how to fuse feature?) MCB(Fukui et al 2016) MLB (Kim et al 2016) In this talk, I’ll go over the major recent VQA method papers, and my thought about those methods. MUTAN (Younes et al 2017)

Modular Network / Programmer (compositionality) In this talk, I’ll go over the major recent VQA method papers, and my thought about those methods. Modular Network (Andreas et. al. 2015) Jonhson et. al. 2017

Others (Bottom up attention) Anderson et al 2017 In this talk, I’ll go over the major recent VQA method papers, and my thought about those methods. (Early Fusion) Vries and Strub et al 2017

Thanks! Q&A Result of show and tell.