Hierarchical Question-Image Co-Attention for Visual Question Answering

Slides:

Advertisements

Similar presentations

Deep Learning and Neural Nets Spring 2015

Advertisements

Feedforward semantic segmentation with zoom-out features

Object Recognizing. Deep Learning Success in 2012 DeepNet and speech processing.

Deep Learning Overview Sources: workshop-tutorial-final.pdf

Learning to Answer Questions from Image Using Convolutional Neural Network Lin Ma, Zhengdong Lu, and Hang Li Huawei Noah’s Ark Lab, Hong Kong

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation EMNLP’14 paper by Kyunghyun Cho, et al.

Analyzing the Behavior of Deep Models Dhruv Batra Devi Parikh Aishwarya Agrawal (EMNLP 2016)

A Sentence Interaction Network for Modeling Dependence between Sentences Biao Liu, Minlie Huang Tsinghua University.

Ensembling Diverse Approaches to Question Answering

Attention Model in NLP Jichuan ZENG.

R-NET: Machine Reading Comprehension With Self-Matching Networks

Recent developments in object detection

Sentiment analysis using deep learning methods

Hybrid Deep Learning for Reflectance Confocal Microscopy Skin Images

Convolutional Sequence to Sequence Learning

CNN-RNN: A Uniﬁed Framework for Multi-label Image Classiﬁcation

Faster R-CNN – Concepts

Convolutional Neural Network

IEEE BIBM 2016 Xu Min, Wanwen Zeng, Ning Chen, Ting Chen*, Rui Jiang*

End-To-End Memory Networks

CS 4501: Introduction to Computer Vision Computer Vision + Natural Language Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy / Justin Johnson.

Object Detection based on Segment Masks

Dhruv Batra Georgia Tech

Deep Compositional Cross-modal Learning to Rank via Local-Global Alignment Xinyang Jiang, Fei Wu, Xi Li, Zhou Zhao, Weiming Lu, Siliang Tang, Yueting.

Attention Is All You Need

CSCI 5922 Neural Networks and Deep Learning: Image Captioning

Different Units Ramakrishna Vedantam.

Lecture 5 Smaller Network: CNN

Please, Pay Attention, Neural Attention

Image Question Answering

Neural Language Model CS246 Junghoo “John” Cho.

Human-level control through deep reinforcement learning

Bird-species Recognition Using Convolutional Neural Network

Computer Vision James Hays

Advanced Recurrent Architectures

Attention-based Caption Description Mun Jonghwan.

Wei Liu, Chaofeng Chen and Kwan-Yee K. Wong

Towards Understanding the Invertibility of Convolutional Neural Networks Anna C. Gilbert1, Yi Zhang1, Kibok Lee1, Yuting Zhang1, Honglak Lee1,2 1University.

Recurrent Neural Networks

INTERSUBJECTIVITY AND SENTIMENT: FROM LANGUAGE TO KNOWLEDGE

Very Deep Convolutional Networks for Large-Scale Image Recognition

Tina Jiang. , Vivek Natarajan. , Xinlei Chen

Diversity meets Deep Networks: Inference, Ensembles, and Applications

Recurrent Encoder-Decoder Networks for Time-Varying Dense Predictions

RCNN, Fast-RCNN, Faster-RCNN

Graph Neural Networks Amog Kamsetty January 30, 2019.

Introduction to Object Tracking

Heterogeneous convolutional neural networks for visual recognition

Course Recap and What’s Next?

Feature fusion and attention scheme

Neural Modular Networks

Attention for translation

CIS 519 Recitation 11/15/18.

Dynamic Neural Networks Joseph E. Gonzalez

Jointly Generating Captions to Aid Visual Question Answering

Visual Question Answering

Presented by: Anurag Paul

Building Dynamic Knowledge Graphs From Text Using Machine Reading Comprehension Rajarshi Das, Tsendsuren Munkhdalai, Xingdi Yuan, Adam Trischler, Andrew.

Week 3 Presentation Ngoc Ta Aidean Sharghi.

Sequence-to-Sequence Models

Deep learning: Recurrent Neural Networks CV192

Peng Cui Tsinghua University

CRCV REU 2019 Week 5.

Week 7 Presentation Ngoc Ta Aidean Sharghi

Visual Grounding.

CRCV REU 2019 Aaron Honculada.

CVPR 2019 Poster.

Presentation transcript:

Hierarchical Question-Image Co-Attention for Visual Question Answering Jiasen Lu, Jianwei Yang, Dhruv Batra, Devi Parikh

How to design a model for VQA? VQA and image captioning Difference with Image captioning.

How to design a model for VQA? Recall the image captioning model VQA and image captioning Difference with Image captioning.

How to design a model for VQA? Recall the image captioning model Task Image caption: Image -> Caption VQA: Image + Question -> Answer VQA and image captioning Difference with Image captioning.

How to design a model for VQA? Recall the image captioning model Task Image caption: Image -> Caption VQA: Image + Question -> Answer Can we develop a VQA model? VQA and image captioning Difference with Image captioning.

Convolution Layer + Non-Linearity 2-Channel VQA Model Image Embedding Convolution Layer + Non-Linearity Pooling Layer Fully-Connected MLP 4096-dim

2-Channel VQA Model Image Embedding Question Embedding Convolution Layer + Non-Linearity Pooling Layer Fully-Connected MLP 4096-dim Embedding Question “How many horses are in this image?” 1024-dim

2-Channel VQA Model Image Embedding Question Embedding Neural Network Softmax over top K answers Image Embedding Convolution Layer + Non-Linearity Pooling Layer Fully-Connected MLP 4096-dim Embedding Question “How many horses are in this image?” 1024-dim

What is the name of the cafe? - bagdad Human Attention (EMNLP 2016) What is the name of the cafe? - bagdad What number of cat is laying on bed? - 2 Image credit: Human Attention in Visual Question Answering:

Human Attention (EMNLP 2016) What is the name of the cafe? - bagdad What number of cat is laying on bed? - 2 Image credit: Human Attention in Visual Question Answering:

Recall: Attention Mechanism (Soft) CNN 7 512 summarize Query LSTM LSTM LSTM LSTM LSTM <start> w1 w2 w3

Attention Mechanism (Soft) CNN 7 512 summarize Query LSTM LSTM LSTM LSTM LSTM What is in the Image ?

Q: What if the algorithm attending wrong place at first place? What are sitting in the basket on a bicycle? Image credit: Stacked Attention Networks for Image Question Answering

Q: What if the algorithm attending wrong place at first place? What are sitting in the basket on a bicycle? Image credit: Stacked Attention Networks for Image Question Answering

Q: What if the algorithm attending wrong place at first place? Let’s do another round! What are sitting in the basket on a bicycle? Image credit: Stacked Attention Networks for Image Question Answering

Hint: Let’s do another round! Q: What if the algorithm attending wrong place at first place? Hint: Let’s do another round! What are sitting in the basket on a bicycle? Image credit: Stacked Attention Networks for Image Question Answering

Stacked Attention Network (CVPR 2016) Image credit: Stacked Attention Networks for Image Question Answering

Q: What about question?

Q: What about question? are in the image ? how many horses can you see in this image? how many horses

Q: What about question? are in the image ? how many horses can you see in this image? how many horses

Q: What about question? are in the image ? how many horses can you see in this image? how many horses Hint: Can we do attention on question as well?

Q: What about question? What are sitting in the basket on a bicycle?

Q: What about question? What are sitting in the basket on a Bicycle ?

Q: What about question? What are sitting in the basket on a Bicycle ?

Q: What about question? What are sitting in the basket on a Bicycle ? Hint: Can we use the compositionality of the question ?

Q: what is the color of the bird? HieCoAttn (NIPS 2016) Answer: white What is the color of the bird ? the color of CNN What is is the of the bird ? In this paper, we propose a novel mechanism that jointly reasons about visual attention and question attention. We build a hierarchical architecture that co-attends to the image and question at three levels: At word level, we embed the words to a vector space through a embedding matrix. At phrase, level, 1-dimensional convolution neural networks are used to capture the information contained in unigrams, bigrams and trigrams. At question level, we use recurrent neural networks to encode the entire question. At each level, we construct the co-attention maps, which are them combined recursively to predict the answers. what is the color of the bird ? Q: what is the color of the bird?

Hierarchical Co-Attention (NIPS 2016)

Base Model, Simple Model for VQA Aishwaya, Lu, Antol et.al 2015 Zhou et.al 2015 In this talk, I’ll go over the major recent VQA method papers, and my thought about those methods. Jabri et.al 2016

Attention based VQA model (where to look?) SAN (Yang et. al. 2016) DMN (Xiong et. al. 2016) In this talk, I’ll go over the major recent VQA method papers, and my thought about those methods. HieCoAtt(Lu et. al. 2016)

Multi-modality Feature Learning (how to fuse feature?) MCB(Fukui et al 2016) MLB (Kim et al 2016) In this talk, I’ll go over the major recent VQA method papers, and my thought about those methods. MUTAN (Younes et al 2017)

Modular Network / Programmer (compositionality) In this talk, I’ll go over the major recent VQA method papers, and my thought about those methods. Modular Network (Andreas et. al. 2015) Jonhson et. al. 2017

Others (Bottom up attention) Anderson et al 2017 In this talk, I’ll go over the major recent VQA method papers, and my thought about those methods. (Early Fusion) Vries and Strub et al 2017

Thanks! Q&A Result of show and tell.