Image Question Answering

Slides:

Advertisements

Similar presentations

R-CNN By Zhang Liliang.

Advertisements

Convolutional Neural Networks for Image Processing with Applications in Mobile Robotics By, Sruthi Moola.

Deep Learning Overview Sources: workshop-tutorial-final.pdf

Parsing Natural Scenes and Natural Language with Recursive Neural Networks INTERNATIONAL CONFERENCE ON MACHINE LEARNING (ICML 2011) RICHARD SOCHER CLIFF.

Learning to Answer Questions from Image Using Convolutional Neural Network Lin Ma, Zhengdong Lu, and Hang Li Huawei Noah’s Ark Lab, Hong Kong

Analyzing the Behavior of Deep Models Dhruv Batra Devi Parikh Aishwarya Agrawal (EMNLP 2016)

Recent developments in object detection

Unsupervised Learning of Video Representations using LSTMs

CNN-RNN: A Uniﬁed Framework for Multi-label Image Classiﬁcation

What Convnets Make for Image Captioning?

Convolutional Neural Network

Hierarchical Question-Image Co-Attention for Visual Question Answering

End-To-End Memory Networks

CS 388: Natural Language Processing: LSTM Recurrent Neural Networks

CS 4501: Introduction to Computer Vision Computer Vision + Natural Language Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy / Justin Johnson.

The Relationship between Deep Learning and Brain Function

Deep Learning Amin Sobhani.

Compact Bilinear Pooling

Sentence Modeling Representation of sentences is the heart of Natural Language Processing A sentence model is a representation and analysis of semantic.

Inverse Compositional Spatial Transformer Networks

Convolutional Neural Fabrics by Shreyas Saxena, Jakob Verbeek

Krishna Kumar Singh, Yong Jae Lee University of California, Davis

Understanding and Predicting Image Memorability at a Large Scale

Neural Networks for Machine Learning Lecture 1e Three types of learning Geoffrey Hinton with Nitish Srivastava Kevin Swersky.

Combining CNN with RNN for scene labeling (segmentation)

Neural networks (3) Regularization Autoencoder

Lecture 5 Smaller Network: CNN

Neural Networks 2 CS446 Machine Learning.

mengye ren, ryan kiros, richard s. zemel

Training Techniques for Deep Neural Networks

Efficient Deep Model for Monocular Road Segmentation

R-CNN region By Ilia Iofedov 11/11/2018 BGU, DNN course 2016.

Non-linear classifiers Neural networks

Dynamic Routing Using Inter Capsule Routing Protocol Between Capsules

Network In Network Authors: Min Lin, Qiang Chen, Shuicheng Yan

Computer Vision James Hays

Attention-based Caption Description Mun Jonghwan.

Introduction to Neural Networks

Image Classification.

Grid Long Short-Term Memory

Vessel Extraction in X-Ray Angiograms Using Deep Learning

Construct a Convolutional Neural Network with Python

Tips for Training Deep Network

Tina Jiang. , Vivek Natarajan. , Xinlei Chen

On Convolutional Neural Network

Lecture: Deep Convolutional Neural Networks

Recurrent Encoder-Decoder Networks for Time-Varying Dense Predictions

Analysis of Trained CNN (Receptive Field & Weights of Network)

RCNN, Fast-RCNN, Faster-RCNN

Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824

Neural networks (3) Regularization Autoencoder

Word embeddings (continued)

Meta Learning (Part 2): Gradient Descent as LSTM

Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824

Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton

Human-object interaction

Visual Question Answering

Deep Object Co-Segmentation

VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION

Image recognition.

Presented By: Harshul Gupta

Baseline Model CSV Files Pandas DataFrame Sentence Lists

Learning Deconvolution Network for Semantic Segmentation

Recurrent Neural Networks

Sequence-to-Sequence Models

Deep learning: Recurrent Neural Networks CV192

Week 7 Presentation Ngoc Ta Aidean Sharghi

Listen Attend and Spell – a brief introduction

CVPR 2019 Poster.

Presentation transcript:

Image Question Answering Computer Vision Lab. Paul Hongsuck Seo

Taxonomy of Sub-problems by hyeonwoo Classification with Complex Setting. Multi-domain classification Classification with input/output connection Zero-shot learning Novel Computer Vision Tasks Reference problem Spatial relation problem Visual semantic role labeling [6] Weakly-supervised learning to count Data Efficiency Problem Operation compositionality Image QA task compositionality Natural Language Understanding Extracting operation and input from question Multi-modal Information Mergence Merging multi-modal information better Image Question Answering

Reference Problem Different answers are correct given different targets of questions what color is the cup? What color is the teapot? What color is the spoon? Image Question Answering

Spatial Relation Understanding Targets can be specified implicitly by their spatial relation to others. What is behind the horse? What is in front of the bed? What is beside the cat? Image Question Answering

General Idea – Attention Model Attention model in caption generation standing is man <BOS> <EOS> Attention <INIT> Then, Say Attend First Image Question Answering

General Idea – Attention Model Reverse approach for IQA yes Then, Attend Attention Attention Attention Attention Listen First <INIT> is man standing <EOS> Image Question Answering

MNIST-Reference Dataset A synthetic dataset for the reference problem made of the MNIST dataset Predicting the color given a number in an image Image with one-hot information Only partial information available from the answers The objective function does not explicitly model the use of extra information Image Question Answering

Multi-modal Information Mergence In some tasks, answer classes don’t include any information about the ask Example: yes/no question Answering the existence of a given number in an image + = number: 3 yes + = number: 6 no Image Question Answering

Models concat emb gating cgating CNN (flatten) CNN (fc out) pool (2, 2) conv (20, 3, 3) flatten () one_hot (10) concat () linear (200) softmax (2) CNN (fc out) pool (2, 2) conv (20, 3, 3) fc (200) one_hot (10) emb (200) nonlinearity () el-mul () linear (256) softmax (2) pool (2, 2) conv (20, 3, 3) fc (200) one_hot (10) emb (200) concat () linear (256) softmax (2) CNN (fc out) CNN pool (2, 2) conv (20, 3, 3) one_hot (10) emb (20) nonlinearity () fc (200) cgating () linear (256) softmax (2) Image Question Answering

Convolutional Gating Individual pixel-wise gating on channels in a 2D feature map Spatial feature selection Image Question Answering

Results concat emb gating cgating linear tanh sigmoid Linear 60000 58.52 58.75 64.24 63.56 60.30 62.44 64.03 61.66 200000 79.19 77.21 85.73 85.91 81.15 90.80 89.30 81.69 gating_sigmoid small var large var 60000 58.49 59.98 200000 58.13 81.15 cgating_sigmoid small var large var 60000 58.40 61.66 200000 58.33 81.69 Image Question Answering

Discussion Gating (element-wise multiplication) aligns the semantic arrangement in two modalities. Values in each dimension of two vectors are semantically aligned. The multiplication of the original feature representation performed better than the output of nonlinear functions. Linear multiplication outperformed in both gating & cgating settings The performances with gating showed high dependency on the variance of initial values. Different variances performed better on different problems, different gatings, and different dataset sizes. Image Question Answering

More Discussion + = + = + = Sigmoid performed better in the real set (done by hyeonwoo) Without yes/no questions To test on the set with only yes/no questions The task might be used for pre-training spatial transformer network. + = number: 3 yes + = number: 6 no + = What is the animal? elephant Image Question Answering

Back to Reference Problem Attention Model Previous models were connected as the localization network of the attention model CNN + Localization Network number: 3 = 3 yellow Image Question Answering

Spatial Transformer Network Image Question Answering

Results baseline (concat) concat emb gating cgating linear tanh sigmoid Linear 60000 43.13 43.15 43.77 43.96 43.57 43.71 43.73 200000 65.25　 91.13 92.78 81.50 90.12 87.10 Image Question Answering

To Do Fill the blanks in the results Comparisons between different parameters for gating models Comparisons on larger networks (for cgating model) Applying to the recurrent attention reasoning model with LSTM (or GRU) Image Question Answering