Image Question Answering

Slides:



Advertisements
Similar presentations
R-CNN By Zhang Liliang.
Advertisements

Convolutional Neural Networks for Image Processing with Applications in Mobile Robotics By, Sruthi Moola.
Deep Learning Overview Sources: workshop-tutorial-final.pdf
Parsing Natural Scenes and Natural Language with Recursive Neural Networks INTERNATIONAL CONFERENCE ON MACHINE LEARNING (ICML 2011) RICHARD SOCHER CLIFF.
Learning to Answer Questions from Image Using Convolutional Neural Network Lin Ma, Zhengdong Lu, and Hang Li Huawei Noah’s Ark Lab, Hong Kong
Analyzing the Behavior of Deep Models Dhruv Batra Devi Parikh Aishwarya Agrawal (EMNLP 2016)
Recent developments in object detection
Unsupervised Learning of Video Representations using LSTMs
CNN-RNN: A Unified Framework for Multi-label Image Classification
What Convnets Make for Image Captioning?
Convolutional Neural Network
Hierarchical Question-Image Co-Attention for Visual Question Answering
End-To-End Memory Networks
CS 388: Natural Language Processing: LSTM Recurrent Neural Networks
CS 4501: Introduction to Computer Vision Computer Vision + Natural Language Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy / Justin Johnson.
The Relationship between Deep Learning and Brain Function
Deep Learning Amin Sobhani.
Compact Bilinear Pooling
Sentence Modeling Representation of sentences is the heart of Natural Language Processing A sentence model is a representation and analysis of semantic.
Inverse Compositional Spatial Transformer Networks
Convolutional Neural Fabrics by Shreyas Saxena, Jakob Verbeek
Krishna Kumar Singh, Yong Jae Lee University of California, Davis
Understanding and Predicting Image Memorability at a Large Scale
Neural Networks for Machine Learning Lecture 1e Three types of learning Geoffrey Hinton with Nitish Srivastava Kevin Swersky.
Combining CNN with RNN for scene labeling (segmentation)
Neural networks (3) Regularization Autoencoder
Lecture 5 Smaller Network: CNN
Neural Networks 2 CS446 Machine Learning.
mengye ren, ryan kiros, richard s. zemel
Training Techniques for Deep Neural Networks
Efficient Deep Model for Monocular Road Segmentation
R-CNN region By Ilia Iofedov 11/11/2018 BGU, DNN course 2016.
Non-linear classifiers Neural networks
Dynamic Routing Using Inter Capsule Routing Protocol Between Capsules
Network In Network Authors: Min Lin, Qiang Chen, Shuicheng Yan
Computer Vision James Hays
Attention-based Caption Description Mun Jonghwan.
Introduction to Neural Networks
Image Classification.
Grid Long Short-Term Memory
Vessel Extraction in X-Ray Angiograms Using Deep Learning
Construct a Convolutional Neural Network with Python
Tips for Training Deep Network
Tina Jiang. , Vivek Natarajan. , Xinlei Chen
On Convolutional Neural Network
Lecture: Deep Convolutional Neural Networks
Papers 15/08.
Recurrent Encoder-Decoder Networks for Time-Varying Dense Predictions
Analysis of Trained CNN (Receptive Field & Weights of Network)
RCNN, Fast-RCNN, Faster-RCNN
Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824
Neural networks (3) Regularization Autoencoder
Word embeddings (continued)
Meta Learning (Part 2): Gradient Descent as LSTM
Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824
Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton
Human-object interaction
Visual Question Answering
Deep Object Co-Segmentation
VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION
Image recognition.
Presented By: Harshul Gupta
Baseline Model CSV Files Pandas DataFrame Sentence Lists
Learning Deconvolution Network for Semantic Segmentation
Recurrent Neural Networks
Sequence-to-Sequence Models
Deep learning: Recurrent Neural Networks CV192
Week 7 Presentation Ngoc Ta Aidean Sharghi
Listen Attend and Spell – a brief introduction
CVPR 2019 Poster.
Presentation transcript:

Image Question Answering Computer Vision Lab. Paul Hongsuck Seo

Taxonomy of Sub-problems by hyeonwoo Classification with Complex Setting. Multi-domain classification Classification with input/output connection Zero-shot learning Novel Computer Vision Tasks Reference problem Spatial relation problem Visual semantic role labeling [6] Weakly-supervised learning to count Data Efficiency Problem Operation compositionality Image QA task compositionality Natural Language Understanding Extracting operation and input from question Multi-modal Information Mergence Merging multi-modal information better Image Question Answering

Reference Problem Different answers are correct given different targets of questions what color is the cup? What color is the teapot? What color is the spoon? Image Question Answering

Spatial Relation Understanding Targets can be specified implicitly by their spatial relation to others. What is behind the horse? What is in front of the bed? What is beside the cat? Image Question Answering

General Idea – Attention Model Attention model in caption generation standing is man <BOS> <EOS> Attention <INIT> Then, Say Attend First Image Question Answering

General Idea – Attention Model Reverse approach for IQA yes Then, Attend Attention Attention Attention Attention Listen First <INIT> is man standing <EOS> Image Question Answering

MNIST-Reference Dataset A synthetic dataset for the reference problem made of the MNIST dataset Predicting the color given a number in an image Image with one-hot information Only partial information available from the answers The objective function does not explicitly model the use of extra information Image Question Answering

Multi-modal Information Mergence In some tasks, answer classes don’t include any information about the ask Example: yes/no question Answering the existence of a given number in an image + = number: 3 yes + = number: 6 no Image Question Answering

Models concat emb gating cgating CNN (flatten) CNN (fc out) pool (2, 2) conv (20, 3, 3) flatten () one_hot (10) concat () linear (200) softmax (2) CNN (fc out) pool (2, 2) conv (20, 3, 3) fc (200) one_hot (10) emb (200) nonlinearity () el-mul () linear (256) softmax (2) pool (2, 2) conv (20, 3, 3) fc (200) one_hot (10) emb (200) concat () linear (256) softmax (2) CNN (fc out) CNN pool (2, 2) conv (20, 3, 3) one_hot (10) emb (20) nonlinearity () fc (200) cgating () linear (256) softmax (2) Image Question Answering

Convolutional Gating Individual pixel-wise gating on channels in a 2D feature map Spatial feature selection Image Question Answering

Results concat emb gating cgating linear tanh sigmoid Linear 60000 58.52 58.75 64.24 63.56 60.30 62.44 64.03 61.66 200000 79.19 77.21 85.73 85.91 81.15 90.80 89.30 81.69 gating_sigmoid small var large var 60000 58.49 59.98 200000 58.13 81.15 cgating_sigmoid small var large var 60000 58.40 61.66 200000 58.33 81.69 Image Question Answering

Discussion Gating (element-wise multiplication) aligns the semantic arrangement in two modalities. Values in each dimension of two vectors are semantically aligned. The multiplication of the original feature representation performed better than the output of nonlinear functions. Linear multiplication outperformed in both gating & cgating settings The performances with gating showed high dependency on the variance of initial values. Different variances performed better on different problems, different gatings, and different dataset sizes. Image Question Answering

More Discussion + = + = + = Sigmoid performed better in the real set (done by hyeonwoo) Without yes/no questions To test on the set with only yes/no questions The task might be used for pre-training spatial transformer network. + = number: 3 yes + = number: 6 no + = What is the animal? elephant Image Question Answering

Back to Reference Problem Attention Model Previous models were connected as the localization network of the attention model CNN + Localization Network number: 3 = 3 yellow Image Question Answering

Spatial Transformer Network Image Question Answering

Results baseline (concat) concat emb gating cgating linear tanh sigmoid Linear 60000 43.13 43.15 43.77 43.96 43.57 43.71 43.73 200000 65.25  91.13 92.78 81.50 90.12 87.10 Image Question Answering

To Do Fill the blanks in the results Comparisons between different parameters for gating models Comparisons on larger networks (for cgating model) Applying to the recurrent attention reasoning model with LSTM (or GRU) Image Question Answering