Image Question Answering

Image Question Answering
Computer Vision Lab. Paul Hongsuck Seo

Taxonomy of Sub-problems by hyeonwoo
Classification with Complex Setting. Multi-domain classification Classification with input/output connection Zero-shot learning Novel Computer Vision Tasks Reference problem Spatial relation problem Visual semantic role labeling [6] Weakly-supervised learning to count Data Efficiency Problem Operation compositionality Image QA task compositionality Natural Language Understanding Extracting operation and input from question Multi-modal Information Mergence Merging multi-modal information better Image Question Answering

Reference Problem Different answers are correct given different targets of questions what color is the cup? What color is the teapot? What color is the spoon? Image Question Answering

Spatial Relation Understanding
Targets can be specified implicitly by their spatial relation to others. What is behind the horse? What is in front of the bed? What is beside the cat? Image Question Answering

General Idea – Attention Model
Attention model in caption generation standing is man <BOS> <EOS> Attention <INIT> Then, Say Attend First Image Question Answering

General Idea – Attention Model
Reverse approach for IQA yes Then, Attend Attention Attention Attention Attention Listen First <INIT> is man standing <EOS> Image Question Answering

MNIST-Reference Dataset
A synthetic dataset for the reference problem made of the MNIST dataset Predicting the color given a number in an image Image with one-hot information Only partial information available from the answers The objective function does not explicitly model the use of extra information Image Question Answering

Multi-modal Information Mergence
In some tasks, answer classes don’t include any information about the ask Example: yes/no question Answering the existence of a given number in an image + = number: 3 yes + = number: 6 no Image Question Answering

Models concat emb gating cgating CNN (flatten) CNN (fc out)
pool (2, 2) conv (20, 3, 3) flatten () one_hot (10) concat () linear (200) softmax (2) CNN (fc out) pool (2, 2) conv (20, 3, 3) fc (200) one_hot (10) emb (200) nonlinearity () el-mul () linear (256) softmax (2) pool (2, 2) conv (20, 3, 3) fc (200) one_hot (10) emb (200) concat () linear (256) softmax (2) CNN (fc out) CNN pool (2, 2) conv (20, 3, 3) one_hot (10) emb (20) nonlinearity () fc (200) cgating () linear (256) softmax (2) Image Question Answering

Convolutional Gating Individual pixel-wise gating on channels in a 2D feature map Spatial feature selection Image Question Answering

Results concat emb gating cgating linear tanh sigmoid Linear 60000
58.52 58.75 64.24 63.56 60.30 62.44 64.03 61.66 200000 79.19 77.21 85.73 85.91 81.15 90.80 89.30 81.69 gating_sigmoid small var large var 60000 58.49 59.98 200000 58.13 81.15 cgating_sigmoid small var large var 60000 58.40 61.66 200000 58.33 81.69 Image Question Answering

Discussion Gating (element-wise multiplication) aligns the semantic arrangement in two modalities. Values in each dimension of two vectors are semantically aligned. The multiplication of the original feature representation performed better than the output of nonlinear functions. Linear multiplication outperformed in both gating & cgating settings The performances with gating showed high dependency on the variance of initial values. Different variances performed better on different problems, different gatings, and different dataset sizes. Image Question Answering

More Discussion + = + = + =
Sigmoid performed better in the real set (done by hyeonwoo) Without yes/no questions To test on the set with only yes/no questions The task might be used for pre-training spatial transformer network. + = number: 3 yes + = number: 6 no + = What is the animal? elephant Image Question Answering

Back to Reference Problem
Attention Model Previous models were connected as the localization network of the attention model CNN + Localization Network number: 3 = 3 yellow Image Question Answering

Spatial Transformer Network
Image Question Answering

Results baseline (concat) concat emb gating cgating linear tanh
sigmoid Linear 60000 43.13 43.15 43.77 43.96 43.57 43.71 43.73 200000 65.25　 91.13 92.78 81.50 90.12 87.10 Image Question Answering

To Do Fill the blanks in the results
Comparisons between different parameters for gating models Comparisons on larger networks (for cgating model) Applying to the recurrent attention reasoning model with LSTM (or GRU) Image Question Answering

Image Question Answering

Similar presentations

Presentation on theme: "Image Question Answering"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Image Question Answering

Similar presentations

Presentation on theme: "Image Question Answering"— Presentation transcript:

Similar presentations

About project

Feedback