Learning to Answer Questions from Image Using Convolutional Neural Network Lin Ma, Zhengdong Lu, and Hang Li Huawei Noah’s Ark Lab, Hong Kong http://www.ee.cuhk.edu.hk/~lma/

Slides:



Advertisements
Similar presentations
A Human-Centered Computing Framework to Enable Personalized News Video Recommendation (Oh Jun-hyuk)
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Deep Learning and Neural Nets Spring 2015
Spatial Pyramid Pooling in Deep Convolutional
Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning
Building high-level features using large-scale unsupervised learning Anh Nguyen, Bay-yuan Hsu CS290D – Data Mining (Spring 2014) University of California,
Skeleton Based Action Recognition with Convolutional Neural Network
Object Recognizing. Deep Learning Success in 2012 DeepNet and speech processing.
Predicting the dropouts rate of online course using LSTM method
NOTE: To change the image on this slide, select the picture and delete it. Then click the Pictures icon in the placeholder to insert your own image. SHOW.
Lecture 3b: CNN: Advanced Layers
Parsing Natural Scenes and Natural Language with Recursive Neural Networks INTERNATIONAL CONFERENCE ON MACHINE LEARNING (ICML 2011) RICHARD SOCHER CLIFF.
Xintao Wu University of Arkansas Introduction to Deep Learning 1.
Analyzing the Behavior of Deep Models Dhruv Batra Devi Parikh Aishwarya Agrawal (EMNLP 2016)
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Olivier Siohan David Rybach
Welcome deep loria !.
Sentiment analysis using deep learning methods
Deep Learning RUSSIR 2017 – Day 3
Unsupervised Learning of Video Representations using LSTMs
CNN-RNN: A Unified Framework for Multi-label Image Classification
Convolutional Neural Network
Hierarchical Question-Image Co-Attention for Visual Question Answering
CS 388: Natural Language Processing: LSTM Recurrent Neural Networks
CS 4501: Introduction to Computer Vision Computer Vision + Natural Language Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy / Justin Johnson.
Deep Learning Amin Sobhani.
Sentence Modeling Representation of sentences is the heart of Natural Language Processing A sentence model is a representation and analysis of semantic.
Speaker-Targeted Audio-Visual Models for Speech Recognition in Cocktail-Party Environments Good morning, My name is Guan-Lin Chao, from Carnegie Mellon.
Recursive Neural Networks
Computer Science and Engineering, Seoul National University
Recurrent Neural Networks for Natural Language Processing
Matt Gormley Lecture 16 October 24, 2016
Wei Wei, PhD, Zhanglong Ji, PhD, Lucila Ohno-Machado, MD, PhD
Neural networks (3) Regularization Autoencoder
mengye ren, ryan kiros, richard s. zemel
Efficient Deep Model for Monocular Road Segmentation
Deceptive News Prediction Clickbait Score Inference
Image Question Answering
Bird-species Recognition Using Convolutional Neural Network
RNNs: Going Beyond the SRN in Language Prediction
Introduction to Neural Networks
Grid Long Short-Term Memory
Convolutional Neural Networks for Visual Tracking
Image Captions With Deep Learning Yulia Kogan & Ron Shiff
Two-Stream Convolutional Networks for Action Recognition in Videos
A First Look at Music Composition using LSTM Recurrent Neural Networks
Deep Learning Hierarchical Representations for Image Steganalysis
Neural Networks Geoff Hulten.
Recurrent Encoder-Decoder Networks for Time-Varying Dense Predictions
RNNs: Going Beyond the SRN in Language Prediction
Presentation By: Eryk Helenowski PURE Mentor: Vincent Bindschaedler
Heterogeneous convolutional neural networks for visual recognition
Course Recap and What’s Next?
Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton
Attention for translation
Learn to Comment Mentor: Mahdi M. Kalayeh
Department of Computer Science Ben-Gurion University of the Negev
Recurrent Neural Networks (RNNs)
Automatic Handwriting Generation
The Updated experiment based on LSTM
Presented by: Anurag Paul
Deep Object Co-Segmentation
Week 3 Presentation Ngoc Ta Aidean Sharghi.
Recurrent Neural Networks
Week 3 Volodymyr Bobyr.
Week 7 Presentation Ngoc Ta Aidean Sharghi
Neural Machine Translation by Jointly Learning to Align and Translate
CS249: Neural Language Model
Huawei CBG AI Challenges
Presentation transcript:

Learning to Answer Questions from Image Using Convolutional Neural Network Lin Ma, Zhengdong Lu, and Hang Li Huawei Noah’s Ark Lab, Hong Kong http://www.ee.cuhk.edu.hk/~lma/

Multiple Modalities Multiple modalities always accompany with each other Video, audio, and text (movies, TV programs) Image and text (Instagram, news) Image and text and meta data (YouTube) Mine the relationships between multiple modalities Association different modalities Interpretation one modality using another modality Image /video captioning Visual question answering We focus on image question answering

Image Question Answering Input: image and natural-language like question Output: answer AI in general problem

Image Question Answering Challenges Image contents are complicated Question appears to be specific

Image Question Answering Multimodal learning between image and language Image and sentence association Automatic image captioning Image question answering Image Text Question Answer Image question answering requiring more interactions between human and computer, which needs to be paid more attentions.

Related Work Visual Turing Test1 Neural Image question answer Binary questions for a given test image Neural Image question answer LSTM (Long short term memory) for multimodal learning [Ren et al. Malinowski et al.] Multimodal recurrent neural network [Gao et al.] Image question answering resembles the visual Turing test. 1. D. Geman, S. Geman, N. Hallonquist, and L. Younes, A visual Turing testing for computer vision systems, PNAS, Mar. 2015.

Problem Formulation Image question answering the problem is to predict the answer given the question and the related image

CNN for Image Question Answering The softmax layer can be realized as a multi-class logistic regression model can handle the multiple-word answers. CNN model for image question answering

CNN for Image Question Answering Image CNN: generates image representation. Sentence CNN: composes words to sentence representation. Multimodal convolution: let the image and question interact with each other to learn the joint representation. Softmax: predict the answer based on the joint representation.

Image CNN: image representation VGG encodes the image content: The dimension is reduced from 4096 to d. Total number of parameters is greatly reduced. Image is mapped to a new space for fusing the question representation.

Sentence CNN: question representation Convolution: compose the words Max-pooling: remove unreliable compositions

Multimodal Convolution Let image interact with question Fuse image and question representations together to generate their joint representation

LSTM vs. CNN LSTM: appending the image representation to the beginning or ending position of the question the effect of image will vanish at each time step of LSTM LSTM tends to lean an abstractive representation of the sentence CNN: adaptively capture the relations between image and question

Experimental Results Activation function throughout CNN ReLU Stochastic gradient descent (SGD) Minibatch Learning rate Handling overfitting Dropout

Experimental Results Database Evaluation measurements DAQUAR-ALL 6,795 training and 5,673 testing samples DAQUAR-Reduced 3,876 training and 297 testing samples COCO-QA 79,100 training and 39.171 testing samples Evaluation measurements Accuracy Wu-Palmer similarity (WUPS) Similarity between two words based on their common subsequence in the taxonomy tree.

Experimental Results DAQUAR-All 20%+ improvement over Neural-Image_QA method in terms of accuracy.

Experimental Results DAQUAR-Reduced 20% improvement in terms of accuracy over 2-VIS+BLSTM.

Experimental Results COCO-QA Outperform all the competitor models, including the “Full” model.

Experimental Restuls Influence of multimodal convolution layer Influence of image CNN Effectiveness of sentence CNN

Databases DAQUAR COCO-QA https://www.mpi-inf.mpg.de/departments/computer-vision-and-multimodal-computing/research/vision-and-language/visual-turing-challenge/ COCO-QA http://www.cs.utoronto.ca/~mren/imageqa/data/cocoqa/

Conclusion Convolutional neural network (CNN) is employed to learn the interactions and relationships between image and text. Our proposed CNN models achieves the state-of-the-art performances on the task of image question answering.

Thanks! Q&A