Learning to Answer Questions from Image Using Convolutional Neural Network Lin Ma, Zhengdong Lu, and Hang Li Huawei Noah’s Ark Lab, Hong Kong http://www.ee.cuhk.edu.hk/~lma/

Slides:

Advertisements

Similar presentations

A Human-Centered Computing Framework to Enable Personalized News Video Recommendation (Oh Jun-hyuk)

Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Deep Learning and Neural Nets Spring 2015

Spatial Pyramid Pooling in Deep Convolutional

Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning

Building high-level features using large-scale unsupervised learning Anh Nguyen, Bay-yuan Hsu CS290D – Data Mining (Spring 2014) University of California,

Skeleton Based Action Recognition with Convolutional Neural Network

Object Recognizing. Deep Learning Success in 2012 DeepNet and speech processing.

Predicting the dropouts rate of online course using LSTM method

NOTE: To change the image on this slide, select the picture and delete it. Then click the Pictures icon in the placeholder to insert your own image. SHOW.

Lecture 3b: CNN: Advanced Layers

Parsing Natural Scenes and Natural Language with Recursive Neural Networks INTERNATIONAL CONFERENCE ON MACHINE LEARNING (ICML 2011) RICHARD SOCHER CLIFF.

Xintao Wu University of Arkansas Introduction to Deep Learning 1.

Analyzing the Behavior of Deep Models Dhruv Batra Devi Parikh Aishwarya Agrawal (EMNLP 2016)

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Olivier Siohan David Rybach

Welcome deep loria !.

Sentiment analysis using deep learning methods

Deep Learning RUSSIR 2017 – Day 3

Unsupervised Learning of Video Representations using LSTMs

CNN-RNN: A Uniﬁed Framework for Multi-label Image Classiﬁcation

Convolutional Neural Network

Hierarchical Question-Image Co-Attention for Visual Question Answering

CS 388: Natural Language Processing: LSTM Recurrent Neural Networks

CS 4501: Introduction to Computer Vision Computer Vision + Natural Language Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy / Justin Johnson.

Deep Learning Amin Sobhani.

Sentence Modeling Representation of sentences is the heart of Natural Language Processing A sentence model is a representation and analysis of semantic.

Speaker-Targeted Audio-Visual Models for Speech Recognition in Cocktail-Party Environments Good morning, My name is Guan-Lin Chao, from Carnegie Mellon.

Recursive Neural Networks

Computer Science and Engineering, Seoul National University

Recurrent Neural Networks for Natural Language Processing

Matt Gormley Lecture 16 October 24, 2016

Wei Wei, PhD, Zhanglong Ji, PhD, Lucila Ohno-Machado, MD, PhD

Neural networks (3) Regularization Autoencoder

mengye ren, ryan kiros, richard s. zemel

Efficient Deep Model for Monocular Road Segmentation

Deceptive News Prediction Clickbait Score Inference

Image Question Answering

Bird-species Recognition Using Convolutional Neural Network

RNNs: Going Beyond the SRN in Language Prediction

Introduction to Neural Networks

Grid Long Short-Term Memory

Convolutional Neural Networks for Visual Tracking

Image Captions With Deep Learning Yulia Kogan & Ron Shiff

Two-Stream Convolutional Networks for Action Recognition in Videos

A First Look at Music Composition using LSTM Recurrent Neural Networks

Deep Learning Hierarchical Representations for Image Steganalysis

Neural Networks Geoff Hulten.

Recurrent Encoder-Decoder Networks for Time-Varying Dense Predictions

RNNs: Going Beyond the SRN in Language Prediction

Presentation By: Eryk Helenowski PURE Mentor: Vincent Bindschaedler

Heterogeneous convolutional neural networks for visual recognition

Course Recap and What’s Next?

Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton

Attention for translation

Learn to Comment Mentor: Mahdi M. Kalayeh

Department of Computer Science Ben-Gurion University of the Negev

Recurrent Neural Networks (RNNs)

Automatic Handwriting Generation

The Updated experiment based on LSTM

Presented by: Anurag Paul

Deep Object Co-Segmentation

Week 3 Presentation Ngoc Ta Aidean Sharghi.

Recurrent Neural Networks

Week 3 Volodymyr Bobyr.

Week 7 Presentation Ngoc Ta Aidean Sharghi

Neural Machine Translation by Jointly Learning to Align and Translate

CS249: Neural Language Model

Huawei CBG AI Challenges

Presentation transcript:

Learning to Answer Questions from Image Using Convolutional Neural Network Lin Ma, Zhengdong Lu, and Hang Li Huawei Noah’s Ark Lab, Hong Kong http://www.ee.cuhk.edu.hk/~lma/

Multiple Modalities Multiple modalities always accompany with each other Video, audio, and text (movies, TV programs) Image and text (Instagram, news) Image and text and meta data (YouTube) Mine the relationships between multiple modalities Association different modalities Interpretation one modality using another modality Image /video captioning Visual question answering We focus on image question answering

Image Question Answering Input: image and natural-language like question Output: answer AI in general problem

Image Question Answering Challenges Image contents are complicated Question appears to be specific

Image Question Answering Multimodal learning between image and language Image and sentence association Automatic image captioning Image question answering Image Text Question Answer Image question answering requiring more interactions between human and computer, which needs to be paid more attentions.

Related Work Visual Turing Test1 Neural Image question answer Binary questions for a given test image Neural Image question answer LSTM (Long short term memory) for multimodal learning [Ren et al. Malinowski et al.] Multimodal recurrent neural network [Gao et al.] Image question answering resembles the visual Turing test. 1. D. Geman, S. Geman, N. Hallonquist, and L. Younes, A visual Turing testing for computer vision systems, PNAS, Mar. 2015.

Problem Formulation Image question answering the problem is to predict the answer given the question and the related image

CNN for Image Question Answering The softmax layer can be realized as a multi-class logistic regression model can handle the multiple-word answers. CNN model for image question answering

CNN for Image Question Answering Image CNN: generates image representation. Sentence CNN: composes words to sentence representation. Multimodal convolution: let the image and question interact with each other to learn the joint representation. Softmax: predict the answer based on the joint representation.

Image CNN: image representation VGG encodes the image content: The dimension is reduced from 4096 to d. Total number of parameters is greatly reduced. Image is mapped to a new space for fusing the question representation.

Sentence CNN: question representation Convolution: compose the words Max-pooling: remove unreliable compositions

Multimodal Convolution Let image interact with question Fuse image and question representations together to generate their joint representation

LSTM vs. CNN LSTM: appending the image representation to the beginning or ending position of the question the effect of image will vanish at each time step of LSTM LSTM tends to lean an abstractive representation of the sentence CNN: adaptively capture the relations between image and question

Experimental Results Activation function throughout CNN ReLU Stochastic gradient descent (SGD) Minibatch Learning rate Handling overfitting Dropout

Experimental Results Database Evaluation measurements DAQUAR-ALL 6,795 training and 5,673 testing samples DAQUAR-Reduced 3,876 training and 297 testing samples COCO-QA 79,100 training and 39.171 testing samples Evaluation measurements Accuracy Wu-Palmer similarity (WUPS) Similarity between two words based on their common subsequence in the taxonomy tree.

Experimental Results DAQUAR-All 20%+ improvement over Neural-Image_QA method in terms of accuracy.

Experimental Results DAQUAR-Reduced 20% improvement in terms of accuracy over 2-VIS+BLSTM.

Experimental Results COCO-QA Outperform all the competitor models, including the “Full” model.

Experimental Restuls Influence of multimodal convolution layer Influence of image CNN Effectiveness of sentence CNN

Databases DAQUAR COCO-QA https://www.mpi-inf.mpg.de/departments/computer-vision-and-multimodal-computing/research/vision-and-language/visual-turing-challenge/ COCO-QA http://www.cs.utoronto.ca/~mren/imageqa/data/cocoqa/

Conclusion Convolutional neural network (CNN) is employed to learn the interactions and relationships between image and text. Our proposed CNN models achieves the state-of-the-art performances on the task of image question answering.

Thanks! Q&A