mengye ren, ryan kiros, richard s. zemel

Slides:

Advertisements

Similar presentations

Deep Learning and Neural Nets Spring 2015

Advertisements

Distributed Representations of Sentences and Documents

Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.

Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning

Ganesh J, Soumyajit Ganguly, Manish Gupta, Vasudeva Varma, Vikram Pudi

Deep Residual Learning for Image Recognition

Parsing Natural Scenes and Natural Language with Recursive Neural Networks INTERNATIONAL CONFERENCE ON MACHINE LEARNING (ICML 2011) RICHARD SOCHER CLIFF.

Learning to Answer Questions from Image Using Convolutional Neural Network Lin Ma, Zhengdong Lu, and Hang Li Huawei Noah’s Ark Lab, Hong Kong

Xintao Wu University of Arkansas Introduction to Deep Learning 1.

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation EMNLP’14 paper by Kyunghyun Cho, et al.

DeepWalk: Online Learning of Social Representations

A Hierarchical Deep Temporal Model for Group Activity Recognition

R-NET: Machine Reading Comprehension With Self-Matching Networks

Deep Residual Learning for Image Recognition

Jonatas Wehrmann, Willian Becker, Henry E. L. Cagnini, and Rodrigo C

Unsupervised Learning of Video Representations using LSTMs

CNN-RNN: A Uniﬁed Framework for Multi-label Image Classiﬁcation

What Convnets Make for Image Captioning?

Korean version of GloVe Applying GloVe & word2vec model to Korean corpus speaker : 양희정 date :

End-To-End Memory Networks

CS 4501: Introduction to Computer Vision Computer Vision + Natural Language Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy / Justin Johnson.

Deep Learning for Bacteria Event Identification

A Deep Memory Network for Chinese Zero Pronoun Resolution

WAVENET: A GENERATIVE MODEL FOR RAW AUDIO

Energy models and Deep Belief Networks

High-level Cues for Predicting Motivations

Recursive Neural Networks

Syntax-based Deep Matching of Short Texts

Neural Machine Translation by Jointly Learning to Align and Translate

A Hierarchical Model of Reviews for Aspect-based Sentiment Analysis

Intro to NLP and Deep Learning

CSCI 5922 Neural Networks and Deep Learning: Image Captioning

Intelligent Information System Lab

Distributed Representations of Words and Phrases and their Compositionality Presenter: Haotian Xu.

Vector-Space (Distributional) Lexical Semantics

Efficient Estimation of Word Representation in Vector Space

Deep Residual Learning for Image Recognition

Word2Vec CS246 Junghoo “John” Cho.

Image Question Answering

Recursive Structure.

Tremor Detection Using Motion Filtering and SVM Bilge Soran, Jenq-Neng Hwang, Linda Shapiro, ICPR, /16/2018.

Distributed Representation of Words, Sentences and Paragraphs

Visual Question Generation

Introduction to Neural Networks

Paraphrase Generation Using Deep Learning

Word Embedding Word2Vec.

[Figure taken from googleblog

The Big Health Data–Intelligent Machine Paradox

Outline Background Motivation Proposed Model Experimental Results

Ying Dai Faculty of software and information science,

Abnormally Detection

Attention for translation

Learn to Comment Mentor: Mahdi M. Kalayeh

Unsupervised Perceptual Rewards For Imitation Learning

Sequence to Sequence Video to Text

Automatic Handwriting Generation

Visual Question Answering

Presented by: Anurag Paul

Deep Object Co-Segmentation

CS295: Modern Systems: Application Case Study Neural Network Accelerator Sang-Woo Jun Spring 2019 Many slides adapted from Hyoukjun Kwon‘s Gatech “Designing.

Learning and Memorization

Presented By: Harshul Gupta

Baseline Model CSV Files Pandas DataFrame Sentence Lists

Week 3 Volodymyr Bobyr.

Bidirectional LSTM-CRF Models for Sequence Tagging

LHC beam mode classification

CS249: Neural Language Model

Faithful Multimodal Explanation for VQA

Presentation transcript:

mengye ren, ryan kiros, richard s. zemel Exploring models and data for image question answering mengye ren, ryan kiros, richard s. zemel

1. Address the problem of image-based question- answering with new models and dataset 2. Assume that THE ANSWERS CONSIST OF ONLY A SINGLE WORD. TREAT THE PROBLEM AS A CLASSIFICATION PROBLEM 3. A GENERIC END-TO-END QA MODEL 4. AN AUTOMATIC QUESTION GENERATION ALGORITHM THAT CONVERTS DESCRIPTION SENTENCES INTO QUESTIONS 5. A new dataset(coco-qa) generated by the above algorithm 6. comparisons to other models and baselines Overview

datasets DATASETS Previous dataset(DAQUAR): Malinowski and Fritz released a dataset with images and question-answer pairs, the DAtaset for Question Answering on Real-world images(DAQUAR), containing about 1500 images. All images are indoor scenes. Most questions in this dataset can be divided into three types: object type, object color, and number of objects. All these types of questions can be answered with on word. COCO-QA: Containing about 78736 train images and 38948 test images. Difference: The dataset they generated has more images and more even distributed,

how to get the dataset datasets Using Stanford parser to obtain the syntactic structure of the original image description Compound sentences to simple sentences Indefinite determiners “a(n)” to definite determiners “the” Wh-movement constrains example: A man is riding a horse. -> What is the man riding? Question Generation: Object Questions; Number Questions; Color Questions; Location Questions

how to get the dataset datasets Post-Processing: Answers that appear less than a frequency threshold are discarded. Enroll a QA pair one at a time. The probability of enrolling the next QA pair(q,a) is: K <= K2, K = 100 and K2 = 200 in this paper

models vis+lstm model Use the last hidden layer of the 19-layer Oxford VGG Conv Net trained on ImageNet 2014 Challenge as visual embeddings Skip-gram Word Embedding Treat the image as one word of the question The LSTM(s) outputs are fed into a softmax layer at the last step to generate answers

skip-gram word embedding word embeddings skip-gram word embedding Represent a word: Vector Representation Dimension: 300-500 dimension, each dimension may represent certain feature of a word

skip-gram word embedding word embeddings skip-gram word embedding

RNNS(recurrent neural networks) AND LSTM(long short-term memory) models RNNS(recurrent neural networks) AND LSTM(long short-term memory) Problems: I grew up in France… I speak fluent French. Bengio, et al. (1994) http://colah.github.io/posts/2015-08-Understanding-LSTMs/

RNNS(recurrent neural networks) AND LSTM(long short-term memory) models RNNS(recurrent neural networks) AND LSTM(long short-term memory) http://colah.github.io/posts/2015-08-Understanding-LSTMs/

experimental results comparisons Performance Metrics: Plain answer accuracy as well as Wu-Palmer similarity measure WUPS: Example: hatch-back and compact vs hatch-back and truck

comparisons experimental results

limitations Limitations The answers only have one word. Types of questions. Accuracy.

Questions Weiqiang Jin