Jointly Generating Captions to Aid Visual Question Answering

Slides:

Advertisements

Similar presentations

Patch to the Future: Unsupervised Visual Prediction

Advertisements

Retrieving Actions in Group Contexts Tian Lan, Yang Wang, Greg Mori, Stephen Robinovitch Simon Fraser University Sept. 11, 2010.

Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning

Exploiting Ontologies for Automatic Image Annotation M. Srikanth, J. Varner, M. Bowden, D. Moldovan Language Computer Corporation

Building high-level features using large-scale unsupervised learning Anh Nguyen, Bay-yuan Hsu CS290D – Data Mining (Spring 2014) University of California,

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 6: Applying backpropagation to shape recognition Geoffrey Hinton.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Machine learning & object recognition Cordelia Schmid Jakob Verbeek.

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation EMNLP’14 paper by Kyunghyun Cho, et al.

Learning to Compare Image Patches via Convolutional Neural Networks SERGEY ZAGORUYKO & NIKOS KOMODAKIS.

Naifan Zhuang, Jun Ye, Kien A. Hua

Rationalizing Neural Predictions

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Learning to Compare Image Patches via Convolutional Neural Networks

Convolutional Neural Network

Hierarchical Question-Image Co-Attention for Visual Question Answering

CS 388: Natural Language Processing: LSTM Recurrent Neural Networks

CS 4501: Introduction to Computer Vision Computer Vision + Natural Language Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy / Justin Johnson.

Object Detection based on Segment Masks

Deep Learning Amin Sobhani.

Krishna Kumar Singh, Yong Jae Lee University of California, Davis

An Empirical Study of Learning to Rank for Entity Search

Pick samples from task t

Neural Machine Translation by Jointly Learning to Align and Translate

Intelligent Information System Lab

Different Units Ramakrishna Vedantam.

"Playing Atari with deep reinforcement learning."

Human-level control through deep reinforcement learning

Attention-based Caption Description Mun Jonghwan.

Variational Knowledge Graph Reasoning

Image Captions With Deep Learning Yulia Kogan & Ron Shiff

Weakly Learning to Match Experts in Online Community

Vessel Extraction in X-Ray Angiograms Using Deep Learning

Intent-Aware Semantic Query Annotation

Learning to Sportscast: A Test of Grounded Language Acquisition

Tina Jiang. , Vivek Natarajan. , Xinlei Chen

Introduction to Text Generation

Learning a Policy for Opportunistic Active Learning

The Big Health Data–Intelligent Machine Paradox

Y2Seq2Seq: Cross-Modal Representation Learning for 3D Shape and Text by Joint Reconstruction and Prediction of View and Word Sequences 1, Zhizhong.

Graph-based Security and Privacy Analytics via Collective Classification with Joint Weight Learning and Propagation Binghui Wang, Jinyuan Jia, and Neil.

View Inter-Prediction GAN: Unsupervised Representation Learning for 3D Shapes by Learning Global Shape Memories to Support Local View Predictions 1,2 1.

Learning Object Context for Dense Captioning

Introduction to Object Tracking

Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton

Hsien-Chin Lin, Chi-Yu Yang, Hung-Yi Lee, Lin-shan Lee

Attention for translation

Department of Computer Science University of Texas at Austin

Department of Computer Science Ben-Gurion University of the Negev

Visual Question Answering

Presented by: Anurag Paul

Ask and Answer Questions

Vision and language: attention, navigation, and making it work ‘in the wild’ Peter Anderson Australian National University -> Macquarie University -> Georgia.

1MIT CSAIL 2Tsinghua University 3MIT-IBM Watson AI Lab 4DeepMind

Angel A. Cantu, Nami Akazawa Department of Computer Science

Presented By: Harshul Gupta

Week 3 Presentation Ngoc Ta Aidean Sharghi.

Cengizhan Can Phoebe de Nooijer

Week 7 Presentation Ngoc Ta Aidean Sharghi

Neural Machine Translation by Jointly Learning to Align and Translate

CS249: Neural Language Model

Visual Grounding.

CVPR 2019 Poster.

Faithful Multimodal Explanation for VQA

Presentation transcript:

Jointly Generating Captions to Aid Visual Question Answering Raymond Mooney Department of Computer Science University of Texas at Austin with Jialin Wu

VQA Image credits to VQA website

VQA Architectures Most systems are DNNs using both CNNs and RNNs.

VQA with BUTD We use a recent state-of-the-art VQA system BUTD (Bottom-Up-Top-Down) (Anderson et al. 2018). BUTD first detects a wide range of objects and attributes trained on VisualGenome data, and attends to them when computing an answer.

Using Visual Segmentations We use recent methods for using detailed image segmentations for VQA (VQS, Gan et al., 2017). Provides more precise visual information than BUTD’s bounding boxes.

High-Level VQA Architecture

How can captions help VQA? Captions + Detections as inputs Captions can provide useful information for the VQA model

Multitask VQA and Image Captioning There are lots of datasets with image captions. COCO data used in VQA comes with captions Captioning and VQA both need knowledge of image content and language. Should benefit from multitask learning (Caruana, 1997).

Question relevant captions For a particular question, some of the captions are relevant and some are not.

How to generate question-relevant captions Input feature side We need to bias the features to encode the necessary information for the questions. We used the VQA joint representation for simplicity. Supervision side We need the relevant captions to train the model to generate the relevant captions.

How to obtain relevant training captions Directly Collecting captions for each question? Over 1.1 million questions in the dataset (not scalable). The caption has to be in line with the VQA reasoning process. Choosing the most relevant caption from existing dataset? How to measure relevance? What if there is no relevant caption for an image-question pair?

Quantifying the relevance Intuition Generating relevance captions should share the optimization goal with answering the visual question. The two objectives should share some descent directions. Relevance is measured using the inner-product of the gradients from the caption generation loss and the VQA answer prediction loss. A positive inner-product means the two objective functions share some descent directions in the optimization process, and therefore indicates that the corresponding captions help the VQA process.

Quantifying the relevance Selecting the most relevant human caption

How to use the captions A Word GRU to identify important words for the question and images A Caption GRU to encode the sequential information from the attended words.

Joint VQA/Captioning model

Examples

VQA 2.0 Data Training Validation Test 443,757 questions 82,783 images All images come with 5 human generated captions

Experimental Results Compare with the state-of-the-art

Experimental Results Comparing different types of captions Generated relevant captions help VQA more than the question-agnostic captions from BUTD.

Improving Image Captioning Using an Image-Conditioned Auto-Encoder

Aiding Training by Using an Easier Task Using an easier task that first encodes the human captions and the image, and then generates the caption back. C1: several doughnuts are in a cardboard box. C2: a box holds four pairs of mini doughnuts. C3: a variety of doughnuts sit in a box. C4: several different donuts are placed in the box. C5: a fresh box of twelve assorted glazed pastries. C1: several doughnuts are in a cardboard box. C2: a box holds four pairs of mini doughnuts. C3: a variety of doughnuts sit in a box. C4: several different donuts are placed in the box. C5: a fresh box of twelve assorted glazed pastries. ENC DEC ℎ 0 𝑑

Model Overview

Training for Image Captioning Maximum likelihood principle REINFORCE algorithms

Hidden State Supervision Both of these training approaches provide supervision on the output word probabilities, therefore the hidden states do not receive direct supervision. Supervising the hidden states requires the oracle hidden states that contain richer information. An easier task that first encodes the human captions and the image, and then generates the caption back can help. Hidden state loss for time (t)

Training with Maximum Likelihood Jointly optimizes the log-likelihood and the hidden states loss at each time step (t)

Training with REINFORCE Objectives Gradients Problem Every word receives the same amount of reward no matter how appropriate they are.

Hidden State Loss as a Reward Bias Motivation A word should have more reward when its hidden state matches a high performance oracle encoder. Reward bias

Experimental Data COCO (Chen et al., 2015) “Karpathy split” Each image with 5 human captions “Karpathy split” 110,000 training images 5,000 validation images 5,000 test images

Baseline Systems FC (Rennie et al., 2017) With and without “self critical sequence training” Up-Down (aka BUTD) (Anderson et al., 2018)

Evaluation Metrics BLEU-4 (B-4) METEOR (M) ROUGE-L (R-L) CIDEr (C) SPICE (S)

Experimental Results for Max Likelihood

Experimental Results for REINFORCE Training with different reward metrics

Conclusions Jointly generating “question relevant” captions can improve Visual Question Answering. First training an image-conditioned caption auto- encoder can help supervise a captioner to create better hidden state representations that improve final captioning performance.