What Convnets Make for Image Captioning?

Slides:

Advertisements

Similar presentations

On the Relationship between Visual Attributes and Convolutional Networks Paper ID - 52.

Advertisements

Fully Convolutional Networks for Semantic Segmentation

Recognition Using Visual Phrases

Object Recognizing. Deep Learning Success in 2012 DeepNet and speech processing.

PANDA: Pose Aligned Networks for Deep Attribute Modeling Ning Zhang 1,2 Manohar Paluri 1 Marć Aurelio Ranzato 1 Trevor Darrell 2 Lumbomir Boudev 1 1 Facebook.

Analyzing the Behavior of Deep Models Dhruv Batra Devi Parikh Aishwarya Agrawal (EMNLP 2016)

Media Lab, Leiden Institute of Advance Computer Science

Wenchi MA CV Group EECS,KU 03/20/2017

Recent developments in object detection

CS 4501: Introduction to Computer Vision Object Localization, Detection, Semantic Segmentation Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy.

CNN-RNN: A Uniﬁed Framework for Multi-label Image Classiﬁcation

CS 4501: Introduction to Computer Vision Computer Vision + Natural Language Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy / Justin Johnson.

The Relationship between Deep Learning and Brain Function

Object Detection based on Segment Masks

Deep Learning Amin Sobhani.

Compact Bilinear Pooling

Convolutional Neural Fabrics by Shreyas Saxena, Jakob Verbeek

Krishna Kumar Singh, Yong Jae Lee University of California, Davis

Depth estimation and Plane detection

Combining CNN with RNN for scene labeling (segmentation)

Ajita Rattani and Reza Derakhshani,

ECE 6504 Deep Learning for Perception

mengye ren, ryan kiros, richard s. zemel

VQA: Visual Question Answering

CS6890 Deep Learning Weizhen Cai

Towards Understanding End-of-trip Instructions in a Taxi Ride Scenario

Above and below the object level

Image Question Answering

Enhanced-alignment Measure for Binary Foreground Map Evaluation

By: Kevin Yu Ph.D. in Computer Engineering

Layer-wise Performance Bottleneck Analysis of Deep Neural Networks

Zan Gao, Deyu Wang, Xiangnan He, Hua Zhang

Bird-species Recognition Using Convolutional Neural Network

Computer Vision James Hays

Visual Question Generation

Attention-based Caption Description Mun Jonghwan.

Introduction to Neural Networks

Image Classification.

Recurrent Neural Networks

CornerNet: Detecting Objects as Paired Keypoints

Learning a Policy for Opportunistic Active Learning

The Big Health Data–Intelligent Machine Paradox

Lecture: Deep Convolutional Neural Networks

Visualizing and Understanding Convolutional Networks

Object Tracking: Comparison of

Learning Object Context for Dense Captioning

Logistic Regression & Transfer Learning

边缘检测年度进展概述 Ming-Ming Cheng Media Computing Lab, Nankai University

Heterogeneous convolutional neural networks for visual recognition

FOCUS PRIOR ESTIMATION FOR SALIENT OBJECT DETECTION

Abnormally Detection

Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton

Attention for translation

Learn to Comment Mentor: Mahdi M. Kalayeh

Jointly Generating Captions to Aid Visual Question Answering

Sequence to Sequence Video to Text

Visual Question Answering

Presented by: Anurag Paul

A New Large-Scale Sentinel-2 Benchmark Archive to Drive Deep Learning Studies in Remote Sensing Gencer Sumbul1, Marcela Charfuelan3, Begüm Demir1, Volker.

VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION

Semantic Segmentation

Presented By: Harshul Gupta

Week 3 Volodymyr Bobyr.

Report 2 Brandon Silva.

Week 7 Presentation Ngoc Ta Aidean Sharghi

Week 6 Presentation Ngoc Ta Aidean Sharghi.

Adrian E. Gonzalez , David Parra Department of Computer Science

CVPR 2019 Poster.

The experiment based on hier-attention

Presentation transcript:

What Convnets Make for Image Captioning? 23rd International Conference on MultiMedia Modeling (MMM 2017) What Convnets Make for Image Captioning? Yu Liu*, Yanming Guo*, and Michael S. Lew Hello, everyone. Today I am going to talk about our paper, titled “”. From the title, we can see the main task of our paper is image captioning. So, what is image captioning? Leiden Institute of Advanced Computer Science, Leiden University Presenter: Yanming Guo

Image Captioning Describe an image with meaningful and sensible sentence-level captions. Objects Actions Descriptive words Relations … Image captioning is a new emerging and also an important task in vision-to-language research. It tries to describe … The caption here should include many kinds of information. Take this image as an example, the groundtruth caption is “” A large bus sitting next to a very tall building

From Toronto University, Demo Image Captioning Retrieval approaches ---- Map images to pre-defined sentences Generative approaches ---- Estimate novel sentences The image captioning approaches can be divided into two types: retrieval approaches and generative approaches. The retrieval approaches try to map images to pre-defined sentences, while the generative approaches would estimate novel sentences. This is the main difference of these two approaches, also one of the advantages of the generative approaches. Since we should not expect all of the sentences of the new images have ever seen before. A white dog and a brown dog run along side each other at the beach; A dog running on a wet suit on the beach From Toronto University, Demo

Image Captioning Retrieval approaches Generative approaches ---- Map images to pre-defined sentences Generative approaches ---- Estimate novel sentences Advantages: Caption does not have to be previous seen A good language model More intelligent Better performance The image captioning approaches can be divided into two types: retrieval approaches and generative approaches. The retrieval approaches try to map images to pre-defined sentences, while the generative approaches would estimate novel sentences. This is the main difference of these two approaches, also one of the advantages of the generative approaches. Since we should not expect all of the sentences of the new images have ever seen before.

? General Structure Generate a sentence of words START “White” “Cup” END RNN ? … The general structure of the generative approaches is like this: Given an image, it first… Our paper focuses on the transition of the two parts of CNN and RNN, and aims to fully investigate the effects of different Convnets on image captioning CNN High-level image features Generate a sentence of words

? General Structure What Convnets make for image captioning? RNN CNN START “White” “Cup” END RNN ? … Our aim in this paper is to fully investigate the effects of different Convnets on image captioning CNN What Convnets make for image captioning?

Three types of Convnets Single-label Multi-label Multi-attribute finetune Single-label Convnet Generic representation ---- Convnet pre-trained on ImageNet dataset, e.g. AlexNet, VGG … Multi-label Convnet Salient objects Correspondingly, we exploit three kinds of Convnets: single-label, multi-label and multi-attribute. ---- Fine-tune Convnet on 80 object categories of MS COCO Multi-attribute Convnet Salient objects, actions, relations… ---- Fine-tune Convnet on attributes of MS COCO (e.g. 300 attributes)

Three types of Convnets Input image Single-label Convnet Multi-label Convnet Multi-attribute Convnet We visualize the most activated feature map in the last convolutional layer The visualization of the most activated feature map in conv5_3

Multi-attribute feature Multi-Convnet Aggregation Single-label feature Aggregation feature Multi-label feature Multi-attribute feature 𝑥 0 ag(x) 𝑥 1 ag(x) 𝑥 i−2 ag(x) 𝑥 T−1 ag(x) … … LSTM LSTM LSTM LSTM 𝑝 1 𝑝 2 𝑝 i−1 𝑝 T

Multi-Scale Testing … xt Caption generation LSTM CNN 224 FCN 256 320 average transfer Caption generation

Experiments BLUE: measures the precision of n-grams between the generated and reference sentences (e.g. B-1, B-2, B-3, B-4). METEOR: computed based on the alignment between the words in a generated and reference sentences. ROUGE-L: focus on a set words that are appear in the same order in two sentences. CIDEr: use a tf-idf weights for computing each n-grams.

Experiments Multi-scale: considerable improvement SL-Net: largest dimension & worst performance ML-Net: smallest dimension & considerable improvement MA-Net : medium dimension & significant improvement

Experiments Multi-scale testing using FCN is always better; The aggregation of different Convnets can enhance the performance

Experiments Single-label Convnet: A man is sitting on the water with a surfboard. Multi-label Convnet: A man sitting on a boat in front of a boat. Multi-attribute Convnet: A man and a dog on a boat. Multi-Convnet aggregation: A man and a dog on a small boat. Ground truth: A man and a dog on a small yellow boat.

Experiments

Experiments Ours: A man riding a wave in the ocean. Ours: A living room with a lot of furniture. Ours: A man riding a horse at a horse. Ours: A close up of an elephant with an elephant GT: A man riding a wave on a surfboard in the ocean. GT: Living room with furniture with garage door at one end. GT: A horse that threw a man off a horse. GT: A man getting a kiss on the neck from an elephant's trunk

Conclusion Multi-attribute Convnet performs better for image captioning The aggregation of different Convnets can deliver slightly better performance than each individual Convnet Efficient multi-scale augmentation test using FCNs Comparable results with the state-of-the-art

Thanks for your attention! Questions please? Any questions are welcome!