What Convnets Make for Image Captioning?

Slides:



Advertisements
Similar presentations
On the Relationship between Visual Attributes and Convolutional Networks Paper ID - 52.
Advertisements

Fully Convolutional Networks for Semantic Segmentation
Recognition Using Visual Phrases
Object Recognizing. Deep Learning Success in 2012 DeepNet and speech processing.
PANDA: Pose Aligned Networks for Deep Attribute Modeling Ning Zhang 1,2 Manohar Paluri 1 Marć Aurelio Ranzato 1 Trevor Darrell 2 Lumbomir Boudev 1 1 Facebook.
Analyzing the Behavior of Deep Models Dhruv Batra Devi Parikh Aishwarya Agrawal (EMNLP 2016)
Media Lab, Leiden Institute of Advance Computer Science
Wenchi MA CV Group EECS,KU 03/20/2017
Recent developments in object detection
CS 4501: Introduction to Computer Vision Object Localization, Detection, Semantic Segmentation Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy.
CNN-RNN: A Unified Framework for Multi-label Image Classification
CS 4501: Introduction to Computer Vision Computer Vision + Natural Language Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy / Justin Johnson.
The Relationship between Deep Learning and Brain Function
Object Detection based on Segment Masks
Deep Learning Amin Sobhani.
Compact Bilinear Pooling
Convolutional Neural Fabrics by Shreyas Saxena, Jakob Verbeek
Krishna Kumar Singh, Yong Jae Lee University of California, Davis
Depth estimation and Plane detection
Combining CNN with RNN for scene labeling (segmentation)
Ajita Rattani and Reza Derakhshani,
ECE 6504 Deep Learning for Perception
mengye ren, ryan kiros, richard s. zemel
VQA: Visual Question Answering
CS6890 Deep Learning Weizhen Cai
Towards Understanding End-of-trip Instructions in a Taxi Ride Scenario
Above and below the object level
Image Question Answering
Enhanced-alignment Measure for Binary Foreground Map Evaluation
By: Kevin Yu Ph.D. in Computer Engineering
Layer-wise Performance Bottleneck Analysis of Deep Neural Networks
Zan Gao, Deyu Wang, Xiangnan He, Hua Zhang
Bird-species Recognition Using Convolutional Neural Network
Computer Vision James Hays
Visual Question Generation
Attention-based Caption Description Mun Jonghwan.
Introduction to Neural Networks
Image Classification.
Recurrent Neural Networks
CornerNet: Detecting Objects as Paired Keypoints
Learning a Policy for Opportunistic Active Learning
The Big Health Data–Intelligent Machine Paradox
Lecture: Deep Convolutional Neural Networks
Visualizing and Understanding Convolutional Networks
Object Tracking: Comparison of
Learning Object Context for Dense Captioning
Logistic Regression & Transfer Learning
边缘检测年度进展概述 Ming-Ming Cheng Media Computing Lab, Nankai University
Heterogeneous convolutional neural networks for visual recognition
FOCUS PRIOR ESTIMATION FOR SALIENT OBJECT DETECTION
Abnormally Detection
Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton
Attention for translation
Learn to Comment Mentor: Mahdi M. Kalayeh
Jointly Generating Captions to Aid Visual Question Answering
Sequence to Sequence Video to Text
Visual Question Answering
Presented by: Anurag Paul
A New Large-Scale Sentinel-2 Benchmark Archive to Drive Deep Learning Studies in Remote Sensing Gencer Sumbul1, Marcela Charfuelan3, Begüm Demir1, Volker.
VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION
Semantic Segmentation
Presented By: Harshul Gupta
Week 3 Volodymyr Bobyr.
Jiahe Li
Report 2 Brandon Silva.
Week 7 Presentation Ngoc Ta Aidean Sharghi
Week 6 Presentation Ngoc Ta Aidean Sharghi.
Adrian E. Gonzalez , David Parra Department of Computer Science
CVPR 2019 Poster.
The experiment based on hier-attention
Presentation transcript:

What Convnets Make for Image Captioning? 23rd International Conference on MultiMedia Modeling (MMM 2017) What Convnets Make for Image Captioning? Yu Liu*, Yanming Guo*, and Michael S. Lew Hello, everyone. Today I am going to talk about our paper, titled “”. From the title, we can see the main task of our paper is image captioning. So, what is image captioning? Leiden Institute of Advanced Computer Science, Leiden University Presenter: Yanming Guo

Image Captioning Describe an image with meaningful and sensible sentence-level captions. Objects Actions Descriptive words Relations … Image captioning is a new emerging and also an important task in vision-to-language research. It tries to describe … The caption here should include many kinds of information. Take this image as an example, the groundtruth caption is “” A large bus sitting next to a very tall building

From Toronto University, Demo Image Captioning Retrieval approaches ---- Map images to pre-defined sentences Generative approaches ---- Estimate novel sentences The image captioning approaches can be divided into two types: retrieval approaches and generative approaches. The retrieval approaches try to map images to pre-defined sentences, while the generative approaches would estimate novel sentences. This is the main difference of these two approaches, also one of the advantages of the generative approaches. Since we should not expect all of the sentences of the new images have ever seen before. A white dog and a brown dog run along side each other at the beach; A dog running on a wet suit on the beach From Toronto University, Demo

Image Captioning Retrieval approaches Generative approaches ---- Map images to pre-defined sentences Generative approaches ---- Estimate novel sentences Advantages: Caption does not have to be previous seen A good language model More intelligent Better performance The image captioning approaches can be divided into two types: retrieval approaches and generative approaches. The retrieval approaches try to map images to pre-defined sentences, while the generative approaches would estimate novel sentences. This is the main difference of these two approaches, also one of the advantages of the generative approaches. Since we should not expect all of the sentences of the new images have ever seen before.

? General Structure Generate a sentence of words START “White” “Cup” END RNN ? … The general structure of the generative approaches is like this: Given an image, it first… Our paper focuses on the transition of the two parts of CNN and RNN, and aims to fully investigate the effects of different Convnets on image captioning CNN High-level image features Generate a sentence of words

? General Structure What Convnets make for image captioning? RNN CNN START “White” “Cup” END RNN ? … Our aim in this paper is to fully investigate the effects of different Convnets on image captioning CNN What Convnets make for image captioning?

Three types of Convnets Single-label Multi-label Multi-attribute finetune Single-label Convnet Generic representation ---- Convnet pre-trained on ImageNet dataset, e.g. AlexNet, VGG … Multi-label Convnet Salient objects Correspondingly, we exploit three kinds of Convnets: single-label, multi-label and multi-attribute. ---- Fine-tune Convnet on 80 object categories of MS COCO Multi-attribute Convnet Salient objects, actions, relations… ---- Fine-tune Convnet on attributes of MS COCO (e.g. 300 attributes)

Three types of Convnets Input image Single-label Convnet Multi-label Convnet Multi-attribute Convnet We visualize the most activated feature map in the last convolutional layer The visualization of the most activated feature map in conv5_3

Multi-attribute feature Multi-Convnet Aggregation Single-label feature Aggregation feature Multi-label feature Multi-attribute feature 𝑥 0 ag(x) 𝑥 1 ag(x) 𝑥 i−2 ag(x) 𝑥 T−1 ag(x) … … LSTM LSTM LSTM LSTM 𝑝 1 𝑝 2 𝑝 i−1 𝑝 T

Multi-Scale Testing … xt Caption generation LSTM CNN 224 FCN 256 320 average transfer Caption generation

Experiments BLUE: measures the precision of n-grams between the generated and reference sentences (e.g. B-1, B-2, B-3, B-4). METEOR: computed based on the alignment between the words in a generated and reference sentences. ROUGE-L: focus on a set words that are appear in the same order in two sentences. CIDEr: use a tf-idf weights for computing each n-grams.

Experiments Multi-scale: considerable improvement SL-Net: largest dimension & worst performance ML-Net: smallest dimension & considerable improvement MA-Net : medium dimension & significant improvement

Experiments Multi-scale testing using FCN is always better; The aggregation of different Convnets can enhance the performance

Experiments Single-label Convnet: A man is sitting on the water with a surfboard. Multi-label Convnet: A man sitting on a boat in front of a boat. Multi-attribute Convnet: A man and a dog on a boat. Multi-Convnet aggregation: A man and a dog on a small boat. Ground truth: A man and a dog on a small yellow boat.

Experiments

Experiments Ours: A man riding a wave in the ocean. Ours: A living room with a lot of furniture. Ours: A man riding a horse at a horse. Ours: A close up of an elephant with an elephant GT: A man riding a wave on a surfboard in the ocean. GT: Living room with furniture with garage door at one end. GT: A horse that threw a man off a horse. GT: A man getting a kiss on the neck from an elephant's trunk

Conclusion Multi-attribute Convnet performs better for image captioning The aggregation of different Convnets can deliver slightly better performance than each individual Convnet Efficient multi-scale augmentation test using FCNs Comparable results with the state-of-the-art

Thanks for your attention! Questions please? Any questions are welcome!