Download presentation
Presentation is loading. Please wait.
Published byFranklin Barnaby Morton Modified over 6 years ago
1
What Convnets Make for Image Captioning?
23rd International Conference on MultiMedia Modeling (MMM 2017) What Convnets Make for Image Captioning? Yu Liu*, Yanming Guo*, and Michael S. Lew Hello, everyone. Today I am going to talk about our paper, titled “”. From the title, we can see the main task of our paper is image captioning. So, what is image captioning? Leiden Institute of Advanced Computer Science, Leiden University Presenter: Yanming Guo
2
Image Captioning Describe an image with meaningful and sensible sentence-level captions. Objects Actions Descriptive words Relations … Image captioning is a new emerging and also an important task in vision-to-language research. It tries to describe … The caption here should include many kinds of information. Take this image as an example, the groundtruth caption is “” A large bus sitting next to a very tall building
3
From Toronto University, Demo
Image Captioning Retrieval approaches ---- Map images to pre-defined sentences Generative approaches ---- Estimate novel sentences The image captioning approaches can be divided into two types: retrieval approaches and generative approaches. The retrieval approaches try to map images to pre-defined sentences, while the generative approaches would estimate novel sentences. This is the main difference of these two approaches, also one of the advantages of the generative approaches. Since we should not expect all of the sentences of the new images have ever seen before. A white dog and a brown dog run along side each other at the beach; A dog running on a wet suit on the beach From Toronto University, Demo
4
Image Captioning Retrieval approaches Generative approaches
---- Map images to pre-defined sentences Generative approaches ---- Estimate novel sentences Advantages: Caption does not have to be previous seen A good language model More intelligent Better performance The image captioning approaches can be divided into two types: retrieval approaches and generative approaches. The retrieval approaches try to map images to pre-defined sentences, while the generative approaches would estimate novel sentences. This is the main difference of these two approaches, also one of the advantages of the generative approaches. Since we should not expect all of the sentences of the new images have ever seen before.
5
? General Structure Generate a sentence of words
START “White” “Cup” END RNN ? … The general structure of the generative approaches is like this: Given an image, it first… Our paper focuses on the transition of the two parts of CNN and RNN, and aims to fully investigate the effects of different Convnets on image captioning CNN High-level image features Generate a sentence of words
6
? General Structure What Convnets make for image captioning? RNN CNN
START “White” “Cup” END RNN ? … Our aim in this paper is to fully investigate the effects of different Convnets on image captioning CNN What Convnets make for image captioning?
7
Three types of Convnets
Single-label Multi-label Multi-attribute finetune Single-label Convnet Generic representation ---- Convnet pre-trained on ImageNet dataset, e.g. AlexNet, VGG … Multi-label Convnet Salient objects Correspondingly, we exploit three kinds of Convnets: single-label, multi-label and multi-attribute. ---- Fine-tune Convnet on 80 object categories of MS COCO Multi-attribute Convnet Salient objects, actions, relations… ---- Fine-tune Convnet on attributes of MS COCO (e.g. 300 attributes)
8
Three types of Convnets
Input image Single-label Convnet Multi-label Convnet Multi-attribute Convnet We visualize the most activated feature map in the last convolutional layer The visualization of the most activated feature map in conv5_3
9
Multi-attribute feature
Multi-Convnet Aggregation Single-label feature Aggregation feature Multi-label feature Multi-attribute feature 𝑥 0 ag(x) 𝑥 1 ag(x) 𝑥 i−2 ag(x) 𝑥 T−1 ag(x) … … LSTM LSTM LSTM LSTM 𝑝 1 𝑝 2 𝑝 i−1 𝑝 T
10
Multi-Scale Testing … xt Caption generation LSTM CNN 224 FCN 256 320
average transfer Caption generation
11
Experiments BLUE: measures the precision of n-grams between the generated and reference sentences (e.g. B-1, B-2, B-3, B-4). METEOR: computed based on the alignment between the words in a generated and reference sentences. ROUGE-L: focus on a set words that are appear in the same order in two sentences. CIDEr: use a tf-idf weights for computing each n-grams.
12
Experiments Multi-scale: considerable improvement
SL-Net: largest dimension & worst performance ML-Net: smallest dimension & considerable improvement MA-Net : medium dimension & significant improvement
13
Experiments Multi-scale testing using FCN is always better;
The aggregation of different Convnets can enhance the performance
14
Experiments Single-label Convnet: A man is sitting on the water with a surfboard. Multi-label Convnet: A man sitting on a boat in front of a boat. Multi-attribute Convnet: A man and a dog on a boat. Multi-Convnet aggregation: A man and a dog on a small boat. Ground truth: A man and a dog on a small yellow boat.
15
Experiments
16
Experiments Ours: A man riding a wave in the ocean.
Ours: A living room with a lot of furniture. Ours: A man riding a horse at a horse. Ours: A close up of an elephant with an elephant GT: A man riding a wave on a surfboard in the ocean. GT: Living room with furniture with garage door at one end. GT: A horse that threw a man off a horse. GT: A man getting a kiss on the neck from an elephant's trunk
17
Conclusion Multi-attribute Convnet performs better for image captioning The aggregation of different Convnets can deliver slightly better performance than each individual Convnet Efficient multi-scale augmentation test using FCNs Comparable results with the state-of-the-art
18
Thanks for your attention! Questions please?
Any questions are welcome!
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.