Image Captioning Approaches

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Rich feature Hierarchies for Accurate object detection and semantic segmentation Ross Girshick, Jeff Donahue, Trevor Darrell, Jitandra Malik (UC Berkeley)
Classification spotlights
ImageNet Classification with Deep Convolutional Neural Networks
Fraud Detection CNN designed for anti-phishing. Contents Recap PhishZoo Approach Initial Approach Early Results On Deck.
Patch to the Future: Unsupervised Visual Prediction
Deep Learning and Neural Nets Spring 2015
Large-Scale Object Recognition with Weak Supervision
Multiple Instance Learning
Spatial Pyramid Pooling in Deep Convolutional
Kuan-Chuan Peng Tsuhan Chen
Watch, Listen and Learn Sonal Gupta, Joohyun Kim, Kristen Grauman and Raymond Mooney -Pratiksha Shah.
Deep Learning Neural Network with Memory (1)
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Geoffrey Hinton CSC2535: 2013 Lecture 5 Deep Boltzmann Machines.
Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.
Extending the Multi- Instance Problem to Model Instance Collaboration Anjali Koppal Advanced Machine Learning December 11, 2007.
Fully Convolutional Networks for Semantic Segmentation
Neural Net Language Models
 Present by 陳群元.  Introduction  Previous work  Predicting motion patterns  Spatio-temporal transition distribution  Discerning pedestrians  Experimental.
Learning Features and Parts for Fine-Grained Recognition Authors: Jonathan Krause, Timnit Gebru, Jia Deng, Li-Jia Li, Li Fei-Fei ICPR, 2014 Presented by:
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Information Extraction Entity Extraction: Statistical Methods Sunita Sarawagi.
Convolutional Restricted Boltzmann Machines for Feature Learning Mohammad Norouzi Advisor: Dr. Greg Mori Simon Fraser University 27 Nov
CSC321: Introduction to Neural Networks and Machine Learning Lecture 15: Mixtures of Experts Geoffrey Hinton.
Object Recognizing. Deep Learning Success in 2012 DeepNet and speech processing.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
NOTE: To change the image on this slide, select the picture and delete it. Then click the Pictures icon in the placeholder to insert your own image. SHOW.
Lecture 3b: CNN: Advanced Layers
Deep Learning Overview Sources: workshop-tutorial-final.pdf
Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition arXiv: v4 [cs.CV(CVPR)] 23 Apr 2015 Kaiming He, Xiangyu Zhang, Shaoqing.
1 Bilinear Classifiers for Visual Recognition Computational Vision Lab. University of California Irvine To be presented in NIPS 2009 Hamed Pirsiavash Deva.
City Forensics: Using Visual Elements to Predict Non-Visual City Attributes Sean M. Arietta, Alexei A. Efros, Ravi Ramamoorthi, Maneesh Agrawala Presented.
CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.
S.Bengio, O.Vinyals, N.Jaitly, N.Shazeer
Recent developments in object detection
Unsupervised Learning of Video Representations using LSTMs
Korean version of GloVe Applying GloVe & word2vec model to Korean corpus speaker : 양희정 date :
CS 4501: Introduction to Computer Vision Computer Vision + Natural Language Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy / Justin Johnson.
The Relationship between Deep Learning and Brain Function
Dhruv Batra Georgia Tech
Object detection with deformable part-based models
Recurrent Neural Networks for Natural Language Processing
Krishna Kumar Singh, Yong Jae Lee University of California, Davis
Understanding and Predicting Image Memorability at a Large Scale
Lecture 24: Convolutional neural networks
Combining CNN with RNN for scene labeling (segmentation)
CSCI 5922 Neural Networks and Deep Learning: Image Captioning
Intelligent Information System Lab
Data Mining Lecture 11.
R-CNN region By Ilia Iofedov 11/11/2018 BGU, DNN course 2016.
Cheng-Ming Huang, Wen-Hung Liao Department of Computer Science
Computer Vision James Hays
Attention-based Caption Description Mun Jonghwan.
Introduction to Neural Networks
Image Classification.
Vessel Extraction in X-Ray Angiograms Using Deep Learning
Lecture: Deep Convolutional Neural Networks
Papers 15/08.
Heterogeneous convolutional neural networks for visual recognition
Learn to Comment Mentor: Mahdi M. Kalayeh
Jointly Generating Captions to Aid Visual Question Answering
Sequence to Sequence Video to Text
Automatic Handwriting Generation
Presented by: Anurag Paul
Presented By: Harshul Gupta
Week 3 Presentation Ngoc Ta Aidean Sharghi.
Week 7 Presentation Ngoc Ta Aidean Sharghi
Neural Machine Translation by Jointly Learning to Align and Translate
Presentation transcript:

Image Captioning Approaches Azam Moosavi

Overview Tasks involving sequences: Image and video description Long-term recurrent convolutional networks for visual recognition and description From Captions to Visual Concepts and Back

Long-term recurrent convolutional networks for visual recognition and description Donahue Jeff, et al Berkeley

LRCN LRCN is a class of models that is both spatially and temporally deep, and has the flexibility to be applied to a variety of vision tasks involving sequential inputs and outputs.

Sequential inputs /outputs Image credit: main paper

Activity Recognition T individual frames are input to T convolutional networks which are then connected to a single layer STM with 256 hidden units. The CNN base of the LRCN is a hybrid of the Caffe reference model, a minor variant of AlexNet, and the network used by Zeiler & Fergus which is pre-trained on the 1.2M image ILSVRC-2012 classification training subset of the ImageNet dataset,

Activity recognition CNN CNN CNN CNN LSTM LSTM LSTM LSTM running sitting jumping jumping Average jumping

Activity Recognition Evaluation

Activity Recognition Evaluation We evaluate our architecture on the UCF-101 dataset which consists of over 12,000 videos categorized into 101 human action classes

Image description CNN LSTM LSTM LSTM LSTM LSTM <BOS> a dog is jumping <EOS>

Image description CNN LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM <BOS> a dog jumping <EOS> is

Image description Two layered factor CNN LSTM LSTM LSTM LSTM LSTM LSTM <BOS> a dog jumping <EOS> is

Image Description In contrast to activity recognition, the static image description task only requires a single convolutional network. Both the image features and the previous word are provided as inputs to the sequential model, in this case a stack of LSTMs (each with 1000 hidden units) is used.

Image Description Evaluation

Image Description Evaluation

Image Description Evaluation

Video description CNN CNN CNN CNN Average LSTM LSTM LSTM LSTM LSTM <BOS> a dog jumping <EOS> is

Video description CNN CNN CNN CNN LSTM LSTM LSTM LSTM LSTM LSTM LSTM <BOS> a dog jumping <EOS> is N time step video frames M time step produce description

Video description Pre-trained detector predictions LSTM LSTM LSTM LSTM <BOS> a dog jumping <EOS> is

Video Description Figure credit: main paper

Video Description Evaluation TACoS multilevel dataset, which has 44,762 video/sentence pairs (about 40,000 for training/validation).

From Captions to Visual Concepts and Back Saurabh Gupta UC Berkeley Work done at Microsoft Research Hao Cheng, Li Deng, Jacob Devlin, Piotr Dollár, Hao Fang, Jianfeng Gao, Xiaodong He, Forrest Iandola, Margaret Mitchell, John C. Platt, Rupesh Srivastava, C. Lawrence Zitnick, Geoffrey Zweig

woman, crowd, cat, camera, holding, purple 1. Word Detection woman, crowd, cat, camera, holding, purple Slide credit: Saurabh Gupta

woman crowd holding camera cat Purple 3. Sentence Re7Ranking #1 A woman holding a camera in a crowd. 3 1. Word Detection 2. Sentence Generation woman, crowd, cat, camera, holding, purple A purple camera with a woman. A woman holding a camera in a crowd. ... A woman holding a cat. Slide credit: Saurabh Gupta

woman crowd holding camera cat Purple 1. Word Detection 2. Sentence Generation 3. Sentence Re-Ranking woman, crowd, cat, camera, holding, purple A purple camera with a woman. A woman holding a camera in a crowd. ... A woman holding a cat. #1 A woman holding a camera in a crowd. Slide credit: Saurabh Gupta 3

woman, crowd, cat, camera, holding, purple 1. Word Detection woman, crowd, cat, camera, holding, purple Slide credit: Saurabh Gupta

Caption generation: Visual detectors: MIL Detects a set of words that are likely to be part of the image caption MIL CNN Per class probability FC6, FC7, FC8 as fully convolutional layers Multiple Instance Learning Spatial class probability maps Image Slide credit: Saurabh Gupta 4

Multiple Instance Learning (MIL) In MIL, instead of giving the learner labels for the individual examples, the trainer only labels collections of examples, which are called bags. Multi-Instance Learning Unsupervised Supervised K means PCA Mixture Models ... Transductive Inference Co-training ... Neural Nets Perceptrons SVM ... Recover “latent” structure, but not a classifier Hope that structure is “useful” No labels required Train classifier Success = low test error Requires labels

Multiple Instance Learning (MIL) A bag is labeled positive if there is at least one positive example in it and it is labeled negative if all the examples in it are negative Negative Bags (Bi-) Positive Bags (Bi+)

Multiple Instance Learning (MIL) we want to know target class based on its visual content. For instance, the target class might be "beach", where the image contains both "sand" and "water". In MIL terms, the image is described as a bag where each is the feature vector (called instance) extracted from the corresponding i-th region in the image and N is the total regions (instances) partitioning the image. The bag is labeled positive ("beach") if it contains both "sand" region instances and "water" region instances. 𝑥 i Image credit: Multiple instance classification, review, taxonomy and comparative study Jaume Amores (2013)

Noisy-Or for Estimating the Density It is assumed that the event can only happen if at least one of the causations occurred It is also assumed that the probability of any cause failing to trigger the event is independent of any other cause

Caption generation: Visual detectors: MIL Detects a set of words that are likely o be part of the image caption MIL CNN Per class probability FC6, FC7, FC8 as fully convolutional layers Multiple Instance Learning Spatial class probability maps Image Slide credit: Saurabh Gupta

Visual detectors Slide credit: Saurabh Gupta

woman crowd holding camera cat Purple 3. Sentence Re7Ranking #1 A woman holding a camera in a crowd. 3 1. Word Detection 2. Sentence Generation woman, crowd, cat, camera, holding, purple A purple camera with a woman. A woman holding a camera in a crowd. ... A woman holding a cat. Slide credit: Saurabh Gupta

Probabilistic language modeling •  Goal: compute the probability of a sentence or sequence of words: P(W) = P(w1,w2,w3,w4,w5…wn) •  Related task: probability of an upcoming word: P(w5|w1,w2,w3,w4) •  A model that computes either of these: P(W) or P(wn|w1,w2…wn-1) is called a language model.

Markov chains Markov assumption: only previous history matters limited memory: only last k words are included in history (older words less relevant) → kth order Markov model For instance 2-gram language model: p(w1, w2, w3, ..., wn) c p(w1) p(w2|w1) p(w3|w2)...p(wn|wn−1) What is conditioned on, here wi−1 is called the history

Probabilistic language modeling A woman holding holding cat purple camera crowd Slide credit: Saurabh Gupta

Probabilistic language modeling A woman holding holding cat purple camera crowd Slide credit: Saurabh Gupta

Probabilistic language modeling A woman holding holding cat purple Slide credit: Saurabh Gupta

Probabilistic language modeling A woman holding holding cat purple A woman holding a camera in a crowd. Slide credit: Saurabh Gupta

Maximum entropy language model The ME LM estimates the probability of a word conditioned on preceding words and set of words in dictionary that yet to be mentioned in the sentence. To train the ME LM, the objective function is the log likelihood of the captions conditioned on the corresponding set of detected objects

Maximum entropy language model The ME LM estimates the probability of a word conditioned on preceding words and set of words in dictionary that yet to be mentioned in the sentence. To train the ME LM, the objective function is the log likelihood of the captions conditioned on the corresponding set of detected objects

Maximum entropy language model The ME LM estimates the probability of a word conditioned on preceding words and set of words in dictionary that yet to be mentioned in the sentence.

During generation, we perform a left-to-right beam search During generation, we perform a left-to-right beam search. This maintains a stack of length l partial hypotheses. At each step in the search, every path on the stack is extended with a set of likely words, and the resulting length l + 1 paths are stored. The top k length l + 1 paths are retained and the others pruned away. We define the possible extensions to be the end of sentence token <=s>, the 100 most frequent words, the set of attribute words that remain to be mentioned, and all the words in the training data that have been observed to follow the last word in the hypothesis. Pruning is based on the likelihood of the partial path. When <=s> is generated, the full path to <=s> is removed from the stack and set aside as a completed sentence. The process continues until a maximum sentence length L is reached.   After obtaining the set of completed sentences C, we form an M-best list as follows. Given a target number of T image attributes to be mentioned, the sequences in C covering at least T objects are added to the M-best list, sorted in descending order by the log-likelihood. If there are less than M sequences covering at least T objects found in C, we reduce T by 1 until M sequences are found.

Language model Slide credit: Saurabh Gupta

Slide credit: Saurabh Gupta

Slide credit: Saurabh Gupta

woman crowd holding camera cat Purple 1. Word Detection 2. Sentence Generation 3. Sentence Re-Ranking woman, crowd, cat, camera, holding, purple A purple camera with a woman. A woman holding a camera in a crowd. ... A woman holding a cat. #1 A woman holding a camera in a crowd. Slide credit: Saurabh Gupta

Re-­‐rank hypotheses globally A purple camera with a woman A woman holding a camera in a crowd. A woman holding a cat. 4. …. 5. …. Sentence and image level features A woman holding a camera in a crowd. DMSM -­‐ Embedding to maximize similarity between image and its corresponding caption Slide credit: Saurabh Gupta

Re-­‐rank hypotheses globally Deep Multimodal Similarity Model (DMSM): We measure similarity between images and text by measuring cosine similarity between their corresponding vectors. This cosine similarity score is used by MERT to re-rank the sentences. we can compute the posterior probability of the text being relevant to the image via: Where Text vector: yD Image vector :yQ A woman holding a camera in a crowd. DMSM -­‐ Embedding to maximize similarity between image and its corresponding caption Slide credit: Saurabh Gupta

Results Caption generation performance for seven variants of our system on the Microsoft COCO dataset Slide credit: Saurabh Gupta

Thank You