CSCI 5922 Neural Networks and Deep Learning: Image Captioning Mike Mozer Department of Computer Science and Institute of Cognitive Science University of Colorado at Boulder
Fads 2014 Seven papers, almost simultaneously, focused on processing images and textual descriptions 2015-2017 Tons of papers about visual question answering
Image-Sentence Tasks Sentence retrieval finding best matching sentence to an image Sentence generation given image, produce textual description Image retrieval given textual description, find best matching image Image-sentence correspondence match images to sentences Question answering given image and question, produce answer From Karpathy blog
Adding Captions to Images A cow is standing in the field This is a close up of a brown cow. There is a brown cow with long hair and two horns. There are two trees and a cloud in the background of a field with a large cow in it. What’s a good caption? From Devi Parikh
Three Papers I’ll Talk About Deep Visual-Semantic Alignments for Generating Image Descriptions Karpathy & Fei-Fei (Stanford), CVPR 2015 Show and Tell: A Neural Image Caption Generator Vinyals, Toshev, Bengio, Erhan (Google), CVPR 2015 Deep Captioning with Multimodal Recurrent Nets Mao, Xu, Yang, Wang, Yuille (UCLA, Baidu), ICLR 2015
Other Papers Kiros, Ryan, Salakhutdinov, Ruslan, and Zemel, Richard S. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539, 2014a. Donahue, Jeff, Hendricks, Lisa Anne, Guadarrama, Sergio, Rohrbach, Marcus, Venugopalan, Subhashini, Saenko, Kate, and Darrell, Trevor. Long-term recurrent convolutional networks for visual recognition and description. arXiv preprint arXiv:1411.4389, 2014. Fang, Hao, Gupta, Saurabh, Iandola, Forrest, Srivastava, Rupesh, Deng, Li, Dollár, Piotr, Gao, Jianfeng, He, Xiaodong, Mitchell, Margaret, Platt, John, et al. From captions to visual concepts and back. arXiv preprint arXiv:1411.4952, 2014. Chen, Xinlei and Zitnick, C Lawrence. Learning a recurrent visual representation for image caption generation. arXiv preprint arXiv:1411.5654, 2014.
Data Sets Flickr8k Flickr 30k MSCOCO 8000 images, each annotated with 5 sentences via AMT 1000 for validation, testing Flickr 30k 30k images 1000 validation, 1000 testing MSCOCO 123,000 images 5000 for validation, testing These seem really small for what they have to learn…but… folks take pics of certain objects (e.g. cats)
Approach (Karpathy & Fei Fei, 2015) Sentences act as weak labels contiguous sequences of words correspond to some particular (unknown) location in image Task 1 Find alignment between snippets of text and image locations Task 2 Build RNN that maps image patches to snippets of text
Representing Images Process regions of image with a “Region Convolutional Neural Net” pretrained on ImageNet and further fine tuned map activations from the fully connected layer before the classification layer to an 𝒉 dimensional embedding, 𝒗 𝒉 ranges from 1000 to 1600 Pick 19 locations that have strongest classification response plus whole image Image is represented by a set of 20 𝒉 dimensional embedding vectors v is image embedding CNN_\theta_c(Ib) is output of CNN for image patch Ib Wm and bm are learned weights
Representing Sentences Bidirectional RNN input representation is based on word2vec Produces an embedding at each position of the sentence ( 𝒔 𝒕 ) Embedding also has dimensionality 𝒉 [don’t confuse this with hidden activity 𝒉!] Key claim 𝒔 𝒕 represents the concept of the words near position 𝒕
Alignment Image fragment 𝒊 mapped to 𝒉 dimensional embedding, 𝒗 𝒊 Sentence fragment 𝒕 mapped to 𝒉 dimensional embedding, 𝒔 𝒕 Define similarity between image and sentence fragments dot product 𝒗 𝒊 𝐓 𝒔 𝒕 Match between image 𝒌 and sentence 𝒍 where 𝒈 𝒌 and 𝒈 𝒍 are sets of image and sentence fragments
Example Alignments
Alignment II Train model to minimize margin-based loss aligned image-sentence pairs should have a higher score than misaligned pairs, by a margin
Alignment III We have a measure of alignment between words and image fragments: 𝒗 𝒊 𝐓 𝒔 𝒕 We want an assignment of phrases (multiple words) to image fragments Constraint satisfaction approach Search through a space of assignments of words to image fragments encourage adjacent words to have same assignment larger 𝒗 𝒊 𝐓 𝒔 𝒕 -> higher probability that word 𝒕 will be associated with image fragment 𝒊
Training a Generative Text Model For each image fragment associated with a sentence fragment, train an RNN to synthesize sentence fragment: RNN can be used for fragments or whole images
Region Predictions from Generative Model
Summary (Karpathy & Fei-Fei, 2015) Obtain image embedding from Region CNN Obtain sentence embedding from forward- backward RNN Train embeddings to achieve a good alignment of image pates to words in corresponding sentence Use MRF to parse 𝑁 words of sentence into phrases that correspond to the 𝑀 selected image fragments RNN sequence generator maps image fragments to phrases
Example Sentences demo
Formal Evaluation R@K: recall rate of ground truth sentence given top K candidates Med r: median rank of the first retrieved ground truth sentence
Image Search demo
Examples in Mao et al. (2015)
Failures (Mao et al., 2015)
Common Themes Among Models Joint embedding space for images and words Use of ImageNet to produce image embeddings Generic components RNNs Softmax for word selection on output Start and stop words
Differences Among Models Image processing decomposition of image (Karpathy) versus processing of whole image (all) What input does image provide initial input to recurrent net (Vinyals) input every time step – in recurrent layer (Karpathy) or after recurrent layer (Mao) How much is built in? semantic representations? localization of objects in images? Type of recurrence fully connected ReLU (Karpathy, Mao), LSTM (Vinyals) Read out beam search (Vinyals) vs. not (Mao, Karpathy) Local-to-distributed word embeddings one layer (Vinyals, Karpathy) vs. two layers (Mao)