Presentation is loading. Please wait.

Presentation is loading. Please wait.

CSCI 5922 Neural Networks and Deep Learning: Image Captioning

Similar presentations


Presentation on theme: "CSCI 5922 Neural Networks and Deep Learning: Image Captioning"— Presentation transcript:

1 CSCI 5922 Neural Networks and Deep Learning: Image Captioning
Mike Mozer Department of Computer Science and Institute of Cognitive Science University of Colorado at Boulder

2 Fads 2014 Seven papers, almost simultaneously, focused on processing images and textual descriptions Tons of papers about visual question answering

3 Image-Sentence Tasks Sentence retrieval
finding best matching sentence to an image Sentence generation given image, produce textual description Image retrieval given textual description, find best matching image Image-sentence correspondence match images to sentences Question answering given image and question, produce answer From Karpathy blog

4 Adding Captions to Images
A cow is standing in the field This is a close up of a brown cow. There is a brown cow with long hair and two horns. There are two trees and a cloud in the background of a field with a large cow in it. What’s a good caption? From Devi Parikh

5 Three Papers I’ll Talk About
Deep Visual-Semantic Alignments for Generating Image Descriptions Karpathy & Fei-Fei (Stanford), CVPR 2015 Show and Tell: A Neural Image Caption Generator Vinyals, Toshev, Bengio, Erhan (Google), CVPR 2015 Deep Captioning with Multimodal Recurrent Nets Mao, Xu, Yang, Wang, Yuille (UCLA, Baidu), ICLR 2015

6 Other Papers Kiros, Ryan, Salakhutdinov, Ruslan, and Zemel, Richard S. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv: , 2014a. Donahue, Jeff, Hendricks, Lisa Anne, Guadarrama, Sergio, Rohrbach, Marcus, Venugopalan, Subhashini, Saenko, Kate, and Darrell, Trevor. Long-term recurrent convolutional networks for visual recognition and description. arXiv preprint arXiv: , 2014. Fang, Hao, Gupta, Saurabh, Iandola, Forrest, Srivastava, Rupesh, Deng, Li, Dollár, Piotr, Gao, Jianfeng, He, Xiaodong, Mitchell, Margaret, Platt, John, et al. From captions to visual concepts and back. arXiv preprint arXiv: , 2014. Chen, Xinlei and Zitnick, C Lawrence. Learning a recurrent visual representation for image caption generation. arXiv preprint arXiv: , 2014.

7 Data Sets Flickr8k Flickr 30k MSCOCO
8000 images, each annotated with 5 sentences via AMT 1000 for validation, testing Flickr 30k 30k images 1000 validation, 1000 testing MSCOCO 123,000 images 5000 for validation, testing These seem really small for what they have to learn…but… folks take pics of certain objects (e.g. cats)

8 Approach (Karpathy & Fei Fei, 2015)
Sentences act as weak labels contiguous sequences of words correspond to some particular (unknown) location in image Task 1 Find alignment between snippets of text and image locations Task 2 Build RNN that maps image patches to snippets of text

9 Representing Images Process regions of image with a “Region Convolutional Neural Net” pretrained on ImageNet and further fine tuned map activations from the fully connected layer before the classification layer to an 𝒉 dimensional embedding, 𝒗 𝒉 ranges from 1000 to 1600 Pick 19 locations that have strongest classification response plus whole image Image is represented by a set of 20 𝒉 dimensional embedding vectors v is image embedding CNN_\theta_c(Ib) is output of CNN for image patch Ib Wm and bm are learned weights

10 Representing Sentences
Bidirectional RNN input representation is based on word2vec Produces an embedding at each position of the sentence ( 𝒔 𝒕 ) Embedding also has dimensionality 𝒉 [don’t confuse this with hidden activity 𝒉!] Key claim 𝒔 𝒕 represents the concept of the words near position 𝒕

11 Alignment Image fragment 𝒊 mapped to 𝒉 dimensional embedding, 𝒗 𝒊
Sentence fragment 𝒕 mapped to 𝒉 dimensional embedding, 𝒔 𝒕 Define similarity between image and sentence fragments dot product 𝒗 𝒊 𝐓 𝒔 𝒕 Match between image 𝒌 and sentence 𝒍 where 𝒈 𝒌 and 𝒈 𝒍 are sets of image and sentence fragments

12 Example Alignments

13 Alignment II Train model to minimize margin-based loss
aligned image-sentence pairs should have a higher score than misaligned pairs, by a margin

14 Alignment III We have a measure of alignment between words and image fragments: 𝒗 𝒊 𝐓 𝒔 𝒕 We want an assignment of phrases (multiple words) to image fragments Constraint satisfaction approach Search through a space of assignments of words to image fragments encourage adjacent words to have same assignment larger 𝒗 𝒊 𝐓 𝒔 𝒕 -> higher probability that word 𝒕 will be associated with image fragment 𝒊

15 Training a Generative Text Model
For each image fragment associated with a sentence fragment, train an RNN to synthesize sentence fragment: RNN can be used for fragments or whole images

16 Region Predictions from Generative Model

17 Summary (Karpathy & Fei-Fei, 2015)
Obtain image embedding from Region CNN Obtain sentence embedding from forward- backward RNN Train embeddings to achieve a good alignment of image pates to words in corresponding sentence Use MRF to parse 𝑁 words of sentence into phrases that correspond to the 𝑀 selected image fragments RNN sequence generator maps image fragments to phrases

18 Example Sentences demo

19 Formal Evaluation recall rate of ground truth sentence given top K candidates Med r: median rank of the first retrieved ground truth sentence

20 Image Search demo

21 Examples in Mao et al. (2015)

22

23

24 Failures (Mao et al., 2015)

25 Common Themes Among Models
Joint embedding space for images and words Use of ImageNet to produce image embeddings Generic components RNNs Softmax for word selection on output Start and stop words

26 Differences Among Models
Image processing decomposition of image (Karpathy) versus processing of whole image (all) What input does image provide initial input to recurrent net (Vinyals) input every time step – in recurrent layer (Karpathy) or after recurrent layer (Mao) How much is built in? semantic representations? localization of objects in images? Type of recurrence fully connected ReLU (Karpathy, Mao), LSTM (Vinyals) Read out beam search (Vinyals) vs. not (Mao, Karpathy) Local-to-distributed word embeddings one layer (Vinyals, Karpathy) vs. two layers (Mao)


Download ppt "CSCI 5922 Neural Networks and Deep Learning: Image Captioning"

Similar presentations


Ads by Google