CSCI 5922 Neural Networks and Deep Learning: Image Captioning

Slides:



Advertisements
Similar presentations
Deep Learning and Neural Nets Spring 2015
Advertisements

Spatial Pyramid Pooling in Deep Convolutional
Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.
Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning
Neural Net Language Models
Image Captioning Approaches
Unsupervised Visual Representation Learning by Context Prediction
Deep Learning Overview Sources: workshop-tutorial-final.pdf
Parsing Natural Scenes and Natural Language with Recursive Neural Networks INTERNATIONAL CONFERENCE ON MACHINE LEARNING (ICML 2011) RICHARD SOCHER CLIFF.
Learning to Answer Questions from Image Using Convolutional Neural Network Lin Ma, Zhengdong Lu, and Hang Li Huawei Noah’s Ark Lab, Hong Kong
DeepWalk: Online Learning of Social Representations
S.Bengio, O.Vinyals, N.Jaitly, N.Shazeer
Attention Model in NLP Jichuan ZENG.
Recent developments in object detection
Unsupervised Learning of Video Representations using LSTMs
CNN-RNN: A Unified Framework for Multi-label Image Classification
RNNs: An example applied to the prediction task
Hierarchical Question-Image Co-Attention for Visual Question Answering
CS 388: Natural Language Processing: LSTM Recurrent Neural Networks
CS 4501: Introduction to Computer Vision Computer Vision + Natural Language Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy / Justin Johnson.
Object Detection based on Segment Masks
Deep Learning Amin Sobhani.
Compact Bilinear Pooling
Dhruv Batra Georgia Tech
Sentence Modeling Representation of sentences is the heart of Natural Language Processing A sentence model is a representation and analysis of semantic.
CSE 190 Neural Networks: How to train a network to look and see
Recursive Neural Networks
Convolutional Neural Fabrics by Shreyas Saxena, Jakob Verbeek
Deep Compositional Cross-modal Learning to Rank via Local-Global Alignment Xinyang Jiang, Fei Wu, Xi Li, Zhou Zhao, Weiming Lu, Siliang Tang, Yueting.
Neural Machine Translation by Jointly Learning to Align and Translate
Show and Tell: A Neural Image Caption Generator (CVPR 2015)
Matt Gormley Lecture 16 October 24, 2016
CSCI 5922 Neural Networks and Deep Learning: Convolutional Nets For Image And Speech Processing Mike Mozer Department of Computer Science and Institute.
Lecture 24: Convolutional neural networks
Combining CNN with RNN for scene labeling (segmentation)
Spring Courses CSCI 5922 – Probabilistic Models (Mozer) CSCI Mind Reading Machines (Sidney D’Mello) CSCI 7000 – Human Centered Machine Learning.
Intelligent Information System Lab
mengye ren, ryan kiros, richard s. zemel
CS6890 Deep Learning Weizhen Cai
RNNs: Going Beyond the SRN in Language Prediction
Distributed Representation of Words, Sentences and Paragraphs
Convolutional Neural Networks for sentence classification
Attention-based Caption Description Mun Jonghwan.
Image Classification.
Vessel Extraction in X-Ray Angiograms Using Deep Learning
[Figure taken from googleblog
Lecture: Deep Convolutional Neural Networks
Attention.
Word embeddings (continued)
Heterogeneous convolutional neural networks for visual recognition
CSCI 5922 Neural Networks and Deep Learning: Convolutional Nets For Image And Speech Processing Mike Mozer Department of Computer Science and Institute.
Meta Learning (Part 2): Gradient Descent as LSTM
Abnormally Detection
Attention for translation
Sequence to Sequence Video to Text
Presented by: Anurag Paul
VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION
Neural Machine Translation using CNN
Semantic Segmentation
Multi-Modal Multi-Scale Deep Learning for Large-Scale Image Annotation
Presented By: Harshul Gupta
Visual Grounding 专题报告 Lejian Ren 4.23.
Learning Deconvolution Network for Semantic Segmentation
Sequence-to-Sequence Models
Deep learning: Recurrent Neural Networks CV192
Bidirectional LSTM-CRF Models for Sequence Tagging
CSC 578 Neural Networks and Deep Learning
Week 7 Presentation Ngoc Ta Aidean Sharghi
CS249: Neural Language Model
Presentation transcript:

CSCI 5922 Neural Networks and Deep Learning: Image Captioning Mike Mozer Department of Computer Science and Institute of Cognitive Science University of Colorado at Boulder

Fads 2014 Seven papers, almost simultaneously, focused on processing images and textual descriptions 2015-2017 Tons of papers about visual question answering

Image-Sentence Tasks Sentence retrieval finding best matching sentence to an image Sentence generation given image, produce textual description Image retrieval given textual description, find best matching image Image-sentence correspondence match images to sentences Question answering given image and question, produce answer From Karpathy blog

Adding Captions to Images A cow is standing in the field This is a close up of a brown cow. There is a brown cow with long hair and two horns. There are two trees and a cloud in the background of a field with a large cow in it. What’s a good caption? From Devi Parikh

Three Papers I’ll Talk About Deep Visual-Semantic Alignments for Generating Image Descriptions Karpathy & Fei-Fei (Stanford), CVPR 2015 Show and Tell: A Neural Image Caption Generator Vinyals, Toshev, Bengio, Erhan (Google), CVPR 2015 Deep Captioning with Multimodal Recurrent Nets Mao, Xu, Yang, Wang, Yuille (UCLA, Baidu), ICLR 2015

Other Papers Kiros, Ryan, Salakhutdinov, Ruslan, and Zemel, Richard S. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539, 2014a. Donahue, Jeff, Hendricks, Lisa Anne, Guadarrama, Sergio, Rohrbach, Marcus, Venugopalan, Subhashini, Saenko, Kate, and Darrell, Trevor. Long-term recurrent convolutional networks for visual recognition and description. arXiv preprint arXiv:1411.4389, 2014. Fang, Hao, Gupta, Saurabh, Iandola, Forrest, Srivastava, Rupesh, Deng, Li, Dollár, Piotr, Gao, Jianfeng, He, Xiaodong, Mitchell, Margaret, Platt, John, et al. From captions to visual concepts and back. arXiv preprint arXiv:1411.4952, 2014. Chen, Xinlei and Zitnick, C Lawrence. Learning a recurrent visual representation for image caption generation. arXiv preprint arXiv:1411.5654, 2014.

Data Sets Flickr8k Flickr 30k MSCOCO 8000 images, each annotated with 5 sentences via AMT 1000 for validation, testing Flickr 30k 30k images 1000 validation, 1000 testing MSCOCO 123,000 images 5000 for validation, testing These seem really small for what they have to learn…but… folks take pics of certain objects (e.g. cats)

Approach (Karpathy & Fei Fei, 2015) Sentences act as weak labels contiguous sequences of words correspond to some particular (unknown) location in image Task 1 Find alignment between snippets of text and image locations Task 2 Build RNN that maps image patches to snippets of text

Representing Images Process regions of image with a “Region Convolutional Neural Net” pretrained on ImageNet and further fine tuned map activations from the fully connected layer before the classification layer to an 𝒉 dimensional embedding, 𝒗 𝒉 ranges from 1000 to 1600 Pick 19 locations that have strongest classification response plus whole image Image is represented by a set of 20 𝒉 dimensional embedding vectors v is image embedding CNN_\theta_c(Ib) is output of CNN for image patch Ib Wm and bm are learned weights

Representing Sentences Bidirectional RNN input representation is based on word2vec Produces an embedding at each position of the sentence ( 𝒔 𝒕 ) Embedding also has dimensionality 𝒉 [don’t confuse this with hidden activity 𝒉!] Key claim 𝒔 𝒕 represents the concept of the words near position 𝒕

Alignment Image fragment 𝒊 mapped to 𝒉 dimensional embedding, 𝒗 𝒊 Sentence fragment 𝒕 mapped to 𝒉 dimensional embedding, 𝒔 𝒕 Define similarity between image and sentence fragments dot product 𝒗 𝒊 𝐓 𝒔 𝒕 Match between image 𝒌 and sentence 𝒍 where 𝒈 𝒌 and 𝒈 𝒍 are sets of image and sentence fragments

Example Alignments

Alignment II Train model to minimize margin-based loss aligned image-sentence pairs should have a higher score than misaligned pairs, by a margin

Alignment III We have a measure of alignment between words and image fragments: 𝒗 𝒊 𝐓 𝒔 𝒕 We want an assignment of phrases (multiple words) to image fragments Constraint satisfaction approach Search through a space of assignments of words to image fragments encourage adjacent words to have same assignment larger 𝒗 𝒊 𝐓 𝒔 𝒕 -> higher probability that word 𝒕 will be associated with image fragment 𝒊

Training a Generative Text Model For each image fragment associated with a sentence fragment, train an RNN to synthesize sentence fragment: RNN can be used for fragments or whole images

Region Predictions from Generative Model

Summary (Karpathy & Fei-Fei, 2015) Obtain image embedding from Region CNN Obtain sentence embedding from forward- backward RNN Train embeddings to achieve a good alignment of image pates to words in corresponding sentence Use MRF to parse 𝑁 words of sentence into phrases that correspond to the 𝑀 selected image fragments RNN sequence generator maps image fragments to phrases

Example Sentences demo

Formal Evaluation R@K: recall rate of ground truth sentence given top K candidates Med r: median rank of the first retrieved ground truth sentence

Image Search demo

Examples in Mao et al. (2015)

Failures (Mao et al., 2015)

Common Themes Among Models Joint embedding space for images and words Use of ImageNet to produce image embeddings Generic components RNNs Softmax for word selection on output Start and stop words

Differences Among Models Image processing decomposition of image (Karpathy) versus processing of whole image (all) What input does image provide initial input to recurrent net (Vinyals) input every time step – in recurrent layer (Karpathy) or after recurrent layer (Mao) How much is built in? semantic representations? localization of objects in images? Type of recurrence fully connected ReLU (Karpathy, Mao), LSTM (Vinyals) Read out beam search (Vinyals) vs. not (Mao, Karpathy) Local-to-distributed word embeddings one layer (Vinyals, Karpathy) vs. two layers (Mao)