CSCI 5922 Neural Networks and Deep Learning: Image Captioning

Slides:

Advertisements

Similar presentations

Deep Learning and Neural Nets Spring 2015

Advertisements

Spatial Pyramid Pooling in Deep Convolutional

Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.

Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning

Neural Net Language Models

Image Captioning Approaches

Unsupervised Visual Representation Learning by Context Prediction

Deep Learning Overview Sources: workshop-tutorial-final.pdf

Parsing Natural Scenes and Natural Language with Recursive Neural Networks INTERNATIONAL CONFERENCE ON MACHINE LEARNING (ICML 2011) RICHARD SOCHER CLIFF.

Learning to Answer Questions from Image Using Convolutional Neural Network Lin Ma, Zhengdong Lu, and Hang Li Huawei Noah’s Ark Lab, Hong Kong

DeepWalk: Online Learning of Social Representations

S.Bengio, O.Vinyals, N.Jaitly, N.Shazeer

Attention Model in NLP Jichuan ZENG.

Recent developments in object detection

Unsupervised Learning of Video Representations using LSTMs

CNN-RNN: A Uniﬁed Framework for Multi-label Image Classiﬁcation

RNNs: An example applied to the prediction task

Hierarchical Question-Image Co-Attention for Visual Question Answering

CS 388: Natural Language Processing: LSTM Recurrent Neural Networks

CS 4501: Introduction to Computer Vision Computer Vision + Natural Language Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy / Justin Johnson.

Object Detection based on Segment Masks

Deep Learning Amin Sobhani.

Compact Bilinear Pooling

Dhruv Batra Georgia Tech

Sentence Modeling Representation of sentences is the heart of Natural Language Processing A sentence model is a representation and analysis of semantic.

CSE 190 Neural Networks: How to train a network to look and see

Recursive Neural Networks

Convolutional Neural Fabrics by Shreyas Saxena, Jakob Verbeek

Deep Compositional Cross-modal Learning to Rank via Local-Global Alignment Xinyang Jiang, Fei Wu, Xi Li, Zhou Zhao, Weiming Lu, Siliang Tang, Yueting.

Neural Machine Translation by Jointly Learning to Align and Translate

Show and Tell: A Neural Image Caption Generator (CVPR 2015)

Matt Gormley Lecture 16 October 24, 2016

CSCI 5922 Neural Networks and Deep Learning: Convolutional Nets For Image And Speech Processing Mike Mozer Department of Computer Science and Institute.

Lecture 24: Convolutional neural networks

Combining CNN with RNN for scene labeling (segmentation)

Spring Courses CSCI 5922 – Probabilistic Models (Mozer) CSCI Mind Reading Machines (Sidney D’Mello) CSCI 7000 – Human Centered Machine Learning.

Intelligent Information System Lab

mengye ren, ryan kiros, richard s. zemel

CS6890 Deep Learning Weizhen Cai

RNNs: Going Beyond the SRN in Language Prediction

Distributed Representation of Words, Sentences and Paragraphs

Convolutional Neural Networks for sentence classification

Attention-based Caption Description Mun Jonghwan.

Image Classification.

Vessel Extraction in X-Ray Angiograms Using Deep Learning

[Figure taken from googleblog

Lecture: Deep Convolutional Neural Networks

Word embeddings (continued)

Heterogeneous convolutional neural networks for visual recognition

CSCI 5922 Neural Networks and Deep Learning: Convolutional Nets For Image And Speech Processing Mike Mozer Department of Computer Science and Institute.

Meta Learning (Part 2): Gradient Descent as LSTM

Abnormally Detection

Attention for translation

Sequence to Sequence Video to Text

Presented by: Anurag Paul

VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION

Neural Machine Translation using CNN

Semantic Segmentation

Multi-Modal Multi-Scale Deep Learning for Large-Scale Image Annotation

Presented By: Harshul Gupta

Visual Grounding 专题报告 Lejian Ren 4.23.

Learning Deconvolution Network for Semantic Segmentation

Sequence-to-Sequence Models

Deep learning: Recurrent Neural Networks CV192

Bidirectional LSTM-CRF Models for Sequence Tagging

CSC 578 Neural Networks and Deep Learning

Week 7 Presentation Ngoc Ta Aidean Sharghi

CS249: Neural Language Model

Presentation transcript:

CSCI 5922 Neural Networks and Deep Learning: Image Captioning Mike Mozer Department of Computer Science and Institute of Cognitive Science University of Colorado at Boulder

Fads 2014 Seven papers, almost simultaneously, focused on processing images and textual descriptions 2015-2017 Tons of papers about visual question answering

Image-Sentence Tasks Sentence retrieval finding best matching sentence to an image Sentence generation given image, produce textual description Image retrieval given textual description, find best matching image Image-sentence correspondence match images to sentences Question answering given image and question, produce answer From Karpathy blog

Adding Captions to Images A cow is standing in the field This is a close up of a brown cow. There is a brown cow with long hair and two horns. There are two trees and a cloud in the background of a field with a large cow in it. What’s a good caption? From Devi Parikh

Three Papers I’ll Talk About Deep Visual-Semantic Alignments for Generating Image Descriptions Karpathy & Fei-Fei (Stanford), CVPR 2015 Show and Tell: A Neural Image Caption Generator Vinyals, Toshev, Bengio, Erhan (Google), CVPR 2015 Deep Captioning with Multimodal Recurrent Nets Mao, Xu, Yang, Wang, Yuille (UCLA, Baidu), ICLR 2015

Other Papers Kiros, Ryan, Salakhutdinov, Ruslan, and Zemel, Richard S. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539, 2014a. Donahue, Jeff, Hendricks, Lisa Anne, Guadarrama, Sergio, Rohrbach, Marcus, Venugopalan, Subhashini, Saenko, Kate, and Darrell, Trevor. Long-term recurrent convolutional networks for visual recognition and description. arXiv preprint arXiv:1411.4389, 2014. Fang, Hao, Gupta, Saurabh, Iandola, Forrest, Srivastava, Rupesh, Deng, Li, Dollár, Piotr, Gao, Jianfeng, He, Xiaodong, Mitchell, Margaret, Platt, John, et al. From captions to visual concepts and back. arXiv preprint arXiv:1411.4952, 2014. Chen, Xinlei and Zitnick, C Lawrence. Learning a recurrent visual representation for image caption generation. arXiv preprint arXiv:1411.5654, 2014.

Data Sets Flickr8k Flickr 30k MSCOCO 8000 images, each annotated with 5 sentences via AMT 1000 for validation, testing Flickr 30k 30k images 1000 validation, 1000 testing MSCOCO 123,000 images 5000 for validation, testing These seem really small for what they have to learn…but… folks take pics of certain objects (e.g. cats)

Approach (Karpathy & Fei Fei, 2015) Sentences act as weak labels contiguous sequences of words correspond to some particular (unknown) location in image Task 1 Find alignment between snippets of text and image locations Task 2 Build RNN that maps image patches to snippets of text

Representing Images Process regions of image with a “Region Convolutional Neural Net” pretrained on ImageNet and further fine tuned map activations from the fully connected layer before the classification layer to an 𝒉 dimensional embedding, 𝒗 𝒉 ranges from 1000 to 1600 Pick 19 locations that have strongest classification response plus whole image Image is represented by a set of 20 𝒉 dimensional embedding vectors v is image embedding CNN_\theta_c(Ib) is output of CNN for image patch Ib Wm and bm are learned weights

Representing Sentences Bidirectional RNN input representation is based on word2vec Produces an embedding at each position of the sentence ( 𝒔 𝒕 ) Embedding also has dimensionality 𝒉 [don’t confuse this with hidden activity 𝒉!] Key claim 𝒔 𝒕 represents the concept of the words near position 𝒕

Alignment Image fragment 𝒊 mapped to 𝒉 dimensional embedding, 𝒗 𝒊 Sentence fragment 𝒕 mapped to 𝒉 dimensional embedding, 𝒔 𝒕 Define similarity between image and sentence fragments dot product 𝒗 𝒊 𝐓 𝒔 𝒕 Match between image 𝒌 and sentence 𝒍 where 𝒈 𝒌 and 𝒈 𝒍 are sets of image and sentence fragments

Example Alignments

Alignment II Train model to minimize margin-based loss aligned image-sentence pairs should have a higher score than misaligned pairs, by a margin

Alignment III We have a measure of alignment between words and image fragments: 𝒗 𝒊 𝐓 𝒔 𝒕 We want an assignment of phrases (multiple words) to image fragments Constraint satisfaction approach Search through a space of assignments of words to image fragments encourage adjacent words to have same assignment larger 𝒗 𝒊 𝐓 𝒔 𝒕 -> higher probability that word 𝒕 will be associated with image fragment 𝒊

Training a Generative Text Model For each image fragment associated with a sentence fragment, train an RNN to synthesize sentence fragment: RNN can be used for fragments or whole images

Region Predictions from Generative Model

Summary (Karpathy & Fei-Fei, 2015) Obtain image embedding from Region CNN Obtain sentence embedding from forward- backward RNN Train embeddings to achieve a good alignment of image pates to words in corresponding sentence Use MRF to parse 𝑁 words of sentence into phrases that correspond to the 𝑀 selected image fragments RNN sequence generator maps image fragments to phrases

Example Sentences demo

Formal Evaluation R@K: recall rate of ground truth sentence given top K candidates Med r: median rank of the first retrieved ground truth sentence

Image Search demo

Examples in Mao et al. (2015)

Failures (Mao et al., 2015)

Common Themes Among Models Joint embedding space for images and words Use of ImageNet to produce image embeddings Generic components RNNs Softmax for word selection on output Start and stop words

Differences Among Models Image processing decomposition of image (Karpathy) versus processing of whole image (all) What input does image provide initial input to recurrent net (Vinyals) input every time step – in recurrent layer (Karpathy) or after recurrent layer (Mao) How much is built in? semantic representations? localization of objects in images? Type of recurrence fully connected ReLU (Karpathy, Mao), LSTM (Vinyals) Read out beam search (Vinyals) vs. not (Mao, Karpathy) Local-to-distributed word embeddings one layer (Vinyals, Karpathy) vs. two layers (Mao)