CS 1674: Intro to Computer Vision Recurrent Neural Networks

Slides:



Advertisements
Similar presentations
Learning from Descriptive Text Tamara L Berg Stony Brook University
Advertisements

Object recognition and scene “understanding”
Deep Learning and Neural Nets Spring 2015
MACHINE LEARNING AND ARTIFICIAL NEURAL NETWORKS FOR FACE VERIFICATION
Kuan-Chuan Peng Tsuhan Chen
Object detection, deep learning, and R-CNNs
Kai Sheng-Tai, Richard Socher, Christopher D. Manning
PANDA: Pose Aligned Networks for Deep Attribute Modeling Ning Zhang 1,2 Manohar Paluri 1 Marć Aurelio Ranzato 1 Trevor Darrell 2 Lumbomir Boudev 1 1 Facebook.
NOTE: To change the image on this slide, select the picture and delete it. Then click the Pictures icon in the placeholder to insert your own image. SHOW.
Learning to Answer Questions from Image Using Convolutional Neural Network Lin Ma, Zhengdong Lu, and Hang Li Huawei Noah’s Ark Lab, Hong Kong
Analyzing the Behavior of Deep Models Dhruv Batra Devi Parikh Aishwarya Agrawal (EMNLP 2016)
S.Bengio, O.Vinyals, N.Jaitly, N.Shazeer
Prof. Adriana Kovashka University of Pittsburgh March 28, 2017
Hybrid Deep Learning for Reflectance Confocal Microscopy Skin Images
CS 4501: Introduction to Computer Vision Object Localization, Detection, Semantic Segmentation Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy.
What Convnets Make for Image Captioning?
Hierarchical Question-Image Co-Attention for Visual Question Answering
CS 388: Natural Language Processing: LSTM Recurrent Neural Networks
CS 4501: Introduction to Computer Vision Computer Vision + Natural Language Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy / Justin Johnson.
Deep Learning Amin Sobhani.
Dhruv Batra Georgia Tech
How to Start This PowerPoint® Tutorial
Recurrent Neural Networks for Natural Language Processing
Optical Illusions.
CS 2750: Machine Learning Recurrent Neural Networks
Lecture 24: Convolutional neural networks
Overview of Challenge Aishwarya Agrawal (Virginia Tech)
Intro to NLP and Deep Learning
CSCI 5922 Neural Networks and Deep Learning: Image Captioning
Artificial Intelligence fuelled Sentiment Analysis
Intelligent Information System Lab
Object detection as supervised classification
Above and below the object level
Image Question Answering
CS 1675: Intro to Machine Learning Neural Networks
Bird-species Recognition Using Convolutional Neural Network
Computer Vision James Hays
Attention-based Caption Description Mun Jonghwan.
Image Classification.
Image Captions With Deep Learning Yulia Kogan & Ron Shiff
Recurrent Neural Networks
Vessel Extraction in X-Ray Angiograms Using Deep Learning
Word Embedding Word2Vec.
[Figure taken from googleblog
The Big Health Data–Intelligent Machine Paradox
Lecture 16: Recurrent Neural Networks (RNNs)
Lip movement Synthesis from Text
Learning Object Context for Dense Captioning
Presented by Wanxue Dong
Presentation By: Eryk Helenowski PURE Mentor: Vincent Bindschaedler
Advances in Deep Audio and Audio-Visual Processing
Word embeddings (continued)
Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton
Neural Modular Networks
Attention for translation
Learn to Comment Mentor: Mahdi M. Kalayeh
Lecture 21: Machine Learning Overview AP Computer Science Principles
Jointly Generating Captions to Aid Visual Question Answering
Automatic Handwriting Generation
Visual Question Answering
Presented by: Anurag Paul
Word representations David Kauchak CS158 – Fall 2016.
Semantic Segmentation
Visual Grounding 专题报告 Lejian Ren 4.23.
Sequence-to-Sequence Models
Deep learning: Recurrent Neural Networks CV192
Sign Language Recognition With Unsupervised Feature Learning
CS249: Neural Language Model
Lecture 9: Machine Learning Overview AP Computer Science Principles
Presentation transcript:

CS 1674: Intro to Computer Vision Recurrent Neural Networks Prof. Adriana Kovashka University of Pittsburgh December 5, 2016

Announcements Next time: Review for the final exam By Tuesday at noon, send me three topics you want me to review (for participation credit!) Please do OMETs! (Thanks!) Grades before final: See CourseWeb, “Overall” column (I won’t need to curve)

Plan for today Motivation/history Tools Vision and language, image captioning Tools Recurrent neural networks Recent problem: Visual question answering Some approaches

Vision and Language Humans don’t use only their visual processing abilities or speaking/listening abilities in isolation, they use them together While computer vision and natural language processing are separate fields, there has been increased interest in combining them A popular task is image captioning: Given an image, automatically generate a caption for this image, that agrees well with human-generated captions

Berg, Attributes Tutorial CVPR13 Descriptive Text “It was an arresting face, pointed of chin, square of jaw. Her eyes were pale green without a touch of hazel, starred with bristly black lashes and slightly tilted at the ends. Above them, her thick black brows slanted upward, cutting a startling oblique line in her magnolia-white skin–that skin so prized by Southern women and so carefully guarded with bonnets, veils and mittens against hot Georgia suns” Scarlett O’Hara described in Gone with the Wind. People sometimes produce very vivid and richly informative descriptions about the visual world. For example, here the writer says “it was an arresting face, pointed of chin, …” Berg, Attributes Tutorial CVPR13

More Nuance than Traditional Recognition… person car You’ll notice that this human output of recognition is quite different from traditional computer vision recognition outputs which might recognize this picture as a person, this one as a shoe or this one as a car. shoe Berg, Attributes Tutorial CVPR13

Toward Complex Structured Outputs car A lot of research in visual recognition has focused on producing categorical labels for items Berg, Attributes Tutorial CVPR13

Toward Complex Structured Outputs pink car Today we’ve been talking about attributes which is a first step toward producing more complex structured recognition outputs Attributes of objects Berg, Attributes Tutorial CVPR13

Toward Complex Structured Outputs car on road We can also think about recognizing the context of where objects are located with respect to the overall scene or relative to other objects – maybe recognizing that this is a car on a road Relationships between objects Berg, Attributes Tutorial CVPR13

Toward Complex Structured Outputs Little pink smart car parked on the side of a road in a London shopping district. … Complex structured recognition outputs Ultimately we might like our recognition systems to produce more complete predictions about the objects, their appearance, their relationships, actions, and context. Perhaps even going so far as to produce a short description of the image that tells the “story behind the image.” For this image we might like to say something like “little pink smart car…” Telling the “story of an image” Berg, Attributes Tutorial CVPR13

Some good results This is a picture of one sky, one road and one sheep. The gray sky is over the gray road. The gray sheep is by the gray road. Here we see one road, one sky and one bicycle. The road is near the blue sky, and near the colorful bicycle. The colorful bicycle is within the blue sky. This is a picture of two dogs. The first dog is near the second furry dog. This is a picture of one sky, one road… Kulkarni et al, CVPR11

Some bad results Missed detections: False detections: Here we see one potted plant. Missed detections: This is a picture of one dog. False detections: There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one road and one person. The rusty tree is under the red road. The colorful person is near the rusty tree, and under the red road. This is a photograph of two sheeps and one grass. The first black sheep is by the green grass, and by the second black sheep. The second black sheep is by the green grass. Incorrect attributes: This is a photograph of two horses and one grass. The first feathered horse is within the green grass, and by the second feathered horse. The second feathered horse is within the green grass. Of course it doesn’t always work! Some common mistakes are: missing detections, false detections, Incorrectly predicted attributes. Kulkarni et al, CVPR11

Results with Recurrent Neural Networks Karpathy and Fei-Fei, CVPR 2015

Recurrent Networks offer a lot of flexibility: Vanilla Neural Networks Andrej Karpathy

Recurrent Networks offer a lot of flexibility: e.g. Image Captioning image -> sequence of words Andrej Karpathy

Recurrent Networks offer a lot of flexibility: e.g. Sentiment Classification sequence of words -> sentiment Andrej Karpathy

Recurrent Networks offer a lot of flexibility: e.g. Machine Translation seq of words -> seq of words Andrej Karpathy

Recurrent Networks offer a lot of flexibility: e.g. Video classification on frame level Andrej Karpathy

Recurrent Neural Network RNN RNN x Andrej Karpathy

Recurrent Neural Network y usually want to output a prediction at some time steps RNN x Adapted from Andrej Karpathy

Recurrent Neural Network We can process a sequence of vectors x by applying a recurrence formula at every time step: y RNN new state old state input vector at some time step some function with parameters W x Andrej Karpathy

Recurrent Neural Network We can process a sequence of vectors x by applying a recurrence formula at every time step: y RNN Notice: the same function and the same set of parameters are used at every time step. x Andrej Karpathy

(Vanilla) Recurrent Neural Network The state consists of a single “hidden” vector h: y RNN x Andrej Karpathy

Character-level language model example RNN x y Character-level language model example Vocabulary: [h,e,l,o] Example training sequence: “hello” Andrej Karpathy

Example Character-level language model example Vocabulary: [h,e,l,o] Example training sequence: “hello” Andrej Karpathy

Example Character-level language model example Vocabulary: [h,e,l,o] Example training sequence: “hello” Andrej Karpathy

Example Character-level language model example Vocabulary: [h,e,l,o] Example training sequence: “hello” Andrej Karpathy

Image Captioning Explain Images with Multimodal Recurrent Neural Networks, Mao et al. Deep Visual-Semantic Alignments for Generating Image Descriptions, Karpathy and Fei-Fei Show and Tell: A Neural Image Caption Generator, Vinyals et al. Long-term Recurrent Convolutional Networks for Visual Recognition and Description, Donahue et al. Learning a Recurrent Visual Representation for Image Caption Generation, Chen and Zitnick Andrej Karpathy

Recurrent Neural Network Image Captioning Recurrent Neural Network Convolutional Neural Network Andrej Karpathy

Image Captioning test image Andrej Karpathy

test image Andrej Karpathy

test image X Andrej Karpathy

Image Captioning test image <START> Andrej Karpathy x0

Image Captioning v before: h = tanh(Wxh * x + Whh * h) Wih now: test image y0 before: h = tanh(Wxh * x + Whh * h) h0 Wih now: h = tanh(Wxh * x + Whh * h + Wih * v) x0 <STA RT> v <START> Andrej Karpathy

Image Captioning sample! test image <START> y0 h0 x0 <STA RT> straw <START> Andrej Karpathy

Image Captioning test image <START> y0 y1 h0 h1 Andrej Karpathy x0 <STA RT> straw <START> Andrej Karpathy

Image Captioning sample! test image <START> y0 y1 h0 h1 x0 <STA RT> straw hat <START> Andrej Karpathy

Image Captioning test image <START> y0 y1 y2 h0 h1 h2 x0 <STA RT> straw hat <START> Andrej Karpathy

Image Captioning Caption generated: “straw hat” sample test image Caption generated: “straw hat” y0 y1 y2 sample <END> token => finish. h0 h1 h2 x0 <STA RT> straw hat <START> Adapted from Andrej Karpathy

Image Sentence Datasets Microsoft COCO [Tsung-Yi Lin et al. 2014] mscoco.org currently: ~120K images ~5 sentences each Andrej Karpathy

Some Results Andrej Karpathy

Visual Question Answering (VQA) Task: Given an image and a natural language open-ended question, generate a natural language answer. Aishwarya Agrawal

VQA Dataset Aishwarya Agrawal

Applications of VQA An aid to visually-impaired Is it safe to cross the street now? Aishwarya Agrawal

Applications of VQA Surveillance What kind of car did the man in red shirt leave in? Aishwarya Agrawal

Applications of VQA Interacting with robot Is my laptop in my bedroom upstairs? Aishwarya Agrawal

2-Channel VQA Model Image Embedding Question Embedding Neural Network Softmax over top K answers Image Embedding Convolution Layer + Non-Linearity Pooling Layer Fully-Connected MLP 4096-dim Embedding Question “How many horses are in this image?” 1024-dim Aishwarya Agrawal

Incorporating Knowledge Wu et al., CVPR 2016

Incorporating Attention Shih et al., CVPR 2016

Visual Question Answering Demo Aishwarya Agrawal