Unsupervised Visual Representation Learning by Context Prediction

Slides:



Advertisements
Similar presentations
Rich feature Hierarchies for Accurate object detection and semantic segmentation Ross Girshick, Jeff Donahue, Trevor Darrell, Jitandra Malik (UC Berkeley)
Advertisements

Limin Wang, Yu Qiao, and Xiaoou Tang
Deep Learning and Neural Nets Spring 2015
Large-Scale Object Recognition with Weak Supervision
PANDA: Pose Aligned Networks for Deep Attribute Modeling Ning Zhang1;2, Manohar Paluri1, Marc’Aurelio Ranzato1, Trevor Darrell2, Lubomir Bourdev1 1: Facebook.
R-CNN By Zhang Liliang.
Spatial Pyramid Pooling in Deep Convolutional
From R-CNN to Fast R-CNN
Generic object detection with deformable part-based models
A shallow introduction to Deep Learning
Detection, Segmentation and Fine-grained Localization
Object detection, deep learning, and R-CNNs
Fully Convolutional Networks for Semantic Segmentation
Deep Convolutional Nets
Learning Features and Parts for Fine-Grained Recognition Authors: Jonathan Krause, Timnit Gebru, Jia Deng, Li-Jia Li, Li Fei-Fei ICPR, 2014 Presented by:
Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.
Feedforward semantic segmentation with zoom-out features
Object Recognizing. Deep Learning Success in 2012 DeepNet and speech processing.
Rich feature hierarchies for accurate object detection and semantic segmentation 2014 IEEE Conference on Computer Vision and Pattern Recognition Ross Girshick,
PANDA: Pose Aligned Networks for Deep Attribute Modeling Ning Zhang 1,2 Manohar Paluri 1 Marć Aurelio Ranzato 1 Trevor Darrell 2 Lumbomir Boudev 1 1 Facebook.
Deep Residual Learning for Image Recognition
Deep Learning Overview Sources: workshop-tutorial-final.pdf
Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition arXiv: v4 [cs.CV(CVPR)] 23 Apr 2015 Kaiming He, Xiangyu Zhang, Shaoqing.
Lecture 4b Data augmentation for CNN training
City Forensics: Using Visual Elements to Predict Non-Visual City Attributes Sean M. Arietta, Alexei A. Efros, Ravi Ramamoorthi, Maneesh Agrawala Presented.
Yann LeCun Other Methods and Applications of Deep Learning Yann Le Cun The Courant Institute of Mathematical Sciences New York University
Convolutional Neural Networks
Recent developments in object detection
Deep Residual Learning for Image Recognition
CS 4501: Introduction to Computer Vision Object Localization, Detection, Semantic Segmentation Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy.
Object Detection based on Segment Masks
Data Mining, Neural Network and Genetic Programming
A Survey to Self-Supervised Learning
Convolutional Neural Fabrics by Shreyas Saxena, Jakob Verbeek
Krishna Kumar Singh, Yong Jae Lee University of California, Davis
Article Review Todd Hricik.
Announcements Project proposal due tomorrow
Regularizing Face Verification Nets To Discrete-Valued Pain Regression
Combining CNN with RNN for scene labeling (segmentation)
YOLO9000:Better, Faster, Stronger
Ajita Rattani and Reza Derakhshani,
Training Techniques for Deep Neural Networks
CS6890 Deep Learning Weizhen Cai
R-CNN region By Ilia Iofedov 11/11/2018 BGU, DNN course 2016.
Adversarially Tuned Scene Generation
Object detection.
State-of-the-art face recognition systems
Fully Convolutional Networks for Semantic Segmentation
Multiple Organ Detection in CT Volumes using CNN Week 2
Normalized Cut Loss for Weakly-supervised CNN Segmentation
Computer Vision James Hays
Introduction to Neural Networks
Image Classification.
Counting in Dense Crowds using Deep Learning
Object Detection + Deep Learning
Very Deep Convolutional Networks for Large-Scale Image Recognition
Lecture: Deep Convolutional Neural Networks
Papers 15/08.
Outline Background Motivation Proposed Model Experimental Results
RCNN, Fast-RCNN, Faster-RCNN
Heterogeneous convolutional neural networks for visual recognition
Course Recap and What’s Next?
Abnormally Detection
Department of Computer Science Ben-Gurion University of the Negev
Deep Object Co-Segmentation
VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION
Semantic Segmentation
Presented By: Harshul Gupta
Learning Deconvolution Network for Semantic Segmentation
Jiahe Li
Presentation transcript:

Unsupervised Visual Representation Learning by Context Prediction Carl Doersch Joint work with Alexei A. Efros & Abhinav Gupta

ImageNet + Deep Learning Beagle - Image Retrieval - Detection (RCNN) - Segmentation (FCN) - Depth Estimation - … - By now, you’re probably familiar with the idea that big datasets of labeled images plus deep learning have ushered in a new era of computer vision. - what’s even more remarkable, however, is that the trained models seem to transfer to many other problems

ImageNet + Deep Learning Pose? Boundaries? Geometry? Parts? Materials? Beagle Do we need this task? Do we even need semantic labels? - One way to explain this phenomenon is that the network begins by extracting some general-purpose information in order to solve classification. - For example, to classify the breed of this dog, it makes sense to first extract its pose, parts, etc. - For those of us who care about other problems besides classification, this classification problem is just a “pretext”—an excuse—for extracting this more general purpose information. - But if the features we’re learning are general, then does it have to be *this* task that the network trains on? - Semantic labels from humans are expensive. - Do we need semantic labels in order to learn a useful representation? Or is there some other pretext task that will learn something similar?

Context as Supervision [Collobert & Weston 2008; Mikolov et al. 2013] Deep Net - there’s already evidence that context can serve as a pretext task in the text domain - However, doing this with pixels is hard because each pixel has so little semantic information.

Context Prediction for Images ? A B - Say I give you a pair of patches from one image. Can you say where they go relative to one another?

Semantics from a non-semantic task - This is hard if you don’t know what a bus is, but easy if you do. - We hope that if we give this task to a machine learner, it will also solve it by recognizing semantics.

Relative Position Task CNN Classifier 8 possible locations - Let’s ask a machine learner to solve this. Given an unlabeled image, we sample one patch uniformly at random, and then a second patch from one of these 8 possible locations relative to the first. - We then pass both patches to a deep network which is trained to predict which of the 8 classes was sampled. - Note the structure of this network: we have a relatively complex embedding that happens in parallel for both patches, followed by a relatively simple classifier that receives input from both patches. We expect that this will encourage the network to extract semantics, because that’s what makes life easier for the simple classifier. - This embedding function can be transferred to other problems Randomly Sample Patch Sample Second Patch

Note: connects across instances! CNN Classifier Patch Embedding Input Nearest Neighbors CNN Note: connects across instances! - Create an embedding where semantically similar patches are close together - indeed, for this patch, the most similar other patches are other cats - Note what this means: we’ve connected across instances of cats despite using an objective function that operates within a single image - We never told the algorithm that these were the same in any way

Architecture Softmax loss Fully connected Max Pooling LRN Fully connected Convolution Fully connected Max Pooling LRN Convolution Tied Weights Training requires Batch Normalization [Ioffe et al. 2015], but no other tricks Patch 1 Patch 2

Avoiding Trivial Shortcuts - The way that we extract the patches is important. The problem is that there are trivial shortcuts: ways that the network can solve the problem without really extracting the semantics that we’re after, like the line in this pair of patches. - The gap makes it less likely that low-level properties cross both patches - The jittering makes it harder to match straight lines between two patches Include a gap Jitter the patch locations

A Not-So “Trivial” Shortcut Position in Image CNN - After training for a while, another “trivial” shortcut started to take hold, which was difficult to track down. We first noticed pairs that the net could predict much better than people. - We found its nearest neighbors were other textures with little else in common - We then looked at the distribution of where those nearest-neighbor patches were occurring in the original image frame—turned out they were all from the same corner of the image - Was the network figuring out where the patches were on the image frame? To find out, we trained a separate network to predict the absolute (x,y) coordinates of individual patches. - For some images (not all), the net predicted very well, even for patches with no semantics in them.

Chromatic Aberration - Turns out that the images where this worked were those affected by chromatic aberration - Chromatic aberration happens when a lens bends different wavelengths a different amount - For common lenses (specifically, the achromatic doublet), the green color channel is shrunk a little bit toward the image center relative to red and blue - Deep nets can detect this subtle shift, which tells the net where a patch is with respect to the lens, and gives away the answer to the relative-position task.

Chromatic Aberration CNN There’s various ways to fix this—for example, removing color (for our experiments, we randomly dropped 2 of the 3 color channels). But there’s a more important lesson: Deep nets are kind-of lazy. If there’s a way to solve a problem without learning semantics, they may learn to do that instead.

Random Initialization What is learned? Input Ours Random Initialization ImageNet AlexNet Now let’s look at some results. We first did a sanity check, sampling 1000 patches from Pascal and finding nearest neighbors. I’ve selected a few I think are interesting.

Still don’t capture everything Input Ours Random Initialization ImageNet AlexNet You don’t always need to learn! Input Ours Random Initialization ImageNet AlexNet

… Visual Data Mining Via Geometric Verification - We can perform visual datamining via our nearest neighbors, but the net seems to capture textures as well as objects. - Objects tend to be made up of parts that fit a particular geometric configuration. - Hence, we can filter out objects by geometric verification, by finding sets of patches that occur together in a consistent configuration. Via Geometric Verification Simplified from [Chum et al 2007]

Mined from Pascal VOC2011

Pre-Training for R-CNN - If we can mine objects, can we detect them? We adopt the R-CNN pipeline on VOC 2007. Pre-train on relative-position task, w/o labels [Girshick et al. 2014]

VOC 2007 Performance (pretraining for R-CNN) No Rescaling Krähenbühl et al. 2015 % Average Precision 68.6 VGG + Krähenbühl et al. 61.7 [Krähenbühl, Doersch, Donahue & Darrell, “Data-dependent Initializations of CNNs”, 2015] 56.8 54.2 51.1 46.3 45.6 42.4 40.7 ImageNet Labels Ours No Pretraining

VOC 2007 Performance (pretraining for R-CNN) No Rescaling Krähenbühl et al. 2015 Average Precision 56.8% 54.2% 51.1% 46.3% 45.6% 40.7% ImageNet Labels Ours No Pretraining [Krähenbühl, Doersch, Donahue & Darrell, “Data-dependent Initializations of CNNs”, 2015]

Capturing Geometry? We’re doing better at object understanding, but what about other tasks? Looking back at our nearest neighbors, it seems like the network is capturing geometry as well.

Surface-normal Estimation Error (Lower Better) % Good Pixels (Higher Better) Method Mean Median 11.25° 22.5° 30.0° No Pretraining 38.6 26.5 33.1 46.8 52.5 Ours 33.2 21.3 36.0 51.2 57.8 ImageNet Labels 33.3 20.8 36.7 51.7 58.1

So, do we need semantic labels? - While the question remains open, we believe the answer might be no. - And we aren’t the only recent work suggesting that improvements can be had.

“Self-Supervision” and the Future Ego-Motion [Agrawal et al. 2015; Jayaraman et al. 2015] Similar [Wang et al. 2015; Srivastava et al 2015; …] Video [Doersch et al. 2014; Pathak et al. 2015; Isola et al. 2015] Context CNN There are a number of recent works on so-called “self-supervised” learning: that is, algorithms which start with data that has no hand-labeled, semantic annotations, but create a seemingly non-semantic task which still requires semantic information to solve. This includes egomotion, video cues like temporal continuity and prediction in time, as well as other works that use context.

Thank you!

Visual Data Mining? - How does the geometric verification work? First, note that the relative predictions are not quite enough: the network is quite capable of predicting the relative predictions of pairs of patches that don’t contain the same object, just because there’s strong scene layout cues like ground.

Geometric Verification Like [Chum et al. 2007], but simpler What makes an object an object? Say we had four patches on the same object, like this dog. If we found nearest neighbors for all four of them, there would be images that contribute nearest neighbors to all four. This is also true for textures. But note one big difference: the resulting nearest neighbors for the dog in the retrieved image are in essentially the same configuration as the original. But for the water, they’re all over the place.

Geometric Verification Like [Chum et al. 2007], but simpler … x100 15/100 84/100 - In detail, we begin by sampling sets of four patches in a 2x2 grid, which we call a constellation. - For each constellation, we find the top 100 images which have strong matches for all four (we initially measure how well an image matches by simply summing up the similarity scores for the individual patches) - Out of those 100, we discard those images where the set of matches is not in a roughly square pattern, similar to the one we started with. We count up how many are left, which serves as the score of the constellation overall. We simply rank the constellations based on this score. 7/100 …