Unsupervised Visual Representation Learning by Context Prediction Carl Doersch Joint work with Alexei A. Efros & Abhinav Gupta
ImageNet + Deep Learning Beagle - Image Retrieval - Detection (RCNN) - Segmentation (FCN) - Depth Estimation - … - By now, you’re probably familiar with the idea that big datasets of labeled images plus deep learning have ushered in a new era of computer vision. - what’s even more remarkable, however, is that the trained models seem to transfer to many other problems
ImageNet + Deep Learning Pose? Boundaries? Geometry? Parts? Materials? Beagle Do we need this task? Do we even need semantic labels? - One way to explain this phenomenon is that the network begins by extracting some general-purpose information in order to solve classification. - For example, to classify the breed of this dog, it makes sense to first extract its pose, parts, etc. - For those of us who care about other problems besides classification, this classification problem is just a “pretext”—an excuse—for extracting this more general purpose information. - But if the features we’re learning are general, then does it have to be *this* task that the network trains on? - Semantic labels from humans are expensive. - Do we need semantic labels in order to learn a useful representation? Or is there some other pretext task that will learn something similar?
Context as Supervision [Collobert & Weston 2008; Mikolov et al. 2013] Deep Net - there’s already evidence that context can serve as a pretext task in the text domain - However, doing this with pixels is hard because each pixel has so little semantic information.
Context Prediction for Images ? A B - Say I give you a pair of patches from one image. Can you say where they go relative to one another?
Semantics from a non-semantic task - This is hard if you don’t know what a bus is, but easy if you do. - We hope that if we give this task to a machine learner, it will also solve it by recognizing semantics.
Relative Position Task CNN Classifier 8 possible locations - Let’s ask a machine learner to solve this. Given an unlabeled image, we sample one patch uniformly at random, and then a second patch from one of these 8 possible locations relative to the first. - We then pass both patches to a deep network which is trained to predict which of the 8 classes was sampled. - Note the structure of this network: we have a relatively complex embedding that happens in parallel for both patches, followed by a relatively simple classifier that receives input from both patches. We expect that this will encourage the network to extract semantics, because that’s what makes life easier for the simple classifier. - This embedding function can be transferred to other problems Randomly Sample Patch Sample Second Patch
Note: connects across instances! CNN Classifier Patch Embedding Input Nearest Neighbors CNN Note: connects across instances! - Create an embedding where semantically similar patches are close together - indeed, for this patch, the most similar other patches are other cats - Note what this means: we’ve connected across instances of cats despite using an objective function that operates within a single image - We never told the algorithm that these were the same in any way
Architecture Softmax loss Fully connected Max Pooling LRN Fully connected Convolution Fully connected Max Pooling LRN Convolution Tied Weights Training requires Batch Normalization [Ioffe et al. 2015], but no other tricks Patch 1 Patch 2
Avoiding Trivial Shortcuts - The way that we extract the patches is important. The problem is that there are trivial shortcuts: ways that the network can solve the problem without really extracting the semantics that we’re after, like the line in this pair of patches. - The gap makes it less likely that low-level properties cross both patches - The jittering makes it harder to match straight lines between two patches Include a gap Jitter the patch locations
A Not-So “Trivial” Shortcut Position in Image CNN - After training for a while, another “trivial” shortcut started to take hold, which was difficult to track down. We first noticed pairs that the net could predict much better than people. - We found its nearest neighbors were other textures with little else in common - We then looked at the distribution of where those nearest-neighbor patches were occurring in the original image frame—turned out they were all from the same corner of the image - Was the network figuring out where the patches were on the image frame? To find out, we trained a separate network to predict the absolute (x,y) coordinates of individual patches. - For some images (not all), the net predicted very well, even for patches with no semantics in them.
Chromatic Aberration - Turns out that the images where this worked were those affected by chromatic aberration - Chromatic aberration happens when a lens bends different wavelengths a different amount - For common lenses (specifically, the achromatic doublet), the green color channel is shrunk a little bit toward the image center relative to red and blue - Deep nets can detect this subtle shift, which tells the net where a patch is with respect to the lens, and gives away the answer to the relative-position task.
Chromatic Aberration CNN There’s various ways to fix this—for example, removing color (for our experiments, we randomly dropped 2 of the 3 color channels). But there’s a more important lesson: Deep nets are kind-of lazy. If there’s a way to solve a problem without learning semantics, they may learn to do that instead.
Random Initialization What is learned? Input Ours Random Initialization ImageNet AlexNet Now let’s look at some results. We first did a sanity check, sampling 1000 patches from Pascal and finding nearest neighbors. I’ve selected a few I think are interesting.
Still don’t capture everything Input Ours Random Initialization ImageNet AlexNet You don’t always need to learn! Input Ours Random Initialization ImageNet AlexNet
… Visual Data Mining Via Geometric Verification - We can perform visual datamining via our nearest neighbors, but the net seems to capture textures as well as objects. - Objects tend to be made up of parts that fit a particular geometric configuration. - Hence, we can filter out objects by geometric verification, by finding sets of patches that occur together in a consistent configuration. Via Geometric Verification Simplified from [Chum et al 2007]
Mined from Pascal VOC2011
Pre-Training for R-CNN - If we can mine objects, can we detect them? We adopt the R-CNN pipeline on VOC 2007. Pre-train on relative-position task, w/o labels [Girshick et al. 2014]
VOC 2007 Performance (pretraining for R-CNN) No Rescaling Krähenbühl et al. 2015 % Average Precision 68.6 VGG + Krähenbühl et al. 61.7 [Krähenbühl, Doersch, Donahue & Darrell, “Data-dependent Initializations of CNNs”, 2015] 56.8 54.2 51.1 46.3 45.6 42.4 40.7 ImageNet Labels Ours No Pretraining
VOC 2007 Performance (pretraining for R-CNN) No Rescaling Krähenbühl et al. 2015 Average Precision 56.8% 54.2% 51.1% 46.3% 45.6% 40.7% ImageNet Labels Ours No Pretraining [Krähenbühl, Doersch, Donahue & Darrell, “Data-dependent Initializations of CNNs”, 2015]
Capturing Geometry? We’re doing better at object understanding, but what about other tasks? Looking back at our nearest neighbors, it seems like the network is capturing geometry as well.
Surface-normal Estimation Error (Lower Better) % Good Pixels (Higher Better) Method Mean Median 11.25° 22.5° 30.0° No Pretraining 38.6 26.5 33.1 46.8 52.5 Ours 33.2 21.3 36.0 51.2 57.8 ImageNet Labels 33.3 20.8 36.7 51.7 58.1
So, do we need semantic labels? - While the question remains open, we believe the answer might be no. - And we aren’t the only recent work suggesting that improvements can be had.
“Self-Supervision” and the Future Ego-Motion [Agrawal et al. 2015; Jayaraman et al. 2015] Similar [Wang et al. 2015; Srivastava et al 2015; …] Video [Doersch et al. 2014; Pathak et al. 2015; Isola et al. 2015] Context CNN There are a number of recent works on so-called “self-supervised” learning: that is, algorithms which start with data that has no hand-labeled, semantic annotations, but create a seemingly non-semantic task which still requires semantic information to solve. This includes egomotion, video cues like temporal continuity and prediction in time, as well as other works that use context.
Thank you!
Visual Data Mining? - How does the geometric verification work? First, note that the relative predictions are not quite enough: the network is quite capable of predicting the relative predictions of pairs of patches that don’t contain the same object, just because there’s strong scene layout cues like ground.
Geometric Verification Like [Chum et al. 2007], but simpler What makes an object an object? Say we had four patches on the same object, like this dog. If we found nearest neighbors for all four of them, there would be images that contribute nearest neighbors to all four. This is also true for textures. But note one big difference: the resulting nearest neighbors for the dog in the retrieved image are in essentially the same configuration as the original. But for the water, they’re all over the place.
Geometric Verification Like [Chum et al. 2007], but simpler … x100 15/100 84/100 - In detail, we begin by sampling sets of four patches in a 2x2 grid, which we call a constellation. - For each constellation, we find the top 100 images which have strong matches for all four (we initially measure how well an image matches by simply summing up the similarity scores for the individual patches) - Out of those 100, we discard those images where the set of matches is not in a roughly square pattern, similar to the one we started with. We count up how many are left, which serves as the score of the constellation overall. We simply rank the constellations based on this score. 7/100 …