Convolutional Neural Nets Advanced Vision Seminar April 19, 2015
Overview The Revolution(2012) : ImageNet Classification with Deep Conv. Nets Large performance gap Possible Explanations What makes convnets tick? Visualizing and Understanding Deep Conv. Nets Some Limits of conv. Nets Useful Resources I will talk about two subjects: The revolution in vision brought by deep conv. Neural networks, and explaining how it happened An attempt to explain what goes on insight deep nets
ImageNet Classification with Deep Conv. Neural Networks Until 2012: Leading methods used hand-crafted features + encoding methods (e.g, SIFT+Bag-of-Words+SVM) NIPS 2012, Alex Krizhevsky et. Al Significant improvement w.r.t other methods: ImageNet performance 2010: ~28% (pre-convnet) 2012: 16% (Krizhevsky) 2013: 12% 2014: 6.7% Up until 2012, the leading methods in computer vision benchmarks did not include deep neural nets. They included extraction of many local hand crafted features and an encoding of their distribution over the image. 2012 Marked a big leap in performance as Deep conv. Nets returned into play. Since then, we have seen a steady increase in performance. We shall try to see how this happened.
Causes for Performance Gap: Deep Nets have been around for a long time, why the sudden gap? Combination of multiple factors: Network Design Scale of Data ReLu faster convergence Dropout – less overfitting SGD GPU computations Deep neural nets have been around for tens of years and deep conv. Nets for more than 20. What has caused this large performance gap? I shall try to explain it
Network Design Basic layer types: Krizhevsky, 2012 Lecun, 1989 Basic layer types: Convolution Nonlinearity Pooling : max, avg Local Normalization 8 Layers (deeper,wider than 1989) connected can also be viewed as convolution, receptive field is entire layer Lets start with network design. The design of the network is similar in nature to Lecun’s networks starting 1989, just deeper and wider.
Network Design Krizhevsky, 2012 (Layer 0): Input : 224x224x3 mean-subtracted Layer 1: 96 kernels of 11x11x3, stride of 4 pixels, max pool and locally normalize Layer 2: 256 kernels of 5x5x96, max pool Layers 3-5: more convolutions, similar to 1,2 Layers 6,7 : fully connected, 4096 hidden units each Layer 8: Soft-max over 1000 classes ….
Num Samples vs. Num. Parameters The network has ~60,000,000 parameters. To avoid overfitting, a lot of data is needed Imagenet is indeed very large: > millions images Training set: 1.2 mil. Images, 1000 obj. classes Additional samples are generated via data augmentation: simple geometric and color transformations Dropout
Optimization - definitions Loss on one sample (softmax-loss) D – Data/Batch size : “Momentum Variable” (update history) Loss on batch : Regularization term : : Learning rate Momentum improves convergence stability and speed Regularization term crucial for performance, according to authors
Optimization Set =.9, = 0.0005 , = .001 (initially) D=128 (batch size) Num. Epochs: 90 Update: This is called SGD+Momentum
Relu: Faster Convergence Krizhevsky, 2012 Nonlinearity: tanh-> Relu (rectilinear unit) Easier to differentiate Avoids saturation In practice, much faster convergence Tanh Relu
Stronger Machines Modern GPU architectures enable massively parallel computations of the sort required by deep conv. nets Training with two strong GPU’s, this took “only” 6 days – a x50 speedup w.r.t to CPU training
Imagenet Results LSVRC : Large Scale Visual Recognition Challenge Imagenet (2014): 1.4 mil. Images, 1000 obj. classes Compare: Pascal : 22,000 images, 20 obj. classes)
Results - 2012 Agaric Here are some qualitative results. Besides good quantitative performance, we can see that in many cases the results is semantically reasonable, and where mistakes are made, they seem to make sense. This arguably shows some nice generalization capabilities of the network.
Generic Use in Vision Using the output of the fully connected layers as a generic feature extractor has proven to very strong Beating state of the art in many datasets/benchmarks unrelated to ImageNet This is now standard in object detection, scene classification, Scene parsing, Segmentation, and many more “Machine crafted” vs. hand crafted features
Generic Use in Vision a computer vision scientist: How long does it take to train these generic features on ImageNet? Hossein: 2 weeks Ali: almost 3 weeks depending on the hardware the computer vision scientist: hmmmm... Stefan: Well, you have to compare the three weeks to the last 40 years of computer vision *quote from from http://www.csc.kth.se/cvap/cvg/DL/ots/ A. S. Razavian, H. Azizpour, J. Sullivan, S. Carlsson "CNN features off-the-shelf: An astounding baseline for recognition", CVPR 2014, DeepVision workshop
Network Design ?? How to determine “hyper-parameters”: No. layers? Kernel Size/Num. of kernels? Training rate? Number of training epochs? … From own experience, either: Start with existing network & tweak / finetune Incrementally increase network complexity – start with a few layers, see what works Domain knowledge : convolutions are especially suited for images, not always the right choice One of the most puzzling issues that remains for me is how to determine network architecture and others of the many “hyperparameters” . This was not rigorously discussed in any of the paper’s I’ve read so far.
Network Structure (To my knowledge) no rigorous analysis of effect of network structure, either theoretical or empirical Example: systematically check many different network structures / configurations See what works well, what doesn’t and explain why Guessing: Probably done by successful architectures, but “bad” results not published
Visualizing & Understanding Conv. Nets What Makes Convnets “Tick”? What happens in hidden units? Layer 1: easy to visualize Deeper layers: just a bunch of numbers? Or something more meaningful? Do convnets use context or actually model target classes The next part of the presentation is about an attempt to find out what goes on inside the “black” boxes of the networks, suggest by Zeiler and Fergus in 2013. It tried to link back each activation back to the original image
Introducing: Visualizing & Understanding Conv. Nets Zeiler & Fergus, 2013 Goal: Try to visualize the “black box” hidden units, gain insights Hope: Use conclusions to improve performance Idea: “Deconvolutional” neural net The next part of the presentation is about an attempt to find out what goes on inside the “black” boxes of the networks, suggest by Zeiler and Fergus in 2013. It tried to link back each activation back to the original image
Deconvolutional Nets Originally suggested for unsupervised feature learning : construct a convolutional net, cost function is image reconstruction error Used here to find what stimuli causes strongest responses in hidden units Run many images through net find strongest unit activations in each layer visualize by “reversing” net operation
Reversing a convent Each layer in the deconvolutional net is build according to the corresponding layer in the convolutional net, and does the “reverse” operation.
“Unpooling”
Deconvolution* Want to visualize a strong activation in feature map P from layer L+1 down to layer L. As in Deep-Belief Nets: LP*F’, where F’ is the original kernel flipped in both dimensions Intuition: can show, gradient of F w.r.t input is F’, this is back-propping error of strongest activations *Note: Not really “deconvolution”, this is not an attempt to recover original signal
Layer 1:
Hidden Layer Visualizations: layer 2
Hidden Layer Visualizations: Layer 3
Hidden Layer Visualizations: Layer 4
Hidden Layer Visualizations: Layer 5
It’s Nice to Watch, But is it Useful? Authors observed aliasing effects caused by large stride in lower layers (e.g, loss of fine texture) Reducing filter size and stride increased performance, also reporting qualitatively “cleaner” looking filters
Is the net using context? Let’s test if the network really focuses on relevant features Systematically occlude different parts of image Check output confidence for true class (This doesn’t really have to do with the visualization)
Following Network paths B.Zhou et al , Object Detectors Emerge in Deep CNNs, ICLR 2015
Going Deeper with convolutions Recently, even deeper models have been proposed: GoogLeNet – 22 layers : 6.7 top 5 error 16 and 19 layer architectures from VGG , similar performance
Limits : Easy Classes : natural, highly textured, fine-grained Lets see some of the limits of deep conv. Nets. Here are some image classification results. Man made objects are harder than natural ones. Highly textured objects are easier. The simplest of objects seem to be the hardest. It seems like overkill to try to represent these objects with the deep conv nets.
Limits : Difficult Classes Man-Made, simple, non-textured, functional Lets see some of the limits of deep conv. Nets. Here are some image classification results. Man made objects are harder than natural ones. Highly textured objects are easier. The simplest of objects seem to be the hardest. It seems like overkill to try to represent these objects with the deep conv nets. ?
Useful Tools For starters: MatConvNet : More advanced: Caffe : Matlab Simple, straightforward Pre-trained popular models Windows/Linux compatible More advanced: Caffe : Powerful ,open-source framework for training and testing convnets, with c++/python/matlab interfaces Mostly Linux (some old windows ports) “Model-Zoo” : updates often with state-of-the-art models Many more deeplearning.net/software_links/
Thank You References: Lecun et al, Backpropagation Applied to Handwritten Zip Code Recognition (MIT press, 1989) Zeiler et al, Visualizing and understanding convolutional networks, ECCV14 Rob Fergus. Deep Learning for Computer Vision (Tutorial). NIPS, 2013 (including several imgs from slides) Russakovsky et al, ImageNet Large Scale Visual Recognition Challenge. arXiv:1409.0575, 2014 A. S. Razavian, H. Azizpour, J. Sullivan, S. Carlsson "CNN features off-the-shelf: An astounding baseline for recognition", CVPR 2014, DeepVision workshop