Deep Learning RUSSIR 2017 – Day 2 Efstratios Gavves
Recap from Day 2 Convolutional Networks are optimal for images Parameter sharing Much cheaper Much faster Better local invariances Several possible Convnet architectures possible AlexNet/VGGNet ResNet Google Inception V1-4
Deep Nets in Practice Data pre-processing Optimization Regularization Hyper-parameters
𝜕ℒ 𝜕 𝑎 𝑙 = 𝜕 𝑎 𝑙+1 𝜕 𝑥 𝑙+1 𝑇 ⋅ 𝜕ℒ 𝜕 𝑎 𝑙+1 Back-propagation Step 1. Compute forward propagations for all layers recursively 𝑎 𝑙 = ℎ 𝑙 𝑥 𝑙 and 𝑥 𝑙+1 = 𝑎 𝑙 Step 2. Once done with forward propagation, follow the reverse path. Start from last layer. For each new layer compute the gradients Cache computations when possible to avoid redundant operations Step 3. Use the gradients 𝜕ℒ 𝜕 𝜃 𝑙 with Stochastic Gradient Descend to train 𝜕ℒ 𝜕 𝑎 𝑙 = 𝜕 𝑎 𝑙+1 𝜕 𝑥 𝑙+1 𝑇 ⋅ 𝜕ℒ 𝜕 𝑎 𝑙+1 𝜕ℒ 𝜕 𝜃 𝑙 = 𝜕 𝑎 𝑙 𝜕 𝜃 𝑙 ⋅ 𝜕ℒ 𝜕 𝑎 𝑙 𝑇
Stochastic Gradient Descent Stochastically sample “mini-batches” from dataset 𝐷 The size of 𝐵 𝑗 can contain even just 1 sample Much faster than Gradient Descend Results are often better Also suitable for datasets that change over time Variance of gradients increases when batch size decreases 𝐵 𝑗 =𝑠𝑎𝑚𝑝𝑙𝑒(𝐷) 𝜃 (𝑡+1) = 𝜃 (𝑡) − 𝜂 𝑡 | 𝐵 𝑗 | 𝑖 ∈ 𝐵 𝑗 𝛻 𝜃 ℒ 𝑖
SGD often more accurate Loss surface Current solution Noisy SGD gradient Full GD gradient New GD solution No guarantee that this is what is going to always happen. But the noisy SGC gradients can help some times escaping local optima Best GD solution Best SGD solution
Data pre-processing tanh (𝑥) ReLU 𝜎(𝑥) Center data to be roughly 0 Activation functions usually “centered” around 0 Convergence usually faster Otherwise bias on gradient direction might slow down learning ReLU tanh (𝑥) 𝜎(𝑥)
Data pre-processing Scale input variables to have similar diagonal covariances 𝑐 𝑖 = 𝑗 ( 𝑥 𝑖 (𝑗) ) 2 Similar covariances more balanced rate of learning for different weights Rescaling to 1 is a good choice, unless some dimensions are less important 𝑥 1 , 𝑥 2 , 𝑥 3 much different covariances Generated gradients dℒ 𝑑𝜃 𝑥 1 , 𝑥 2 , 𝑥 3 : much different 𝜃 (𝑡+1) = 𝜃 (𝑡) − 𝜂 𝑡 𝑑ℒ/𝑑 θ 1 𝑑ℒ/𝑑 θ 2 𝑑ℒ/𝑑 θ 3 Gradient update harder:
Data pre-processing Input variables should be as decorrelated as possible Input variables are “more independent” Network is forced to find non-trivial correlations between inputs Decorrelated inputs Better optimization Obviously not the case when inputs are by definition correlated (sequences) Extreme case extreme correlation (linear dependency) might cause problems [CAUTION]
Batch normalization (intuitively) Backpropagation 𝑥 𝑙 𝑥 𝑙 𝑥 𝑙 Layer input at (t) at (t+0.5) at (t+1)
Batch normalization (algorithm) 𝜇 ℬ ← 1 𝑚 𝑖=1 𝑚 𝑥 𝑖 [compute mini-batch mean] 𝜎 ℬ ← 1 𝑚 𝑖=1 𝑚 𝑥 𝑖 − 𝜇 ℬ 2 [compute mini-batch variance] 𝑥 𝑖 ← 𝑥 𝑖 −𝜇 ℬ 𝜎 ℬ 2 +𝜀 [normalize input] 𝑦 𝑖 ←𝛾 𝑥 𝑖 +𝛽 [scale and shift input] Trainable parameters
Batch normalization (benefits) Stronger gradients higher learning rates faster training Otherwise maybe exploding or vanishing gradients or getting stuck to local minima Neurons get activated in a near optimal “regime” Better model regularization Neuron activations not deterministic, depend on the batch Model cannot be overconfident
ℓ 2 -regularization Most important (or most popular) regularization θ ∗ ← arg min 𝜃 (𝑥,𝑦)⊆(𝑋, 𝑌) ℒ(𝑦, 𝑎 𝐿 𝑥; 𝜃 1,…,L ) + 𝜆 2 𝑙 𝜃 𝑙 2 The ℓ 2 -regularization can pass inside the gradient descend update rule 𝜃 (𝑡+1) = 𝜃 (𝑡) − 𝜂 𝑡 𝛻 𝜃 ℒ+𝜆 𝜃 𝑙 ⟹ 𝜃 𝑡+1 = 1−𝜆 𝜂 𝑡 𝜃 𝑡 − 𝜂 𝑡 𝛻 𝜃 ℒ 𝜆 is usually about 10 −1 , 10 −2 “Weight decay”, because weights get smaller
Early stopping Monitor performance on a separate validation set Training the network will decrease training error, as well validation error (although with a slower rate usually) Stop when validation error starts increasing This quite likely means the network starts to overfit
Dropout [Srivastava2014] During training setting activations randomly to 0 Neurons sampled at random from a Bernoulli distribution with 𝑝=0.5 At test time all neurons are used Neuron activations reweighted by 𝑝 Benefits Reduces complex co-adaptations or co-dependencies between neurons No “free-rider” neurons that rely on others Every neuron becomes more robust Decreases significantly overfitting Improves significantly training speed
Dropout Effectively, a different architecture at every training epoch Similar to model ensembles Original model
Dropout Effectively, a different architecture at every training epoch Similar to model ensembles Epoch 1
Dropout Effectively, a different architecture at every training epoch Similar to model ensembles Epoch 1
Dropout Effectively, a different architecture at every training epoch Similar to model ensembles Epoch 2
Dropout Effectively, a different architecture at every training epoch Similar to model ensembles Epoch 2
Learning rate The right learning rate 𝜂 𝑡 very important for fast convergence Too strong gradients overshoot and bounce Too weak, too small gradients slow training Learning rate per weight is often advantageous Some weights are near convergence, others not Rule of thumb Learning rate of (shared) weights prop. to square root of share weight connections Adaptive learning rates are also possible, based on the errors observed [Sompolinsky1995]
Weight initialization There are few contradictory requirements Weights need to be small enough around origin ( 𝟎 ) for symmetric functions (tanh, sigmoid) When training starts better stimulate activation functions near their linear regime larger gradients faster training Weights need to be large enough Otherwise signal is too weak for any serious learning
Weight initialization: Similar input/output variance Weights must be initialized to preserve the variance of the activations during the forward and backward computations Especially for deep learning All neurons operate in their full capacity
Quiz: Why similar input/output variance? Weights must be initialized to preserve the variance of the activations during the forward and backward computations Especially for deep learning All neurons operate in their full capacity
Quiz: Why similar input/output variance? Weights must be initialized to preserve the variance of the activations during the forward and backward computations Especially for deep learning All neurons operate in their full capacity The output of one module is the input to another. Similar variance ensures smooth learning for nonlinearities Layer 1 Layer 2 Input Output Input
Weight initialization: Good practices Good practice: initialize weights to be asymmetric Don’t give save values to all weights (like all 𝟎 ) In that case all neurons generate same gradient no learning Generally speaking initialization depends on non-linearities data normalization
[He2015] initialization for ReLUs Unlike sigmoids, ReLUs ground to 0 the linear activations half the time Double weight variance Compensate for the zero flat-area Input and output maintain same variance Very similar to Xavier initialization Draw random weights from w~𝑁 0, 2/𝑑 where 𝑑 is the number of neurons in the input
Optimizers: Momentum Don’t switch gradients all the time Maintain “momentum” from previous parameters More robust gradients and learning faster convergence Nice “physics”-based interpretation Instead of updating the position of the “ball”, we update the velocity, which updates the position 𝑢 𝜃 = 𝛾𝜃 (𝑡) − 𝜂 𝑡 𝛻 𝜃 ℒ 𝜃 (𝑡+1) = 𝜃 (𝑡) + 𝑢 𝜃
Optimizers: Momentum Loss surface Gradient + momentum Gradient
Optimizers: RMSprop Large gradients, e.g. too “noisy” loss surface Schedule 𝑚 𝑗 =𝛼 𝜏=1 𝑡−1 ( 𝛻 𝜃 (𝑡) ℒ 𝑗 ) 2 + 1−𝛼 𝛻 𝜃 (𝑡) ℒ 𝑗 ⟹ 𝜃 (𝑡+1) = 𝜃 (𝑡) − 𝜂 𝑡 𝛻 𝜃 ℒ 𝑚 +𝜀 Moving average of the squared gradients Compared to Adagrad Large gradients, e.g. too “noisy” loss surface Updates are tamed Small gradients, e.g. stuck in flat loss surface ravine Updates become more aggressive
Optimizers: RMSprop Square rooting boosts small values while suppresses large values
Visual overview
Standard vs Structured Inference
Standard inference N-way classification Dog? Cat? Bike? Car? Plane?
Standard inference N-way classification Regression How popular will this movie be in IMDB?
Standard inference N-way classification Regression Ranking … Who is older?
Quiz: What is common? N-way classification Regression Ranking …
Quiz: What is common? They all make “single value” predictions Do all our machine learning tasks boil down to “single value” predictions?
Beyond “single value” predictions? Do all our machine learning tasks boil to “single value” predictions? Are there tasks where outputs are somehow correlated? Is there some structure in this output correlations? How can we predict such structures?
Example: Object detection Predict a box around an object Images Spatial location b(ounding) box Videos Spatio-temporal location bbox@t, bbox@t+1, …
Quiz: Other examples?
Object segmentation
Optical flow & motion estimation
Depth estimation Godard et al., Unsupervised Monocular Depth Estimation with Left-Right Consistency, 2016
Normals and reflectance estimation
Structured prediction Prediction goes beyond asking for “single values” Outputs are complex and output dimensions correlated Output dimensions have latent structure Can we make deep networks to return structured predictions?
Structured prediction Prediction goes beyond asking for “single values” Outputs are complex and output dimensions correlated Output dimensions have latent structure Can we make deep networks to return structured predictions?
Convnets for structured prediction
Sliding window on feature maps Selective Search Object Proposals [Uijlings2013] SPPnet [He2014] Fast R-CNN [Girshick2015]
Fast R-CNN: Steps Process the whole image up to conv5 Conv 5 feature map
Fast R-CNN: Steps Process the whole image up to conv5 Compute possible locations for objects Conv 1 Conv 2 Conv 3 Conv 4 Conv 5 Conv 5 feature map
Fast R-CNN: Steps Process the whole image up to conv5 Compute possible locations for objects some correct, most wrong Given single location ROI pooling module extracts fixed length feature New box coordinates Car/dog/bicycle ROI Pooling Module Conv 1 Conv 2 Conv 3 Conv 4 Conv 5 Always 4x4 no matter the size of candidate location Conv 5 feature map
Region-Of-Interest Pooling Divide feature map in 𝑇𝑥𝑇 cells Cell size changes depending on the size of the candidate location Always 3x3 no matter the size of candidate location
Some results
Going Fully Convolutional [LongCVPR2014] Image larger than network input slide the network Is this pixel a camel? Conv 1 Conv 2 Conv 3 Conv 4 Conv 5 fc1 fc2 Yes! No!
Going Fully Convolutional [LongCVPR2014] Image larger than network input slide the network Is this pixel a camel? Conv 1 Conv 2 Conv 3 Conv 4 Conv 5 fc1 fc2 Yes! No!
Going Fully Convolutional [LongCVPR2014] Image larger than network input slide the network Is this pixel a camel? Conv 1 Conv 2 Conv 3 Conv 4 Conv 5 fc1 fc2 Yes! No!
Going Fully Convolutional [LongCVPR2014] Image larger than network input slide the network Is this pixel a camel? Conv 1 Conv 2 Conv 3 Conv 4 Conv 5 fc1 fc2 Yes! No!
Going Fully Convolutional [LongCVPR2014] Image larger than network input slide the network Is this pixel a camel? Conv 1 Conv 2 Conv 3 Conv 4 Conv 5 fc1 fc2 Yes! No!
Fully Convolutional Networks [LongCVPR2014] Connect intermediate layers to output
Fully Convolutional Networks Output is too coarse Image Size 500x500, Alexnet Input Size: 227x227 Output: 10x10 How to obtain dense predictions? Upconvolution Other names: deconvolution, transposed convolution, fractionally-strided convolutions
Deconvolutional modules Output Image Convolution No padding, no strides Upconvolution No padding, no strides Upconvolution Padding, strides https://github.com/vdumoulin/conv_arithmetic
Ground truth pixel labels Pixel label probabilities Coarse Fine Output Large loss generated (probability much higher than ground truth) Small loss generated 1 Ground truth pixel labels Pixel label probabilities 0.8 0.1 0.9 Upconvolution 2x Upconvolution 2x 7x7 14x14 224x224
Structured losses
Deep ConvNets with CRF loss [Chen, Papandreou 2016] Segmentation map is good but not pixel-precise Details around boundaries are lost Cast fully convolutional outputs as unary potentials Consider pairwise potentials between output dimensions
Deep ConvNets with CRF loss [Chen, Papandreou 2016]
Deep ConvNets with CRF loss [Chen, Papandreou 2016] Segmentation map is good but not pixel-precise Details around boundaries are lost Cast fully convolutional outputs as unary potentials Consider pairwise potentials between output dimensions Include Fully Connected CRF loss to refine segmentation 𝐸 𝑥 =∑ 𝜃 𝑖 𝑥 𝑖 +∑ 𝜃 𝑖𝑗 ( 𝑥 𝑖 , 𝑥 𝑗 ) Total loss Unary loss Pairwise loss 𝜃 𝑖𝑗 𝑥 𝑖 , 𝑥 𝑗 ~ 𝑤 1 exp −𝛼 𝑝 𝑖 − 𝑝 𝑗 2 −𝛽 𝐼 𝑖 − I 𝑗 2 + 𝑤 2 exp (−𝛾 𝑝 𝑖 − 𝑝 𝑗 2 )
Examples
Multi-task learning The total loss is the summation of the per task losses The per task loss relies on the common weights (VGGnet) and the weights specialized for the task ℒ 𝑡𝑜𝑡𝑎𝑙 = 𝑡𝑎𝑠𝑘 ℒ 𝑡𝑎𝑠𝑘 ( 𝜃 𝑐𝑜𝑚𝑚𝑜𝑛 , 𝜃 𝑡𝑎𝑠𝑘 ) +ℛ( 𝜃 𝑡𝑎𝑠𝑘 ) One training image might contain specific only annotations Only a particular task is “run” for that image Gradients per image are computed for tasks available for the image only
Ubernet [Kokkinos2016]
One datum Several tasks Assume our data are images Per image we can predict, boundaries, segmentation, detection, … Why separately? Solve multiple tasks simultaneously One task might help learn another better One task might have more annotations In real applications we don’t want 7 VGGnets 1 for boundaries, 1 for normals, 1 for saliency, …
One image Several tasks
Take away message Deep Learning is good not only for classifying things Structured prediction is also possible Multi-task structure prediction allows for unified networks and for missing data Discovering structure in data is also possible
Thank you!