Deep Learning RUSSIR 2017 – Day 2

Deep Learning RUSSIR 2017 – Day 2
Efstratios Gavves

Recap from Day 2 Convolutional Networks are optimal for images
Parameter sharing Much cheaper Much faster Better local invariances Several possible Convnet architectures possible AlexNet/VGGNet ResNet Google Inception V1-4

Deep Nets in Practice Data pre-processing Optimization Regularization Hyper-parameters

𝜕ℒ 𝜕 𝑎 𝑙 = 𝜕 𝑎 𝑙+1 𝜕 𝑥 𝑙+1 𝑇 ⋅ 𝜕ℒ 𝜕 𝑎 𝑙+1
Back-propagation Step 1. Compute forward propagations for all layers recursively 𝑎 𝑙 = ℎ 𝑙 𝑥 𝑙 and 𝑥 𝑙+1 = 𝑎 𝑙 Step 2. Once done with forward propagation, follow the reverse path. Start from last layer. For each new layer compute the gradients Cache computations when possible to avoid redundant operations Step 3. Use the gradients 𝜕ℒ 𝜕 𝜃 𝑙 with Stochastic Gradient Descend to train 𝜕ℒ 𝜕 𝑎 𝑙 = 𝜕 𝑎 𝑙+1 𝜕 𝑥 𝑙+1 𝑇 ⋅ 𝜕ℒ 𝜕 𝑎 𝑙+1 𝜕ℒ 𝜕 𝜃 𝑙 = 𝜕 𝑎 𝑙 𝜕 𝜃 𝑙 ⋅ 𝜕ℒ 𝜕 𝑎 𝑙 𝑇

Stochastic Gradient Descent
Stochastically sample “mini-batches” from dataset 𝐷 The size of 𝐵 𝑗 can contain even just 1 sample Much faster than Gradient Descend Results are often better Also suitable for datasets that change over time Variance of gradients increases when batch size decreases 𝐵 𝑗 =𝑠𝑎𝑚𝑝𝑙𝑒(𝐷) 𝜃 (𝑡+1) = 𝜃 (𝑡) − 𝜂 𝑡 | 𝐵 𝑗 | 𝑖 ∈ 𝐵 𝑗 𝛻 𝜃 ℒ 𝑖

SGD often more accurate
Loss surface Current solution Noisy SGD gradient Full GD gradient New GD solution No guarantee that this is what is going to always happen. But the noisy SGC gradients can help some times escaping local optima Best GD solution Best SGD solution

Data pre-processing tanh (𝑥)  ReLU  𝜎(𝑥) 
Center data to be roughly 0 Activation functions usually “centered” around 0 Convergence usually faster Otherwise bias on gradient direction  might slow down learning ReLU  tanh (𝑥)  𝜎(𝑥) 

Data pre-processing Scale input variables to have similar diagonal covariances 𝑐 𝑖 = 𝑗 ( 𝑥 𝑖 (𝑗) ) 2 Similar covariances  more balanced rate of learning for different weights Rescaling to 1 is a good choice, unless some dimensions are less important 𝑥 1 , 𝑥 2 , 𝑥 3  much different covariances Generated gradients dℒ 𝑑𝜃 𝑥 1 , 𝑥 2 , 𝑥 3 : much different 𝜃 (𝑡+1) = 𝜃 (𝑡) − 𝜂 𝑡 𝑑ℒ/𝑑 θ 1 𝑑ℒ/𝑑 θ 2 𝑑ℒ/𝑑 θ 3 Gradient update harder:

Data pre-processing Input variables should be as decorrelated as possible Input variables are “more independent” Network is forced to find non-trivial correlations between inputs Decorrelated inputs  Better optimization Obviously not the case when inputs are by definition correlated (sequences) Extreme case extreme correlation (linear dependency) might cause problems [CAUTION]

Batch normalization (intuitively)
Backpropagation 𝑥 𝑙 𝑥 𝑙 𝑥 𝑙 Layer input at (t) at (t+0.5) at (t+1)

Batch normalization (algorithm)
𝜇 ℬ ← 1 𝑚 𝑖=1 𝑚 𝑥 𝑖 [compute mini-batch mean] 𝜎 ℬ ← 1 𝑚 𝑖=1 𝑚 𝑥 𝑖 − 𝜇 ℬ 2 [compute mini-batch variance] 𝑥 𝑖 ← 𝑥 𝑖 −𝜇 ℬ 𝜎 ℬ 2 +𝜀 [normalize input] 𝑦 𝑖 ←𝛾 𝑥 𝑖 +𝛽 [scale and shift input] Trainable parameters

Batch normalization (benefits)
Stronger gradients  higher learning rates  faster training Otherwise maybe exploding or vanishing gradients or getting stuck to local minima Neurons get activated in a near optimal “regime” Better model regularization Neuron activations not deterministic, depend on the batch Model cannot be overconfident

ℓ 2 -regularization Most important (or most popular) regularization
θ ∗ ← arg min 𝜃 (𝑥,𝑦)⊆(𝑋, 𝑌) ℒ(𝑦, 𝑎 𝐿 𝑥; 𝜃 1,…,L ) + 𝜆 2 𝑙 𝜃 𝑙 2 The ℓ 2 -regularization can pass inside the gradient descend update rule 𝜃 (𝑡+1) = 𝜃 (𝑡) − 𝜂 𝑡 𝛻 𝜃 ℒ+𝜆 𝜃 𝑙 ⟹ 𝜃 𝑡+1 = 1−𝜆 𝜂 𝑡 𝜃 𝑡 − 𝜂 𝑡 𝛻 𝜃 ℒ 𝜆 is usually about 10 −1 , 10 −2 “Weight decay”, because weights get smaller

Early stopping Monitor performance on a separate validation set
Training the network will decrease training error, as well validation error (although with a slower rate usually) Stop when validation error starts increasing This quite likely means the network starts to overfit

Dropout [Srivastava2014] During training setting activations randomly to 0 Neurons sampled at random from a Bernoulli distribution with 𝑝=0.5 At test time all neurons are used Neuron activations reweighted by 𝑝 Benefits Reduces complex co-adaptations or co-dependencies between neurons No “free-rider” neurons that rely on others Every neuron becomes more robust Decreases significantly overfitting Improves significantly training speed

Dropout Effectively, a different architecture at every training epoch
Similar to model ensembles Original model

Similar to model ensembles Epoch 1

Similar to model ensembles Epoch 2

Learning rate The right learning rate 𝜂 𝑡 very important for fast convergence Too strong  gradients overshoot and bounce Too weak,  too small gradients  slow training Learning rate per weight is often advantageous Some weights are near convergence, others not Rule of thumb Learning rate of (shared) weights prop. to square root of share weight connections Adaptive learning rates are also possible, based on the errors observed [Sompolinsky1995]

Weight initialization
There are few contradictory requirements Weights need to be small enough around origin ( 𝟎 ) for symmetric functions (tanh, sigmoid) When training starts better stimulate activation functions near their linear regime larger gradients  faster training Weights need to be large enough Otherwise signal is too weak for any serious learning

Weight initialization: Similar input/output variance
Weights must be initialized to preserve the variance of the activations during the forward and backward computations Especially for deep learning All neurons operate in their full capacity

Quiz: Why similar input/output variance?
Weights must be initialized to preserve the variance of the activations during the forward and backward computations Especially for deep learning All neurons operate in their full capacity

Quiz: Why similar input/output variance?
Weights must be initialized to preserve the variance of the activations during the forward and backward computations Especially for deep learning All neurons operate in their full capacity The output of one module is the input to another. Similar variance ensures smooth learning for nonlinearities Layer 1 Layer 2 Input Output Input

Weight initialization: Good practices
Good practice: initialize weights to be asymmetric Don’t give save values to all weights (like all 𝟎 ) In that case all neurons generate same gradient  no learning Generally speaking initialization depends on non-linearities data normalization

[He2015] initialization for ReLUs
Unlike sigmoids, ReLUs ground to 0 the linear activations half the time Double weight variance Compensate for the zero flat-area  Input and output maintain same variance Very similar to Xavier initialization Draw random weights from w~𝑁 0, 2/𝑑 where 𝑑 is the number of neurons in the input

Optimizers: Momentum Don’t switch gradients all the time
Maintain “momentum” from previous parameters More robust gradients and learning  faster convergence Nice “physics”-based interpretation Instead of updating the position of the “ball”, we update the velocity, which updates the position 𝑢 𝜃 = 𝛾𝜃 (𝑡) − 𝜂 𝑡 𝛻 𝜃 ℒ 𝜃 (𝑡+1) = 𝜃 (𝑡) + 𝑢 𝜃

Optimizers: Momentum Loss surface Gradient + momentum Gradient

Optimizers: RMSprop Large gradients, e.g. too “noisy” loss surface
Schedule 𝑚 𝑗 =𝛼 𝜏=1 𝑡−1 ( 𝛻 𝜃 (𝑡) ℒ 𝑗 ) −𝛼 𝛻 𝜃 (𝑡) ℒ 𝑗 ⟹ 𝜃 (𝑡+1) = 𝜃 (𝑡) − 𝜂 𝑡 𝛻 𝜃 ℒ 𝑚 +𝜀 Moving average of the squared gradients Compared to Adagrad Large gradients, e.g. too “noisy” loss surface Updates are tamed Small gradients, e.g. stuck in flat loss surface ravine Updates become more aggressive

Optimizers: RMSprop Square rooting boosts small values while suppresses large values

Visual overview

Standard vs Structured Inference

Standard inference N-way classification Dog? Cat? Bike? Car? Plane?

Standard inference N-way classification Regression
How popular will this movie be in IMDB?

Standard inference N-way classification Regression Ranking …
Who is older?

Quiz: What is common? N-way classification Regression Ranking …

Quiz: What is common? They all make “single value” predictions Do all our machine learning tasks boil down to “single value” predictions?

Beyond “single value” predictions?
Do all our machine learning tasks boil to “single value” predictions? Are there tasks where outputs are somehow correlated? Is there some structure in this output correlations? How can we predict such structures?

Example: Object detection
Predict a box around an object Images Spatial location b(ounding) box Videos Spatio-temporal location …

Quiz: Other examples?

Object segmentation

Optical flow & motion estimation

Depth estimation Godard et al., Unsupervised Monocular Depth Estimation with Left-Right Consistency, 2016

Normals and reflectance estimation

Structured prediction
Prediction goes beyond asking for “single values” Outputs are complex and output dimensions correlated Output dimensions have latent structure Can we make deep networks to return structured predictions?

Convnets for structured prediction

Sliding window on feature maps
Selective Search Object Proposals [Uijlings2013] SPPnet [He2014] Fast R-CNN [Girshick2015]

Fast R-CNN: Steps Process the whole image up to conv5
Conv 5 feature map

Fast R-CNN: Steps Process the whole image up to conv5 Compute possible locations for objects Conv 1 Conv 2 Conv 3 Conv 4 Conv 5 Conv 5 feature map

Fast R-CNN: Steps Process the whole image up to conv5
Compute possible locations for objects some correct, most wrong Given single location  ROI pooling module extracts fixed length feature New box coordinates Car/dog/bicycle ROI Pooling Module Conv 1 Conv 2 Conv 3 Conv 4 Conv 5 Always 4x4 no matter the size of candidate location Conv 5 feature map

Region-Of-Interest Pooling
Divide feature map in 𝑇𝑥𝑇 cells Cell size changes depending on the size of the candidate location Always 3x3 no matter the size of candidate location

Some results

Going Fully Convolutional
[LongCVPR2014] Image larger than network input slide the network Is this pixel a camel? Conv 1 Conv 2 Conv 3 Conv 4 Conv 5 fc1 fc2 Yes! No!

Fully Convolutional Networks
[LongCVPR2014] Connect intermediate layers to output

Fully Convolutional Networks
Output is too coarse Image Size 500x500, Alexnet Input Size: 227x227  Output: 10x10 How to obtain dense predictions? Upconvolution Other names: deconvolution, transposed convolution, fractionally-strided convolutions

Deconvolutional modules
Output Image Convolution No padding, no strides Upconvolution No padding, no strides Upconvolution Padding, strides

Ground truth pixel labels Pixel label probabilities
Coarse  Fine Output Large loss generated (probability much higher than ground truth) Small loss generated 1 Ground truth pixel labels Pixel label probabilities 0.8 0.1 0.9 Upconvolution 2x Upconvolution 2x 7x7 14x14 224x224

Structured losses

Deep ConvNets with CRF loss
[Chen, Papandreou 2016] Segmentation map is good but not pixel-precise Details around boundaries are lost Cast fully convolutional outputs as unary potentials Consider pairwise potentials between output dimensions

[Chen, Papandreou 2016]

[Chen, Papandreou 2016] Segmentation map is good but not pixel-precise Details around boundaries are lost Cast fully convolutional outputs as unary potentials Consider pairwise potentials between output dimensions Include Fully Connected CRF loss to refine segmentation 𝐸 𝑥 =∑ 𝜃 𝑖 𝑥 𝑖 +∑ 𝜃 𝑖𝑗 ( 𝑥 𝑖 , 𝑥 𝑗 ) Total loss Unary loss Pairwise loss 𝜃 𝑖𝑗 𝑥 𝑖 , 𝑥 𝑗 ~ 𝑤 1 exp −𝛼 𝑝 𝑖 − 𝑝 𝑗 2 −𝛽 𝐼 𝑖 − I 𝑗 𝑤 2 exp (−𝛾 𝑝 𝑖 − 𝑝 𝑗 2 )

Examples

Multi-task learning The total loss is the summation of the per task losses The per task loss relies on the common weights (VGGnet) and the weights specialized for the task ℒ 𝑡𝑜𝑡𝑎𝑙 = 𝑡𝑎𝑠𝑘 ℒ 𝑡𝑎𝑠𝑘 ( 𝜃 𝑐𝑜𝑚𝑚𝑜𝑛 , 𝜃 𝑡𝑎𝑠𝑘 ) +ℛ( 𝜃 𝑡𝑎𝑠𝑘 ) One training image might contain specific only annotations Only a particular task is “run” for that image Gradients per image are computed for tasks available for the image only

Ubernet [Kokkinos2016]

One datum  Several tasks
Assume our data are images Per image we can predict, boundaries, segmentation, detection, … Why separately? Solve multiple tasks simultaneously One task might help learn another better One task might have more annotations In real applications we don’t want 7 VGGnets 1 for boundaries, 1 for normals, 1 for saliency, …

One image  Several tasks

Take away message Deep Learning is good not only for classifying things Structured prediction is also possible Multi-task structure prediction allows for unified networks and for missing data Discovering structure in data is also possible

Thank you!

Deep Learning RUSSIR 2017 – Day 2

Similar presentations

Presentation on theme: "Deep Learning RUSSIR 2017 – Day 2"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Deep Learning RUSSIR 2017 – Day 2

Similar presentations

Presentation on theme: "Deep Learning RUSSIR 2017 – Day 2"— Presentation transcript:

Similar presentations

About project

Feedback