Deep Learning RUSSIR 2017 – Day 2

Slides:

Advertisements

Similar presentations

Request Dispatching for Cheap Energy Prices in Cloud Data Centers

Advertisements

SpringerLink Training Kit

Luminosity measurements at Hadron Colliders

From Word Embeddings To Document Distances

Choosing a Dental Plan Student Name

Virtual Environments and Computer Graphics

Chương 1: CÁC PHƯƠNG THỨC GIAO DỊCH TRÊN THỊ TRƯỜNG THẾ GIỚI

THỰC TIỄN KINH DOANH TRONG CỘNG ĐỒNG KINH TẾ ASEAN –

D. Phát triển thương hiệu

NHỮNG VẤN ĐỀ NỔI BẬT CỦA NỀN KINH TẾ VIỆT NAM GIAI ĐOẠN

Điều trị chống huyết khối trong tai biến mạch máu não

BÖnh Parkinson PGS.TS.BS NGUYỄN TRỌNG HƯNG BỆNH VIỆN LÃO KHOA TRUNG ƯƠNG TRƯỜNG ĐẠI HỌC Y HÀ NỘI Bác Ninh 2013.

Nasal Cannula X particulate mask

Evolving Architecture for Beyond the Standard Model

HF NOISE FILTERS PERFORMANCE

Electronics for Pedestrians – Passive Components –

Parameterization of Tabulated BRDFs Ian Mallett (me), Cem Yuksel

L-Systems and Affine Transformations

CMSC423: Bioinformatic Algorithms, Databases and Tools

Some aspect concerning the LMDZ dynamical core and its use

Bayesian Confidence Limits and Intervals

实习总结（Internship Summary)

Current State of Japanese Economy under Negative Interest Rate and Proposed Remedies Naoyuki Yoshino Dean Asian Development Bank Institute Professor Emeritus,

Front End Electronics for SOI Monolithic Pixel Sensor

Face Recognition Monday, February 1, 2016.

Solving Rubik's Cube By: Etai Nativ.

CS284 Paper Presentation Arpad Kovacs

انتقال حرارت 2 خانم خسرویار.

Summer Student Program First results

Theoretical Results on Neutrinos

HERMESでのHard Exclusive生成過程による核子内クォーク全角運動量についての研究

Wavelet Coherence & Cross-Wavelet Transform

yaSpMV: Yet Another SpMV Framework on GPUs

Creating Synthetic Microdata for Higher Educational Use in Japan: Reproduction of Distribution Type based on the Descriptive Statistics Kiyomi Shirakawa.

MOCLA02 Design of a Compact L-band Transverse Deflecting Cavity with Arbitrary Polarizations for the SACLA Injector Sep. 14th, 2015 H. Maesaka, T. Asaka,

Hui Wang†*, Canturk Isci‡, Lavanya Subramanian*,

Fuel cell development program for electric vehicle

Overview of TST-2 Experiment

Optomechanics with atoms

داده کاوی سئوالات نمونه

Inter-system biases estimation in multi-GNSS relative positioning with GPS and Galileo Cecile Deprez and Rene Warnant University of Liege, Belgium

ლექცია 4 - ფული და ინფლაცია

10. predavanje Novac i financijski sustav

Wissenschaftliche Aussprache zur Dissertation

FLUORECENCE MICROSCOPY SUPERRESOLUTION BLINK MICROSCOPY ON THE BASIS OF ENGINEERED DARK STATES* *Christian Steinhauer, Carsten Forthmann, Jan Vogelsang,

Particle acceleration during the gamma-ray flares of the Crab Nebular

Interpretations of the Derivative Gottfried Wilhelm Leibniz

Advisor: Chiuyuan Chen Student: Shao-Chun Lin

Widow Rockfish Assessment

SiW-ECAL Beam Test 2015 Kick-Off meeting

On Robust Neighbor Discovery in Mobile Wireless Networks

Chapter 6 并发：死锁和饥饿 Operating Systems: Internals and Design Principles

You NEED your book!!! Frequency Distribution

Y V =0 a V =V0 x b b V =0 z

Fairness-oriented Scheduling Support for Multicore Systems

Climate-Energy-Policy Interaction

Hui Wang†*, Canturk Isci‡, Lavanya Subramanian*,

Ch48 Statistics by Chtan FYHSKulai

The ABCD matrix for parabolic reflectors and its application to astigmatism free four-mirror cavities.

Measure Twice and Cut Once: Robust Dynamic Voltage Scaling for FPGAs

Online Learning: An Introduction

Factor Based Index of Systemic Stress (FISS)

What is Chemistry? Chemistry is: the study of matter & the changes it undergoes Composition Structure Properties Energy changes.

THE BERRY PHASE OF A BOGOLIUBOV QUASIPARTICLE IN AN ABRIKOSOV VORTEX*

Quantum-classical transition in optical twin beams and experimental applications to quantum metrology Ivano Ruo-Berchera Frascati.

The Toroidal Sporadic Source: Understanding Temporal Variations

FW 3.4: More Circle Practice

ارائه یک روش حل مبتنی بر استراتژی های تکاملی گروه بندی برای حل مسئله بسته بندی اقلام در ظروف

Decision Procedures Christoph M. Wintersteiger 9/11/2017 3:14 PM

Limits on Anomalous WWγ and WWZ Couplings from DØ

Presentation transcript:

Deep Learning RUSSIR 2017 – Day 2 Efstratios Gavves

Recap from Day 2 Convolutional Networks are optimal for images Parameter sharing Much cheaper Much faster Better local invariances Several possible Convnet architectures possible AlexNet/VGGNet ResNet Google Inception V1-4

Deep Nets in Practice Data pre-processing Optimization Regularization Hyper-parameters

𝜕ℒ 𝜕 𝑎 𝑙 = 𝜕 𝑎 𝑙+1 𝜕 𝑥 𝑙+1 𝑇 ⋅ 𝜕ℒ 𝜕 𝑎 𝑙+1 Back-propagation Step 1. Compute forward propagations for all layers recursively 𝑎 𝑙 = ℎ 𝑙 𝑥 𝑙 and 𝑥 𝑙+1 = 𝑎 𝑙 Step 2. Once done with forward propagation, follow the reverse path. Start from last layer. For each new layer compute the gradients Cache computations when possible to avoid redundant operations Step 3. Use the gradients 𝜕ℒ 𝜕 𝜃 𝑙 with Stochastic Gradient Descend to train 𝜕ℒ 𝜕 𝑎 𝑙 = 𝜕 𝑎 𝑙+1 𝜕 𝑥 𝑙+1 𝑇 ⋅ 𝜕ℒ 𝜕 𝑎 𝑙+1 𝜕ℒ 𝜕 𝜃 𝑙 = 𝜕 𝑎 𝑙 𝜕 𝜃 𝑙 ⋅ 𝜕ℒ 𝜕 𝑎 𝑙 𝑇

Stochastic Gradient Descent Stochastically sample “mini-batches” from dataset 𝐷 The size of 𝐵 𝑗 can contain even just 1 sample Much faster than Gradient Descend Results are often better Also suitable for datasets that change over time Variance of gradients increases when batch size decreases 𝐵 𝑗 =𝑠𝑎𝑚𝑝𝑙𝑒(𝐷) 𝜃 (𝑡+1) = 𝜃 (𝑡) − 𝜂 𝑡 | 𝐵 𝑗 | 𝑖 ∈ 𝐵 𝑗 𝛻 𝜃 ℒ 𝑖

SGD often more accurate Loss surface Current solution Noisy SGD gradient Full GD gradient New GD solution No guarantee that this is what is going to always happen. But the noisy SGC gradients can help some times escaping local optima Best GD solution Best SGD solution

Data pre-processing tanh (𝑥)  ReLU  𝜎(𝑥)  Center data to be roughly 0 Activation functions usually “centered” around 0 Convergence usually faster Otherwise bias on gradient direction  might slow down learning ReLU  tanh (𝑥)  𝜎(𝑥) 

Data pre-processing Scale input variables to have similar diagonal covariances 𝑐 𝑖 = 𝑗 ( 𝑥 𝑖 (𝑗) ) 2 Similar covariances  more balanced rate of learning for different weights Rescaling to 1 is a good choice, unless some dimensions are less important 𝑥 1 , 𝑥 2 , 𝑥 3  much different covariances Generated gradients dℒ 𝑑𝜃 𝑥 1 , 𝑥 2 , 𝑥 3 : much different 𝜃 (𝑡+1) = 𝜃 (𝑡) − 𝜂 𝑡 𝑑ℒ/𝑑 θ 1 𝑑ℒ/𝑑 θ 2 𝑑ℒ/𝑑 θ 3 Gradient update harder:

Data pre-processing Input variables should be as decorrelated as possible Input variables are “more independent” Network is forced to find non-trivial correlations between inputs Decorrelated inputs  Better optimization Obviously not the case when inputs are by definition correlated (sequences) Extreme case extreme correlation (linear dependency) might cause problems [CAUTION]

Batch normalization (intuitively) Backpropagation 𝑥 𝑙 𝑥 𝑙 𝑥 𝑙 Layer input at (t) at (t+0.5) at (t+1)

Batch normalization (algorithm) 𝜇 ℬ ← 1 𝑚 𝑖=1 𝑚 𝑥 𝑖 [compute mini-batch mean] 𝜎 ℬ ← 1 𝑚 𝑖=1 𝑚 𝑥 𝑖 − 𝜇 ℬ 2 [compute mini-batch variance] 𝑥 𝑖 ← 𝑥 𝑖 −𝜇 ℬ 𝜎 ℬ 2 +𝜀 [normalize input] 𝑦 𝑖 ←𝛾 𝑥 𝑖 +𝛽 [scale and shift input] Trainable parameters

Batch normalization (benefits) Stronger gradients  higher learning rates  faster training Otherwise maybe exploding or vanishing gradients or getting stuck to local minima Neurons get activated in a near optimal “regime” Better model regularization Neuron activations not deterministic, depend on the batch Model cannot be overconfident

ℓ 2 -regularization Most important (or most popular) regularization θ ∗ ← arg min 𝜃 (𝑥,𝑦)⊆(𝑋, 𝑌) ℒ(𝑦, 𝑎 𝐿 𝑥; 𝜃 1,…,L ) + 𝜆 2 𝑙 𝜃 𝑙 2 The ℓ 2 -regularization can pass inside the gradient descend update rule 𝜃 (𝑡+1) = 𝜃 (𝑡) − 𝜂 𝑡 𝛻 𝜃 ℒ+𝜆 𝜃 𝑙 ⟹ 𝜃 𝑡+1 = 1−𝜆 𝜂 𝑡 𝜃 𝑡 − 𝜂 𝑡 𝛻 𝜃 ℒ 𝜆 is usually about 10 −1 , 10 −2 “Weight decay”, because weights get smaller

Early stopping Monitor performance on a separate validation set Training the network will decrease training error, as well validation error (although with a slower rate usually) Stop when validation error starts increasing This quite likely means the network starts to overfit

Dropout [Srivastava2014] During training setting activations randomly to 0 Neurons sampled at random from a Bernoulli distribution with 𝑝=0.5 At test time all neurons are used Neuron activations reweighted by 𝑝 Benefits Reduces complex co-adaptations or co-dependencies between neurons No “free-rider” neurons that rely on others Every neuron becomes more robust Decreases significantly overfitting Improves significantly training speed

Dropout Effectively, a different architecture at every training epoch Similar to model ensembles Original model

Dropout Effectively, a different architecture at every training epoch Similar to model ensembles Epoch 1

Dropout Effectively, a different architecture at every training epoch Similar to model ensembles Epoch 1

Dropout Effectively, a different architecture at every training epoch Similar to model ensembles Epoch 2

Dropout Effectively, a different architecture at every training epoch Similar to model ensembles Epoch 2

Learning rate The right learning rate 𝜂 𝑡 very important for fast convergence Too strong  gradients overshoot and bounce Too weak,  too small gradients  slow training Learning rate per weight is often advantageous Some weights are near convergence, others not Rule of thumb Learning rate of (shared) weights prop. to square root of share weight connections Adaptive learning rates are also possible, based on the errors observed [Sompolinsky1995]

Weight initialization There are few contradictory requirements Weights need to be small enough around origin ( 𝟎 ) for symmetric functions (tanh, sigmoid) When training starts better stimulate activation functions near their linear regime larger gradients  faster training Weights need to be large enough Otherwise signal is too weak for any serious learning

Weight initialization: Similar input/output variance Weights must be initialized to preserve the variance of the activations during the forward and backward computations Especially for deep learning All neurons operate in their full capacity

Quiz: Why similar input/output variance? Weights must be initialized to preserve the variance of the activations during the forward and backward computations Especially for deep learning All neurons operate in their full capacity

Quiz: Why similar input/output variance? Weights must be initialized to preserve the variance of the activations during the forward and backward computations Especially for deep learning All neurons operate in their full capacity The output of one module is the input to another. Similar variance ensures smooth learning for nonlinearities Layer 1 Layer 2 Input Output Input

Weight initialization: Good practices Good practice: initialize weights to be asymmetric Don’t give save values to all weights (like all 𝟎 ) In that case all neurons generate same gradient  no learning Generally speaking initialization depends on non-linearities data normalization

[He2015] initialization for ReLUs Unlike sigmoids, ReLUs ground to 0 the linear activations half the time Double weight variance Compensate for the zero flat-area  Input and output maintain same variance Very similar to Xavier initialization Draw random weights from w~𝑁 0, 2/𝑑 where 𝑑 is the number of neurons in the input

Optimizers: Momentum Don’t switch gradients all the time Maintain “momentum” from previous parameters More robust gradients and learning  faster convergence Nice “physics”-based interpretation Instead of updating the position of the “ball”, we update the velocity, which updates the position 𝑢 𝜃 = 𝛾𝜃 (𝑡) − 𝜂 𝑡 𝛻 𝜃 ℒ 𝜃 (𝑡+1) = 𝜃 (𝑡) + 𝑢 𝜃

Optimizers: Momentum Loss surface Gradient + momentum Gradient

Optimizers: RMSprop Large gradients, e.g. too “noisy” loss surface Schedule 𝑚 𝑗 =𝛼 𝜏=1 𝑡−1 ( 𝛻 𝜃 (𝑡) ℒ 𝑗 ) 2 + 1−𝛼 𝛻 𝜃 (𝑡) ℒ 𝑗 ⟹ 𝜃 (𝑡+1) = 𝜃 (𝑡) − 𝜂 𝑡 𝛻 𝜃 ℒ 𝑚 +𝜀 Moving average of the squared gradients Compared to Adagrad Large gradients, e.g. too “noisy” loss surface Updates are tamed Small gradients, e.g. stuck in flat loss surface ravine Updates become more aggressive

Optimizers: RMSprop Square rooting boosts small values while suppresses large values

Visual overview

Standard vs Structured Inference

Standard inference N-way classification Dog? Cat? Bike? Car? Plane?

Standard inference N-way classification Regression How popular will this movie be in IMDB?

Standard inference N-way classification Regression Ranking … Who is older?

Quiz: What is common? N-way classification Regression Ranking …

Quiz: What is common? They all make “single value” predictions Do all our machine learning tasks boil down to “single value” predictions?

Beyond “single value” predictions? Do all our machine learning tasks boil to “single value” predictions? Are there tasks where outputs are somehow correlated? Is there some structure in this output correlations? How can we predict such structures?

Example: Object detection Predict a box around an object Images Spatial location b(ounding) box Videos Spatio-temporal location bbox@t, bbox@t+1, …

Quiz: Other examples?

Object segmentation

Optical flow & motion estimation

Depth estimation Godard et al., Unsupervised Monocular Depth Estimation with Left-Right Consistency, 2016

Normals and reflectance estimation

Structured prediction Prediction goes beyond asking for “single values” Outputs are complex and output dimensions correlated Output dimensions have latent structure Can we make deep networks to return structured predictions?

Structured prediction Prediction goes beyond asking for “single values” Outputs are complex and output dimensions correlated Output dimensions have latent structure Can we make deep networks to return structured predictions?

Convnets for structured prediction

Sliding window on feature maps Selective Search Object Proposals [Uijlings2013] SPPnet [He2014] Fast R-CNN [Girshick2015]

Fast R-CNN: Steps Process the whole image up to conv5 Conv 5 feature map

Fast R-CNN: Steps Process the whole image up to conv5 Compute possible locations for objects Conv 1 Conv 2 Conv 3 Conv 4 Conv 5 Conv 5 feature map

Fast R-CNN: Steps Process the whole image up to conv5 Compute possible locations for objects some correct, most wrong Given single location  ROI pooling module extracts fixed length feature New box coordinates Car/dog/bicycle ROI Pooling Module Conv 1 Conv 2 Conv 3 Conv 4 Conv 5 Always 4x4 no matter the size of candidate location Conv 5 feature map

Region-Of-Interest Pooling Divide feature map in 𝑇𝑥𝑇 cells Cell size changes depending on the size of the candidate location Always 3x3 no matter the size of candidate location

Some results

Going Fully Convolutional [LongCVPR2014] Image larger than network input slide the network Is this pixel a camel? Conv 1 Conv 2 Conv 3 Conv 4 Conv 5 fc1 fc2 Yes! No!

Going Fully Convolutional [LongCVPR2014] Image larger than network input slide the network Is this pixel a camel? Conv 1 Conv 2 Conv 3 Conv 4 Conv 5 fc1 fc2 Yes! No!

Going Fully Convolutional [LongCVPR2014] Image larger than network input slide the network Is this pixel a camel? Conv 1 Conv 2 Conv 3 Conv 4 Conv 5 fc1 fc2 Yes! No!

Going Fully Convolutional [LongCVPR2014] Image larger than network input slide the network Is this pixel a camel? Conv 1 Conv 2 Conv 3 Conv 4 Conv 5 fc1 fc2 Yes! No!

Going Fully Convolutional [LongCVPR2014] Image larger than network input slide the network Is this pixel a camel? Conv 1 Conv 2 Conv 3 Conv 4 Conv 5 fc1 fc2 Yes! No!

Fully Convolutional Networks [LongCVPR2014] Connect intermediate layers to output

Fully Convolutional Networks Output is too coarse Image Size 500x500, Alexnet Input Size: 227x227  Output: 10x10 How to obtain dense predictions? Upconvolution Other names: deconvolution, transposed convolution, fractionally-strided convolutions

Deconvolutional modules Output Image Convolution No padding, no strides Upconvolution No padding, no strides Upconvolution Padding, strides https://github.com/vdumoulin/conv_arithmetic

Ground truth pixel labels Pixel label probabilities Coarse  Fine Output Large loss generated (probability much higher than ground truth) Small loss generated 1 Ground truth pixel labels Pixel label probabilities 0.8 0.1 0.9 Upconvolution 2x Upconvolution 2x 7x7 14x14 224x224

Structured losses

Deep ConvNets with CRF loss [Chen, Papandreou 2016] Segmentation map is good but not pixel-precise Details around boundaries are lost Cast fully convolutional outputs as unary potentials Consider pairwise potentials between output dimensions

Deep ConvNets with CRF loss [Chen, Papandreou 2016]

Deep ConvNets with CRF loss [Chen, Papandreou 2016] Segmentation map is good but not pixel-precise Details around boundaries are lost Cast fully convolutional outputs as unary potentials Consider pairwise potentials between output dimensions Include Fully Connected CRF loss to refine segmentation 𝐸 𝑥 =∑ 𝜃 𝑖 𝑥 𝑖 +∑ 𝜃 𝑖𝑗 ( 𝑥 𝑖 , 𝑥 𝑗 ) Total loss Unary loss Pairwise loss 𝜃 𝑖𝑗 𝑥 𝑖 , 𝑥 𝑗 ~ 𝑤 1 exp −𝛼 𝑝 𝑖 − 𝑝 𝑗 2 −𝛽 𝐼 𝑖 − I 𝑗 2 + 𝑤 2 exp (−𝛾 𝑝 𝑖 − 𝑝 𝑗 2 )

Examples

Multi-task learning The total loss is the summation of the per task losses The per task loss relies on the common weights (VGGnet) and the weights specialized for the task ℒ 𝑡𝑜𝑡𝑎𝑙 = 𝑡𝑎𝑠𝑘 ℒ 𝑡𝑎𝑠𝑘 ( 𝜃 𝑐𝑜𝑚𝑚𝑜𝑛 , 𝜃 𝑡𝑎𝑠𝑘 ) +ℛ( 𝜃 𝑡𝑎𝑠𝑘 ) One training image might contain specific only annotations Only a particular task is “run” for that image Gradients per image are computed for tasks available for the image only

Ubernet [Kokkinos2016]

One datum  Several tasks Assume our data are images Per image we can predict, boundaries, segmentation, detection, … Why separately? Solve multiple tasks simultaneously One task might help learn another better One task might have more annotations In real applications we don’t want 7 VGGnets 1 for boundaries, 1 for normals, 1 for saliency, …

One image  Several tasks

Take away message Deep Learning is good not only for classifying things Structured prediction is also possible Multi-task structure prediction allows for unified networks and for missing data Discovering structure in data is also possible

Thank you!