Deep Learning RUSSIR 2017 – Day 1

Slides:

Advertisements

Similar presentations

Request Dispatching for Cheap Energy Prices in Cloud Data Centers

Advertisements

SpringerLink Training Kit

Luminosity measurements at Hadron Colliders

From Word Embeddings To Document Distances

Choosing a Dental Plan Student Name

Virtual Environments and Computer Graphics

Chương 1: CÁC PHƯƠNG THỨC GIAO DỊCH TRÊN THỊ TRƯỜNG THẾ GIỚI

THỰC TIỄN KINH DOANH TRONG CỘNG ĐỒNG KINH TẾ ASEAN –

D. Phát triển thương hiệu

NHỮNG VẤN ĐỀ NỔI BẬT CỦA NỀN KINH TẾ VIỆT NAM GIAI ĐOẠN

Điều trị chống huyết khối trong tai biến mạch máu não

BÖnh Parkinson PGS.TS.BS NGUYỄN TRỌNG HƯNG BỆNH VIỆN LÃO KHOA TRUNG ƯƠNG TRƯỜNG ĐẠI HỌC Y HÀ NỘI Bác Ninh 2013.

Nasal Cannula X particulate mask

Evolving Architecture for Beyond the Standard Model

HF NOISE FILTERS PERFORMANCE

Electronics for Pedestrians – Passive Components –

Parameterization of Tabulated BRDFs Ian Mallett (me), Cem Yuksel

L-Systems and Affine Transformations

CMSC423: Bioinformatic Algorithms, Databases and Tools

Some aspect concerning the LMDZ dynamical core and its use

Bayesian Confidence Limits and Intervals

实习总结（Internship Summary)

Current State of Japanese Economy under Negative Interest Rate and Proposed Remedies Naoyuki Yoshino Dean Asian Development Bank Institute Professor Emeritus,

Front End Electronics for SOI Monolithic Pixel Sensor

Face Recognition Monday, February 1, 2016.

Solving Rubik's Cube By: Etai Nativ.

CS284 Paper Presentation Arpad Kovacs

انتقال حرارت 2 خانم خسرویار.

Summer Student Program First results

Theoretical Results on Neutrinos

HERMESでのHard Exclusive生成過程による核子内クォーク全角運動量についての研究

Wavelet Coherence & Cross-Wavelet Transform

yaSpMV: Yet Another SpMV Framework on GPUs

Creating Synthetic Microdata for Higher Educational Use in Japan: Reproduction of Distribution Type based on the Descriptive Statistics Kiyomi Shirakawa.

MOCLA02 Design of a Compact L-band Transverse Deflecting Cavity with Arbitrary Polarizations for the SACLA Injector Sep. 14th, 2015 H. Maesaka, T. Asaka,

Hui Wang†*, Canturk Isci‡, Lavanya Subramanian*,

Fuel cell development program for electric vehicle

Overview of TST-2 Experiment

Optomechanics with atoms

داده کاوی سئوالات نمونه

Inter-system biases estimation in multi-GNSS relative positioning with GPS and Galileo Cecile Deprez and Rene Warnant University of Liege, Belgium

ლექცია 4 - ფული და ინფლაცია

10. predavanje Novac i financijski sustav

Wissenschaftliche Aussprache zur Dissertation

FLUORECENCE MICROSCOPY SUPERRESOLUTION BLINK MICROSCOPY ON THE BASIS OF ENGINEERED DARK STATES* *Christian Steinhauer, Carsten Forthmann, Jan Vogelsang,

Particle acceleration during the gamma-ray flares of the Crab Nebular

Interpretations of the Derivative Gottfried Wilhelm Leibniz

Advisor: Chiuyuan Chen Student: Shao-Chun Lin

Widow Rockfish Assessment

SiW-ECAL Beam Test 2015 Kick-Off meeting

On Robust Neighbor Discovery in Mobile Wireless Networks

Chapter 6 并发：死锁和饥饿 Operating Systems: Internals and Design Principles

You NEED your book!!! Frequency Distribution

Y V =0 a V =V0 x b b V =0 z

Fairness-oriented Scheduling Support for Multicore Systems

Climate-Energy-Policy Interaction

Hui Wang†*, Canturk Isci‡, Lavanya Subramanian*,

Ch48 Statistics by Chtan FYHSKulai

The ABCD matrix for parabolic reflectors and its application to astigmatism free four-mirror cavities.

Measure Twice and Cut Once: Robust Dynamic Voltage Scaling for FPGAs

Online Learning: An Introduction

Factor Based Index of Systemic Stress (FISS)

What is Chemistry? Chemistry is: the study of matter & the changes it undergoes Composition Structure Properties Energy changes.

THE BERRY PHASE OF A BOGOLIUBOV QUASIPARTICLE IN AN ABRIKOSOV VORTEX*

Quantum-classical transition in optical twin beams and experimental applications to quantum metrology Ivano Ruo-Berchera Frascati.

The Toroidal Sporadic Source: Understanding Temporal Variations

FW 3.4: More Circle Practice

ارائه یک روش حل مبتنی بر استراتژی های تکاملی گروه بندی برای حل مسئله بسته بندی اقلام در ظروف

Decision Procedures Christoph M. Wintersteiger 9/11/2017 3:14 PM

Limits on Anomalous WWγ and WWZ Couplings from DØ

Presentation transcript:

Deep Learning RUSSIR 2017 – Day 1 Efstratios Gavves

Who am I? Assistant Professor at the University of Amsterdam QUVA Lab with focus on Deep Learning and Computer Vision www.egavves.com https://ivi.fnwi.uva.nl/quva/ Also, teaching Deep Learning for the MSc in AI at UvA Slides, code available at https://deeplearningamsterdam.github.io/ Computer Vision Focus Universal spatiotemporal representations Machine Learning Focus Learning to learn

Overview Day 1: Intro to Deep Learning Multi-Layer Perceptrons, Convolutional Neural Networks, Deep Network properties Day 2: Deeper into Deep Learning Optimizations, Structured-output Deep Neural Networks, Multi-task Learning Day 3: Sequential Data Recurrent Networks, Encoder-Decoder Embeddings, Memory Networks Day 4: Generative Models Variational Autoencoders, Generative Adversarial Networks

Neural Networks

A Neural Network perspective Data Model Prediction Loss 5 convolutional layers 2 fully connected layers ----------------------------------- 100 convolutional layers 50 batch normalization layers Softmax --------------- Sigmoid Linear etc. Cross entropy --------------- Euclidean Contrastive

Or simpler … In neural networks no real separation between layers Output layer is just another layer Loss layer is just another layer with one more input (targets) Input Model Recursion (k) Data initialization Prediction #Layers k=1,…,K

𝑓 𝑥 =𝑉⋅ exp (𝑊∗𝑥) as a NN A neural network is organized by layers 1 layer  1 module  A single and simple operation 𝑥 0 𝑥 0 𝑥 1 =𝑊∗ 𝑥 0 𝑥 0 𝑥 1 =𝑊∗ 𝑥 0 𝑥 2 = exp ( 𝑥 1 ) 𝑥 0 𝑥 1 =𝑊∗ 𝑥 0 𝑥 2 = exp ( 𝑥 1 ) 𝑥 3 =𝑉⋅ 𝑥 2 𝑥 0 𝑥 1 =𝑊∗ 𝑥 0 𝑥 2 = exp ( 𝑥 1 ) 𝑥 3 =𝑉⋅ 𝑥 2 ℒ=| 𝑦 ∗ − 𝑥 3 2 Input Convolutional Nonlinearity Linear Loss

Module/Layer types? Linear: 𝑥 𝑖+1 =𝑊⋅ 𝑥 𝑖 [parameteric  learnable] Convolutional: 𝑥 𝑖+1 =𝑊∗ 𝑥 𝑖 [parameteric  learnable] Nonlinearity: 𝑥 𝑖+1 =ℎ 𝑥 0 [non-parameteric  defined] Pooling: 𝑥 1 =downsample ( 𝑥 0 ) [non-parameteric  defined] Normalization, e.g.: 𝑥 1 = ℓ 2 ( 𝑥 0 ) [non-parameteric  defined] Regularization, e.g.: 𝑥 1 =dropout( 𝑥 0 ) [non-parameteric  defined] Practically, any 1st order [almost everywhere] differentiable function can be a module

Training Neural Networks 1. The Neural Network 𝑎 𝐿 𝑥; 𝜃 1,…,L = ℎ 𝐿 ( ℎ 𝐿−1 … ℎ 1 𝑥, θ 1 , θ 𝐿−1 , θ 𝐿 ) 2. Learning by minimizing empirical error θ ∗ ← arg min 𝜃 (𝑥,𝑦)⊆(𝑋, 𝑌) ℒ(𝑦, 𝑎 𝐿 𝑥; 𝜃 1,…,L ) 3. Optimizing with Gradient Descend based methods 𝜃 (𝑡+1) = 𝜃 (𝑡) − 𝜂 𝑡 𝛻 𝜃 ℒ

Convolutional Neural Networks Or just Convnets/CNNs

Convnets vs NNs Question: Spatial structure? NNs: don’t care Convnets: Convolutional filters Question: Huge input dimensionalities? NNs: don’t care (in theory) Convnets: Parameter sharing Question: Local variances? NNs: don’t care (much) Convnets: Pooling

Preserving spatial structure

Quiz: What does spatial mean?

What does spatial mean? One pixel alone does not carry much information Many pixels in the right order though  tons of information I.e., Neighboring variables are correlated And the variable correlations is the visual structure we want to learn

Filters have width/height/depth Grayscale RGB Multiple channels D dims 3 dims Filters How many parameters per filter? #𝑝𝑎𝑟𝑎𝑚𝑠=𝐻×𝑊×𝐷

Local connectivity The weight connections are surface-wise local! The weights connections are depth-wise global For standard neurons no local connectivity Everything is connected to everything

Parameter Sharing Natural images are stationary Visual features are common for different parts of one or multiple image If features are local and similar across locations, why not reuse filters? Localar pameter sharing  Convolutions

Convolutional filters Original image

Convolutional filters Original image 1 1 Convolutional filter 1

Convolutional filters Convolving the image Result Original image 1 x0 x1 1 1 x1 x0 1 1 1 1 1 4 1 4 4 x1 x1 x0 x0 x1 1 1 1 1 1 1 x0 x1 x0 x0 1 1 1 x1 x0 x1 1 Convolutional filter 1 1 1 1 1 1 𝐼 𝑥,𝑦 ∗ℎ= 𝑖=−𝑎 𝑎 𝑗=−𝑏 𝑏 𝐼 𝑥−𝑖,𝑦−𝑗 ⋅ℎ(𝑖, 𝑗) Inner product

Convolutional filters Convolving the image Result Original image 1 x0 x1 1 1 x1 x0 1 1 1 1 1 4 1 4 4 3 x1 x0 x1 x0 x1 1 1 1 1 1 1 x0 x1 x0 x0 1 1 1 x1 x0 x1 1 Convolutional filter 1 1 1 1 1 1 𝐼 𝑥,𝑦 ∗ℎ= 𝑖=−𝑎 𝑎 𝑗=−𝑏 𝑏 𝐼 𝑥−𝑖,𝑦−𝑗 ⋅ℎ(𝑖, 𝑗) Inner product

Convolutional filters Convolving the image Result Original image 1 1 x0 x1 1 x1 x0 1 1 1 1 1 4 4 1 4 3 4 x1 x0 1 1 1 1 1 1 2 4 3 x0 1 1 1 2 3 4 x1 x0 x1 1 Convolutional filter 1 1 1 x0 x1 x0 1 1 1 x1 x0 x1 𝐼 𝑥,𝑦 ∗ℎ= 𝑖=−𝑎 𝑎 𝑗=−𝑏 𝑏 𝐼 𝑥−𝑖,𝑦−𝑗 ⋅ℎ(𝑖, 𝑗) Inner product

Convnets in Language [1] Convnets can be easier parallelized 9x faster than recurrent neural systems more & better training possible Convnets are organized hierarchically More complex relationships between data are captured [1] Convolutional Sequence to Sequence Learning, Gehring et al., 2017

Why call them convolutions? Definition The convolution of two functions 𝑓 and 𝑔 is denoted by ∗ as the integral of the product of the two functions after one is reversed and shifted 𝑓∗𝑔 𝑡 ≝ −∞ ∞ 𝑓 𝜏 𝑔 𝑡−𝜏 𝑑𝜏 = −∞ ∞ 𝑓 𝑡−𝜏 𝑔 𝜏 𝑑𝜏

Pooling Often we want to summarize the local information into a single code vector Feature aggregation ≡ Pooling Pooled feature invariant to small local transformations. Only the strongest activation is retained Output dimensions  Faster computations Keeps most salient information Different dimensionality inputs can now be compared Popular pooling methods Average, Max, Bilinear, Convolutional 1 4 3 6 2 1 9 4 9 2 2 7 7 5 7 5 3 3 6

Implementation details Stride every how many pixels do you compute a convolution equivalent to sampling coefficient, influences output size Padding Add 0s (or another value) around the layer input Prevent output from getting smaller and smaller Dilation Atrous convolutions

Good practice Resize the image to have a size in the power of 2 Use stride 𝑠=1 A filter of ℎ 𝑓 , 𝑤 𝑓 =[3×3] works quite alright with deep architectures Add 1 layer of zero padding Avoid combinations of hyper-parameters that do not click E.g. 𝑠=1 [ ℎ 𝑓 × 𝑤 𝑓 ]=[3×3] and image size ℎ 𝑖𝑛 × 𝑤 𝑖𝑛 = 6×6 ℎ 𝑜𝑢𝑡 × 𝑤 𝑜𝑢𝑡 =[2.5×2.5] Programmatically worse, and worse accuracy because borders are ignored

Nonlinearities If we would only have 𝑁 linear layers, we could replace them all with a single layer 𝑊 1 ⋅ 𝑊 2 ⋅…⋅ 𝑊 𝑁 =𝑊 Nonlinearities allow for deeper networks Any nonlinear function can work although some are more preferable than others

Sigmoid Activation function 𝑎=𝜎 𝑥 = 1 1+ 𝑒 −𝑥 𝑥 1 𝑖𝑛 Sigmoid Sigmoid 1 Activation function 𝑎=𝜎 𝑥 = 1 1+ 𝑒 −𝑥 Gradient wrt the input 𝜕𝑎 𝜕𝑥 =𝜎(𝑥)(1−𝜎(𝑥)) When 𝑥 1 𝑖𝑛 ~0 [normalized inputs] 𝑥 2 𝑖𝑛 = 𝑥 1 𝑜𝑢𝑡 ~0.5 Introducing bias to hidden layer neurons Not recursive friendly Gradients always <1 and flat in the tails No serious upgrades Deep networks have problems 𝑥 1 𝑜𝑢𝑡 𝑥 2 𝑖𝑛 Sigmoid 2

Tanh Activation function 𝑎=𝑡𝑎𝑛ℎ 𝑥 = 𝑒 𝑥 − 𝑒 −𝑥 𝑒 𝑥 + 𝑒 −𝑥 Gradient with respect to the input 𝜕𝑎 𝜕𝑥 =1− 𝑡𝑎𝑛ℎ 2 (𝑥) Similar to sigmoid, but with different output range [−1, +1] instead of 0, +1 Stronger gradients, because data in subsequent module is centered around 0 (not 0.5) Less bias to hidden layer neurons Outputs positive or negative and likely to have zero mean

ReLU Activation function 𝑎=ℎ(𝑥)= max 0, 𝑥 Very popular in computer vision and speech recognition Gradient wrt the input 𝜕𝑎 𝜕𝑥 = 0, 𝑖𝑓 𝑥≤0 1, 𝑖𝑓𝑥>0

ReLU Much faster computations, gradients Sparse activations No vanishing/exploding gradients People claim biological plausibility :/ Sparse activations No saturation Non-symmetric Non-differentiable at 0 A large gradient during training can cause a neuron to “die”. Higher learning rates mitigate the problem

𝑎 (𝑘) = 𝑒 𝑥 (𝑘) −𝜇 𝑗 𝑒 𝑥 (𝑗) −𝜇 , 𝜇= max 𝑘 𝑥 (𝑘) because Softmax Activation function 𝑎 (𝑘) =𝑠𝑜𝑓𝑡𝑚𝑎𝑥( 𝑥 (𝑘) )= 𝑒 𝑥 (𝑘) 𝑗 𝑒 𝑥 (𝑗) Outputs probability distribution, 𝑘=1 𝐾 𝑎 (𝑘) =1 for 𝐾 classes Typically used as prediction layer Because 𝑒 𝑎+𝑏 = 𝑒 𝑎 𝑒 𝑏 , we usually compute 𝑎 (𝑘) = 𝑒 𝑥 (𝑘) −𝜇 𝑗 𝑒 𝑥 (𝑗) −𝜇 , 𝜇= max 𝑘 𝑥 (𝑘) because 𝑒 𝑥 (𝑘) −𝜇 𝑗 𝑒 𝑥 (𝑗) −𝜇 = 𝑒 𝜇 𝑒 𝑥 (𝑘) 𝑒 𝜇 𝑗 𝑒 𝑥 (𝑗) = 𝑒 𝑥 (𝑘) 𝑗 𝑒 𝑥 (𝑗) Avoid exponentianting large numbers  better stability

Euclidean Loss Activation function 𝑎(𝑥)= 0.5 𝑦−𝑥 2 Mostly used to measure the loss in regression tasks Gradient with respect to the input 𝜕𝑎 𝜕𝑥 =𝑥−𝑦

Cross-entropy loss Activation function 𝑎 𝑥 =− 𝑘=1 𝐾 𝑦 (𝑘) log 𝑥 (𝑘) , 𝑦 (𝑘) ={0, 1} Gradient with respect to the input 𝜕𝑎 𝜕 𝑥 (𝑘) =− 1 𝑥 (𝑘) The cross-entropy is the most popular classification loss for classifiers that output probabilities (not SVM) Cross-entropy loss couples well softmax/sigmoid module Often the modules are combined and joint gradients are computed Generalization of logistic regression for more than 2 outputs

Case studies Alexnet ResNet Google Inception Or the modern version of it, VGGnet ResNet From 14 to 1000 layers Google Inception Networks as Direct Acyclic Graphs (DAG)

Alexnet

Architectural details

Removing layer 7

Removing layer 6, 7

Removing layer 3, 4 Removing layer 3, 4

Removing layer 3, 4, 6, 7

Quiz: Translation invariance? Credit: R. Fergus slides in Deep Learning Summer School 2016

Translation invariance Credit: R. Fergus slides in Deep Learning Summer School 2016

Quiz: Scale invariance? Credit: R. Fergus slides in Deep Learning Summer School 2016

Scale invariance Credit: R. Fergus slides in Deep Learning Summer School 2016

Quiz: Rotation invariance? Credit: R. Fergus slides in Deep Learning Summer School 2016

Rotation invariance Credit: R. Fergus slides in Deep Learning Summer School 2016

More Depth? VGGnet

ResNet

Hypothesis Hypothesis: Is it possible to have a very deep network at least as accurate as averagely deep networks? Thought experiment: Let’s assume two Convnets A, B. They are almost identical, in that B is the same as A, with extra “identity” layers. Since identity layers pass the information unchanged, the errors of the two networks should be similar. Thus, there is a Convnet B, which is at least as good as Convnet A w.r.t. training error CNN A ℒ 𝐴 Identity CNN B ℒ 𝐵

Testing hypothesis Adding identity layers increases training error!! Training error, not testing error Not all networks are the same as easy to optimize Performance degradation not caused by overfitting Just the task is harder CNN A ℒ 𝐴 ℒ 𝐴 𝑡𝑟𝑎𝑖𝑛 < ℒ 𝐵 𝑡𝑟𝑎𝑖𝑛 !!! Identity CNN B ℒ 𝐵

Quiz: What looks weird?

Testing hypothesis Adding identity layers increases training error!! Training error, not testing error Performance degradation not caused by overfitting Just the optimization task is harder Assuming optimizers are doing their job fine, it appears that not all networks are the same as easy to optimize CNN A ℒ 𝐴 ℒ 𝐴 𝑡𝑟𝑎𝑖𝑛 < ℒ 𝐵 𝑡𝑟𝑎𝑖𝑛 !!! Identity CNN B ℒ 𝐵

ResNet: Main idea Layer models residual F x =𝐻 𝑥 −𝑥 instead of 𝐻(𝑥) If anything, the optimizer can simply set the weights to 0 This assumes that the identity mapping is indeed the optimal one Adding identity layers should lead to larger networks that have at least lower training error Shortcut/skip connections

Smooth propagation Additive relation between 𝑥 𝑙 , 𝑥 𝐿 Traditional NNs have multiplicative: 𝑥 𝐿 = 𝑖=𝑙 𝐿−1 𝑊 𝑖 𝑥 𝑙 Smooth backprop: 𝜕ℒ 𝜕 𝑥 𝑙 = 𝜕ℒ 𝜕 𝑥 𝐿 ⋅ 𝜕 𝑥 𝐿 𝜕 𝑥 𝑙 = 𝜕ℒ 𝜕 𝑥 𝐿 (1+ 𝜕 𝜕 𝑥 𝐿 𝑖=𝑙 𝐿−1 𝐹( 𝑥 𝑖 ) ) The loss closest to the output 𝜕ℒ 𝜕 𝑥 𝐿 is always there in the gradients … 𝑥 𝐿 = 𝑥 𝑙 + 𝑖=𝑙 𝐿−1 𝐹( 𝑥 𝑖 ) 𝑥 𝑙+1 = 𝑥 𝑙 +F( x l ) 𝑥 𝑙+2 = 𝑥 𝑙+1 +F( x l+1 )

But why the residual? Not a very persuasive answer yet Conditions solution around the identity  from thereon the optimizer finds some solution in the neighborhood? Reminds the layer output of the layer input, so the intermediate function must learn something different?

But why the residual? My best shot: If you take the Discrete Fourier Transform 𝑥 𝑙 = 1 𝐿 𝑘=0 𝐿−1 𝑋 𝑘 exp 𝑖2πln/L = 𝑋 0 𝐿 + 𝑘=1 𝐿−1 𝑋 𝑘 exp 𝑖2πln/L , where 𝑋 0 ~ 𝐸 𝑙 𝑥 𝑙 → 𝑥 𝑙+1 −𝐸 𝑥 𝑙 = 𝑘=1 𝐿−1 𝑋 𝑘 exp 𝑖2πln/L ResNets are basically like High-Pass filters that remove the DC-bias and model only the AC components of the signal Electrical circuit limited capacity  DC-bias removal avoids clipping NN limited modelling capacity  Avoid clipping feature space corners ResNet Start from 1, not 0

No degradation anymore Without residual connections deeper networks are untrainable

ResNet vs Highway Nets ResNet: 𝑦=𝐻 𝑥 −𝑥 ResNet ⊆ Highway Nets ResNet ≡ Highway Nets: 𝑇 𝑥 ~𝐵𝑖𝑛𝑜𝑚𝑖𝑎𝑙 with 𝐸 𝑇 𝑥 =0.5 ResNet data independent Curse or blessing, depending on point of view Definitely simpler

ResNet breaks records Ridiculously low error in ImageNet Up to 1000 layers ResNets trained Previous deepest network ~30-40 layers on simple datasets

Google Inception V1 Instead of having convolutions (e.g. 3×3) directly, first reduce features by 1×1 convolutions E.g., assume we have 256 features in the previous layer Convolve with 256×64×1×1 Then convolve with 64×64×3×3 Then convolve with 64×256×1×1 Credit: https://culurciello.github.io/tech/2016/06/04/nets.html

Google Inception V2-V3 Decompose square filters to flattened convolutions Fewer operations O 𝑛 vs O( 𝑛 2 ) Use 3×3 filters as 5×5 can be written as a function of 3×3 Inspired by VGGNet Balance depth and width Credit: https://culurciello.github.io/tech/2016/06/04/nets.html

Google Inception V4 Similar to Inception V3 plus shortcut connections Inspired by ResNet Credit: https://culurciello.github.io/tech/2016/06/04/nets.html

State-of-the-art Credit: https://culurciello.github.io/tech/2016/06/04/nets.html

Take-away message Deep nets are highly accurate, but also highly capricious Good normalization, regularization is needed More is not always better, but usually helps Careful with learning rate, number of layers, dataset size Very deep architectures improve accuracy significantly

Thank you!