Machine Learning – Neural Networks David Fenyő

Machine Learning – Neural Networks David Fenyő
Contact:

Example: Skin Cancer Diagnosis
2 Esteva et al., “Dermatologist-level classification of skin cancer with deep neural networks”, Nature. 2017

Example: Histopathological Diagnosis
3 Litjens el al., “Deep learning as a tool for increased accuracy and efficiency of histopathological diagnosis”, Scientific Reports 2016

Example: The Digital Mammography DREAM Challenge
4 The Digital Mammography DREAM Challenge will attempt to improve the predictive accuracy of digital mammography for the early detection of breast cancer. The primary benefit of this Challenge will be to establish new quantitative tools - machine learning, deep learning or other - that can help decrease the recall rate of screening mammography, with a potential impact on shifting the balance of routine breast cancer screening towards more benefit and less harm. Participating teams will be asked to submit predictive models based on over 640,000 de-identified digital mammography images from over subjects, with corresponding clinical variables.

Architecture w1 x1 𝑓( 𝑖 𝑤 𝑖 𝑥 𝑖 +𝑏) w2 x2 xn wn . Input Output Hidden
𝑓( 𝑖 𝑤 𝑖 𝑥 𝑖 +𝑏) 5 w2 x2 . xn wn Input Output Hidden

tanh 𝑧 = 𝑒 𝑧 − 𝑒 −𝑧 𝑒 𝑧 + 𝑒 −𝑧 =2𝜎 2𝑧 −1
Activation Functions 1 6 𝜎(𝑧)= 1 1+ 𝑒 −𝑧 Sigmoid: Hyperbolic tangent: ReLu: 1 tanh 𝑧 = 𝑒 𝑧 − 𝑒 −𝑧 𝑒 𝑧 + 𝑒 −𝑧 =2𝜎 2𝑧 −1 −1 z ReLU(𝑧)=max⁡(0,𝑧)

𝑑𝑅𝑒𝐿𝑈(𝑧) 𝑑𝑧 = 0 𝑖𝑓 𝑧<0 1 𝑖𝑓 𝑧>0
Activation Function Derivatives 1 𝜎(𝑧)= 1 1+ 𝑒 −𝑧 𝑑𝜎(𝑧) 𝑑𝑧 =𝜎(𝑧)(1−𝜎(𝑧)) 7 Sigmoid: ReLu: 1 z ReLU(𝑧)=max⁡(0,𝑧) 𝑑𝑅𝑒𝐿𝑈(𝑧) 𝑑𝑧 = 0 𝑖𝑓 𝑧<0 1 𝑖𝑓 𝑧>0

Faster Learning with ReLU
8 A four-layer convolutional neural network with ReLUs (solid line) reaches a 25% training error rate on CIFAR-10 six times faster than an equivalent network with tanh neurons (dashed line). Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 2012 (pp ).

Activation Functions for Output Layer
9 𝜎(𝑧)= 1 1+ 𝑒 −𝑧 Binary classification: Sigmoid: Multi-class classification: Softmax = 𝑒 𝑧 𝑘 𝑗 𝑒 𝑧 𝑗

Sigmoid Activation Function
Gradient Descent argmin 𝒘 𝑳 𝒘 𝒘 𝑛+1 = 𝒘 𝑛 −𝜂𝛁𝐿( 𝒘 𝑛 ) Sigmoid Activation Function Analytical expression for likelihood loss function derivative: 𝐿=− 1 𝑛 𝑗 (𝑦 𝑗 log 𝜎(𝒙 𝑗 ∙𝒘) + (1−𝑦 𝑗 )log⁡(1−𝜎( 𝒙 𝑗 ∙𝒘))) 𝛁 𝒘 𝐿= 1 𝑛 𝑗 ( 𝑦 𝑗 𝜎(𝒙 𝑗 ∙𝒘) − 1− 𝑦 𝑗 (1− 𝜎(𝒙 𝑗 ∙𝒘)) ) 𝛁 𝒘 𝜎(𝒙 𝑗 ∙𝒘) = 1 𝑛 𝑗 𝒙 𝑗 ( 𝜎( 𝒙 𝑗 ∙𝒘)− 𝑦 𝑗 )

𝒗 𝑛 =𝛾 𝒗 𝑛−1 +𝜂𝛁𝐿( 𝒘 𝑛 −𝛾 𝒗 𝑛−1 ) 𝒘 𝑛+1 = 𝒘 𝑛 − 𝒗 𝑛
Gradient Descent – Momentum and Friction 𝒘 𝑛+1 = 𝒘 𝑛 −𝜂𝛁𝐿( 𝒘 𝑛 ) Partially Remembering Previous Gradients: 𝒗 𝑛 =𝛾 𝒗 𝑛−1 +𝜂𝛁𝐿( 𝒘 𝑛 ) 𝒘 𝑛+1 = 𝒘 𝑛 − 𝒗 𝑛 Nesterov accelerated gradient: 𝒗 𝑛 =𝛾 𝒗 𝑛−1 +𝜂𝛁𝐿( 𝒘 𝑛 −𝛾 𝒗 𝑛−1 ) 𝒘 𝑛+1 = 𝒘 𝑛 − 𝒗 𝑛 Adagrad: Decreases the learning rate monotonically based on sum of squared past gradients. Adadelta & RMSprop: Extensions of Adagrad that slowly forgets. Adaptive Moment Estimation (Adam): Adaptive learning rates and momentum

Mini-Batch Gradient Descent
Mini-batch gradient descent: Uses a subset of the training set to calculate the gradient for each step.

f(z) 𝜕𝑎 𝜕 𝑤 𝑖 𝜕𝐿 𝜕 𝑤 𝑖 𝜕𝐿 𝜕𝑎 𝜕𝐿 𝜕 𝑤 𝑖 = 𝜕𝑎 𝜕 𝑤 𝑖 𝜕𝐿 𝜕𝑎 Backpropagation
w1 x1 𝑎=𝑓 𝑧 = 𝑓( 𝑖 𝑤 𝑖 𝑥 𝑖 +𝑏) 𝜕𝐿 𝜕 𝑤 𝑖 13 w2 𝜕𝐿 𝜕𝑎 x2 f(z) . xn wn Chain Rule 𝜕𝐿 𝜕 𝑤 𝑖 = 𝜕𝑎 𝜕 𝑤 𝑖 𝜕𝐿 𝜕𝑎

+ Backpropagation 𝜕𝐿 𝜕𝑦 𝑎=𝑦+𝑐 𝜕𝐿 𝜕𝑎 y
14 𝑎=𝑦+𝑐 𝜕𝐿 𝜕𝑎 y 𝜕𝑎 𝜕𝑦 = 𝜕(𝑦+𝑐) 𝜕𝑦 = 𝜕𝑦 𝜕𝑦 + 𝜕𝑐 𝜕𝑦 =1 𝜕𝐿 𝜕𝑦 = 𝜕𝑎 𝜕𝑦 𝜕𝐿 𝜕𝑎 = 𝜕𝐿 𝜕𝑎

+ Backpropagation 𝜕𝐿 𝜕 𝑦 1 y1 𝑎= 𝑦 1 + 𝑦 2 𝜕𝐿 𝜕𝑎 y2 𝜕𝐿 𝜕 𝑦 2
15 𝑎= 𝑦 1 + 𝑦 2 𝜕𝐿 𝜕𝑎 y2 𝜕𝐿 𝜕 𝑦 2 𝜕𝑎 𝜕 𝑦 𝑖 =1 𝜕𝐿 𝜕 𝑦 𝑖 = 𝜕𝐿 𝜕𝑎

Backpropagation 𝜕𝐿 𝜕𝑦 * 16 𝑎=𝑐𝑦 𝜕𝐿 𝜕𝑎 y 𝜕𝑎 𝜕𝑦 =𝑐 𝜕𝐿 𝜕𝑦 =𝑐 𝜕𝐿 𝜕𝑎

* + * Backpropagation 𝜕𝐿 𝜕 𝑤 1 = 𝜕 𝑎 11 𝜕 𝑤 1 𝜕 𝑎 21 𝜕 𝑎 11 𝜕𝐿 𝜕 𝑎 21
𝜕𝐿 𝜕 𝑤 1 = 𝜕 𝑎 11 𝜕 𝑤 1 𝜕 𝑎 21 𝜕 𝑎 11 𝜕𝐿 𝜕 𝑎 21 = 𝑥 1 (1) 𝜕𝐿 𝜕 𝑎 21 x1 * 𝜕 𝑎 21 𝜕 𝑎 11 =1 17 w1 𝜕 𝑎 11 𝜕 𝑤 1 = 𝑥 1 + 𝑎 11 = 𝑤 1 𝑥 1 𝑎 21 = 𝑎 11 + 𝑎 12 𝜕𝐿 𝜕 𝑎 21 𝑎 12 = 𝑤 2 𝑥 2 x2 * 𝜕 𝑎 21 𝜕 𝑎 12 =1 w2 𝜕 𝑎 12 𝜕 𝑤 2 = 𝑥 2 𝜕𝐿 𝜕 𝑤 2 = 𝜕 𝑎 12 𝜕 𝑤 2 𝜕 𝑎 21 𝜕 𝑎 12 𝜕𝐿 𝜕 𝑎 21 = 𝑥 2 (1) 𝜕𝐿 𝜕 𝑎 21

* + * Backpropagation x1 =3 1 3 w1 =1 3 𝜕𝐿 𝜕𝑎 -5 x2 =-4 -8 1 w2 =2 -4
18 1 3 w1 =1 3 + 𝜕𝐿 𝜕𝑎 -5 x2 =-4 -8 * 1 w2 =2 -4

max Backpropagation 𝜕𝐿 𝜕 𝑦 1 𝑎= y1 𝑚𝑎𝑥( 𝑦 1 , 𝑦 2 ) 𝜕𝐿 𝜕𝑎 y2 𝜕𝐿 𝜕 𝑦 2
19 𝜕𝐿 𝜕𝑎 y2 𝜕𝐿 𝜕 𝑦 2 𝜕𝑎 𝜕 𝑦 1 =1, 𝜕𝑎 𝜕 𝑦 2 =0,𝑖𝑓 𝑦 1 > 𝑦 2 𝜕𝑎 𝜕 𝑦 1 =0, 𝜕𝑎 𝜕 𝑦 2 =1,𝑖𝑓 𝑦 1 < 𝑦 2

1/𝑦 Backpropagation 𝜕𝐿 𝜕𝑦 𝑎=1/𝑦 𝜕𝐿 𝜕𝑎 y 𝜕𝑎 𝜕𝑦 =− 1 𝑦 2
20 𝑎=1/𝑦 𝜕𝐿 𝜕𝑎 y 1/𝑦 𝜕𝑎 𝜕𝑦 =− 1 𝑦 2 𝜕𝐿 𝜕𝑦 =− 1 𝑦 2 𝜕𝐿 𝜕𝑎

exp Backpropagation 𝜕𝐿 𝜕𝑦 𝑎= 𝑒 𝑦 𝜕𝐿 𝜕𝑎 y 𝜕𝑎 𝜕𝑦 = 𝑒 𝑦 𝜕𝐿 𝜕𝑦 = 𝑒 𝑦 𝜕𝐿 𝜕𝑎
21 𝑎= 𝑒 𝑦 𝜕𝐿 𝜕𝑎 y 𝜕𝑎 𝜕𝑦 = 𝑒 𝑦 𝜕𝐿 𝜕𝑦 = 𝑒 𝑦 𝜕𝐿 𝜕𝑎

* * + - Backpropagation 1+ exp 1/x
𝜕𝐿 𝜕 𝑤 1 = 𝜕 𝑎 11 𝜕 𝑤 1 𝜕 𝑎 21 𝜕 𝑎 11 𝜕 𝑎 31 𝜕 𝑎 21 𝜕 𝑎 41 𝜕 𝑎 31 𝜕 𝑎 51 𝜕 𝑎 41 𝜕 𝑎 61 𝜕 𝑎 51 𝜕𝐿 𝜕 𝑎 61 = 𝑥 1 (1)(−1) 𝑒 𝑎 31 (1)(− 1 𝑎 ) 𝜕𝐿 𝜕 𝑎 61 𝜕 𝑎 21 𝜕 𝑎 11 =1 x1 𝜕 𝑎 61 𝜕 𝑎 51 = − 1 𝑎 51 2 𝜕 𝑎 51 𝜕 𝑎 41 = 1 𝜕 𝑎 41 𝜕 𝑎 31 = 𝑒 𝑎 31 𝑎 11 = 𝑤 1 𝑥 1 𝜕 𝑎 31 𝜕 𝑎 21 = −1 22 * w1 𝜕 𝑎 11 𝜕 𝑤 1 = 𝑥 1 + - exp 1+ 1/x 𝜕𝐿 𝜕 𝑎 61 𝑎 21 = 𝑎 11 + 𝑎 12 𝑎 31 = − 𝑎 21 𝑎 41 = 𝑒 𝑎 31 𝑎 51 = 1+ 𝑎 41 𝑎 61 = 1 𝑎 51 x2 * 𝑎 12 = 𝑤 2 𝑥 2 w2 𝜕 𝑎 12 𝜕 𝑤 2 = 𝑥 2 𝜕 𝑎 21 𝜕 𝑎 12 =1 𝜕𝐿 𝜕 𝑤 2 = 𝜕 𝑎 12 𝜕 𝑤 2 𝜕 𝑎 21 𝜕 𝑎 11 𝜕 𝑎 31 𝜕 𝑎 21 𝜕 𝑎 41 𝜕 𝑎 31 𝜕 𝑎 51 𝜕 𝑎 41 𝜕 𝑎 61 𝜕 𝑎 51 𝜕𝐿 𝜕 𝑎 61 = 𝑥 2 (1)(−1) 𝑒 𝑎 31 (1)(− 1 𝑎 ) 𝜕𝐿 𝜕 𝑎 61

* * + - Backpropagation x1 =3 6.6E-3 0.02 6.6E-3 -6.6E-3 -4.5E-5
𝜕𝐿 𝜕𝑎 w1 =1 3 + - exp 1+ 1/x 6.6E-3 x2 =-4 -5 5 148.4 149.4 0.0067 * -0.027 -8 w2 =2

Backpropagation x1 24 * w1 + - exp + 1/x x2 * w2 σ

* * + σ Backpropagation 𝜕𝐿 𝜕𝑎 𝜕 𝑎 21 𝜕 𝑎 11 =1 x1 𝑎 11 = 𝑤 1 𝑥 1
𝜕 𝑎 21 𝜕 𝑎 11 =1 x1 𝑎 11 = 𝑤 1 𝑥 1 25 * 𝜕 𝑎 31 𝜕 𝑎 21 =𝜎( 𝑎 21 )(1−𝜎 𝑎 21 ) w1 𝜕𝐿 𝜕𝑎 𝜕 𝑎 11 𝜕 𝑤 1 = 𝑥 1 + σ 𝑎 21 = 𝑎 11 + 𝑎 12 𝑎 31 =𝜎( 𝑎 21 ) x2 * 𝑎 12 = 𝑤 2 𝑥 2 w2 𝜕 𝑎 12 𝜕 𝑤 2 = 𝑥 2 𝜕 𝑎 21 𝜕 𝑎 12 =1

* * + σ Backpropagation x1 =3 6.6E-3 0.02 6.6E-3 𝜕𝐿 𝜕𝑎 w1 =1 3 6.6E-3
26 * 6.6E-3 0.02 6.6E-3 𝜕𝐿 𝜕𝑎 w1 =1 3 + σ 6.6E-3 x2 =-4 -5 0.0067 * -0.027 -8 w2 =2

Backpropagation 𝜕𝐿 𝜕𝑎 1 27 + 𝜕𝐿 𝜕𝑎 2

Backpropagation Layer 1 Layer 2 Layer 3 𝑤 74 2 𝑏 7 2 𝑎 7 2 𝑎 𝑗 𝑙 =𝑓 𝑘 𝑤 𝑗𝑘 𝑙 𝑎 𝑘 𝑙−1 + 𝑏 𝑗 𝑙 =𝑓( 𝑧 𝑗 𝑙 ) 𝒂 𝑙 =𝑓 𝒘 𝑙 𝒂 𝑙−1 + 𝒃 𝑙 =𝑓( 𝒛 𝑙 ) 𝛿 𝑗 𝑙 = 𝜕𝐿 𝜕 𝑧 𝑗 𝑙 Error: 𝜹 𝑙 = 𝜵 𝒛 𝐿

Error for output layer (l=N):
Backpropagation Error for output layer (l=N): 𝛿 𝑗 𝑁 = 𝜕𝐿 𝜕 𝑧 𝑗 𝑁 ⇓ Chain Rule 𝛿 𝑗 𝑁 = 𝑘 𝜕𝐿 𝜕 𝑎 𝑘 𝑁 𝜕 𝑎 𝑘 𝑁 𝜕 𝑧 𝑗 𝑁 𝜹 𝑁 = 𝜵 𝒂 𝐿⨀𝑓′( 𝒛 𝑁 ) ⇓ 𝜕 𝑎 𝑘 𝑁 𝜕 𝑧 𝑗 𝑁 =0 𝑖𝑓 𝑘≠𝑗 ⨀ - Element-wise multiplication of vectors 𝛿 𝑗 𝑁 = 𝜕𝐿 𝜕 𝑎 𝑗 𝑁 𝜕 𝑎 𝑗 𝑁 𝜕 𝑧 𝑗 𝑁 ⇓ 𝑎 𝑗 𝑁 =𝑓( 𝑧 𝑗 𝑁 ) 𝛿 𝑗 𝑁 = 𝜕𝐿 𝜕 𝑎 𝑗 𝑁 𝜕𝑓( 𝑧 𝑗 𝑁 ) 𝜕 𝑧 𝑗 𝑁 = 𝜕𝐿 𝜕 𝑎 𝑗 𝑁 𝑓′( 𝑧 𝑗 𝑁 )

Error as a function of the error in the next layer:
Backpropagation Error as a function of the error in the next layer: 𝛿 𝑗 𝑙 = 𝜕𝐿 𝜕 𝑧 𝑗 𝑙 ⇓ Chain Rule 𝛿 𝑗 𝑙 = 𝜕𝐿 𝜕 𝑧 𝑗 𝑙 = 𝑘 𝜕𝐿 𝜕 𝑧 𝑘 𝑙+1 𝜕 𝑧 𝑘 𝑙+1 𝜕 𝑧 𝑗 𝑙 ⇓ 𝛿 𝑘 𝑙+1 = 𝜕𝐿 𝜕 𝑧 𝑘 𝑙+1 𝜹 𝑙 =( ( 𝒘 𝑙+1 ) 𝑇 𝜹 𝑙+1 )⨀𝑓′( 𝒛 𝑙 ) 𝛿 𝑗 𝑙 = 𝑘 𝛿 𝑘 𝑙+1 𝜕 𝑧 𝑘 𝑙+1 𝜕 𝑧 𝑗 𝑙 𝑧 𝑘 𝑙+1 = 𝑖 𝑤 𝑘𝑖 𝑙+1 𝑎 𝑖 𝑙 + 𝑏 𝑘 𝑙+1 = 𝑖 𝑤 𝑘𝑖 𝑙+1 𝑓( 𝑧 𝑖 𝑙 )+ 𝑏 𝑘 𝑙+1 ⇓ 𝛿 𝑗 𝑙 = 𝑘 𝛿 𝑘 𝑙+1 𝑤 𝑘𝑗 𝑙+1 𝑓′( 𝑧 𝑗 𝑙 )

Gradient with respect to bias
Backpropagation Gradient with respect to bias 𝜕𝐿 𝜕 𝑏 𝑗 𝑙 ⇓ Chain Rule 𝜕𝐿 𝜕 𝑏 𝑗 𝑙 = 𝑘 𝜕𝐿 𝜕 𝑧 𝑘 𝑙 𝜕 𝑧 𝑘 𝑙 𝜕 𝑏 𝑗 𝑙 𝜵 𝒃 𝐿= 𝜹 𝑙 ⇓ 𝛿 𝑘 𝑙 = 𝜕𝐿 𝜕 𝑧 𝑘 𝑙 𝜕𝐿 𝜕 𝑏 𝑗 𝑙 = 𝑘 𝛿 𝑘 𝑙 𝜕 𝑧 𝑘 𝑙 𝜕 𝑏 𝑗 𝑙 𝑧 𝑘 𝑙 = 𝑖 𝑤 𝑘𝑖 𝑙 𝑎 𝑖 𝑙−1 + 𝑏 𝑘 𝑙 ⇓ 𝜕𝐿 𝜕 𝑏 𝑗 𝑙 = 𝛿 𝑗 𝑙

Gradient with respect to weights
Backpropagation Gradient with respect to weights 𝜕𝐿 𝜕 𝑤 𝑗𝑘 𝑙 ⇓ Chain Rule 𝜕𝐿 𝜕 𝑤 𝑗𝑘 𝑙 = 𝑖 𝜕𝐿 𝜕 𝑧 𝑖 𝑙 𝜕 𝑧 𝑖 𝑙 𝜕 𝑤 𝑗𝑘 𝑙 𝜵 𝒘 𝐿= 𝜹 𝑙 ( 𝒂 𝑙−1 ) 𝑇 ⇓ 𝛿 𝑖 𝑙 = 𝜕𝐿 𝜕 𝑧 𝑖 𝑙 𝜕𝐿 𝜕 𝑤 𝑗𝑘 𝑙 = 𝑖 𝛿 𝑖 𝑙 𝜕 𝑧 𝑖 𝑙 𝜕 𝑤 𝑗𝑘 𝑙 𝑧 𝑖 𝑙 = 𝑚 𝑤 𝑖𝑚 𝑙 𝑎 𝑚 𝑙−1 + 𝑏 𝑖 𝑙 ⇓ 𝜕𝐿 𝜕 𝑤 𝑗𝑘 𝑙 = 𝑎 𝑘 𝑙−1 𝛿 𝑗 𝑙

Backpropagation 𝒂 𝒙,𝑙 =𝑓( 𝒛 𝒙,𝑙 ) =𝒇( 𝒘 𝑙 𝒂 𝒙,𝑙−1 + 𝒃 𝑙 )
For each training example x in mini-batch of size m: Forward Pass: Output Error: Error Backpropagation: Gradient Descent: 𝒂 𝒙,𝑙 =𝑓( 𝒛 𝒙,𝑙 ) =𝒇( 𝒘 𝑙 𝒂 𝒙,𝑙−1 + 𝒃 𝑙 ) 𝜹 𝑥,𝑁 = 𝜵 𝒂 𝐿⨀𝑓′( 𝒛 𝑥,𝑁 ) 𝜹 𝑥,𝑙 =( ( 𝒘 𝑙+1 ) 𝑇 𝜹 𝑥,𝑙+1 )⨀𝑓′( 𝒛 𝑥,𝑙 ) 𝒃 𝑙 → 𝒃 𝑙 − 𝜂 𝑚 𝑥 𝜹 𝑥,𝑙 𝒘 𝑙 → 𝒘 𝑙 − 𝜂 𝑚 𝑥 𝜹 𝑥,𝑙 ( 𝒂 𝑥,𝑙−1 ) 𝑇

Initialization wi initialized to be normally distributed (μ=0, σ=1)
34 z=Σwixi σ(z) σ'(z) xi, i=1..N, normally distributed (μ=0, σ=1), N=100 wi initialized to be normally distributed (μ=0, σ=1/N) z=Σwixi σ(z) σ'(z)

⇓ ⇓ ⇓ σ σ σ σ Vanishing Gradient 𝜕𝐿 𝜕𝑧 𝛿 𝑙 = 𝛿 𝑙+1 𝑤 𝑙+1 𝜎 ′ 𝑧 𝑙
35 𝛿 𝑙 = 𝛿 𝑙+1 𝑤 𝑙+1 𝜎 ′ 𝑧 𝑙 ⇓ 𝛿 𝑁 = 𝜕𝐿 𝜕 𝑎 𝑁 𝜎′( 𝑧 𝑁 ) 𝛿 𝑙 = 𝜕𝐿 𝜕 𝑎 𝑁 𝜎′( 𝑧 𝑁 ) 𝑗=𝑙 𝑗=𝑁−1 𝑤 𝑗+1 𝜎 ′ 𝑧 𝑗 ⇓ 𝛿 𝑙 𝛿 𝑙+𝑘 = 𝑗=𝑙 𝑗=𝑙+𝑘 𝑤 𝑗+1 𝜎 ′ 𝑧 𝑗 ⇓ 𝑤 𝑗+1 <1 𝜎 ′ 𝑧 𝑗 ≤ 𝜎 ′ 0 = 1 4 𝛿 𝑙 𝛿 𝑙+𝑘 < 1 4 𝑘

𝑑𝑅𝑒𝐿𝑈(𝑧) 𝑑𝑡 = 0 𝑖𝑓 𝑧<0 1 𝑖𝑓 𝑧>0
Dying ReLu’s 1 z ReLU(𝑡)=max⁡(0,𝑧) 𝑑𝑅𝑒𝐿𝑈(𝑧) 𝑑𝑡 = 0 𝑖𝑓 𝑧<0 1 𝑖𝑓 𝑧>0 𝜕𝑎 𝜕𝑤 = 𝑥 1 𝜕𝑎 𝜕𝑏 =1 𝑖𝑓 𝑤𝑥 1 +𝑏>0 ReLU 𝑎= 𝑚𝑎𝑥(0, 𝑤𝑥 1 +𝑏) 𝜕𝐿 𝜕𝑎 x1 𝝏𝒂 𝝏𝒘 =𝟎 𝝏𝒂 𝝏𝒃 =𝟎 𝑖𝑓 𝑤𝑥 1 +𝑏<0

Universal Function Approximation
“…arbitrary decision regions can be arbitrarily well approximated by continuous feedforward neural networks with only a single internal, hidden layer and any continuous sigmoidal nonlinearity.” Cybenko G. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals, and Systems (MCSS) Dec 1;2(4): 37 Sigmoid Subtraction of two sigmoid with small shift A function Approximations of the function with sigmoids

Convolutional neural networks (CNNs)
Local receptive field Describes the response of a linear and time-invariant system to an input signal. The inverse Fourier transform of the pointwise product in frequency space Shared weights and biases LeCun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proceedings of the IEEE Nov;86(11):

LeNet-5 LeCun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proceedings of the IEEE Nov;86(11):

AlexNet Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 2012 (pp ).

Deep with small convolution filters
Convolutional neural networks (CNNs) VGG Net Deep with small convolution filters

GoogLeNet Inception modules (parallel layers with multi-scale processing) increases the depth and width of the network while keeping the computational budget constant. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2015 (pp. 1-9).

ResNet Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, Deep Residual Learning for Image Recognition, arXiv: [cs.CV], 2015

Transfer Learning If your data set is limited in size:
Use a pre-trained network and only remove the last fully connected layer and train a linear classifier using your data set. Fine tune part of the network using backpropagation. Example: “An image of a skin lesion (for example, melanoma) is sequentially warped into a probability distribution over clinical classes of skin disease using Google Inception v3 CNN architecture pretrained on the ImageNet dataset (1.28 million images over 1,000 generic object classes) and fine-tuned on our own dataset of 129,450 skin lesions comprising 2,032 different diseases.” Esteva et al., “Dermatologist-level classification of skin cancer with deep neural networks”, Nature. 2017

Data Augmentation Translations Rotations Reflections
45 Translations Rotations Reflections Intensity and color of illumination Deformation

Batch Normalization For each mini-batch, B, normalize between each layer: 46 𝑥 𝑖 = 𝑥 𝑖 − 𝜇 𝐵 𝜎 𝐵 2 +𝜀 𝑦 𝑖 =𝛾 𝑥 𝑖 +𝛽 Regularization and faster learning Karpathy, Ioffe S, Szegedy C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv: Feb 11.

Regularization - Dropout
47

Recurrent convolutional neural networks (R-CNNs)
y yt-1 yt yt+1 48 𝑊 ℎ𝑦 h 𝑊 ℎℎ ht-1 ht ht+1 𝑊 𝑥ℎ x xt-1 xt xt+1 𝒉 𝑡 = tanh 𝑾 ℎℎ 𝒉 𝑡−1 + 𝑾 𝑥ℎ 𝒙 𝑡 𝒚 𝑡 = 𝑾 ℎ𝑦 𝒉 𝑡

Classification of each frame of a video
Recurrent convolutional neural networks (R-CNNs) 49 CNN Image captioning Sentiment Classification Translation Classification of each frame of a video Andrej Karpathy, The Unreasonable Effectiveness of Recurrent Neural Networks,

LSTM Long Short-Term Memory (LSTM) RNN Cell state update: y
x 𝑊 𝑦 𝑊 ℎ Cell state update: 𝑪 𝑡 = 𝒇 𝑡 ⨀ 𝑪 𝑡−1 + 𝒊 𝑡 ⨀𝑡𝑎𝑛ℎ⁡( 𝑾 𝑐 𝒉 𝑡−1 , 𝒙 𝑡 + 𝒃 𝑐 ) 𝒉 𝑡 = 𝒐 𝑡 ⨀𝑡𝑎𝑛ℎ⁡⁡( 𝑪 𝑡 ) Forget gate: 𝒇 𝑡 =σ⁡( 𝑾 𝑓 𝒉 𝑡−1 , 𝒙 𝑡 + 𝒃 𝑓 ) Input gate: 𝒊 𝑡 =σ⁡( 𝑾 𝑖 𝒉 𝑡−1 , 𝒙 𝑡 + 𝒃 𝑖 ) 𝒉 𝑡 =tanh⁡( 𝑾 ℎ [ 𝒉 𝑡−1 , 𝒙 𝑡 ]) Output gate: 𝒐 𝑡 =σ⁡⁡( 𝑾 𝑜 𝒉 𝑡−1 , 𝒙 𝑡 + 𝒃 𝑜 ) 𝒚 𝑡 = 𝑾 𝑦 𝒉 𝑡 Hochreiter S, Schmidhuber J. Long short-term memory. Neural computation Nov 15;9(8):

Image Captioning – Combining CNNs and RNNs
51 Karpathy, Andrej & Fei-Fei, Li, "Deep visual-semantic alignments for generating image descriptions", Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015)

Autoencoders – Unsupervised Learning
Autoencoders learn a lower dimensionality representation where output ≈ input 52 Lower Dimensional Representation Input Output

Generative Adversarial Networks
53 Nguyen et al., “Plug & Play Generative Networks: Conditional Iterative Generation of Images in Latent Space”,

Deep Dream Google DeepDream: The Garden of Earthly Delights
54 Google DeepDream: The Garden of Earthly Delights Hieronymus Bosch: The Garden of Earthly Delights

Artistic Style 55 LA. Gatys, A.S. Ecker, M. Bethge, “A Neural Algorithm of Artistic Style”,

Adversarial Fooling Examples
Original correctly classified image Classified as ostrich Perturbation 56 Szegedy et al., “Intriguing properties of neural networks”,

Machine Learning – Neural Networks David Fenyő

Similar presentations

Presentation on theme: "Machine Learning – Neural Networks David Fenyő"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Machine Learning – Neural Networks David Fenyő

Similar presentations

Presentation on theme: "Machine Learning – Neural Networks David Fenyő"— Presentation transcript:

Similar presentations

About project

Feedback