Machine Learning – Neural Networks David Fenyő

Machine Learning – Neural Networks David Fenyő
Contact:

Example: Skin Cancer Diagnosis
2 Esteva et al., “Dermatologist-level classification of skin cancer with deep neural networks”, Nature. 2017

Logistic Regression 𝑓(𝑧)= 1 1+ 𝑒 −𝑧 w1 x1 𝑓( 𝑖 𝑤 𝑖 𝑥 𝑖 +𝑏) w2 x2 xn wn
𝑓( 𝑖 𝑤 𝑖 𝑥 𝑖 +𝑏) 3 w2 x2 1 . xn wn 𝑓(𝑧)= 1 1+ 𝑒 −𝑧

Architecture w1 x1 𝑓( 𝑖 𝑤 𝑖 𝑥 𝑖 +𝑏) w2 x2 xn wn . Input Output Hidden
𝑓( 𝑖 𝑤 𝑖 𝑥 𝑖 +𝑏) 4 w2 x2 . xn wn Input Output Hidden

tanh 𝑧 = 𝑒 𝑧 − 𝑒 −𝑧 𝑒 𝑧 + 𝑒 −𝑧 =2𝜎 2𝑧 −1
Activation Functions 1 5 𝜎(𝑧)= 1 1+ 𝑒 −𝑧 Sigmoid: Hyperbolic tangent: ReLu: 1 tanh 𝑧 = 𝑒 𝑧 − 𝑒 −𝑧 𝑒 𝑧 + 𝑒 −𝑧 =2𝜎 2𝑧 −1 −1 z ReLU(𝑧)=max⁡(0,𝑧)

𝑑𝑅𝑒𝐿𝑈(𝑧) 𝑑𝑧 = 0 𝑖𝑓 𝑧<0 1 𝑖𝑓 𝑧>0
Activation Function Derivatives 1 𝜎(𝑧)= 1 1+ 𝑒 −𝑧 𝑑𝜎(𝑧) 𝑑𝑧 =𝜎(𝑧)(1−𝜎(𝑧)) 6 Sigmoid: ReLu: 1 z ReLU(𝑧)=max⁡(0,𝑧) 𝑑𝑅𝑒𝐿𝑈(𝑧) 𝑑𝑧 = 0 𝑖𝑓 𝑧<0 1 𝑖𝑓 𝑧>0

Faster Learning with ReLU
7 A four-layer convolutional neural network with ReLUs (solid line) reaches a 25% training error rate on CIFAR-10 six times faster than an equivalent network with tanh neurons (dashed line). Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 2012 (pp ).

Activation Functions for Output Layer
8 𝜎(𝑧)= 1 1+ 𝑒 −𝑧 Binary classification: Sigmoid: Multi-class classification: Softmax = 𝑒 𝑧 𝑘 𝑗 𝑒 𝑧 𝑗

f(z) 𝜕𝑎 𝜕 𝑤 𝑖 𝜕𝐿 𝜕 𝑤 𝑖 𝜕𝐿 𝜕𝑎 𝜕𝐿 𝜕 𝑤 𝑖 = 𝜕𝑎 𝜕 𝑤 𝑖 𝜕𝐿 𝜕𝑎 Backpropagation
w1 x1 𝑎=𝑓 𝑧 = 𝑓( 𝑖 𝑤 𝑖 𝑥 𝑖 +𝑏) 𝜕𝐿 𝜕 𝑤 𝑖 9 w2 𝜕𝐿 𝜕𝑎 x2 f(z) . xn wn Chain Rule 𝜕𝐿 𝜕 𝑤 𝑖 = 𝜕𝑎 𝜕 𝑤 𝑖 𝜕𝐿 𝜕𝑎

+ Backpropagation 𝜕𝐿 𝜕𝑦 𝑎=𝑦+𝑐 𝜕𝐿 𝜕𝑎 y
10 𝑎=𝑦+𝑐 𝜕𝐿 𝜕𝑎 y 𝜕𝑎 𝜕𝑦 = 𝜕(𝑦+𝑐) 𝜕𝑦 = 𝜕𝑦 𝜕𝑦 + 𝜕𝑐 𝜕𝑦 =1 𝜕𝐿 𝜕𝑦 = 𝜕𝑎 𝜕𝑦 𝜕𝐿 𝜕𝑎 = 𝜕𝐿 𝜕𝑎

+ Backpropagation 𝜕𝐿 𝜕 𝑦 1 y1 𝑎= 𝑦 1 + 𝑦 2 𝜕𝐿 𝜕𝑎 y2 𝜕𝐿 𝜕 𝑦 2
11 𝑎= 𝑦 1 + 𝑦 2 𝜕𝐿 𝜕𝑎 y2 𝜕𝐿 𝜕 𝑦 2 𝜕𝑎 𝜕 𝑦 𝑖 =1 𝜕𝐿 𝜕 𝑦 𝑖 = 𝜕𝐿 𝜕𝑎

Backpropagation 𝜕𝐿 𝜕𝑦 * 12 𝑎=𝑐𝑦 𝜕𝐿 𝜕𝑎 y 𝜕𝑎 𝜕𝑦 =𝑐 𝜕𝐿 𝜕𝑦 =𝑐 𝜕𝐿 𝜕𝑎

* + * Backpropagation 𝜕𝐿 𝜕 𝑤 1 = 𝜕 𝑎 11 𝜕 𝑤 1 𝜕 𝑎 21 𝜕 𝑎 11 𝜕𝐿 𝜕 𝑎 21
𝜕𝐿 𝜕 𝑤 1 = 𝜕 𝑎 11 𝜕 𝑤 1 𝜕 𝑎 21 𝜕 𝑎 11 𝜕𝐿 𝜕 𝑎 21 = 𝑥 1 (1) 𝜕𝐿 𝜕 𝑎 21 x1 * 𝜕 𝑎 21 𝜕 𝑎 11 =1 13 w1 𝜕 𝑎 11 𝜕 𝑤 1 = 𝑥 1 + 𝑎 11 = 𝑤 1 𝑥 1 𝑎 21 = 𝑎 11 + 𝑎 12 𝜕𝐿 𝜕 𝑎 21 𝑎 12 = 𝑤 2 𝑥 2 x2 * 𝜕 𝑎 21 𝜕 𝑎 12 =1 w2 𝜕 𝑎 12 𝜕 𝑤 2 = 𝑥 2 𝜕𝐿 𝜕 𝑤 2 = 𝜕 𝑎 12 𝜕 𝑤 2 𝜕 𝑎 21 𝜕 𝑎 12 𝜕𝐿 𝜕 𝑎 21 = 𝑥 2 (1) 𝜕𝐿 𝜕 𝑎 21

* + * Backpropagation x1 =3 1 3 w1 =1 3 𝜕𝐿 𝜕𝑎 -5 x2 =-4 -8 1 w2 =2 -4
14 1 3 w1 =1 3 + 𝜕𝐿 𝜕𝑎 -5 x2 =-4 -8 * 1 w2 =2 -4

max Backpropagation 𝜕𝐿 𝜕 𝑦 1 𝑎= y1 𝑚𝑎𝑥( 𝑦 1 , 𝑦 2 ) 𝜕𝐿 𝜕𝑎 y2 𝜕𝐿 𝜕 𝑦 2
15 𝜕𝐿 𝜕𝑎 y2 𝜕𝐿 𝜕 𝑦 2 𝜕𝑎 𝜕 𝑦 1 =1, 𝜕𝑎 𝜕 𝑦 2 =0,𝑖𝑓 𝑦 1 > 𝑦 2 𝜕𝑎 𝜕 𝑦 1 =0, 𝜕𝑎 𝜕 𝑦 2 =1,𝑖𝑓 𝑦 1 < 𝑦 2

1/𝑦 Backpropagation 𝜕𝐿 𝜕𝑦 𝑎=1/𝑦 𝜕𝐿 𝜕𝑎 y 𝜕𝑎 𝜕𝑦 =− 1 𝑦 2
16 𝑎=1/𝑦 𝜕𝐿 𝜕𝑎 y 1/𝑦 𝜕𝑎 𝜕𝑦 =− 1 𝑦 2 𝜕𝐿 𝜕𝑦 =− 1 𝑦 2 𝜕𝐿 𝜕𝑎

exp Backpropagation 𝜕𝐿 𝜕𝑦 𝑎= 𝑒 𝑦 𝜕𝐿 𝜕𝑎 y 𝜕𝑎 𝜕𝑦 = 𝑒 𝑦 𝜕𝐿 𝜕𝑦 = 𝑒 𝑦 𝜕𝐿 𝜕𝑎
17 𝑎= 𝑒 𝑦 𝜕𝐿 𝜕𝑎 y 𝜕𝑎 𝜕𝑦 = 𝑒 𝑦 𝜕𝐿 𝜕𝑦 = 𝑒 𝑦 𝜕𝐿 𝜕𝑎

* * + - Backpropagation 1+ exp 1/x
𝜕𝐿 𝜕 𝑤 1 = 𝜕 𝑎 11 𝜕 𝑤 1 𝜕 𝑎 21 𝜕 𝑎 11 𝜕 𝑎 31 𝜕 𝑎 21 𝜕 𝑎 41 𝜕 𝑎 31 𝜕 𝑎 51 𝜕 𝑎 41 𝜕 𝑎 61 𝜕 𝑎 51 𝜕𝐿 𝜕 𝑎 61 = 𝑥 1 (1)(−1) 𝑒 𝑎 31 (1)(− 1 𝑎 ) 𝜕𝐿 𝜕 𝑎 61 𝜕 𝑎 21 𝜕 𝑎 11 =1 x1 𝜕 𝑎 61 𝜕 𝑎 51 = − 1 𝑎 51 2 𝜕 𝑎 51 𝜕 𝑎 41 = 1 𝜕 𝑎 41 𝜕 𝑎 31 = 𝑒 𝑎 31 𝑎 11 = 𝑤 1 𝑥 1 𝜕 𝑎 31 𝜕 𝑎 21 = −1 18 * w1 𝜕 𝑎 11 𝜕 𝑤 1 = 𝑥 1 + - exp 1+ 1/x 𝜕𝐿 𝜕 𝑎 61 𝑎 21 = 𝑎 11 + 𝑎 12 𝑎 31 = − 𝑎 21 𝑎 41 = 𝑒 𝑎 31 𝑎 51 = 1+ 𝑎 41 𝑎 61 = 1 𝑎 51 x2 * 𝑎 12 = 𝑤 2 𝑥 2 w2 𝜕 𝑎 12 𝜕 𝑤 2 = 𝑥 2 𝜕 𝑎 21 𝜕 𝑎 12 =1 𝜕𝐿 𝜕 𝑤 2 = 𝜕 𝑎 12 𝜕 𝑤 2 𝜕 𝑎 21 𝜕 𝑎 11 𝜕 𝑎 31 𝜕 𝑎 21 𝜕 𝑎 41 𝜕 𝑎 31 𝜕 𝑎 51 𝜕 𝑎 41 𝜕 𝑎 61 𝜕 𝑎 51 𝜕𝐿 𝜕 𝑎 61 = 𝑥 2 (1)(−1) 𝑒 𝑎 31 (1)(− 1 𝑎 ) 𝜕𝐿 𝜕 𝑎 61

* * + - Backpropagation x1 =3 6.6E-3 0.02 6.6E-3 -6.6E-3 -4.5E-5
𝜕𝐿 𝜕𝑎 w1 =1 3 + - exp 1+ 1/x 6.6E-3 x2 =-4 -5 5 148.4 149.4 0.0067 * -0.027 -8 w2 =2

Backpropagation x1 20 * w1 + - exp + 1/x x2 * w2 σ

* * + σ Backpropagation 𝜕𝐿 𝜕𝑎 𝜕 𝑎 21 𝜕 𝑎 11 =1 x1 𝑎 11 = 𝑤 1 𝑥 1
𝜕 𝑎 21 𝜕 𝑎 11 =1 x1 𝑎 11 = 𝑤 1 𝑥 1 21 * 𝜕 𝑎 31 𝜕 𝑎 21 =𝜎( 𝑎 21 )(1−𝜎 𝑎 21 ) w1 𝜕𝐿 𝜕𝑎 𝜕 𝑎 11 𝜕 𝑤 1 = 𝑥 1 + σ 𝑎 21 = 𝑎 11 + 𝑎 12 𝑎 31 =𝜎( 𝑎 21 ) x2 * 𝑎 12 = 𝑤 2 𝑥 2 w2 𝜕 𝑎 12 𝜕 𝑤 2 = 𝑥 2 𝜕 𝑎 21 𝜕 𝑎 12 =1

* * + σ Backpropagation x1 =3 6.6E-3 0.02 6.6E-3 𝜕𝐿 𝜕𝑎 w1 =1 3 6.6E-3
22 * 6.6E-3 0.02 6.6E-3 𝜕𝐿 𝜕𝑎 w1 =1 3 + σ 6.6E-3 x2 =-4 -5 0.0067 * -0.027 -8 w2 =2

Backpropagation 𝜕𝐿 𝜕𝑎 1 23 + 𝜕𝐿 𝜕𝑎 2

Backpropagation Layer 1 Layer 2 Layer 3 𝑤 74 2 𝑏 7 2 𝑎 7 2 𝑎 𝑗 𝑙 =𝑓 𝑘 𝑤 𝑗𝑘 𝑙 𝑎 𝑘 𝑙−1 + 𝑏 𝑗 𝑙 =𝑓( 𝑧 𝑗 𝑙 ) 𝒂 𝑙 =𝑓 𝒘 𝑙 𝒂 𝑙−1 + 𝒃 𝑙 =𝑓( 𝒛 𝑙 ) 𝛿 𝑗 𝑙 = 𝜕𝐿 𝜕 𝑧 𝑗 𝑙 Error: 𝜹 𝑙 = 𝜵 𝒛 𝐿

Error for output layer (l=N):
Backpropagation Error for output layer (l=N): 𝛿 𝑗 𝑁 = 𝜕𝐿 𝜕 𝑧 𝑗 𝑁 ⇓ Chain Rule 𝛿 𝑗 𝑁 = 𝑘 𝜕𝐿 𝜕 𝑎 𝑘 𝑁 𝜕 𝑎 𝑘 𝑁 𝜕 𝑧 𝑗 𝑁 𝜹 𝑁 = 𝜵 𝒂 𝐿⨀𝑓′( 𝒛 𝑁 ) ⇓ 𝜕 𝑎 𝑘 𝑁 𝜕 𝑧 𝑗 𝑁 =0 𝑖𝑓 𝑘≠𝑗 ⨀ - Element-wise multiplication of vectors 𝛿 𝑗 𝑁 = 𝜕𝐿 𝜕 𝑎 𝑗 𝑁 𝜕 𝑎 𝑗 𝑁 𝜕 𝑧 𝑗 𝑁 ⇓ 𝑎 𝑗 𝑁 =𝑓( 𝑧 𝑗 𝑁 ) 𝛿 𝑗 𝑁 = 𝜕𝐿 𝜕 𝑎 𝑗 𝑁 𝜕𝑓( 𝑧 𝑗 𝑁 ) 𝜕 𝑧 𝑗 𝑁 = 𝜕𝐿 𝜕 𝑎 𝑗 𝑁 𝑓′( 𝑧 𝑗 𝑁 )

Error as a function of the error in the next layer:
Backpropagation Error as a function of the error in the next layer: 𝛿 𝑗 𝑙 = 𝜕𝐿 𝜕 𝑧 𝑗 𝑙 ⇓ Chain Rule 𝛿 𝑗 𝑙 = 𝜕𝐿 𝜕 𝑧 𝑗 𝑙 = 𝑘 𝜕𝐿 𝜕 𝑧 𝑘 𝑙+1 𝜕 𝑧 𝑘 𝑙+1 𝜕 𝑧 𝑗 𝑙 ⇓ 𝛿 𝑘 𝑙+1 = 𝜕𝐿 𝜕 𝑧 𝑘 𝑙+1 𝜹 𝑙 =( ( 𝒘 𝑙+1 ) 𝑇 𝜹 𝑙+1 )⨀𝑓′( 𝒛 𝑙 ) 𝛿 𝑗 𝑙 = 𝑘 𝛿 𝑘 𝑙+1 𝜕 𝑧 𝑘 𝑙+1 𝜕 𝑧 𝑗 𝑙 𝑧 𝑘 𝑙+1 = 𝑖 𝑤 𝑘𝑖 𝑙+1 𝑎 𝑖 𝑙 + 𝑏 𝑘 𝑙+1 = 𝑖 𝑤 𝑘𝑖 𝑙+1 𝑓( 𝑧 𝑖 𝑙 )+ 𝑏 𝑘 𝑙+1 ⇓ 𝛿 𝑗 𝑙 = 𝑘 𝛿 𝑘 𝑙+1 𝑤 𝑘𝑗 𝑙+1 𝑓′( 𝑧 𝑗 𝑙 )

Gradient with respect to bias
Backpropagation Gradient with respect to bias 𝜕𝐿 𝜕 𝑏 𝑗 𝑙 ⇓ Chain Rule 𝜕𝐿 𝜕 𝑏 𝑗 𝑙 = 𝑘 𝜕𝐿 𝜕 𝑧 𝑘 𝑙 𝜕 𝑧 𝑘 𝑙 𝜕 𝑏 𝑗 𝑙 𝜵 𝒃 𝐿= 𝜹 𝑙 ⇓ 𝛿 𝑘 𝑙 = 𝜕𝐿 𝜕 𝑧 𝑘 𝑙 𝜕𝐿 𝜕 𝑏 𝑗 𝑙 = 𝑘 𝛿 𝑘 𝑙 𝜕 𝑧 𝑘 𝑙 𝜕 𝑏 𝑗 𝑙 𝑧 𝑘 𝑙 = 𝑖 𝑤 𝑘𝑖 𝑙 𝑎 𝑖 𝑙−1 + 𝑏 𝑘 𝑙 ⇓ 𝜕𝐿 𝜕 𝑏 𝑗 𝑙 = 𝛿 𝑗 𝑙

Gradient with respect to weights
Backpropagation Gradient with respect to weights 𝜕𝐿 𝜕 𝑤 𝑗𝑘 𝑙 ⇓ Chain Rule 𝜕𝐿 𝜕 𝑤 𝑗𝑘 𝑙 = 𝑖 𝜕𝐿 𝜕 𝑧 𝑖 𝑙 𝜕 𝑧 𝑖 𝑙 𝜕 𝑤 𝑗𝑘 𝑙 𝜵 𝒘 𝐿= 𝜹 𝑙 ( 𝒂 𝑙−1 ) 𝑇 ⇓ 𝛿 𝑖 𝑙 = 𝜕𝐿 𝜕 𝑧 𝑖 𝑙 𝜕𝐿 𝜕 𝑤 𝑗𝑘 𝑙 = 𝑖 𝛿 𝑖 𝑙 𝜕 𝑧 𝑖 𝑙 𝜕 𝑤 𝑗𝑘 𝑙 𝑧 𝑖 𝑙 = 𝑚 𝑤 𝑖𝑚 𝑙 𝑎 𝑚 𝑙−1 + 𝑏 𝑖 𝑙 ⇓ 𝜕𝐿 𝜕 𝑤 𝑗𝑘 𝑙 = 𝑎 𝑘 𝑙−1 𝛿 𝑗 𝑙

Backpropagation 𝒂 𝒙,𝑙 =𝑓( 𝒛 𝒙,𝑙 ) =𝒇( 𝒘 𝑙 𝒂 𝒙,𝑙−1 + 𝒃 𝑙 )
For each training example x in mini-batch of size m: Forward Pass: Output Error: Error Backpropagation: Gradient Descent: 𝒂 𝒙,𝑙 =𝑓( 𝒛 𝒙,𝑙 ) =𝒇( 𝒘 𝑙 𝒂 𝒙,𝑙−1 + 𝒃 𝑙 ) 𝜹 𝑥,𝑁 = 𝜵 𝒂 𝐿⨀𝑓′( 𝒛 𝑥,𝑁 ) 𝜹 𝑥,𝑙 =( ( 𝒘 𝑙+1 ) 𝑇 𝜹 𝑥,𝑙+1 )⨀𝑓′( 𝒛 𝑥,𝑙 ) 𝒃 𝑙 → 𝒃 𝑙 − 𝜂 𝑚 𝑥 𝜹 𝑥,𝑙 𝒘 𝑙 → 𝒘 𝑙 − 𝜂 𝑚 𝑥 𝜹 𝑥,𝑙 ( 𝒂 𝑥,𝑙−1 ) 𝑇

⇓ ⇓ ⇓ σ σ σ σ Vanishing Gradient 𝜕𝐿 𝜕𝑧 𝛿 𝑙 = 𝛿 𝑙+1 𝑤 𝑙+1 𝜎 ′ 𝑧 𝑙
30 𝛿 𝑙 = 𝛿 𝑙+1 𝑤 𝑙+1 𝜎 ′ 𝑧 𝑙 ⇓ 𝛿 𝑁 = 𝜕𝐿 𝜕 𝑎 𝑁 𝜎′( 𝑧 𝑁 ) 𝛿 𝑙 = 𝜕𝐿 𝜕 𝑎 𝑁 𝜎′( 𝑧 𝑁 ) 𝑗=𝑙 𝑗=𝑁−1 𝑤 𝑗+1 𝜎 ′ 𝑧 𝑗 ⇓ 𝛿 𝑙 𝛿 𝑙+𝑘 = 𝑗=𝑙 𝑗=𝑙+𝑘 𝑤 𝑗+1 𝜎 ′ 𝑧 𝑗 ⇓ 𝑤 𝑗+1 <1 𝜎 ′ 𝑧 𝑗 ≤ 𝜎 ′ 0 = 1 4 𝛿 𝑙 𝛿 𝑙+𝑘 < 1 4 𝑘

𝑑𝑅𝑒𝐿𝑈(𝑧) 𝑑𝑡 = 0 𝑖𝑓 𝑧<0 1 𝑖𝑓 𝑧>0
Dying ReLu’s 1 z ReLU(𝑡)=max⁡(0,𝑧) 𝑑𝑅𝑒𝐿𝑈(𝑧) 𝑑𝑡 = 0 𝑖𝑓 𝑧<0 1 𝑖𝑓 𝑧>0 𝜕𝑎 𝜕𝑤 = 𝑥 1 𝜕𝑎 𝜕𝑏 =1 𝑖𝑓 𝑤𝑥 1 +𝑏>0 ReLU 𝑎= 𝑚𝑎𝑥(0, 𝑤𝑥 1 +𝑏) 𝜕𝐿 𝜕𝑎 x1 𝝏𝒂 𝝏𝒘 =𝟎 𝝏𝒂 𝝏𝒃 =𝟎 𝑖𝑓 𝑤𝑥 1 +𝑏<0

Universal Function Approximation
“…arbitrary decision regions can be arbitrarily well approximated by continuous feedforward neural networks with only a single internal, hidden layer and any continuous sigmoidal nonlinearity.” Cybenko G. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals, and Systems (MCSS) Dec 1;2(4): 32 Sigmoid Subtraction of two sigmoid with small shift A function Approximations of the function with sigmoids

Convolutional neural networks (CNNs)
Local receptive field Describes the response of a linear and time-invariant system to an input signal. The inverse Fourier transform of the pointwise product in frequency space Shared weights and biases LeCun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proceedings of the IEEE Nov;86(11):

LeNet-5 LeCun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proceedings of the IEEE Nov;86(11):

AlexNet Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 2012 (pp ).

Deep with small convolution filters
Convolutional neural networks (CNNs) VGG Net Deep with small convolution filters

GoogLeNet Inception modules (parallel layers with multi-scale processing) increases the depth and width of the network while keeping the computational budget constant. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2015 (pp. 1-9).

ResNet Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, Deep Residual Learning for Image Recognition, arXiv: [cs.CV], 2015

Mini-Batch Gradient Descent
Mini-batch gradient descent: Uses a subset of the training set to calculate the gradient for each step.

Initialization wi initialized to be normally distributed (μ=0, σ=1)
40 z=Σwixi σ(z) σ'(z) xi, i=1..N, normally distributed (μ=0, σ=1), N=100 wi initialized to be normally distributed (μ=0, σ=1/N) z=Σwixi σ(z) σ'(z)

Transfer Learning If your data set is limited in size:
Use a pre-trained network and only remove the last fully connected layer and train a linear classifier using your data set. Fine tune part of the network using backpropagation. Example: “An image of a skin lesion (for example, melanoma) is sequentially warped into a probability distribution over clinical classes of skin disease using Google Inception v3 CNN architecture pretrained on the ImageNet dataset (1.28 million images over 1,000 generic object classes) and fine-tuned on our own dataset of 129,450 skin lesions comprising 2,032 different diseases.” Esteva et al., “Dermatologist-level classification of skin cancer with deep neural networks”, Nature. 2017

Data Augmentation Translations Rotations Reflections
42 Translations Rotations Reflections Intensity and color of illumination Deformation

Batch Normalization For each mini-batch, B, normalize between each layer: 43 𝑥 𝑖 = 𝑥 𝑖 − 𝜇 𝐵 𝜎 𝐵 2 +𝜀 𝑦 𝑖 =𝛾 𝑥 𝑖 +𝛽 Regularization and faster learning Karpathy, Ioffe S, Szegedy C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv: Feb 11.

Regularization - Dropout
44

Recurrent convolutional neural networks (R-CNNs)
y yt-1 yt yt+1 45 𝑊 ℎ𝑦 h 𝑊 ℎℎ ht-1 ht ht+1 𝑊 𝑥ℎ x xt-1 xt xt+1 𝒉 𝑡 = tanh 𝑾 ℎℎ 𝒉 𝑡−1 + 𝑾 𝑥ℎ 𝒙 𝑡 𝒚 𝑡 = 𝑾 ℎ𝑦 𝒉 𝑡

Classification of each frame of a video
Recurrent convolutional neural networks (R-CNNs) 46 CNN Image captioning Sentiment Classification Translation Classification of each frame of a video Andrej Karpathy, The Unreasonable Effectiveness of Recurrent Neural Networks,

LSTM Long Short-Term Memory (LSTM) RNN Cell state update: y
x 𝑊 𝑦 𝑊 ℎ Cell state update: 𝑪 𝑡 = 𝒇 𝑡 ⨀ 𝑪 𝑡−1 + 𝒊 𝑡 ⨀𝑡𝑎𝑛ℎ⁡( 𝑾 𝑐 𝒉 𝑡−1 , 𝒙 𝑡 + 𝒃 𝑐 ) 𝒉 𝑡 = 𝒐 𝑡 ⨀𝑡𝑎𝑛ℎ⁡⁡( 𝑪 𝑡 ) Forget gate: 𝒇 𝑡 =σ⁡( 𝑾 𝑓 𝒉 𝑡−1 , 𝒙 𝑡 + 𝒃 𝑓 ) Input gate: 𝒊 𝑡 =σ⁡( 𝑾 𝑖 𝒉 𝑡−1 , 𝒙 𝑡 + 𝒃 𝑖 ) 𝒉 𝑡 =tanh⁡( 𝑾 ℎ [ 𝒉 𝑡−1 , 𝒙 𝑡 ]) Output gate: 𝒐 𝑡 =σ⁡⁡( 𝑾 𝑜 𝒉 𝑡−1 , 𝒙 𝑡 + 𝒃 𝑜 ) 𝒚 𝑡 = 𝑾 𝑦 𝒉 𝑡 Hochreiter S, Schmidhuber J. Long short-term memory. Neural computation Nov 15;9(8):

Image Captioning – Combining CNNs and RNNs
48 Karpathy, Andrej & Fei-Fei, Li, "Deep visual-semantic alignments for generating image descriptions", Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015)

Autoencoders – Unsupervised Learning
Autoencoders learn a lower dimensionality representation where output ≈ input 49 Lower Dimensional Representation Input Output

Generative Adversarial Networks
50 Nguyen et al., “Plug & Play Generative Networks: Conditional Iterative Generation of Images in Latent Space”,

Deep Dream Google DeepDream: The Garden of Earthly Delights
51 Google DeepDream: The Garden of Earthly Delights Hieronymus Bosch: The Garden of Earthly Delights

Artistic Style 52 LA. Gatys, A.S. Ecker, M. Bethge, “A Neural Algorithm of Artistic Style”,

Adversarial Fooling Examples
Original correctly classified image Classified as ostrich Perturbation 53 Szegedy et al., “Intriguing properties of neural networks”,

Adversarial Fooling Examples
54 Wieland Brendel,

Machine Learning – Neural Networks David Fenyő

Similar presentations

Presentation on theme: "Machine Learning – Neural Networks David Fenyő"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Machine Learning – Neural Networks David Fenyő

Similar presentations

Presentation on theme: "Machine Learning – Neural Networks David Fenyő"— Presentation transcript:

Similar presentations

About project

Feedback