Machine Learning – Neural Networks David Fenyő Contact: David@FenyoLab.org
Example: Skin Cancer Diagnosis 2 Esteva et al., “Dermatologist-level classification of skin cancer with deep neural networks”, Nature. 2017
Example: Histopathological Diagnosis 3 Litjens el al., “Deep learning as a tool for increased accuracy and efficiency of histopathological diagnosis”, Scientific Reports 2016
Example: The Digital Mammography DREAM Challenge 4 The Digital Mammography DREAM Challenge will attempt to improve the predictive accuracy of digital mammography for the early detection of breast cancer. The primary benefit of this Challenge will be to establish new quantitative tools - machine learning, deep learning or other - that can help decrease the recall rate of screening mammography, with a potential impact on shifting the balance of routine breast cancer screening towards more benefit and less harm. Participating teams will be asked to submit predictive models based on over 640,000 de-identified digital mammography images from over 86000 subjects, with corresponding clinical variables. https://www.synapse.org/#!Synapse:syn4224222/wiki/401743
Architecture w1 x1 𝑓( 𝑖 𝑤 𝑖 𝑥 𝑖 +𝑏) w2 x2 xn wn . Input Output Hidden 𝑓( 𝑖 𝑤 𝑖 𝑥 𝑖 +𝑏) 5 w2 x2 . xn wn Input Output Hidden
tanh 𝑧 = 𝑒 𝑧 − 𝑒 −𝑧 𝑒 𝑧 + 𝑒 −𝑧 =2𝜎 2𝑧 −1 Activation Functions 1 6 𝜎(𝑧)= 1 1+ 𝑒 −𝑧 Sigmoid: Hyperbolic tangent: ReLu: 1 tanh 𝑧 = 𝑒 𝑧 − 𝑒 −𝑧 𝑒 𝑧 + 𝑒 −𝑧 =2𝜎 2𝑧 −1 −1 z ReLU(𝑧)=max(0,𝑧)
𝑑𝑅𝑒𝐿𝑈(𝑧) 𝑑𝑧 = 0 𝑖𝑓 𝑧<0 1 𝑖𝑓 𝑧>0 Activation Function Derivatives 1 𝜎(𝑧)= 1 1+ 𝑒 −𝑧 𝑑𝜎(𝑧) 𝑑𝑧 =𝜎(𝑧)(1−𝜎(𝑧)) 7 Sigmoid: ReLu: 1 z ReLU(𝑧)=max(0,𝑧) 𝑑𝑅𝑒𝐿𝑈(𝑧) 𝑑𝑧 = 0 𝑖𝑓 𝑧<0 1 𝑖𝑓 𝑧>0
Faster Learning with ReLU 8 A four-layer convolutional neural network with ReLUs (solid line) reaches a 25% training error rate on CIFAR-10 six times faster than an equivalent network with tanh neurons (dashed line). Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 2012 (pp. 1097-1105).
Activation Functions for Output Layer 9 𝜎(𝑧)= 1 1+ 𝑒 −𝑧 Binary classification: Sigmoid: Multi-class classification: Softmax = 𝑒 𝑧 𝑘 𝑗 𝑒 𝑧 𝑗
Sigmoid Activation Function Gradient Descent argmin 𝒘 𝑳 𝒘 𝒘 𝑛+1 = 𝒘 𝑛 −𝜂𝛁𝐿( 𝒘 𝑛 ) Sigmoid Activation Function Analytical expression for likelihood loss function derivative: 𝐿=− 1 𝑛 𝑗 (𝑦 𝑗 log 𝜎(𝒙 𝑗 ∙𝒘) + (1−𝑦 𝑗 )log(1−𝜎( 𝒙 𝑗 ∙𝒘))) 𝛁 𝒘 𝐿= 1 𝑛 𝑗 ( 𝑦 𝑗 𝜎(𝒙 𝑗 ∙𝒘) − 1− 𝑦 𝑗 (1− 𝜎(𝒙 𝑗 ∙𝒘)) ) 𝛁 𝒘 𝜎(𝒙 𝑗 ∙𝒘) = 1 𝑛 𝑗 𝒙 𝑗 ( 𝜎( 𝒙 𝑗 ∙𝒘)− 𝑦 𝑗 )
𝒗 𝑛 =𝛾 𝒗 𝑛−1 +𝜂𝛁𝐿( 𝒘 𝑛 −𝛾 𝒗 𝑛−1 ) 𝒘 𝑛+1 = 𝒘 𝑛 − 𝒗 𝑛 Gradient Descent – Momentum and Friction 𝒘 𝑛+1 = 𝒘 𝑛 −𝜂𝛁𝐿( 𝒘 𝑛 ) Partially Remembering Previous Gradients: 𝒗 𝑛 =𝛾 𝒗 𝑛−1 +𝜂𝛁𝐿( 𝒘 𝑛 ) 𝒘 𝑛+1 = 𝒘 𝑛 − 𝒗 𝑛 Nesterov accelerated gradient: 𝒗 𝑛 =𝛾 𝒗 𝑛−1 +𝜂𝛁𝐿( 𝒘 𝑛 −𝛾 𝒗 𝑛−1 ) 𝒘 𝑛+1 = 𝒘 𝑛 − 𝒗 𝑛 Adagrad: Decreases the learning rate monotonically based on sum of squared past gradients. Adadelta & RMSprop: Extensions of Adagrad that slowly forgets. Adaptive Moment Estimation (Adam): Adaptive learning rates and momentum
Mini-Batch Gradient Descent Mini-batch gradient descent: Uses a subset of the training set to calculate the gradient for each step.
f(z) 𝜕𝑎 𝜕 𝑤 𝑖 𝜕𝐿 𝜕 𝑤 𝑖 𝜕𝐿 𝜕𝑎 𝜕𝐿 𝜕 𝑤 𝑖 = 𝜕𝑎 𝜕 𝑤 𝑖 𝜕𝐿 𝜕𝑎 Backpropagation w1 x1 𝑎=𝑓 𝑧 = 𝑓( 𝑖 𝑤 𝑖 𝑥 𝑖 +𝑏) 𝜕𝐿 𝜕 𝑤 𝑖 13 w2 𝜕𝐿 𝜕𝑎 x2 f(z) . xn wn Chain Rule 𝜕𝐿 𝜕 𝑤 𝑖 = 𝜕𝑎 𝜕 𝑤 𝑖 𝜕𝐿 𝜕𝑎
+ Backpropagation 𝜕𝐿 𝜕𝑦 𝑎=𝑦+𝑐 𝜕𝐿 𝜕𝑎 y 14 𝑎=𝑦+𝑐 𝜕𝐿 𝜕𝑎 y 𝜕𝑎 𝜕𝑦 = 𝜕(𝑦+𝑐) 𝜕𝑦 = 𝜕𝑦 𝜕𝑦 + 𝜕𝑐 𝜕𝑦 =1 𝜕𝐿 𝜕𝑦 = 𝜕𝑎 𝜕𝑦 𝜕𝐿 𝜕𝑎 = 𝜕𝐿 𝜕𝑎
+ Backpropagation 𝜕𝐿 𝜕 𝑦 1 y1 𝑎= 𝑦 1 + 𝑦 2 𝜕𝐿 𝜕𝑎 y2 𝜕𝐿 𝜕 𝑦 2 15 𝑎= 𝑦 1 + 𝑦 2 𝜕𝐿 𝜕𝑎 y2 𝜕𝐿 𝜕 𝑦 2 𝜕𝑎 𝜕 𝑦 𝑖 =1 𝜕𝐿 𝜕 𝑦 𝑖 = 𝜕𝐿 𝜕𝑎
Backpropagation 𝜕𝐿 𝜕𝑦 * 16 𝑎=𝑐𝑦 𝜕𝐿 𝜕𝑎 y 𝜕𝑎 𝜕𝑦 =𝑐 𝜕𝐿 𝜕𝑦 =𝑐 𝜕𝐿 𝜕𝑎
* + * Backpropagation 𝜕𝐿 𝜕 𝑤 1 = 𝜕 𝑎 11 𝜕 𝑤 1 𝜕 𝑎 21 𝜕 𝑎 11 𝜕𝐿 𝜕 𝑎 21 𝜕𝐿 𝜕 𝑤 1 = 𝜕 𝑎 11 𝜕 𝑤 1 𝜕 𝑎 21 𝜕 𝑎 11 𝜕𝐿 𝜕 𝑎 21 = 𝑥 1 (1) 𝜕𝐿 𝜕 𝑎 21 x1 * 𝜕 𝑎 21 𝜕 𝑎 11 =1 17 w1 𝜕 𝑎 11 𝜕 𝑤 1 = 𝑥 1 + 𝑎 11 = 𝑤 1 𝑥 1 𝑎 21 = 𝑎 11 + 𝑎 12 𝜕𝐿 𝜕 𝑎 21 𝑎 12 = 𝑤 2 𝑥 2 x2 * 𝜕 𝑎 21 𝜕 𝑎 12 =1 w2 𝜕 𝑎 12 𝜕 𝑤 2 = 𝑥 2 𝜕𝐿 𝜕 𝑤 2 = 𝜕 𝑎 12 𝜕 𝑤 2 𝜕 𝑎 21 𝜕 𝑎 12 𝜕𝐿 𝜕 𝑎 21 = 𝑥 2 (1) 𝜕𝐿 𝜕 𝑎 21
* + * Backpropagation x1 =3 1 3 w1 =1 3 𝜕𝐿 𝜕𝑎 -5 x2 =-4 -8 1 w2 =2 -4 18 1 3 w1 =1 3 + 𝜕𝐿 𝜕𝑎 -5 x2 =-4 -8 * 1 w2 =2 -4
max Backpropagation 𝜕𝐿 𝜕 𝑦 1 𝑎= y1 𝑚𝑎𝑥( 𝑦 1 , 𝑦 2 ) 𝜕𝐿 𝜕𝑎 y2 𝜕𝐿 𝜕 𝑦 2 19 𝜕𝐿 𝜕𝑎 y2 𝜕𝐿 𝜕 𝑦 2 𝜕𝑎 𝜕 𝑦 1 =1, 𝜕𝑎 𝜕 𝑦 2 =0,𝑖𝑓 𝑦 1 > 𝑦 2 𝜕𝑎 𝜕 𝑦 1 =0, 𝜕𝑎 𝜕 𝑦 2 =1,𝑖𝑓 𝑦 1 < 𝑦 2
1/𝑦 Backpropagation 𝜕𝐿 𝜕𝑦 𝑎=1/𝑦 𝜕𝐿 𝜕𝑎 y 𝜕𝑎 𝜕𝑦 =− 1 𝑦 2 20 𝑎=1/𝑦 𝜕𝐿 𝜕𝑎 y 1/𝑦 𝜕𝑎 𝜕𝑦 =− 1 𝑦 2 𝜕𝐿 𝜕𝑦 =− 1 𝑦 2 𝜕𝐿 𝜕𝑎
exp Backpropagation 𝜕𝐿 𝜕𝑦 𝑎= 𝑒 𝑦 𝜕𝐿 𝜕𝑎 y 𝜕𝑎 𝜕𝑦 = 𝑒 𝑦 𝜕𝐿 𝜕𝑦 = 𝑒 𝑦 𝜕𝐿 𝜕𝑎 21 𝑎= 𝑒 𝑦 𝜕𝐿 𝜕𝑎 y 𝜕𝑎 𝜕𝑦 = 𝑒 𝑦 𝜕𝐿 𝜕𝑦 = 𝑒 𝑦 𝜕𝐿 𝜕𝑎
* * + - Backpropagation 1+ exp 1/x 𝜕𝐿 𝜕 𝑤 1 = 𝜕 𝑎 11 𝜕 𝑤 1 𝜕 𝑎 21 𝜕 𝑎 11 𝜕 𝑎 31 𝜕 𝑎 21 𝜕 𝑎 41 𝜕 𝑎 31 𝜕 𝑎 51 𝜕 𝑎 41 𝜕 𝑎 61 𝜕 𝑎 51 𝜕𝐿 𝜕 𝑎 61 = 𝑥 1 (1)(−1) 𝑒 𝑎 31 (1)(− 1 𝑎 51 2 ) 𝜕𝐿 𝜕 𝑎 61 𝜕 𝑎 21 𝜕 𝑎 11 =1 x1 𝜕 𝑎 61 𝜕 𝑎 51 = − 1 𝑎 51 2 𝜕 𝑎 51 𝜕 𝑎 41 = 1 𝜕 𝑎 41 𝜕 𝑎 31 = 𝑒 𝑎 31 𝑎 11 = 𝑤 1 𝑥 1 𝜕 𝑎 31 𝜕 𝑎 21 = −1 22 * w1 𝜕 𝑎 11 𝜕 𝑤 1 = 𝑥 1 + - exp 1+ 1/x 𝜕𝐿 𝜕 𝑎 61 𝑎 21 = 𝑎 11 + 𝑎 12 𝑎 31 = − 𝑎 21 𝑎 41 = 𝑒 𝑎 31 𝑎 51 = 1+ 𝑎 41 𝑎 61 = 1 𝑎 51 x2 * 𝑎 12 = 𝑤 2 𝑥 2 w2 𝜕 𝑎 12 𝜕 𝑤 2 = 𝑥 2 𝜕 𝑎 21 𝜕 𝑎 12 =1 𝜕𝐿 𝜕 𝑤 2 = 𝜕 𝑎 12 𝜕 𝑤 2 𝜕 𝑎 21 𝜕 𝑎 11 𝜕 𝑎 31 𝜕 𝑎 21 𝜕 𝑎 41 𝜕 𝑎 31 𝜕 𝑎 51 𝜕 𝑎 41 𝜕 𝑎 61 𝜕 𝑎 51 𝜕𝐿 𝜕 𝑎 61 = 𝑥 2 (1)(−1) 𝑒 𝑎 31 (1)(− 1 𝑎 51 2 ) 𝜕𝐿 𝜕 𝑎 61
* * + - Backpropagation x1 =3 6.6E-3 0.02 6.6E-3 -6.6E-3 -4.5E-5 𝜕𝐿 𝜕𝑎 w1 =1 3 + - exp 1+ 1/x 6.6E-3 x2 =-4 -5 5 148.4 149.4 0.0067 * -0.027 -8 w2 =2
Backpropagation x1 24 * w1 + - exp + 1/x x2 * w2 σ
* * + σ Backpropagation 𝜕𝐿 𝜕𝑎 𝜕 𝑎 21 𝜕 𝑎 11 =1 x1 𝑎 11 = 𝑤 1 𝑥 1 𝜕 𝑎 21 𝜕 𝑎 11 =1 x1 𝑎 11 = 𝑤 1 𝑥 1 25 * 𝜕 𝑎 31 𝜕 𝑎 21 =𝜎( 𝑎 21 )(1−𝜎 𝑎 21 ) w1 𝜕𝐿 𝜕𝑎 𝜕 𝑎 11 𝜕 𝑤 1 = 𝑥 1 + σ 𝑎 21 = 𝑎 11 + 𝑎 12 𝑎 31 =𝜎( 𝑎 21 ) x2 * 𝑎 12 = 𝑤 2 𝑥 2 w2 𝜕 𝑎 12 𝜕 𝑤 2 = 𝑥 2 𝜕 𝑎 21 𝜕 𝑎 12 =1
* * + σ Backpropagation x1 =3 6.6E-3 0.02 6.6E-3 𝜕𝐿 𝜕𝑎 w1 =1 3 6.6E-3 26 * 6.6E-3 0.02 6.6E-3 𝜕𝐿 𝜕𝑎 w1 =1 3 + σ 6.6E-3 x2 =-4 -5 0.0067 * -0.027 -8 w2 =2
Backpropagation 𝜕𝐿 𝜕𝑎 1 27 + 𝜕𝐿 𝜕𝑎 2
Backpropagation Layer 1 Layer 2 Layer 3 𝑤 74 2 𝑏 7 2 𝑎 7 2 𝑎 𝑗 𝑙 =𝑓 𝑘 𝑤 𝑗𝑘 𝑙 𝑎 𝑘 𝑙−1 + 𝑏 𝑗 𝑙 =𝑓( 𝑧 𝑗 𝑙 ) 𝒂 𝑙 =𝑓 𝒘 𝑙 𝒂 𝑙−1 + 𝒃 𝑙 =𝑓( 𝒛 𝑙 ) 𝛿 𝑗 𝑙 = 𝜕𝐿 𝜕 𝑧 𝑗 𝑙 Error: 𝜹 𝑙 = 𝜵 𝒛 𝐿
Error for output layer (l=N): Backpropagation Error for output layer (l=N): 𝛿 𝑗 𝑁 = 𝜕𝐿 𝜕 𝑧 𝑗 𝑁 ⇓ Chain Rule 𝛿 𝑗 𝑁 = 𝑘 𝜕𝐿 𝜕 𝑎 𝑘 𝑁 𝜕 𝑎 𝑘 𝑁 𝜕 𝑧 𝑗 𝑁 𝜹 𝑁 = 𝜵 𝒂 𝐿⨀𝑓′( 𝒛 𝑁 ) ⇓ 𝜕 𝑎 𝑘 𝑁 𝜕 𝑧 𝑗 𝑁 =0 𝑖𝑓 𝑘≠𝑗 ⨀ - Element-wise multiplication of vectors 𝛿 𝑗 𝑁 = 𝜕𝐿 𝜕 𝑎 𝑗 𝑁 𝜕 𝑎 𝑗 𝑁 𝜕 𝑧 𝑗 𝑁 ⇓ 𝑎 𝑗 𝑁 =𝑓( 𝑧 𝑗 𝑁 ) 𝛿 𝑗 𝑁 = 𝜕𝐿 𝜕 𝑎 𝑗 𝑁 𝜕𝑓( 𝑧 𝑗 𝑁 ) 𝜕 𝑧 𝑗 𝑁 = 𝜕𝐿 𝜕 𝑎 𝑗 𝑁 𝑓′( 𝑧 𝑗 𝑁 )
Error as a function of the error in the next layer: Backpropagation Error as a function of the error in the next layer: 𝛿 𝑗 𝑙 = 𝜕𝐿 𝜕 𝑧 𝑗 𝑙 ⇓ Chain Rule 𝛿 𝑗 𝑙 = 𝜕𝐿 𝜕 𝑧 𝑗 𝑙 = 𝑘 𝜕𝐿 𝜕 𝑧 𝑘 𝑙+1 𝜕 𝑧 𝑘 𝑙+1 𝜕 𝑧 𝑗 𝑙 ⇓ 𝛿 𝑘 𝑙+1 = 𝜕𝐿 𝜕 𝑧 𝑘 𝑙+1 𝜹 𝑙 =( ( 𝒘 𝑙+1 ) 𝑇 𝜹 𝑙+1 )⨀𝑓′( 𝒛 𝑙 ) 𝛿 𝑗 𝑙 = 𝑘 𝛿 𝑘 𝑙+1 𝜕 𝑧 𝑘 𝑙+1 𝜕 𝑧 𝑗 𝑙 𝑧 𝑘 𝑙+1 = 𝑖 𝑤 𝑘𝑖 𝑙+1 𝑎 𝑖 𝑙 + 𝑏 𝑘 𝑙+1 = 𝑖 𝑤 𝑘𝑖 𝑙+1 𝑓( 𝑧 𝑖 𝑙 )+ 𝑏 𝑘 𝑙+1 ⇓ 𝛿 𝑗 𝑙 = 𝑘 𝛿 𝑘 𝑙+1 𝑤 𝑘𝑗 𝑙+1 𝑓′( 𝑧 𝑗 𝑙 )
Gradient with respect to bias Backpropagation Gradient with respect to bias 𝜕𝐿 𝜕 𝑏 𝑗 𝑙 ⇓ Chain Rule 𝜕𝐿 𝜕 𝑏 𝑗 𝑙 = 𝑘 𝜕𝐿 𝜕 𝑧 𝑘 𝑙 𝜕 𝑧 𝑘 𝑙 𝜕 𝑏 𝑗 𝑙 𝜵 𝒃 𝐿= 𝜹 𝑙 ⇓ 𝛿 𝑘 𝑙 = 𝜕𝐿 𝜕 𝑧 𝑘 𝑙 𝜕𝐿 𝜕 𝑏 𝑗 𝑙 = 𝑘 𝛿 𝑘 𝑙 𝜕 𝑧 𝑘 𝑙 𝜕 𝑏 𝑗 𝑙 𝑧 𝑘 𝑙 = 𝑖 𝑤 𝑘𝑖 𝑙 𝑎 𝑖 𝑙−1 + 𝑏 𝑘 𝑙 ⇓ 𝜕𝐿 𝜕 𝑏 𝑗 𝑙 = 𝛿 𝑗 𝑙
Gradient with respect to weights Backpropagation Gradient with respect to weights 𝜕𝐿 𝜕 𝑤 𝑗𝑘 𝑙 ⇓ Chain Rule 𝜕𝐿 𝜕 𝑤 𝑗𝑘 𝑙 = 𝑖 𝜕𝐿 𝜕 𝑧 𝑖 𝑙 𝜕 𝑧 𝑖 𝑙 𝜕 𝑤 𝑗𝑘 𝑙 𝜵 𝒘 𝐿= 𝜹 𝑙 ( 𝒂 𝑙−1 ) 𝑇 ⇓ 𝛿 𝑖 𝑙 = 𝜕𝐿 𝜕 𝑧 𝑖 𝑙 𝜕𝐿 𝜕 𝑤 𝑗𝑘 𝑙 = 𝑖 𝛿 𝑖 𝑙 𝜕 𝑧 𝑖 𝑙 𝜕 𝑤 𝑗𝑘 𝑙 𝑧 𝑖 𝑙 = 𝑚 𝑤 𝑖𝑚 𝑙 𝑎 𝑚 𝑙−1 + 𝑏 𝑖 𝑙 ⇓ 𝜕𝐿 𝜕 𝑤 𝑗𝑘 𝑙 = 𝑎 𝑘 𝑙−1 𝛿 𝑗 𝑙
Backpropagation 𝒂 𝒙,𝑙 =𝑓( 𝒛 𝒙,𝑙 ) =𝒇( 𝒘 𝑙 𝒂 𝒙,𝑙−1 + 𝒃 𝑙 ) For each training example x in mini-batch of size m: Forward Pass: Output Error: Error Backpropagation: Gradient Descent: 𝒂 𝒙,𝑙 =𝑓( 𝒛 𝒙,𝑙 ) =𝒇( 𝒘 𝑙 𝒂 𝒙,𝑙−1 + 𝒃 𝑙 ) 𝜹 𝑥,𝑁 = 𝜵 𝒂 𝐿⨀𝑓′( 𝒛 𝑥,𝑁 ) 𝜹 𝑥,𝑙 =( ( 𝒘 𝑙+1 ) 𝑇 𝜹 𝑥,𝑙+1 )⨀𝑓′( 𝒛 𝑥,𝑙 ) 𝒃 𝑙 → 𝒃 𝑙 − 𝜂 𝑚 𝑥 𝜹 𝑥,𝑙 𝒘 𝑙 → 𝒘 𝑙 − 𝜂 𝑚 𝑥 𝜹 𝑥,𝑙 ( 𝒂 𝑥,𝑙−1 ) 𝑇
Initialization wi initialized to be normally distributed (μ=0, σ=1) 34 z=Σwixi σ(z) σ'(z) xi, i=1..N, normally distributed (μ=0, σ=1), N=100 wi initialized to be normally distributed (μ=0, σ=1/N) z=Σwixi σ(z) σ'(z)
⇓ ⇓ ⇓ σ σ σ σ Vanishing Gradient 𝜕𝐿 𝜕𝑧 𝛿 𝑙 = 𝛿 𝑙+1 𝑤 𝑙+1 𝜎 ′ 𝑧 𝑙 35 𝛿 𝑙 = 𝛿 𝑙+1 𝑤 𝑙+1 𝜎 ′ 𝑧 𝑙 ⇓ 𝛿 𝑁 = 𝜕𝐿 𝜕 𝑎 𝑁 𝜎′( 𝑧 𝑁 ) 𝛿 𝑙 = 𝜕𝐿 𝜕 𝑎 𝑁 𝜎′( 𝑧 𝑁 ) 𝑗=𝑙 𝑗=𝑁−1 𝑤 𝑗+1 𝜎 ′ 𝑧 𝑗 ⇓ 𝛿 𝑙 𝛿 𝑙+𝑘 = 𝑗=𝑙 𝑗=𝑙+𝑘 𝑤 𝑗+1 𝜎 ′ 𝑧 𝑗 ⇓ 𝑤 𝑗+1 <1 𝜎 ′ 𝑧 𝑗 ≤ 𝜎 ′ 0 = 1 4 𝛿 𝑙 𝛿 𝑙+𝑘 < 1 4 𝑘
𝑑𝑅𝑒𝐿𝑈(𝑧) 𝑑𝑡 = 0 𝑖𝑓 𝑧<0 1 𝑖𝑓 𝑧>0 Dying ReLu’s 1 z ReLU(𝑡)=max(0,𝑧) 𝑑𝑅𝑒𝐿𝑈(𝑧) 𝑑𝑡 = 0 𝑖𝑓 𝑧<0 1 𝑖𝑓 𝑧>0 𝜕𝑎 𝜕𝑤 = 𝑥 1 𝜕𝑎 𝜕𝑏 =1 𝑖𝑓 𝑤𝑥 1 +𝑏>0 ReLU 𝑎= 𝑚𝑎𝑥(0, 𝑤𝑥 1 +𝑏) 𝜕𝐿 𝜕𝑎 x1 𝝏𝒂 𝝏𝒘 =𝟎 𝝏𝒂 𝝏𝒃 =𝟎 𝑖𝑓 𝑤𝑥 1 +𝑏<0
Universal Function Approximation “…arbitrary decision regions can be arbitrarily well approximated by continuous feedforward neural networks with only a single internal, hidden layer and any continuous sigmoidal nonlinearity.” Cybenko G. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals, and Systems (MCSS). 1989 Dec 1;2(4):303-14. 37 Sigmoid Subtraction of two sigmoid with small shift A function Approximations of the function with sigmoids
Convolutional neural networks (CNNs) Local receptive field http://en.wikipedia.org/wiki/Convolution Describes the response of a linear and time-invariant system to an input signal. The inverse Fourier transform of the pointwise product in frequency space Shared weights and biases LeCun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proceedings of the IEEE. 1998 Nov;86(11):2278-324.
Convolutional neural networks (CNNs) LeNet-5 LeCun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proceedings of the IEEE. 1998 Nov;86(11):2278-324.
Convolutional neural networks (CNNs) AlexNet Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 2012 (pp. 1097-1105).
Deep with small convolution filters Convolutional neural networks (CNNs) VGG Net Deep with small convolution filters
Convolutional neural networks (CNNs) GoogLeNet Inception modules (parallel layers with multi-scale processing) increases the depth and width of the network while keeping the computational budget constant. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2015 (pp. 1-9).
Convolutional neural networks (CNNs) ResNet Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, Deep Residual Learning for Image Recognition, arXiv:1512.03385 [cs.CV], 2015
Transfer Learning If your data set is limited in size: Use a pre-trained network and only remove the last fully connected layer and train a linear classifier using your data set. Fine tune part of the network using backpropagation. Example: “An image of a skin lesion (for example, melanoma) is sequentially warped into a probability distribution over clinical classes of skin disease using Google Inception v3 CNN architecture pretrained on the ImageNet dataset (1.28 million images over 1,000 generic object classes) and fine-tuned on our own dataset of 129,450 skin lesions comprising 2,032 different diseases.” Esteva et al., “Dermatologist-level classification of skin cancer with deep neural networks”, Nature. 2017
Data Augmentation Translations Rotations Reflections 45 Translations Rotations Reflections Intensity and color of illumination Deformation
Batch Normalization For each mini-batch, B, normalize between each layer: 46 𝑥 𝑖 = 𝑥 𝑖 − 𝜇 𝐵 𝜎 𝐵 2 +𝜀 𝑦 𝑖 =𝛾 𝑥 𝑖 +𝛽 Regularization and faster learning Karpathy, Ioffe S, Szegedy C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. 2015 Feb 11.
Regularization - Dropout 47
Recurrent convolutional neural networks (R-CNNs) y yt-1 yt yt+1 48 𝑊 ℎ𝑦 h 𝑊 ℎℎ ht-1 ht ht+1 𝑊 𝑥ℎ x xt-1 xt xt+1 𝒉 𝑡 = tanh 𝑾 ℎℎ 𝒉 𝑡−1 + 𝑾 𝑥ℎ 𝒙 𝑡 𝒚 𝑡 = 𝑾 ℎ𝑦 𝒉 𝑡
Classification of each frame of a video Recurrent convolutional neural networks (R-CNNs) 49 CNN Image captioning Sentiment Classification Translation Classification of each frame of a video Andrej Karpathy, The Unreasonable Effectiveness of Recurrent Neural Networks, http://karpathy.github.io/2015/05/21/rnn-effectiveness/
LSTM Long Short-Term Memory (LSTM) RNN Cell state update: y x 𝑊 𝑦 𝑊 ℎ Cell state update: 𝑪 𝑡 = 𝒇 𝑡 ⨀ 𝑪 𝑡−1 + 𝒊 𝑡 ⨀𝑡𝑎𝑛ℎ( 𝑾 𝑐 𝒉 𝑡−1 , 𝒙 𝑡 + 𝒃 𝑐 ) 𝒉 𝑡 = 𝒐 𝑡 ⨀𝑡𝑎𝑛ℎ( 𝑪 𝑡 ) Forget gate: 𝒇 𝑡 =σ( 𝑾 𝑓 𝒉 𝑡−1 , 𝒙 𝑡 + 𝒃 𝑓 ) Input gate: 𝒊 𝑡 =σ( 𝑾 𝑖 𝒉 𝑡−1 , 𝒙 𝑡 + 𝒃 𝑖 ) 𝒉 𝑡 =tanh( 𝑾 ℎ [ 𝒉 𝑡−1 , 𝒙 𝑡 ]) Output gate: 𝒐 𝑡 =σ( 𝑾 𝑜 𝒉 𝑡−1 , 𝒙 𝑡 + 𝒃 𝑜 ) 𝒚 𝑡 = 𝑾 𝑦 𝒉 𝑡 Hochreiter S, Schmidhuber J. Long short-term memory. Neural computation. 1997 Nov 15;9(8):1735-80.
Image Captioning – Combining CNNs and RNNs 51 Karpathy, Andrej & Fei-Fei, Li, "Deep visual-semantic alignments for generating image descriptions", Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015) 3128-3137
Autoencoders – Unsupervised Learning Autoencoders learn a lower dimensionality representation where output ≈ input 52 Lower Dimensional Representation Input Output
Generative Adversarial Networks 53 Nguyen et al., “Plug & Play Generative Networks: Conditional Iterative Generation of Images in Latent Space”, https://arxiv.org/abs/1612.00005.
Deep Dream Google DeepDream: The Garden of Earthly Delights 54 Google DeepDream: The Garden of Earthly Delights Hieronymus Bosch: The Garden of Earthly Delights
Artistic Style 55 LA. Gatys, A.S. Ecker, M. Bethge, “A Neural Algorithm of Artistic Style”, https://arxiv.org/pdf/1508.06576v1.pdf
Adversarial Fooling Examples Original correctly classified image Classified as ostrich Perturbation 56 Szegedy et al., “Intriguing properties of neural networks”, https://arxiv.org/abs/1312.6199