Machine Learning – Neural Networks David Fenyő

Slides:



Advertisements
Similar presentations
Request Dispatching for Cheap Energy Prices in Cloud Data Centers
Advertisements

SpringerLink Training Kit
Luminosity measurements at Hadron Colliders
From Word Embeddings To Document Distances
Choosing a Dental Plan Student Name
Virtual Environments and Computer Graphics
Chương 1: CÁC PHƯƠNG THỨC GIAO DỊCH TRÊN THỊ TRƯỜNG THẾ GIỚI
THỰC TIỄN KINH DOANH TRONG CỘNG ĐỒNG KINH TẾ ASEAN –
D. Phát triển thương hiệu
NHỮNG VẤN ĐỀ NỔI BẬT CỦA NỀN KINH TẾ VIỆT NAM GIAI ĐOẠN
Điều trị chống huyết khối trong tai biến mạch máu não
BÖnh Parkinson PGS.TS.BS NGUYỄN TRỌNG HƯNG BỆNH VIỆN LÃO KHOA TRUNG ƯƠNG TRƯỜNG ĐẠI HỌC Y HÀ NỘI Bác Ninh 2013.
Nasal Cannula X particulate mask
Evolving Architecture for Beyond the Standard Model
HF NOISE FILTERS PERFORMANCE
Electronics for Pedestrians – Passive Components –
Parameterization of Tabulated BRDFs Ian Mallett (me), Cem Yuksel
L-Systems and Affine Transformations
CMSC423: Bioinformatic Algorithms, Databases and Tools
Some aspect concerning the LMDZ dynamical core and its use
Bayesian Confidence Limits and Intervals
实习总结 (Internship Summary)
Current State of Japanese Economy under Negative Interest Rate and Proposed Remedies Naoyuki Yoshino Dean Asian Development Bank Institute Professor Emeritus,
Front End Electronics for SOI Monolithic Pixel Sensor
Face Recognition Monday, February 1, 2016.
Solving Rubik's Cube By: Etai Nativ.
CS284 Paper Presentation Arpad Kovacs
انتقال حرارت 2 خانم خسرویار.
Summer Student Program First results
Theoretical Results on Neutrinos
HERMESでのHard Exclusive生成過程による 核子内クォーク全角運動量についての研究
Wavelet Coherence & Cross-Wavelet Transform
yaSpMV: Yet Another SpMV Framework on GPUs
Creating Synthetic Microdata for Higher Educational Use in Japan: Reproduction of Distribution Type based on the Descriptive Statistics Kiyomi Shirakawa.
MOCLA02 Design of a Compact L-­band Transverse Deflecting Cavity with Arbitrary Polarizations for the SACLA Injector Sep. 14th, 2015 H. Maesaka, T. Asaka,
Hui Wang†*, Canturk Isci‡, Lavanya Subramanian*,
Fuel cell development program for electric vehicle
Overview of TST-2 Experiment
Optomechanics with atoms
داده کاوی سئوالات نمونه
Inter-system biases estimation in multi-GNSS relative positioning with GPS and Galileo Cecile Deprez and Rene Warnant University of Liege, Belgium  
ლექცია 4 - ფული და ინფლაცია
10. predavanje Novac i financijski sustav
Wissenschaftliche Aussprache zur Dissertation
FLUORECENCE MICROSCOPY SUPERRESOLUTION BLINK MICROSCOPY ON THE BASIS OF ENGINEERED DARK STATES* *Christian Steinhauer, Carsten Forthmann, Jan Vogelsang,
Particle acceleration during the gamma-ray flares of the Crab Nebular
Interpretations of the Derivative Gottfried Wilhelm Leibniz
Advisor: Chiuyuan Chen Student: Shao-Chun Lin
Widow Rockfish Assessment
SiW-ECAL Beam Test 2015 Kick-Off meeting
On Robust Neighbor Discovery in Mobile Wireless Networks
Chapter 6 并发:死锁和饥饿 Operating Systems: Internals and Design Principles
You NEED your book!!! Frequency Distribution
Y V =0 a V =V0 x b b V =0 z
Fairness-oriented Scheduling Support for Multicore Systems
Climate-Energy-Policy Interaction
Hui Wang†*, Canturk Isci‡, Lavanya Subramanian*,
Ch48 Statistics by Chtan FYHSKulai
The ABCD matrix for parabolic reflectors and its application to astigmatism free four-mirror cavities.
Measure Twice and Cut Once: Robust Dynamic Voltage Scaling for FPGAs
Online Learning: An Introduction
Factor Based Index of Systemic Stress (FISS)
What is Chemistry? Chemistry is: the study of matter & the changes it undergoes Composition Structure Properties Energy changes.
THE BERRY PHASE OF A BOGOLIUBOV QUASIPARTICLE IN AN ABRIKOSOV VORTEX*
Quantum-classical transition in optical twin beams and experimental applications to quantum metrology Ivano Ruo-Berchera Frascati.
The Toroidal Sporadic Source: Understanding Temporal Variations
FW 3.4: More Circle Practice
ارائه یک روش حل مبتنی بر استراتژی های تکاملی گروه بندی برای حل مسئله بسته بندی اقلام در ظروف
Decision Procedures Christoph M. Wintersteiger 9/11/2017 3:14 PM
Limits on Anomalous WWγ and WWZ Couplings from DØ
Presentation transcript:

Machine Learning – Neural Networks David Fenyő Contact: David@FenyoLab.org

Example: Skin Cancer Diagnosis 2 Esteva et al., “Dermatologist-level classification of skin cancer with deep neural networks”, Nature. 2017

Example: Histopathological Diagnosis 3 Litjens el al., “Deep learning as a tool for increased accuracy and efficiency of histopathological diagnosis”, Scientific Reports 2016

Example: The Digital Mammography DREAM Challenge 4 The Digital Mammography DREAM Challenge will attempt to improve the predictive accuracy of digital mammography for the early detection of breast cancer. The primary benefit of this Challenge will be to establish new quantitative tools - machine learning, deep learning or other - that can help decrease the recall rate of screening mammography, with a potential impact on shifting the balance of routine breast cancer screening towards more benefit and less harm. Participating teams will be asked to submit predictive models based on over 640,000 de-identified digital mammography images from over 86000 subjects, with corresponding clinical variables. https://www.synapse.org/#!Synapse:syn4224222/wiki/401743

Architecture w1 x1 𝑓( 𝑖 𝑤 𝑖 𝑥 𝑖 +𝑏) w2 x2 xn wn . Input Output Hidden 𝑓( 𝑖 𝑤 𝑖 𝑥 𝑖 +𝑏) 5 w2 x2 . xn wn Input Output Hidden

tanh 𝑧 = 𝑒 𝑧 − 𝑒 −𝑧 𝑒 𝑧 + 𝑒 −𝑧 =2𝜎 2𝑧 −1 Activation Functions 1 6 𝜎(𝑧)= 1 1+ 𝑒 −𝑧 Sigmoid: Hyperbolic tangent: ReLu: 1 tanh 𝑧 = 𝑒 𝑧 − 𝑒 −𝑧 𝑒 𝑧 + 𝑒 −𝑧 =2𝜎 2𝑧 −1 −1 z ReLU(𝑧)=max⁡(0,𝑧)

𝑑𝑅𝑒𝐿𝑈(𝑧) 𝑑𝑧 = 0 𝑖𝑓 𝑧<0 1 𝑖𝑓 𝑧>0 Activation Function Derivatives 1 𝜎(𝑧)= 1 1+ 𝑒 −𝑧 𝑑𝜎(𝑧) 𝑑𝑧 =𝜎(𝑧)(1−𝜎(𝑧)) 7 Sigmoid: ReLu: 1 z ReLU(𝑧)=max⁡(0,𝑧) 𝑑𝑅𝑒𝐿𝑈(𝑧) 𝑑𝑧 = 0 𝑖𝑓 𝑧<0 1 𝑖𝑓 𝑧>0

Faster Learning with ReLU 8 A four-layer convolutional neural network with ReLUs (solid line) reaches a 25% training error rate on CIFAR-10 six times faster than an equivalent network with tanh neurons (dashed line). Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 2012 (pp. 1097-1105).

Activation Functions for Output Layer 9 𝜎(𝑧)= 1 1+ 𝑒 −𝑧 Binary classification: Sigmoid: Multi-class classification: Softmax = 𝑒 𝑧 𝑘 𝑗 𝑒 𝑧 𝑗

Sigmoid Activation Function Gradient Descent argmin 𝒘 𝑳 𝒘 𝒘 𝑛+1 = 𝒘 𝑛 −𝜂𝛁𝐿( 𝒘 𝑛 ) Sigmoid Activation Function Analytical expression for likelihood loss function derivative: 𝐿=− 1 𝑛 𝑗 (𝑦 𝑗 log 𝜎(𝒙 𝑗 ∙𝒘) + (1−𝑦 𝑗 )log⁡(1−𝜎( 𝒙 𝑗 ∙𝒘))) 𝛁 𝒘 𝐿= 1 𝑛 𝑗 ( 𝑦 𝑗 𝜎(𝒙 𝑗 ∙𝒘) − 1− 𝑦 𝑗 (1− 𝜎(𝒙 𝑗 ∙𝒘)) ) 𝛁 𝒘 𝜎(𝒙 𝑗 ∙𝒘) = 1 𝑛 𝑗 𝒙 𝑗 ( 𝜎( 𝒙 𝑗 ∙𝒘)− 𝑦 𝑗 )

𝒗 𝑛 =𝛾 𝒗 𝑛−1 +𝜂𝛁𝐿( 𝒘 𝑛 −𝛾 𝒗 𝑛−1 ) 𝒘 𝑛+1 = 𝒘 𝑛 − 𝒗 𝑛 Gradient Descent – Momentum and Friction 𝒘 𝑛+1 = 𝒘 𝑛 −𝜂𝛁𝐿( 𝒘 𝑛 ) Partially Remembering Previous Gradients: 𝒗 𝑛 =𝛾 𝒗 𝑛−1 +𝜂𝛁𝐿( 𝒘 𝑛 ) 𝒘 𝑛+1 = 𝒘 𝑛 − 𝒗 𝑛 Nesterov accelerated gradient: 𝒗 𝑛 =𝛾 𝒗 𝑛−1 +𝜂𝛁𝐿( 𝒘 𝑛 −𝛾 𝒗 𝑛−1 ) 𝒘 𝑛+1 = 𝒘 𝑛 − 𝒗 𝑛 Adagrad: Decreases the learning rate monotonically based on sum of squared past gradients. Adadelta & RMSprop: Extensions of Adagrad that slowly forgets. Adaptive Moment Estimation (Adam): Adaptive learning rates and momentum  

Mini-Batch Gradient Descent Mini-batch gradient descent: Uses a subset of the training set to calculate the gradient for each step.

f(z) 𝜕𝑎 𝜕 𝑤 𝑖 𝜕𝐿 𝜕 𝑤 𝑖 𝜕𝐿 𝜕𝑎 𝜕𝐿 𝜕 𝑤 𝑖 = 𝜕𝑎 𝜕 𝑤 𝑖 𝜕𝐿 𝜕𝑎 Backpropagation w1 x1 𝑎=𝑓 𝑧 = 𝑓( 𝑖 𝑤 𝑖 𝑥 𝑖 +𝑏) 𝜕𝐿 𝜕 𝑤 𝑖 13 w2 𝜕𝐿 𝜕𝑎 x2 f(z) . xn wn Chain Rule 𝜕𝐿 𝜕 𝑤 𝑖 = 𝜕𝑎 𝜕 𝑤 𝑖 𝜕𝐿 𝜕𝑎

+ Backpropagation 𝜕𝐿 𝜕𝑦 𝑎=𝑦+𝑐 𝜕𝐿 𝜕𝑎 y 14 𝑎=𝑦+𝑐 𝜕𝐿 𝜕𝑎 y 𝜕𝑎 𝜕𝑦 = 𝜕(𝑦+𝑐) 𝜕𝑦 = 𝜕𝑦 𝜕𝑦 + 𝜕𝑐 𝜕𝑦 =1 𝜕𝐿 𝜕𝑦 = 𝜕𝑎 𝜕𝑦 𝜕𝐿 𝜕𝑎 = 𝜕𝐿 𝜕𝑎

+ Backpropagation 𝜕𝐿 𝜕 𝑦 1 y1 𝑎= 𝑦 1 + 𝑦 2 𝜕𝐿 𝜕𝑎 y2 𝜕𝐿 𝜕 𝑦 2 15 𝑎= 𝑦 1 + 𝑦 2 𝜕𝐿 𝜕𝑎 y2 𝜕𝐿 𝜕 𝑦 2 𝜕𝑎 𝜕 𝑦 𝑖 =1 𝜕𝐿 𝜕 𝑦 𝑖 = 𝜕𝐿 𝜕𝑎

Backpropagation 𝜕𝐿 𝜕𝑦 * 16 𝑎=𝑐𝑦 𝜕𝐿 𝜕𝑎 y 𝜕𝑎 𝜕𝑦 =𝑐 𝜕𝐿 𝜕𝑦 =𝑐 𝜕𝐿 𝜕𝑎

* + * Backpropagation 𝜕𝐿 𝜕 𝑤 1 = 𝜕 𝑎 11 𝜕 𝑤 1 𝜕 𝑎 21 𝜕 𝑎 11 𝜕𝐿 𝜕 𝑎 21 𝜕𝐿 𝜕 𝑤 1 = 𝜕 𝑎 11 𝜕 𝑤 1 𝜕 𝑎 21 𝜕 𝑎 11 𝜕𝐿 𝜕 𝑎 21 = 𝑥 1 (1) 𝜕𝐿 𝜕 𝑎 21 x1 * 𝜕 𝑎 21 𝜕 𝑎 11 =1 17 w1 𝜕 𝑎 11 𝜕 𝑤 1 = 𝑥 1 + 𝑎 11 = 𝑤 1 𝑥 1 𝑎 21 = 𝑎 11 + 𝑎 12 𝜕𝐿 𝜕 𝑎 21 𝑎 12 = 𝑤 2 𝑥 2 x2 * 𝜕 𝑎 21 𝜕 𝑎 12 =1 w2 𝜕 𝑎 12 𝜕 𝑤 2 = 𝑥 2 𝜕𝐿 𝜕 𝑤 2 = 𝜕 𝑎 12 𝜕 𝑤 2 𝜕 𝑎 21 𝜕 𝑎 12 𝜕𝐿 𝜕 𝑎 21 = 𝑥 2 (1) 𝜕𝐿 𝜕 𝑎 21

* + * Backpropagation x1 =3 1 3 w1 =1 3 𝜕𝐿 𝜕𝑎 -5 x2 =-4 -8 1 w2 =2 -4 18 1 3 w1 =1 3 + 𝜕𝐿 𝜕𝑎 -5 x2 =-4 -8 * 1 w2 =2 -4

max Backpropagation 𝜕𝐿 𝜕 𝑦 1 𝑎= y1 𝑚𝑎𝑥( 𝑦 1 , 𝑦 2 ) 𝜕𝐿 𝜕𝑎 y2 𝜕𝐿 𝜕 𝑦 2 19 𝜕𝐿 𝜕𝑎 y2 𝜕𝐿 𝜕 𝑦 2 𝜕𝑎 𝜕 𝑦 1 =1, 𝜕𝑎 𝜕 𝑦 2 =0,𝑖𝑓 𝑦 1 > 𝑦 2 𝜕𝑎 𝜕 𝑦 1 =0, 𝜕𝑎 𝜕 𝑦 2 =1,𝑖𝑓 𝑦 1 < 𝑦 2

1/𝑦 Backpropagation 𝜕𝐿 𝜕𝑦 𝑎=1/𝑦 𝜕𝐿 𝜕𝑎 y 𝜕𝑎 𝜕𝑦 =− 1 𝑦 2 20 𝑎=1/𝑦 𝜕𝐿 𝜕𝑎 y 1/𝑦 𝜕𝑎 𝜕𝑦 =− 1 𝑦 2 𝜕𝐿 𝜕𝑦 =− 1 𝑦 2 𝜕𝐿 𝜕𝑎

exp Backpropagation 𝜕𝐿 𝜕𝑦 𝑎= 𝑒 𝑦 𝜕𝐿 𝜕𝑎 y 𝜕𝑎 𝜕𝑦 = 𝑒 𝑦 𝜕𝐿 𝜕𝑦 = 𝑒 𝑦 𝜕𝐿 𝜕𝑎 21 𝑎= 𝑒 𝑦 𝜕𝐿 𝜕𝑎 y 𝜕𝑎 𝜕𝑦 = 𝑒 𝑦 𝜕𝐿 𝜕𝑦 = 𝑒 𝑦 𝜕𝐿 𝜕𝑎

* * + - Backpropagation 1+ exp 1/x 𝜕𝐿 𝜕 𝑤 1 = 𝜕 𝑎 11 𝜕 𝑤 1 𝜕 𝑎 21 𝜕 𝑎 11 𝜕 𝑎 31 𝜕 𝑎 21 𝜕 𝑎 41 𝜕 𝑎 31 𝜕 𝑎 51 𝜕 𝑎 41 𝜕 𝑎 61 𝜕 𝑎 51 𝜕𝐿 𝜕 𝑎 61 = 𝑥 1 (1)(−1) 𝑒 𝑎 31 (1)(− 1 𝑎 51 2 ) 𝜕𝐿 𝜕 𝑎 61 𝜕 𝑎 21 𝜕 𝑎 11 =1 x1 𝜕 𝑎 61 𝜕 𝑎 51 = − 1 𝑎 51 2 𝜕 𝑎 51 𝜕 𝑎 41 = 1 𝜕 𝑎 41 𝜕 𝑎 31 = 𝑒 𝑎 31 𝑎 11 = 𝑤 1 𝑥 1 𝜕 𝑎 31 𝜕 𝑎 21 = −1 22 * w1 𝜕 𝑎 11 𝜕 𝑤 1 = 𝑥 1 + - exp 1+ 1/x 𝜕𝐿 𝜕 𝑎 61 𝑎 21 = 𝑎 11 + 𝑎 12 𝑎 31 = − 𝑎 21 𝑎 41 = 𝑒 𝑎 31 𝑎 51 = 1+ 𝑎 41 𝑎 61 = 1 𝑎 51 x2 * 𝑎 12 = 𝑤 2 𝑥 2 w2 𝜕 𝑎 12 𝜕 𝑤 2 = 𝑥 2 𝜕 𝑎 21 𝜕 𝑎 12 =1 𝜕𝐿 𝜕 𝑤 2 = 𝜕 𝑎 12 𝜕 𝑤 2 𝜕 𝑎 21 𝜕 𝑎 11 𝜕 𝑎 31 𝜕 𝑎 21 𝜕 𝑎 41 𝜕 𝑎 31 𝜕 𝑎 51 𝜕 𝑎 41 𝜕 𝑎 61 𝜕 𝑎 51 𝜕𝐿 𝜕 𝑎 61 = 𝑥 2 (1)(−1) 𝑒 𝑎 31 (1)(− 1 𝑎 51 2 ) 𝜕𝐿 𝜕 𝑎 61

* * + - Backpropagation x1 =3 6.6E-3 0.02 6.6E-3 -6.6E-3 -4.5E-5 𝜕𝐿 𝜕𝑎 w1 =1 3 + - exp 1+ 1/x 6.6E-3 x2 =-4 -5 5 148.4 149.4 0.0067 * -0.027 -8 w2 =2

Backpropagation x1 24 * w1 + - exp + 1/x x2 * w2 σ

* * + σ Backpropagation 𝜕𝐿 𝜕𝑎 𝜕 𝑎 21 𝜕 𝑎 11 =1 x1 𝑎 11 = 𝑤 1 𝑥 1 𝜕 𝑎 21 𝜕 𝑎 11 =1 x1 𝑎 11 = 𝑤 1 𝑥 1 25 * 𝜕 𝑎 31 𝜕 𝑎 21 =𝜎( 𝑎 21 )(1−𝜎 𝑎 21 ) w1 𝜕𝐿 𝜕𝑎 𝜕 𝑎 11 𝜕 𝑤 1 = 𝑥 1 + σ 𝑎 21 = 𝑎 11 + 𝑎 12 𝑎 31 =𝜎( 𝑎 21 ) x2 * 𝑎 12 = 𝑤 2 𝑥 2 w2 𝜕 𝑎 12 𝜕 𝑤 2 = 𝑥 2 𝜕 𝑎 21 𝜕 𝑎 12 =1

* * + σ Backpropagation x1 =3 6.6E-3 0.02 6.6E-3 𝜕𝐿 𝜕𝑎 w1 =1 3 6.6E-3 26 * 6.6E-3 0.02 6.6E-3 𝜕𝐿 𝜕𝑎 w1 =1 3 + σ 6.6E-3 x2 =-4 -5 0.0067 * -0.027 -8 w2 =2

Backpropagation 𝜕𝐿 𝜕𝑎 1 27 + 𝜕𝐿 𝜕𝑎 2

Backpropagation Layer 1 Layer 2 Layer 3 𝑤 74 2 𝑏 7 2 𝑎 7 2 𝑎 𝑗 𝑙 =𝑓 𝑘 𝑤 𝑗𝑘 𝑙 𝑎 𝑘 𝑙−1 + 𝑏 𝑗 𝑙 =𝑓( 𝑧 𝑗 𝑙 ) 𝒂 𝑙 =𝑓 𝒘 𝑙 𝒂 𝑙−1 + 𝒃 𝑙 =𝑓( 𝒛 𝑙 ) 𝛿 𝑗 𝑙 = 𝜕𝐿 𝜕 𝑧 𝑗 𝑙 Error: 𝜹 𝑙 = 𝜵 𝒛 𝐿

Error for output layer (l=N): Backpropagation Error for output layer (l=N): 𝛿 𝑗 𝑁 = 𝜕𝐿 𝜕 𝑧 𝑗 𝑁 ⇓ Chain Rule 𝛿 𝑗 𝑁 = 𝑘 𝜕𝐿 𝜕 𝑎 𝑘 𝑁 𝜕 𝑎 𝑘 𝑁 𝜕 𝑧 𝑗 𝑁 𝜹 𝑁 = 𝜵 𝒂 𝐿⨀𝑓′( 𝒛 𝑁 ) ⇓ 𝜕 𝑎 𝑘 𝑁 𝜕 𝑧 𝑗 𝑁 =0 𝑖𝑓 𝑘≠𝑗 ⨀ - Element-wise multiplication of vectors 𝛿 𝑗 𝑁 = 𝜕𝐿 𝜕 𝑎 𝑗 𝑁 𝜕 𝑎 𝑗 𝑁 𝜕 𝑧 𝑗 𝑁 ⇓ 𝑎 𝑗 𝑁 =𝑓( 𝑧 𝑗 𝑁 ) 𝛿 𝑗 𝑁 = 𝜕𝐿 𝜕 𝑎 𝑗 𝑁 𝜕𝑓( 𝑧 𝑗 𝑁 ) 𝜕 𝑧 𝑗 𝑁 = 𝜕𝐿 𝜕 𝑎 𝑗 𝑁 𝑓′( 𝑧 𝑗 𝑁 )

Error as a function of the error in the next layer: Backpropagation Error as a function of the error in the next layer: 𝛿 𝑗 𝑙 = 𝜕𝐿 𝜕 𝑧 𝑗 𝑙 ⇓ Chain Rule 𝛿 𝑗 𝑙 = 𝜕𝐿 𝜕 𝑧 𝑗 𝑙 = 𝑘 𝜕𝐿 𝜕 𝑧 𝑘 𝑙+1 𝜕 𝑧 𝑘 𝑙+1 𝜕 𝑧 𝑗 𝑙 ⇓ 𝛿 𝑘 𝑙+1 = 𝜕𝐿 𝜕 𝑧 𝑘 𝑙+1 𝜹 𝑙 =( ( 𝒘 𝑙+1 ) 𝑇 𝜹 𝑙+1 )⨀𝑓′( 𝒛 𝑙 ) 𝛿 𝑗 𝑙 = 𝑘 𝛿 𝑘 𝑙+1 𝜕 𝑧 𝑘 𝑙+1 𝜕 𝑧 𝑗 𝑙 𝑧 𝑘 𝑙+1 = 𝑖 𝑤 𝑘𝑖 𝑙+1 𝑎 𝑖 𝑙 + 𝑏 𝑘 𝑙+1 = 𝑖 𝑤 𝑘𝑖 𝑙+1 𝑓( 𝑧 𝑖 𝑙 )+ 𝑏 𝑘 𝑙+1 ⇓ 𝛿 𝑗 𝑙 = 𝑘 𝛿 𝑘 𝑙+1 𝑤 𝑘𝑗 𝑙+1 𝑓′( 𝑧 𝑗 𝑙 )

Gradient with respect to bias Backpropagation Gradient with respect to bias 𝜕𝐿 𝜕 𝑏 𝑗 𝑙 ⇓ Chain Rule 𝜕𝐿 𝜕 𝑏 𝑗 𝑙 = 𝑘 𝜕𝐿 𝜕 𝑧 𝑘 𝑙 𝜕 𝑧 𝑘 𝑙 𝜕 𝑏 𝑗 𝑙 𝜵 𝒃 𝐿= 𝜹 𝑙 ⇓ 𝛿 𝑘 𝑙 = 𝜕𝐿 𝜕 𝑧 𝑘 𝑙 𝜕𝐿 𝜕 𝑏 𝑗 𝑙 = 𝑘 𝛿 𝑘 𝑙 𝜕 𝑧 𝑘 𝑙 𝜕 𝑏 𝑗 𝑙 𝑧 𝑘 𝑙 = 𝑖 𝑤 𝑘𝑖 𝑙 𝑎 𝑖 𝑙−1 + 𝑏 𝑘 𝑙 ⇓ 𝜕𝐿 𝜕 𝑏 𝑗 𝑙 = 𝛿 𝑗 𝑙

Gradient with respect to weights Backpropagation Gradient with respect to weights 𝜕𝐿 𝜕 𝑤 𝑗𝑘 𝑙 ⇓ Chain Rule 𝜕𝐿 𝜕 𝑤 𝑗𝑘 𝑙 = 𝑖 𝜕𝐿 𝜕 𝑧 𝑖 𝑙 𝜕 𝑧 𝑖 𝑙 𝜕 𝑤 𝑗𝑘 𝑙 𝜵 𝒘 𝐿= 𝜹 𝑙 ( 𝒂 𝑙−1 ) 𝑇 ⇓ 𝛿 𝑖 𝑙 = 𝜕𝐿 𝜕 𝑧 𝑖 𝑙 𝜕𝐿 𝜕 𝑤 𝑗𝑘 𝑙 = 𝑖 𝛿 𝑖 𝑙 𝜕 𝑧 𝑖 𝑙 𝜕 𝑤 𝑗𝑘 𝑙 𝑧 𝑖 𝑙 = 𝑚 𝑤 𝑖𝑚 𝑙 𝑎 𝑚 𝑙−1 + 𝑏 𝑖 𝑙 ⇓ 𝜕𝐿 𝜕 𝑤 𝑗𝑘 𝑙 = 𝑎 𝑘 𝑙−1 𝛿 𝑗 𝑙

Backpropagation 𝒂 𝒙,𝑙 =𝑓( 𝒛 𝒙,𝑙 ) =𝒇( 𝒘 𝑙 𝒂 𝒙,𝑙−1 + 𝒃 𝑙 ) For each training example x in mini-batch of size m: Forward Pass: Output Error: Error Backpropagation: Gradient Descent: 𝒂 𝒙,𝑙 =𝑓( 𝒛 𝒙,𝑙 ) =𝒇( 𝒘 𝑙 𝒂 𝒙,𝑙−1 + 𝒃 𝑙 ) 𝜹 𝑥,𝑁 = 𝜵 𝒂 𝐿⨀𝑓′( 𝒛 𝑥,𝑁 ) 𝜹 𝑥,𝑙 =( ( 𝒘 𝑙+1 ) 𝑇 𝜹 𝑥,𝑙+1 )⨀𝑓′( 𝒛 𝑥,𝑙 ) 𝒃 𝑙 → 𝒃 𝑙 − 𝜂 𝑚 𝑥 𝜹 𝑥,𝑙 𝒘 𝑙 → 𝒘 𝑙 − 𝜂 𝑚 𝑥 𝜹 𝑥,𝑙 ( 𝒂 𝑥,𝑙−1 ) 𝑇

Initialization wi initialized to be normally distributed (μ=0, σ=1) 34 z=Σwixi σ(z) σ'(z) xi, i=1..N, normally distributed (μ=0, σ=1), N=100 wi initialized to be normally distributed (μ=0, σ=1/N) z=Σwixi σ(z) σ'(z)

⇓ ⇓ ⇓ σ σ σ σ Vanishing Gradient 𝜕𝐿 𝜕𝑧 𝛿 𝑙 = 𝛿 𝑙+1 𝑤 𝑙+1 𝜎 ′ 𝑧 𝑙 35 𝛿 𝑙 = 𝛿 𝑙+1 𝑤 𝑙+1 𝜎 ′ 𝑧 𝑙 ⇓ 𝛿 𝑁 = 𝜕𝐿 𝜕 𝑎 𝑁 𝜎′( 𝑧 𝑁 ) 𝛿 𝑙 = 𝜕𝐿 𝜕 𝑎 𝑁 𝜎′( 𝑧 𝑁 ) 𝑗=𝑙 𝑗=𝑁−1 𝑤 𝑗+1 𝜎 ′ 𝑧 𝑗 ⇓ 𝛿 𝑙 𝛿 𝑙+𝑘 = 𝑗=𝑙 𝑗=𝑙+𝑘 𝑤 𝑗+1 𝜎 ′ 𝑧 𝑗 ⇓ 𝑤 𝑗+1 <1 𝜎 ′ 𝑧 𝑗 ≤ 𝜎 ′ 0 = 1 4 𝛿 𝑙 𝛿 𝑙+𝑘 < 1 4 𝑘

𝑑𝑅𝑒𝐿𝑈(𝑧) 𝑑𝑡 = 0 𝑖𝑓 𝑧<0 1 𝑖𝑓 𝑧>0 Dying ReLu’s 1 z ReLU(𝑡)=max⁡(0,𝑧) 𝑑𝑅𝑒𝐿𝑈(𝑧) 𝑑𝑡 = 0 𝑖𝑓 𝑧<0 1 𝑖𝑓 𝑧>0 𝜕𝑎 𝜕𝑤 = 𝑥 1 𝜕𝑎 𝜕𝑏 =1 𝑖𝑓 𝑤𝑥 1 +𝑏>0 ReLU 𝑎= 𝑚𝑎𝑥(0, 𝑤𝑥 1 +𝑏) 𝜕𝐿 𝜕𝑎 x1 𝝏𝒂 𝝏𝒘 =𝟎 𝝏𝒂 𝝏𝒃 =𝟎 𝑖𝑓 𝑤𝑥 1 +𝑏<0

Universal Function Approximation “…arbitrary decision regions can be arbitrarily well approximated by continuous feedforward neural networks with only a single internal, hidden layer and any continuous sigmoidal nonlinearity.” Cybenko G. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals, and Systems (MCSS). 1989 Dec 1;2(4):303-14. 37 Sigmoid Subtraction of two sigmoid with small shift A function Approximations of the function with sigmoids

Convolutional neural networks (CNNs) Local receptive field http://en.wikipedia.org/wiki/Convolution Describes the response of a linear and time-invariant system to an input signal. The inverse Fourier transform of the pointwise product in frequency space Shared weights and biases LeCun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proceedings of the IEEE. 1998 Nov;86(11):2278-324.

Convolutional neural networks (CNNs) LeNet-5 LeCun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proceedings of the IEEE. 1998 Nov;86(11):2278-324.

Convolutional neural networks (CNNs) AlexNet Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 2012 (pp. 1097-1105).

Deep with small convolution filters Convolutional neural networks (CNNs) VGG Net Deep with small convolution filters

Convolutional neural networks (CNNs) GoogLeNet Inception modules (parallel layers with multi-scale processing) increases the depth and width of the network while keeping the computational budget constant. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2015 (pp. 1-9).

Convolutional neural networks (CNNs) ResNet Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, Deep Residual Learning for Image Recognition, arXiv:1512.03385 [cs.CV], 2015

Transfer Learning If your data set is limited in size: Use a pre-trained network and only remove the last fully connected layer and train a linear classifier using your data set. Fine tune part of the network using backpropagation. Example: “An image of a skin lesion (for example, melanoma) is sequentially warped into a probability distribution over clinical classes of skin disease using Google Inception v3 CNN architecture pretrained on the ImageNet dataset (1.28 million images over 1,000 generic object classes) and fine-tuned on our own dataset of 129,450 skin lesions comprising 2,032 different diseases.” Esteva et al., “Dermatologist-level classification of skin cancer with deep neural networks”, Nature. 2017

Data Augmentation Translations Rotations Reflections 45 Translations Rotations Reflections Intensity and color of illumination Deformation

Batch Normalization For each mini-batch, B, normalize between each layer: 46 𝑥 𝑖 = 𝑥 𝑖 − 𝜇 𝐵 𝜎 𝐵 2 +𝜀 𝑦 𝑖 =𝛾 𝑥 𝑖 +𝛽 Regularization and faster learning Karpathy, Ioffe S, Szegedy C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. 2015 Feb 11.

Regularization - Dropout 47

Recurrent convolutional neural networks (R-CNNs) y yt-1 yt yt+1 48 𝑊 ℎ𝑦 h 𝑊 ℎℎ ht-1 ht ht+1 𝑊 𝑥ℎ x xt-1 xt xt+1 𝒉 𝑡 = tanh 𝑾 ℎℎ 𝒉 𝑡−1 + 𝑾 𝑥ℎ 𝒙 𝑡 𝒚 𝑡 = 𝑾 ℎ𝑦 𝒉 𝑡

Classification of each frame of a video Recurrent convolutional neural networks (R-CNNs) 49 CNN Image captioning Sentiment Classification Translation Classification of each frame of a video Andrej Karpathy, The Unreasonable Effectiveness of Recurrent Neural Networks, http://karpathy.github.io/2015/05/21/rnn-effectiveness/

LSTM Long Short-Term Memory (LSTM) RNN Cell state update: y x 𝑊 𝑦 𝑊 ℎ Cell state update: 𝑪 𝑡 = 𝒇 𝑡 ⨀ 𝑪 𝑡−1 + 𝒊 𝑡 ⨀𝑡𝑎𝑛ℎ⁡( 𝑾 𝑐 𝒉 𝑡−1 , 𝒙 𝑡 + 𝒃 𝑐 ) 𝒉 𝑡 = 𝒐 𝑡 ⨀𝑡𝑎𝑛ℎ⁡⁡( 𝑪 𝑡 ) Forget gate: 𝒇 𝑡 =σ⁡( 𝑾 𝑓 𝒉 𝑡−1 , 𝒙 𝑡 + 𝒃 𝑓 ) Input gate: 𝒊 𝑡 =σ⁡( 𝑾 𝑖 𝒉 𝑡−1 , 𝒙 𝑡 + 𝒃 𝑖 ) 𝒉 𝑡 =tanh⁡( 𝑾 ℎ [ 𝒉 𝑡−1 , 𝒙 𝑡 ]) Output gate: 𝒐 𝑡 =σ⁡⁡( 𝑾 𝑜 𝒉 𝑡−1 , 𝒙 𝑡 + 𝒃 𝑜 ) 𝒚 𝑡 = 𝑾 𝑦 𝒉 𝑡 Hochreiter S, Schmidhuber J. Long short-term memory. Neural computation. 1997 Nov 15;9(8):1735-80.

Image Captioning – Combining CNNs and RNNs 51 Karpathy, Andrej & Fei-Fei, Li, "Deep visual-semantic alignments for generating image descriptions", Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015) 3128-3137

Autoencoders – Unsupervised Learning Autoencoders learn a lower dimensionality representation where output ≈ input 52 Lower Dimensional Representation Input Output

Generative Adversarial Networks 53 Nguyen et al., “Plug & Play Generative Networks: Conditional Iterative Generation of Images in Latent Space”, https://arxiv.org/abs/1612.00005.

Deep Dream Google DeepDream: The Garden of Earthly Delights 54 Google DeepDream: The Garden of Earthly Delights Hieronymus Bosch: The Garden of Earthly Delights

Artistic Style 55 LA. Gatys, A.S. Ecker, M. Bethge, “A Neural Algorithm of Artistic Style”, https://arxiv.org/pdf/1508.06576v1.pdf

Adversarial Fooling Examples Original correctly classified image Classified as ostrich Perturbation 56 Szegedy et al., “Intriguing properties of neural networks”, https://arxiv.org/abs/1312.6199