Shiyan Hu Michigan Technological University

Shiyan Hu Michigan Technological University
Deep Learning Shiyan Hu Michigan Technological University

Network parameter 𝜃: all the weights and biases in the “neurons”
Neural Network “Neuron” Neural Network Different connection leads to different network structures Network parameter 𝜃: all the weights and biases in the “neurons”

Fully Connect Feedforward Network
1 4 0.98 1 -2 1 -1 -2 0.12 -1 1 Sigmoid Function

1 4 0.98 2 0.86 3 0.62 1 -2 1 -1 -1 -2 -1 -2 0.12 -2 -1 0.11 0.83 -1 1 -1 4 2

0.73 0.72 0.51 1 2 3 -2 -1 -1 1 -2 -1 0.5 -2 -1 0.12 0.85 1 -1 4 2

neuron Input Layer 1 …… Layer 2 …… Layer L …… Output …… y1 …… y2 …… …… …… yM Input Layer Output Layer Hidden Layers

Deep = Many hidden layers
19 layers 8 layers 6.7% 7.3% 16.4% AlexNet (2012) VGG (2014) GoogleNet (2014)

Deep = Many hidden layers
3.57% 7.3% 6.7% 16.4% AlexNet (2012) VGG (2014) GoogleNet (2014) Residual Net (2015)

Matrix Operation 1 −2 −1 1 1 −1 1 0 0.98 0.12 𝜎 = + 4 −2 0.98 4 1 1 -2
-1 -2 0.12 -1 1 1 −2 −1 1 1 −1 1 0 𝜎 = + 4 −2

Neural Network …… y1 …… y2 …… …… …… …… …… …… yM W1 W2 WL b1 b2 bL x a1
+ 𝜎 b2 W2 a1 + 𝜎 bL WL + 𝜎 aL-1

Neural Network …… y1 …… y2 …… …… …… …… …… …… yM
WL …… y2 b1 b2 bL …… …… …… …… …… x a1 a2 …… y yM y =𝑓 x Using parallel computing techniques to speed up matrix operation WL W2 W1 x b1 b2 bL … … + + =𝜎 𝜎 𝜎 + x

Output Layer …… …… y1 …… …… …… y2 …… …… …… yM
Automatic feature engineering …… …… y1 …… …… …… y2 Softmax …… …… …… yM Input Layer Output Layer Multi-class Classifier Hidden Layers

Example Application Input Output y1 y2 The image is “2” …… …… …… y10
0.1 0.7 is 2 The image is “2” …… …… 0.2 is 0 16 x 16 = 256 Ink → 1 No ink → 0 Each dimension represents the confidence of a digit.

What is needed is a function ……
Example Application Handwriting Digit Recognition …… …… y1 y2 y10 is 1 Machine is 2 Neural Network “2” …… is 0 What is needed is a function …… Input: 256-dim vector output: 10-dim vector

Example Application “2” …… …… …… …… …… y1 y2 y10
Input Layer 1 …… Layer 2 …… Layer L …… Output …… …… y1 y2 y10 is 1 A function set containing the candidates for Handwriting Digit Recognition …… is 2 “2” …… …… …… is 0 Input Layer Output Layer Hidden Layers You need to learn a good function in your function set to minimize the classification error.

Given a set of parameters
Classification Error target “1” …… y1 1 𝑦 1 Given a set of parameters …… y2 𝑦 2 Softmax …… …… …… Cross Entropy …… …… …… …… y10 𝑦 10 𝑦 𝑦 𝐶 𝑦 , 𝑦 =− 𝑖= 𝑦 𝑖 𝑙𝑛 𝑦 𝑖

Total Error 𝐿= 𝑛=1 𝑁 𝐶 𝑛 For all training data …
𝐿= 𝑛=1 𝑁 𝐶 𝑛 For all training data … x1 NN y1 𝑦 1 𝐶 1 Find a function in function set that minimizes total loss L x2 NN y2 𝑦 2 𝐶 2 x3 NN y3 𝑦 3 𝐶 3 …… …… …… …… Find the network parameters 𝜽 ∗ that minimize total loss L xN NN yR 𝑦 𝑁 𝐶 𝑁

Gradient Descent 𝜃 𝜕𝐿 𝜕 𝑤 1 𝜕𝐿 𝜕 𝑤 2 ⋮ 𝜕𝐿 𝜕 𝑏 1 ⋮ 𝛻𝐿= …… gradient ……
Compute 𝜕𝐿 𝜕 𝑤 1 𝜕𝐿 𝜕 𝑤 1 𝜕𝐿 𝜕 𝑤 2 ⋮ 𝜕𝐿 𝜕 𝑏 1 ⋮ 𝑤 1 0.2 0.15 −𝜂 𝜕𝐿 𝜕 𝑤 1 Compute 𝜕𝐿 𝜕 𝑤 2 𝑤 2 -0.1 0.05 𝛻𝐿= −𝜂 𝜕𝐿 𝜕 𝑤 2 …… Compute 𝜕𝐿 𝜕 𝑏 1 𝑏 1 0.3 0.2 −𝜂 𝜕𝐿 𝜕 𝑏 1 gradient ……

Gradient Descent 𝜃 …… …… …… …… …… Compute 𝜕𝐿 𝜕 𝑤 1 Compute 𝜕𝐿 𝜕 𝑤 1
0.2 0.15 0.09 −𝜂 𝜕𝐿 𝜕 𝑤 1 −𝜂 𝜕𝐿 𝜕 𝑤 1 …… Compute 𝜕𝐿 𝜕 𝑤 2 Compute 𝜕𝐿 𝜕 𝑤 2 𝑤 2 -0.1 0.05 0.15 −𝜂 𝜕𝐿 𝜕 𝑤 2 −𝜂 𝜕𝐿 𝜕 𝑤 2 …… …… Compute 𝜕𝐿 𝜕 𝑏 1 Compute 𝜕𝐿 𝜕 𝑏 1 𝑏 1 0.3 0.2 0.10 −𝜂 𝜕𝐿 𝜕 𝑏 1 −𝜂 𝜕𝐿 𝜕 𝑏 1 …… ……

Gradient Descent This is the “learning” of machines in deep learning …… Even alpha go using this approach. People image …… Actually …..

Backpropagation Backpropagation: an efficient way to compute 𝜕𝐿 𝜕𝑤 in neural network

Gradient Descent Millions of parameters ……
Network parameters Starting Parameters …… Millions of parameters …… To compute the gradients efficiently, we use backpropagation.

Backpropagation 𝐿 𝜃 = 𝑛=1 𝑁 𝐶 𝑛 𝜃 𝜕𝐿 𝜃 𝜕𝑤 = 𝑛=1 𝑁 𝜕 𝐶 𝑛 𝜃 𝜕𝑤 𝑥 1 𝑥 2
xn NN 𝜃 yn 𝑦 𝑛 𝐶 𝑛 𝐿 𝜃 = 𝑛=1 𝑁 𝐶 𝑛 𝜃 𝜕𝐿 𝜃 𝜕𝑤 = 𝑛=1 𝑁 𝜕 𝐶 𝑛 𝜃 𝜕𝑤 𝑦 1 𝑥 1 𝑥 2 𝑦 2

Backpropagation 𝑧 𝑤 1 …… 𝑥 1 …… 𝑤 2 𝑧= 𝑥 1 𝑤 1 + 𝑥 2 𝑤 2 +𝑏 𝑥 2
𝑦 1 𝑥 1 b …… 𝑤 2 𝑧= 𝑥 1 𝑤 1 + 𝑥 2 𝑤 2 +𝑏 𝑦 2 𝑥 2 Forward pass: Compute 𝜕𝑧 𝜕𝑤 for all parameters 𝜕𝐶 𝜕𝑤 =? 𝜕𝑧 𝜕𝑤 𝜕𝐶 𝜕𝑧 Backward pass: (Chain rule) Compute 𝜕𝐶 𝜕𝑧 for all activation function inputs z

Backpropagation – Forward pass
Compute 𝜕𝑧 𝜕𝑤 for all parameters 𝑤 1 𝑧 …… 𝑦 1 𝑥 1 b …… 𝑤 2 𝑧= 𝑥 1 𝑤 1 + 𝑥 2 𝑤 2 +𝑏 𝑦 2 𝑥 2 𝑥 1 𝜕𝑧 𝜕 𝑤 1 =? The value of the input connected to the weight 𝑥 2 𝜕𝑧 𝜕 𝑤 2 =?

Backpropagation – Forward pass
Compute 𝜕𝑧 𝜕𝑤 for all parameters 0.98 1 2 0.86 3 1 -2 1 -1 -1 -2 -1 0.12 -2 -1 0.11 -1 1 -1 4 2 𝜕𝑧 𝜕𝑤 =−1 𝜕𝑧 𝜕𝑤 =0.12 𝜕𝑧 𝜕𝑤 =0.11

Backpropagation – Backward pass
Compute 𝜕𝐶 𝜕𝑧 for all activation function inputs z 𝑧 𝑎 𝑤 1 𝑤 3 𝑧′ 𝑥 1 b 𝑎=𝜎 𝑧 𝑤 2 𝑤 4 𝑧’’ 𝑥 2 𝜕𝐶 𝜕𝑧 = 𝜕𝑎 𝜕𝑧 𝜕𝐶 𝜕𝑎 𝜎′ 𝑧

Compute 𝜕𝐶 𝜕𝑧 for all activation function inputs z 𝑧 𝑎 𝑤 1 𝑤 3 𝑧′ 𝑥 1 b 𝑎=𝜎 𝑧 𝑤 2 𝑤 4 𝜎′ 𝑧 𝜎 𝑧 𝑧’’ 𝑥 2 𝜕𝐶 𝜕𝑧 = 𝜕𝑎 𝜕𝑧 𝜕𝐶 𝜕𝑎 𝜎′ 𝑧

Compute 𝜕𝐶 𝜕𝑧 for all activation function inputs z 𝑎 𝑧′ 𝑤 1 𝑧 𝑤 3 𝑥 1 b 𝑧′=𝑎 𝑤 3 +⋯ 𝑎=𝜎 𝑧 𝑤 2 𝑤 4 𝑧’’ 𝑥 2 𝜕𝐶 𝜕𝑧 = 𝜕𝑎 𝜕𝑧 𝜕𝐶 𝜕𝑎 𝜕𝐶 𝜕𝑎 = 𝜕𝑧′ 𝜕𝑎 𝜕𝐶 𝜕𝑧′ + 𝜕𝑧′′ 𝜕𝑎 𝜕𝐶 𝜕𝑧′′ (Chain rule) ? ? 𝑤 3 𝑤 4

Compute 𝜕𝐶 𝜕𝑧 for all activation function inputs z 𝑧 𝑎 𝑤 3 𝑧′ 𝑤 1 𝑥 1 𝜕𝐶 𝜕𝑧′ 𝜕𝐶 𝜕𝑧 b 𝑤 2 𝑤 4 𝑧’’ 𝑥 2 𝜕𝐶 𝜕𝑧′′ 𝜕𝐶 𝜕𝑧 =𝜎′ 𝑧 𝑤 3 𝜕𝐶 𝜕𝑧′ + 𝑤 4 𝜕𝐶 𝜕𝑧′′

𝜎′ 𝑧 𝑤 3 𝜕𝐶 𝜕𝑧′ 𝜕𝐶 𝜕𝑧 𝑤 4 𝜎′ 𝑧 is a constant since z is already determined in the forward pass. 𝜕𝐶 𝜕𝑧′′ 𝜕𝐶 𝜕𝑧 =𝜎′ 𝑧 𝑤 3 𝜕𝐶 𝜕𝑧′ + 𝑤 4 𝜕𝐶 𝜕𝑧′′

Compute 𝜕𝐶 𝜕𝑧 for all activation function inputs z 𝑤 1 𝑧 𝑎 𝑤 3 𝑧′ 𝑦 1 𝑥 1 𝜕𝐶 𝜕𝑧′ 𝜕𝐶 𝜕𝑧 b 𝑤 2 𝑤 4 𝑧’’ 𝑦 2 𝑥 2 𝜕𝐶 𝜕𝑧′′ Case 1. Output Layer 𝜕𝐶 𝜕𝑧′ = 𝜕 𝑦 1 𝜕𝑧′ 𝜕𝐶 𝜕 𝑦 1 𝜕𝐶 𝜕𝑧′′ = 𝜕 𝑦 2 𝜕𝑧′′ 𝜕𝐶 𝜕 𝑦 2

Compute 𝜕𝐶 𝜕𝑧 for all activation function inputs z Case 2. Not Output Layer 𝑧′ …… 𝜕𝐶 𝜕𝑧′ 𝑧’’ …… 𝜕𝐶 𝜕𝑧′′

Compute 𝜕𝐶 𝜕𝑧 for all activation function inputs z Case 2. Not Output Layer Compute 𝜕𝐶 𝜕𝑧 recursively 𝑧′ 𝑎′ 𝑤 5 𝑧 𝑎 𝜕𝐶 𝜕𝑧′ 𝜕𝐶 𝜕 𝑧 𝑎 Until we reach the output layer …… 𝑤 6 𝑧’’ 𝑧 𝑏 𝜕𝐶 𝜕𝑧′′ 𝜕𝐶 𝜕 𝑧 𝑏

Backpropagation – Backward Pass
Compute 𝜕𝐶 𝜕𝑧 for all activation function inputs z Compute 𝜕𝐶 𝜕𝑧 from the output layer 𝜕𝐶 𝜕 𝑧 1 𝜕𝐶 𝜕 𝑧 3 𝜕𝐶 𝜕 𝑧 5 𝑧 1 𝑧 3 𝑧 5 𝑥 1 𝑦 1 𝑥 2 𝑦 2 𝑧 2 𝑧 4 𝑧 6 𝜕𝐶 𝜕 𝑧 2 𝜕𝐶 𝜕 𝑧 4 𝜕𝐶 𝜕 𝑧 6

Backpropagation – Backward Pass
Compute 𝜕𝐶 𝜕𝑧 for all activation function inputs z Compute 𝜕𝐶 𝜕𝑧 from the output layer 𝜕𝐶 𝜕 𝑧 1 𝜕𝐶 𝜕 𝑧 3 𝜕𝐶 𝜕 𝑧 5 𝑧 1 𝑧 3 𝑧 5 𝑥 1 𝑦 1 𝜎′ 𝑧 1 𝜎′ 𝑧 3 𝜎′ 𝑧 2 𝜎′ 𝑧 4 𝑥 2 𝑦 2 𝑧 2 𝑧 4 𝑧 6 𝜕𝐶 𝜕 𝑧 2 𝜕𝐶 𝜕 𝑧 4 𝜕𝐶 𝜕 𝑧 6

Backpropagation – Summary
Forward Pass Backward Pass … … 𝑎 𝜕𝑧 𝜕𝑤 𝜕𝐶 𝜕𝑧 = 𝜕𝐶 𝜕𝑤 =𝑎 ⋅ for all w

Example Application “1” Handwriting Digit Recognition Machine
MNIST Data: “Hello world” for deep learning Keras provides data sets loading function:

Keras 28x28 …… 500 softplus, softsign, relu, tanh, hard_sigmoid, linear 500 Softmax y1 y2 y10 ……

SGD, RMSprop, Adagrad, Adadelta, Adam, Adamax, Nadam
Keras Step 3.1: Configuration SGD, RMSprop, Adagrad, Adadelta, Adam, Adamax, Nadam Step 3.2: Find the optimal network parameters Training data (Images) Labels (digits) To be discussed

Keras Save and load models
How to use the neural network (testing): case 1: case 2:

Repeat the above process
We do not really minimize total loss! Mini-batch Randomly initialize network parameters x1 NN y1 Pick the 1st batch 𝑦 1 𝐶 1 𝐿′= 𝐶 1 + 𝐶 31 +⋯ Mini-batch x31 NN y31 Update parameters once 𝑦 31 𝐶 31 Pick the 2nd batch …… 𝐿′′= 𝐶 2 + 𝐶 16 +⋯ Update parameters once x2 NN y2 𝑦 2 … 𝐶 2 Until all mini-batches have been picked Mini-batch x16 NN y16 𝑦 16 one epoch 𝐶 16 …… Repeat the above process

Mini-batch Batch size influences both speed and performance. You need to tune it. x1 NN …… y1 𝑦 1 𝑙 1 x31 y31 𝑦 31 𝑙 31 Mini-batch Pick the 1st batch Pick the 2nd batch 𝐿′= 𝐶 1 + 𝐶 31 +⋯ 𝐿′′= 𝐶 2 + 𝐶 16 +⋯ Update parameters Until all mini-batches have been picked … one epoch Repeat 20 times

Don’t worry. This is the default of Keras.
Shuffle the training examples for each epoch Epoch 1 Epoch 2 x1 NN y1 x1 NN y1 𝑦 1 𝑦 1 𝑙 1 𝑙 1 Mini-batch x31 NN y31 Mini-batch x31 NN y17 𝑦 31 𝑦 17 𝑙 31 𝑙 17 …… …… Don’t worry. This is the default of Keras. x2 NN y2 x2 NN y2 𝑦 2 𝑦 2 𝑙 2 𝑙 2 Mini-batch Mini-batch x16 NN y16 x16 NN y26 𝑦 16 𝑦 26 𝑙 16 𝑙 26 …… ……

The Power of Deep? Results on Training Data
Deeper usually does not imply better.

Vanishing Gradient Problem
Smaller gradients Large input Small output …… …… 𝑦 1 𝑦 2 𝑦 𝑀 …… 𝑦 1 𝑦 2 𝑦 𝑀 …… …… …… …… …… 𝐶 +∆𝐶 …… +∆𝑤 Intuitive way to compute the derivatives … 𝜕𝐶 𝜕𝑤 =? ∆𝐶 ∆𝑤

Vanishing Gradient Problem
…… …… y1 y2 yM …… …… …… …… …… …… Smaller gradients Larger gradients Learn very slow Learn very fast Almost random Already converge based on random!?

ReLU Rectified Linear Unit (ReLU) Reason: 𝜎 𝑧 1. Fast to compute
𝑎 𝑎=𝑧 𝑎=0 𝜎 𝑧 1. Fast to compute 2. Vanishing gradient problem [Xavier Glorot, AISTATS’11] [Andrew L. Maas, ICML’13] [Kaiming He, arXiv’15]

𝑧 𝑎 𝑎=𝑧 𝑎=0 ReLU

ReLU A Thinner linear network Do not have smaller gradients 𝑎 𝑎=𝑧 𝑎=0
With different input data, it becomes a piece-wise linear approximation of nonlinear network.

Maxout ReLU is a special cases of Maxout
Learnable activation function [Ian J. Goodfellow, ICML’13] neuron + 5 + 1 Input Max Max 7 2 + 7 + 2 + −1 + 4 Max Max 1 4 + 1 + 3

Maxout ReLU is a special cases of Maxout + Input 𝑧 1 𝑧 2 𝑎 Input 𝑧 𝑤 𝑏
𝑚𝑎𝑥 𝑧 1 , 𝑧 2 Input ReLU 𝑧 𝑤 𝑏 𝑎 𝑤 𝑏 𝑎 𝑎 𝑧=𝑤𝑥+𝑏 𝑧 1 =𝑤𝑥+𝑏 𝑥 𝑥 𝑧 2 =0

Learnable Activation Function
Maxout More than ReLU Max Input + 𝑧 1 𝑧 2 𝑎 𝑚𝑎𝑥 𝑧 1 , 𝑧 2 Input ReLU 𝑧 𝑤 𝑏 𝑎 𝑤 𝑏 𝑤 ′ 𝑏 ′ Learnable Activation Function 𝑎 𝑎 𝑧=𝑤𝑥+𝑏 𝑧 1 =𝑤𝑥+𝑏 𝑥 𝑥 𝑧 2 = 𝑤 ′ 𝑥+ 𝑏 ′

Dropout Training: Each time before updating the parameters
Each neuron has p% to dropout

Dropout Training: Thinner! Each time before updating the parameters
Each neuron has p% to dropout The structure of the network is changed. Using the new network for training

Dropout Testing: No dropout
If the dropout rate at training is p%, all the weights times 1-p% Assume that the dropout rate is 50%. If a weight w=1 by training, set 𝑤=0.5 for testing.

Dropout - Intuitive Reason
Training of Dropout Testing of Dropout Assume dropout rate is 50% No dropout Weights from training 𝑧 ′ ≈2𝑧 𝑤 1 0.5× 𝑤 1 𝑤 2 𝑧 0.5× 𝑤 2 𝑧 ′ 𝑤 3 0.5× 𝑤 3 𝑤 4 0.5× 𝑤 4 𝑧 ′ ≈𝑧 Weights multiply 1-p%

Dropout - Intuitive Reason
When teams up, if everyone expect the partner will do the work, nothing will be done finally. However, if you know your partner will dropout, you will do better. When testing, no one dropout actually, so obtaining good results eventually.

Why Deep? Layer X Size Word Error Rate (%) 1 X 2k 24.2 2 X 2k 20.4
18.4 4 X 2k 17.8 5 X 2k 17.2 7 X 2k 17.1 1 X 16k 22.1 Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription Using Context-Dependent Deep Neural Networks." Interspeech

Fat + Short v.s. Thin + Tall
The same number of parameters …… Shallow Which one is better? …… Deep

Modularization Deep → Modularization
Don’t put everything in your main function.

Modularization Deep → Modularization weak few examples Classifier 1
Girls with long hair Classifier 2 Boys with long hair Image weak few examples Classifier 3 Girls with short hair Classifier 4 Boys with short hair

Modularization Deep → Modularization Classifier 1 Girls with long hair
Boy or Girl? Classifier 2 Boys with long hair Image good few data Basic Classifier Classifier 3 Girls with short hair Long or short? Classifier 4 Boys with short hair

Modularization → Less training data? Deep → Modularization
…… The modularization is automatically learned from data. The most basic classifiers Use 1st layer as module to build classifiers Use 2nd layer as module ……

Universality Theorem Any continuous function f
Can be realized by a network with one hidden layer Reference for the reason: (given enough hidden neurons) Yes, shallow network can represent any function. However, using deep structure is more effective.

Analogy Logic circuits Neural network Logic circuits consists of gates
A two layers of logic gates can represent any Boolean function. Using multiple layers of logic gates to build some functions are much simpler Neural network consists of neurons A hidden layer network can represent any continuous function. Using multiple layers of neurons to represent some functions are much simpler less parameters less data? less gates needed

With multiple layers, we need only O(d) gates.
Analogy E.g. parity check For input sequence with d bits, Circuit 1 (even) Two-layer circuit need O(2d) gates. Circuit 0 (odd) XNOR 1 1 1 With multiple layers, we need only O(d) gates.

1 29 0.1 75 -1 1 35 -3 1 -1 -15 1 78 1 -1 1 3 2

Shiyan Hu Michigan Technological University

Similar presentations

Presentation on theme: "Shiyan Hu Michigan Technological University"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Shiyan Hu Michigan Technological University

Similar presentations

Presentation on theme: "Shiyan Hu Michigan Technological University"— Presentation transcript:

Similar presentations

About project

Feedback