Presentation is loading. Please wait.

Presentation is loading. Please wait.

Shiyan Hu Michigan Technological University

Similar presentations


Presentation on theme: "Shiyan Hu Michigan Technological University"โ€” Presentation transcript:

1 Shiyan Hu Michigan Technological University
Deep Learning Shiyan Hu Michigan Technological University

2 Network parameter ๐œƒ: all the weights and biases in the โ€œneuronsโ€
Neural Network โ€œNeuronโ€ Neural Network Different connection leads to different network structures Network parameter ๐œƒ: all the weights and biases in the โ€œneuronsโ€

3 Fully Connect Feedforward Network
1 4 0.98 1 -2 1 -1 -2 0.12 -1 1 Sigmoid Function

4 Fully Connect Feedforward Network
1 4 0.98 2 0.86 3 0.62 1 -2 1 -1 -1 -2 -1 -2 0.12 -2 -1 0.11 0.83 -1 1 -1 4 2

5 Fully Connect Feedforward Network
0.73 0.72 0.51 1 2 3 -2 -1 -1 1 -2 -1 0.5 -2 -1 0.12 0.85 1 -1 4 2

6 Fully Connect Feedforward Network
neuron Input Layer 1 โ€ฆโ€ฆ Layer 2 โ€ฆโ€ฆ Layer L โ€ฆโ€ฆ Output โ€ฆโ€ฆ y1 โ€ฆโ€ฆ y2 โ€ฆโ€ฆ โ€ฆโ€ฆ โ€ฆโ€ฆ yM Input Layer Output Layer Hidden Layers

7 Deep = Many hidden layers
19 layers 8 layers 6.7% 7.3% 16.4% AlexNet (2012) VGG (2014) GoogleNet (2014)

8 Deep = Many hidden layers
3.57% 7.3% 6.7% 16.4% AlexNet (2012) VGG (2014) GoogleNet (2014) Residual Net (2015)

9 Matrix Operation 1 โˆ’2 โˆ’1 1 1 โˆ’1 1 0 0.98 0.12 ๐œŽ = + 4 โˆ’2 0.98 4 1 1 -2
-1 -2 0.12 -1 1 1 โˆ’2 โˆ’1 1 1 โˆ’1 1 0 ๐œŽ = + 4 โˆ’2

10 Neural Network โ€ฆโ€ฆ y1 โ€ฆโ€ฆ y2 โ€ฆโ€ฆ โ€ฆโ€ฆ โ€ฆโ€ฆ โ€ฆโ€ฆ โ€ฆโ€ฆ โ€ฆโ€ฆ yM W1 W2 WL b1 b2 bL x a1
+ ๐œŽ b2 W2 a1 + ๐œŽ bL WL + ๐œŽ aL-1

11 Neural Network โ€ฆโ€ฆ y1 โ€ฆโ€ฆ y2 โ€ฆโ€ฆ โ€ฆโ€ฆ โ€ฆโ€ฆ โ€ฆโ€ฆ โ€ฆโ€ฆ โ€ฆโ€ฆ yM
WL โ€ฆโ€ฆ y2 b1 b2 bL โ€ฆโ€ฆ โ€ฆโ€ฆ โ€ฆโ€ฆ โ€ฆโ€ฆ โ€ฆโ€ฆ x a1 a2 โ€ฆโ€ฆ y yM y =๐‘“ x Using parallel computing techniques to speed up matrix operation WL W2 W1 x b1 b2 bL โ€ฆ โ€ฆ + + =๐œŽ ๐œŽ ๐œŽ + x

12 Output Layer โ€ฆโ€ฆ โ€ฆโ€ฆ y1 โ€ฆโ€ฆ โ€ฆโ€ฆ โ€ฆโ€ฆ y2 โ€ฆโ€ฆ โ€ฆโ€ฆ โ€ฆโ€ฆ yM
Automatic feature engineering โ€ฆโ€ฆ โ€ฆโ€ฆ y1 โ€ฆโ€ฆ โ€ฆโ€ฆ โ€ฆโ€ฆ y2 Softmax โ€ฆโ€ฆ โ€ฆโ€ฆ โ€ฆโ€ฆ yM Input Layer Output Layer Multi-class Classifier Hidden Layers

13 Example Application Input Output y1 y2 The image is โ€œ2โ€ โ€ฆโ€ฆ โ€ฆโ€ฆ โ€ฆโ€ฆ y10
0.1 0.7 is 2 The image is โ€œ2โ€ โ€ฆโ€ฆ โ€ฆโ€ฆ 0.2 is 0 16 x 16 = 256 Ink โ†’ 1 No ink โ†’ 0 Each dimension represents the confidence of a digit.

14 What is needed is a function โ€ฆโ€ฆ
Example Application Handwriting Digit Recognition โ€ฆโ€ฆ โ€ฆโ€ฆ y1 y2 y10 is 1 Machine is 2 Neural Network โ€œ2โ€ โ€ฆโ€ฆ is 0 What is needed is a function โ€ฆโ€ฆ Input: 256-dim vector output: 10-dim vector

15 Example Application โ€œ2โ€ โ€ฆโ€ฆ โ€ฆโ€ฆ โ€ฆโ€ฆ โ€ฆโ€ฆ โ€ฆโ€ฆ y1 y2 y10
Input Layer 1 โ€ฆโ€ฆ Layer 2 โ€ฆโ€ฆ Layer L โ€ฆโ€ฆ Output โ€ฆโ€ฆ โ€ฆโ€ฆ y1 y2 y10 is 1 A function set containing the candidates for Handwriting Digit Recognition โ€ฆโ€ฆ is 2 โ€œ2โ€ โ€ฆโ€ฆ โ€ฆโ€ฆ โ€ฆโ€ฆ is 0 Input Layer Output Layer Hidden Layers You need to learn a good function in your function set to minimize the classification error.

16 Given a set of parameters
Classification Error target โ€œ1โ€ โ€ฆโ€ฆ y1 1 ๐‘ฆ 1 Given a set of parameters โ€ฆโ€ฆ y2 ๐‘ฆ 2 Softmax โ€ฆโ€ฆ โ€ฆโ€ฆ โ€ฆโ€ฆ Cross Entropy โ€ฆโ€ฆ โ€ฆโ€ฆ โ€ฆโ€ฆ โ€ฆโ€ฆ y10 ๐‘ฆ 10 ๐‘ฆ ๐‘ฆ ๐ถ ๐‘ฆ , ๐‘ฆ =โˆ’ ๐‘–= ๐‘ฆ ๐‘– ๐‘™๐‘› ๐‘ฆ ๐‘–

17 Total Error ๐ฟ= ๐‘›=1 ๐‘ ๐ถ ๐‘› For all training data โ€ฆ
๐ฟ= ๐‘›=1 ๐‘ ๐ถ ๐‘› For all training data โ€ฆ x1 NN y1 ๐‘ฆ 1 ๐ถ 1 Find a function in function set that minimizes total loss L x2 NN y2 ๐‘ฆ 2 ๐ถ 2 x3 NN y3 ๐‘ฆ 3 ๐ถ 3 โ€ฆโ€ฆ โ€ฆโ€ฆ โ€ฆโ€ฆ โ€ฆโ€ฆ Find the network parameters ๐œฝ โˆ— that minimize total loss L xN NN yR ๐‘ฆ ๐‘ ๐ถ ๐‘

18 Gradient Descent ๐œƒ ๐œ•๐ฟ ๐œ• ๐‘ค 1 ๐œ•๐ฟ ๐œ• ๐‘ค 2 โ‹ฎ ๐œ•๐ฟ ๐œ• ๐‘ 1 โ‹ฎ ๐›ป๐ฟ= โ€ฆโ€ฆ gradient โ€ฆโ€ฆ
Compute ๐œ•๐ฟ ๐œ• ๐‘ค 1 ๐œ•๐ฟ ๐œ• ๐‘ค 1 ๐œ•๐ฟ ๐œ• ๐‘ค 2 โ‹ฎ ๐œ•๐ฟ ๐œ• ๐‘ 1 โ‹ฎ ๐‘ค 1 0.2 0.15 โˆ’๐œ‚ ๐œ•๐ฟ ๐œ• ๐‘ค 1 Compute ๐œ•๐ฟ ๐œ• ๐‘ค 2 ๐‘ค 2 -0.1 0.05 ๐›ป๐ฟ= โˆ’๐œ‚ ๐œ•๐ฟ ๐œ• ๐‘ค 2 โ€ฆโ€ฆ Compute ๐œ•๐ฟ ๐œ• ๐‘ 1 ๐‘ 1 0.3 0.2 โˆ’๐œ‚ ๐œ•๐ฟ ๐œ• ๐‘ 1 gradient โ€ฆโ€ฆ

19 Gradient Descent ๐œƒ โ€ฆโ€ฆ โ€ฆโ€ฆ โ€ฆโ€ฆ โ€ฆโ€ฆ โ€ฆโ€ฆ Compute ๐œ•๐ฟ ๐œ• ๐‘ค 1 Compute ๐œ•๐ฟ ๐œ• ๐‘ค 1
0.2 0.15 0.09 โˆ’๐œ‚ ๐œ•๐ฟ ๐œ• ๐‘ค 1 โˆ’๐œ‚ ๐œ•๐ฟ ๐œ• ๐‘ค 1 โ€ฆโ€ฆ Compute ๐œ•๐ฟ ๐œ• ๐‘ค 2 Compute ๐œ•๐ฟ ๐œ• ๐‘ค 2 ๐‘ค 2 -0.1 0.05 0.15 โˆ’๐œ‚ ๐œ•๐ฟ ๐œ• ๐‘ค 2 โˆ’๐œ‚ ๐œ•๐ฟ ๐œ• ๐‘ค 2 โ€ฆโ€ฆ โ€ฆโ€ฆ Compute ๐œ•๐ฟ ๐œ• ๐‘ 1 Compute ๐œ•๐ฟ ๐œ• ๐‘ 1 ๐‘ 1 0.3 0.2 0.10 โˆ’๐œ‚ ๐œ•๐ฟ ๐œ• ๐‘ 1 โˆ’๐œ‚ ๐œ•๐ฟ ๐œ• ๐‘ 1 โ€ฆโ€ฆ โ€ฆโ€ฆ

20 Gradient Descent This is the โ€œlearningโ€ of machines in deep learning โ€ฆโ€ฆ Even alpha go using this approach. People image โ€ฆโ€ฆ Actually โ€ฆ..

21 Backpropagation Backpropagation: an efficient way to compute ๐œ•๐ฟ ๐œ•๐‘ค in neural network

22 Gradient Descent Millions of parameters โ€ฆโ€ฆ
Network parameters Starting Parameters โ€ฆโ€ฆ Millions of parameters โ€ฆโ€ฆ To compute the gradients efficiently, we use backpropagation.

23 Backpropagation ๐ฟ ๐œƒ = ๐‘›=1 ๐‘ ๐ถ ๐‘› ๐œƒ ๐œ•๐ฟ ๐œƒ ๐œ•๐‘ค = ๐‘›=1 ๐‘ ๐œ• ๐ถ ๐‘› ๐œƒ ๐œ•๐‘ค ๐‘ฅ 1 ๐‘ฅ 2
xn NN ๐œƒ yn ๐‘ฆ ๐‘› ๐ถ ๐‘› ๐ฟ ๐œƒ = ๐‘›=1 ๐‘ ๐ถ ๐‘› ๐œƒ ๐œ•๐ฟ ๐œƒ ๐œ•๐‘ค = ๐‘›=1 ๐‘ ๐œ• ๐ถ ๐‘› ๐œƒ ๐œ•๐‘ค ๐‘ฆ 1 ๐‘ฅ 1 ๐‘ฅ 2 ๐‘ฆ 2

24 Backpropagation ๐‘ง ๐‘ค 1 โ€ฆโ€ฆ ๐‘ฅ 1 โ€ฆโ€ฆ ๐‘ค 2 ๐‘ง= ๐‘ฅ 1 ๐‘ค 1 + ๐‘ฅ 2 ๐‘ค 2 +๐‘ ๐‘ฅ 2
๐‘ฆ 1 ๐‘ฅ 1 b โ€ฆโ€ฆ ๐‘ค 2 ๐‘ง= ๐‘ฅ 1 ๐‘ค 1 + ๐‘ฅ 2 ๐‘ค 2 +๐‘ ๐‘ฆ 2 ๐‘ฅ 2 Forward pass: Compute ๐œ•๐‘ง ๐œ•๐‘ค for all parameters ๐œ•๐ถ ๐œ•๐‘ค =? ๐œ•๐‘ง ๐œ•๐‘ค ๐œ•๐ถ ๐œ•๐‘ง Backward pass: (Chain rule) Compute ๐œ•๐ถ ๐œ•๐‘ง for all activation function inputs z

25 Backpropagation โ€“ Forward pass
Compute ๐œ•๐‘ง ๐œ•๐‘ค for all parameters ๐‘ค 1 ๐‘ง โ€ฆโ€ฆ ๐‘ฆ 1 ๐‘ฅ 1 b โ€ฆโ€ฆ ๐‘ค 2 ๐‘ง= ๐‘ฅ 1 ๐‘ค 1 + ๐‘ฅ 2 ๐‘ค 2 +๐‘ ๐‘ฆ 2 ๐‘ฅ 2 ๐‘ฅ 1 ๐œ•๐‘ง ๐œ• ๐‘ค 1 =? The value of the input connected to the weight ๐‘ฅ 2 ๐œ•๐‘ง ๐œ• ๐‘ค 2 =?

26 Backpropagation โ€“ Forward pass
Compute ๐œ•๐‘ง ๐œ•๐‘ค for all parameters 0.98 1 2 0.86 3 1 -2 1 -1 -1 -2 -1 0.12 -2 -1 0.11 -1 1 -1 4 2 ๐œ•๐‘ง ๐œ•๐‘ค =โˆ’1 ๐œ•๐‘ง ๐œ•๐‘ค =0.12 ๐œ•๐‘ง ๐œ•๐‘ค =0.11

27 Backpropagation โ€“ Backward pass
Compute ๐œ•๐ถ ๐œ•๐‘ง for all activation function inputs z ๐‘ง ๐‘Ž ๐‘ค 1 ๐‘ค 3 ๐‘งโ€ฒ ๐‘ฅ 1 b ๐‘Ž=๐œŽ ๐‘ง ๐‘ค 2 ๐‘ค 4 ๐‘งโ€™โ€™ ๐‘ฅ 2 ๐œ•๐ถ ๐œ•๐‘ง = ๐œ•๐‘Ž ๐œ•๐‘ง ๐œ•๐ถ ๐œ•๐‘Ž ๐œŽโ€ฒ ๐‘ง

28 Backpropagation โ€“ Backward pass
Compute ๐œ•๐ถ ๐œ•๐‘ง for all activation function inputs z ๐‘ง ๐‘Ž ๐‘ค 1 ๐‘ค 3 ๐‘งโ€ฒ ๐‘ฅ 1 b ๐‘Ž=๐œŽ ๐‘ง ๐‘ค 2 ๐‘ค 4 ๐œŽโ€ฒ ๐‘ง ๐œŽ ๐‘ง ๐‘งโ€™โ€™ ๐‘ฅ 2 ๐œ•๐ถ ๐œ•๐‘ง = ๐œ•๐‘Ž ๐œ•๐‘ง ๐œ•๐ถ ๐œ•๐‘Ž ๐œŽโ€ฒ ๐‘ง

29 Backpropagation โ€“ Backward pass
Compute ๐œ•๐ถ ๐œ•๐‘ง for all activation function inputs z ๐‘Ž ๐‘งโ€ฒ ๐‘ค 1 ๐‘ง ๐‘ค 3 ๐‘ฅ 1 b ๐‘งโ€ฒ=๐‘Ž ๐‘ค 3 +โ‹ฏ ๐‘Ž=๐œŽ ๐‘ง ๐‘ค 2 ๐‘ค 4 ๐‘งโ€™โ€™ ๐‘ฅ 2 ๐œ•๐ถ ๐œ•๐‘ง = ๐œ•๐‘Ž ๐œ•๐‘ง ๐œ•๐ถ ๐œ•๐‘Ž ๐œ•๐ถ ๐œ•๐‘Ž = ๐œ•๐‘งโ€ฒ ๐œ•๐‘Ž ๐œ•๐ถ ๐œ•๐‘งโ€ฒ + ๐œ•๐‘งโ€ฒโ€ฒ ๐œ•๐‘Ž ๐œ•๐ถ ๐œ•๐‘งโ€ฒโ€ฒ (Chain rule) ? ? ๐‘ค 3 ๐‘ค 4

30 Backpropagation โ€“ Backward pass
Compute ๐œ•๐ถ ๐œ•๐‘ง for all activation function inputs z ๐‘ง ๐‘Ž ๐‘ค 3 ๐‘งโ€ฒ ๐‘ค 1 ๐‘ฅ 1 ๐œ•๐ถ ๐œ•๐‘งโ€ฒ ๐œ•๐ถ ๐œ•๐‘ง b ๐‘ค 2 ๐‘ค 4 ๐‘งโ€™โ€™ ๐‘ฅ 2 ๐œ•๐ถ ๐œ•๐‘งโ€ฒโ€ฒ ๐œ•๐ถ ๐œ•๐‘ง =๐œŽโ€ฒ ๐‘ง ๐‘ค 3 ๐œ•๐ถ ๐œ•๐‘งโ€ฒ + ๐‘ค 4 ๐œ•๐ถ ๐œ•๐‘งโ€ฒโ€ฒ

31 Backpropagation โ€“ Backward pass
๐œŽโ€ฒ ๐‘ง ๐‘ค 3 ๐œ•๐ถ ๐œ•๐‘งโ€ฒ ๐œ•๐ถ ๐œ•๐‘ง ๐‘ค 4 ๐œŽโ€ฒ ๐‘ง is a constant since z is already determined in the forward pass. ๐œ•๐ถ ๐œ•๐‘งโ€ฒโ€ฒ ๐œ•๐ถ ๐œ•๐‘ง =๐œŽโ€ฒ ๐‘ง ๐‘ค 3 ๐œ•๐ถ ๐œ•๐‘งโ€ฒ + ๐‘ค 4 ๐œ•๐ถ ๐œ•๐‘งโ€ฒโ€ฒ

32 Backpropagation โ€“ Backward pass
Compute ๐œ•๐ถ ๐œ•๐‘ง for all activation function inputs z ๐‘ค 1 ๐‘ง ๐‘Ž ๐‘ค 3 ๐‘งโ€ฒ ๐‘ฆ 1 ๐‘ฅ 1 ๐œ•๐ถ ๐œ•๐‘งโ€ฒ ๐œ•๐ถ ๐œ•๐‘ง b ๐‘ค 2 ๐‘ค 4 ๐‘งโ€™โ€™ ๐‘ฆ 2 ๐‘ฅ 2 ๐œ•๐ถ ๐œ•๐‘งโ€ฒโ€ฒ Case 1. Output Layer ๐œ•๐ถ ๐œ•๐‘งโ€ฒ = ๐œ• ๐‘ฆ 1 ๐œ•๐‘งโ€ฒ ๐œ•๐ถ ๐œ• ๐‘ฆ 1 ๐œ•๐ถ ๐œ•๐‘งโ€ฒโ€ฒ = ๐œ• ๐‘ฆ 2 ๐œ•๐‘งโ€ฒโ€ฒ ๐œ•๐ถ ๐œ• ๐‘ฆ 2

33 Backpropagation โ€“ Backward pass
Compute ๐œ•๐ถ ๐œ•๐‘ง for all activation function inputs z Case 2. Not Output Layer ๐‘งโ€ฒ โ€ฆโ€ฆ ๐œ•๐ถ ๐œ•๐‘งโ€ฒ ๐‘งโ€™โ€™ โ€ฆโ€ฆ ๐œ•๐ถ ๐œ•๐‘งโ€ฒโ€ฒ

34 Backpropagation โ€“ Backward pass
Compute ๐œ•๐ถ ๐œ•๐‘ง for all activation function inputs z Case 2. Not Output Layer Compute ๐œ•๐ถ ๐œ•๐‘ง recursively ๐‘งโ€ฒ ๐‘Žโ€ฒ ๐‘ค 5 ๐‘ง ๐‘Ž ๐œ•๐ถ ๐œ•๐‘งโ€ฒ ๐œ•๐ถ ๐œ• ๐‘ง ๐‘Ž Until we reach the output layer โ€ฆโ€ฆ ๐‘ค 6 ๐‘งโ€™โ€™ ๐‘ง ๐‘ ๐œ•๐ถ ๐œ•๐‘งโ€ฒโ€ฒ ๐œ•๐ถ ๐œ• ๐‘ง ๐‘

35 Backpropagation โ€“ Backward Pass
Compute ๐œ•๐ถ ๐œ•๐‘ง for all activation function inputs z Compute ๐œ•๐ถ ๐œ•๐‘ง from the output layer ๐œ•๐ถ ๐œ• ๐‘ง 1 ๐œ•๐ถ ๐œ• ๐‘ง 3 ๐œ•๐ถ ๐œ• ๐‘ง 5 ๐‘ง 1 ๐‘ง 3 ๐‘ง 5 ๐‘ฅ 1 ๐‘ฆ 1 ๐‘ฅ 2 ๐‘ฆ 2 ๐‘ง 2 ๐‘ง 4 ๐‘ง 6 ๐œ•๐ถ ๐œ• ๐‘ง 2 ๐œ•๐ถ ๐œ• ๐‘ง 4 ๐œ•๐ถ ๐œ• ๐‘ง 6

36 Backpropagation โ€“ Backward Pass
Compute ๐œ•๐ถ ๐œ•๐‘ง for all activation function inputs z Compute ๐œ•๐ถ ๐œ•๐‘ง from the output layer ๐œ•๐ถ ๐œ• ๐‘ง 1 ๐œ•๐ถ ๐œ• ๐‘ง 3 ๐œ•๐ถ ๐œ• ๐‘ง 5 ๐‘ง 1 ๐‘ง 3 ๐‘ง 5 ๐‘ฅ 1 ๐‘ฆ 1 ๐œŽโ€ฒ ๐‘ง 1 ๐œŽโ€ฒ ๐‘ง 3 ๐œŽโ€ฒ ๐‘ง 2 ๐œŽโ€ฒ ๐‘ง 4 ๐‘ฅ 2 ๐‘ฆ 2 ๐‘ง 2 ๐‘ง 4 ๐‘ง 6 ๐œ•๐ถ ๐œ• ๐‘ง 2 ๐œ•๐ถ ๐œ• ๐‘ง 4 ๐œ•๐ถ ๐œ• ๐‘ง 6

37 Backpropagation โ€“ Summary
Forward Pass Backward Pass โ€ฆ โ€ฆ ๐‘Ž ๐œ•๐‘ง ๐œ•๐‘ค ๐œ•๐ถ ๐œ•๐‘ง = ๐œ•๐ถ ๐œ•๐‘ค =๐‘Ž โ‹… for all w

38 Example Application โ€œ1โ€ Handwriting Digit Recognition Machine
MNIST Data: โ€œHello worldโ€ for deep learning Keras provides data sets loading function:

39 Keras 28x28 โ€ฆโ€ฆ 500 softplus, softsign, relu, tanh, hard_sigmoid, linear 500 Softmax y1 y2 y10 โ€ฆโ€ฆ

40 Keras

41 SGD, RMSprop, Adagrad, Adadelta, Adam, Adamax, Nadam
Keras Step 3.1: Configuration SGD, RMSprop, Adagrad, Adadelta, Adam, Adamax, Nadam Step 3.2: Find the optimal network parameters Training data (Images) Labels (digits) To be discussed

42 Keras Save and load models
How to use the neural network (testing): case 1: case 2:

43 Repeat the above process
We do not really minimize total loss! Mini-batch Randomly initialize network parameters x1 NN y1 Pick the 1st batch ๐‘ฆ 1 ๐ถ 1 ๐ฟโ€ฒ= ๐ถ 1 + ๐ถ 31 +โ‹ฏ Mini-batch x31 NN y31 Update parameters once ๐‘ฆ 31 ๐ถ 31 Pick the 2nd batch โ€ฆโ€ฆ ๐ฟโ€ฒโ€ฒ= ๐ถ 2 + ๐ถ 16 +โ‹ฏ Update parameters once x2 NN y2 ๐‘ฆ 2 โ€ฆ ๐ถ 2 Until all mini-batches have been picked Mini-batch x16 NN y16 ๐‘ฆ 16 one epoch ๐ถ 16 โ€ฆโ€ฆ Repeat the above process

44 Mini-batch Batch size influences both speed and performance. You need to tune it. x1 NN โ€ฆโ€ฆ y1 ๐‘ฆ 1 ๐‘™ 1 x31 y31 ๐‘ฆ 31 ๐‘™ 31 Mini-batch Pick the 1st batch Pick the 2nd batch ๐ฟโ€ฒ= ๐ถ 1 + ๐ถ 31 +โ‹ฏ ๐ฟโ€ฒโ€ฒ= ๐ถ 2 + ๐ถ 16 +โ‹ฏ Update parameters Until all mini-batches have been picked โ€ฆ one epoch Repeat 20 times

45 Donโ€™t worry. This is the default of Keras.
Shuffle the training examples for each epoch Epoch 1 Epoch 2 x1 NN y1 x1 NN y1 ๐‘ฆ 1 ๐‘ฆ 1 ๐‘™ 1 ๐‘™ 1 Mini-batch x31 NN y31 Mini-batch x31 NN y17 ๐‘ฆ 31 ๐‘ฆ 17 ๐‘™ 31 ๐‘™ 17 โ€ฆโ€ฆ โ€ฆโ€ฆ Donโ€™t worry. This is the default of Keras. x2 NN y2 x2 NN y2 ๐‘ฆ 2 ๐‘ฆ 2 ๐‘™ 2 ๐‘™ 2 Mini-batch Mini-batch x16 NN y16 x16 NN y26 ๐‘ฆ 16 ๐‘ฆ 26 ๐‘™ 16 ๐‘™ 26 โ€ฆโ€ฆ โ€ฆโ€ฆ

46 The Power of Deep? Results on Training Data
Deeper usually does not imply better.

47 Vanishing Gradient Problem
Smaller gradients Large input Small output โ€ฆโ€ฆ โ€ฆโ€ฆ ๐‘ฆ 1 ๐‘ฆ 2 ๐‘ฆ ๐‘€ โ€ฆโ€ฆ ๐‘ฆ 1 ๐‘ฆ 2 ๐‘ฆ ๐‘€ โ€ฆโ€ฆ โ€ฆโ€ฆ โ€ฆโ€ฆ โ€ฆโ€ฆ โ€ฆโ€ฆ ๐ถ +โˆ†๐ถ โ€ฆโ€ฆ +โˆ†๐‘ค Intuitive way to compute the derivatives โ€ฆ ๐œ•๐ถ ๐œ•๐‘ค =? โˆ†๐ถ โˆ†๐‘ค

48 Vanishing Gradient Problem
โ€ฆโ€ฆ โ€ฆโ€ฆ y1 y2 yM โ€ฆโ€ฆ โ€ฆโ€ฆ โ€ฆโ€ฆ โ€ฆโ€ฆ โ€ฆโ€ฆ โ€ฆโ€ฆ Smaller gradients Larger gradients Learn very slow Learn very fast Almost random Already converge based on random!?

49 ReLU Rectified Linear Unit (ReLU) Reason: ๐œŽ ๐‘ง 1. Fast to compute
๐‘Ž ๐‘Ž=๐‘ง ๐‘Ž=0 ๐œŽ ๐‘ง 1. Fast to compute 2. Vanishing gradient problem [Xavier Glorot, AISTATSโ€™11] [Andrew L. Maas, ICMLโ€™13] [Kaiming He, arXivโ€™15]

50 ๐‘ง ๐‘Ž ๐‘Ž=๐‘ง ๐‘Ž=0 ReLU

51 ReLU A Thinner linear network Do not have smaller gradients ๐‘Ž ๐‘Ž=๐‘ง ๐‘Ž=0
With different input data, it becomes a piece-wise linear approximation of nonlinear network.

52 Maxout ReLU is a special cases of Maxout
Learnable activation function [Ian J. Goodfellow, ICMLโ€™13] neuron + 5 + 1 Input Max Max 7 2 + 7 + 2 + โˆ’1 + 4 Max Max 1 4 + 1 + 3

53 Maxout ReLU is a special cases of Maxout + Input ๐‘ง 1 ๐‘ง 2 ๐‘Ž Input ๐‘ง ๐‘ค ๐‘
๐‘š๐‘Ž๐‘ฅ ๐‘ง 1 , ๐‘ง 2 Input ReLU ๐‘ง ๐‘ค ๐‘ ๐‘Ž ๐‘ค ๐‘ ๐‘Ž ๐‘Ž ๐‘ง=๐‘ค๐‘ฅ+๐‘ ๐‘ง 1 =๐‘ค๐‘ฅ+๐‘ ๐‘ฅ ๐‘ฅ ๐‘ง 2 =0

54 Learnable Activation Function
Maxout More than ReLU Max Input + ๐‘ง 1 ๐‘ง 2 ๐‘Ž ๐‘š๐‘Ž๐‘ฅ ๐‘ง 1 , ๐‘ง 2 Input ReLU ๐‘ง ๐‘ค ๐‘ ๐‘Ž ๐‘ค ๐‘ ๐‘ค โ€ฒ ๐‘ โ€ฒ Learnable Activation Function ๐‘Ž ๐‘Ž ๐‘ง=๐‘ค๐‘ฅ+๐‘ ๐‘ง 1 =๐‘ค๐‘ฅ+๐‘ ๐‘ฅ ๐‘ฅ ๐‘ง 2 = ๐‘ค โ€ฒ ๐‘ฅ+ ๐‘ โ€ฒ

55 Dropout Training: Each time before updating the parameters
Each neuron has p% to dropout

56 Dropout Training: Thinner! Each time before updating the parameters
Each neuron has p% to dropout The structure of the network is changed. Using the new network for training

57 Dropout Testing: No dropout
If the dropout rate at training is p%, all the weights times 1-p% Assume that the dropout rate is 50%. If a weight w=1 by training, set ๐‘ค=0.5 for testing.

58 Dropout - Intuitive Reason
Training of Dropout Testing of Dropout Assume dropout rate is 50% No dropout Weights from training ๐‘ง โ€ฒ โ‰ˆ2๐‘ง ๐‘ค 1 0.5ร— ๐‘ค 1 ๐‘ค 2 ๐‘ง 0.5ร— ๐‘ค 2 ๐‘ง โ€ฒ ๐‘ค 3 0.5ร— ๐‘ค 3 ๐‘ค 4 0.5ร— ๐‘ค 4 ๐‘ง โ€ฒ โ‰ˆ๐‘ง Weights multiply 1-p%

59 Dropout - Intuitive Reason
When teams up, if everyone expect the partner will do the work, nothing will be done finally. However, if you know your partner will dropout, you will do better. When testing, no one dropout actually, so obtaining good results eventually.

60 Why Deep? Layer X Size Word Error Rate (%) 1 X 2k 24.2 2 X 2k 20.4
18.4 4 X 2k 17.8 5 X 2k 17.2 7 X 2k 17.1 1 X 16k 22.1 Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription Using Context-Dependent Deep Neural Networks."ย Interspeech

61 Fat + Short v.s. Thin + Tall
The same number of parameters โ€ฆโ€ฆ Shallow Which one is better? โ€ฆโ€ฆ Deep

62 Modularization Deep โ†’ Modularization
Donโ€™t put everything in your main function.

63 Modularization Deep โ†’ Modularization weak few examples Classifier 1
Girls with long hair Classifier 2 Boys with long hair Image weak few examples Classifier 3 Girls with short hair Classifier 4 Boys with short hair

64 Modularization Deep โ†’ Modularization Classifier 1 Girls with long hair
Boy or Girl? Classifier 2 Boys with long hair Image good few data Basic Classifier Classifier 3 Girls with short hair Long or short? Classifier 4 Boys with short hair

65 Modularization โ†’ Less training data? Deep โ†’ Modularization
โ€ฆโ€ฆ The modularization is automatically learned from data. The most basic classifiers Use 1st layer as module to build classifiers Use 2nd layer as module โ€ฆโ€ฆ

66 Universality Theorem Any continuous function f
Can be realized by a network with one hidden layer Reference for the reason: (given enough hidden neurons) Yes, shallow network can represent any function. However, using deep structure is more effective.

67 Analogy Logic circuits Neural network Logic circuits consists of gates
A two layers of logic gates can represent any Boolean function. Using multiple layers of logic gates to build some functions are much simpler Neural network consists of neurons A hidden layer network can represent any continuous function. Using multiple layers of neurons to represent some functions are much simpler less parameters less data? less gates needed

68 With multiple layers, we need only O(d) gates.
Analogy E.g. parity check For input sequence with d bits, Circuit 1 (even) Two-layer circuit need O(2d) gates. Circuit 0 (odd) XNOR 1 1 1 With multiple layers, we need only O(d) gates.

69 1 29 0.1 75 -1 1 35 -3 1 -1 -15 1 78 1 -1 1 3 2


Download ppt "Shiyan Hu Michigan Technological University"

Similar presentations


Ads by Google