Backpropagation Disclaimer: This PPT is modified based on Dr. Hung-yi Lee http://speech.ee.ntu.edu.tw/~tlkagk/courses_ML17.html
Gradient Descent Millions of parameters …… Network parameters Starting Parameters …… A network can have millions of parameters. Backpropagation is the way to compute the gradients efficiently (not today) Ref: http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture/DNN%20backprop.ecm.mp4/index.html What is the suitable value for η? I don’t know. Depend on C(θ) Millions of parameters …… To compute the gradients efficiently, we use backpropagation.
Chain Rule Case 1 Case 2 First review Chain rule in derivative
Review: Total Loss Total Loss: 𝐿= 𝑛=1 𝑁 𝐶 𝑛 For all training data … 𝐿= 𝑛=1 𝑁 𝐶 𝑛 For all training data … x1 NN y1 𝑦 1 𝐶 1 Find a function in function set that minimizes total loss L x2 NN y2 𝑦 2 𝐶 2 x3 NN y3 𝑦 3 Randomly picked one Two approaches update the parameters towards the same direction, but stochastic is faster! Better! 𝐶 3 …… …… …… …… Find the network parameters 𝜽 ∗ that minimize total loss L xN NN yN 𝑦 𝑁 𝐶 𝑁
Backpropagation 𝐿 𝜃 = 𝑛=1 𝑁 𝐶 𝑛 𝜃 𝜕𝐿 𝜃 𝜕𝑤 = 𝑛=1 𝑁 𝜕 𝐶 𝑛 𝜃 𝜕𝑤 𝑥 1 𝑥 2 xn NN 𝜃 yn 𝑦 𝑛 𝐶 𝑛 𝐿 𝜃 = 𝑛=1 𝑁 𝐶 𝑛 𝜃 𝜕𝐿 𝜃 𝜕𝑤 = 𝑛=1 𝑁 𝜕 𝐶 𝑛 𝜃 𝜕𝑤 𝑦 1 𝑥 1 First start with first neuro 𝑥 2 𝑦 2
Backpropagation 𝑧 𝑤 1 …… 𝑥 1 …… 𝑤 2 𝑧= 𝑥 1 𝑤 1 + 𝑥 2 𝑤 2 +𝑏 𝑥 2 For simplicity, consider C for the rest of this PPT Backpropagation 𝑤 1 𝑧 …… 𝑦 1 𝑥 1 b …… 𝑤 2 𝑧= 𝑥 1 𝑤 1 + 𝑥 2 𝑤 2 +𝑏 𝑦 2 𝑥 2 Forward pass: 後面的值很大 Forward pass Backward pass Compute 𝜕𝑧 𝜕𝑤 for all parameters 𝜕𝐶 𝜕𝑤 =? 𝜕𝑧 𝜕𝑤 𝜕𝐶 𝜕𝑧 Backward pass: (Chain rule) Compute 𝜕𝐶 𝜕𝑧 for all activation function inputs z
Backpropagation – Forward pass Compute 𝜕𝑧 𝜕𝑤 for all parameters 𝑤 1 𝑧 …… 𝑦 1 𝑥 1 b …… 𝑤 2 𝑧= 𝑥 1 𝑤 1 + 𝑥 2 𝑤 2 +𝑏 𝑦 2 𝑥 2 𝑥 1 𝜕𝑧 𝜕 𝑤 1 =? The value of the input connected by the weight 𝑥 2 𝜕𝑧 𝜕 𝑤 2 =?
Backpropagation – Forward pass Compute 𝜕𝑧 𝜕𝑤 for all parameters 0.98 1 2 0.86 3 1 -2 1 -1 -1 -2 -1 0.12 -2 -1 0.11 -1 That’s it. We have done the forward pass. Derivative is: The value of the input connected by the weight 1 -1 4 2 𝜕𝑧 𝜕𝑤 =−1 𝜕𝑧 𝜕𝑤 =0.12 𝜕𝑧 𝜕𝑤 =0.11
Backpropagation – Backward pass Compute 𝜕𝐶 𝜕𝑧 for all activation function inputs z 𝑎 𝑤 1 𝑧 𝑥 1 b 𝑎=𝜎 𝑧 𝑤 2 𝜎′ 𝑧 𝜎 𝑧 Activation function (eg Sigmoid function) 𝑥 2 𝜕𝐶 𝜕𝑧 = 𝜕𝑎 𝜕𝑧 𝜕𝐶 𝜕𝑎 𝜎′ 𝑧
Backpropagation – Backward pass Compute 𝜕𝐶 𝜕𝑧 for all activation function inputs z 𝑧 𝑎 𝑤 3 𝑧′ 𝑤 1 𝑥 1 b 𝑧′=𝑎 𝑤 3 +⋯ 𝑎=𝜎 𝑧 𝑤 2 𝑤 4 𝑧’’ How to explain this chain rule 𝑥 2 𝜕𝐶 𝜕𝑧 = 𝜕𝑎 𝜕𝑧 𝜕𝐶 𝜕𝑎 𝜕𝐶 𝜕𝑎 = 𝜕𝑧′ 𝜕𝑎 𝜕𝐶 𝜕𝑧′ + 𝜕𝑧′′ 𝜕𝑎 𝜕𝐶 𝜕𝑧′′ , (Chain rule) ? ? Assumed it’s known 𝑤 3 𝑤 4
Backpropagation – Backward pass Compute 𝜕𝐶 𝜕𝑧 for all activation function inputs z 𝑤 1 𝑧 𝑎 𝑤 3 𝑧′ 𝑥 1 𝜕𝐶 𝜕𝑧′ 𝜕𝐶 𝜕𝑧 b 𝑤 2 𝑤 4 𝑧’’ How to explain this chain rule 𝑥 2 𝜕𝐶 𝜕𝑧′′ 𝜕𝐶 𝜕𝑧 =𝜎′ 𝑧 𝑤 3 𝜕𝐶 𝜕𝑧′ + 𝑤 4 𝜕𝐶 𝜕𝑧′′ Assumed it’s known
Backpropagation – Backward pass 𝜎′ 𝑧 𝑤 3 𝜕𝐶 𝜕𝑧′ 𝜕𝐶 𝜕𝑧 𝑤 4 𝜎′ 𝑧 is a constant because z is already determined in the forward pass. How to explain this chain rule 𝜕𝐶 𝜕𝑧′′ 𝜕𝐶 𝜕𝑧 =𝜎′ 𝑧 𝑤 3 𝜕𝐶 𝜕𝑧′ + 𝑤 4 𝜕𝐶 𝜕𝑧′′ How to calculate? 2 cases
Backpropagation – Backward pass Compute 𝜕𝐶 𝜕𝑧 for all activation function inputs z 𝑧 𝑎 𝑧′ 𝑤 1 𝑤 3 𝑦 1 𝑥 1 𝜕𝐶 𝜕𝑧′ 𝜕𝐶 𝜕𝑧 b 𝑤 2 𝑤 4 𝑧’’ 𝑦 2 How to explain this chain rule 𝑥 2 𝜕𝐶 𝜕𝑧′′ Case 1. Output Layer 𝜕𝐶 𝜕𝑧′ = 𝜕 𝑦 1 𝜕𝑧′ 𝜕𝐶 𝜕 𝑦 1 𝜕𝐶 𝜕𝑧′′ = 𝜕 𝑦 2 𝜕𝑧′′ 𝜕𝐶 𝜕 𝑦 2 , Done!
Backpropagation – Backward pass Compute 𝜕𝐶 𝜕𝑧 for all activation function inputs z Case 2. Not Output Layer 𝑧′ …… 𝜕𝐶 𝜕𝑧′ How to explain this chain rule 𝑧’’ …… 𝜕𝐶 𝜕𝑧′′
Backpropagation – Backward pass Compute 𝜕𝐶 𝜕𝑧 for all activation function inputs z Case 2. Not Output Layer 𝑧′ 𝑎′ 𝑤 5 𝑧 𝑎 𝜕𝐶 𝜕𝑧′ 𝜕𝐶 𝜕 𝑧 𝑎 How to explain this chain rule 𝑤 6 𝑧’’ 𝑧 𝑏 𝜕𝐶 𝜕𝑧′′ 𝜕𝐶 𝜕 𝑧 𝑏
Backpropagation – Backward pass Compute 𝜕𝐶 𝜕𝑧 for all activation function inputs z Case 2. Not Output Layer Compute 𝜕𝐶 𝜕𝑧 recursively 𝑧′ 𝑎′ 𝑤 5 𝑧 𝑎 𝜕𝐶 𝜕𝑧′ 𝜕𝐶 𝜕 𝑧 𝑎 𝜎′ 𝑧′ Until we reach the output layer …… How to explain this chain rule 𝑤 6 𝑧’’ 𝑧 𝑏 𝜕𝐶 𝜕𝑧′′ 𝜕𝐶 𝜕 𝑧 𝑏
Backpropagation – Backward Pass For Example Backpropagation – Backward Pass Compute 𝜕𝐶 𝜕𝑧 for all activation function inputs z Compute 𝜕𝐶 𝜕𝑧 from the output layer 𝜕𝐶 𝜕 𝑧 1 𝜕𝐶 𝜕 𝑧 3 𝜕𝐶 𝜕 𝑧 5 𝑧 1 𝑧 3 𝑧 5 𝑥 1 𝑦 1 Start from output layer 𝑥 2 𝑦 2 𝑧 2 𝑧 4 𝑧 6 𝜕𝐶 𝜕 𝑧 2 𝜕𝐶 𝜕 𝑧 4 𝜕𝐶 𝜕 𝑧 6
Backpropagation – Backward Pass Compute 𝜕𝐶 𝜕𝑧 for all activation function inputs z Compute 𝜕𝐶 𝜕𝑧 from the output layer 𝜕𝐶 𝜕 𝑧 1 𝜕𝐶 𝜕 𝑧 3 𝜕𝐶 𝜕 𝑧 5 𝑧 1 𝑧 3 𝑧 5 𝑥 1 𝑦 1 𝜎′ 𝑧 1 𝜎′ 𝑧 3 𝜎′ 𝑧 2 𝜎′ 𝑧 4 𝑥 2 𝑦 2 𝑧 2 𝑧 4 𝑧 6 𝜕𝐶 𝜕 𝑧 2 𝜕𝐶 𝜕 𝑧 4 𝜕𝐶 𝜕 𝑧 6
Review: Backpropagation: Motivation 𝑤 1 𝑧 …… 𝑦 1 𝑥 1 b …… 𝑤 2 𝑧= 𝑥 1 𝑤 1 + 𝑥 2 𝑤 2 +𝑏 𝑦 2 𝑥 2 Forward pass: 後面的值很大 Forward pass Backward pass Compute 𝜕𝑧 𝜕𝑤 for all parameters 𝜕𝐶 𝜕𝑤 =? 𝜕𝑧 𝜕𝑤 𝝏𝑪 𝝏𝒛 Backward pass: (Chain rule) Compute 𝜕𝐶 𝜕𝑧 for all activation function inputs z
Backpropagation – Summary Forward Pass Backward Pass … … 𝑎 𝜕𝑧 𝜕𝑤 𝜕𝐶 𝜕𝑧 = 𝜕𝐶 𝜕𝑤 X =𝑎 for all w