Computing Gradient Hung-yi Lee 李宏毅

Computing Gradient Hung-yi Lee 李宏毅

Introduction Backpropagation: an efficient way to compute the gradient
Prerequisite Backpropagation for feedforward net: DNN%20backprop.ecm.mp4/index.html Simple version: Backpropagation through time for RNN: %20training%20(v6).ecm.mp4/index.html Understanding backpropagation by computational graph Tensorflow, Theano, CNTK, etc. Backpropagation through time:

Computational Graph

Computational Graph A “language” describing a function 𝑓 Example
𝑎=𝑓 𝑏,𝑐 b a A “language” describing a function Node: variable (scalar, vector, tensor ……) Edge: operation (simple function) 𝑓 c Example 𝑦=𝑓 𝑔 ℎ 𝑥 𝑢=ℎ 𝑥 𝑣=𝑔 𝑢 𝑦=𝑓 𝑣 x u v y ℎ 𝑔 𝑓

Computational Graph Example: e = (a+b) ∗ (b+1) ∗ +1 + c = a + b e 6
d = b + 1 ∗ e = c ∗ d c d 3 2 +1 + a b 2 1

Review: Chain Rule Case 1 g h x y z Case 2 x y z s

Computational Graph Example: e = (a+b) ∗ (b+1) Compute 𝜕𝑒 𝜕𝑏
=1x(b+1)+1x(a+b) 𝜕𝑒 𝜕𝑐 ∗ 𝜕𝑒 𝜕𝑑 Sum over all paths from b to e =d =c =a+b =b+1 c d +1 𝜕𝑐 𝜕𝑎 + 𝜕𝑐 𝜕𝑏 𝜕𝑑 𝜕𝑏 =1 =1 =1 a b

Computational Graph Example: e = (a+b) ∗ (b+1) Compute 𝜕𝑒 𝜕𝑏 ∗ +1 + e
15 a = 3, b = 2 𝜕𝑒 𝜕𝑐 ∗ 𝜕𝑒 𝜕𝑑 =d =c 3 5 c d 5 3 +1 𝜕𝑐 𝜕𝑎 + 𝜕𝑐 𝜕𝑏 𝜕𝑑 𝜕𝑏 =1 =1 =1 a b 3 2

Computational Graph Example: e = (a+b) ∗ (b+1) Compute 𝜕𝑒 𝜕𝑏
=8 e 8 a = 3, b = 2 𝜕𝑒 𝜕𝑐 ∗ 𝜕𝑒 𝜕𝑑 =d =c 3 5 c d 1 1 +1 𝜕𝑐 𝜕𝑎 + 𝜕𝑐 𝜕𝑏 𝜕𝑑 𝜕𝑏 =1 =1 =1 a b 1 Start with 1

Computational Graph Example: e = (a+b) ∗ (b+1) Compute 𝜕𝑒 𝜕𝑎 ∗ +1 + =3
𝜕𝑒 𝜕𝑐 ∗ 𝜕𝑒 𝜕𝑑 =d 3 =c 5 c d 1 +1 𝜕𝑐 𝜕𝑎 + 𝜕𝑐 𝜕𝑏 𝜕𝑑 𝜕𝑏 =1 =1 =1 a b 1 Start with 1

Computational Graph Example: e = (a+b) ∗ (b+1) ∗ +1 + Reverse mode
What is the benefit? Example: e = (a+b) ∗ (b+1) Compute and a = 3, b = 2 e Start with 1 1 𝜕𝑒 𝜕𝑐 ∗ 𝜕𝑒 𝜕𝑑 =d 3 =c 5 c d 3 5 +1 𝜕𝑐 𝜕𝑎 + 𝜕𝑐 𝜕𝑏 𝜕𝑑 𝜕𝑏 =1 =1 =1 a b 3 8

Computational Graph Parameter sharing: the same parameters appearing in different nodes 𝜕𝑦 𝜕𝑥 =? 𝑦=𝑥 𝑒 𝑥 2 𝑒 𝑥 2 +𝑥∙ 𝑒 𝑥 2 ∙2𝑥 𝜕𝑦 𝜕𝑥 =𝑥∙ 𝑒 𝑥 2 ∙𝑥 x u 𝑥 exp 𝑥 y ∗ v 𝑢 ∗ 𝜕𝑦 𝜕𝑥 =𝑥∙ 𝑒 𝑥 2 ∙𝑥 = 𝑒 𝑥 2 𝑥 𝑢 x x = 𝑒 𝑥 2 𝜕𝑦 𝜕𝑥 = 𝑒 𝑥 2

Computational Graph for Feedforward Net

Review: Backpropagation
Error signal … Layer Forward Pass Backward Pass

Review: Backpropagation
Layer L-1 Layer L Error signal …… …… Backward Pass … … …… … … (we do not use softmax here)

Feedforward Network … … y WL W2 W1 x b1 b2 bL + + =𝜎 𝜎 𝜎 +
𝑧 1 = 𝑊 1 𝑥+ 𝑏 1 𝑧 2 = 𝑊 2 𝑎 1 + 𝑏 2 x z1 a1 z2 y 𝜎 𝜎 W1 W2 b1 b2

Loss Function of Feedforward Network
𝐶=𝐿 𝑦, 𝑦 𝑦 𝐶 𝑧 1 = 𝑊 1 𝑥+ 𝑏 1 𝑧 2 = 𝑊 2 𝑎 1 + 𝑏 2 x z1 a1 z2 y 𝜎 𝜎 W1 W2 b1 b2

Gradient of Cost Function
𝐶=𝐿 𝑦, 𝑦 To compute the gradient … 𝑦 𝐶 Computing the partial derivative on the edge Using reverse mode Output is always a scalar x z1 a1 z2 y 𝜎 𝜎 𝜕𝐶 𝜕 𝑊 1 W1 𝜕𝐶 𝜕 𝑊 2 W2 vector 𝜕 𝑎 1 𝜕 𝑧 1 =? 𝜕𝐶 𝜕 𝑏 2 𝜕𝐶 𝜕 𝑏 1 b1 b2 vector

Jacobian Matrix 𝑥= 𝑥 1 𝑥 2 𝑥 3 𝑦= 𝑦 1 𝑦 2 𝑦=𝑓 𝑥
𝑥= 𝑥 1 𝑥 2 𝑥 3 𝑦= 𝑦 1 𝑦 2 𝑦=𝑓 𝑥 𝜕𝑦 𝜕𝑥 = 𝜕 𝑦 1 𝜕 𝑥 1 𝜕 𝑦 1 𝜕 𝑥 2 𝜕 𝑦 1 𝜕 𝑥 3 𝜕 𝑦 2 𝜕 𝑥 1 𝜕 𝑦 2 𝜕 𝑥 2 𝜕 𝑦 2 𝜕 𝑥 3 size of y size of x Example 𝑥 1 + 𝑥 2 𝑥 3 2 𝑥 3 =𝑓 𝑥 1 𝑥 2 𝑥 3 𝜕𝑦 𝜕𝑥 = 1 𝑥 3 𝑥

𝐶=𝐿 𝑦, 𝑦 … 𝑦 𝐶 𝑦 = 0 0 ⋮ 1 ⋮ 𝜕𝐶 𝜕𝑦 Last Layer … 𝑟 y 𝜕𝐶 𝜕 𝑦 𝑖 =? … Only one dimension is 1, and others are all 0 Index of the dimension which is 1 Do we have to consider other dimensions? 𝑖=𝑟: Cross Entropy: 𝐶=−𝑙𝑜𝑔 𝑦 𝑟 𝜕𝐶 𝜕𝑦 = 𝜕𝐶 𝜕 𝑦 𝑟 =−1/ 𝑦 𝑟 0 ⋯ −1/ 𝑦 𝑟 ⋯ 𝑖≠𝑟: 𝜕𝐶 𝜕 𝑦 𝑖 =0

𝐶=𝐿 𝑦, 𝑦 square 𝜕𝑦 𝜕 𝑧 2 is a Jacobian matrix 𝑦 𝑦 𝐶 𝑧 2 i-th row, j-th column: 𝜕 𝑦 𝑖 𝜕 𝑧 𝑗 2 𝜕𝐶 𝜕𝑦 𝜕𝑦 𝜕 𝑧 2 𝑖≠𝑗: 𝜕 𝑦 𝑖 𝜕 𝑧 𝑗 2 = z2 y 𝑖=𝑗: 𝜕 𝑦 𝑖 𝜕 𝑧 𝑖 2 = 𝜎′ 𝑧 𝑖 2 𝜎 𝜎′ 𝑧 𝑖 2 𝑦 𝑖 =𝜎 𝑧 𝑖 2 𝑦 Diagonal Matrix How about softmax?  𝑧 𝑖 2 𝑧

𝜕 𝑧 𝑖 2 𝜕 𝑎 𝑗 1 = 𝑊 𝑖𝑗 2 𝑊 2 𝜕 𝑧 2 𝜕 𝑎 1 is a Jacobian matrix 𝑎 1 𝑧 2
𝜕 𝑧 2 𝜕 𝑎 1 is a Jacobian matrix 𝑎 1 𝑧 2 𝐶=𝐿 𝑦, 𝑦 i-th row, j-th column: 𝑦 𝐶 𝜕 𝑧 𝑖 2 𝜕 𝑎 𝑗 1 = 𝑊 𝑖𝑗 2 𝜕𝐶 𝜕𝑦 𝑧 2 = 𝑊 2 𝑎 1 𝜕𝑦 𝜕 𝑧 2 𝜕 𝑧 2 𝜕 𝑎 1 a1 z2 y 𝑧 𝑖 2 = 𝑤 𝑖1 2 𝑎 𝑤 𝑖2 2 𝑎 2 1 + 𝜎 ⋯+ 𝑤 𝑖𝑛 2 𝑎 𝑛 1 𝜕 𝑧 2 𝜕 𝑊 2 W2 𝑊 2

𝑖≠𝑗: 𝑖=𝑗: 𝑊 2 mxn 𝜕 𝑧 𝑖 2 𝜕 𝑊 𝑗𝑘 2 (j-1)xn+k 𝜕 𝑧 2 𝜕 𝑊 2 = m i a1 z2 y
𝜕 𝑧 𝑖 2 𝜕 𝑊 𝑗𝑘 2 (j-1)xn+k 𝜕 𝑧 2 𝜕 𝑊 2 = m i a1 z2 y W2 𝑦 𝐶 𝐶=𝐿 𝑦, 𝑦 𝑧 2 = 𝑊 2 𝑎 1 𝜕𝑦 𝜕 𝑧 2 𝜕 𝑧 2 𝜕 𝑊 2 𝜎 𝑊 2 𝜕 𝑧 𝑖 2 𝜕 𝑊 𝑗𝑘 2 =? 𝜕 𝑧 𝑖 2 𝜕 𝑊 𝑗𝑘 2 =0 𝑖≠𝑗: 𝜕𝐶 𝜕𝑦 𝜕 𝑧 𝑖 2 𝜕 𝑊 𝑖𝑘 2 = 𝑎 𝑘 1 𝑖=𝑗: 𝑧 𝑖 2 = 𝑤 𝑖1 2 𝑎 𝑤 𝑖2 2 𝑎 2 1 + ⋯+ 𝑤 𝑖𝑛 2 𝑎 𝑛 1 n m Considering W2 as a mxn vector mxn (considering 𝜕 𝑧 2 𝜕 𝑊 2 as a tensor makes thing easier)

𝑖≠𝑗: 𝑖=𝑗: 𝑊 2 𝑚 𝑚×𝑛 mxn 𝜕 𝑧 𝑖 2 𝜕 𝑊 𝑗𝑘 2 𝜕 𝑧 2 𝜕 𝑊 2 = (j-1)xn+k m i
𝜕 𝑧 𝑖 2 𝜕 𝑊 𝑗𝑘 2 𝜕 𝑧 2 𝜕 𝑊 2 = (j-1)xn+k m i a1 z2 y W2 𝑦 𝐶 𝐶=𝐿 𝑦, 𝑦 𝑧 2 = 𝑊 2 𝑎 1 𝜕𝑦 𝜕 𝑧 2 𝜕 𝑧 2 𝜕 𝑊 2 𝜎 𝑊 2 𝜕 𝑧 𝑖 2 𝜕 𝑊 𝑗𝑘 2 =? 𝜕 𝑧 𝑖 2 𝜕 𝑊 𝑗𝑘 2 =0 𝑖≠𝑗: 𝜕𝐶 𝜕𝑦 𝜕 𝑧 𝑖 2 𝜕 𝑊 𝑖𝑘 2 = 𝑎 𝑘 1 𝑖=𝑗: 𝑗=1 𝑗=2 𝑗=3 n m i=1 𝑎 𝑎 i=2 Considering W2 as a mxn vector 𝑚 𝑎 mxn 𝑚×𝑛

𝜕𝐶 𝜕 𝑊 1 = 𝜕𝐶 𝜕𝑦 =[ ⋯ 𝜕𝐶 𝜕 𝑊 𝑖𝑗 1 ⋯ ] 𝜕𝑦 𝜕 𝑧 2 𝜕 𝑎 1 𝜕 𝑧 1 𝜕 𝑧 1 𝜕 𝑊 1
=[ ⋯ 𝜕𝐶 𝜕 𝑊 𝑖𝑗 1 ⋯ ] 𝜕𝑦 𝜕 𝑧 2 𝜕 𝑎 1 𝜕 𝑧 1 𝜕 𝑧 1 𝜕 𝑊 1 𝑊 2 𝐶=𝐿 𝑦, 𝑦 𝑦 𝐶 𝜕𝐶 𝜕𝑦 𝜕 𝑎 1 𝜕 𝑧 1 𝜕𝑦 𝜕 𝑧 2 𝑊 1 𝑊 2 x z1 a1 z2 y 𝜎 𝜎 𝜕 𝑧 1 𝜕 𝑊 1 𝜕 𝑧 2 𝜕 𝑊 2 W1 W2 𝜕𝐶 𝜕 𝑊 1

Question Q: Only backward pass for computational graph?
Q: Do we get the same results from the two different approaches?

Computational Graph for Recurrent Network

Recurrent Network y1 y2 y3 h0 f h1 f h2 f h3 …… x1 x2 x3
𝑦 𝑡 , ℎ 𝑡 =𝑓 𝑥 𝑡 , ℎ 𝑡−1 ; 𝑊 𝑖 , 𝑊 ℎ , 𝑊 𝑜 ℎ 𝑡 =𝜎 𝑊 𝑖 𝑥 𝑡 + 𝑊 ℎ ℎ 𝑡−1 𝑦 𝑡 =𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑊 𝑜 ℎ 𝑡 (biases are ignored here)

Recurrent Network 𝑦 𝑡 , ℎ 𝑡 =𝑓 𝑥 𝑡 , ℎ 𝑡−1 ; 𝑊 𝑖 , 𝑊 ℎ , 𝑊 𝑜
𝑦 𝑡 , ℎ 𝑡 =𝑓 𝑥 𝑡 , ℎ 𝑡−1 ; 𝑊 𝑖 , 𝑊 ℎ , 𝑊 𝑜 Recurrent Network 𝑎 𝑡 =𝜎 𝑊 𝑖 𝑥 𝑡 + 𝑊 ℎ ℎ 𝑡−1 𝑦 𝑡 =𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑊 𝑜 ℎ 𝑡 Wo o1 y1 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 × a0 m1 z1 h1 × + 𝜎 Wh n1 × x1 Wi

Recurrent Network 𝑦 𝑡 =𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑊 𝑜 ℎ 𝑡 ℎ 𝑡 =𝜎 𝑊 𝑖 𝑥 𝑡 + 𝑊 ℎ ℎ 𝑡−1
𝑦 𝑡 , ℎ 𝑡 =𝑓 𝑥 𝑡 , ℎ 𝑡−1 ; 𝑊 𝑖 , 𝑊 ℎ , 𝑊 𝑜 Recurrent Network 𝑎 𝑡 =𝜎 𝑊 𝑖 𝑥 𝑡 + 𝑊 ℎ ℎ 𝑡−1 𝑦 𝑡 =𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑊 𝑜 ℎ 𝑡 Wo y1 𝑦 𝑡 =𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑊 𝑜 ℎ 𝑡 h0 h1 Wh ℎ 𝑡 =𝜎 𝑊 𝑖 𝑥 𝑡 + 𝑊 ℎ ℎ 𝑡−1 x1 Wi

Recurrent Network 𝐶= 𝐶 1 + 𝐶 2 + 𝐶 3 C 𝑦 1 𝑦 2 C2 𝑦 3 C1 C3 Wo y1 Wo
𝑦 1 𝑦 2 C2 𝑦 3 C1 C3 Wo y1 Wo y2 Wo y3 h0 h1 h2 h3 Wh Wh Wh x1 Wi x2 Wi x3 Wi

Recurrent Network 𝐶= 𝐶 1 + 𝐶 2 + 𝐶 3 C 𝑦 1 𝑦 2 C2 𝑦 3 C1 C3
𝑦 1 𝑦 2 C2 𝑦 3 C1 C3 𝜕 𝐶 1 𝜕 𝑦 1 𝜕 𝐶 2 𝜕 𝑦 2 𝜕 𝐶 3 𝜕 𝑦 3 Wo y1 Wo y2 Wo y3 𝜕 𝑦 1 𝜕 ℎ 1 𝜕 𝑦 2 𝜕 ℎ 2 𝜕 𝑦 3 𝜕 ℎ 3 h0 h1 h2 h3 𝜕 ℎ 2 𝜕 ℎ 1 𝜕 ℎ 3 𝜕 ℎ 2 Wh Wh Wh 𝜕 ℎ 1 𝜕 𝑊 ℎ 𝜕 ℎ 2 𝜕 𝑊 ℎ 𝜕 ℎ 3 𝜕 𝑊 ℎ 𝜕𝐶 𝜕 𝑊 ℎ 𝜕𝐶 𝜕 𝑊 ℎ 𝜕𝐶 𝜕 𝑊 ℎ x1 Wi x2 Wi x3 Wi

𝜕𝐶 𝜕 𝑊 ℎ 𝜕 𝐶 1 𝜕 𝑦 1 𝜕 𝑦 1 𝜕 ℎ 1 𝜕 𝐶 2 𝜕 𝑦 2 𝜕 𝑦 2 𝜕 ℎ 2 𝜕 ℎ 2 𝜕 ℎ 1
𝜕 𝐶 1 𝜕 𝑦 1 𝜕 𝑦 1 𝜕 ℎ 1 𝜕 𝐶 2 𝜕 𝑦 2 𝜕 𝑦 2 𝜕 ℎ 2 𝜕 ℎ 2 𝜕 ℎ 1 𝜕 𝐶 3 𝜕 𝑦 3 𝜕 𝑦 3 𝜕 ℎ 3 𝜕 ℎ 3 𝜕 ℎ 2 𝜕 ℎ 2 𝜕 ℎ 1 𝜕 ℎ 1 𝜕 𝑊 ℎ 1 + + = 𝜕𝐶 𝜕 𝑊 ℎ 𝜕𝐶 𝜕 𝑊 ℎ 𝜕𝐶 𝜕 𝑊 ℎ 𝜕𝐶 𝜕 𝑊 ℎ 𝜕𝐶 𝜕 𝑊 ℎ 2 1 2 3 = …… = + + 𝜕𝐶 𝜕 𝑊 ℎ 3 = …… 1 2 3

Reference Textbook: Deep Learning Chapter 6.5
Calculus on Computational Graphs: Backpropagation Backprop/ On chain rule, computational graphs, and backpropagation Review backpropagation???

Acknowledgement 感謝翁丞世同學找到投影片上的錯誤

Computing Gradient Hung-yi Lee 李宏毅

Similar presentations

Presentation on theme: "Computing Gradient Hung-yi Lee 李宏毅"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Computing Gradient Hung-yi Lee 李宏毅

Similar presentations

Presentation on theme: "Computing Gradient Hung-yi Lee 李宏毅"— Presentation transcript:

Similar presentations

About project

Feedback