Computing Gradient Hung-yi Lee 李宏毅

Slides:



Advertisements
Similar presentations
Multi-Layer Perceptron (MLP)
Advertisements

NEURAL NETWORKS Backpropagation Algorithm
Neural networks Introduction Fitting neural networks
Introduction to Neural Networks Computing
Machine Learning Neural Networks
Maths for Computer Graphics
Lecture 14 – Neural Networks
Supervised learning 1.Early learning algorithms 2.First order gradient methods 3.Second order gradient methods.
1 Pertemuan 13 BACK PROPAGATION Matakuliah: H0434/Jaringan Syaraf Tiruan Tahun: 2005 Versi: 1.
September 30, 2010Neural Networks Lecture 8: Backpropagation Learning 1 Sigmoidal Neurons In backpropagation networks, we typically choose  = 1 and 
CHAPTER 11 Back-Propagation Ming-Feng Yeh.
Backpropagation An efficient way to compute the gradient Hung-yi Lee.
Lecture 3 Introduction to Neural Networks and Fuzzy Logic President UniversityErwin SitompulNNFL 3/1 Dr.-Ing. Erwin Sitompul President University
Classification / Regression Neural Networks 2
Neural Network Introduction Hung-yi Lee. Review: Supervised Learning Training: Pick the “best” Function f * Training Data Model Testing: Hypothesis Function.
1 Chapter 11 Neural Networks. 2 Chapter 11 Contents (1) l Biological Neurons l Artificial Neurons l Perceptrons l Multilayer Neural Networks l Backpropagation.
11 1 Backpropagation Multilayer Perceptron R – S 1 – S 2 – S 3 Network.
PARALLELIZATION OF ARTIFICIAL NEURAL NETWORKS Joe Bradish CS5802 Fall 2015.
Chapter 8: Adaptive Networks
Previous Lecture Perceptron W  t+1  W  t  t  d(t) - sign (w(t)  x)] x Adaline W  t+1  W  t  t  d(t) - f(w(t)  x)] f’ x Gradient.
Neural Networks Lecture 11: Learning in recurrent networks Geoffrey Hinton.
6.5.4 Back-Propagation Computation in Fully-Connected MLP.
Back Propagation and Representation in PDP Networks
Matt Gormley Lecture 15 October 19, 2016
Today’s Lecture Neural networks Training
Lecture 7 Learned feedforward visual processing
Back Propagation and Representation in PDP Networks
Chapter 14 Partial Derivatives.
Deep Feedforward Networks
The Gradient Descent Algorithm
Matt Gormley Lecture 16 October 24, 2016
Deep Learning Hung-yi Lee 李宏毅.
Intro to NLP and Deep Learning
Intro to NLP and Deep Learning
CS 4501: Introduction to Computer Vision Basics of Neural Networks, and Training Neural Nets I Connelly Barnes.
Neural Networks CS 446 Machine Learning.
Lecture 25: Backprop and convnets
Intelligent Information System Lab
Neural Networks Lecture 6 Rob Fergus.
Introduction to CuDNN (CUDA Deep Neural Nets)
Neural Networks and Backpropagation
Classification / Regression Neural Networks 2
CSE 473 Introduction to Artificial Intelligence Neural Networks
A critical review of RNN for sequence learning Zachary C
Goodfellow: Chap 6 Deep Feedforward Networks
An open-source software library for Machine Intelligence
Deep Neural Network Toolkits: What’s Under the Hood?
2. Matrix Algebra 2.1 Matrix Operations.
Neural Network - 2 Mayank Vatsa
Backpropagation.
Convolutional networks
Neural Networks Geoff Hulten.
ML – Lecture 3B Deep NN.
Backpropagation.
Backpropagation.
Multilayer Perceptron: Learning : {(xi, f(xi)) | i = 1 ~ N} → W
RNNs: Going Beyond the SRN in Language Prediction
Backpropagation Disclaimer: This PPT is modified based on
3. Feedforward Nets (1) Architecture
Backpropagation.
CSC321: Neural Networks Lecture 11: Learning in recurrent networks
Deep Learning Libraries
Image recognition.
Dhruv Batra Georgia Tech
Dhruv Batra Georgia Tech
Artificial Neural Networks / Spring 2002
Principles of Back-Propagation
Derivatives and Gradients
An introduction to neural network and machine learning
Overall Introduction for the Lecture
Presentation transcript:

Computing Gradient Hung-yi Lee 李宏毅 https://medium.com/@aidangomez/let-s-do-this-f9b699de31d9#.n6l28c26g

Introduction Backpropagation: an efficient way to compute the gradient Prerequisite Backpropagation for feedforward net: http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture/ DNN%20backprop.ecm.mp4/index.html Simple version: https://www.youtube.com/watch?v=ibJpTrp5mcE Backpropagation through time for RNN: http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture/RNN %20training%20(v6).ecm.mp4/index.html Understanding backpropagation by computational graph Tensorflow, Theano, CNTK, etc. Backpropagation through time: http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture/RNN%20training%20(v6).ecm.mp4/index.html

Computational Graph

Computational Graph A “language” describing a function 𝑓 Example 𝑎=𝑓 𝑏,𝑐 b a A “language” describing a function Node: variable (scalar, vector, tensor ……) Edge: operation (simple function) 𝑓 c Example 𝑦=𝑓 𝑔 ℎ 𝑥 𝑢=ℎ 𝑥 𝑣=𝑔 𝑢 𝑦=𝑓 𝑣 x u v y ℎ 𝑔 𝑓

Computational Graph Example: e = (a+b) ∗ (b+1) ∗ +1 + c = a + b e 6 d = b + 1 ∗ e = c ∗ d c d 3 2 +1 + a b 2 1

Review: Chain Rule Case 1 g h x y z Case 2 x y z s

Computational Graph Example: e = (a+b) ∗ (b+1) Compute 𝜕𝑒 𝜕𝑏 =1x(b+1)+1x(a+b) 𝜕𝑒 𝜕𝑐 ∗ 𝜕𝑒 𝜕𝑑 Sum over all paths from b to e =d =c =a+b =b+1 c d +1 𝜕𝑐 𝜕𝑎 + 𝜕𝑐 𝜕𝑏 𝜕𝑑 𝜕𝑏 =1 =1 =1 a b

Computational Graph Example: e = (a+b) ∗ (b+1) Compute 𝜕𝑒 𝜕𝑏 ∗ +1 + e 15 a = 3, b = 2 𝜕𝑒 𝜕𝑐 ∗ 𝜕𝑒 𝜕𝑑 =d =c 3 5 c d 5 3 +1 𝜕𝑐 𝜕𝑎 + 𝜕𝑐 𝜕𝑏 𝜕𝑑 𝜕𝑏 =1 =1 =1 a b 3 2

Computational Graph Example: e = (a+b) ∗ (b+1) Compute 𝜕𝑒 𝜕𝑏 =8 e 8 a = 3, b = 2 𝜕𝑒 𝜕𝑐 ∗ 𝜕𝑒 𝜕𝑑 =d =c 3 5 c d 1 1 +1 𝜕𝑐 𝜕𝑎 + 𝜕𝑐 𝜕𝑏 𝜕𝑑 𝜕𝑏 =1 =1 =1 a b 1 Start with 1

Computational Graph Example: e = (a+b) ∗ (b+1) Compute 𝜕𝑒 𝜕𝑎 ∗ +1 + =3 𝜕𝑒 𝜕𝑐 ∗ 𝜕𝑒 𝜕𝑑 =d 3 =c 5 c d 1 +1 𝜕𝑐 𝜕𝑎 + 𝜕𝑐 𝜕𝑏 𝜕𝑑 𝜕𝑏 =1 =1 =1 a b 1 Start with 1

Computational Graph Example: e = (a+b) ∗ (b+1) ∗ +1 + Reverse mode What is the benefit? Example: e = (a+b) ∗ (b+1) Compute and a = 3, b = 2 e Start with 1 1 𝜕𝑒 𝜕𝑐 ∗ 𝜕𝑒 𝜕𝑑 =d 3 =c 5 c d 3 5 +1 𝜕𝑐 𝜕𝑎 + 𝜕𝑐 𝜕𝑏 𝜕𝑑 𝜕𝑏 =1 =1 =1 a b 3 8

Computational Graph Parameter sharing: the same parameters appearing in different nodes 𝜕𝑦 𝜕𝑥 =? 𝑦=𝑥 𝑒 𝑥 2 𝑒 𝑥 2 +𝑥∙ 𝑒 𝑥 2 ∙2𝑥 𝜕𝑦 𝜕𝑥 =𝑥∙ 𝑒 𝑥 2 ∙𝑥 x u 𝑥 exp 𝑥 y ∗ v 𝑢 ∗ 𝜕𝑦 𝜕𝑥 =𝑥∙ 𝑒 𝑥 2 ∙𝑥 = 𝑒 𝑥 2 𝑥 𝑢 x x = 𝑒 𝑥 2 𝜕𝑦 𝜕𝑥 = 𝑒 𝑥 2

Computational Graph for Feedforward Net

Review: Backpropagation Error signal … Layer Forward Pass Backward Pass

Review: Backpropagation Layer L-1 Layer L Error signal …… …… Backward Pass … … …… … … (we do not use softmax here)

Feedforward Network … … y WL W2 W1 x b1 b2 bL + + =𝜎 𝜎 𝜎 + 𝑧 1 = 𝑊 1 𝑥+ 𝑏 1 𝑧 2 = 𝑊 2 𝑎 1 + 𝑏 2 x z1 a1 z2 y 𝜎 𝜎 W1 W2 b1 b2

Loss Function of Feedforward Network 𝐶=𝐿 𝑦, 𝑦 𝑦 𝐶 𝑧 1 = 𝑊 1 𝑥+ 𝑏 1 𝑧 2 = 𝑊 2 𝑎 1 + 𝑏 2 x z1 a1 z2 y 𝜎 𝜎 W1 W2 b1 b2

Gradient of Cost Function 𝐶=𝐿 𝑦, 𝑦 To compute the gradient … 𝑦 𝐶 Computing the partial derivative on the edge Using reverse mode Output is always a scalar x z1 a1 z2 y 𝜎 𝜎 𝜕𝐶 𝜕 𝑊 1 W1 𝜕𝐶 𝜕 𝑊 2 W2 vector 𝜕 𝑎 1 𝜕 𝑧 1 =? 𝜕𝐶 𝜕 𝑏 2 𝜕𝐶 𝜕 𝑏 1 b1 b2 vector

Jacobian Matrix 𝑥= 𝑥 1 𝑥 2 𝑥 3 𝑦= 𝑦 1 𝑦 2 𝑦=𝑓 𝑥 𝑥= 𝑥 1 𝑥 2 𝑥 3 𝑦= 𝑦 1 𝑦 2 𝑦=𝑓 𝑥 𝜕𝑦 𝜕𝑥 = 𝜕 𝑦 1 𝜕 𝑥 1 𝜕 𝑦 1 𝜕 𝑥 2 𝜕 𝑦 1 𝜕 𝑥 3 𝜕 𝑦 2 𝜕 𝑥 1 𝜕 𝑦 2 𝜕 𝑥 2 𝜕 𝑦 2 𝜕 𝑥 3 size of y size of x Example 𝑥 1 + 𝑥 2 𝑥 3 2 𝑥 3 =𝑓 𝑥 1 𝑥 2 𝑥 3 𝜕𝑦 𝜕𝑥 = 1 𝑥 3 𝑥 2 0 0 2

Gradient of Cost Function 𝐶=𝐿 𝑦, 𝑦 … 𝑦 𝐶 𝑦 = 0 0 ⋮ 1 ⋮ 𝜕𝐶 𝜕𝑦 Last Layer … 𝑟 y 𝜕𝐶 𝜕 𝑦 𝑖 =? … Only one dimension is 1, and others are all 0 Index of the dimension which is 1 Do we have to consider other dimensions? 𝑖=𝑟: Cross Entropy: 𝐶=−𝑙𝑜𝑔 𝑦 𝑟 𝜕𝐶 𝜕𝑦 = 𝜕𝐶 𝜕 𝑦 𝑟 =−1/ 𝑦 𝑟 0 ⋯ −1/ 𝑦 𝑟 ⋯ 𝑖≠𝑟: 𝜕𝐶 𝜕 𝑦 𝑖 =0

Gradient of Cost Function 𝐶=𝐿 𝑦, 𝑦 square 𝜕𝑦 𝜕 𝑧 2 is a Jacobian matrix 𝑦 𝑦 𝐶 𝑧 2 i-th row, j-th column: 𝜕 𝑦 𝑖 𝜕 𝑧 𝑗 2 𝜕𝐶 𝜕𝑦 𝜕𝑦 𝜕 𝑧 2 𝑖≠𝑗: 𝜕 𝑦 𝑖 𝜕 𝑧 𝑗 2 = z2 y 𝑖=𝑗: 𝜕 𝑦 𝑖 𝜕 𝑧 𝑖 2 = 𝜎′ 𝑧 𝑖 2 𝜎 𝜎′ 𝑧 𝑖 2 𝑦 𝑖 =𝜎 𝑧 𝑖 2 𝑦 Diagonal Matrix How about softmax?  𝑧 𝑖 2 𝑧

𝜕 𝑧 𝑖 2 𝜕 𝑎 𝑗 1 = 𝑊 𝑖𝑗 2 𝑊 2 𝜕 𝑧 2 𝜕 𝑎 1 is a Jacobian matrix 𝑎 1 𝑧 2 𝜕 𝑧 2 𝜕 𝑎 1 is a Jacobian matrix 𝑎 1 𝑧 2 𝐶=𝐿 𝑦, 𝑦 i-th row, j-th column: 𝑦 𝐶 𝜕 𝑧 𝑖 2 𝜕 𝑎 𝑗 1 = 𝑊 𝑖𝑗 2 𝜕𝐶 𝜕𝑦 𝑧 2 = 𝑊 2 𝑎 1 𝜕𝑦 𝜕 𝑧 2 𝜕 𝑧 2 𝜕 𝑎 1 a1 z2 y 𝑧 𝑖 2 = 𝑤 𝑖1 2 𝑎 1 1 + 𝑤 𝑖2 2 𝑎 2 1 + 𝜎 ⋯+ 𝑤 𝑖𝑛 2 𝑎 𝑛 1 𝜕 𝑧 2 𝜕 𝑊 2 W2 𝑊 2

𝑖≠𝑗: 𝑖=𝑗: 𝑊 2 mxn 𝜕 𝑧 𝑖 2 𝜕 𝑊 𝑗𝑘 2 (j-1)xn+k 𝜕 𝑧 2 𝜕 𝑊 2 = m i a1 z2 y 𝜕 𝑧 𝑖 2 𝜕 𝑊 𝑗𝑘 2 (j-1)xn+k 𝜕 𝑧 2 𝜕 𝑊 2 = m i a1 z2 y W2 𝑦 𝐶 𝐶=𝐿 𝑦, 𝑦 𝑧 2 = 𝑊 2 𝑎 1 𝜕𝑦 𝜕 𝑧 2 𝜕 𝑧 2 𝜕 𝑊 2 𝜎 𝑊 2 𝜕 𝑧 𝑖 2 𝜕 𝑊 𝑗𝑘 2 =? 𝜕 𝑧 𝑖 2 𝜕 𝑊 𝑗𝑘 2 =0 𝑖≠𝑗: 𝜕𝐶 𝜕𝑦 𝜕 𝑧 𝑖 2 𝜕 𝑊 𝑖𝑘 2 = 𝑎 𝑘 1 𝑖=𝑗: 𝑧 𝑖 2 = 𝑤 𝑖1 2 𝑎 1 1 + 𝑤 𝑖2 2 𝑎 2 1 + ⋯+ 𝑤 𝑖𝑛 2 𝑎 𝑛 1 n m Considering W2 as a mxn vector mxn (considering 𝜕 𝑧 2 𝜕 𝑊 2 as a tensor makes thing easier)

𝑖≠𝑗: 𝑖=𝑗: 𝑊 2 𝑚 𝑚×𝑛 mxn 𝜕 𝑧 𝑖 2 𝜕 𝑊 𝑗𝑘 2 𝜕 𝑧 2 𝜕 𝑊 2 = (j-1)xn+k m i 𝜕 𝑧 𝑖 2 𝜕 𝑊 𝑗𝑘 2 𝜕 𝑧 2 𝜕 𝑊 2 = (j-1)xn+k m i a1 z2 y W2 𝑦 𝐶 𝐶=𝐿 𝑦, 𝑦 𝑧 2 = 𝑊 2 𝑎 1 𝜕𝑦 𝜕 𝑧 2 𝜕 𝑧 2 𝜕 𝑊 2 𝜎 𝑊 2 𝜕 𝑧 𝑖 2 𝜕 𝑊 𝑗𝑘 2 =? 𝜕 𝑧 𝑖 2 𝜕 𝑊 𝑗𝑘 2 =0 𝑖≠𝑗: 𝜕𝐶 𝜕𝑦 𝜕 𝑧 𝑖 2 𝜕 𝑊 𝑖𝑘 2 = 𝑎 𝑘 1 𝑖=𝑗: 𝑗=1 𝑗=2 𝑗=3 n m i=1 𝑎 𝑎 i=2 Considering W2 as a mxn vector 𝑚 𝑎 mxn 𝑚×𝑛

𝜕𝐶 𝜕 𝑊 1 = 𝜕𝐶 𝜕𝑦 =[ ⋯ 𝜕𝐶 𝜕 𝑊 𝑖𝑗 1 ⋯ ] 𝜕𝑦 𝜕 𝑧 2 𝜕 𝑎 1 𝜕 𝑧 1 𝜕 𝑧 1 𝜕 𝑊 1 =[ ⋯ 𝜕𝐶 𝜕 𝑊 𝑖𝑗 1 ⋯ ] 𝜕𝑦 𝜕 𝑧 2 𝜕 𝑎 1 𝜕 𝑧 1 𝜕 𝑧 1 𝜕 𝑊 1 𝑊 2 𝐶=𝐿 𝑦, 𝑦 𝑦 𝐶 𝜕𝐶 𝜕𝑦 𝜕 𝑎 1 𝜕 𝑧 1 𝜕𝑦 𝜕 𝑧 2 𝑊 1 𝑊 2 x z1 a1 z2 y 𝜎 𝜎 𝜕 𝑧 1 𝜕 𝑊 1 𝜕 𝑧 2 𝜕 𝑊 2 W1 W2 𝜕𝐶 𝜕 𝑊 1

Question Q: Only backward pass for computational graph? Q: Do we get the same results from the two different approaches?

Computational Graph for Recurrent Network

Recurrent Network y1 y2 y3 h0 f h1 f h2 f h3 …… x1 x2 x3 𝑦 𝑡 , ℎ 𝑡 =𝑓 𝑥 𝑡 , ℎ 𝑡−1 ; 𝑊 𝑖 , 𝑊 ℎ , 𝑊 𝑜 ℎ 𝑡 =𝜎 𝑊 𝑖 𝑥 𝑡 + 𝑊 ℎ ℎ 𝑡−1 𝑦 𝑡 =𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑊 𝑜 ℎ 𝑡 (biases are ignored here)

Recurrent Network 𝑦 𝑡 , ℎ 𝑡 =𝑓 𝑥 𝑡 , ℎ 𝑡−1 ; 𝑊 𝑖 , 𝑊 ℎ , 𝑊 𝑜 𝑦 𝑡 , ℎ 𝑡 =𝑓 𝑥 𝑡 , ℎ 𝑡−1 ; 𝑊 𝑖 , 𝑊 ℎ , 𝑊 𝑜 Recurrent Network 𝑎 𝑡 =𝜎 𝑊 𝑖 𝑥 𝑡 + 𝑊 ℎ ℎ 𝑡−1 𝑦 𝑡 =𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑊 𝑜 ℎ 𝑡 Wo o1 y1 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 × a0 m1 z1 h1 × + 𝜎 Wh n1 × x1 Wi

Recurrent Network 𝑦 𝑡 =𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑊 𝑜 ℎ 𝑡 ℎ 𝑡 =𝜎 𝑊 𝑖 𝑥 𝑡 + 𝑊 ℎ ℎ 𝑡−1 𝑦 𝑡 , ℎ 𝑡 =𝑓 𝑥 𝑡 , ℎ 𝑡−1 ; 𝑊 𝑖 , 𝑊 ℎ , 𝑊 𝑜 Recurrent Network 𝑎 𝑡 =𝜎 𝑊 𝑖 𝑥 𝑡 + 𝑊 ℎ ℎ 𝑡−1 𝑦 𝑡 =𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑊 𝑜 ℎ 𝑡 Wo y1 𝑦 𝑡 =𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑊 𝑜 ℎ 𝑡 h0 h1 Wh ℎ 𝑡 =𝜎 𝑊 𝑖 𝑥 𝑡 + 𝑊 ℎ ℎ 𝑡−1 x1 Wi

Recurrent Network 𝐶= 𝐶 1 + 𝐶 2 + 𝐶 3 C 𝑦 1 𝑦 2 C2 𝑦 3 C1 C3 Wo y1 Wo 𝑦 1 𝑦 2 C2 𝑦 3 C1 C3 Wo y1 Wo y2 Wo y3 h0 h1 h2 h3 Wh Wh Wh x1 Wi x2 Wi x3 Wi

Recurrent Network 𝐶= 𝐶 1 + 𝐶 2 + 𝐶 3 C 𝑦 1 𝑦 2 C2 𝑦 3 C1 C3 𝑦 1 𝑦 2 C2 𝑦 3 C1 C3 𝜕 𝐶 1 𝜕 𝑦 1 𝜕 𝐶 2 𝜕 𝑦 2 𝜕 𝐶 3 𝜕 𝑦 3 Wo y1 Wo y2 Wo y3 𝜕 𝑦 1 𝜕 ℎ 1 𝜕 𝑦 2 𝜕 ℎ 2 𝜕 𝑦 3 𝜕 ℎ 3 h0 h1 h2 h3 𝜕 ℎ 2 𝜕 ℎ 1 𝜕 ℎ 3 𝜕 ℎ 2 Wh Wh Wh 𝜕 ℎ 1 𝜕 𝑊 ℎ 𝜕 ℎ 2 𝜕 𝑊 ℎ 𝜕 ℎ 3 𝜕 𝑊 ℎ 𝜕𝐶 𝜕 𝑊 ℎ 𝜕𝐶 𝜕 𝑊 ℎ 𝜕𝐶 𝜕 𝑊 ℎ x1 Wi x2 Wi x3 Wi

𝜕𝐶 𝜕 𝑊 ℎ 𝜕 𝐶 1 𝜕 𝑦 1 𝜕 𝑦 1 𝜕 ℎ 1 𝜕 𝐶 2 𝜕 𝑦 2 𝜕 𝑦 2 𝜕 ℎ 2 𝜕 ℎ 2 𝜕 ℎ 1 𝜕 𝐶 1 𝜕 𝑦 1 𝜕 𝑦 1 𝜕 ℎ 1 𝜕 𝐶 2 𝜕 𝑦 2 𝜕 𝑦 2 𝜕 ℎ 2 𝜕 ℎ 2 𝜕 ℎ 1 𝜕 𝐶 3 𝜕 𝑦 3 𝜕 𝑦 3 𝜕 ℎ 3 𝜕 ℎ 3 𝜕 ℎ 2 𝜕 ℎ 2 𝜕 ℎ 1 𝜕 ℎ 1 𝜕 𝑊 ℎ 1 + + = 𝜕𝐶 𝜕 𝑊 ℎ 𝜕𝐶 𝜕 𝑊 ℎ 𝜕𝐶 𝜕 𝑊 ℎ 𝜕𝐶 𝜕 𝑊 ℎ 𝜕𝐶 𝜕 𝑊 ℎ 2 1 2 3 = …… = + + 𝜕𝐶 𝜕 𝑊 ℎ 3 = …… 1 2 3

Reference Textbook: Deep Learning Chapter 6.5 Calculus on Computational Graphs: Backpropagation https://colah.github.io/posts/2015-08- Backprop/ On chain rule, computational graphs, and backpropagation http://outlace.com/Computational-Graph/ Review backpropagation???

Acknowledgement 感謝 翁丞世 同學找到投影片上的錯誤