Computing Gradient Hung-yi Lee 李宏毅

Slides:

Advertisements

Similar presentations

Multi-Layer Perceptron (MLP)

Advertisements

NEURAL NETWORKS Backpropagation Algorithm

Neural networks Introduction Fitting neural networks

Introduction to Neural Networks Computing

Machine Learning Neural Networks

Maths for Computer Graphics

Lecture 14 – Neural Networks

Supervised learning 1.Early learning algorithms 2.First order gradient methods 3.Second order gradient methods.

1 Pertemuan 13 BACK PROPAGATION Matakuliah: H0434/Jaringan Syaraf Tiruan Tahun: 2005 Versi: 1.

September 30, 2010Neural Networks Lecture 8: Backpropagation Learning 1 Sigmoidal Neurons In backpropagation networks, we typically choose  = 1 and 

CHAPTER 11 Back-Propagation Ming-Feng Yeh.

Backpropagation An efficient way to compute the gradient Hung-yi Lee.

Lecture 3 Introduction to Neural Networks and Fuzzy Logic President UniversityErwin SitompulNNFL 3/1 Dr.-Ing. Erwin Sitompul President University

Classification / Regression Neural Networks 2

Neural Network Introduction Hung-yi Lee. Review: Supervised Learning Training: Pick the “best” Function f * Training Data Model Testing: Hypothesis Function.

1 Chapter 11 Neural Networks. 2 Chapter 11 Contents (1) l Biological Neurons l Artificial Neurons l Perceptrons l Multilayer Neural Networks l Backpropagation.

11 1 Backpropagation Multilayer Perceptron R – S 1 – S 2 – S 3 Network.

PARALLELIZATION OF ARTIFICIAL NEURAL NETWORKS Joe Bradish CS5802 Fall 2015.

Chapter 8: Adaptive Networks

Previous Lecture Perceptron W  t+1  W  t  t  d(t) - sign (w(t)  x)] x Adaline W  t+1  W  t  t  d(t) - f(w(t)  x)] f’ x Gradient.

Neural Networks Lecture 11: Learning in recurrent networks Geoffrey Hinton.

6.5.4 Back-Propagation Computation in Fully-Connected MLP.

Back Propagation and Representation in PDP Networks

Matt Gormley Lecture 15 October 19, 2016

Today’s Lecture Neural networks Training

Lecture 7 Learned feedforward visual processing

Back Propagation and Representation in PDP Networks

Chapter 14 Partial Derivatives.

Deep Feedforward Networks

The Gradient Descent Algorithm

Matt Gormley Lecture 16 October 24, 2016

Deep Learning Hung-yi Lee 李宏毅.

Intro to NLP and Deep Learning

Intro to NLP and Deep Learning

CS 4501: Introduction to Computer Vision Basics of Neural Networks, and Training Neural Nets I Connelly Barnes.

Neural Networks CS 446 Machine Learning.

Lecture 25: Backprop and convnets

Intelligent Information System Lab

Neural Networks Lecture 6 Rob Fergus.

Introduction to CuDNN (CUDA Deep Neural Nets)

Neural Networks and Backpropagation

Classification / Regression Neural Networks 2

CSE 473 Introduction to Artificial Intelligence Neural Networks

A critical review of RNN for sequence learning Zachary C

Goodfellow: Chap 6 Deep Feedforward Networks

An open-source software library for Machine Intelligence

Deep Neural Network Toolkits: What’s Under the Hood?

2. Matrix Algebra 2.1 Matrix Operations.

Neural Network - 2 Mayank Vatsa

Backpropagation.

Convolutional networks

Neural Networks Geoff Hulten.

ML – Lecture 3B Deep NN.

Backpropagation.

Backpropagation.

Multilayer Perceptron: Learning : {(xi, f(xi)) | i = 1 ~ N} → W

RNNs: Going Beyond the SRN in Language Prediction

Backpropagation Disclaimer: This PPT is modified based on

3. Feedforward Nets (1) Architecture

Backpropagation.

CSC321: Neural Networks Lecture 11: Learning in recurrent networks

Deep Learning Libraries

Image recognition.

Dhruv Batra Georgia Tech

Dhruv Batra Georgia Tech

Artificial Neural Networks / Spring 2002

Principles of Back-Propagation

Derivatives and Gradients

An introduction to neural network and machine learning

Overall Introduction for the Lecture

Presentation transcript:

Computing Gradient Hung-yi Lee 李宏毅 https://medium.com/@aidangomez/let-s-do-this-f9b699de31d9#.n6l28c26g

Introduction Backpropagation: an efficient way to compute the gradient Prerequisite Backpropagation for feedforward net: http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture/ DNN%20backprop.ecm.mp4/index.html Simple version: https://www.youtube.com/watch?v=ibJpTrp5mcE Backpropagation through time for RNN: http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture/RNN %20training%20(v6).ecm.mp4/index.html Understanding backpropagation by computational graph Tensorflow, Theano, CNTK, etc. Backpropagation through time: http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture/RNN%20training%20(v6).ecm.mp4/index.html

Computational Graph

Computational Graph A “language” describing a function 𝑓 Example 𝑎=𝑓 𝑏,𝑐 b a A “language” describing a function Node: variable (scalar, vector, tensor ……) Edge: operation (simple function) 𝑓 c Example 𝑦=𝑓 𝑔 ℎ 𝑥 𝑢=ℎ 𝑥 𝑣=𝑔 𝑢 𝑦=𝑓 𝑣 x u v y ℎ 𝑔 𝑓

Computational Graph Example: e = (a+b) ∗ (b+1) ∗ +1 + c = a + b e 6 d = b + 1 ∗ e = c ∗ d c d 3 2 +1 + a b 2 1

Review: Chain Rule Case 1 g h x y z Case 2 x y z s

Computational Graph Example: e = (a+b) ∗ (b+1) Compute 𝜕𝑒 𝜕𝑏 =1x(b+1)+1x(a+b) 𝜕𝑒 𝜕𝑐 ∗ 𝜕𝑒 𝜕𝑑 Sum over all paths from b to e =d =c =a+b =b+1 c d +1 𝜕𝑐 𝜕𝑎 + 𝜕𝑐 𝜕𝑏 𝜕𝑑 𝜕𝑏 =1 =1 =1 a b

Computational Graph Example: e = (a+b) ∗ (b+1) Compute 𝜕𝑒 𝜕𝑏 ∗ +1 + e 15 a = 3, b = 2 𝜕𝑒 𝜕𝑐 ∗ 𝜕𝑒 𝜕𝑑 =d =c 3 5 c d 5 3 +1 𝜕𝑐 𝜕𝑎 + 𝜕𝑐 𝜕𝑏 𝜕𝑑 𝜕𝑏 =1 =1 =1 a b 3 2

Computational Graph Example: e = (a+b) ∗ (b+1) Compute 𝜕𝑒 𝜕𝑏 =8 e 8 a = 3, b = 2 𝜕𝑒 𝜕𝑐 ∗ 𝜕𝑒 𝜕𝑑 =d =c 3 5 c d 1 1 +1 𝜕𝑐 𝜕𝑎 + 𝜕𝑐 𝜕𝑏 𝜕𝑑 𝜕𝑏 =1 =1 =1 a b 1 Start with 1

Computational Graph Example: e = (a+b) ∗ (b+1) Compute 𝜕𝑒 𝜕𝑎 ∗ +1 + =3 𝜕𝑒 𝜕𝑐 ∗ 𝜕𝑒 𝜕𝑑 =d 3 =c 5 c d 1 +1 𝜕𝑐 𝜕𝑎 + 𝜕𝑐 𝜕𝑏 𝜕𝑑 𝜕𝑏 =1 =1 =1 a b 1 Start with 1

Computational Graph Example: e = (a+b) ∗ (b+1) ∗ +1 + Reverse mode What is the benefit? Example: e = (a+b) ∗ (b+1) Compute and a = 3, b = 2 e Start with 1 1 𝜕𝑒 𝜕𝑐 ∗ 𝜕𝑒 𝜕𝑑 =d 3 =c 5 c d 3 5 +1 𝜕𝑐 𝜕𝑎 + 𝜕𝑐 𝜕𝑏 𝜕𝑑 𝜕𝑏 =1 =1 =1 a b 3 8

Computational Graph Parameter sharing: the same parameters appearing in different nodes 𝜕𝑦 𝜕𝑥 =? 𝑦=𝑥 𝑒 𝑥 2 𝑒 𝑥 2 +𝑥∙ 𝑒 𝑥 2 ∙2𝑥 𝜕𝑦 𝜕𝑥 =𝑥∙ 𝑒 𝑥 2 ∙𝑥 x u 𝑥 exp 𝑥 y ∗ v 𝑢 ∗ 𝜕𝑦 𝜕𝑥 =𝑥∙ 𝑒 𝑥 2 ∙𝑥 = 𝑒 𝑥 2 𝑥 𝑢 x x = 𝑒 𝑥 2 𝜕𝑦 𝜕𝑥 = 𝑒 𝑥 2

Computational Graph for Feedforward Net

Review: Backpropagation Error signal … Layer Forward Pass Backward Pass

Review: Backpropagation Layer L-1 Layer L Error signal …… …… Backward Pass … … …… … … (we do not use softmax here)

Feedforward Network … … y WL W2 W1 x b1 b2 bL + + =𝜎 𝜎 𝜎 + 𝑧 1 = 𝑊 1 𝑥+ 𝑏 1 𝑧 2 = 𝑊 2 𝑎 1 + 𝑏 2 x z1 a1 z2 y 𝜎 𝜎 W1 W2 b1 b2

Loss Function of Feedforward Network 𝐶=𝐿 𝑦, 𝑦 𝑦 𝐶 𝑧 1 = 𝑊 1 𝑥+ 𝑏 1 𝑧 2 = 𝑊 2 𝑎 1 + 𝑏 2 x z1 a1 z2 y 𝜎 𝜎 W1 W2 b1 b2

Gradient of Cost Function 𝐶=𝐿 𝑦, 𝑦 To compute the gradient … 𝑦 𝐶 Computing the partial derivative on the edge Using reverse mode Output is always a scalar x z1 a1 z2 y 𝜎 𝜎 𝜕𝐶 𝜕 𝑊 1 W1 𝜕𝐶 𝜕 𝑊 2 W2 vector 𝜕 𝑎 1 𝜕 𝑧 1 =? 𝜕𝐶 𝜕 𝑏 2 𝜕𝐶 𝜕 𝑏 1 b1 b2 vector

Jacobian Matrix 𝑥= 𝑥 1 𝑥 2 𝑥 3 𝑦= 𝑦 1 𝑦 2 𝑦=𝑓 𝑥 𝑥= 𝑥 1 𝑥 2 𝑥 3 𝑦= 𝑦 1 𝑦 2 𝑦=𝑓 𝑥 𝜕𝑦 𝜕𝑥 = 𝜕 𝑦 1 𝜕 𝑥 1 𝜕 𝑦 1 𝜕 𝑥 2 𝜕 𝑦 1 𝜕 𝑥 3 𝜕 𝑦 2 𝜕 𝑥 1 𝜕 𝑦 2 𝜕 𝑥 2 𝜕 𝑦 2 𝜕 𝑥 3 size of y size of x Example 𝑥 1 + 𝑥 2 𝑥 3 2 𝑥 3 =𝑓 𝑥 1 𝑥 2 𝑥 3 𝜕𝑦 𝜕𝑥 = 1 𝑥 3 𝑥 2 0 0 2

Gradient of Cost Function 𝐶=𝐿 𝑦, 𝑦 … 𝑦 𝐶 𝑦 = 0 0 ⋮ 1 ⋮ 𝜕𝐶 𝜕𝑦 Last Layer … 𝑟 y 𝜕𝐶 𝜕 𝑦 𝑖 =? … Only one dimension is 1, and others are all 0 Index of the dimension which is 1 Do we have to consider other dimensions? 𝑖=𝑟: Cross Entropy: 𝐶=−𝑙𝑜𝑔 𝑦 𝑟 𝜕𝐶 𝜕𝑦 = 𝜕𝐶 𝜕 𝑦 𝑟 =−1/ 𝑦 𝑟 0 ⋯ −1/ 𝑦 𝑟 ⋯ 𝑖≠𝑟: 𝜕𝐶 𝜕 𝑦 𝑖 =0

Gradient of Cost Function 𝐶=𝐿 𝑦, 𝑦 square 𝜕𝑦 𝜕 𝑧 2 is a Jacobian matrix 𝑦 𝑦 𝐶 𝑧 2 i-th row, j-th column: 𝜕 𝑦 𝑖 𝜕 𝑧 𝑗 2 𝜕𝐶 𝜕𝑦 𝜕𝑦 𝜕 𝑧 2 𝑖≠𝑗: 𝜕 𝑦 𝑖 𝜕 𝑧 𝑗 2 = z2 y 𝑖=𝑗: 𝜕 𝑦 𝑖 𝜕 𝑧 𝑖 2 = 𝜎′ 𝑧 𝑖 2 𝜎 𝜎′ 𝑧 𝑖 2 𝑦 𝑖 =𝜎 𝑧 𝑖 2 𝑦 Diagonal Matrix How about softmax?  𝑧 𝑖 2 𝑧

𝜕 𝑧 𝑖 2 𝜕 𝑎 𝑗 1 = 𝑊 𝑖𝑗 2 𝑊 2 𝜕 𝑧 2 𝜕 𝑎 1 is a Jacobian matrix 𝑎 1 𝑧 2 𝜕 𝑧 2 𝜕 𝑎 1 is a Jacobian matrix 𝑎 1 𝑧 2 𝐶=𝐿 𝑦, 𝑦 i-th row, j-th column: 𝑦 𝐶 𝜕 𝑧 𝑖 2 𝜕 𝑎 𝑗 1 = 𝑊 𝑖𝑗 2 𝜕𝐶 𝜕𝑦 𝑧 2 = 𝑊 2 𝑎 1 𝜕𝑦 𝜕 𝑧 2 𝜕 𝑧 2 𝜕 𝑎 1 a1 z2 y 𝑧 𝑖 2 = 𝑤 𝑖1 2 𝑎 1 1 + 𝑤 𝑖2 2 𝑎 2 1 + 𝜎 ⋯+ 𝑤 𝑖𝑛 2 𝑎 𝑛 1 𝜕 𝑧 2 𝜕 𝑊 2 W2 𝑊 2

𝑖≠𝑗: 𝑖=𝑗: 𝑊 2 mxn 𝜕 𝑧 𝑖 2 𝜕 𝑊 𝑗𝑘 2 (j-1)xn+k 𝜕 𝑧 2 𝜕 𝑊 2 = m i a1 z2 y 𝜕 𝑧 𝑖 2 𝜕 𝑊 𝑗𝑘 2 (j-1)xn+k 𝜕 𝑧 2 𝜕 𝑊 2 = m i a1 z2 y W2 𝑦 𝐶 𝐶=𝐿 𝑦, 𝑦 𝑧 2 = 𝑊 2 𝑎 1 𝜕𝑦 𝜕 𝑧 2 𝜕 𝑧 2 𝜕 𝑊 2 𝜎 𝑊 2 𝜕 𝑧 𝑖 2 𝜕 𝑊 𝑗𝑘 2 =? 𝜕 𝑧 𝑖 2 𝜕 𝑊 𝑗𝑘 2 =0 𝑖≠𝑗: 𝜕𝐶 𝜕𝑦 𝜕 𝑧 𝑖 2 𝜕 𝑊 𝑖𝑘 2 = 𝑎 𝑘 1 𝑖=𝑗: 𝑧 𝑖 2 = 𝑤 𝑖1 2 𝑎 1 1 + 𝑤 𝑖2 2 𝑎 2 1 + ⋯+ 𝑤 𝑖𝑛 2 𝑎 𝑛 1 n m Considering W2 as a mxn vector mxn (considering 𝜕 𝑧 2 𝜕 𝑊 2 as a tensor makes thing easier)

𝑖≠𝑗: 𝑖=𝑗: 𝑊 2 𝑚 𝑚×𝑛 mxn 𝜕 𝑧 𝑖 2 𝜕 𝑊 𝑗𝑘 2 𝜕 𝑧 2 𝜕 𝑊 2 = (j-1)xn+k m i 𝜕 𝑧 𝑖 2 𝜕 𝑊 𝑗𝑘 2 𝜕 𝑧 2 𝜕 𝑊 2 = (j-1)xn+k m i a1 z2 y W2 𝑦 𝐶 𝐶=𝐿 𝑦, 𝑦 𝑧 2 = 𝑊 2 𝑎 1 𝜕𝑦 𝜕 𝑧 2 𝜕 𝑧 2 𝜕 𝑊 2 𝜎 𝑊 2 𝜕 𝑧 𝑖 2 𝜕 𝑊 𝑗𝑘 2 =? 𝜕 𝑧 𝑖 2 𝜕 𝑊 𝑗𝑘 2 =0 𝑖≠𝑗: 𝜕𝐶 𝜕𝑦 𝜕 𝑧 𝑖 2 𝜕 𝑊 𝑖𝑘 2 = 𝑎 𝑘 1 𝑖=𝑗: 𝑗=1 𝑗=2 𝑗=3 n m i=1 𝑎 𝑎 i=2 Considering W2 as a mxn vector 𝑚 𝑎 mxn 𝑚×𝑛

𝜕𝐶 𝜕 𝑊 1 = 𝜕𝐶 𝜕𝑦 =[ ⋯ 𝜕𝐶 𝜕 𝑊 𝑖𝑗 1 ⋯ ] 𝜕𝑦 𝜕 𝑧 2 𝜕 𝑎 1 𝜕 𝑧 1 𝜕 𝑧 1 𝜕 𝑊 1 =[ ⋯ 𝜕𝐶 𝜕 𝑊 𝑖𝑗 1 ⋯ ] 𝜕𝑦 𝜕 𝑧 2 𝜕 𝑎 1 𝜕 𝑧 1 𝜕 𝑧 1 𝜕 𝑊 1 𝑊 2 𝐶=𝐿 𝑦, 𝑦 𝑦 𝐶 𝜕𝐶 𝜕𝑦 𝜕 𝑎 1 𝜕 𝑧 1 𝜕𝑦 𝜕 𝑧 2 𝑊 1 𝑊 2 x z1 a1 z2 y 𝜎 𝜎 𝜕 𝑧 1 𝜕 𝑊 1 𝜕 𝑧 2 𝜕 𝑊 2 W1 W2 𝜕𝐶 𝜕 𝑊 1

Question Q: Only backward pass for computational graph? Q: Do we get the same results from the two different approaches?

Computational Graph for Recurrent Network

Recurrent Network y1 y2 y3 h0 f h1 f h2 f h3 …… x1 x2 x3 𝑦 𝑡 , ℎ 𝑡 =𝑓 𝑥 𝑡 , ℎ 𝑡−1 ; 𝑊 𝑖 , 𝑊 ℎ , 𝑊 𝑜 ℎ 𝑡 =𝜎 𝑊 𝑖 𝑥 𝑡 + 𝑊 ℎ ℎ 𝑡−1 𝑦 𝑡 =𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑊 𝑜 ℎ 𝑡 (biases are ignored here)

Recurrent Network 𝑦 𝑡 , ℎ 𝑡 =𝑓 𝑥 𝑡 , ℎ 𝑡−1 ; 𝑊 𝑖 , 𝑊 ℎ , 𝑊 𝑜 𝑦 𝑡 , ℎ 𝑡 =𝑓 𝑥 𝑡 , ℎ 𝑡−1 ; 𝑊 𝑖 , 𝑊 ℎ , 𝑊 𝑜 Recurrent Network 𝑎 𝑡 =𝜎 𝑊 𝑖 𝑥 𝑡 + 𝑊 ℎ ℎ 𝑡−1 𝑦 𝑡 =𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑊 𝑜 ℎ 𝑡 Wo o1 y1 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 × a0 m1 z1 h1 × + 𝜎 Wh n1 × x1 Wi

Recurrent Network 𝑦 𝑡 =𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑊 𝑜 ℎ 𝑡 ℎ 𝑡 =𝜎 𝑊 𝑖 𝑥 𝑡 + 𝑊 ℎ ℎ 𝑡−1 𝑦 𝑡 , ℎ 𝑡 =𝑓 𝑥 𝑡 , ℎ 𝑡−1 ; 𝑊 𝑖 , 𝑊 ℎ , 𝑊 𝑜 Recurrent Network 𝑎 𝑡 =𝜎 𝑊 𝑖 𝑥 𝑡 + 𝑊 ℎ ℎ 𝑡−1 𝑦 𝑡 =𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑊 𝑜 ℎ 𝑡 Wo y1 𝑦 𝑡 =𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑊 𝑜 ℎ 𝑡 h0 h1 Wh ℎ 𝑡 =𝜎 𝑊 𝑖 𝑥 𝑡 + 𝑊 ℎ ℎ 𝑡−1 x1 Wi

Recurrent Network 𝐶= 𝐶 1 + 𝐶 2 + 𝐶 3 C 𝑦 1 𝑦 2 C2 𝑦 3 C1 C3 Wo y1 Wo 𝑦 1 𝑦 2 C2 𝑦 3 C1 C3 Wo y1 Wo y2 Wo y3 h0 h1 h2 h3 Wh Wh Wh x1 Wi x2 Wi x3 Wi

Recurrent Network 𝐶= 𝐶 1 + 𝐶 2 + 𝐶 3 C 𝑦 1 𝑦 2 C2 𝑦 3 C1 C3 𝑦 1 𝑦 2 C2 𝑦 3 C1 C3 𝜕 𝐶 1 𝜕 𝑦 1 𝜕 𝐶 2 𝜕 𝑦 2 𝜕 𝐶 3 𝜕 𝑦 3 Wo y1 Wo y2 Wo y3 𝜕 𝑦 1 𝜕 ℎ 1 𝜕 𝑦 2 𝜕 ℎ 2 𝜕 𝑦 3 𝜕 ℎ 3 h0 h1 h2 h3 𝜕 ℎ 2 𝜕 ℎ 1 𝜕 ℎ 3 𝜕 ℎ 2 Wh Wh Wh 𝜕 ℎ 1 𝜕 𝑊 ℎ 𝜕 ℎ 2 𝜕 𝑊 ℎ 𝜕 ℎ 3 𝜕 𝑊 ℎ 𝜕𝐶 𝜕 𝑊 ℎ 𝜕𝐶 𝜕 𝑊 ℎ 𝜕𝐶 𝜕 𝑊 ℎ x1 Wi x2 Wi x3 Wi

𝜕𝐶 𝜕 𝑊 ℎ 𝜕 𝐶 1 𝜕 𝑦 1 𝜕 𝑦 1 𝜕 ℎ 1 𝜕 𝐶 2 𝜕 𝑦 2 𝜕 𝑦 2 𝜕 ℎ 2 𝜕 ℎ 2 𝜕 ℎ 1 𝜕 𝐶 1 𝜕 𝑦 1 𝜕 𝑦 1 𝜕 ℎ 1 𝜕 𝐶 2 𝜕 𝑦 2 𝜕 𝑦 2 𝜕 ℎ 2 𝜕 ℎ 2 𝜕 ℎ 1 𝜕 𝐶 3 𝜕 𝑦 3 𝜕 𝑦 3 𝜕 ℎ 3 𝜕 ℎ 3 𝜕 ℎ 2 𝜕 ℎ 2 𝜕 ℎ 1 𝜕 ℎ 1 𝜕 𝑊 ℎ 1 + + = 𝜕𝐶 𝜕 𝑊 ℎ 𝜕𝐶 𝜕 𝑊 ℎ 𝜕𝐶 𝜕 𝑊 ℎ 𝜕𝐶 𝜕 𝑊 ℎ 𝜕𝐶 𝜕 𝑊 ℎ 2 1 2 3 = …… = + + 𝜕𝐶 𝜕 𝑊 ℎ 3 = …… 1 2 3

Reference Textbook: Deep Learning Chapter 6.5 Calculus on Computational Graphs: Backpropagation https://colah.github.io/posts/2015-08- Backprop/ On chain rule, computational graphs, and backpropagation http://outlace.com/Computational-Graph/ Review backpropagation???

Acknowledgement 感謝翁丞世同學找到投影片上的錯誤