Deep Neural Network Toolkits: What’s Under the Hood?

Slides:



Advertisements
Similar presentations
1 Machine Learning: Lecture 4 Artificial Neural Networks (Based on Chapter 4 of Mitchell T.., Machine Learning, 1997)
Advertisements

Neural Networks. Background - Neural Networks can be : Biological - Biological models Artificial - Artificial models - Desire to produce artificial systems.
Backpropagation An efficient way to compute the gradient Hung-yi Lee.
Classification / Regression Neural Networks 2
CS 478 – Tools for Machine Learning and Data Mining Backpropagation.
M. Wang, T. Xiao, J. Li, J. Zhang, C. Hong, & Z. Zhang (2014)
Other Map-Reduce (ish) Frameworks: Spark William Cohen 1.
Neural Networks The Elements of Statistical Learning, Chapter 12 Presented by Nick Rizzolo.
Deep neural networks.
Multinomial Regression and the Softmax Activation Function Gary Cottrell.
6.5.4 Back-Propagation Computation in Fully-Connected MLP.
Deep neural networks. Outline What’s new in ANNs in the last 5-10 years? – Deeper networks, more data, and faster training Scalability and use of GPUs.
CSE343/543 Machine Learning Mayank Vatsa Lecture slides are prepared using several teaching resources and no authorship is claimed for any slides.
Back Propagation and Representation in PDP Networks
Lecture 7 Learned feedforward visual processing
Multiple-Layer Networks and Backpropagation Algorithms
Introduction to Algorithms
TensorFlow– A system for large-scale machine learning
Back Propagation and Representation in PDP Networks
Fall 2004 Backpropagation CS478 - Machine Learning.
Learning linguistic structure with simple and more complex recurrent neural networks Psychology February 2, 2017.
RNNs: An example applied to the prediction task
CS 388: Natural Language Processing: LSTM Recurrent Neural Networks
Deep Feedforward Networks
Tensorflow Tutorial Homin Yoon.
Dhruv Batra Georgia Tech
Why it is Called Tensor Flow Parallelism in ANNs Project Ideas and Discussion Glenn Fung Presents Batch Renormalizating Paper.
Computing Gradient Hung-yi Lee 李宏毅
Matt Gormley Lecture 16 October 24, 2016
Deep Learning Libraries
Intro to NLP and Deep Learning
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Neural Networks CS 446 Machine Learning.
Lecture 25: Backprop and convnets
Intelligent Information System Lab
Neural Networks Lecture 6 Rob Fergus.
Hybrid computing using a neural network with dynamic external memory
Neural Networks and Backpropagation
Classification / Regression Neural Networks 2
Torch 02/27/2018 Hyeri Kim Good afternoon, everyone. I’m Hyeri. Today, I’m gonna talk about Torch.
INF 5860 Machine learning for image classification
RNNs: Going Beyond the SRN in Language Prediction
Deep Learning Packages
An open-source software library for Machine Intelligence
MXNet Internals Cyrus M. Vahid, Principal Solutions Architect,
Deep Networks
Artificial Neural Network & Backpropagation Algorithm
Learning linguistic structure with simple and more complex recurrent neural networks Psychology February 8, 2018.
of the Artificial Neural Networks.
Neural Network - 2 Mayank Vatsa
MNIST Dataset Training with Tensorflow
CSC 578 Neural Networks and Deep Learning
Long Short Term Memory within Recurrent Neural Networks
Neural Networks Geoff Hulten.
Introduction to Algorithms
Machine Learning: Lecture 4
Machine Learning: UNIT-2 CHAPTER-1
Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824
Artificial neurons Nisheeth 10th January 2019.
RNNs: Going Beyond the SRN in Language Prediction
Back Propagation and Representation in PDP Networks
Backpropagation and Neural Nets
Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824
Attention for translation
Deep Learning Libraries
CSC 578 Neural Networks and Deep Learning
Dhruv Batra Georgia Tech
Dhruv Batra Georgia Tech
Overall Introduction for the Lecture
Machine Learning for Cyber
Presentation transcript:

Deep Neural Network Toolkits: What’s Under the Hood? How can we generalize BackProp to other ANNs? How can we automate BackProp for other ANNs? Deep Neural Network Toolkits: What’s Under the Hood?

Recap: Wordcount in GuineaPig class variables in the planner are data structures

Recap: Wordcount in GuineaPig The general idea: Embed something that looks like code but, when executed, builds a data structure The data structure defines a computation you want to do “computation graph” Then you use the data structure to do the computation stream-and-sort streaming Hadoop … We’re going to re-use the same idea: but now the graph both supports computation of a function and differentiation of that computation

Autodiff for AP Calculus Problems

computation graph

Generalizing backprop https://justindomke.wordpress.com/ Generalizing backprop e.g. Starting point: a function of n variables Step 1: code your function as a series of assignments Step 2: back propagate by going thru the list in reverse order, starting with… …and using the chain rule return inputs: Step 1: forward Wengert list A function Step 2: backprop Computed in previous step

Autodiff for a NN

Recap: logistic regression with SGD Let X be a matrix with k examples Let wi be the input weights for the i-th hidden unit Then Z = X W is output (pre-sigmoid) for all m units for all k examples w1 w2 w3 … wm 0.1 -0.3 -1.7 0.3 1.2 x1 1 x2 … xk x1.w1 x1.w2 … x1.wm xk.w1 xk.wm There’s a lot of chances to do this in parallel…. with parallel matrix multiplication XW =

Example: 2-layer neural network Step 1: backprop return inputs: Step 1: forward Inputs: X,W1,B1,W2,B2 Z1a = mul(X,W1) // matrix mult Z1b = add*(Z1a,B1) // add bias vec A1 = tanh(Z1b) //element-wise Z2a = mul(A1,W2) Z2b = add*(Z2a,B2) A2 = tanh(Z2b) // element-wise P = softMax(A2) // vec to vec C = crossEntY(P) // cost function Target Y; N examples; K outs; D feats, H hidden

Example: 2-layer neural network Z1A B1 Z1B A1 W2 Z2A B2 Z2B A2 mul add* tanh mul add* tanh X is N*D, W1 is D*H, B1 is 1*H, W2 is H*K, … Inputs: X,W1,B1,W2,B2 Z1a = mul(X,W1) // matrix mult Z1b = add*(Z1a,B1) // add bias vec A1 = tanh(Z1b) //element-wise Z2a = mul(A1,W2) Z2b = add*(Z2a,B2) A2 = tanh(Z2b) // element-wise P = softMax(A2) // vec to vec C = crossEntY(P) // cost function Z1a is N*H Z1b is N*H 𝐩 𝒊 = exp⁡( 𝐚 𝐢 ) 𝑗 exp⁡( 𝐚 𝒋 ) A1 is N*H Z2a is N*K Z2b is N*K A2 is N*K P is N*K C is a scalar Target Y; N examples; K outs; D feats, H hidden

Example: 2-layer neural network Z1A B1 Z1B A1 W2 Z2A B2 Z2B A2 P C crossEnt mul add* tanh mul add* tanh softmax Y X is N*D, W1 is D*H, B1 is 1*H, W2 is H*K, … Inputs: X,W1,B1,W2,B2 Z1a = mul(X,W1) // matrix mult Z1b = add*(Z1a,B1) // add bias vec A1 = tanh(Z1b) //element-wise Z2a = mul(A1,W2) Z2b = add*(Z2a,B2) A2 = tanh(Z2b) // element-wise P = softMax(A2) // vec to vec C = crossEntY(P) // cost function Z1a is N*H Z1b is N*H 𝐩 𝒊 = exp⁡( 𝐚 𝐢 ) 𝑗 exp⁡( 𝐚 𝒋 ) A1 is N*H Z2a is N*K Z2b is N*K A2 is N*K P is N*K C is a scalar Target Y; N examples; K outs; D feats, H hidden

Example: 2-layer neural network return inputs: Step 1: forward Step 1: backprop dC/dC = 1 dC/dP = dC/dC * dCrossEntY/dP dC/dA2 = dC/dP * dsoftmax/dA2 dC/Z2b = dC/dA2 * dtanh/dZ2b dC/dZ2a = dC/dZ2b * dadd*/dZ2a dC/dB2 = dC/dZ2b * dadd*/dB2 dC/dA1 = … Inputs: X,W1,B1,W2,B2 Z1a = mul(X,W1) // matrix mult Z1b = add*(Z1a,B1) // add bias vec A1 = tanh(Z1b) //element-wise Z2a = mul(A1,W2) Z2b = add*(Z2a,B2) A2 = tanh(Z2b) // element-wise P = softMax(A2) // vec to vec C = crossEntY(P) // cost function Target Y; N rows; K outs; D feats, H hidden

Example: 2-layer neural network Z1A B1 Z1B A1 W2 Z2A B2 Z2B A2 P C crossEnt mul add* tanh mul add* tanh softmax Y dC/dC = 1 dC/dP = dC/dC * dCrossEntY/dP dC/dA2 = dC/dP * dsoftmax/dA2 dC/Z2b = dC/dA2 * dtanh/dZ2b dC/dZ2a = dC/dZ2b * dadd*/dZ2a dC/dB2 = dC/dZ2b * dadd*/dB2 dC/dA1 = … Inputs: X,W1,B1,W2,B2 Z1a = mul(X,W1) // matrix mult Z1b = add*(Z1a,B1) // add bias vec A1 = tanh(Z1b) //element-wise Z2a = mul(A1,W2) Z2b = add*(Z2a,B2) A2 = tanh(Z2b) // element-wise P = softMax(A2) // vec to vec C = crossEntY(P) // cost function N*K N*K Target Y; N rows; K outs; D feats, H hidden

Example: 2-layer neural network mul dC/dC = 1 dC/dP = dC/dC * dCrossEntY/dP dC/dA2 = dC/dP * dsoftmax/dA2 dC/dZ2b = dC/dA2 * dtanh/dZ2b dC/dZ2a = dC/dZ2b * dadd*/dZ2a dC/dB2 = dC/dZ2b * dadd*/dB2 dC/dA1 = dC/dZ2a *dmul/dA1 dC/dW2 = dC/dZ2a *dmul/dW2 dC/dZ1b = dC/dA1 * dtanh/dZ1b dC/dZ1a = dC/dZ1b * dadd*/dZ1a dC/dB1 = dC/dZ1b * dadd*/dB1 dC/dX = dC/dZ1a * dmul*/dZ1a dC/dW1 = dC/dZ1a * dmul*/dW1 Z1a B1 add Z1b tanh W2 A1 mul Z2a B2 add Z2b tanh A2 softmax P crossentY C

Example: 2-layer neural network with “tied parameters” X W1 mul dC/dC = 1 dC/dP = dC/dC * dCrossEntY/dP dC/dA2 = dC/dP * dsoftmax/dA2 dC/dZ2b = dC/dA2 * dtanh/dZ2b dC/dZ2a = dC/dZ2b * dadd*/dZ2a dC/dB2 = dC/dZ2b * dadd*/dB dC/dA1 = dC/dZ2a *dmul/dA1 dC/dW2 = dC/dZ2a *dmul/dW2 dC/dZ1b = dC/dA1 * dtanh/dZ1b dC/dZ1a = dC/dZ1b * dadd*/dZ1a dC/dB1 = dC/dZ1b * dadd*/dB dC/dX = dC/dZ1a * dmul*/dZ1a dC/dW1 = dC/dZ1a * dmul*/dW1 Z1a B add Z1b tanh W2 A1 mul Z2a B add Z2b tanh A2 softmax P crossentY C

Example: 2-layer neural network Step 1: backprop return inputs: Step 1: forward dC/dC = 1 dC/dP = dC/dC * dCrossEntY/dP dC/dA2 = dC/dP * dsoftmax/dA2 dC/Z2b = dC/dA2 * dtanh/dZ2b dC/dZ2a = dC/dZ2b * dadd*/dZ2a dC/dB2 = dC/dZ2b * dadd*/dB2 dC/dA1 = … N*K Inputs: X,W1,B1,W2,B2 Z1a = mul(X,W1) // matrix mult Z1b = add(Z11,B1) // add bias vec A1 = tanh(Z1b) //element-wise Z2a = mul(A1,W2) Z2b = add(Z2a,B2) A2 = tanh(Z2b) // element-wise P = softMax(A2) // vec to vec C = crossEntY(P) // cost function N*H Need a backward form for each matrix operation used in forward, with respect to each argument Target Y; N rows; K outs; D feats, H hidden

Example: 2-layer neural network return inputs: Step 1: forward An autodiff package usually includes A collection of matrix-oriented operations (mul, add*, …) For each operation A forward implementation A backward implementation for each argument It’s still a little awkward to program with a list of assignments so…. Inputs: X,W1,B1,W2,B2 Z1a = mul(X,W1) // matrix mult Z1b = add(Z11,B1) // add bias vec A1 = tanh(Z1b) //element-wise Z2a = mul(A1,W2) Z2b = add(Z2a,B2) A2 = tanh(Z2b) // element-wise P = softMax(A2) // vec to vec C = crossEntY(P) // cost function N*H Need a backward form for each matrix operation used in forward, with respect to each argument Target Y; N rows; K outs; D feats, H hidden

What’s Going On Here?

Differentiating a Wengert list: a simple case High school: symbolic differentiation, compute a symbolic form of the deriv of f Now: automatic differentiation, find an algorithm to compute f’(a) at any point a

Differentiating a Wengert list: a simple case Now: automatic differentiation, find an algorithm to compute f’(a) at any point a ai is the i-th ouput on input a Notation: 

Differentiating a Wengert list: a simple case What did Liebnitz mean with this? for all a m-1,0 ai is the i-th ouput on input a Notation: 

Differentiating a Wengert list: a simple case for all a Notation: 

Differentiating a Wengert list: a simple case for all a backprop routine compute order

Differentiating a Wengert list

Generalizing backprop https://justindomke.wordpress.com/ Generalizing backprop return inputs: Step 1: forward Starting point: a function of n variables Step 1: code your function as a series of assignments Wengert list Better plan: overload your matrix operators so that when you use them in-line they build an expression graph Convert the expression graph to a Wengert list when necessary

Systems that implement Automatic Differentiation

Example of an Autodiff System: Theano … # generate some training data

Example: Theano

p_1

Example: 2-layer neural network return inputs: Step 1: forward An autodiff package usually includes A collection of matrix-oriented operations (mul, add*, …) For each operation A forward implementation A backward implementation for each argument A way of composing operations into expressions (often using operator overloading) which evaluate to expression trees Expression simplification/compilation Inputs: X,W1,B1,W2,B2 Z1a = mul(X,W1) // matrix mult Z1b = add(Z11,B1) // add bias vec A1 = tanh(Z1b) //element-wise Z2a = mul(A1,W2) Z2b = add(Z2a,B2) A2 = tanh(Z2b) // element-wise P = softMax(A2) // vec to vec C = crossEntY(P) // cost function

Some tools that use autodiff Theano: Univ Montreal system, Python-based, first version 2007 now v0.8, integrated with GPUs Many libraries build over Theano (Lasagne, Keras, Blocks..) Torch: Collobert et al, used heavily at Facebook, based on Lua. Similar performance-wise to Theano PyTorch Similar API to Torch but in Python TensorFlow: Google system Also supported by Keras …

More About Systems That Do Autodiff

Inputs and parameters Our abstract reverse-mode autodiff treats all nodes in the graph the same But they are different: some are functions of other nodes some are ”inputs” (like X,Y) some are “parameters” (like W) inputs and parameters are both autodiff leaves but what happens to them is different in HW they have different names

Inputs and parameters TensorFlow: inputs are ”placeholders” have type and shape (for the matrix) but not a value you bind the placeholders to values (eg numpy arrays) when you execute the computation graph TensorFlow: parameters are graph nodes usually with an explicit assign operator added w_update = w.assign(w – rate*grad) then the system knows to reuse memory that’s called an “update” step TensorFlow: an optimizer will output an update

More detail Lots of tutorials One nice walkthrough of same task in TensorFlow, PyTorch, Theano: http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture8.pdf

Autodiff In More Detail

Walkthru of an Autodiff Program Matrix and vector operations are very useful

Walkthru of an Autodiff Program Matrix and vector operations are very useful

Walkthru of an Autodiff Program Matrix and vector operations are very useful…but lets start with some simple scalar operations. Python encoding

The Interesting Part: Evaluation and Differentiation of a Wengert list To compute f(3,7):

The Interesting Part: Evaluation and Differentiation of a Wengert list To compute f(3,7): Populated with call to “eval”

The Tricky Part: Constructing a List Goal: write code like this: Add your problem-specific functions: but just the syntax, not the G and DG definitions Define the function to optimize with gradient descent Use your function

Python code to create a Wengert list h, w, area are registers also an automatically-named register z1 to hold h*w

Python code to create a Wengert list name “h” role “input” some Python hacking can use the class variable name ‘area’ and insert it into the ‘name’ field of a register name “w” role “input” name role “opOut” defAs O1 z1 name “area” role “opOut” defAs O2 and invent names for the others at setup() time

Python code to create a Wengert list “half” is an operator so is “mul” --- h*w  mul(h,w) each instance of an operator points to registers that are arguments and outputs

Python code to create a Wengert list name “h” role “input” h, w are registers; mul() generates an operator name “w” role “input” name z1 role “opOut” defAs O1 args [A1,A2] fun “mul” output R3 name “area” role “opOut” defAs O2 args [A1] fun “half” output R4

Python code to create a Wengert list name “area” role “opOut” defAs O2 args [A1] fun “half” output R4

Autodiff In EVEN MORE Detail

More detail: xman.py

More detail

… … name “area” role “opOut” defAs O2 args [A1] fun “half” output R4

Using xman.py

Using xman.py

Using xman.py

Using xman.py …

Using xman.py – like Guinea Pig

Using xman.py

An Xman instance and a loop