Backpropagation and Neural Nets

Slides:



Advertisements
Similar presentations
Multi-Layer Perceptron (MLP)
Advertisements

Beyond Linear Separability
A Brief Overview of Neural Networks By Rohit Dua, Samuel A. Mulder, Steve E. Watkins, and Donald C. Wunsch.
Slides from: Doug Gray, David Poole
NEURAL NETWORKS Backpropagation Algorithm
1 Neural networks. Neural networks are made up of many artificial neurons. Each input into the neuron has its own weight associated with it illustrated.
Lecture 13 – Perceptrons Machine Learning March 16, 2010.
Artificial Neural Networks
Financial Informatics –XVI: Supervised Backpropagation Learning
Lecture 14 – Neural Networks
September 30, 2010Neural Networks Lecture 8: Backpropagation Learning 1 Sigmoidal Neurons In backpropagation networks, we typically choose  = 1 and 
CHAPTER 11 Back-Propagation Ming-Feng Yeh.
Neural Networks. Background - Neural Networks can be : Biological - Biological models Artificial - Artificial models - Desire to produce artificial systems.
CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.
Classification Part 3: Artificial Neural Networks
Multiple-Layer Networks and Backpropagation Algorithms
Backpropagation An efficient way to compute the gradient Hung-yi Lee.
Neural Network Introduction Hung-yi Lee. Review: Supervised Learning Training: Pick the “best” Function f * Training Data Model Testing: Hypothesis Function.
ECE 6504: Deep Learning for Perception Dhruv Batra Virginia Tech Topics: –Neural Networks –Backprop –Modular Design.
A note about gradient descent: Consider the function f(x)=(x-x 0 ) 2 Its derivative is: By gradient descent. x0x0 + -
Non-Bayes classifiers. Linear discriminants, neural networks.
PARALLELIZATION OF ARTIFICIAL NEURAL NETWORKS Joe Bradish CS5802 Fall 2015.
EEE502 Pattern Recognition
Neural Networks Vladimir Pleskonjić 3188/ /20 Vladimir Pleskonjić General Feedforward neural networks Inputs are numeric features Outputs are in.
CS 551/651 Search and “Through the Lens” Lecture 13 Search and “Through the Lens” Lecture 13.
Previous Lecture Perceptron W  t+1  W  t  t  d(t) - sign (w(t)  x)] x Adaline W  t+1  W  t  t  d(t) - f(w(t)  x)] f’ x Gradient.
Neural Networks The Elements of Statistical Learning, Chapter 12 Presented by Nick Rizzolo.
Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.
Learning: Neural Networks Artificial Intelligence CMSC February 3, 2005.
Multinomial Regression and the Softmax Activation Function Gary Cottrell.
Learning with Neural Networks Artificial Intelligence CMSC February 19, 2002.
CSE343/543 Machine Learning Mayank Vatsa Lecture slides are prepared using several teaching resources and no authorship is claimed for any slides.
Today’s Lecture Neural networks Training
Lecture 7 Learned feedforward visual processing
Neural networks.
Neural networks and support vector machines
Multiple-Layer Networks and Backpropagation Algorithms
Deep Feedforward Networks
The Gradient Descent Algorithm
Neural Networks Winter-Spring 2014
Learning with Perceptrons and Neural Networks
Lecture 2. Basic Neurons To model neurons we have to idealize them:
CSE 473 Introduction to Artificial Intelligence Neural Networks
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Lecture 25: Backprop and convnets
with Daniel L. Silver, Ph.D. Christian Frey, BBA April 11-12, 2017
Neural Networks and Backpropagation
CSE P573 Applications of Artificial Intelligence Neural Networks
Classification / Regression Neural Networks 2
Machine Learning Today: Reading: Maria Florina Balcan
CSC 578 Neural Networks and Deep Learning
Gradient Checks for ANN
CS 4501: Introduction to Computer Vision Training Neural Networks II
Artificial Neural Network & Backpropagation Algorithm
of the Artificial Neural Networks.
CSE 573 Introduction to Artificial Intelligence Neural Networks
Neural Network - 2 Mayank Vatsa
Neural Networks Geoff Hulten.
Capabilities of Threshold Neurons
Backpropagation.
Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824
Backpropagation Disclaimer: This PPT is modified based on
Neural networks (1) Traditional multi-layer perceptrons
Artificial Intelligence 10. Neural Networks
Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824
Introduction to Neural Networks
Dhruv Batra Georgia Tech
Artificial Neural Networks / Spring 2002
Principles of Back-Propagation
Overall Introduction for the Lecture
Presentation transcript:

Backpropagation and Neural Nets EECS 442 – Prof. David Fouhey Winter 2019, University of Michigan http://web.eecs.umich.edu/~fouhey/teaching/EECS442_W19/

So Far: Linear Models 𝐿(𝒘) = 𝜆 𝒘 2 2 + 𝑖=1 𝑛 𝑦 𝑖 − 𝒘 𝑇 𝒙 𝑖 ) 2 𝐿(𝒘) = 𝜆 𝒘 2 2 + 𝑖=1 𝑛 𝑦 𝑖 − 𝒘 𝑇 𝒙 𝑖 ) 2 Example: find w minimizing squared error over data Each datapoint represented by some vector x Can find optimal w with ~10 line derivation

Last Class 𝐿(𝒘) = 𝜆 𝒘 2 2 + 𝑖=1 𝑛 𝐿( 𝑦 𝑖 ,𝑓(𝒙;𝒙) ) 𝐿(𝒘) = 𝜆 𝒘 2 2 + 𝑖=1 𝑛 𝐿( 𝑦 𝑖 ,𝑓(𝒙;𝒙) ) What about an arbitrary loss function L? What about an arbitrary parametric function f? Solution: take the gradient, do gradient descent 𝒘 𝑖+1 = 𝒘 𝒊 −𝛼 ∇ 𝑤 𝐿(𝑓( 𝒘 𝑖 )) What if L(f(w)) is complicated? Today!

Taking the Gradient – Review 𝑓 𝑥 = −𝑥+3 2 𝑓= 𝑞 2 𝑞=𝑟+3 𝑟=−𝑥 𝜕𝑓 𝜕𝑞 =2𝑞 𝜕𝑞 𝜕𝑟 =1 𝜕𝑞 𝜕𝑥 =−1 Chain rule 𝜕𝑓 𝜕𝑥 = 𝜕𝑓 𝜕𝑞 𝜕𝑞 𝜕𝑟 𝜕𝑟 𝜕𝑥 =2𝑞∗1∗−1 =−2 −𝑥+3 =2𝑥−6

Supplemental Reading Lectures can only introduce you to a topic You will solidify your knowledge by doing I highly recommend working through everything in the Stanford CS213N resources http://cs231n.github.io/optimization-2/ These slides follow the general examples with a few modifications. The primary difference is that I define local variables n, m per-block.

Let’s Do This Another Way Suppose we have a box representing a function f. This box does two things: Forward: Given forward input n, compute f(n) Backwards: Given backwards input g, return g*df/dn 𝑛 𝑓(𝑛) f 𝑔 𝑔( 𝜕𝑓 𝜕𝑛)

Let’s Do This Another Way 𝑓 𝑥 = −𝑥+3 2 x -x -x+3 (-x+3)2 -n n+3 n2 1 𝜕 𝜕𝑛 𝑛 2 =2𝑛 =2 −𝑥+3 =−2𝑥+6 (−2𝑥+6) ∗1 𝜕 𝜕𝑛 ∗1=

Let’s Do This Another Way 𝑓 𝑥 = −𝑥+3 2 x -x -x+3 (-x+3)2 -n n+3 n2 1 −2𝑥+6 𝜕 𝜕𝑛 =1 1 ∗(−2x+6)

Let’s Do This Another Way 𝑓 𝑥 = −𝑥+3 2 x -x -x+3 (-x+3)2 -n n+3 n2 1 −2𝑥+6 −2𝑥+6 𝜕 𝜕𝑛 =−1 −1 ∗(−2x+6) 2x−6

Let’s Do This Another Way 𝑓 𝑥 = −𝑥+3 2 x -x -x+3 (-x+3)2 -n n+3 n2 1 2x−6 −2𝑥+6 −2𝑥+6

f Two Inputs 𝑛 𝑔( 𝜕𝑓 𝜕𝑛) 𝑓(𝑛,𝑚) 𝑚 𝑔 𝑔( 𝜕𝑓 𝜕𝑚) Given two inputs, just have two input/output wires Forward: the same Backward: the same – send gradients with respect to each variable f 𝑛 𝑚 𝑓(𝑛,𝑚) 𝑔 𝑔( 𝜕𝑓 𝜕𝑛) 𝑔( 𝜕𝑓 𝜕𝑚)

f(x,y,z) = (x+y)z x x+y y (x+y)z z n+m n*m Example Credit: Karpathy and Fei-Fei

f(x,y,z) = (x+y)z x x+y y z*1 (x+y)z z 1 (x+y)*1 𝜕 𝜕𝑛 𝑛𝑚=𝑚 →𝑧∗1 n+m Multiplication swaps inputs, multiplies gradient x+y y z*1 n*m (x+y)z z 1 (x+y)*1 𝜕 𝜕𝑛 𝑛𝑚=𝑚 →𝑧∗1 𝜕 𝜕𝑚 𝑛𝑚=𝑛 →(𝑥+𝑦)∗1 Example Credit: Karpathy and Fei-Fei

f(x,y,z) = (x+y)z x 1*z*1 x+y y 1*z*1 (x+y)z z*1 z 1 (x+y)*1 →1∗𝑧∗1 n+m Addition sends gradient through unchanged 1*z*1 x+y y 1*z*1 n*m (x+y)z z*1 z 1 (x+y)*1 →1∗𝑧∗1 𝜕 𝜕𝑛 𝑛+𝑚=1 →1∗𝑧∗1 𝜕 𝜕𝑚 𝑛+𝑚=1 Example Credit: Karpathy and Fei-Fei

f(x,y,z) = (x+y)z x z x+y y z (x+y)z z*1 z 1 x+y 𝜕 𝑥+𝑦 𝑧 𝜕𝑧 =(𝑥+𝑦) n+m z x+y y z n*m (x+y)z z*1 z 1 x+y 𝜕 𝑥+𝑦 𝑧 𝜕𝑧 =(𝑥+𝑦) 𝜕 𝑥+𝑦 𝑧 𝜕𝑥 =𝑧 𝜕 𝑥+𝑦 𝑧 𝜕𝑦 =𝑧 Example Credit: Karpathy and Fei-Fei

Once More, With Numbers!

f(x,y,z) = (x+y)z 1 n+m 5 4 n*m 50 10 Example Credit: Karpathy and Fei-Fei

f(x,y,z) = (x+y)z 1 5 4 10 50 10 1 5 𝜕 𝜕𝑛 𝑛𝑚=𝑚 →10∗1 𝜕 𝜕𝑚 𝑛𝑚=𝑛 →5∗1 n+m 5 4 10 n*m 50 10 1 5 𝜕 𝜕𝑛 𝑛𝑚=𝑚 →10∗1 𝜕 𝜕𝑚 𝑛𝑚=𝑛 →5∗1 Example Credit: Karpathy and Fei-Fei

f(x,y,z) = (x+y)z 1 10 5 4 10 50 10 10 1 5 →1∗10∗1 𝜕 𝜕𝑛 𝑛+𝑚=1 →1∗5∗1 n+m 10 5 4 10 n*m 50 10 u 10 1 5 →1∗10∗1 𝜕 𝜕𝑛 𝑛+𝑚=1 →1∗5∗1 𝜕 𝜕𝑚 𝑛+𝑚=1 Example Credit: Karpathy and Fei-Fei

We want to fit a model w that just will equal 6. Think You’ve Got It? 𝐿 𝑥 = 𝑤−6 2 We want to fit a model w that just will equal 6. World’s most basic linear model / neural net: no inputs, just constant output.

I’ll Need a Few Volunteers 𝐿 𝑥 = 𝑤−6 2 n-6 n g n2 n g 2ng Job #1 (n-6): Forward: Compute n-6 Backwards: Multiply by 1 Job #2 (n2): Forward: Compute n2 Backwards: Multiply by 2n Job #3: Backwards: Write down 1

Preemptively The diagrams look complex but that’s since we’re covering the details together

Something More Complex 𝑓 𝒘,𝒙 = 1 1+ 𝑒 −( 𝑤 0 𝑥 0 + 𝑤 1 𝑥 1 + 𝑤 2 ) w0 n*m x0 n+m w1 n+m n*-1 en n+1 1/n x1 w2 Example Credit: Karpathy and Fei-Fei

𝑓 𝒘,𝒙 = 1 1+ 𝑒 −( 𝑤 0 𝑥 0 + 𝑤 1 𝑥 1 + 𝑤 2 ) w0 2 * -2 x0 -1 + 4 w1 -3 * 6 + 1 -1 0.37 1.37 0.73 *-1 en +1 n-1 x1 -2 w2 -3

a b 𝑓 𝒘,𝒙 = 1 1+ 𝑒 −( 𝑤 0 𝑥 0 + 𝑤 1 𝑥 1 + 𝑤 2 ) 𝜕 𝜕𝑛 𝑚+𝑛=1 𝜕 𝜕𝑛 𝑚𝑛=𝑚 w0 2 * c d 𝜕 𝜕𝑛 𝑒 𝑛 = 𝑒 𝑛 𝜕 𝜕𝑛 𝑛 −1 =− 𝑛 −2 -2 x0 -1 e f 𝜕 𝜕𝑛 𝑎𝑛=𝑎 𝜕 𝜕𝑛 𝑐+𝑛=1 + 4 w1 -3 * 6 + 1 -1 0.37 1.37 0.73 *-1 en +1 n-1 x1 -2 w2 -3 Example Credit: Karpathy and Fei-Fei

a b 𝑓 𝒘,𝒙 = 1 1+ 𝑒 −( 𝑤 0 𝑥 0 + 𝑤 1 𝑥 1 + 𝑤 2 ) 𝜕 𝜕𝑛 𝑚+𝑛=1 𝜕 𝜕𝑛 𝑚𝑛=𝑚 w0 2 * c d 𝜕 𝜕𝑛 𝑒 𝑛 = 𝑒 𝑛 𝜕 𝜕𝑛 𝑛 −1 =− 𝑛 −2 -2 x0 -1 e f 𝜕 𝜕𝑛 𝑎𝑛=𝑎 𝜕 𝜕𝑛 𝑐+𝑛=1 + 4 w1 -3 * 6 + 1 -1 0.37 1.37 0.73 *-1 en +1 n-1 x1 -2 1 w2 -3 Example Credit: Karpathy and Fei-Fei

a b 𝑓 𝒘,𝒙 = 1 1+ 𝑒 −( 𝑤 0 𝑥 0 + 𝑤 1 𝑥 1 + 𝑤 2 ) 𝜕 𝜕𝑛 𝑚+𝑛=1 𝜕 𝜕𝑛 𝑚𝑛=𝑚 w0 2 * c d 𝜕 𝜕𝑛 𝑒 𝑛 = 𝑒 𝑛 𝜕 𝜕𝑛 𝑛 −1 =− 𝑛 −2 -2 x0 -1 e f 𝜕 𝜕𝑛 𝑎𝑛=𝑎 𝜕 𝜕𝑛 𝑐+𝑛=1 + 4 w1 -3 * 6 + 1 -1 0.37 1.37 0.73 *-1 en +1 n-1 x1 -2 1 − 1.37 −2 ∗1=−0.53 w2 -3 Where does 1.37 come from? Example Credit: Karpathy and Fei-Fei

a b w0 c d x0 e f w1 x1 -0.53 1 w2 1∗−0.53=−0.53 2 * -2 -1 + 4 -3 * 6 𝑓 𝒘,𝒙 = 1 1+ 𝑒 −( 𝑤 0 𝑥 0 + 𝑤 1 𝑥 1 + 𝑤 2 ) 𝜕 𝜕𝑛 𝑚+𝑛=1 𝜕 𝜕𝑛 𝑚𝑛=𝑚 w0 2 * c d 𝜕 𝜕𝑛 𝑒 𝑛 = 𝑒 𝑛 𝜕 𝜕𝑛 𝑛 −1 =− 𝑛 −2 -2 x0 -1 e f 𝜕 𝜕𝑛 𝑎𝑛=𝑎 𝜕 𝜕𝑛 𝑐+𝑛=1 + 4 w1 -3 * 6 + 1 -1 0.37 1.37 0.73 *-1 en +1 n-1 x1 -2 -0.53 1 w2 -3 1∗−0.53=−0.53 Example Credit: Karpathy and Fei-Fei

a b w0 c d x0 e f w1 x1 -0.53 -0.53 1 w2 𝑒 −1 ∗−0.53=−0.2 2 * -2 -1 + 𝑓 𝒘,𝒙 = 1 1+ 𝑒 −( 𝑤 0 𝑥 0 + 𝑤 1 𝑥 1 + 𝑤 2 ) 𝜕 𝜕𝑛 𝑚+𝑛=1 𝜕 𝜕𝑛 𝑚𝑛=𝑚 w0 2 * c d 𝜕 𝜕𝑛 𝑒 𝑛 = 𝑒 𝑛 𝜕 𝜕𝑛 𝑛 −1 =− 𝑛 −2 -2 x0 -1 e f 𝜕 𝜕𝑛 𝑎𝑛=𝑎 𝜕 𝜕𝑛 𝑐+𝑛=1 + 4 w1 -3 * 6 + 1 -1 0.37 1.37 0.73 *-1 en +1 n-1 x1 -2 -0.53 -0.53 1 w2 -3 𝑒 −1 ∗−0.53=−0.2 Example Credit: Karpathy and Fei-Fei

a b w0 c d x0 e f w1 x1 -0.2 -0.53 -0.53 1 w2 −1∗−0.2=0.2 2 * -2 -1 + 𝑓 𝒘,𝒙 = 1 1+ 𝑒 −( 𝑤 0 𝑥 0 + 𝑤 1 𝑥 1 + 𝑤 2 ) 𝜕 𝜕𝑛 𝑚+𝑛=1 𝜕 𝜕𝑛 𝑚𝑛=𝑚 w0 2 * c d 𝜕 𝜕𝑛 𝑒 𝑛 = 𝑒 𝑛 𝜕 𝜕𝑛 𝑛 −1 =− 𝑛 −2 -2 x0 -1 e f 𝜕 𝜕𝑛 𝑎𝑛=𝑎 𝜕 𝜕𝑛 𝑐+𝑛=1 + 4 w1 -3 * 6 + 1 -1 0.37 1.37 0.73 *-1 en +1 n-1 x1 -2 -0.2 -0.53 -0.53 1 w2 -3 −1∗−0.2=0.2 Example Credit: Karpathy and Fei-Fei

a b 𝑓 𝒘,𝒙 = 1 1+ 𝑒 −( 𝑤 0 𝑥 0 + 𝑤 1 𝑥 1 + 𝑤 2 ) 𝜕 𝜕𝑛 𝑚+𝑛=1 𝜕 𝜕𝑛 𝑚𝑛=𝑚 w0 2 * c d 𝜕 𝜕𝑛 𝑒 𝑛 = 𝑒 𝑛 𝜕 𝜕𝑛 𝑛 −1 =− 𝑛 −2 -2 x0 -1 e f 𝜕 𝜕𝑛 𝑎𝑛=𝑎 𝜕 𝜕𝑛 𝑐+𝑛=1 + 4 w1 -3 * 6 + 1 -1 0.37 1.37 0.73 *-1 en +1 n-1 x1 -2 0.2 -0.2 -0.53 -0.53 1 w2 1∗0.2=0.2 Gets sent back both directions -3 Example Credit: Karpathy and Fei-Fei

a b 𝑓 𝒘,𝒙 = 1 1+ 𝑒 −( 𝑤 0 𝑥 0 + 𝑤 1 𝑥 1 + 𝑤 2 ) 𝜕 𝜕𝑛 𝑚+𝑛=1 𝜕 𝜕𝑛 𝑚𝑛=𝑚 w0 2 * c d 𝜕 𝜕𝑛 𝑒 𝑛 = 𝑒 𝑛 𝜕 𝜕𝑛 𝑛 −1 =− 𝑛 −2 -2 x0 -1 e f 𝜕 𝜕𝑛 𝑎𝑛=𝑎 𝜕 𝜕𝑛 𝑐+𝑛=1 + 4 w1 -3 * 6 + 1 -1 0.37 1.37 0.73 0.2 *-1 en +1 n-1 x1 -2 0.2 -0.2 -0.53 -0.53 1 w2 1∗0.2=0.2 Gets sent back both directions -3 0.2 Example Credit: Karpathy and Fei-Fei

a b 𝑓 𝒘,𝒙 = 1 1+ 𝑒 −( 𝑤 0 𝑥 0 + 𝑤 1 𝑥 1 + 𝑤 2 ) 𝜕 𝜕𝑛 𝑚+𝑛=1 𝜕 𝜕𝑛 𝑚𝑛=𝑚 w0 2 * c d 𝜕 𝜕𝑛 𝑒 𝑛 = 𝑒 𝑛 𝜕 𝜕𝑛 𝑛 −1 =− 𝑛 −2 -2 x0 -1 e f 𝜕 𝜕𝑛 𝑎𝑛=𝑎 𝜕 𝜕𝑛 𝑐+𝑛=1 0.2 + 4 w1 -3 * 6 + 1 -1 0.37 1.37 0.73 0.2 *-1 en +1 n-1 0.2 x1 -2 0.2 -0.2 -0.53 -0.53 1 w2 −1∗0.2=−0.2 -3 0.2 2∗0.2=0.4 Example Credit: Karpathy and Fei-Fei

a b 𝑓 𝒘,𝒙 = 1 1+ 𝑒 −( 𝑤 0 𝑥 0 + 𝑤 1 𝑥 1 + 𝑤 2 ) 𝜕 𝜕𝑛 𝑚+𝑛=1 𝜕 𝜕𝑛 𝑚𝑛=𝑚 w0 2 * c d 𝜕 𝜕𝑛 𝑒 𝑛 = 𝑒 𝑛 𝜕 𝜕𝑛 𝑛 −1 =− 𝑛 −2 -0.2 -2 x0 -1 e f 0.4 𝜕 𝜕𝑛 𝑎𝑛=𝑎 𝜕 𝜕𝑛 𝑐+𝑛=1 0.2 + 4 w1 -3 * 6 + 1 -1 0.37 1.37 0.73 0.2 *-1 en +1 n-1 0.2 x1 -2 0.2 -0.2 -0.53 -0.53 1 w2 −2∗0.2=−0.4 −3∗0.2=−0.6 -3 0.2 Example Credit: Karpathy and Fei-Fei

a b 𝑓 𝒘,𝒙 = 1 1+ 𝑒 −( 𝑤 0 𝑥 0 + 𝑤 1 𝑥 1 + 𝑤 2 ) 𝜕 𝜕𝑛 𝑚+𝑛=1 𝜕 𝜕𝑛 𝑚𝑛=𝑚 w0 2 * c d 𝜕 𝜕𝑛 𝑒 𝑛 = 𝑒 𝑛 𝜕 𝜕𝑛 𝑛 −1 =− 𝑛 −2 -0.2 -2 x0 -1 e f 0.4 𝜕 𝜕𝑛 𝑎𝑛=𝑎 𝜕 𝜕𝑛 𝑐+𝑛=1 0.2 + 4 w1 -3 * 6 + 1 -1 0.37 1.37 0.73 -0.4 0.2 *-1 en +1 n-1 0.2 x1 -2 0.2 -0.2 -0.53 -0.53 1 -0.6 w2 -3 0.2 Example Credit: Karpathy and Fei-Fei

PHEW! a b w0 c d -0.2 x0 e f 0.4 w1 -0.4 x1 -0.6 w2 0.2 2 * -2 -1 + 4 𝑓 𝒘,𝒙 = 1 1+ 𝑒 −( 𝑤 0 𝑥 0 + 𝑤 1 𝑥 1 + 𝑤 2 ) 𝜕 𝜕𝑛 𝑚+𝑛=1 𝜕 𝜕𝑛 𝑚𝑛=𝑚 w0 2 * c d 𝜕 𝜕𝑛 𝑒 𝑛 = 𝑒 𝑛 𝜕 𝜕𝑛 𝑛 −1 =− 𝑛 −2 -0.2 -2 x0 -1 e f 0.4 𝜕 𝜕𝑛 𝑎𝑛=𝑎 𝜕 𝜕𝑛 𝑐+𝑛=1 + 4 w1 -3 * 6 + 1 -1 0.37 1.37 0.73 -0.4 *-1 en +1 n-1 x1 -2 -0.6 PHEW! w2 -3 0.2 Example Credit: Karpathy and Fei-Fei

f … Summary 𝑥 1 𝑔( 𝜕𝑓 𝜕 𝑥 1 ) 𝑥 2 𝑔( 𝜕𝑓 𝜕 𝑥 2 ) 𝑓( 𝑥 1 ,…, 𝑥 𝑛 ) 𝑔 𝑥 𝑛 Each block computes backwards (g) * local gradient (df/dxi) at the evaluation point f 𝑓( 𝑥 1 ,…, 𝑥 𝑛 ) 𝑔 𝑥 1 𝑔( 𝜕𝑓 𝜕 𝑥 1 ) 𝑔( 𝜕𝑓 𝜕 𝑥 2 ) 𝑥 2 𝑥 𝑛 𝑔( 𝜕𝑓 𝜕 𝑥 𝑛 ) …

Multiple Outputs Flowing Back Gradients from different backwards sum up 𝑗=1 𝐾 𝑔 𝑗 ( 𝜕 𝑓 𝑗 𝜕 𝑥 𝑖 ) 𝑥 1 f 𝑓 1 ( 𝑥 1 ,…, 𝑥 𝑛 ) 𝑔 1 𝑗=1 𝐾 𝑔 𝑗 ( 𝜕 𝑓 𝑗 𝜕 𝑥 1 ) … 𝑓 𝐾 ( 𝑥 1 ,…, 𝑥 𝑛 ) 𝑔 𝐾 𝑥 𝑛 𝑗=1 𝐾 𝑔 𝑗 ( 𝜕 𝑓 𝑗 𝜕 𝑥 𝑛 )

Multiple Outputs Flowing Back 𝑓 𝑥 = −𝑥+3 2 (-x+3)2 x -x -x+3 -n n+3 m*n -x+3 -x+3 x-3 1 x x -n n+3

Multiple Outputs Flowing Back 𝑓 𝑥 = −𝑥+3 2 x = 𝑥−3 + 𝑥−3 𝜕𝑓 𝜕𝑥 x-3 x =2𝑥−6 x x-3

Does It Have To Be So Painful? 𝑓 𝒘,𝒙 = 1 1+ 𝑒 −( 𝑤 0 𝑥 0 + 𝑤 1 𝑥 1 + 𝑤 2 ) w0 n*m x0 n+m w1 𝜎 𝑛 = 1 1+ 𝑒 −𝑛 n+m n*-1 en n+1 1/x x1 w2 Example Credit: Karpathy and Fei-Fei

Does It Have To Be So Painful? 𝜎 𝑛 = 1 1+ 𝑒 −𝑛 𝜕 𝜕𝑛 𝜎 𝑛 = 𝑒 −𝑛 1+ 𝑒 −𝑛 2 = 1+ 𝑒 −𝑛 −1 1+ 𝑒 −𝑛 1 1+ 𝑒 −𝑛 1+ 𝑒 −𝑛 1+ 𝑒 −𝑛 − 1 1+ 𝑒 −𝑛 =1−𝜎(𝑛) 𝜎(𝑛) = 1−𝜎 𝑛 𝜎(𝑛) 𝜕 𝜕𝑛 𝜎 𝑛 = −1 1+ 𝑒 −𝑛 2 ∗1∗ 𝑒 −𝑛 ∗−1 Chain rule: d/dx (1/x)*d/dx (1+x)* d/dx (e*x)*d/dx (-x) Line 1 to 2: For the curious Example Credit: Karpathy and Fei-Fei

Does It Have To Be So Painful? 𝑓 𝒘,𝒙 = 1 1+ 𝑒 −( 𝑤 0 𝑥 0 + 𝑤 1 𝑥 1 + 𝑤 2 ) w0 n*m x0 n+m w1 n+m σ(n) x1 w2 𝜎 𝑛 = 1 1+ 𝑒 −𝑛 𝜕𝜎(𝑛) 𝜕𝑛 =(1−𝜎 𝑛 )𝜎(𝑛) Example Credit: Karpathy and Fei-Fei

Does It Have To Be So Painful? Can compute for any function Pick your functions carefully: existing code is usually structured into sensible blocks

Building Blocks Takes signals from other cells, processes, sends out Input from other cells Output to other cells Neuron diagram credit: Karpathy and Fei-Fei

Artificial Neuron Weighted average of other neuron outputs passed through an activation function 𝑖 𝑤 𝑖 𝑥 𝑖 +𝑏 Activation 𝑓 𝑖 𝑤 𝑖 𝑥 𝑖 +𝑏

Can differentiate whole thing e.g., dNeuron/dx1. Artificial Neuron Can differentiate whole thing e.g., dNeuron/dx1. What can we now do? w3 * x3 w2 * x2 w1 * x1 b ∑ f

Artificial Neuron Each artificial neuron is a linear model + an activation function f Can find w, b that minimizes a loss function with gradient descent ∑ f w,b x

∑ ∑ ∑ ∑ Artificial Neurons f f f f Connect neurons to make a more complex function; use backprop to compute gradient ∑ f w,b x ∑ f w,b x ∑ f w,b x ∑ f w,b x

What’s The Activation Function Sigmoid 𝑠 𝑥 = 1 1+ 𝑒 −𝑥 Nice interpretation Squashes things to (0,1) Gradients are near zero if neuron is high/low

What’s The Activation Function ReLU (Rectifying Linear Unit) max⁡(0,𝑥) Constant gradient Converges ~6x faster If neuron negative, zero gradient. Be careful!

What’s The Activation Function Leaky ReLU (Rectifying Linear Unit) 𝑥 :𝑥≥0 0.01𝑥 :𝑥<0 ReLU, but allows some small gradient for negative vales

Setting Up A Neural Net Input Hidden Output h1 h2 h3 h4 y1 y2 y3 x2 x1

Setting Up A Neural Net Input Hidden 1 Hidden 2 Output a1 a2 a3 a4 h1 y1 y2 y3 x2 x1

Fully Connected Network x2 x1 a1 a2 a3 a4 h1 h2 h3 h4 Each neuron connects to each neuron in the previous layer

Fully Connected Network a1 a2 a3 a4 h1 h2 h3 h4 y1 y2 y3 𝒂 All layer a values x2 x1 𝒘 𝒊 , 𝑏 𝑖 Neuron i weights, bias 𝑓 Activation function ℎ 𝑖 =𝑓( 𝒘 𝒊 𝑻 𝒂+ 𝑏 𝑖 ) How do we do all the neurons all at once?

Fully Connected Network a1 a2 a3 a4 h1 h2 h3 h4 y1 y2 y3 𝒂 All layer a values x2 x1 𝒘 𝒊 , 𝑏 𝑖 Neuron i weights, bias 𝑓 Activation function 𝒉=𝑓(𝑾𝒂+𝒃) = 𝑤 1 𝑇 𝑤 2 𝑇 𝑤 3 𝑇 𝑤 4 𝑇 𝑏 1 𝑏 2 𝑏 3 𝑏 4 𝑎 1 𝑎 2 𝑎 3 𝑎 4 ℎ 1 ℎ 2 ℎ 3 ℎ 4 + 𝑓 ( )

Fully Connected Network Define New Block: “Linear Layer” (Ok technically it’s Affine) n L W b 𝐿 𝒏 =𝑾𝒏+𝒃 Can get gradient with respect to all the inputs (do on your own; useful trick: have to be able to do matrix multiply)

Fully Connected Network x2 x1 a1 a2 a3 a4 h1 h2 h3 h4 L f(n) W b L f(n) W b L f(n) W b x

Fully Connected Network x2 x1 a1 a2 a3 a4 h1 h2 h3 h4 What happens if we remove the activation functions? W b W b W b x L L L

Demo Time https://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html