Backpropagation and Neural Nets

Slides:

Advertisements

Similar presentations

Multi-Layer Perceptron (MLP)

Advertisements

Beyond Linear Separability

A Brief Overview of Neural Networks By Rohit Dua, Samuel A. Mulder, Steve E. Watkins, and Donald C. Wunsch.

Slides from: Doug Gray, David Poole

NEURAL NETWORKS Backpropagation Algorithm

1 Neural networks. Neural networks are made up of many artificial neurons. Each input into the neuron has its own weight associated with it illustrated.

Lecture 13 – Perceptrons Machine Learning March 16, 2010.

Artificial Neural Networks

Financial Informatics –XVI: Supervised Backpropagation Learning

Lecture 14 – Neural Networks

September 30, 2010Neural Networks Lecture 8: Backpropagation Learning 1 Sigmoidal Neurons In backpropagation networks, we typically choose  = 1 and 

CHAPTER 11 Back-Propagation Ming-Feng Yeh.

Neural Networks. Background - Neural Networks can be : Biological - Biological models Artificial - Artificial models - Desire to produce artificial systems.

CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.

Classification Part 3: Artificial Neural Networks

Multiple-Layer Networks and Backpropagation Algorithms

Backpropagation An efficient way to compute the gradient Hung-yi Lee.

Neural Network Introduction Hung-yi Lee. Review: Supervised Learning Training: Pick the “best” Function f * Training Data Model Testing: Hypothesis Function.

ECE 6504: Deep Learning for Perception Dhruv Batra Virginia Tech Topics: –Neural Networks –Backprop –Modular Design.

A note about gradient descent: Consider the function f(x)=(x-x 0 ) 2 Its derivative is: By gradient descent. x0x0 + -

Non-Bayes classifiers. Linear discriminants, neural networks.

PARALLELIZATION OF ARTIFICIAL NEURAL NETWORKS Joe Bradish CS5802 Fall 2015.

EEE502 Pattern Recognition

Neural Networks Vladimir Pleskonjić 3188/ /20 Vladimir Pleskonjić General Feedforward neural networks Inputs are numeric features Outputs are in.

CS 551/651 Search and “Through the Lens” Lecture 13 Search and “Through the Lens” Lecture 13.

Previous Lecture Perceptron W  t+1  W  t  t  d(t) - sign (w(t)  x)] x Adaline W  t+1  W  t  t  d(t) - f(w(t)  x)] f’ x Gradient.

Neural Networks The Elements of Statistical Learning, Chapter 12 Presented by Nick Rizzolo.

Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.

Learning: Neural Networks Artificial Intelligence CMSC February 3, 2005.

Multinomial Regression and the Softmax Activation Function Gary Cottrell.

Learning with Neural Networks Artificial Intelligence CMSC February 19, 2002.

CSE343/543 Machine Learning Mayank Vatsa Lecture slides are prepared using several teaching resources and no authorship is claimed for any slides.

Today’s Lecture Neural networks Training

Lecture 7 Learned feedforward visual processing

Neural networks.

Neural networks and support vector machines

Multiple-Layer Networks and Backpropagation Algorithms

Deep Feedforward Networks

The Gradient Descent Algorithm

Neural Networks Winter-Spring 2014

Learning with Perceptrons and Neural Networks

Lecture 2. Basic Neurons To model neurons we have to idealize them:

CSE 473 Introduction to Artificial Intelligence Neural Networks

Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)

Lecture 25: Backprop and convnets

with Daniel L. Silver, Ph.D. Christian Frey, BBA April 11-12, 2017

Neural Networks and Backpropagation

CSE P573 Applications of Artificial Intelligence Neural Networks

Classification / Regression Neural Networks 2

Machine Learning Today: Reading: Maria Florina Balcan

CSC 578 Neural Networks and Deep Learning

Gradient Checks for ANN

CS 4501: Introduction to Computer Vision Training Neural Networks II

Artificial Neural Network & Backpropagation Algorithm

of the Artificial Neural Networks.

CSE 573 Introduction to Artificial Intelligence Neural Networks

Neural Network - 2 Mayank Vatsa

Neural Networks Geoff Hulten.

Capabilities of Threshold Neurons

Backpropagation.

Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824

Backpropagation Disclaimer: This PPT is modified based on

Neural networks (1) Traditional multi-layer perceptrons

Artificial Intelligence 10. Neural Networks

Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824

Introduction to Neural Networks

Dhruv Batra Georgia Tech

Artificial Neural Networks / Spring 2002

Principles of Back-Propagation

Overall Introduction for the Lecture

Presentation transcript:

Backpropagation and Neural Nets EECS 442 – Prof. David Fouhey Winter 2019, University of Michigan http://web.eecs.umich.edu/~fouhey/teaching/EECS442_W19/

So Far: Linear Models 𝐿(𝒘) = 𝜆 𝒘 2 2 + 𝑖=1 𝑛 𝑦 𝑖 − 𝒘 𝑇 𝒙 𝑖 ) 2 𝐿(𝒘) = 𝜆 𝒘 2 2 + 𝑖=1 𝑛 𝑦 𝑖 − 𝒘 𝑇 𝒙 𝑖 ) 2 Example: find w minimizing squared error over data Each datapoint represented by some vector x Can find optimal w with ~10 line derivation

Last Class 𝐿(𝒘) = 𝜆 𝒘 2 2 + 𝑖=1 𝑛 𝐿( 𝑦 𝑖 ,𝑓(𝒙;𝒙) ) 𝐿(𝒘) = 𝜆 𝒘 2 2 + 𝑖=1 𝑛 𝐿( 𝑦 𝑖 ,𝑓(𝒙;𝒙) ) What about an arbitrary loss function L? What about an arbitrary parametric function f? Solution: take the gradient, do gradient descent 𝒘 𝑖+1 = 𝒘 𝒊 −𝛼 ∇ 𝑤 𝐿(𝑓( 𝒘 𝑖 )) What if L(f(w)) is complicated? Today!

Taking the Gradient – Review 𝑓 𝑥 = −𝑥+3 2 𝑓= 𝑞 2 𝑞=𝑟+3 𝑟=−𝑥 𝜕𝑓 𝜕𝑞 =2𝑞 𝜕𝑞 𝜕𝑟 =1 𝜕𝑞 𝜕𝑥 =−1 Chain rule 𝜕𝑓 𝜕𝑥 = 𝜕𝑓 𝜕𝑞 𝜕𝑞 𝜕𝑟 𝜕𝑟 𝜕𝑥 =2𝑞∗1∗−1 =−2 −𝑥+3 =2𝑥−6

Supplemental Reading Lectures can only introduce you to a topic You will solidify your knowledge by doing I highly recommend working through everything in the Stanford CS213N resources http://cs231n.github.io/optimization-2/ These slides follow the general examples with a few modifications. The primary difference is that I define local variables n, m per-block.

Let’s Do This Another Way Suppose we have a box representing a function f. This box does two things: Forward: Given forward input n, compute f(n) Backwards: Given backwards input g, return g*df/dn 𝑛 𝑓(𝑛) f 𝑔 𝑔( 𝜕𝑓 𝜕𝑛)

Let’s Do This Another Way 𝑓 𝑥 = −𝑥+3 2 x -x -x+3 (-x+3)2 -n n+3 n2 1 𝜕 𝜕𝑛 𝑛 2 =2𝑛 =2 −𝑥+3 =−2𝑥+6 (−2𝑥+6) ∗1 𝜕 𝜕𝑛 ∗1=

Let’s Do This Another Way 𝑓 𝑥 = −𝑥+3 2 x -x -x+3 (-x+3)2 -n n+3 n2 1 −2𝑥+6 𝜕 𝜕𝑛 =1 1 ∗(−2x+6)

Let’s Do This Another Way 𝑓 𝑥 = −𝑥+3 2 x -x -x+3 (-x+3)2 -n n+3 n2 1 −2𝑥+6 −2𝑥+6 𝜕 𝜕𝑛 =−1 −1 ∗(−2x+6) 2x−6

Let’s Do This Another Way 𝑓 𝑥 = −𝑥+3 2 x -x -x+3 (-x+3)2 -n n+3 n2 1 2x−6 −2𝑥+6 −2𝑥+6

f Two Inputs 𝑛 𝑔( 𝜕𝑓 𝜕𝑛) 𝑓(𝑛,𝑚) 𝑚 𝑔 𝑔( 𝜕𝑓 𝜕𝑚) Given two inputs, just have two input/output wires Forward: the same Backward: the same – send gradients with respect to each variable f 𝑛 𝑚 𝑓(𝑛,𝑚) 𝑔 𝑔( 𝜕𝑓 𝜕𝑛) 𝑔( 𝜕𝑓 𝜕𝑚)

f(x,y,z) = (x+y)z x x+y y (x+y)z z n+m n*m Example Credit: Karpathy and Fei-Fei

f(x,y,z) = (x+y)z x x+y y z*1 (x+y)z z 1 (x+y)*1 𝜕 𝜕𝑛 𝑛𝑚=𝑚 →𝑧∗1 n+m Multiplication swaps inputs, multiplies gradient x+y y z*1 n*m (x+y)z z 1 (x+y)*1 𝜕 𝜕𝑛 𝑛𝑚=𝑚 →𝑧∗1 𝜕 𝜕𝑚 𝑛𝑚=𝑛 →(𝑥+𝑦)∗1 Example Credit: Karpathy and Fei-Fei

f(x,y,z) = (x+y)z x 1*z*1 x+y y 1*z*1 (x+y)z z*1 z 1 (x+y)*1 →1∗𝑧∗1 n+m Addition sends gradient through unchanged 1*z*1 x+y y 1*z*1 n*m (x+y)z z*1 z 1 (x+y)*1 →1∗𝑧∗1 𝜕 𝜕𝑛 𝑛+𝑚=1 →1∗𝑧∗1 𝜕 𝜕𝑚 𝑛+𝑚=1 Example Credit: Karpathy and Fei-Fei

f(x,y,z) = (x+y)z x z x+y y z (x+y)z z*1 z 1 x+y 𝜕 𝑥+𝑦 𝑧 𝜕𝑧 =(𝑥+𝑦) n+m z x+y y z n*m (x+y)z z*1 z 1 x+y 𝜕 𝑥+𝑦 𝑧 𝜕𝑧 =(𝑥+𝑦) 𝜕 𝑥+𝑦 𝑧 𝜕𝑥 =𝑧 𝜕 𝑥+𝑦 𝑧 𝜕𝑦 =𝑧 Example Credit: Karpathy and Fei-Fei

Once More, With Numbers!

f(x,y,z) = (x+y)z 1 n+m 5 4 n*m 50 10 Example Credit: Karpathy and Fei-Fei

f(x,y,z) = (x+y)z 1 5 4 10 50 10 1 5 𝜕 𝜕𝑛 𝑛𝑚=𝑚 →10∗1 𝜕 𝜕𝑚 𝑛𝑚=𝑛 →5∗1 n+m 5 4 10 n*m 50 10 1 5 𝜕 𝜕𝑛 𝑛𝑚=𝑚 →10∗1 𝜕 𝜕𝑚 𝑛𝑚=𝑛 →5∗1 Example Credit: Karpathy and Fei-Fei

f(x,y,z) = (x+y)z 1 10 5 4 10 50 10 10 1 5 →1∗10∗1 𝜕 𝜕𝑛 𝑛+𝑚=1 →1∗5∗1 n+m 10 5 4 10 n*m 50 10 u 10 1 5 →1∗10∗1 𝜕 𝜕𝑛 𝑛+𝑚=1 →1∗5∗1 𝜕 𝜕𝑚 𝑛+𝑚=1 Example Credit: Karpathy and Fei-Fei

We want to fit a model w that just will equal 6. Think You’ve Got It? 𝐿 𝑥 = 𝑤−6 2 We want to fit a model w that just will equal 6. World’s most basic linear model / neural net: no inputs, just constant output.

I’ll Need a Few Volunteers 𝐿 𝑥 = 𝑤−6 2 n-6 n g n2 n g 2ng Job #1 (n-6): Forward: Compute n-6 Backwards: Multiply by 1 Job #2 (n2): Forward: Compute n2 Backwards: Multiply by 2n Job #3: Backwards: Write down 1

Preemptively The diagrams look complex but that’s since we’re covering the details together

Something More Complex 𝑓 𝒘,𝒙 = 1 1+ 𝑒 −( 𝑤 0 𝑥 0 + 𝑤 1 𝑥 1 + 𝑤 2 ) w0 n*m x0 n+m w1 n+m n*-1 en n+1 1/n x1 w2 Example Credit: Karpathy and Fei-Fei

𝑓 𝒘,𝒙 = 1 1+ 𝑒 −( 𝑤 0 𝑥 0 + 𝑤 1 𝑥 1 + 𝑤 2 ) w0 2 * -2 x0 -1 + 4 w1 -3 * 6 + 1 -1 0.37 1.37 0.73 *-1 en +1 n-1 x1 -2 w2 -3

a b 𝑓 𝒘,𝒙 = 1 1+ 𝑒 −( 𝑤 0 𝑥 0 + 𝑤 1 𝑥 1 + 𝑤 2 ) 𝜕 𝜕𝑛 𝑚+𝑛=1 𝜕 𝜕𝑛 𝑚𝑛=𝑚 w0 2 * c d 𝜕 𝜕𝑛 𝑒 𝑛 = 𝑒 𝑛 𝜕 𝜕𝑛 𝑛 −1 =− 𝑛 −2 -2 x0 -1 e f 𝜕 𝜕𝑛 𝑎𝑛=𝑎 𝜕 𝜕𝑛 𝑐+𝑛=1 + 4 w1 -3 * 6 + 1 -1 0.37 1.37 0.73 *-1 en +1 n-1 x1 -2 w2 -3 Example Credit: Karpathy and Fei-Fei

a b 𝑓 𝒘,𝒙 = 1 1+ 𝑒 −( 𝑤 0 𝑥 0 + 𝑤 1 𝑥 1 + 𝑤 2 ) 𝜕 𝜕𝑛 𝑚+𝑛=1 𝜕 𝜕𝑛 𝑚𝑛=𝑚 w0 2 * c d 𝜕 𝜕𝑛 𝑒 𝑛 = 𝑒 𝑛 𝜕 𝜕𝑛 𝑛 −1 =− 𝑛 −2 -2 x0 -1 e f 𝜕 𝜕𝑛 𝑎𝑛=𝑎 𝜕 𝜕𝑛 𝑐+𝑛=1 + 4 w1 -3 * 6 + 1 -1 0.37 1.37 0.73 *-1 en +1 n-1 x1 -2 1 w2 -3 Example Credit: Karpathy and Fei-Fei

a b 𝑓 𝒘,𝒙 = 1 1+ 𝑒 −( 𝑤 0 𝑥 0 + 𝑤 1 𝑥 1 + 𝑤 2 ) 𝜕 𝜕𝑛 𝑚+𝑛=1 𝜕 𝜕𝑛 𝑚𝑛=𝑚 w0 2 * c d 𝜕 𝜕𝑛 𝑒 𝑛 = 𝑒 𝑛 𝜕 𝜕𝑛 𝑛 −1 =− 𝑛 −2 -2 x0 -1 e f 𝜕 𝜕𝑛 𝑎𝑛=𝑎 𝜕 𝜕𝑛 𝑐+𝑛=1 + 4 w1 -3 * 6 + 1 -1 0.37 1.37 0.73 *-1 en +1 n-1 x1 -2 1 − 1.37 −2 ∗1=−0.53 w2 -3 Where does 1.37 come from? Example Credit: Karpathy and Fei-Fei

a b w0 c d x0 e f w1 x1 -0.53 1 w2 1∗−0.53=−0.53 2 * -2 -1 + 4 -3 * 6 𝑓 𝒘,𝒙 = 1 1+ 𝑒 −( 𝑤 0 𝑥 0 + 𝑤 1 𝑥 1 + 𝑤 2 ) 𝜕 𝜕𝑛 𝑚+𝑛=1 𝜕 𝜕𝑛 𝑚𝑛=𝑚 w0 2 * c d 𝜕 𝜕𝑛 𝑒 𝑛 = 𝑒 𝑛 𝜕 𝜕𝑛 𝑛 −1 =− 𝑛 −2 -2 x0 -1 e f 𝜕 𝜕𝑛 𝑎𝑛=𝑎 𝜕 𝜕𝑛 𝑐+𝑛=1 + 4 w1 -3 * 6 + 1 -1 0.37 1.37 0.73 *-1 en +1 n-1 x1 -2 -0.53 1 w2 -3 1∗−0.53=−0.53 Example Credit: Karpathy and Fei-Fei

a b w0 c d x0 e f w1 x1 -0.53 -0.53 1 w2 𝑒 −1 ∗−0.53=−0.2 2 * -2 -1 + 𝑓 𝒘,𝒙 = 1 1+ 𝑒 −( 𝑤 0 𝑥 0 + 𝑤 1 𝑥 1 + 𝑤 2 ) 𝜕 𝜕𝑛 𝑚+𝑛=1 𝜕 𝜕𝑛 𝑚𝑛=𝑚 w0 2 * c d 𝜕 𝜕𝑛 𝑒 𝑛 = 𝑒 𝑛 𝜕 𝜕𝑛 𝑛 −1 =− 𝑛 −2 -2 x0 -1 e f 𝜕 𝜕𝑛 𝑎𝑛=𝑎 𝜕 𝜕𝑛 𝑐+𝑛=1 + 4 w1 -3 * 6 + 1 -1 0.37 1.37 0.73 *-1 en +1 n-1 x1 -2 -0.53 -0.53 1 w2 -3 𝑒 −1 ∗−0.53=−0.2 Example Credit: Karpathy and Fei-Fei

a b w0 c d x0 e f w1 x1 -0.2 -0.53 -0.53 1 w2 −1∗−0.2=0.2 2 * -2 -1 + 𝑓 𝒘,𝒙 = 1 1+ 𝑒 −( 𝑤 0 𝑥 0 + 𝑤 1 𝑥 1 + 𝑤 2 ) 𝜕 𝜕𝑛 𝑚+𝑛=1 𝜕 𝜕𝑛 𝑚𝑛=𝑚 w0 2 * c d 𝜕 𝜕𝑛 𝑒 𝑛 = 𝑒 𝑛 𝜕 𝜕𝑛 𝑛 −1 =− 𝑛 −2 -2 x0 -1 e f 𝜕 𝜕𝑛 𝑎𝑛=𝑎 𝜕 𝜕𝑛 𝑐+𝑛=1 + 4 w1 -3 * 6 + 1 -1 0.37 1.37 0.73 *-1 en +1 n-1 x1 -2 -0.2 -0.53 -0.53 1 w2 -3 −1∗−0.2=0.2 Example Credit: Karpathy and Fei-Fei

a b 𝑓 𝒘,𝒙 = 1 1+ 𝑒 −( 𝑤 0 𝑥 0 + 𝑤 1 𝑥 1 + 𝑤 2 ) 𝜕 𝜕𝑛 𝑚+𝑛=1 𝜕 𝜕𝑛 𝑚𝑛=𝑚 w0 2 * c d 𝜕 𝜕𝑛 𝑒 𝑛 = 𝑒 𝑛 𝜕 𝜕𝑛 𝑛 −1 =− 𝑛 −2 -2 x0 -1 e f 𝜕 𝜕𝑛 𝑎𝑛=𝑎 𝜕 𝜕𝑛 𝑐+𝑛=1 + 4 w1 -3 * 6 + 1 -1 0.37 1.37 0.73 *-1 en +1 n-1 x1 -2 0.2 -0.2 -0.53 -0.53 1 w2 1∗0.2=0.2 Gets sent back both directions -3 Example Credit: Karpathy and Fei-Fei

a b 𝑓 𝒘,𝒙 = 1 1+ 𝑒 −( 𝑤 0 𝑥 0 + 𝑤 1 𝑥 1 + 𝑤 2 ) 𝜕 𝜕𝑛 𝑚+𝑛=1 𝜕 𝜕𝑛 𝑚𝑛=𝑚 w0 2 * c d 𝜕 𝜕𝑛 𝑒 𝑛 = 𝑒 𝑛 𝜕 𝜕𝑛 𝑛 −1 =− 𝑛 −2 -2 x0 -1 e f 𝜕 𝜕𝑛 𝑎𝑛=𝑎 𝜕 𝜕𝑛 𝑐+𝑛=1 + 4 w1 -3 * 6 + 1 -1 0.37 1.37 0.73 0.2 *-1 en +1 n-1 x1 -2 0.2 -0.2 -0.53 -0.53 1 w2 1∗0.2=0.2 Gets sent back both directions -3 0.2 Example Credit: Karpathy and Fei-Fei

a b 𝑓 𝒘,𝒙 = 1 1+ 𝑒 −( 𝑤 0 𝑥 0 + 𝑤 1 𝑥 1 + 𝑤 2 ) 𝜕 𝜕𝑛 𝑚+𝑛=1 𝜕 𝜕𝑛 𝑚𝑛=𝑚 w0 2 * c d 𝜕 𝜕𝑛 𝑒 𝑛 = 𝑒 𝑛 𝜕 𝜕𝑛 𝑛 −1 =− 𝑛 −2 -2 x0 -1 e f 𝜕 𝜕𝑛 𝑎𝑛=𝑎 𝜕 𝜕𝑛 𝑐+𝑛=1 0.2 + 4 w1 -3 * 6 + 1 -1 0.37 1.37 0.73 0.2 *-1 en +1 n-1 0.2 x1 -2 0.2 -0.2 -0.53 -0.53 1 w2 −1∗0.2=−0.2 -3 0.2 2∗0.2=0.4 Example Credit: Karpathy and Fei-Fei

a b 𝑓 𝒘,𝒙 = 1 1+ 𝑒 −( 𝑤 0 𝑥 0 + 𝑤 1 𝑥 1 + 𝑤 2 ) 𝜕 𝜕𝑛 𝑚+𝑛=1 𝜕 𝜕𝑛 𝑚𝑛=𝑚 w0 2 * c d 𝜕 𝜕𝑛 𝑒 𝑛 = 𝑒 𝑛 𝜕 𝜕𝑛 𝑛 −1 =− 𝑛 −2 -0.2 -2 x0 -1 e f 0.4 𝜕 𝜕𝑛 𝑎𝑛=𝑎 𝜕 𝜕𝑛 𝑐+𝑛=1 0.2 + 4 w1 -3 * 6 + 1 -1 0.37 1.37 0.73 0.2 *-1 en +1 n-1 0.2 x1 -2 0.2 -0.2 -0.53 -0.53 1 w2 −2∗0.2=−0.4 −3∗0.2=−0.6 -3 0.2 Example Credit: Karpathy and Fei-Fei

a b 𝑓 𝒘,𝒙 = 1 1+ 𝑒 −( 𝑤 0 𝑥 0 + 𝑤 1 𝑥 1 + 𝑤 2 ) 𝜕 𝜕𝑛 𝑚+𝑛=1 𝜕 𝜕𝑛 𝑚𝑛=𝑚 w0 2 * c d 𝜕 𝜕𝑛 𝑒 𝑛 = 𝑒 𝑛 𝜕 𝜕𝑛 𝑛 −1 =− 𝑛 −2 -0.2 -2 x0 -1 e f 0.4 𝜕 𝜕𝑛 𝑎𝑛=𝑎 𝜕 𝜕𝑛 𝑐+𝑛=1 0.2 + 4 w1 -3 * 6 + 1 -1 0.37 1.37 0.73 -0.4 0.2 *-1 en +1 n-1 0.2 x1 -2 0.2 -0.2 -0.53 -0.53 1 -0.6 w2 -3 0.2 Example Credit: Karpathy and Fei-Fei

PHEW! a b w0 c d -0.2 x0 e f 0.4 w1 -0.4 x1 -0.6 w2 0.2 2 * -2 -1 + 4 𝑓 𝒘,𝒙 = 1 1+ 𝑒 −( 𝑤 0 𝑥 0 + 𝑤 1 𝑥 1 + 𝑤 2 ) 𝜕 𝜕𝑛 𝑚+𝑛=1 𝜕 𝜕𝑛 𝑚𝑛=𝑚 w0 2 * c d 𝜕 𝜕𝑛 𝑒 𝑛 = 𝑒 𝑛 𝜕 𝜕𝑛 𝑛 −1 =− 𝑛 −2 -0.2 -2 x0 -1 e f 0.4 𝜕 𝜕𝑛 𝑎𝑛=𝑎 𝜕 𝜕𝑛 𝑐+𝑛=1 + 4 w1 -3 * 6 + 1 -1 0.37 1.37 0.73 -0.4 *-1 en +1 n-1 x1 -2 -0.6 PHEW! w2 -3 0.2 Example Credit: Karpathy and Fei-Fei

f … Summary 𝑥 1 𝑔( 𝜕𝑓 𝜕 𝑥 1 ) 𝑥 2 𝑔( 𝜕𝑓 𝜕 𝑥 2 ) 𝑓( 𝑥 1 ,…, 𝑥 𝑛 ) 𝑔 𝑥 𝑛 Each block computes backwards (g) * local gradient (df/dxi) at the evaluation point f 𝑓( 𝑥 1 ,…, 𝑥 𝑛 ) 𝑔 𝑥 1 𝑔( 𝜕𝑓 𝜕 𝑥 1 ) 𝑔( 𝜕𝑓 𝜕 𝑥 2 ) 𝑥 2 𝑥 𝑛 𝑔( 𝜕𝑓 𝜕 𝑥 𝑛 ) …

Multiple Outputs Flowing Back Gradients from different backwards sum up 𝑗=1 𝐾 𝑔 𝑗 ( 𝜕 𝑓 𝑗 𝜕 𝑥 𝑖 ) 𝑥 1 f 𝑓 1 ( 𝑥 1 ,…, 𝑥 𝑛 ) 𝑔 1 𝑗=1 𝐾 𝑔 𝑗 ( 𝜕 𝑓 𝑗 𝜕 𝑥 1 ) … 𝑓 𝐾 ( 𝑥 1 ,…, 𝑥 𝑛 ) 𝑔 𝐾 𝑥 𝑛 𝑗=1 𝐾 𝑔 𝑗 ( 𝜕 𝑓 𝑗 𝜕 𝑥 𝑛 )

Multiple Outputs Flowing Back 𝑓 𝑥 = −𝑥+3 2 (-x+3)2 x -x -x+3 -n n+3 m*n -x+3 -x+3 x-3 1 x x -n n+3

Multiple Outputs Flowing Back 𝑓 𝑥 = −𝑥+3 2 x = 𝑥−3 + 𝑥−3 𝜕𝑓 𝜕𝑥 x-3 x =2𝑥−6 x x-3

Does It Have To Be So Painful? 𝑓 𝒘,𝒙 = 1 1+ 𝑒 −( 𝑤 0 𝑥 0 + 𝑤 1 𝑥 1 + 𝑤 2 ) w0 n*m x0 n+m w1 𝜎 𝑛 = 1 1+ 𝑒 −𝑛 n+m n*-1 en n+1 1/x x1 w2 Example Credit: Karpathy and Fei-Fei

Does It Have To Be So Painful? 𝜎 𝑛 = 1 1+ 𝑒 −𝑛 𝜕 𝜕𝑛 𝜎 𝑛 = 𝑒 −𝑛 1+ 𝑒 −𝑛 2 = 1+ 𝑒 −𝑛 −1 1+ 𝑒 −𝑛 1 1+ 𝑒 −𝑛 1+ 𝑒 −𝑛 1+ 𝑒 −𝑛 − 1 1+ 𝑒 −𝑛 =1−𝜎(𝑛) 𝜎(𝑛) = 1−𝜎 𝑛 𝜎(𝑛) 𝜕 𝜕𝑛 𝜎 𝑛 = −1 1+ 𝑒 −𝑛 2 ∗1∗ 𝑒 −𝑛 ∗−1 Chain rule: d/dx (1/x)*d/dx (1+x)* d/dx (e*x)*d/dx (-x) Line 1 to 2: For the curious Example Credit: Karpathy and Fei-Fei

Does It Have To Be So Painful? 𝑓 𝒘,𝒙 = 1 1+ 𝑒 −( 𝑤 0 𝑥 0 + 𝑤 1 𝑥 1 + 𝑤 2 ) w0 n*m x0 n+m w1 n+m σ(n) x1 w2 𝜎 𝑛 = 1 1+ 𝑒 −𝑛 𝜕𝜎(𝑛) 𝜕𝑛 =(1−𝜎 𝑛 )𝜎(𝑛) Example Credit: Karpathy and Fei-Fei

Does It Have To Be So Painful? Can compute for any function Pick your functions carefully: existing code is usually structured into sensible blocks

Building Blocks Takes signals from other cells, processes, sends out Input from other cells Output to other cells Neuron diagram credit: Karpathy and Fei-Fei

Artificial Neuron Weighted average of other neuron outputs passed through an activation function 𝑖 𝑤 𝑖 𝑥 𝑖 +𝑏 Activation 𝑓 𝑖 𝑤 𝑖 𝑥 𝑖 +𝑏

Can differentiate whole thing e.g., dNeuron/dx1. Artificial Neuron Can differentiate whole thing e.g., dNeuron/dx1. What can we now do? w3 * x3 w2 * x2 w1 * x1 b ∑ f

Artificial Neuron Each artificial neuron is a linear model + an activation function f Can find w, b that minimizes a loss function with gradient descent ∑ f w,b x

∑ ∑ ∑ ∑ Artificial Neurons f f f f Connect neurons to make a more complex function; use backprop to compute gradient ∑ f w,b x ∑ f w,b x ∑ f w,b x ∑ f w,b x

What’s The Activation Function Sigmoid 𝑠 𝑥 = 1 1+ 𝑒 −𝑥 Nice interpretation Squashes things to (0,1) Gradients are near zero if neuron is high/low

What’s The Activation Function ReLU (Rectifying Linear Unit) max⁡(0,𝑥) Constant gradient Converges ~6x faster If neuron negative, zero gradient. Be careful!

What’s The Activation Function Leaky ReLU (Rectifying Linear Unit) 𝑥 :𝑥≥0 0.01𝑥 :𝑥<0 ReLU, but allows some small gradient for negative vales

Setting Up A Neural Net Input Hidden Output h1 h2 h3 h4 y1 y2 y3 x2 x1

Setting Up A Neural Net Input Hidden 1 Hidden 2 Output a1 a2 a3 a4 h1 y1 y2 y3 x2 x1

Fully Connected Network x2 x1 a1 a2 a3 a4 h1 h2 h3 h4 Each neuron connects to each neuron in the previous layer

Fully Connected Network a1 a2 a3 a4 h1 h2 h3 h4 y1 y2 y3 𝒂 All layer a values x2 x1 𝒘 𝒊 , 𝑏 𝑖 Neuron i weights, bias 𝑓 Activation function ℎ 𝑖 =𝑓( 𝒘 𝒊 𝑻 𝒂+ 𝑏 𝑖 ) How do we do all the neurons all at once?

Fully Connected Network a1 a2 a3 a4 h1 h2 h3 h4 y1 y2 y3 𝒂 All layer a values x2 x1 𝒘 𝒊 , 𝑏 𝑖 Neuron i weights, bias 𝑓 Activation function 𝒉=𝑓(𝑾𝒂+𝒃) = 𝑤 1 𝑇 𝑤 2 𝑇 𝑤 3 𝑇 𝑤 4 𝑇 𝑏 1 𝑏 2 𝑏 3 𝑏 4 𝑎 1 𝑎 2 𝑎 3 𝑎 4 ℎ 1 ℎ 2 ℎ 3 ℎ 4 + 𝑓 ( )

Fully Connected Network Define New Block: “Linear Layer” (Ok technically it’s Affine) n L W b 𝐿 𝒏 =𝑾𝒏+𝒃 Can get gradient with respect to all the inputs (do on your own; useful trick: have to be able to do matrix multiply)

Fully Connected Network x2 x1 a1 a2 a3 a4 h1 h2 h3 h4 L f(n) W b L f(n) W b L f(n) W b x

Fully Connected Network x2 x1 a1 a2 a3 a4 h1 h2 h3 h4 What happens if we remove the activation functions? W b W b W b x L L L

Demo Time https://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html