Backpropagation and Neural Nets EECS 442 – Prof. David Fouhey Winter 2019, University of Michigan http://web.eecs.umich.edu/~fouhey/teaching/EECS442_W19/
So Far: Linear Models 𝐿(𝒘) = 𝜆 𝒘 2 2 + 𝑖=1 𝑛 𝑦 𝑖 − 𝒘 𝑇 𝒙 𝑖 ) 2 𝐿(𝒘) = 𝜆 𝒘 2 2 + 𝑖=1 𝑛 𝑦 𝑖 − 𝒘 𝑇 𝒙 𝑖 ) 2 Example: find w minimizing squared error over data Each datapoint represented by some vector x Can find optimal w with ~10 line derivation
Last Class 𝐿(𝒘) = 𝜆 𝒘 2 2 + 𝑖=1 𝑛 𝐿( 𝑦 𝑖 ,𝑓(𝒙;𝒙) ) 𝐿(𝒘) = 𝜆 𝒘 2 2 + 𝑖=1 𝑛 𝐿( 𝑦 𝑖 ,𝑓(𝒙;𝒙) ) What about an arbitrary loss function L? What about an arbitrary parametric function f? Solution: take the gradient, do gradient descent 𝒘 𝑖+1 = 𝒘 𝒊 −𝛼 ∇ 𝑤 𝐿(𝑓( 𝒘 𝑖 )) What if L(f(w)) is complicated? Today!
Taking the Gradient – Review 𝑓 𝑥 = −𝑥+3 2 𝑓= 𝑞 2 𝑞=𝑟+3 𝑟=−𝑥 𝜕𝑓 𝜕𝑞 =2𝑞 𝜕𝑞 𝜕𝑟 =1 𝜕𝑞 𝜕𝑥 =−1 Chain rule 𝜕𝑓 𝜕𝑥 = 𝜕𝑓 𝜕𝑞 𝜕𝑞 𝜕𝑟 𝜕𝑟 𝜕𝑥 =2𝑞∗1∗−1 =−2 −𝑥+3 =2𝑥−6
Supplemental Reading Lectures can only introduce you to a topic You will solidify your knowledge by doing I highly recommend working through everything in the Stanford CS213N resources http://cs231n.github.io/optimization-2/ These slides follow the general examples with a few modifications. The primary difference is that I define local variables n, m per-block.
Let’s Do This Another Way Suppose we have a box representing a function f. This box does two things: Forward: Given forward input n, compute f(n) Backwards: Given backwards input g, return g*df/dn 𝑛 𝑓(𝑛) f 𝑔 𝑔( 𝜕𝑓 𝜕𝑛)
Let’s Do This Another Way 𝑓 𝑥 = −𝑥+3 2 x -x -x+3 (-x+3)2 -n n+3 n2 1 𝜕 𝜕𝑛 𝑛 2 =2𝑛 =2 −𝑥+3 =−2𝑥+6 (−2𝑥+6) ∗1 𝜕 𝜕𝑛 ∗1=
Let’s Do This Another Way 𝑓 𝑥 = −𝑥+3 2 x -x -x+3 (-x+3)2 -n n+3 n2 1 −2𝑥+6 𝜕 𝜕𝑛 =1 1 ∗(−2x+6)
Let’s Do This Another Way 𝑓 𝑥 = −𝑥+3 2 x -x -x+3 (-x+3)2 -n n+3 n2 1 −2𝑥+6 −2𝑥+6 𝜕 𝜕𝑛 =−1 −1 ∗(−2x+6) 2x−6
Let’s Do This Another Way 𝑓 𝑥 = −𝑥+3 2 x -x -x+3 (-x+3)2 -n n+3 n2 1 2x−6 −2𝑥+6 −2𝑥+6
f Two Inputs 𝑛 𝑔( 𝜕𝑓 𝜕𝑛) 𝑓(𝑛,𝑚) 𝑚 𝑔 𝑔( 𝜕𝑓 𝜕𝑚) Given two inputs, just have two input/output wires Forward: the same Backward: the same – send gradients with respect to each variable f 𝑛 𝑚 𝑓(𝑛,𝑚) 𝑔 𝑔( 𝜕𝑓 𝜕𝑛) 𝑔( 𝜕𝑓 𝜕𝑚)
f(x,y,z) = (x+y)z x x+y y (x+y)z z n+m n*m Example Credit: Karpathy and Fei-Fei
f(x,y,z) = (x+y)z x x+y y z*1 (x+y)z z 1 (x+y)*1 𝜕 𝜕𝑛 𝑛𝑚=𝑚 →𝑧∗1 n+m Multiplication swaps inputs, multiplies gradient x+y y z*1 n*m (x+y)z z 1 (x+y)*1 𝜕 𝜕𝑛 𝑛𝑚=𝑚 →𝑧∗1 𝜕 𝜕𝑚 𝑛𝑚=𝑛 →(𝑥+𝑦)∗1 Example Credit: Karpathy and Fei-Fei
f(x,y,z) = (x+y)z x 1*z*1 x+y y 1*z*1 (x+y)z z*1 z 1 (x+y)*1 →1∗𝑧∗1 n+m Addition sends gradient through unchanged 1*z*1 x+y y 1*z*1 n*m (x+y)z z*1 z 1 (x+y)*1 →1∗𝑧∗1 𝜕 𝜕𝑛 𝑛+𝑚=1 →1∗𝑧∗1 𝜕 𝜕𝑚 𝑛+𝑚=1 Example Credit: Karpathy and Fei-Fei
f(x,y,z) = (x+y)z x z x+y y z (x+y)z z*1 z 1 x+y 𝜕 𝑥+𝑦 𝑧 𝜕𝑧 =(𝑥+𝑦) n+m z x+y y z n*m (x+y)z z*1 z 1 x+y 𝜕 𝑥+𝑦 𝑧 𝜕𝑧 =(𝑥+𝑦) 𝜕 𝑥+𝑦 𝑧 𝜕𝑥 =𝑧 𝜕 𝑥+𝑦 𝑧 𝜕𝑦 =𝑧 Example Credit: Karpathy and Fei-Fei
Once More, With Numbers!
f(x,y,z) = (x+y)z 1 n+m 5 4 n*m 50 10 Example Credit: Karpathy and Fei-Fei
f(x,y,z) = (x+y)z 1 5 4 10 50 10 1 5 𝜕 𝜕𝑛 𝑛𝑚=𝑚 →10∗1 𝜕 𝜕𝑚 𝑛𝑚=𝑛 →5∗1 n+m 5 4 10 n*m 50 10 1 5 𝜕 𝜕𝑛 𝑛𝑚=𝑚 →10∗1 𝜕 𝜕𝑚 𝑛𝑚=𝑛 →5∗1 Example Credit: Karpathy and Fei-Fei
f(x,y,z) = (x+y)z 1 10 5 4 10 50 10 10 1 5 →1∗10∗1 𝜕 𝜕𝑛 𝑛+𝑚=1 →1∗5∗1 n+m 10 5 4 10 n*m 50 10 u 10 1 5 →1∗10∗1 𝜕 𝜕𝑛 𝑛+𝑚=1 →1∗5∗1 𝜕 𝜕𝑚 𝑛+𝑚=1 Example Credit: Karpathy and Fei-Fei
We want to fit a model w that just will equal 6. Think You’ve Got It? 𝐿 𝑥 = 𝑤−6 2 We want to fit a model w that just will equal 6. World’s most basic linear model / neural net: no inputs, just constant output.
I’ll Need a Few Volunteers 𝐿 𝑥 = 𝑤−6 2 n-6 n g n2 n g 2ng Job #1 (n-6): Forward: Compute n-6 Backwards: Multiply by 1 Job #2 (n2): Forward: Compute n2 Backwards: Multiply by 2n Job #3: Backwards: Write down 1
Preemptively The diagrams look complex but that’s since we’re covering the details together
Something More Complex 𝑓 𝒘,𝒙 = 1 1+ 𝑒 −( 𝑤 0 𝑥 0 + 𝑤 1 𝑥 1 + 𝑤 2 ) w0 n*m x0 n+m w1 n+m n*-1 en n+1 1/n x1 w2 Example Credit: Karpathy and Fei-Fei
𝑓 𝒘,𝒙 = 1 1+ 𝑒 −( 𝑤 0 𝑥 0 + 𝑤 1 𝑥 1 + 𝑤 2 ) w0 2 * -2 x0 -1 + 4 w1 -3 * 6 + 1 -1 0.37 1.37 0.73 *-1 en +1 n-1 x1 -2 w2 -3
a b 𝑓 𝒘,𝒙 = 1 1+ 𝑒 −( 𝑤 0 𝑥 0 + 𝑤 1 𝑥 1 + 𝑤 2 ) 𝜕 𝜕𝑛 𝑚+𝑛=1 𝜕 𝜕𝑛 𝑚𝑛=𝑚 w0 2 * c d 𝜕 𝜕𝑛 𝑒 𝑛 = 𝑒 𝑛 𝜕 𝜕𝑛 𝑛 −1 =− 𝑛 −2 -2 x0 -1 e f 𝜕 𝜕𝑛 𝑎𝑛=𝑎 𝜕 𝜕𝑛 𝑐+𝑛=1 + 4 w1 -3 * 6 + 1 -1 0.37 1.37 0.73 *-1 en +1 n-1 x1 -2 w2 -3 Example Credit: Karpathy and Fei-Fei
a b 𝑓 𝒘,𝒙 = 1 1+ 𝑒 −( 𝑤 0 𝑥 0 + 𝑤 1 𝑥 1 + 𝑤 2 ) 𝜕 𝜕𝑛 𝑚+𝑛=1 𝜕 𝜕𝑛 𝑚𝑛=𝑚 w0 2 * c d 𝜕 𝜕𝑛 𝑒 𝑛 = 𝑒 𝑛 𝜕 𝜕𝑛 𝑛 −1 =− 𝑛 −2 -2 x0 -1 e f 𝜕 𝜕𝑛 𝑎𝑛=𝑎 𝜕 𝜕𝑛 𝑐+𝑛=1 + 4 w1 -3 * 6 + 1 -1 0.37 1.37 0.73 *-1 en +1 n-1 x1 -2 1 w2 -3 Example Credit: Karpathy and Fei-Fei
a b 𝑓 𝒘,𝒙 = 1 1+ 𝑒 −( 𝑤 0 𝑥 0 + 𝑤 1 𝑥 1 + 𝑤 2 ) 𝜕 𝜕𝑛 𝑚+𝑛=1 𝜕 𝜕𝑛 𝑚𝑛=𝑚 w0 2 * c d 𝜕 𝜕𝑛 𝑒 𝑛 = 𝑒 𝑛 𝜕 𝜕𝑛 𝑛 −1 =− 𝑛 −2 -2 x0 -1 e f 𝜕 𝜕𝑛 𝑎𝑛=𝑎 𝜕 𝜕𝑛 𝑐+𝑛=1 + 4 w1 -3 * 6 + 1 -1 0.37 1.37 0.73 *-1 en +1 n-1 x1 -2 1 − 1.37 −2 ∗1=−0.53 w2 -3 Where does 1.37 come from? Example Credit: Karpathy and Fei-Fei
a b w0 c d x0 e f w1 x1 -0.53 1 w2 1∗−0.53=−0.53 2 * -2 -1 + 4 -3 * 6 𝑓 𝒘,𝒙 = 1 1+ 𝑒 −( 𝑤 0 𝑥 0 + 𝑤 1 𝑥 1 + 𝑤 2 ) 𝜕 𝜕𝑛 𝑚+𝑛=1 𝜕 𝜕𝑛 𝑚𝑛=𝑚 w0 2 * c d 𝜕 𝜕𝑛 𝑒 𝑛 = 𝑒 𝑛 𝜕 𝜕𝑛 𝑛 −1 =− 𝑛 −2 -2 x0 -1 e f 𝜕 𝜕𝑛 𝑎𝑛=𝑎 𝜕 𝜕𝑛 𝑐+𝑛=1 + 4 w1 -3 * 6 + 1 -1 0.37 1.37 0.73 *-1 en +1 n-1 x1 -2 -0.53 1 w2 -3 1∗−0.53=−0.53 Example Credit: Karpathy and Fei-Fei
a b w0 c d x0 e f w1 x1 -0.53 -0.53 1 w2 𝑒 −1 ∗−0.53=−0.2 2 * -2 -1 + 𝑓 𝒘,𝒙 = 1 1+ 𝑒 −( 𝑤 0 𝑥 0 + 𝑤 1 𝑥 1 + 𝑤 2 ) 𝜕 𝜕𝑛 𝑚+𝑛=1 𝜕 𝜕𝑛 𝑚𝑛=𝑚 w0 2 * c d 𝜕 𝜕𝑛 𝑒 𝑛 = 𝑒 𝑛 𝜕 𝜕𝑛 𝑛 −1 =− 𝑛 −2 -2 x0 -1 e f 𝜕 𝜕𝑛 𝑎𝑛=𝑎 𝜕 𝜕𝑛 𝑐+𝑛=1 + 4 w1 -3 * 6 + 1 -1 0.37 1.37 0.73 *-1 en +1 n-1 x1 -2 -0.53 -0.53 1 w2 -3 𝑒 −1 ∗−0.53=−0.2 Example Credit: Karpathy and Fei-Fei
a b w0 c d x0 e f w1 x1 -0.2 -0.53 -0.53 1 w2 −1∗−0.2=0.2 2 * -2 -1 + 𝑓 𝒘,𝒙 = 1 1+ 𝑒 −( 𝑤 0 𝑥 0 + 𝑤 1 𝑥 1 + 𝑤 2 ) 𝜕 𝜕𝑛 𝑚+𝑛=1 𝜕 𝜕𝑛 𝑚𝑛=𝑚 w0 2 * c d 𝜕 𝜕𝑛 𝑒 𝑛 = 𝑒 𝑛 𝜕 𝜕𝑛 𝑛 −1 =− 𝑛 −2 -2 x0 -1 e f 𝜕 𝜕𝑛 𝑎𝑛=𝑎 𝜕 𝜕𝑛 𝑐+𝑛=1 + 4 w1 -3 * 6 + 1 -1 0.37 1.37 0.73 *-1 en +1 n-1 x1 -2 -0.2 -0.53 -0.53 1 w2 -3 −1∗−0.2=0.2 Example Credit: Karpathy and Fei-Fei
a b 𝑓 𝒘,𝒙 = 1 1+ 𝑒 −( 𝑤 0 𝑥 0 + 𝑤 1 𝑥 1 + 𝑤 2 ) 𝜕 𝜕𝑛 𝑚+𝑛=1 𝜕 𝜕𝑛 𝑚𝑛=𝑚 w0 2 * c d 𝜕 𝜕𝑛 𝑒 𝑛 = 𝑒 𝑛 𝜕 𝜕𝑛 𝑛 −1 =− 𝑛 −2 -2 x0 -1 e f 𝜕 𝜕𝑛 𝑎𝑛=𝑎 𝜕 𝜕𝑛 𝑐+𝑛=1 + 4 w1 -3 * 6 + 1 -1 0.37 1.37 0.73 *-1 en +1 n-1 x1 -2 0.2 -0.2 -0.53 -0.53 1 w2 1∗0.2=0.2 Gets sent back both directions -3 Example Credit: Karpathy and Fei-Fei
a b 𝑓 𝒘,𝒙 = 1 1+ 𝑒 −( 𝑤 0 𝑥 0 + 𝑤 1 𝑥 1 + 𝑤 2 ) 𝜕 𝜕𝑛 𝑚+𝑛=1 𝜕 𝜕𝑛 𝑚𝑛=𝑚 w0 2 * c d 𝜕 𝜕𝑛 𝑒 𝑛 = 𝑒 𝑛 𝜕 𝜕𝑛 𝑛 −1 =− 𝑛 −2 -2 x0 -1 e f 𝜕 𝜕𝑛 𝑎𝑛=𝑎 𝜕 𝜕𝑛 𝑐+𝑛=1 + 4 w1 -3 * 6 + 1 -1 0.37 1.37 0.73 0.2 *-1 en +1 n-1 x1 -2 0.2 -0.2 -0.53 -0.53 1 w2 1∗0.2=0.2 Gets sent back both directions -3 0.2 Example Credit: Karpathy and Fei-Fei
a b 𝑓 𝒘,𝒙 = 1 1+ 𝑒 −( 𝑤 0 𝑥 0 + 𝑤 1 𝑥 1 + 𝑤 2 ) 𝜕 𝜕𝑛 𝑚+𝑛=1 𝜕 𝜕𝑛 𝑚𝑛=𝑚 w0 2 * c d 𝜕 𝜕𝑛 𝑒 𝑛 = 𝑒 𝑛 𝜕 𝜕𝑛 𝑛 −1 =− 𝑛 −2 -2 x0 -1 e f 𝜕 𝜕𝑛 𝑎𝑛=𝑎 𝜕 𝜕𝑛 𝑐+𝑛=1 0.2 + 4 w1 -3 * 6 + 1 -1 0.37 1.37 0.73 0.2 *-1 en +1 n-1 0.2 x1 -2 0.2 -0.2 -0.53 -0.53 1 w2 −1∗0.2=−0.2 -3 0.2 2∗0.2=0.4 Example Credit: Karpathy and Fei-Fei
a b 𝑓 𝒘,𝒙 = 1 1+ 𝑒 −( 𝑤 0 𝑥 0 + 𝑤 1 𝑥 1 + 𝑤 2 ) 𝜕 𝜕𝑛 𝑚+𝑛=1 𝜕 𝜕𝑛 𝑚𝑛=𝑚 w0 2 * c d 𝜕 𝜕𝑛 𝑒 𝑛 = 𝑒 𝑛 𝜕 𝜕𝑛 𝑛 −1 =− 𝑛 −2 -0.2 -2 x0 -1 e f 0.4 𝜕 𝜕𝑛 𝑎𝑛=𝑎 𝜕 𝜕𝑛 𝑐+𝑛=1 0.2 + 4 w1 -3 * 6 + 1 -1 0.37 1.37 0.73 0.2 *-1 en +1 n-1 0.2 x1 -2 0.2 -0.2 -0.53 -0.53 1 w2 −2∗0.2=−0.4 −3∗0.2=−0.6 -3 0.2 Example Credit: Karpathy and Fei-Fei
a b 𝑓 𝒘,𝒙 = 1 1+ 𝑒 −( 𝑤 0 𝑥 0 + 𝑤 1 𝑥 1 + 𝑤 2 ) 𝜕 𝜕𝑛 𝑚+𝑛=1 𝜕 𝜕𝑛 𝑚𝑛=𝑚 w0 2 * c d 𝜕 𝜕𝑛 𝑒 𝑛 = 𝑒 𝑛 𝜕 𝜕𝑛 𝑛 −1 =− 𝑛 −2 -0.2 -2 x0 -1 e f 0.4 𝜕 𝜕𝑛 𝑎𝑛=𝑎 𝜕 𝜕𝑛 𝑐+𝑛=1 0.2 + 4 w1 -3 * 6 + 1 -1 0.37 1.37 0.73 -0.4 0.2 *-1 en +1 n-1 0.2 x1 -2 0.2 -0.2 -0.53 -0.53 1 -0.6 w2 -3 0.2 Example Credit: Karpathy and Fei-Fei
PHEW! a b w0 c d -0.2 x0 e f 0.4 w1 -0.4 x1 -0.6 w2 0.2 2 * -2 -1 + 4 𝑓 𝒘,𝒙 = 1 1+ 𝑒 −( 𝑤 0 𝑥 0 + 𝑤 1 𝑥 1 + 𝑤 2 ) 𝜕 𝜕𝑛 𝑚+𝑛=1 𝜕 𝜕𝑛 𝑚𝑛=𝑚 w0 2 * c d 𝜕 𝜕𝑛 𝑒 𝑛 = 𝑒 𝑛 𝜕 𝜕𝑛 𝑛 −1 =− 𝑛 −2 -0.2 -2 x0 -1 e f 0.4 𝜕 𝜕𝑛 𝑎𝑛=𝑎 𝜕 𝜕𝑛 𝑐+𝑛=1 + 4 w1 -3 * 6 + 1 -1 0.37 1.37 0.73 -0.4 *-1 en +1 n-1 x1 -2 -0.6 PHEW! w2 -3 0.2 Example Credit: Karpathy and Fei-Fei
f … Summary 𝑥 1 𝑔( 𝜕𝑓 𝜕 𝑥 1 ) 𝑥 2 𝑔( 𝜕𝑓 𝜕 𝑥 2 ) 𝑓( 𝑥 1 ,…, 𝑥 𝑛 ) 𝑔 𝑥 𝑛 Each block computes backwards (g) * local gradient (df/dxi) at the evaluation point f 𝑓( 𝑥 1 ,…, 𝑥 𝑛 ) 𝑔 𝑥 1 𝑔( 𝜕𝑓 𝜕 𝑥 1 ) 𝑔( 𝜕𝑓 𝜕 𝑥 2 ) 𝑥 2 𝑥 𝑛 𝑔( 𝜕𝑓 𝜕 𝑥 𝑛 ) …
Multiple Outputs Flowing Back Gradients from different backwards sum up 𝑗=1 𝐾 𝑔 𝑗 ( 𝜕 𝑓 𝑗 𝜕 𝑥 𝑖 ) 𝑥 1 f 𝑓 1 ( 𝑥 1 ,…, 𝑥 𝑛 ) 𝑔 1 𝑗=1 𝐾 𝑔 𝑗 ( 𝜕 𝑓 𝑗 𝜕 𝑥 1 ) … 𝑓 𝐾 ( 𝑥 1 ,…, 𝑥 𝑛 ) 𝑔 𝐾 𝑥 𝑛 𝑗=1 𝐾 𝑔 𝑗 ( 𝜕 𝑓 𝑗 𝜕 𝑥 𝑛 )
Multiple Outputs Flowing Back 𝑓 𝑥 = −𝑥+3 2 (-x+3)2 x -x -x+3 -n n+3 m*n -x+3 -x+3 x-3 1 x x -n n+3
Multiple Outputs Flowing Back 𝑓 𝑥 = −𝑥+3 2 x = 𝑥−3 + 𝑥−3 𝜕𝑓 𝜕𝑥 x-3 x =2𝑥−6 x x-3
Does It Have To Be So Painful? 𝑓 𝒘,𝒙 = 1 1+ 𝑒 −( 𝑤 0 𝑥 0 + 𝑤 1 𝑥 1 + 𝑤 2 ) w0 n*m x0 n+m w1 𝜎 𝑛 = 1 1+ 𝑒 −𝑛 n+m n*-1 en n+1 1/x x1 w2 Example Credit: Karpathy and Fei-Fei
Does It Have To Be So Painful? 𝜎 𝑛 = 1 1+ 𝑒 −𝑛 𝜕 𝜕𝑛 𝜎 𝑛 = 𝑒 −𝑛 1+ 𝑒 −𝑛 2 = 1+ 𝑒 −𝑛 −1 1+ 𝑒 −𝑛 1 1+ 𝑒 −𝑛 1+ 𝑒 −𝑛 1+ 𝑒 −𝑛 − 1 1+ 𝑒 −𝑛 =1−𝜎(𝑛) 𝜎(𝑛) = 1−𝜎 𝑛 𝜎(𝑛) 𝜕 𝜕𝑛 𝜎 𝑛 = −1 1+ 𝑒 −𝑛 2 ∗1∗ 𝑒 −𝑛 ∗−1 Chain rule: d/dx (1/x)*d/dx (1+x)* d/dx (e*x)*d/dx (-x) Line 1 to 2: For the curious Example Credit: Karpathy and Fei-Fei
Does It Have To Be So Painful? 𝑓 𝒘,𝒙 = 1 1+ 𝑒 −( 𝑤 0 𝑥 0 + 𝑤 1 𝑥 1 + 𝑤 2 ) w0 n*m x0 n+m w1 n+m σ(n) x1 w2 𝜎 𝑛 = 1 1+ 𝑒 −𝑛 𝜕𝜎(𝑛) 𝜕𝑛 =(1−𝜎 𝑛 )𝜎(𝑛) Example Credit: Karpathy and Fei-Fei
Does It Have To Be So Painful? Can compute for any function Pick your functions carefully: existing code is usually structured into sensible blocks
Building Blocks Takes signals from other cells, processes, sends out Input from other cells Output to other cells Neuron diagram credit: Karpathy and Fei-Fei
Artificial Neuron Weighted average of other neuron outputs passed through an activation function 𝑖 𝑤 𝑖 𝑥 𝑖 +𝑏 Activation 𝑓 𝑖 𝑤 𝑖 𝑥 𝑖 +𝑏
Can differentiate whole thing e.g., dNeuron/dx1. Artificial Neuron Can differentiate whole thing e.g., dNeuron/dx1. What can we now do? w3 * x3 w2 * x2 w1 * x1 b ∑ f
Artificial Neuron Each artificial neuron is a linear model + an activation function f Can find w, b that minimizes a loss function with gradient descent ∑ f w,b x
∑ ∑ ∑ ∑ Artificial Neurons f f f f Connect neurons to make a more complex function; use backprop to compute gradient ∑ f w,b x ∑ f w,b x ∑ f w,b x ∑ f w,b x
What’s The Activation Function Sigmoid 𝑠 𝑥 = 1 1+ 𝑒 −𝑥 Nice interpretation Squashes things to (0,1) Gradients are near zero if neuron is high/low
What’s The Activation Function ReLU (Rectifying Linear Unit) max(0,𝑥) Constant gradient Converges ~6x faster If neuron negative, zero gradient. Be careful!
What’s The Activation Function Leaky ReLU (Rectifying Linear Unit) 𝑥 :𝑥≥0 0.01𝑥 :𝑥<0 ReLU, but allows some small gradient for negative vales
Setting Up A Neural Net Input Hidden Output h1 h2 h3 h4 y1 y2 y3 x2 x1
Setting Up A Neural Net Input Hidden 1 Hidden 2 Output a1 a2 a3 a4 h1 y1 y2 y3 x2 x1
Fully Connected Network x2 x1 a1 a2 a3 a4 h1 h2 h3 h4 Each neuron connects to each neuron in the previous layer
Fully Connected Network a1 a2 a3 a4 h1 h2 h3 h4 y1 y2 y3 𝒂 All layer a values x2 x1 𝒘 𝒊 , 𝑏 𝑖 Neuron i weights, bias 𝑓 Activation function ℎ 𝑖 =𝑓( 𝒘 𝒊 𝑻 𝒂+ 𝑏 𝑖 ) How do we do all the neurons all at once?
Fully Connected Network a1 a2 a3 a4 h1 h2 h3 h4 y1 y2 y3 𝒂 All layer a values x2 x1 𝒘 𝒊 , 𝑏 𝑖 Neuron i weights, bias 𝑓 Activation function 𝒉=𝑓(𝑾𝒂+𝒃) = 𝑤 1 𝑇 𝑤 2 𝑇 𝑤 3 𝑇 𝑤 4 𝑇 𝑏 1 𝑏 2 𝑏 3 𝑏 4 𝑎 1 𝑎 2 𝑎 3 𝑎 4 ℎ 1 ℎ 2 ℎ 3 ℎ 4 + 𝑓 ( )
Fully Connected Network Define New Block: “Linear Layer” (Ok technically it’s Affine) n L W b 𝐿 𝒏 =𝑾𝒏+𝒃 Can get gradient with respect to all the inputs (do on your own; useful trick: have to be able to do matrix multiply)
Fully Connected Network x2 x1 a1 a2 a3 a4 h1 h2 h3 h4 L f(n) W b L f(n) W b L f(n) W b x
Fully Connected Network x2 x1 a1 a2 a3 a4 h1 h2 h3 h4 What happens if we remove the activation functions? W b W b W b x L L L
Demo Time https://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html