cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10

cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10
12/5/2018 Today’s Topics Weight Space (for ANNs) Gradient Descent and Local Minima Stochastic Gradient Descent Backpropagation The Need to Train the Biases and a Simple Algebraic Trick Perceptron Training Rule and a Worked Example Case Analysis of Delta Rule 11/10/16 cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10

WARNING! Some Calculus Ahead
11/10/16 cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10

No Calculus Experience?
Derivatives generalize the idea of SLOPE How to calc the SLOPE of a line d (m x + b) d x = m // ‘mx + b’ is the algebraic form of a line // ‘m’ is the slope // ‘b’ is the y intercept (value of y when x = 0) Two (distinct) points define a line 11/10/16 cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10

Weight Space Given a neural-network layout, the weights and biases are free parameters that define a space Each point in this Weight Space specifies a network weight space is a continuous space we search Associated with each point is an error rate, E, over the training data Backprop performs gradient descent in weight space 11/10/16 cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10

Gradient Descent in Weight Space
Total Error on Training Set ERROR with Current Wgt Settings ∂ E ∂ W W1  W1 Current Wgt Settings W2 New Wgt Settings  W2 11/10/16 cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10

Backprop Seeks LOCAL Minima (in a continuous space)
Error on Train Set Note: a local min might over fit the training data, so often ‘early stopping’ used (later) Weight Space 11/10/16 cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10

Local Min are Good Enough for Us!
ANNs, including Deep Networks, make accurate predictions even though we likely are only finding local min The world could have been like this: Note: ensembles of ANNs work well (often find different local minima) Ie, most min poor, hard to find a good min Weight Space Error on Train Set 11/10/16 cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10

The Gradient-Descent Rule
E(w)  [ ] E w0 w1 w2 wN , , , … , _ The ‘gradient’ This is a N+1 dimensional vector (ie, the ‘slope’ in weight space) Since we want to reduce errors, we want to go ‘down hill’ We’ll take a finite step in weight space: E  E w = -   E ( w ) or wi = -  ‘delta’ = the change to w E wi W1 W2 w 11/10/16 cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10

‘On Line’ vs. ‘Batch’ Backprop
Technically, we should look at the error gradient for the entire training set, before taking a step in weight space (‘batch’ backprop) However, in practice we take a step after each example (‘on-line’ backprop) Much faster convergence (learn after each example) Called ‘stochastic’ gradient descent Stochastic gradient descent quite popular at Google, Facebook, Microsoft, etc due to easy parallelism 11/10/16 cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10

‘On Line’ vs. ‘Batch’ BP (continued)
* Note wi,BATCH  wi, ON-LINE, for i > 1 ‘On Line’ vs. ‘Batch’ BP (continued) E w BATCH – add w vectors for every training example, then ‘move’ in weight space ON-LINE – ‘move’ after each example (aka, stochastic gradient descent) wi wex1 wex3 wex2 E wex1 wex2 Vector from BATCH wex3 w * Final locations in space need not be the same for BATCH and ON-LINE w 11/10/16 cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10

BP Calculations (note: text uses ‘loss’ instead of ‘error’)
k j i Assume one layer of hidden units (std. non-deep topology) Error  ½  ( Teacheri – Output i ) 2 = ½  (Teacheri – F ( [  Wi,j x Output j ] )2 = ½  (Teacheri – F ( [  Wi,j x F (  Wj,k x Output k)] ))2 Determine Recall ∂ Error ∂ Wi,j ∂ Wj,k = (use equation 2) = (use equation 3) See text’s Section for the specific calculations wx,y = -  (∂ E / ∂ wx,y ) 11/10/16 cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10

Differentiating the Logistic Function (‘soft’ step-function)
Note: Differentiating RLU’s easy! (use F’ = 0 when input = bias) Differentiating the Logistic Function (‘soft’ step-function)  1/2  Wj x outj F(wgt’ed in) out i = 1 1 + e - (  wj,i x outj -  i ) F '(wgt’ed in) = out i ( 1- out i ) F '(wgt’ed in) 1/4 Notice that even if totally wrong, no (or very little) change in weights wgt’ed input 11/10/16 cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10

12/5/2018 Gradient Descent for the Perceptron (for the simple case of linear output units) 2 Error  ½  ( T – o ) Network’s output Teacher’s answer (a constant wrt the weights) = (T – o) ∂ E ∂ Wk ∂ (T – o) = - (T – o) ∂ o ∂ W k 11/10/16 cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10

Continuation of Derivation
Stick in formula for output ∂( ∑ w j  x j) ∂ E ∂ Wk = - (T – o) ∂ Wk ∂ E ∂ Wk Recall ΔWk  - η = - (T – o) x k So ΔWk = η (T – o) xk The Perceptron Rule Also known as the delta rule and other names (with some variation in the calculation) We’ll use for both LINEAR and STEP-FUNCTION activation 11/10/16 cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10

Node Biases Recall: A node’s output is weighted function of its inputs and a ‘bias’ term Input bias 1 Output These biases also need to be learned! 11/10/16 cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10

Training Biases ( Θ’s ) A node’s output (assume ‘step function’ for simplicity) 1 if W1 X1 + W2 X2 +…+ Wn Xn ≥ Θ 0 otherwise Rewriting W1 X1 + W2 X2 + … + Wn Xn – Θ ≥ 0 W1 X1 + W2 X2 + … + Wn Xn + Θ  (-1) ≥ 0 ‘activation’ weight 11/10/16 cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10

Training Biases (cont.)
Hence, add another unit whose activation is always -1 The bias is then just another weight! Eg Θ Θ -1 11/10/16 cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10

Perceptron Example (assume step function and use η = 0.1)
Train Set Perceptron Learning Rule X1 X2 Correct Output 3 -2 1 6 5 -3 ΔWk = η (T – o) xk 1 Out = StepFunction(3   (-3) - 1  2) = 1 X1 -3 X2 2 -1 No wgt changes, since correct 11/10/16 cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10

Train Set Perceptron Learning Rule X1 X2 Correct Output 3 -2 1 6 5 -3 ΔWa = η (T – o) xa 1 Out = StepFunction(6   (-3) - 1  2) = 1 // So need to update weights X1 -3 X2 2 -1 11/10/16 cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10

Train Set Perceptron Learning Rule X1 X2 Correct Output 3 -2 1 6 5 -3 ΔWk = η (T – o) xk 6 = 0.4 Out = StepFunction(6   (-3) - 1  2) = 1 X1 -3-0.11 = -3.1 X2 -1 (-1) = 2.1 11/10/16 cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10

Pushing the Weights and Bias in the Correct Direction when Wrong
Assume TEACHER=1 and ANN=0, so some combo of (a) wgts on some positively valued inputs too small (b) wgts on some negatively valued inputs too large (c) ‘bias’ too large Output Wgt’ed Sum bias Opposite movement when TEACHER= 0 and ANN = 1 11/10/16 cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10

Case Analysis: Assume Teach = 1, out = 0, η = 1
ΔWk = η (T – o) xk Case Analysis: Assume Teach = 1, out = 0, η = 1 Note: ‘bigger’ means closer to +infinity and ‘smaller’ means closer to -infinity Four Cases Pos/Neg Input  Pos/Neg Weight Cases for the BIAS Input Vector: 1, -1, 1, ‘-1’ ‘-1’ Weights: , -4, -3, or -6 // the BIAS New Wgts: , -4-1, -3+1, bigger smaller bigger smaller smaller smaller Old vs New Input  Wgt 2 vs vs vs vs -4 So weighted sum will be LARGER (-2 vs 2) And BIAS will be SMALLER 11/10/16 cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10

HW4 Train a perceptron on WINE For N Boolean-valued features, have N+1 input units (+1 due to “-1” input) and 1 output unit Learned model rep’ed by vector of N+1 doubles Your code should handle any dataset that meets HW0 spec (Maybe also train an ANN with 100 HU’s - but not required) Employ ‘early stopping’ (later) Compare perceptron testset results to random forests 11/10/16 cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10

Wrapup of Basics of ANN Training
∂ Error ∂ Wk Wrapup of Basics of ANN Training We differentiate (in the calculus sense) all the free parameters in an ANN with a fixed structure (‘topology’) If all else is held constant (‘partial derivatives’), what is the impact of changing weightk? Simultaneously move each weight a small amount in the direction that reduces error Process example-by-example, many times Seeks local minimum, ie, where all derivatives = 0 11/10/16 cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10

cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10

Similar presentations

Presentation on theme: "cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10

Similar presentations

Presentation on theme: "cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10"— Presentation transcript:

Similar presentations

About project

Feedback