cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10 12/5/2018 Today’s Topics Weight Space (for ANNs) Gradient Descent and Local Minima Stochastic Gradient Descent Backpropagation The Need to Train the Biases and a Simple Algebraic Trick Perceptron Training Rule and a Worked Example Case Analysis of Delta Rule 11/10/16 cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10
WARNING! Some Calculus Ahead 11/10/16 cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10
No Calculus Experience? Derivatives generalize the idea of SLOPE How to calc the SLOPE of a line d (m x + b) d x = m // ‘mx + b’ is the algebraic form of a line // ‘m’ is the slope // ‘b’ is the y intercept (value of y when x = 0) Two (distinct) points define a line 11/10/16 cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10
cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10 Weight Space Given a neural-network layout, the weights and biases are free parameters that define a space Each point in this Weight Space specifies a network weight space is a continuous space we search Associated with each point is an error rate, E, over the training data Backprop performs gradient descent in weight space 11/10/16 cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10
Gradient Descent in Weight Space Total Error on Training Set ERROR with Current Wgt Settings ∂ E ∂ W W1 W1 Current Wgt Settings W2 New Wgt Settings W2 11/10/16 cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10
Backprop Seeks LOCAL Minima (in a continuous space) Error on Train Set Note: a local min might over fit the training data, so often ‘early stopping’ used (later) Weight Space 11/10/16 cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10
Local Min are Good Enough for Us! ANNs, including Deep Networks, make accurate predictions even though we likely are only finding local min The world could have been like this: Note: ensembles of ANNs work well (often find different local minima) Ie, most min poor, hard to find a good min Weight Space Error on Train Set 11/10/16 cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10
The Gradient-Descent Rule E(w) [ ] E w0 w1 w2 wN , , , … , _ The ‘gradient’ This is a N+1 dimensional vector (ie, the ‘slope’ in weight space) Since we want to reduce errors, we want to go ‘down hill’ We’ll take a finite step in weight space: E E w = - E ( w ) or wi = - ‘delta’ = the change to w E wi W1 W2 w 11/10/16 cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10
‘On Line’ vs. ‘Batch’ Backprop Technically, we should look at the error gradient for the entire training set, before taking a step in weight space (‘batch’ backprop) However, in practice we take a step after each example (‘on-line’ backprop) Much faster convergence (learn after each example) Called ‘stochastic’ gradient descent Stochastic gradient descent quite popular at Google, Facebook, Microsoft, etc due to easy parallelism 11/10/16 cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10
‘On Line’ vs. ‘Batch’ BP (continued) * Note wi,BATCH wi, ON-LINE, for i > 1 ‘On Line’ vs. ‘Batch’ BP (continued) E w BATCH – add w vectors for every training example, then ‘move’ in weight space ON-LINE – ‘move’ after each example (aka, stochastic gradient descent) wi wex1 wex3 wex2 E wex1 wex2 Vector from BATCH wex3 w * Final locations in space need not be the same for BATCH and ON-LINE w 11/10/16 cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10
BP Calculations (note: text uses ‘loss’ instead of ‘error’) k j i Assume one layer of hidden units (std. non-deep topology) Error ½ ( Teacheri – Output i ) 2 = ½ (Teacheri – F ( [ Wi,j x Output j ] )2 = ½ (Teacheri – F ( [ Wi,j x F ( Wj,k x Output k)] ))2 Determine Recall ∂ Error ∂ Wi,j ∂ Wj,k = (use equation 2) = (use equation 3) See text’s Section 18.7.4 for the specific calculations wx,y = - (∂ E / ∂ wx,y ) 11/10/16 cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10
Differentiating the Logistic Function (‘soft’ step-function) Note: Differentiating RLU’s easy! (use F’ = 0 when input = bias) Differentiating the Logistic Function (‘soft’ step-function) 1/2 Wj x outj F(wgt’ed in) out i = 1 1 + e - ( wj,i x outj - i ) F '(wgt’ed in) = out i ( 1- out i ) F '(wgt’ed in) 1/4 Notice that even if totally wrong, no (or very little) change in weights wgt’ed input 11/10/16 cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10
cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10 12/5/2018 Gradient Descent for the Perceptron (for the simple case of linear output units) 2 Error ½ ( T – o ) Network’s output Teacher’s answer (a constant wrt the weights) = (T – o) ∂ E ∂ Wk ∂ (T – o) = - (T – o) ∂ o ∂ W k 11/10/16 cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10
Continuation of Derivation Stick in formula for output ∂( ∑ w j x j) ∂ E ∂ Wk = - (T – o) ∂ Wk ∂ E ∂ Wk Recall ΔWk - η = - (T – o) x k So ΔWk = η (T – o) xk The Perceptron Rule Also known as the delta rule and other names (with some variation in the calculation) We’ll use for both LINEAR and STEP-FUNCTION activation 11/10/16 cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10
cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10 Node Biases Recall: A node’s output is weighted function of its inputs and a ‘bias’ term Input bias 1 Output These biases also need to be learned! 11/10/16 cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10
cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10 Training Biases ( Θ’s ) A node’s output (assume ‘step function’ for simplicity) 1 if W1 X1 + W2 X2 +…+ Wn Xn ≥ Θ 0 otherwise Rewriting W1 X1 + W2 X2 + … + Wn Xn – Θ ≥ 0 W1 X1 + W2 X2 + … + Wn Xn + Θ (-1) ≥ 0 ‘activation’ weight 11/10/16 cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10
Training Biases (cont.) Hence, add another unit whose activation is always -1 The bias is then just another weight! Eg Θ Θ -1 11/10/16 cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10
Perceptron Example (assume step function and use η = 0.1) Train Set Perceptron Learning Rule X1 X2 Correct Output 3 -2 1 6 5 -3 ΔWk = η (T – o) xk 1 Out = StepFunction(3 1 - 2 (-3) - 1 2) = 1 X1 -3 X2 2 -1 No wgt changes, since correct 11/10/16 cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10
Perceptron Example (assume step function and use η = 0.1) Train Set Perceptron Learning Rule X1 X2 Correct Output 3 -2 1 6 5 -3 ΔWa = η (T – o) xa 1 Out = StepFunction(6 1 + 1 (-3) - 1 2) = 1 // So need to update weights X1 -3 X2 2 -1 11/10/16 cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10
Perceptron Example (assume step function and use η = 0.1) Train Set Perceptron Learning Rule X1 X2 Correct Output 3 -2 1 6 5 -3 ΔWk = η (T – o) xk 1 - 0.16 = 0.4 Out = StepFunction(6 1 + 1 (-3) - 1 2) = 1 X1 -3-0.11 = -3.1 X2 -1 2 - 0.1(-1) = 2.1 11/10/16 cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10
Pushing the Weights and Bias in the Correct Direction when Wrong Assume TEACHER=1 and ANN=0, so some combo of (a) wgts on some positively valued inputs too small (b) wgts on some negatively valued inputs too large (c) ‘bias’ too large Output Wgt’ed Sum bias Opposite movement when TEACHER= 0 and ANN = 1 11/10/16 cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10
Case Analysis: Assume Teach = 1, out = 0, η = 1 ΔWk = η (T – o) xk Case Analysis: Assume Teach = 1, out = 0, η = 1 Note: ‘bigger’ means closer to +infinity and ‘smaller’ means closer to -infinity Four Cases Pos/Neg Input Pos/Neg Weight Cases for the BIAS Input Vector: 1, -1, 1, -1 ‘-1’ ‘-1’ Weights: 2, -4, -3, 5 6 or -6 // the BIAS New Wgts: 2+1, -4-1, -3+1, 5-1 6-1 -6-1 bigger smaller bigger smaller smaller smaller Old vs New Input Wgt 2 vs 3 4 vs 5 -3 vs -2 -5 vs -4 So weighted sum will be LARGER (-2 vs 2) And BIAS will be SMALLER 11/10/16 cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10
cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10 HW4 Train a perceptron on WINE For N Boolean-valued features, have N+1 input units (+1 due to “-1” input) and 1 output unit Learned model rep’ed by vector of N+1 doubles Your code should handle any dataset that meets HW0 spec (Maybe also train an ANN with 100 HU’s - but not required) Employ ‘early stopping’ (later) Compare perceptron testset results to random forests 11/10/16 cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10
Wrapup of Basics of ANN Training ∂ Error ∂ Wk Wrapup of Basics of ANN Training We differentiate (in the calculus sense) all the free parameters in an ANN with a fixed structure (‘topology’) If all else is held constant (‘partial derivatives’), what is the impact of changing weightk? Simultaneously move each weight a small amount in the direction that reduces error Process example-by-example, many times Seeks local minimum, ie, where all derivatives = 0 11/10/16 cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10