Today’s Topics Midterm class mean: 83.5 HW3 Due Thursday and HW4 Out Thursday Turn in Your BN Nannon Player (in Separate, ‘Dummy’ Assignment) until a Week from Thursday Weight Space (for ANNs) Gradient Descent and Local Minima Stochastic Gradient Descent Backpropagation The Need to Train the Biases and a Simple Algebraic Trick Perceptron Training Rule and a Worked Example Case Analysis of Delta Rule Neural ‘Word Vectors’ 11/3/15CS Fall 2015 (Shavlik©), Lecture 19, Week 91
Back to Prob Reasoning for Two Slides: Base-Rate Fallacy /3/15CS Fall 2015 (Shavlik©), Lecture 19, Week 9 2 Assume Disease A is rare (one in 1 million, say – so picture not to scale) Assume population is 10B = So 10 4 people have it Assume testForA is 99.99% accurate You test positive. What is the prob you have Disease A? Someone (not in cs540) might naively think prob = People for whom testForA = true 9999 people that actually have Disease A 10 6 people that do NOT have Disease A Prob(A | testForA) = 0.01 A This same issue arises in ML when have many more neg than pos ex’s: false pos overwhelm true pos 99.99% 0.01%
A Major Weakness of BN’s (I also copied this and prev slide to an earlier lecture, for future cs540’s) If many ‘hidden’ random vars (N binary vars, say), then the marginalization formula leads to many calls to a BN (2 N in our example; for N = 20, 2 N = 1,048,576) Using uniform-random sampling to estimate the result is too inaccurate since most of the probability might be concentrated in only a few ‘complete world states’ Hence, much research (beyond cs540’s scope) on scaling up inference in BNs and other graphical models, eg via more sophisticated sampling (eg, MCMC) 11/3/15CS Fall 2015 (Shavlik©), Lecture 19, Week 9 3
WARNING! Some Calculus Ahead 11/3/15CS Fall 2015 (Shavlik©), Lecture 19, Week 94
No Calculus Experience? For HWs and the Final … Derivatives generalize the idea of SLOPE You only need to know how to calc the SLOPE of a line d (m x + b) d x 11/3/15CS Fall 2015 (Shavlik©), Lecture 19, Week 9 5 = m // ‘mx + b’ is the algebraic form of a line // ‘m’ is the slope // ‘b’ is the y intercept (value of y when x = 0) Two (distinct) points define a line
Weight Space Given a neural-network layout, the weights and biases are free parameters that define a space Each point in this Weight Space specifies a network weight space is a continuous space we search Associated with each point is an error rate, E, over the training data Backprop performs gradient descent in weight space 11/3/15CS Fall 2015 (Shavlik©), Lecture 19, Week 96
Gradient Descent in Weight Space Total Error on Training Set W1W1 W2W2 ∂ E ∂ W W1 W1 W2 W2 11/3/15CS Fall 2015 (Shavlik©), Lecture 19, Week 97 ERROR with Current Wgt Settings New Wgt Settings Current Wgt Settings
Backprop Seeks LOCAL Minima (in a continuous space) 11/3/15CS Fall 2015 (Shavlik©), Lecture 19, Week 9 8 Weight Space Error on Train Set Note: a local min might over fit the training data, so often ‘early stopping’ used (later)
Local Min are Good Enough for Us! ANNs, including Deep Networks, make accurate predictions even though we likely are only finding local min The world could have been like this: Note: ensembles of ANNs work well (often find different local minima) 11/3/15CS Fall 2015 (Shavlik©), Lecture 19, Week 9 9 Weight Space Error on Train Set Ie, most min poor, hard to find a good min
The Gradient-Descent Rule The ‘gradient’ This is a N+1 dimensional vector (ie, the ‘slope’ in weight space) Since we want to reduce errors, we want to go ‘down hill’ We’ll take a finite step in weight space: E W1W1 W2W2 w = - E ( w ) or w i = - Ewi Ewi ‘delta’ = the change to w E ww 11/3/15CS Fall 2015 (Shavlik©), Lecture 19, Week 910 E(w) [ ] Ew0 Ew0 Ew1 Ew1 Ew2 Ew2 EwN EwN,,, …, _
‘On Line’ vs. ‘Batch’ Backprop Technically, we should look at the error gradient for the entire training set, before taking a step in weight space (‘batch’ backprop) However, in practice we take a step after each example (‘on-line’ backprop) –Much faster convergence (learn after each example) –Called ‘stochastic’ gradient descent –Stochastic gradient descent very popular at Google, etc due to easy parallelism 11/3/15CS Fall 2015 (Shavlik©), Lecture 19, Week 911
‘On Line’ vs. ‘Batch’ BP (continued) BATCH – add w vectors for every training example, then ‘move’ in weight space ON-LINE – ‘move’ after each example (aka, stochastic gradient descent) E w i w ex1 w ex3 w ex2 w w ex1 w ex2 w ex3 * Final locations in space need not be the same for BATCH and ON-LINE w * Note w i,BATCH w i, ON-LINE, for i > 1 E w 11/3/15CS Fall 2015 (Shavlik©), Lecture 19, Week 912 Vector from BATCH
Assume one layer of hidden units (std. non-deep topology) 1.Error ½ ( Teacher i – Output i ) 2 2.= ½ (Teacher i – F ( [ W i,j x Output j ] ) 2 3.= ½ (Teacher i – F ( [ W i,j x F ( W j,k x Output k )] )) 2 Determine Recall BP Calculations (note: text uses ‘loss’ instead of ‘error’) ∂ Error ∂ W i,j ∂ Error ∂ W j,k = (use equation 2) = (use equation 3) See Sec and Fig in textbook for results (I won’t ask you to derive on final) w x,y = - (∂ E / ∂ w x,y ) k j i 11/3/15CS Fall 2015 (Shavlik©), Lecture 19, Week 913
11/3/15CS Fall 2015 (Shavlik©), Lecture 19, Week 9 Differentiating the Logistic Function (‘soft’ step-function) out i = e - ( w j,i x out j - i ) F '(wgt’ed in) = out i ( 1- out i ) 1/2 W j x out j F '(wgt’ed in) wgt’ed input 1/4 Notice that even if totally wrong, no (or very little) change in weights F(wgt’ed in) 14 Note: Differentiating RLU’s easy! (use F’ = 0 when input = bias)
Gradient Descent for the Perceptron (for the simple case of linear output units) Error ½ ( T – o ) 2 Network’s output Teacher’s answer (a constant wrt the weights) = (T – o) ∂ E ∂ W a ∂ (T – o) ∂ W a = - (T – o) ∂ o ∂ W a 11/3/15CS Fall 2015 (Shavlik©), Lecture 19, Week 915
Continuation of Derivation ∂ E ∂ Wk∂ Wk∂ Wk∂ Wk = - (T – o) ∂ W a ∂( ∑ w k x k ) = - (T – o) x a So ΔW k = η (T – o) x a The Perceptron Rule Stick in formula for output Also known as the delta rule and other names (with some variation in the calc) 11/3/15CS Fall 2015 (Shavlik©), Lecture 19, Week 916 We’ll use for both LINEAR and STEP-FUNCTION activation ∂ E ∂ W a Recall ΔW k - η
11/3/15CS Fall 2015 (Shavlik©), Lecture 19, Week 9 Node Biases Recall: A node’s output is weighted function of its inputs and a ‘bias’ term Input bias 1 Output These biases also need to be learned! 17
11/3/15CS Fall 2015 (Shavlik©), Lecture 19, Week 9 Training Biases ( Θ’s ) A node’s output (assume ‘step function’ for simplicity) 1 if W 1 X 1 + W 2 X 2 +…+ W n X n ≥ Θ 0 otherwise Rewriting W 1 X 1 + W 2 X 2 + … + W n X n – Θ ≥ 0 W 1 X 1 + W 2 X 2 + … + W n X n + Θ (-1) ≥ 0 weight ‘activation’ 18
11/3/15CS Fall 2015 (Shavlik©), Lecture 19, Week 9 Training Biases (cont.) Hence, add another unit whose activation is always -1 The bias is then just another weight! Eg Θ Θ 19
Perceptron Example (assume step function and use η = 0.1) X1X2Correct Output /3/15CS Fall 2015 (Shavlik©), Lecture 19, Week 9 20 Train Set X1X1 X2X ΔW a = η (T – o) x a Perceptron Learning Rule Out = StepFunction(3 (-3) - 1 2) = 1 No wgt changes, since correct
Perceptron Example (assume step function and use η = 0.1) X1X2Correct Output /3/15CS Fall 2015 (Shavlik©), Lecture 19, Week 9 21 Train Set X1X1 X2X ΔW a = η (T – o) x a Perceptron Learning Rule Out = StepFunction(6 (-3) - 1 2) = 1 // So need to update weights
Perceptron Example (assume step function and use η = 0.1) X1X2Correct Output /3/15CS Fall 2015 (Shavlik©), Lecture 19, Week 9 22 Train Set X1X1 X2X 6 = 1 = (-1) = 2.1 ΔW a = η (T – o) x a Perceptron Learning Rule Out = StepFunction(6 (-3) - 1 2) = 1
Pushing the Weights and Bias in the Correct Direction when Wrong 11/3/15CS Fall 2015 (Shavlik©), Lecture 19, Week 9 23 Output Wgt’ed Sum bias Assume TEACHER=1 and ANN=0, so some combo of (a) wgts on some positively valued inputs too small (b) wgts on some negatively valued inputs too large (c) ‘bias’ too large Opposite movement when TEACHER= 0 and ANN = 1
Case Analysis: Assume Teach = 1, out = 0, η = 1 Input Vector:1, -1, 1,-1 ‘-1’ ‘-1’ Weights: 2, -4, -3, 5 6 or -6 // the BIAS New Wgts: 2+1, -4-1, -3+1, bigger smaller bigger smaller smaller smaller Old vs New Input Wgt 2 vs 3 4 vs 5 -3 vs vs -4 11/3/15CS Fall 2015 (Shavlik©), Lecture 19, Week 9 24 ΔW k = η (T – o) x k Cases for the BIAS Four Cases Pos/Neg Input Pos/Neg Weight So weighted sum will be LARGER (-2 vs 2) And BIAS will be SMALLER Note: ‘bigger’ means closer to +infinity and ‘smaller’ means closer to -infinity
Neural Word Vectors – Current Hot Topic (see or Distributional Hypothesis words can be characterized by the words that appear nearby in a large text corpus (matrix algebra also used for this task, eg singular- value decomposition, SVD) 11/3/15CS Fall 2015 (Shavlik©), Lecture 19, Week 9 25 Two Possible Designs (CBOW = Continuous Bag of Words) Initially assign each word a random k-long vector of random #’s in [0,1] or [-1,1] (k is something like 100 or 300) – as opposed to traditional ‘1-of-N’ encoding Recall 1-of-N : aardvark = 1,0,0,…,0 // N is 50,000 or more! zzzz = 0,0,0,…,1 // And nothing ‘shared’ by related words Compute Error / Input i to change the input vector(s) - ie, find good word vectors so easy to learn to predict I/O pairs in the fig above
Neural Word Vectors Surprisingly, one can do ‘simple algebra’ with these word vectors! vector France – vector Paris = vector Italy – X Subtract vector for Paris from vector for France, then subtract vector for Italy. Negate then find closest word vectors in one’s word ‘library’ web page suggests X = vector Rome though I got vector Milan (which is reasonable; vector Rome was 2 nd ) 11/3/15CS Fall 2015 (Shavlik©), Lecture 19, Week = -? king – man = queen – X
Wrapup of Basics of ANN Training We differentiate (in the calculus sense) all the free parameters in an ANN with a fixed structure (‘topology’) –If all else is held constant (‘partial derivatives’), what is the impact of changing weight a ? –Simultaneously move each weight a small amount in the direction that reduces error –Process example-by-example, many times Seeks local minimum, ie, where all derivatives = 0 11/3/15CS Fall 2015 (Shavlik©), Lecture 19, Week 9 27 ∂ Error ∂ W a