cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10

Slides:



Advertisements
Similar presentations
Slides from: Doug Gray, David Poole
Advertisements

Artificial Intelligence 13. Multi-Layer ANNs Course V231 Department of Computing Imperial College © Simon Colton.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Lecture 13 – Perceptrons Machine Learning March 16, 2010.
Machine Learning: Connectionist McCulloch-Pitts Neuron Perceptrons Multilayer Networks Support Vector Machines Feedback Networks Hopfield Networks.
CS Perceptrons1. 2 Basic Neuron CS Perceptrons3 Expanded Neuron.
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
Artificial Neural Networks
CS 4700: Foundations of Artificial Intelligence
Artificial Neural Networks
Classification Part 3: Artificial Neural Networks
CS 478 – Tools for Machine Learning and Data Mining Backpropagation.
Today’s Topics Read –For exam: Chapter 13 of textbook –Not on exam: Sections & Genetic Algorithms (GAs) –Mutation –Crossover –Fitness-proportional.
Linear Discrimination Reading: Chapter 2 of textbook.
Linear Classification with Perceptrons
CS-424 Gregory Dudek Today’s Lecture Neural networks –Training Backpropagation of error (backprop) –Example –Radial basis functions.
Today’s Topics 11/10/15CS Fall 2015 (Shavlik©), Lecture 22, Week 101 Support Vector Machines (SVMs) Three Key Ideas –Max Margins –Allowing Misclassified.
Today’s Topics Midterm class mean: 83.5 HW3 Due Thursday and HW4 Out Thursday Turn in Your BN Nannon Player (in Separate, ‘Dummy’ Assignment) until a Week.
Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.
Today’s Topics Read: Chapters 7, 8, and 9 on Logical Representation and Reasoning HW3 due at 11:55pm THURS (ditto for your Nannon Tourney Entry) Recipe.
Backpropagation Training
EEE502 Pattern Recognition
1 Perceptron as one Type of Linear Discriminants IntroductionIntroduction Design of Primitive UnitsDesign of Primitive Units PerceptronsPerceptrons.
Neural NetworksNN 21 Architecture We consider the architecture: feed- forward NN with one layer It is sufficient to study single layer perceptrons with.
Today’s Topics Artificial Neural Networks (ANNs) Perceptrons (1950s) Hidden Units and Backpropagation (1980s) Deep Neural Networks (2010s) ??? (2040s [note.
Today’s Topics 11/10/15CS Fall 2015 (Shavlik©), Lecture 21, Week 101 More on DEEP ANNs –Convolution –Max Pooling –Drop Out Final ANN Wrapup FYI:
Learning: Neural Networks Artificial Intelligence CMSC February 3, 2005.
Learning with Neural Networks Artificial Intelligence CMSC February 19, 2002.
Today’s Lecture Neural networks Training
Lecture 7 Learned feedforward visual processing
Neural networks and support vector machines
Fall 2004 Backpropagation CS478 - Machine Learning.
CS 388: Natural Language Processing: Neural Networks
Gradient descent David Kauchak CS 158 – Fall 2016.
Deep Feedforward Networks
The Gradient Descent Algorithm
Neural Networks Winter-Spring 2014
Learning with Perceptrons and Neural Networks
ECE 5424: Introduction to Machine Learning
Real Neurons Cell structures Cell body Dendrites Axon
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Neural Networks CS 446 Machine Learning.
Classification with Perceptrons Reading:
Artificial Neural Networks
CS Fall 2016 (Shavlik©), Lecture 12, Week 6
with Daniel L. Silver, Ph.D. Christian Frey, BBA April 11-12, 2017
Artificial Neural Networks
Neural Networks and Backpropagation
CSE P573 Applications of Artificial Intelligence Neural Networks
CS621: Artificial Intelligence
cs638/838 - Spring 2017 (Shavlik©), Week 7
CS540 - Fall 2016 (Shavlik©), Lecture 17, Week 10
Machine Learning Today: Reading: Maria Florina Balcan
Classification Neural Networks 1
Chapter 3. Artificial Neural Networks - Introduction -
cs540 - Fall 2016 (Shavlik©), Lecture 20, Week 11
CSC 578 Neural Networks and Deep Learning
Perceptron as one Type of Linear Discriminants
CSE 573 Introduction to Artificial Intelligence Neural Networks
Neural Network - 2 Mayank Vatsa
Concept Learning Algorithms
Neural Networks Geoff Hulten.
CS Fall 2016 (Shavlik©), Lecture 12, Week 6
Multilayer Perceptron: Learning : {(xi, f(xi)) | i = 1 ~ N} → W
Neural networks (1) Traditional multi-layer perceptrons
Backpropagation David Kauchak CS159 – Fall 2019.
Peer Instruction I pose a challenge question (usually multiple choice), which will help solidify understanding of topics we have studied Might not just.
Seminar on Machine Learning Rada Mihalcea
CS621: Artificial Intelligence Lecture 22-23: Sigmoid neuron, Backpropagation (Lecture 20 and 21 taken by Anup on Graphical Models) Pushpak Bhattacharyya.
Introduction to Neural Networks
Presentation transcript:

cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10 12/5/2018 Today’s Topics Weight Space (for ANNs) Gradient Descent and Local Minima Stochastic Gradient Descent Backpropagation The Need to Train the Biases and a Simple Algebraic Trick Perceptron Training Rule and a Worked Example Case Analysis of Delta Rule 11/10/16 cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10

WARNING! Some Calculus Ahead 11/10/16 cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10

No Calculus Experience? Derivatives generalize the idea of SLOPE How to calc the SLOPE of a line d (m x + b) d x = m // ‘mx + b’ is the algebraic form of a line // ‘m’ is the slope // ‘b’ is the y intercept (value of y when x = 0) Two (distinct) points define a line 11/10/16 cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10

cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10 Weight Space Given a neural-network layout, the weights and biases are free parameters that define a space Each point in this Weight Space specifies a network weight space is a continuous space we search Associated with each point is an error rate, E, over the training data Backprop performs gradient descent in weight space 11/10/16 cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10

Gradient Descent in Weight Space Total Error on Training Set ERROR with Current Wgt Settings ∂ E ∂ W W1  W1 Current Wgt Settings W2 New Wgt Settings  W2 11/10/16 cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10

Backprop Seeks LOCAL Minima (in a continuous space) Error on Train Set Note: a local min might over fit the training data, so often ‘early stopping’ used (later) Weight Space 11/10/16 cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10

Local Min are Good Enough for Us! ANNs, including Deep Networks, make accurate predictions even though we likely are only finding local min The world could have been like this: Note: ensembles of ANNs work well (often find different local minima) Ie, most min poor, hard to find a good min Weight Space Error on Train Set 11/10/16 cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10

The Gradient-Descent Rule E(w)  [ ] E w0 w1 w2 wN , , , … , _ The ‘gradient’ This is a N+1 dimensional vector (ie, the ‘slope’ in weight space) Since we want to reduce errors, we want to go ‘down hill’ We’ll take a finite step in weight space: E  E w = -   E ( w ) or wi = -  ‘delta’ = the change to w E wi W1 W2 w 11/10/16 cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10

‘On Line’ vs. ‘Batch’ Backprop Technically, we should look at the error gradient for the entire training set, before taking a step in weight space (‘batch’ backprop) However, in practice we take a step after each example (‘on-line’ backprop) Much faster convergence (learn after each example) Called ‘stochastic’ gradient descent Stochastic gradient descent quite popular at Google, Facebook, Microsoft, etc due to easy parallelism 11/10/16 cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10

‘On Line’ vs. ‘Batch’ BP (continued) * Note wi,BATCH  wi, ON-LINE, for i > 1 ‘On Line’ vs. ‘Batch’ BP (continued) E w BATCH – add w vectors for every training example, then ‘move’ in weight space ON-LINE – ‘move’ after each example (aka, stochastic gradient descent) wi wex1 wex3 wex2 E wex1 wex2 Vector from BATCH wex3 w * Final locations in space need not be the same for BATCH and ON-LINE w 11/10/16 cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10

BP Calculations (note: text uses ‘loss’ instead of ‘error’) k j i Assume one layer of hidden units (std. non-deep topology) Error  ½  ( Teacheri – Output i ) 2 = ½  (Teacheri – F ( [  Wi,j x Output j ] )2 = ½  (Teacheri – F ( [  Wi,j x F (  Wj,k x Output k)] ))2 Determine Recall ∂ Error ∂ Wi,j ∂ Wj,k = (use equation 2) = (use equation 3) See text’s Section 18.7.4 for the specific calculations wx,y = -  (∂ E / ∂ wx,y ) 11/10/16 cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10

Differentiating the Logistic Function (‘soft’ step-function) Note: Differentiating RLU’s easy! (use F’ = 0 when input = bias) Differentiating the Logistic Function (‘soft’ step-function)  1/2  Wj x outj F(wgt’ed in) out i = 1 1 + e - (  wj,i x outj -  i ) F '(wgt’ed in) = out i ( 1- out i ) F '(wgt’ed in) 1/4 Notice that even if totally wrong, no (or very little) change in weights wgt’ed input 11/10/16 cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10

cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10 12/5/2018 Gradient Descent for the Perceptron (for the simple case of linear output units) 2 Error  ½  ( T – o ) Network’s output Teacher’s answer (a constant wrt the weights) = (T – o) ∂ E ∂ Wk ∂ (T – o) = - (T – o) ∂ o ∂ W k 11/10/16 cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10

Continuation of Derivation Stick in formula for output ∂( ∑ w j  x j) ∂ E ∂ Wk = - (T – o) ∂ Wk ∂ E ∂ Wk Recall ΔWk  - η = - (T – o) x k So ΔWk = η (T – o) xk The Perceptron Rule Also known as the delta rule and other names (with some variation in the calculation) We’ll use for both LINEAR and STEP-FUNCTION activation 11/10/16 cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10

cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10 Node Biases Recall: A node’s output is weighted function of its inputs and a ‘bias’ term Input bias 1 Output These biases also need to be learned! 11/10/16 cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10

cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10 Training Biases ( Θ’s ) A node’s output (assume ‘step function’ for simplicity) 1 if W1 X1 + W2 X2 +…+ Wn Xn ≥ Θ 0 otherwise Rewriting W1 X1 + W2 X2 + … + Wn Xn – Θ ≥ 0 W1 X1 + W2 X2 + … + Wn Xn + Θ  (-1) ≥ 0 ‘activation’ weight 11/10/16 cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10

Training Biases (cont.) Hence, add another unit whose activation is always -1 The bias is then just another weight! Eg Θ Θ -1 11/10/16 cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10

Perceptron Example (assume step function and use η = 0.1) Train Set Perceptron Learning Rule X1 X2 Correct Output 3 -2 1 6 5 -3 ΔWk = η (T – o) xk 1 Out = StepFunction(3  1 - 2  (-3) - 1  2) = 1 X1 -3 X2 2 -1 No wgt changes, since correct 11/10/16 cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10

Perceptron Example (assume step function and use η = 0.1) Train Set Perceptron Learning Rule X1 X2 Correct Output 3 -2 1 6 5 -3 ΔWa = η (T – o) xa 1 Out = StepFunction(6  1 + 1  (-3) - 1  2) = 1 // So need to update weights X1 -3 X2 2 -1 11/10/16 cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10

Perceptron Example (assume step function and use η = 0.1) Train Set Perceptron Learning Rule X1 X2 Correct Output 3 -2 1 6 5 -3 ΔWk = η (T – o) xk 1 - 0.16 = 0.4 Out = StepFunction(6  1 + 1  (-3) - 1  2) = 1 X1 -3-0.11 = -3.1 X2 -1 2 - 0.1(-1) = 2.1 11/10/16 cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10

Pushing the Weights and Bias in the Correct Direction when Wrong Assume TEACHER=1 and ANN=0, so some combo of (a) wgts on some positively valued inputs too small (b) wgts on some negatively valued inputs too large (c) ‘bias’ too large Output Wgt’ed Sum bias Opposite movement when TEACHER= 0 and ANN = 1 11/10/16 cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10

Case Analysis: Assume Teach = 1, out = 0, η = 1 ΔWk = η (T – o) xk Case Analysis: Assume Teach = 1, out = 0, η = 1 Note: ‘bigger’ means closer to +infinity and ‘smaller’ means closer to -infinity Four Cases Pos/Neg Input  Pos/Neg Weight Cases for the BIAS Input Vector: 1, -1, 1, -1 ‘-1’ ‘-1’ Weights: 2, -4, -3, 5 6 or -6 // the BIAS New Wgts: 2+1, -4-1, -3+1, 5-1 6-1 -6-1 bigger smaller bigger smaller smaller smaller Old vs New Input  Wgt 2 vs 3 4 vs 5 -3 vs -2 -5 vs -4 So weighted sum will be LARGER (-2 vs 2) And BIAS will be SMALLER 11/10/16 cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10

cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10 HW4 Train a perceptron on WINE For N Boolean-valued features, have N+1 input units (+1 due to “-1” input) and 1 output unit Learned model rep’ed by vector of N+1 doubles Your code should handle any dataset that meets HW0 spec (Maybe also train an ANN with 100 HU’s - but not required) Employ ‘early stopping’ (later) Compare perceptron testset results to random forests 11/10/16 cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10

Wrapup of Basics of ANN Training ∂ Error ∂ Wk Wrapup of Basics of ANN Training We differentiate (in the calculus sense) all the free parameters in an ANN with a fixed structure (‘topology’) If all else is held constant (‘partial derivatives’), what is the impact of changing weightk? Simultaneously move each weight a small amount in the direction that reduces error Process example-by-example, many times Seeks local minimum, ie, where all derivatives = 0 11/10/16 cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10