Today’s Topics Midterm class mean: 83.5 HW3 Due Thursday and HW4 Out Thursday Turn in Your BN Nannon Player (in Separate, ‘Dummy’ Assignment) until a Week.

Slides:



Advertisements
Similar presentations
Multi-Layer Perceptron (MLP)
Advertisements

Beyond Linear Separability
Slides from: Doug Gray, David Poole
NEURAL NETWORKS Backpropagation Algorithm
Linear Regression.
Artificial Intelligence 13. Multi-Layer ANNs Course V231 Department of Computing Imperial College © Simon Colton.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
CS 678 –Boltzmann Machines1 Boltzmann Machine Relaxation net with visible and hidden units Learning algorithm Avoids local minima (and speeds up learning)
Lecture 13 – Perceptrons Machine Learning March 16, 2010.
Machine Learning: Connectionist McCulloch-Pitts Neuron Perceptrons Multilayer Networks Support Vector Machines Feedback Networks Hopfield Networks.
CS Perceptrons1. 2 Basic Neuron CS Perceptrons3 Expanded Neuron.
Classification Neural Networks 1
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
Linear Separators.
Overview over different methods – Supervised Learning
Connectionist models. Connectionist Models Motivated by Brain rather than Mind –A large number of very simple processing elements –A large number of weighted.
Machine Learning Motivation for machine learning How to set up a problem How to design a learner Introduce one class of learners (ANN) –Perceptrons –Feed-forward.
Artificial Neural Networks
Data Mining with Neural Networks (HK: Chapter 7.5)
Artificial Neural Networks
LOGO Classification III Lecturer: Dr. Bo Yuan
CS 4700: Foundations of Artificial Intelligence
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Dr. Hala Moushir Ebied Faculty of Computers & Information Sciences
Artificial Neural Networks
Classification Part 3: Artificial Neural Networks
Neural Networks AI – Week 23 Sub-symbolic AI Multi-Layer Neural Networks Lee McCluskey, room 3/10
START OF DAY 4 Reading: Chap. 3 & 4. Project Topics & Teams Select topics/domains Select teams Deliverables – Description of the problem – Selection.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Today’s Topics Read –For exam: Chapter 13 of textbook –Not on exam: Sections & Genetic Algorithms (GAs) –Mutation –Crossover –Fitness-proportional.
Linear Discrimination Reading: Chapter 2 of textbook.
Overview of the final test for CSC Overview PART A: 7 easy questions –You should answer 5 of them. If you answer more we will select 5 at random.
Linear Classification with Perceptrons
Today’s Topics HW1 Due 11:55pm Today (no later than next Tuesday) HW2 Out, Due in Two Weeks Next Week We’ll Discuss the Make-Up Midterm Be Sure to Check.
Today’s Topics 11/10/15CS Fall 2015 (Shavlik©), Lecture 22, Week 101 Support Vector Machines (SVMs) Three Key Ideas –Max Margins –Allowing Misclassified.
Back-Propagation Algorithm AN INTRODUCTION TO LEARNING INTERNAL REPRESENTATIONS BY ERROR PROPAGATION Presented by: Kunal Parmar UHID:
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.
Today’s Topics Graded HW1 in Moodle (Testbeds used for grading are linked to class home page) HW2 due (but can still use 5 late days) at 11:55pm tonight.
1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Today’s Topics Read: Chapters 7, 8, and 9 on Logical Representation and Reasoning HW3 due at 11:55pm THURS (ditto for your Nannon Tourney Entry) Recipe.
Artificial Neural Network
EEE502 Pattern Recognition
1 Perceptrons Louis Oliphant CS 540 section 2 slides borrowed (with modifications) from Burr Settles.
Today’s Topics Remember: no discussing exam until next Tues! ok to stop by Thurs 5:45-7:15pm for HW3 help More BN Practice (from Fall 2014 CS 540 Final)
1 Perceptron as one Type of Linear Discriminants IntroductionIntroduction Design of Primitive UnitsDesign of Primitive Units PerceptronsPerceptrons.
Artificial Intelligence Methods Neural Networks Lecture 3 Rakesh K. Bissoondeeal Rakesh K. Bissoondeeal.
Today’s Topics Artificial Neural Networks (ANNs) Perceptrons (1950s) Hidden Units and Backpropagation (1980s) Deep Neural Networks (2010s) ??? (2040s [note.
Today’s Topics 11/10/15CS Fall 2015 (Shavlik©), Lecture 21, Week 101 More on DEEP ANNs –Convolution –Max Pooling –Drop Out Final ANN Wrapup FYI:
Learning: Neural Networks Artificial Intelligence CMSC February 3, 2005.
Pattern Recognition Lecture 20: Neural Networks 3 Dr. Richard Spillman Pacific Lutheran University.
Multinomial Regression and the Softmax Activation Function Gary Cottrell.
Learning with Neural Networks Artificial Intelligence CMSC February 19, 2002.
CSE343/543 Machine Learning Mayank Vatsa Lecture slides are prepared using several teaching resources and no authorship is claimed for any slides.
Today’s Lecture Neural networks Training
Learning with Perceptrons and Neural Networks
Real Neurons Cell structures Cell body Dendrites Axon
A Simple Artificial Neuron
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Classification with Perceptrons Reading:
with Daniel L. Silver, Ph.D. Christian Frey, BBA April 11-12, 2017
Artificial Neural Networks
Neural Networks and Backpropagation
CS540 - Fall 2016(Shavlik©), Lecture 16, Week 9
Classification Neural Networks 1
cs540 - Fall 2016 (Shavlik©), Lecture 20, Week 11
cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10
Perceptron as one Type of Linear Discriminants
Concept Learning Algorithms
Backpropagation David Kauchak CS159 – Fall 2019.
Presentation transcript:

Today’s Topics Midterm class mean: 83.5 HW3 Due Thursday and HW4 Out Thursday Turn in Your BN Nannon Player (in Separate, ‘Dummy’ Assignment) until a Week from Thursday Weight Space (for ANNs) Gradient Descent and Local Minima Stochastic Gradient Descent Backpropagation The Need to Train the Biases and a Simple Algebraic Trick Perceptron Training Rule and a Worked Example Case Analysis of Delta Rule Neural ‘Word Vectors’ 11/3/15CS Fall 2015 (Shavlik©), Lecture 19, Week 91

Back to Prob Reasoning for Two Slides: Base-Rate Fallacy /3/15CS Fall 2015 (Shavlik©), Lecture 19, Week 9 2 Assume Disease A is rare (one in 1 million, say – so picture not to scale) Assume population is 10B = So 10 4 people have it Assume testForA is 99.99% accurate You test positive. What is the prob you have Disease A? Someone (not in cs540) might naively think prob = People for whom testForA = true 9999 people that actually have Disease A 10 6 people that do NOT have Disease A Prob(A | testForA) = 0.01 A This same issue arises in ML when have many more neg than pos ex’s: false pos overwhelm true pos 99.99% 0.01%

A Major Weakness of BN’s (I also copied this and prev slide to an earlier lecture, for future cs540’s) If many ‘hidden’ random vars (N binary vars, say), then the marginalization formula leads to many calls to a BN (2 N in our example; for N = 20, 2 N = 1,048,576) Using uniform-random sampling to estimate the result is too inaccurate since most of the probability might be concentrated in only a few ‘complete world states’ Hence, much research (beyond cs540’s scope) on scaling up inference in BNs and other graphical models, eg via more sophisticated sampling (eg, MCMC) 11/3/15CS Fall 2015 (Shavlik©), Lecture 19, Week 9 3

WARNING! Some Calculus Ahead 11/3/15CS Fall 2015 (Shavlik©), Lecture 19, Week 94

No Calculus Experience? For HWs and the Final … Derivatives generalize the idea of SLOPE You only need to know how to calc the SLOPE of a line d (m x + b) d x 11/3/15CS Fall 2015 (Shavlik©), Lecture 19, Week 9 5 = m // ‘mx + b’ is the algebraic form of a line // ‘m’ is the slope // ‘b’ is the y intercept (value of y when x = 0) Two (distinct) points define a line

Weight Space Given a neural-network layout, the weights and biases are free parameters that define a space Each point in this Weight Space specifies a network weight space is a continuous space we search Associated with each point is an error rate, E, over the training data Backprop performs gradient descent in weight space 11/3/15CS Fall 2015 (Shavlik©), Lecture 19, Week 96

Gradient Descent in Weight Space Total Error on Training Set W1W1 W2W2 ∂ E ∂ W  W1 W1  W2 W2 11/3/15CS Fall 2015 (Shavlik©), Lecture 19, Week 97 ERROR with Current Wgt Settings New Wgt Settings Current Wgt Settings

Backprop Seeks LOCAL Minima (in a continuous space) 11/3/15CS Fall 2015 (Shavlik©), Lecture 19, Week 9 8 Weight Space Error on Train Set Note: a local min might over fit the training data, so often ‘early stopping’ used (later)

Local Min are Good Enough for Us! ANNs, including Deep Networks, make accurate predictions even though we likely are only finding local min The world could have been like this: Note: ensembles of ANNs work well (often find different local minima) 11/3/15CS Fall 2015 (Shavlik©), Lecture 19, Week 9 9 Weight Space Error on Train Set Ie, most min poor, hard to find a good min

The Gradient-Descent Rule The ‘gradient’ This is a N+1 dimensional vector (ie, the ‘slope’ in weight space) Since we want to reduce errors, we want to go ‘down hill’ We’ll take a finite step in weight space: E W1W1 W2W2  w = -   E ( w ) or  w i = -  Ewi Ewi ‘delta’ = the change to w  E ww 11/3/15CS Fall 2015 (Shavlik©), Lecture 19, Week 910  E(w)  [ ] Ew0 Ew0 Ew1 Ew1 Ew2 Ew2 EwN EwN,,, …, _

‘On Line’ vs. ‘Batch’ Backprop Technically, we should look at the error gradient for the entire training set, before taking a step in weight space (‘batch’ backprop) However, in practice we take a step after each example (‘on-line’ backprop) –Much faster convergence (learn after each example) –Called ‘stochastic’ gradient descent –Stochastic gradient descent very popular at Google, etc due to easy parallelism 11/3/15CS Fall 2015 (Shavlik©), Lecture 19, Week 911

‘On Line’ vs. ‘Batch’ BP (continued) BATCH – add  w vectors for every training example, then ‘move’ in weight space ON-LINE – ‘move’ after each example (aka, stochastic gradient descent) E  w i  w ex1  w ex3  w ex2 w  w ex1  w ex2  w ex3 * Final locations in space need not be the same for BATCH and ON-LINE w * Note  w i,BATCH   w i, ON-LINE, for i > 1 E w 11/3/15CS Fall 2015 (Shavlik©), Lecture 19, Week 912 Vector from BATCH

Assume one layer of hidden units (std. non-deep topology) 1.Error  ½  ( Teacher i – Output i ) 2 2.= ½  (Teacher i – F ( [  W i,j x Output j ] ) 2 3.= ½  (Teacher i – F ( [  W i,j x F (  W j,k x Output k )] )) 2 Determine Recall BP Calculations (note: text uses ‘loss’ instead of ‘error’) ∂ Error ∂ W i,j ∂ Error ∂ W j,k = (use equation 2) = (use equation 3) See Sec and Fig in textbook for results (I won’t ask you to derive on final)  w x,y = -  (∂ E / ∂ w x,y ) k j i 11/3/15CS Fall 2015 (Shavlik©), Lecture 19, Week 913

11/3/15CS Fall 2015 (Shavlik©), Lecture 19, Week 9 Differentiating the Logistic Function (‘soft’ step-function) out i = e - (  w j,i x out j -  i ) F '(wgt’ed in) = out i ( 1- out i )  1/2  W j x out j F '(wgt’ed in) wgt’ed input 1/4 Notice that even if totally wrong, no (or very little) change in weights F(wgt’ed in) 14 Note: Differentiating RLU’s easy! (use F’ = 0 when input = bias)

Gradient Descent for the Perceptron (for the simple case of linear output units) Error  ½  ( T – o ) 2 Network’s output Teacher’s answer (a constant wrt the weights) = (T – o) ∂ E ∂ W a ∂ (T – o) ∂ W a = - (T – o) ∂ o ∂ W a 11/3/15CS Fall 2015 (Shavlik©), Lecture 19, Week 915

Continuation of Derivation ∂ E ∂ Wk∂ Wk∂ Wk∂ Wk = - (T – o) ∂ W a ∂( ∑ w k  x k ) = - (T – o) x a So ΔW k = η (T – o) x a The Perceptron Rule Stick in formula for output Also known as the delta rule and other names (with some variation in the calc) 11/3/15CS Fall 2015 (Shavlik©), Lecture 19, Week 916 We’ll use for both LINEAR and STEP-FUNCTION activation ∂ E ∂ W a Recall ΔW k  - η

11/3/15CS Fall 2015 (Shavlik©), Lecture 19, Week 9 Node Biases Recall: A node’s output is weighted function of its inputs and a ‘bias’ term Input bias 1 Output These biases also need to be learned! 17

11/3/15CS Fall 2015 (Shavlik©), Lecture 19, Week 9 Training Biases ( Θ’s ) A node’s output (assume ‘step function’ for simplicity) 1 if W 1 X 1 + W 2 X 2 +…+ W n X n ≥ Θ 0 otherwise Rewriting W 1 X 1 + W 2 X 2 + … + W n X n – Θ ≥ 0 W 1 X 1 + W 2 X 2 + … + W n X n + Θ  (-1) ≥ 0 weight ‘activation’ 18

11/3/15CS Fall 2015 (Shavlik©), Lecture 19, Week 9 Training Biases (cont.) Hence, add another unit whose activation is always -1 The bias is then just another weight! Eg Θ Θ 19

Perceptron Example (assume step function and use η = 0.1) X1X2Correct Output /3/15CS Fall 2015 (Shavlik©), Lecture 19, Week 9 20 Train Set X1X1 X2X ΔW a = η (T – o) x a Perceptron Learning Rule Out = StepFunction(3   (-3) - 1  2) = 1 No wgt changes, since correct

Perceptron Example (assume step function and use η = 0.1) X1X2Correct Output /3/15CS Fall 2015 (Shavlik©), Lecture 19, Week 9 21 Train Set X1X1 X2X ΔW a = η (T – o) x a Perceptron Learning Rule Out = StepFunction(6   (-3) - 1  2) = 1 // So need to update weights

Perceptron Example (assume step function and use η = 0.1) X1X2Correct Output /3/15CS Fall 2015 (Shavlik©), Lecture 19, Week 9 22 Train Set X1X1 X2X  6 =  1 =  (-1) = 2.1 ΔW a = η (T – o) x a Perceptron Learning Rule Out = StepFunction(6   (-3) - 1  2) = 1

Pushing the Weights and Bias in the Correct Direction when Wrong 11/3/15CS Fall 2015 (Shavlik©), Lecture 19, Week 9 23 Output Wgt’ed Sum bias Assume TEACHER=1 and ANN=0, so some combo of (a) wgts on some positively valued inputs too small (b) wgts on some negatively valued inputs too large (c) ‘bias’ too large Opposite movement when TEACHER= 0 and ANN = 1

Case Analysis: Assume Teach = 1, out = 0, η = 1 Input Vector:1, -1, 1,-1 ‘-1’ ‘-1’ Weights: 2, -4, -3, 5 6 or -6 // the BIAS New Wgts: 2+1, -4-1, -3+1, bigger smaller bigger smaller smaller smaller Old vs New Input  Wgt 2 vs 3 4 vs 5 -3 vs vs -4 11/3/15CS Fall 2015 (Shavlik©), Lecture 19, Week 9 24 ΔW k = η (T – o) x k Cases for the BIAS Four Cases Pos/Neg Input  Pos/Neg Weight So weighted sum will be LARGER (-2 vs 2) And BIAS will be SMALLER Note: ‘bigger’ means closer to +infinity and ‘smaller’ means closer to -infinity

Neural Word Vectors – Current Hot Topic (see or Distributional Hypothesis words can be characterized by the words that appear nearby in a large text corpus (matrix algebra also used for this task, eg singular- value decomposition, SVD) 11/3/15CS Fall 2015 (Shavlik©), Lecture 19, Week 9 25 Two Possible Designs (CBOW = Continuous Bag of Words) Initially assign each word a random k-long vector of random #’s in [0,1] or [-1,1] (k is something like 100 or 300) – as opposed to traditional ‘1-of-N’ encoding Recall 1-of-N : aardvark = 1,0,0,…,0 // N is 50,000 or more! zzzz = 0,0,0,…,1 // And nothing ‘shared’ by related words Compute  Error /  Input i to change the input vector(s) - ie, find good word vectors so easy to learn to predict I/O pairs in the fig above

Neural Word Vectors Surprisingly, one can do ‘simple algebra’ with these word vectors! vector France – vector Paris = vector Italy – X Subtract vector for Paris from vector for France, then subtract vector for Italy. Negate then find closest word vectors in one’s word ‘library’ web page suggests X = vector Rome though I got vector Milan (which is reasonable; vector Rome was 2 nd ) 11/3/15CS Fall 2015 (Shavlik©), Lecture 19, Week = -? king – man = queen – X

Wrapup of Basics of ANN Training We differentiate (in the calculus sense) all the free parameters in an ANN with a fixed structure (‘topology’) –If all else is held constant (‘partial derivatives’), what is the impact of changing weight a ? –Simultaneously move each weight a small amount in the direction that reduces error –Process example-by-example, many times Seeks local minimum, ie, where all derivatives = 0 11/3/15CS Fall 2015 (Shavlik©), Lecture 19, Week 9 27 ∂ Error ∂ W a