2806 Neural Computation Multilayer neural networks Lecture 4 2005 Ari Visa.

Slides:



Advertisements
Similar presentations
Multi-Layer Perceptron (MLP)
Advertisements

Slides from: Doug Gray, David Poole
1 Machine Learning: Lecture 4 Artificial Neural Networks (Based on Chapter 4 of Mitchell T.., Machine Learning, 1997)
Introduction to Neural Networks Computing
Artificial Neural Networks
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
2806 Neural Computation Single Layer Perceptron Lecture Ari Visa.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Supervised learning 1.Early learning algorithms 2.First order gradient methods 3.Second order gradient methods.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
RBF Neural Networks x x1 Examples inside circles 1 and 2 are of class +, examples outside both circles are of class – What NN does.
Chapter 5 NEURAL NETWORKS
Neural Networks Multi-stage regression/classification model activation function PPR also known as ridge functions in PPR output function bias unit synaptic.
Back-Propagation Algorithm
Chapter 6: Multilayer Neural Networks
Improved BP algorithms ( first order gradient method) 1.BP with momentum 2.Delta- bar- delta 3.Decoupled momentum 4.RProp 5.Adaptive BP 6.Trinary BP 7.BP.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
CHAPTER 11 Back-Propagation Ming-Feng Yeh.
CS 4700: Foundations of Artificial Intelligence
Last lecture summary.
Last lecture summary.
Collaborative Filtering Matrix Factorization Approach
Neural Networks Lecture 8: Two simple learning algorithms
Dr. Hala Moushir Ebied Faculty of Computers & Information Sciences
Radial Basis Function Networks
Artificial Neural Networks
Biointelligence Laboratory, Seoul National University
Artificial Neural Networks
Multi Layer NN and Bit-True Modeling of These Networks SILab presentation Ali Ahmadi September 2007.
Neural NetworksNN 11 Neural netwoks thanks to: Basics of neural network theory and practice for supervised and unsupervised.
Multi-Layer Perceptrons Michael J. Watts
Chapter 11 – Neural Networks COMP 540 4/17/2007 Derek Singer.
11 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Neural Networks - Berrin Yanıkoğlu1 Applications and Examples From Mitchell Chp. 4.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 16: NEURAL NETWORKS Objectives: Feedforward.
Last lecture summary. biologically motivated synapses Neuron accumulates (Σ) positive/negative stimuli from other neurons. Then Σ is processed further.
Non-Bayes classifiers. Linear discriminants, neural networks.
Back-Propagation Algorithm AN INTRODUCTION TO LEARNING INTERNAL REPRESENTATIONS BY ERROR PROPAGATION Presented by: Kunal Parmar UHID:
Neural Networks - Berrin Yanıkoğlu1 Applications and Examples From Mitchell Chp. 4.
Neural Networks - lecture 51 Multi-layer neural networks  Motivation  Choosing the architecture  Functioning. FORWARD algorithm  Neural networks as.
CHAPTER 10 Widrow-Hoff Learning Ming-Feng Yeh.
METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Part 7: Linear and Generalized Discriminant Functions.
Neural Networks Presented by M. Abbasi Course lecturer: Dr.Tohidkhah.
Neural Networks Teacher: Elena Marchiori R4.47 Assistant: Kees Jong S2.22
EEE502 Pattern Recognition
Variations on Backpropagation.
Neural Networks 2nd Edition Simon Haykin 柯博昌 Chap 3. Single-Layer Perceptrons.
Neural Networks 2nd Edition Simon Haykin
Chapter 6 Neural Network.
Neural NetworksNN 21 Architecture We consider the architecture: feed- forward NN with one layer It is sufficient to study single layer perceptrons with.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Neural Networks The Elements of Statistical Learning, Chapter 12 Presented by Nick Rizzolo.
Pattern Recognition Lecture 20: Neural Networks 3 Dr. Richard Spillman Pacific Lutheran University.
Learning with Neural Networks Artificial Intelligence CMSC February 19, 2002.
CSE343/543 Machine Learning Mayank Vatsa Lecture slides are prepared using several teaching resources and no authorship is claimed for any slides.
Machine Learning Supervised Learning Classification and Regression
Learning with Perceptrons and Neural Networks
第 3 章 神经网络.
Machine Learning Today: Reading: Maria Florina Balcan
ECE 471/571 - Lecture 17 Back Propagation.
Artificial Intelligence Chapter 3 Neural Networks
Artificial Neural Networks
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Artificial Intelligence Chapter 3 Neural Networks
Artificial Intelligence Chapter 3 Neural Networks
Artificial Intelligence Chapter 3 Neural Networks
Artificial Intelligence Chapter 3 Neural Networks
Presentation transcript:

2806 Neural Computation Multilayer neural networks Lecture Ari Visa

Agenda n Some historical notes n Multilayer neural networks n Backpropagation n Error surfaces and feature mapping n Speed ups n Conclusions

Some historical notes Rosenblatt’s perceptron 1958: a single neuron with adjustable synaptic weights and bias Perceptron convergence theorem 1962 (Rosenblatt): If patterns used to train the perceptron are drawn from two linearly separable classes, then the algorithm converges and positions the decision surface in the form of hyperplane between two classes. The limitations of networks implementing linear discriminants were well known in the 1950s and 1960s.

Some historical notes n A single neuron -> the class of solutions that can be obtained is not general enough -> multilayer perceptron n Widrow & Hoff (1960): least-mean square algorithm (LMS) = delta rule: three-layer networks designed by hand (fixed input-to-hidden weights, trained hidden-to-output weights). n The development of backpropagation: Kalman filtering (Kalman 1960, Bryson,Denham,Dreyfus 1963,1969).

Some historical notes n Werbos : Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Science, Ph.D. thesis, Harvard University, 1974 n Parker 1982,1985 n Rumelhart, Hinton, Williams: Learning internal representations by backpropagating errors, Nature 323(99),pp ,1986

Multilayer Neural Networks n Input layer n Hidden layer n Output layer n Bias unit = neuron n net j = w t j x n y j = f(net j ) n f(net) = Sgn(net) n z k = f(net k )

Multilayer Neural Networks n The XOR problem: (x 1 OR x 2 ) AND NOT (x 1 AND x 2 ) n y 1 : x 1 + x = 0 n > 0 -> y 1 = 1, otherwise –1 n y 2 : x 1 + x = 0 n > 0 -> y 2 = 1, otherwise –1

Multilayer Neural Networks n Any desired continuous function can be implemented by a three- layer network given sufficient number of hidden units, proper nonlinearitiers and weights (Kolmogorov) n Feedforward n Learning n (Demo chapter 11)

Multilayer Neural Networks n Nonlinear multilayer networks n How to set the weights of a three-layer neural network based on training patterns and the desired output?

Backpropagation Algorithm n The LMS algorithm exists for linear systems n Training error n Learning rule

Backpropagation Algorithm n Learning rule n Hidden-to-output n Input-to-hidden n Note, that w ij are initialized with random values n Demo Chapter 11

Backpropagation Algorithm n Compare with LMS algoritms n 1) Method of Steepest Descent n The direction of steepest descent is in direction opposite to the gradient vector g =  E(w) n w(n+1) = w(n) –  g(n) n  is the stepsize or learning-rate parameter

Backpropagation Algorithm n Training set = a set of patterns with known labels n Stochastic training = patterns are chosen randomly from the training set n Batch training = all patterns are presented to the network before learning takes place n On-line protocol = each pattern is presented once and only once, no memory in use n Epoch = a single presentation of all patterns in the training set. The number of epochs is an indication of the relative amount of learning.

Backpropagation Algorithm n Stopping criterion

Backpropagation Algorithm Learning set Validation set Test set Stopping criterion Learning curve, the average error per pattern Cross-validation

Error Surfaces and Feature Mapping n Note, error backpropagation is based on gradient descent in a criterion function J(w) that is represented represented

Error Surfaces and Feature Mapping n The total training error is minimized. n It usually decreases monotonically, even though this is not the case for the error on each individual pattern.

Error Surfaces and Feature Mapping n Hidden-to-output weights ~ a linear discriminant n Input-to-hidden weights ~ ”matched filter”

Practical Techniques for Improving Backpropagation n How to improve convergence, performance, and results? n Neuron: Sigmoid function = centered zero and antisymmetric

Practical Techniques for Improving Backpropagation Scaling input variables = the input patterns should be shifted so that the average over the training set of each feature is zero. = the full data set should be scaled to have the same variance in each feature component Note, this standardization can only be done for stochastic and batch learning!

Practical Techniques for Improving Backpropagation n When the training set is small one can generate surrogate training patterns. n In the absence of problem-specific information, the surrogate patterns should be made by adding d-dimensional Gaussian noise to true training points, the category label should be left unchanged. n If we know about the source of variation among patterns we can manufacture training data. n The number of hidden units should be less than the total number of training points n, say roughly n/10.

Practical Techniques for Improving Backpropagation n We cannot initialize the weights to 0. n Initializing weights = uniform learning -> choose weights randomly from a single distribution n Input-to-hidden weights: -1 /  d < w ij < + 1 /  d,where d is the number of input units n Hidden-to-output weights: -1 /  n H < w kj < + 1 /  n H,where d is the number of hidden units

Practical Techniques for Improving Backpropagation n Learning Rates n Demo Chapter 9 n The optimal rate

Practical Techniques for Improving Backpropagation n Momentum : allows the network to learn more quickly when plateaus in the error surface exist. Demo chapter 12.  0.9

Practical Techniques for Improving Backpropagation n Weight Decay to avoid overfitting = to start with a network with too many weights and decay all the weights during training w new = w old (1-  ), where 0 <  < 1 n The weights that are not needed for reducing the error function become smaller and smaller -> eliminated

Practical Techniques for Improving Backpropagation n If we have insufficient training data for the desired classification accuracy. n Learning with hints is to add output units for addressing an ancillary problem, one different from but related to the specific classification problem at hand.

Practical Techniques for Improving Backpropagation n Stopped Training n Number of hidden layers: typically 2-3. n Criterion Function: cross entropy, Minkowski Error

Practical Techniques for Improving Backpropagation n Second-Order Methods to speed up the learning rate: n Newton’s Method, demo chapter 9 n Quickprop n Conjugate Gradient Descent requires batch training, demo chapter 9

Practical Techniques for Improving Backpropagation n 2) Newton’s method n The idea is to minimize the quadratic approximation of the cost function E(w) around the current point w(n). n Using a second-order Taylor series expansion of the cost function around the point w(n). n  Ew(n)  g T (n)  w(n) +½  w T (n) H(n)  w(n) n g(n) is the m-by-1 gradient vector of the cost function E(w) evaluated at the point w(n). The matrix H(n) is the m-by-m Hessian matrix of E(w) (second derivative), H =  ²E(w)

Practical Techniques for Improving Backpropagation n H =  ²E(w) requires the cost function E(w) to be twice continuously differentiable with respect to the elements of w. n Differentiating  Ew(n)  g T (n)  w(n) +½  w T (n) H(n)  w(n) with respect to  w, the change  E(n) is minimized when n g(n) + H(n)  w(n) = 0 ->  w(n) = H -1 (n)g(n) n w(n+1) = w(n) +  w(n) n w(n+1) = w(n)+H -1 (n)g(n) n where H -1 (n) is the inverse of the Hessian of E(w).

Practical Techniques for Improving Backpropagation n Newton’s method converges quickly asymtotically and does not exhibit the zigzagging behavior. n Newton’s method requires that the Hessian H(n) has to be a positive definite matrix for all n!

Practical Techniques for Improving Backpropagation n 3) Gauss-Newton Method n It is applicable to a cost function that is expressed as the sum of error squares. n E(w) = ½  i=1 n e ² (i), note that all the error terms are calculated on the basis of a weight vector w that is fixed over the entire observation interval 1  i  n. n The error signal e(i) is a function of the adjustable weight vector w. Given an operating point w(n), we linearize the dependence of e(i) on w by writing e’(i,w) = e(i) + [  e(i)/  w] T w=w(n) (w-w(n)), i=1,2,...,n

Practical Techniques for Improving Backpropagation e’(n,w) = e(n) + J(n)(w-w(n)) where e(n) is the error vector e(n) = [e(1),e(2),...,e(n)] T and J(n) is the n-by-m Jacobian matrix of e(n) (The Jacobian J(n) is the transpose of the m-by-n gradient matrix  e(n), where  e(n) =[  e(1),  e(2),...,  e(n)]. w(n+1) = arg min w {½  e’(n,w)  ²}= ½  e(n)  ² +e T (n)J(n)(w-w(n)) + ½(w- w(n)) T J T (n)J(n)(w-w(n)) Differentiating the expression with respect w and setting the result equal to zero

Practical Techniques for Improving Backpropagation J T (n)e(n) + J T (n)e(n) (w-w(n)) = 0 w(n+1) = w(n) – ( J T (n)J(n)) -1 J T (n)e(n) The Gauss-Newton requires only the Jacobian matrix of the error function e(n). For the Gauss-Newton iteration to be computable, the matrix product J T (n)J(n) must be nonsigular. J T (n)J(n) is always nonnegative definite but to ensure that it is nonsingular, the Jacobian J(n) must have row rank n. -> add the diagonal matrix  I to the matrix J T (n)J(n), the parameter  is a small positive constant.

Practical Techniques for Improving Backpropagation n J T (n)J(n)+  I ; positve definite for all n. n -> The Gauss-Newton method is implemented in the following form: w(n+1) = w(n) – ( J T (n)J(n) +  I ) -1 J T (n)e(n) n This is the solution to the modified cost function: n E(w) = ½{  w-w(0)  ²+  i=1 n e ² (i)} n where w(0) is the initial value of w.

Regularization, Complexity Adjustment and Pruning n If we have too many weight and train too long - > overfitting n Wald Statistic: we can estimate the importance of a parameter in a model n The Optimal Brain Damage, The Optimal Brain Surgeon: eliminate the parameter having the least importance

Summary n Multilayer nonlinear neural networks trained by gradient descent methods such as backpropagation perform a maximum-likelihood estimation of the weight values in the model defined by the network topology. n f(net) at the hidden units allows the networks to form an arbitary decision boundary, so long as there are sufficiently many hidden units.