Multilayer Perceptrons

Slides:

Advertisements

Similar presentations

Multi-Layer Perceptron (MLP)

Advertisements

Slides from: Doug Gray, David Poole

NEURAL NETWORKS Backpropagation Algorithm

Introduction to Neural Networks Computing

Multilayer Perceptrons 1. Overview  Recap of neural network theory  The multi-layered perceptron  Back-propagation  Introduction to training  Uses.

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Mehran University of Engineering and Technology, Jamshoro Department of Electronic Engineering Neural Networks Feedforward Networks By Dr. Mukhtiar Ali.

Kostas Kontogiannis E&CE

Financial Informatics –XVI: Supervised Backpropagation Learning

Radial-Basis Function Networks CS/CMPE 537 – Neural Networks.

Simple Neural Nets For Pattern Classification

Supervised learning 1.Early learning algorithms 2.First order gradient methods 3.Second order gradient methods.

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

BP - Review CS/CMPE 333 – Neural Networks. CS/CMPE Neural Networks (Sp 2002/2003) - Asim LUMS2 Notation Consider a MLP with P input, Q hidden,

The back-propagation training algorithm

Prénom Nom Document Analysis: Artificial Neural Networks Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Least-Mean-Square Algorithm CS/CMPE 537 – Neural Networks.

Chapter 5 NEURAL NETWORKS

The Perceptron CS/CMPE 333 – Neural Networks. CS/CMPE Neural Networks (Sp 2002/2003) - Asim LUMS2 The Perceptron – Basics Simplest and one.

Neural Networks Marco Loog.

Prénom Nom Document Analysis: Artificial Neural Networks Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Back-Propagation Algorithm

Chapter 6: Multilayer Neural Networks

CHAPTER 11 Back-Propagation Ming-Feng Yeh.

Neural Networks. Background - Neural Networks can be : Biological - Biological models Artificial - Artificial models - Desire to produce artificial systems.

Artificial neural networks:

Dr. Hala Moushir Ebied Faculty of Computers & Information Sciences

Artificial Neural Networks

Biointelligence Laboratory, Seoul National University

Artificial Neural Networks

Multi-Layer Perceptrons Michael J. Watts

Introduction to Neural Networks Debrup Chakraborty Pattern Recognition and Machine Learning 2006.

Chapter 11 – Neural Networks COMP 540 4/17/2007 Derek Singer.

11 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering

1 Chapter 6: Artificial Neural Networks Part 2 of 3 (Sections 6.4 – 6.6) Asst. Prof. Dr. Sukanya Pongsuparb Dr. Srisupa Palakvangsa Na Ayudhya Dr. Benjarath.

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

Artificial Neural Network Supervised Learning دكترمحسن كاهاني

Neural Networks - Berrin Yanıkoğlu1 Applications and Examples From Mitchell Chp. 4.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 16: NEURAL NETWORKS Objectives: Feedforward.

LINEAR CLASSIFICATION. Biological inspirations  Some numbers…  The human brain contains about 10 billion nerve cells ( neurons )  Each neuron is connected.

Artificial Intelligence Methods Neural Networks Lecture 4 Rakesh K. Bissoondeeal Rakesh K. Bissoondeeal.

Artificial Intelligence Techniques Multilayer Perceptrons.

Artificial Neural Networks. The Brain How do brains work? How do human brains differ from that of other animals? Can we base models of artificial intelligence.

Neural Networks and Machine Learning Applications CSC 563 Prof. Mohamed Batouche Computer Science Department CCIS – King Saud University Riyadh, Saudi.

Non-Bayes classifiers. Linear discriminants, neural networks.

Introduction to Neural Networks. Biological neural activity –Each neuron has a body, an axon, and many dendrites Can be in one of the two states: firing.

Back-Propagation Algorithm AN INTRODUCTION TO LEARNING INTERNAL REPRESENTATIONS BY ERROR PROPAGATION Presented by: Kunal Parmar UHID:

Neural Networks - Berrin Yanıkoğlu1 Applications and Examples From Mitchell Chp. 4.

1 Lecture 6 Neural Network Training. 2 Neural Network Training Network training is basic to establishing the functional relationship between the inputs.

Neural Networks Presented by M. Abbasi Course lecturer: Dr.Tohidkhah.

Neural Networks Teacher: Elena Marchiori R4.47 Assistant: Kees Jong S2.22

EEE502 Pattern Recognition

Chapter 8: Adaptive Networks

Neural Networks 2nd Edition Simon Haykin

Artificial Intelligence CIS 342 The College of Saint Rose David Goldschmidt, Ph.D.

Chapter 6 Neural Network.

Neural NetworksNN 21 Architecture We consider the architecture: feedforward NN with one layer It is sufficient to study single layer perceptrons with.

Neural Networks The Elements of Statistical Learning, Chapter 12 Presented by Nick Rizzolo.

Learning: Neural Networks Artificial Intelligence CMSC February 3, 2005.

Neural Networks - Berrin Yanıkoğlu1 MLP & Backpropagation Issues.

Learning with Neural Networks Artificial Intelligence CMSC February 19, 2002.

CSE343/543 Machine Learning Mayank Vatsa Lecture slides are prepared using several teaching resources and no authorship is claimed for any slides.

Machine Learning Supervised Learning Classification and Regression

Learning with Perceptrons and Neural Networks

Artificial neural networks:

with Daniel L. Silver, Ph.D. Christian Frey, BBA April 11-12, 2017

CSC 578 Neural Networks and Deep Learning

ECE 471/571 - Lecture 17 Back Propagation.

Artificial Neural Networks

Presentation transcript:

Multilayer Perceptrons CS/CMPE 537 – Neural Networks CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS

CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS Introduction Multilayer perceptrons (MLPs) are multilayer feedforward networks with continuously differentiable nonlinear activation functions MLPs can be considered as an extension of the simple perceptron to multiple layers of neurons (hence the name multilayer perceptron) MLPs are trained by the popular supervised learning algorithm known as the error back-propagation algorithm MLPs’ training involves both forward and backward information flow (hence the name BP for the algorithm) CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS

CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS Historical Note The idea of having multiple layers of computation nodes date from the early days of neural networks (1960s). However, at that time no algorithm was available to train such networks The emergence of the BP algorithm in the mid 1980s made it possible to train MLPs, which opened the way for their widespread use Since the mid 80s there has been great interest in neural networks in general and MLPs in particular with many contributions made to the theory and applications of neural networks CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS

Distinguishing Characteristics of MLPs Smooth nonlinear activation function Biological basis Ensures that the input-output mapping of the network is nonlinear The smoothness is essential for the BP algorithm One or more hidden layers These layers enhance mapping capability of the network Each additional layer can be viewed as adding more complexity to the feature detection capability of the network High degree of connectivity Highly distributed information processing Fault tolerance CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS

CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS An MLP CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS

CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS Information Flow Function signals: input signal that moves forward from layer to layer Error signal: error signal originates at the output and propagates back layer-by-layer CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS

The Back-Propagation (BP) Algorithm The BP algorithm is a generalization of the LMS algorithm. While the LMS algorithm applies correction to one layer of weights, the BP algorithm provides a mechanism for applying correction to multiple layers of weights. The BP algorithm apportion errors at output layer to errors at hidden layers. Once the error at a neuron has been determined the correction is distributed according to the delta rule (as in LMS algorithm). CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS

CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS Some Notations Indices i, j, k: identifies a neuron, such that the layer in which neuron k lies follows the layer in which neuron j and i lie. wji: weight associated with the link connecting neuron (or source node) i with neuron j yj(n), dj(n), ej(n), vj(n): actual response, target response, error, and net activity level of neuron j when presented with pattern n xi(n), oi(n): ith element of input vector n, and network output vector when presented with input vector x(n) Neuron j has a bias input equal to +1 with weight wj0 = bj CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS

CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS

CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS BP Algorithm (1) Output of neuron j at iteration n (nth training pattern) ej(n) = dj(n) – yj(n) Instantaneous sum of square errors of the network ξ(n) = ½ΣjЄC ej2(n) C = set of all neurons in output layer For N training patterns, the cost function is ξav = (1/N) Σn=1 n ξ(n) ξ is a function of the free parameters (weights and thresholds). The goal is to find the weights/threshdolds that minimize ξav Weights are updated on a pattern-by-pattern basis CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS

CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS BP Algorithm (2) Considering neuron j in output layer vj(n) = Σi=0 p wji(n)yi(n) And yj(n) = φ(vj(n)) Instantaneous error-correction learning wji(n+1) = wji(n) – η δξ(n)/δwji(n) Using the chain rule, the gradient can be written as δξ(n)/δwji(n) = [δξ(n)/δej(n)][δej(n)/δyj(n)][δyj(n)/δvj(n)][δvj(n)/δwji(n)] CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS

CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS BP Algorithm (3) Computing the partial derivatives δξ(n)/δej(n) = ej(n) δej(n)/δyj(n) = -1 δyj(n)/δvj(n) = φj’(vj(n)) δvj(n)/δwji(n) = yi(n) The gradient at iteration n wrt wji δξ(n)/δwji(n) = - ej(n) φj’(vj(n))yi(n) The weight change at iteration n Δwji(n) = – η δξ(n)/δwji(n) = η ej(n) φj’(vj(n))yi(n) = η δj(n)yi(n) This is known as the delta rule CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS

CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS BP Algorithm (4) Local gradient δj(n) δj(n) = - [δξ(n)/δej(n)][δej(n)/δyj(n)][δyj(n)/δvj(n)] = ej(n) φj’(vj(n)) Credit-assignment problem How to computer ej(n) ? How to penalize and reward neuron j (weights associated with neuron j) for ej(n) ? For output layer ? For hidden layer(s) ? CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS

CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS Output Layer If neuron j lies in the output layer, the desired response dj(n) is known and the error ej(n) can be computed Hence, the local gradient δj(n) can be computed, and the weights updated using the delta rule (as given by the equations on the preceding slides) CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS

CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS Hidden Layer (1) CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS

CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS Hidden Layer (2) If neuron j lies in a hidden layer, we don’t have a desired response to calculate the error signal The local gradient has to be computed recursively by considering all neurons to which neuron j feeds. δj(n) = - [δξ(n)/δyj(n)][δyj(n)/δvj(n)] = - [δξ(n)/δyj(n)]φj’(vj(n)) We need to calculate the value of the partial derivative δξ(n)/δyj(n) CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS

CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS Hidden Layer (3) Cost function is ξ(n) = ½ΣkЄC ek2(n) Partial derivatives δξ(n)/δyj(n) = Σk ek(n)[δek(n)/δvk(n)][δvk(n)/δyj(n)] δek(n)/δvk(n) = δ[dk(n) – yk(n)]/δvk(n) = δ[dk(n) – φk(vk(n))]/δvk(n) = – φk’(vk(n) δvk(n)/δyj(n) = δ[Σj=0 q wkj(n)yj(n)]/ δyj(n) = wkj(n) CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS

CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS Hidden Layer (4) Thus, we have δξ(n)/δyj(n) = - Σk ek(n) φk’(vk(n)) wkj(n) = - Σk δk(n)wkj(n) And δj(n) = φj’(vj(n)) Σk δk(n)wkj(n) CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS

CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS BP Algorithm Summary Delta rule wji(n+1) = wji(n) + Δwji(n) where Δwji(n) = ηδj(n)yi(n) And, δj(n) is given by If neuron j lies in the output layer δj(n) = φj’(n)ej(n) If neuron j lies in a hidden layer δj(n) = φj’(vj(n)) Σk δk(n)wkj(n) CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS

Sigmoidal Nonlinearity For the BP algorithm to work, we need the activation function φ to be continuous and differentiable For the logistic function yj(n) = φj(vj(n)) = [1 + exp(-vj(n))]-1 δyj(n)/ δvj(n) = φj’(vj(n)) = exp(-vj(n)/[1 + exp(-vj(n))]2 = yj(n)[1 - yj(n)] CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS

CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS Learning Rate Learning rate parameter controls the rate and stability of the convergence of the BP algorithm Smaller η -> slower, smoother convergence Larger η -> faster, unstable (oscillatory) convergence Momentum Δwji(n) = αΔwji(n-1) + ηδj(n)yi(n) α = usually positive constant called momentum term Improves stability of convergence by adding a “momentum” from the previous change in weights May prevent convergence to a shallow (local) minimum This update rule is known as the generalized delta rule CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS

CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS Modes of Training N = number of training examples (input-output patterns or training set) One complete presentation of the training set is called an epoch It is good practice to randomize the order of presentation of patterns from one epoch to another so as to enhance the search in weight space CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS

Pattern Mode of Training Weights are updated after the presentation of each pattern This is the manner in which the BP algorithm was derived in the preceding slides Average weight change after N updates Δw’ji = 1/N Σn=1 N Δwji(n) = - η/N Σn=1 N δξ(n)/δwji(n) = - η/N Σn=1 N ej(n) δej(n)/δwji(n) CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS

CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS Batch Mode of Training Weights are updated after the presentation of all patterns in the epoch Average cost function ξav = 1/2N Σn=1NΣjЄC ej2(n) Δwji = - ηδξav/δwji = - η/N Σn=1 N ej(n)δej(n)/ δwji Comparison The weight updates after a complete epoch is different for both modes Pattern mode is an estimate for the batch mode Pattern mode is suitable for on-line implementation, requires less storage, and provides better search (because it is stochastic) CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS

CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS Stopping Criteria The BP algorithm is considered to have converged when Euclidean norm of the gradient vector reaches a sufficiently small gradient threshold ||g(w)|| < t The BP algorithm is considered to have converged when the absolute rate of change in the average squared error per epoch is sufficiently small |[ξav(w(n+1)) - ξav(w(n))]/ξav(w(n-1)| < t The BP algorithm is terminated at the weight vector wfinal when ||g(wfinal)|| < t1 , or ξav(wfinal) < t2 The BP algorithm is stopped when the network’s generalization properties are adequapte CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS

CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS Initialization The initial values assigned to the weights affects the performance of the network If prior information is available, then it should be used to assign appropriate initial values to the weights However, prior information is usually not known. Moreover, even when prior information of the problem is known, it is not possible to assign weights since the behavior of a MLP is complex and not understood completely. If prior information is not available, the weights are initialized to uniform random values within a range [0, 1] Premature saturation CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS

CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS Premature Saturation When the output of a neuron approaches the limits of the sigmoidal function, little change occurs in the weight. That is, the neuron is saturated, and learning and adaptation is hampered. How to avoid premature saturation? Initialize weights from uniform distribution within a small range Use as few hidden neurons as possible Premature saturation is least likely when neurons operature in the linear range (middle of sigmoidal function CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS

CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS The XOR Problem (1) CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS

CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS The XOR Problem (2) CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS

Hyperbolic Tangent Activation Function The hyperbolic tangent is an asymmetric sigmoidal function Experience has indicated that using an asymmetric activation function can speed up learning (i.e. it requires fewer training iterations) The tangent hyperbolic function varies from -a to a (as opposed to [0, a] for the logistic function) (or -1 to 1 for a = 1) φ(v) = a tanh(bv) = a[1 – exp(-bv)][1+ exp(-bv)]-1 = 2a[1 + exp(-bv)]-1 – a Suggested values for a and b; a = 1.716; b = 2/3 CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS

Some Implementation Tips (1) Normalize the desired (target) responses dj to lie within the limits of the activation function If the activation function values range from -1.716 to 1.716, we can limit the desired responses to [-1, +1] The weights and thresholds should be uniformly distributed within a small range to prevent saturation of the neurons All neurons should desirably learn at the same rate. To achieve this, the learning-rate parameter can be set larger for layers further away from the output layer CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS

Some Implementation Tips (2) The order in which the training examples are presented to the network should be randomized (shuffled) from one epoch to another. This enhances search for a better local minima on the error surface. Whenever prior information is available, include that in the learning process CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS

Pattern Classification Since outputs of MLPs are continuous we need to define decision rules for classification In general, classification into M classes requires M output neurons What decision rules should be used? A pattern x is classified to class k, if output neuron k ‘fires’ (i.e. its output is greater than a threshold) The problem with this decision rule is that it is unambiguous; more than one neurons may ‘fire’ A pattern x is classified to class k if the output of neuron k is greater than all other neurons yk > yj for all j not equal to k CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS

CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS Example (1) Classify between two ‘overlapping’ two-dimensional, Gaussian-distributed patterns Conditional probability density function for the two classes f(x | C1) = 1/2πσ12 exp[-1/2σ12 ||x – μ1||2] μ1 = mean = [0 0]T and σ12 = variance = 1 f(x | C2) = 1/2πσ22 exp[-1/2σ22 ||x – μ2||2] μ2 = mean = [2 0]T and σ22 = variance = 4 x = [x1 x2]T = two dimensional input C1 and C2 = class labels CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS

CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS Example (2) CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS

CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS Example (3) CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS

CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS Example (4) Consider a two-input, four hidden neurons, and two-output MLP Decision rule: an input x is classified to C1 if y1 > y2 The training set is generated from the probability distribution functions Using BP algorithm, the network is trained for minimum mean-square-error The testing set is generated from the probability distribution functions The trained network is tested for correct classification For other implementation details, see the Matlab code CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS

CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS Example (5) CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS

CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS Experimental Design Number of hidden layers Use the minimum number of hidden layers that gives the best performance (least mean-square-error or best generalization) In general, more than 2 hidden layers is rarely necessary Number of hidden layer neurons Use the minimum number of hidden layer neurons (>= 2) that gives the best performance Learning-rate and momentum parameters The parameters that, on average, yield convergence to a local minimum in least number of epochs The parameters that, on average or in worst-case, yield convergence to the global minimum in least number of epochs The parameters that, on average, yield a network with best generalization CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS

CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS Generalization (1) A neural network is trained to learn the input-output patterns presented to it by minimizing an error function (e.g. mean-square-error) In other words, the neural network tries to learn the given input-output mapping or association as accurately as possible But… can it generalize properly ? And, what is generalization? A network is said to generalize well when the input-output relationship computed by the network is correct (or nearly so) for input-output pattern (test data) never used in creating and training the network In other words, a network generalizes well when it learns the input-output mapping of the system from which the training data is obtained CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS

CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS Generalization (2) Properly fitted data; good generalization CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS

CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS Generalization (3) Over fitted data; bad generalization CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS

CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS Generalization (4) How to achieve good generalization? In general, good generalization implies a smooth nonlinear input-output mapping Rigorous mathematical criterion presented by Poggio and Girosi (1990) In general, the simplest function that maps the input-output patterns would be smoother In a neural network… use the simplest architecture possible with as few hidden neurons as needed for the mapping use a training set that is consistent with the complexity of the architecture (i.e. more patterns for more complex networks) CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS

Universal Approximator A neural network can be thought of as a mapping from a p-dimensional Euclidean space to a q-dimensional Euclidean space. In other words, a neural network can learn a function s: Rp -> Rq A multilayer feedforward network with nonlinear, bounded, monotone-increasing activation functions is a universal approximator Universal approximation: to learn a continuous nonlinear mapping to any degree of accuracy This is an existence theorem. That is, it does not say anything about practical optimality (complexity, computation time, etc) CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS

CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS Cross-Validation The design of a neural network is experimental. We select the “best” network parameters based on a criterion From statistics, cross-validation provides a systematic way of experimenting Randomly partition data into training and testing samples Further partition training sample into an estimation and an evaluation (cross-validation) sample Find the “best” model by training with estimation sample and validating with evaluation sample Once the model has been found, train on entire training sample and test generalization using the testing sample CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS

CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS MLP and BP: Remarks (1) The BP algorithm is the most popular algorithm for supervised training of MLPs. It is popular because It is simple to compute locally It performs stochastic gradient descent in weight space (for pattern-by-pattern mode of training) The BP algorithm does not have a biological basis. Many of its operations are not found in biological neural network. Nevertheless, it is of great engineering importance. The hidden layers act as feature extractors/detectors An MLP can be trained to learn an identity mapping (i.e. map a pattern to itself), in which case the hidden layers act as feature extractors CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS

CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS MLP and BP: Remarks (2) A MLP with BP is an universal approximator in the sense that it can approximate any continuous multivariate function to any desired degree of accuracy, provided that sufficiently many hidden neurons are available The BP algorithm is a first-order approximation of the method of steepest descent. Consequently, it converges slowly. The BP algorithm is a hill-climbing technique and therefore suffers from the possibility of getting trapped in a local minimum of the cost surface The MLP scales poorly because of full connectivity CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS

Accelerating Convergence Four heuristics Every adjustable network parameter of the cost function should have its own individual learning rate parameter Every learning rate parameter should be allowed to vary from one iteration to the next When the derivative of the cost function with respect to a synaptic weight has the same algebraic sign for several consecutive iterations of the algorithm, the learning rate parameter for that particular weight should be increased When the algebraic sign of the derivative of the cost function with respect to a particular synaptic weight alternates for several consecutive iterations of the algorithm, the learning rate parameter for that weight should be decreased CS/CMPE 537 - Neural Networks (Sp 2006-2007) - Asim Karim @ LUMS