Download presentation
Presentation is loading. Please wait.
1
Artificial neural networks – Lecture 4
Prof Kang Li
2
Last lecture Analysis of LMS learning algorithm Batch via sequential training MLP as classifier and approximator This LectureSequential BP Sequential BP Some issues about MLP training Advanced MLP learning algorithms
3
MLP learning algorithm
In biological neural networks (the brain) learning is thought to occur through the adaptation of synapses over time in response to repeated activation. In ANNs learning is performed through adaptation of the weights in response to training data in a manner which improves the network fit to a given mapping. This training process can be posed as an unconstrained optimisation problem in which the objective is to minimise (or maximise) a performance index with respect to the network parameters.
4
gradient descent method
BP= gradient descent method + multilayer networks (Humelhart, Hinton and Williams, 1986)
5
BP Back-propagation learning algorithm is called ‘BP’
BP has two phases: Forward pass phase: computes ‘functional signal’, feed-forward propagation of input pattern signals through network Backward pass phase: computes ‘error signal’, propagation of error (difference between actual and desired output values) backwards through network starting at output units
6
BP for MLP with one hidden layer
Input layer Hidden layer Output layer
7
Forward pass (sequential mode)
Weights are fixed during forward pass at time t 1. Compute values for hidden nodes 2. Compute values for output nodes
8
Backward Pass (sequential mode)
Recall simultaneous error measure for each pattern is Again recall the Delta rule. Now we want to know how to modify weights in order to decrease E, where both for hidden nodes and output nodes This can be rewritten as product of two terms using chain rule
9
both for hidden units and output units
Part A: Part B:
10
Combining A+B gives So to achieve gradient descent in E should change weights by
11
Now need to find di(t) and Di(t) for each node. Output nodes:
where Therefore
12
Hidden nodes:
13
Sequential BP Summary
14
More issues on BP and MLP learning
Selecting initial weight values Choice of initial weight values is important as this decides staring position in weight space. That is, how far away from global minimum Aim is to select weight values which produce midrange function signals Normally select weight values randomly form uniform probability distribution. Normalise weight values so number of weighted connections per unit produces midrange function signal
15
Randomisation Validation
For the pattern training it might be a good practice to randomise the order of presentation of training examples between epochs. Validation In order to validate the process of learning the available data is randomly partitioned into a training set which is used for training, and a test set which is used for validation of the obtained data model.
16
The generalisation problem
Refers to the ability of the trained network to predict the correct output for unseen data. The network should interpolate smoothly between the data points in the training set. The generalisation error is usually estimated as the mean-squared error on a separate validation (test) set of data. Ideally this validation set should be independent of, and uncorrelated with, the data used for training in order to get an unbiased estimate of generalisation performance. Using fewer neurons may lead to better generalisation performance, however, too few neurons may lead to poor approximation accuracy. This is also considered as the bias-variance trade-off problem.
17
The generalisation problem (cont)
Two models g(x) are used to fit the data generated by y=h(x)+e Model A- Underfitting - high bias Model B -Overfitting - high variance
18
The generalisation problem (cont)
The bias-variance trade-off is most likely to become a problem if we have relatively few data points. In the opposite case, where we have essentially an infinite number of data points (as in continuous online learning), we are not usually in danger of overfitting the data, as the noise associated with any single data point plays a vanishingly small role in our overall fit.
19
Preventing overfitting and improving generalisation
Early stopping One training data and one validation data, stop when validation error starts to rise instead of decreasing
20
Regularisation - also called weight decay
Add a penalty term to the standard training cost function which directly penalises poor generalisation (penalise large weights therefore reduce unnecessary curvature). Where is a scalar that determines the influence of the added term and normally takes the form of Weight decays exponentially with its size Therefore
21
Training with noise To add an extra small amount of noise (a small random value with mean value of zero) to each training input. Each time a specific input pattern x is presented, we add a different random number, and use x+e instead. If we have a finite training set, another way of introducing noise into the training process is to use online training, that is, updating weights after every pattern presentation, and to randomly reorder the patterns at the end of each training epoch. In this manner, each weight update is based on a noisy estimate of the true gradient.
22
Some advanced issues in MLP training
Zig-zag search directions and highly correlated step sizes represent untapped curvature information and reflect the overall inefficiencies of traditional BP. Various heuristic extensions to BP have been proposed. (Momentum, Quickprop, delta-bar-delta, conjugate etc.) Much better results obtained by exploiting advanced second order algorithms from the field of unconstrained optimization. Classed as second order because they directly or indirectly exploit second derivative (curvature) information to generate their search directions. They have a strong theoretical basis and are vastly superior to gradient descent. However, the improved performance is obtained at the expense of additional computational complexity and memory requirements.
23
Adaptive learning rate
For batch mode, one of the ways of increasing the convergence speed, that is, to move faster downhill to the minimum of the mean-squared error, E (W), is to vary adaptively the learning rate parameter . • If E is decreasing consistently, then: • If the error has increased: In general, increasing learning rate the learning tends to become unstable, therefore it is important to quickly reduce .
24
Second derivative methods
The problem of minimisation of a function of many variables (multi-variable function), E(w), has been researched since the 17th century and its principles were formulated by people as Kepler, Fermat, Newton, Leibnitz, Gauss. Taylor series expansion: Consider values of the weight vector at two consecutive steps Corresponding cost functions are Taylor series expansion gives
25
is the gradient vector of the cost function
is the Hessian matrix of second derivatives Therefore The purpose of iterative training is to ensure the error reduction goes along a fast and stable direction, making use of the first and second derivative information BP and derived algorithms only use first derivative information Advanced algorithms such as LM use second derivative information
26
Newton’s method To minimise E(w(t+1)), calculate its gradient, ignore higher order terms in the Taylar expansion and equate it to zero, gives Newton’s method uses the above equation to update the weights. Hessian matrix provides additional information about the shape of the performance index surface in the neighbourhood of w(t), therefore it typically faster than BP and its derivatives which uses only the first derivative information. However, in Newton’s method, the computation of the inverse of H is complex and expensive. Newton’s method uses batch mode.
27
Gauss-Newton method Jacobian matrix
Gauss-Newton method uses online learning mode. Review LMS method. All weights in a vector Jacobian matrix
28
Then where Gauss-Newton method
29
Levenberg-Marquardt algorithm
The problem with Gauss-Newton method is that the approximated Hessian Matrix is not invertible. Therefore, in Levenberg-Marquardt algorithm, Here H(w) is the batch Hessian matrix, J(w) is the batch Jacobian matrix, and I is a identity matrix, µ is a small constant.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.