Presentation is loading. Please wait.

Presentation is loading. Please wait.

Optimizing number of hidden neurons in neural networks

Similar presentations


Presentation on theme: "Optimizing number of hidden neurons in neural networks"— Presentation transcript:

1 Optimizing number of hidden neurons in neural networks
IASTED International Conference on Artificial Intelligence and Applications Innsbruck, Austria Feb, 2007 Neural networks can be used as unknown function approximators. Thus in deciding about the neural network organization one has to chose network complexity expressed by the number of neurons. In this paper we will use a novel and quantitative measure to optimize the number of hidden neurons of NNs. Janusz A. Starzyk School of Electrical Engineering and Computer Science Ohio University Athens Ohio U.S.A

2 Outline Neural networks – multi-layer perceptron Overfitting problem
Signal-to-noise ratio figure (SNRF) Optimization using signal-to-noise ratio figure Experimental results Conclusions My talk is organized as follows: I will briefly describe the operation of the multilayer perceptron and illustrate the problem of overfitting. Then I will derive a new measure of estimating the fit quality named signal to noise ratio figure. I will show NN optimization using SNRF and illustrate it with application results.

3 Neural networks – multi-layer perceptron (MLP)
Inputs x Outputs z A typical 2-layer perceptron contains input, hidden layer and output layer. The input data go through the connections between the input layer and the hidden layer. The weighted sum goes through the neurons on the hidden layer. The hidden neurons implement nonlinear transformation of this weighted sum. Data from hidden neurons go through the connections between the hidden layer and the output layer. The weighted sum from hidden neurons is the network output. The interconnection weights are trainable to implement desired functions.

4 Neural networks – multi-layer perceptron (MLP)
inputs outputs Efficient mapping from inputs to outputs Powerful universal function approximation Number of inputs and outputs determined by the data Number of hidden neurons: determines the fitting accuracy  critical MLP Thus the weights are trainable to map the relations between input and output data. The MLP can be a powerful tool in function approximation in the sense that it finds the weights to combine sigmoidal basis function. The number of basis functions used affects the learning quality of the function approximator. The number of the input and output neurons are determined by the dimensions of the input and output data, so that only the number of hidden neurons is to be decided by users. And the number of hidden neurons affects the fitting accuracy. In such sense, it is a critical parameter for NNs.

5 Overfitting problem Generalization:
Overfitting: overestimates the function complexity, degrades generalization capability Bias/variance dilemma Excessive hidden neuron  overfitting Training data (x, y) Model MLP training new data (x’) y’ The important reason to do function approximation is to generalize the model from the existing data to unseen data. Using MLP, an approximating model for the training data is generated. We can use the approximating model to predict the function value for new data. The training data are usually produced by measurements so they come with noise. In generalization problems, if the model approximates the existing data too close, it will over-estimate the problem complexity and greatly degrade generalization capability. This leads to significant deviation in predictions. In such case we say that the model overfits the data. Using excessive number of hidden neurons will cause overfitting. Optimizing the number of hidden neurons to use without a pre-set target for accuracy is one of the major challenges for neural networks, usually referred to as the bias/variance dilemma.

6 Overfitting problem Avoid overfitting: cross-validation & early stopping training data (x, y) MLP training Training error etrain All available training data (x, y) testing data (x’, y’) MLP testing Testing error etest Common methods to determine the optimal number of hidden neurons are cross-validation and early-stopping. In these methods, the available data are divided into two independent sets: a training set and a validation or testing set. Only the training set participates in the neural network learning, and the testing set is used to compute testing error, which approximates the generalization error. The performance of a function approximation is measured by training error and testing error. Once the testing performance stops improving with further increase of the number of hidden neurons, it is possible that overfitting occurs. Therefore, the stopping criterion is set so that, when the testing set error starts to increase, or equivalently when training error and testing error start to diverge, the optimal value of the number of hidden neurons is reached. Fitting error etest Stopping criterion: etest starts to increase or etrain and etest start to diverge etrain Number of hidden neurons Optimum number

7 Number of hidden neurons
Overfitting problem How to divide available data? When to stop? Fitting error training data (x, y) All available training data (x, y) etest testing data (x’, y’) However, in cross-validation and early stopping, the use of the stopping criterion based on testing error is not straightforward. For example, how does one determine the size of the training and testing sets in predicting generalization error using testing error? Although testing error provides an estimate of generalization error, is it necessary that the testing error increases as soon as generalization error begins to increase? The methods require removing the testing set in from training data, which is a significant waste of the precious available data. etrain data wasted Number of hidden neurons Optimum number Can test error catch the generalization error?

8 Overfitting problem Desired:
As testing error varies with the number of hidden neurons, and it usually has many local minima, which of these local minima indicates the occurrence of overfitting? It is desired to have a quantitative measure of overfitting based on the training error so that we can decide if we can improve the learning by adding hidden neurons. Desired: Quantitative measure of unlearned useful information from etrain Automatic recognition of overfitting

9 Signal-to-noise ratio figure (SNRF)
Sampled data: function value + noise Error signal: approximation error component + noise component Useful signal Should be reduced Noise part Should not be learned It is assumed that the training data comes with White Gaussian Noise (WGN) at an unknown level. The error signal is the difference between the approximating function value and the training data. The error signal contains two components: the approximation error due to inaccuracy in approximation, and the WGN in the training data. The question of overfitting becomes: is there still a useful signal left in the error signal or the noise dominates the error signal. Assuming that the approximate function is continuous and that the noise is WGN, we can estimate the signal and noise levels in the error signal. The ratio of the signal energy level over the noise energy level is defined as SNRF. The SNRF can be pre-calculated for the WGN. If SNRF of the error signal is comparable with that of WGN, there is little useful information left in the error signal, and the approximation error cannot be reduced anymore. Assumption: continuous function & WGN as noise Signal-to-noise ratio figure (SNRF): signal energy/noise energy Compare SNRFe and SNRFWGN Learning should stop – ? If there is useful signal left unlearned If noise dominates in the error signal

10 Signal-to-noise ratio figure (SNRF) – one-dimensional case
Training data and approximating function Error signal The method to obtain SNRF is first explained in a one-dimensional case. The figure on the left shows the training data and its fitting using quadratic polynomial. The error signal, is shown on the right. Obviously, the error signal doesn’t look like a WGN. The error signal e contains a useful signal left unlearned, and a noise component. The level of the noise is unknown. The question is how can we measure the signal level and noise level. How to measure the level of these two components? approximation error component + noise component

11 Signal-to-noise ratio figure (SNRF) – one-dimensional case
High correlation between neighboring samples of signals The energy of the error signal e is also composed of the signal and noise components, and it can be approximated using the autocorrelation function. There is a high level of correlation between two neighboring samples of the signal component. Due to the nature of WGN, noise of a sample is independent of noise on neighboring samples thus the correlation of noise with its shifted copy is 0. Therefore, the correlation between the error signal ei and its shifted copy ei-1 approximates the signal energy. The noise energy is the difference between the original and shifted error signals.

12 Signal-to-noise ratio figure (SNRF) – one-dimensional case
The ratio of signal level to noise level, defined as the SNRF of the error signal, is obtained. In order to detect the existence of useful signal in e, the SNRF of e has to be compared with SNRF of WGN estimated using the same number of samples. SNRF_WGN average value and standard deviation can be estimated using Monte-Carlo simulation.

13 Signal-to-noise ratio figure (SNRF) – one-dimensional case
The histogram of SNRF of WGN is shown in this figure based on the Monte Carlo simulation. The stopping criterion can now be determined by testing the hypothesis that SNRFe and SNRFWGN are from the same population. The value of SNRFe at which the hypothesis is rejected constitutes a threshold below which one must stop increasing number of hidden neurons to avoid overfitting. It is obtained from statistical simulation that the 5% significance level can be approximated by the average value plus 1.7 standard deviations. Hypothesis test: 5% significance level

14 Signal-to-noise ratio figure (SNRF) – multi-dimensional case
Signal and noise level: estimated within neighborhood sample p M neighbors Similar estimation of signal and noise level using correlation may be obtained in multi-dimensional case. The signal values in the neighbourhood of multidimensional case are correlated, while the values of WGN are uncorrelated. The signal level at ep is computed using a weighted combination of the products of ep values with each of its M nearest neighbors. Weights are based on the Euclidean distances between the neighboring samples.

15 Signal-to-noise ratio figure (SNRF) – multi-dimensional case
All samples And then the overall signal level of e can be calculated as the summation of signal levels at all samples. Finally, the SNRFe in an M-dimensional input space is computed from the overall signal and noise energy. Notice that when (M=1), this result is identical to the SNRFe model derived for the one-dimensional case.

16 Signal-to-noise ratio figure (SNRF) – multi-dimensional case
The average value and the standard deviation of SNRF WGN can be found from statistical simulation. The threshold to determine when overfitting is about to occur can be approximated by the average value plus 1.2 standard deviations. It is noticed that the threshold calculation defined here is very close to the one used in approximation of one-dimensional functions. M=1  threshold multi-dimensional (M=1)≈ threshold one-dimensional

17 Optimization using SNRF
SNRFe< threshold SNRFWGN Start with small network Train the MLP  etrain Compare SNRFe & SNRFWGN Add hidden neurons Noise dominates in the error signal, Little information left unlearned, Learning should stop Using SNRF, we can quantitatively determine the amount of useful signal information left in the error signal. The noise SNRF level serves as a reference for developing the stopping criterion. When SNRFe is smaller than the threshold set by SNRF_WGN, it means that the noise dominates in the error signal, and the learning process can be stopped. In optimizing number of hidden neurons in NN, one may start with a network with a small number of hidden neurons. Examine the training error signal and obtain its SNRFe. Compare the SNRFe with the threshold. If it is higher than the threshold, more hidden neurons can be added until SNRFe indicates overfitting. Stopping criterion: SNRFe< threshold SNRFWGN

18 Optimization using SNRF
Applied in optimizing number of iterations in back-propagation training to avoid overfitting (overtraining) Set the structure of MLP Train the MLP with back-propagation iteration  etrain Compare SNRFe & SNRFWGN Keep training with more iterations A similar process may be used to optimize other NN design parameters, e.g. the number of iterations in back-propagation training

19 Experimental results noise-corrupted 0.4sinx+0.5
Optimizing number of iterations noise-corrupted 0.4sinx+0.5 The proposed SNRF-based criterion on optimizing number of iterations is tested on learning a noise-corrupted sin wave. The stopping criterion indicates that 10 iterations is enough for BP training. The figure shows that MLP after 10 iterations of training can approximate the function very well. Using too many iterations, will result in overtraining as shown on the lower figure. The approximating function is affected by the noise in the data.

20 Optimization using SNRF
Optimizing order of polynomial This figure illustrates the proposed SNRF-based criterion in optimizing number of basis functions The SNRF stopping criteria was met at polynomial of order 5, while the minimum of early stopping criteria was at polynomial of order 12 We can see that true generalization error was smaller at order 5 than at order 12 In addition, early testing error was not able to detect large growth of generalization error for polynomials of order higher than 18

21 Experimental results Optimizing number of hidden neurons
two-dimensional function The proposed SNRF-based criterion on optimizing number of hidden neurons is tested on learning a 2-dimensional function. As the number of hidden neurons increases, the SNRFe is expected to decrease. When the SNRFe becomes lower than the threshold as more neurons are added, the overfitting starts to occur. At this point, one should stop increasing the size of the hidden layer. After that training error and testing error start to diverge from each other

22 Experimental results Using 25 hidden neurons in MLP, the function is optimally approximated. We can see the difference between the desired function and the approximating function with small values around zero. While using 35 hidden neurons in MLP, the MLP does not approximate the function as well as that with 25 neurons. The difference between the desired function and the approximating function is larger, which means it gives bad prediction to the unseen testing data.

23 Experimental results Mackey-glass database
Every consecutive 7 samples  the following sample MLP The proposed SNRF-based criterion on optimizing the number of hidden neurons was tested on two benchmark dataset for NN learning. The Mackey-glass data is a time series data set with an unknown level of noise. MLP is used to predict each 8th sample based on prior 7 samples. The target function and the error signal, e, are a continuous one-dimensional, time-domain functions. The one-dimensional threshold measure discovered that SNRFe becomes lower than the threshold with the number of the hidden neurons larger than or equal to 4. It is also shown that testing error starts to increase and diverge from training error after the number of hidden neurons exceeds 4.

24 Experimental results WGN characteristic
Using 4 hidden neurons in the MLP, the approximated sequence almost overlaps the desired target function, and the obtained error signal is almost a Gaussian noise. The autocorrelation of error signal shows the characteristic of WGN. It shows that the MLP with 4 hidden neuron approximates the function without overfitting. WGN characteristic

25 Experimental results Puma robot arm dynamics database
8 inputs (positions, velocities, torques) angular acceleration MLP The next benchmark dataset is multidimensional one and is generated from a simulation of the dynamics of a Puma robot arm. The 8dimensional data include angular positions, angular velocities and torques of the robot arm. The task is to predict the angular acceleration of the robot arm's links based on inputs. The dataset is subject to an unknown level of noise. It is indicated by SNRFe that overfitting starts to occur when the number of neurons is 40. Note that testing error has many local minima and using a local minimum as a stopping criterion would be ambiguous. Using a 6th order polynomial fit to testing error, we can see that the testing error starts to diverge from training error at 40 neurons, which is very close to the prediction from SNRF based criterion.

26 Conclusions Quantitative criterion based on SNRF to optimize number of hidden neurons in MLP Detect overfitting by training error only No separate test set required Criterion: simple, easy to apply, efficient and effective Optimization of other parameters of neural networks or fitting problems In this work, a method is proposed to optimize the number of hidden neurons in NN to avoid overfitting in function approximation. The method utilizes a quantitative criterion based on the SNRF to detect overfitting automatically using the training error only It does not require a separate validation or testing set. The criterion is easy to apply, and is suitable for practical applications. The proposed SNRF-based criterion was verified on one-dimensional and multi-dimensional datasets. The same principle applies to the optimization of other parameters of neural networks, like the number of iterations in back propagation training or the number of hidden layers. It can be applied to parametric optimization or model selection for other function approximation problems as well.


Download ppt "Optimizing number of hidden neurons in neural networks"

Similar presentations


Ads by Google