Presentation is loading. Please wait.

Presentation is loading. Please wait.

Going Backwards In The Procedure and Recapitulation of System Identification By Ali Pekcan 65570B.

Similar presentations


Presentation on theme: "Going Backwards In The Procedure and Recapitulation of System Identification By Ali Pekcan 65570B."— Presentation transcript:

1 Going Backwards In The Procedure and Recapitulation of System Identification
By Ali Pekcan 65570B

2 T-61.6050 Neural Networks for Modelling and Control of Dynamic Systems
Outlines Going Backwards in the Procedure Training the Network Again Finding the Optimal Network Architecture Optimal Brain Surgeon Optimal Brain Damage Section Summary Recapitulation of System Identification 12/11/2018 T Neural Networks for Modelling and Control of Dynamic Systems

3 Going Backwards in the Procedure
Necessary to go back in the procedure if a model is not accepted immediately. “Stepping back” Dissatisfaction with the network model Possible better model 12/11/2018 T Neural Networks for Modelling and Control of Dynamic Systems

4 Training the Network Again
Retrain the network with a different initialization of the weights. Retrain the network according to another criterion. The generalization ability of a model structure depends on the weight decay and minimum of the optimal generalization. Determination of the optimal weight decay and finding the global minimum: If there is a test set: Perhaps the best way is trial-and-error test. If a test set is not available: useful to evaluate average generalization error for the trained networks in terms of FPE and LULOO estimates. 12/11/2018 T Neural Networks for Modelling and Control of Dynamic Systems

5 Finding the Optimal Network Architecture (1)
Two fundamental choices: The contents of the regression vector What weights to include in the network architecture The goodness of the structure can be ranked by the average generalization error. Common procedure: Select a particular structure of the regression vector (NNARX, NNARMAX,NNOE,…) Try to determine the best possible network architecture for this choise of regressors. More training data  less important architecture determination. 12/11/2018 T Neural Networks for Modelling and Control of Dynamic Systems

6 Finding the Optimal Network Architecture (2)
For large amount of training data fully connected networks can be used and architecture selection is reduced to a matter of choosing the number of hidden units. Increasing the number of hidden units gradually while evaluating the test error can be a good way to choose. If the training set is very limited, the most essential weights should be chosen. (pruning algorithms) Principle of Pruning Algorithms: Eliminating the weights one at a time, of an architecture that is large enough to describe the system, until the optimal architecture is reached. 12/11/2018 T Neural Networks for Modelling and Control of Dynamic Systems

7 T-61.6050 Neural Networks for Modelling and Control of Dynamic Systems
Comparing a fully connected NNARX(2,2,1) model structure with 91 weights to a model structure obtained by pruning the network until it contained 19 weights 12/11/2018 T Neural Networks for Modelling and Control of Dynamic Systems

8 Finding the Optimal Network Architecture (3)
Eliminating the weights (Ranking): Minimization of the unregularized criterion gives an asymptotically Gaussian distributed estimate of the weights where Obvious candidates for elimination are naturally those weights for which zero is within a distance of 2-3 standard deviations from the estimates, (Hassibi and Stork 1993) The way of ranking these candidates: comparing the weights to its standard deviation. If the ratio is small  variance of weight is large  maximum decrease in Average Generalization Error 12/11/2018 T Neural Networks for Modelling and Control of Dynamic Systems

9 Optimal Brain Surgeon (OBS) -(1)
First Derived for unregularized mean square error criteria, and was not directly motivated by the generalization error considerations. (Hassibi and Stork (1993)) The extension to the regularized case and connection to the average generalization error. (Pedersen and Hansen (1995)) For a single output network model trained to its minimum, the FPE estimate of the average generalization error for the unregularized criterion is: If a regularization term is included  Little accuracy difference is expected between 3 12/11/2018 T Neural Networks for Modelling and Control of Dynamic Systems

10 Optimal Brain Surgeon (OBS) -(2)
OBS is to eliminate the weight that gives a maximum decrease in the FPE estimate. Each weight affects the FPE: Let Mj specify the model structure with the jth weight Second equation is Saliency for weight j : Change in training error Third equation is generalization-based saliency for weight j : Change in the average generalization error. OBS is to eliminate the weight with the smallest generalization-based saliency 12/11/2018 T Neural Networks for Modelling and Control of Dynamic Systems

11 Optimal Brain Surgeon (OBS) -(3)
In the regularized case it is necessary to find the minimizer of; where is the effective number of weights in Mj Computing the Salience: 12/11/2018 T Neural Networks for Modelling and Control of Dynamic Systems

12 Optimal Brain Surgeon (OBS) -(4)
since 12/11/2018 T Neural Networks for Modelling and Control of Dynamic Systems

13 Optimal Brain Surgeon (OBS) -(5)
Without weight decay the first two terms will be zero Like the ranking ratio before 12/11/2018 T Neural Networks for Modelling and Control of Dynamic Systems

14 Optimal Brain Surgeon (OBS) -(6)
Problem: In over parameterized networks Hessian of the unregularized criterion, R, generally is ill-conditioned. So the inverse Hessian should be approximated with the recursion used in the recursive Gauss-Newton algorithm. Weight decay improves the robustness of the OBS procedure. The effective number of weights can be approximated by : And the inverse Hessian as the Schur complement of Q 12/11/2018 T Neural Networks for Modelling and Control of Dynamic Systems

15 Optimal Brain Surgeon (OBS) -(7)
For two layer network: In the beginning only input-to-hidden layer weights are eliminated When a hidden unit has only one weight leading to it, the saliency for entire unit is determined If this unit saliency is smaller than any other saliency, the unit is eliminated. Its incorrect to calculate saliency as the change in FPE obtained by fixing both weights connected to the unit to zero. J=(j1,j2) contain the locattions of two weights. 12/11/2018 T Neural Networks for Modelling and Control of Dynamic Systems

16 Complete procedure for pruning by (OBS)
Select a sufficiently large model structure and train the network with small weight decay. Compute the Gauss-Newton Hessian and invert it. Evaluate FPE estimate and evaluate the test error if a test set is available. If both quantities pass respective the minimum go to Step7 Compute the saliency for each weight, effective number of parameters in the reduced network and combine them into generalization-based saliencies. Search for the smallest generalization-based saliency and eliminate the concerned weight(s) from the parameter vector. Determine the remaining weights. Retrain, store the weights and go to Step2 (Levenberg-Marquardt iterations) Re-establish the architecture that was optimal as assessed by the test error and FPE estimate, and retrain the network without weight decay 12/11/2018 T Neural Networks for Modelling and Control of Dynamic Systems

17 Optimal Brain Damage (OBD)
It’s based on the assumption that all off-diagonal elements in the Hessian matrix can be neglected. Unlike OBS, OBD only provides guidelines for pruning and not for how to re-estimate the remaining weights. When OBD removes a weight, it leaves the remaining weights untouched. Not modifies like OBS. The saliency of weight j is given by : More than one weight at a time is allowed. Computation is less than OBS and even performs similarly. Most demanding part is to find the initial model structure which is large enough to describe the system. Generalization-based saliency of weight j : 12/11/2018 T Neural Networks for Modelling and Control of Dynamic Systems

18 T-61.6050 Neural Networks for Modelling and Control of Dynamic Systems
Section Summary Retraining the network The found minimum can be quite far in value from the global minimum so the network should be retrained with different initializations of the weights. Not necessary to validate first. Selecting the optimal model structure OBS-OBD both prune the network weights of one at a time until optimal architecture is reached. Stopping criterion one or more average generalization error (FPE,LULOO) OBD  Only prunes weights OBS  Prunes weights and recommends how to change the value of the remaining weights in the network. Continuing the experiment  If its not possible to describe the system satisfactory with the available training data, it may be necessary to redo the experiment. 12/11/2018 T Neural Networks for Modelling and Control of Dynamic Systems

19 Recapitulation of System Identification
Experiment collect as much data as possible, with visual inspection Model structure selection Small amount of data fully connected NNARX models are recommended Medium amount of data (bias and variance error is found) regularization by weight decay and pruning methods are discussed. Large amount of data (without regularization term fully connected nets) Estimate a model Select a criterion specify how the weights should be determined from data. Choose an iterative search method for minimizing the criterion. (training algorithm) Ex: Levenberg Marquardt, Quasi Newton methods 12/11/2018 T Neural Networks for Modelling and Control of Dynamic Systems

20 Recapitulation of System Identification
Validation Correlation tests to investigate if the residuals are white and independent of past information One-step and k-step ahead predictions, estimation of prediction intervals Estimation of the average generalization error 12/11/2018 T Neural Networks for Modelling and Control of Dynamic Systems

21 Recapitulation of System Identification
Going Backwards in the procedure Retrain the network Network usually have more than one local minimum. 5-7 times with different initialization of the weights Determine optimal weight decay for regularization trial-and-error or a combination of FPE, LULOO, and test error to find the optimal weight decay. Determine optimal model structure by pruning Determine a weight decay that is smaller than the optimal. Prune one weight at a time and let it be followed by retraining 12/11/2018 T Neural Networks for Modelling and Control of Dynamic Systems

22 -End-


Download ppt "Going Backwards In The Procedure and Recapitulation of System Identification By Ali Pekcan 65570B."

Similar presentations


Ads by Google