Download presentation
Presentation is loading. Please wait.
1
Neural Network Training
Classical Optimization Techniques Similarity between NN training and optimization Gradient decent not popular in optimization discipline Many algorithms or mathematical methods to calculate weights
2
Objective Function Called “Cost Function” in optimization work
In Neural Networks, this is equivalent to the error term
3
Classical Techniques Correlation Analysis
Order: Reference to the derivative of the error with respect to the weights First Order - Back propagation is considered a gradient decent algorithm Second Order – makes use of Hessian
4
Correlation Analysis The analysis or relationship between data points x1, x2, x3 … xn to y1, y2, y3 … yn Where y = f(x) If there exists a relationship, the points are called correlated. Sample correlation coefficient, frequently R squared, 1.0 perfectly correlated 0.0 not correlated R2 is the relationship relating the sample standard deviations.
5
Autocorrelation Regression techniques assume variables are random, normally distributed A Term – residual, that that is left over, y - yHat = e The residuals should be normally distributed, if not then the variables are auto correlated
6
The residuals should be normally distributed, if not then the variables are auto correlated When there is a natural sequential order correlation, this is called autocorrelation Example: Hole drilling, tool wear, there is a probability that the diameter of the next hole drilled will be slightly smaller than the previous. These are called auto correlated variables Neural Network training algorithms should correct for auto correlated variables. Avoid using auto correlated variables as inputs to a NN. Example: Humidity and Wet Bulb temperature are correlated
7
Hessian Matrix
8
Hessian Why use the Hessian?
Second order can better calculate search direction Used in pruning algorithms At a local minimum H will be positive, can tell if search has reached global minimum
9
Hessian can Improve Learning Rate
How do you select the learning rate? 0 < h < 2/lmax lmax = max{ eigenvalue (H) }
10
Gradient Descent Small step size, slow convergence no momentum
12
Newton’s Method, a Second Order
For quadratic function, Newton’s method converges in one iteration
13
Levenberg-Marquardt A Second order method
Search direction a linear combination of gradient descent and Newton’s method Newton’s method close to solution Gradient descent everywhere else Adjusts Lambda based on eigenvalues of Hessian
14
Conjugate Gradient Descent
A Complex Conjugate: The conjugate of a complex number is that number with the sign of the imaginary part reversed Conjugate of a + jb is a - jb
15
Conjugate Gradient Descent
A first order method As fast as second method methods Frequently used method for problem with many variables DeltaV uses Conjugate Gradient
16
Conjugate Gradient Descent - constructing a series of line searches across the error surface, locates direction of steepest descent, projects a straight line in that direction and then locates a minimum along this line, The directions of the line searches (the conjugate directions) are chosen to try to ensure that the directions that have already been minimized stay minimized. The conjugate directions are actually calculated on the assumption that the error surface is quadratic. If the algorithm discovers that the current line search direction isn't actually downhill, it simply calculates the line of steepest descent and restarts the search in that direction. Once a point close to a minimum is found, the quadratic assumption holds true and the minimum can be located very quickly.
17
Conjugate Gradient Descent
Calculates the error gradient as the sum of the error gradients on each training case. The initial search direction is: Search direction is updated using the Polak-Rebiere formula:
18
Conjugate Gradient Descent
If the search direction is not downhill, the algorithm restarts using the line of steepest descent. It restarts anyway after W directions (where W is the number of weights), as at that point the conjugacy has been exhausted.
19
Conjugate Gradient Descent
Locates quadratic function optimum in 2 iterations
20
Annealing Optimization technique based on physical analogy. Cooling a liquid slowly results in a larger crystal. Boltzmann Statistic, based on temperature Temperature frequently used in neural network terminology
21
Annealing Search will not get caught in a local minimum
Adding noise to the error function assures that there is a good probability that the search will jump out of the local minimum and locate the global minimum Very slow, stochastic method
22
Annealing; White Noise
Will add a random input noise to each weight, This noise makes the network weights constantly "shake" as the training progresses. The purpose of this noise is to help the network jump out form a gradient direction that leads to local minima in the weight surface.
23
Pruning Algorithms Best to use a minimum sized network as possible, avoid over fitting the data Train an over specified network, then selectively remove weights and retrain
24
ANN should be trained with no unnecessary weights in the network
ANN should be trained with no unnecessary weights in the network. One method to remove unwanted weights is to train an over specified network and then ”prune” weights. How can you prune? Optimal Brain Surgeon (OBS) method, uses the Hessian and defines a “Saliency” as Seeks the weight adjustment, dw that sets one weight to zero and causes the least change in error.
25
Begins by locating the q with the smallest L, delete that weight and adjust the remaining weights by dw and retrain. Hassibi, B., Stork, D. G., "Second Order Derivatives for Neural Network Pruning: Optimal Brain Surgeon", NIPS 5, Eds. S. J. Hansen et al., San Mateo, Morgan Kaufmann, 1993.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.