Download presentation
Presentation is loading. Please wait.
Published byBertram Farmer Modified over 6 years ago
1
Chapter 6: Multilayer Neural Networks (Sections 6.1-6.5, 6.8)
Introduction Feedforward Operation and Classification Backpropagation Algorithm Practical Techniques for Improving Backpropagation
2
Introduction Goal: Classify objects by learning nonlinearity
There are many problems for which linear discriminants are insufficient for minimum error In previous methods, the central difficulty was the choice of the appropriate nonlinear functions A “brute” approach might be to select a complete basis set such as all polynomials; such a classifier would require too many parameters to be determined from a limited number of training samples Pattern Classification, Chapter 6
3
There is no automatic method for determining the nonlinearities when no prior information is provided to the classifier In using the multilayer Neural Networks, the form of the nonlinearity is learned from the training data Multilayer neural networks can, in principle, provide the optimal solution to an arbitrary classification problem Pattern Classification, Chapter 6
4
Feedforward Operation and Classification
A three-layer neural network consists of an input layer, a hidden layer and an output layer interconnected by modifiable weights represented by links between layers Pattern Classification, Chapter 6
5
Pattern Classification, Chapter 6
6
Pattern Classification, Chapter 6
7
A single “bias unit” is connected to each unit other than the input units
Net activation: where the subscript i indexes units in the input layer, j in the hidden; wji denotes the input-to-hidden layer weights at the hidden unit j. (In neurobiology, such weights or connections are called “synapses”) Each hidden unit emits an output that is a nonlinear function of its activation, that is: yj = f(netj) Pattern Classification, Chapter 6
8
Figure 6.1 shows a simple threshold function
The function f(.) is also called the activation function or “nonlinearity” of a unit. There are more general activation functions with desirables properties Each output unit similarly computes its net activation based on the hidden unit signals as: where the subscript k indexes units in the ouput layer and nH denotes the number of hidden units Pattern Classification, Chapter 6
9
The three-layer network with the weights listed in
More than one output are referred zk. An output unit computes the nonlinear function of its net, emitting zk = f(netk) In the case of c outputs (classes), we can view the network as computing c discriminants functions zk = gk(x) and classify the input x according to the largest discriminant function gk(x) k = 1, …, c The three-layer network with the weights listed in fig. 6.1 solves the XOR problem Pattern Classification, Chapter 6
10
The hidden unit y1 computes the boundary:
x1 + x = 0 < 0 y1 = -1 The hidden unit y2 computes the boundary: 0 y2 = +1 x1 + x = 0 < 0 y2 = -1 The final output unit emits z1 = +1 y1 = +1 and y2 = -1 zk = y1 and not y2 = (x1 or x2) and not (x1 and x2) = x1 XOR x2 which provides the nonlinear decision of fig. 6.1 Pattern Classification, Chapter 6
11
General Feedforward Operation – case of c output units
Hidden units enable us to express more complicated nonlinear functions and thus extend the classification The activation function does not have to be a sign function, it is often required to be continuous and differentiable We can allow the activation in the output layer to be different from the activation function in the hidden layer or have different activation for each individual unit We assume for now that all activation functions to be identical Pattern Classification, Chapter 6
12
(Universal Function Approximation Theorem)
Expressive Power of multi-layer Networks Question: Can every decision be implemented by a three-layer network described by equation (1) ? Answer: Yes “Any continuous function from input to output can be implemented in a three-layer net, given sufficient number of hidden units nH, proper nonlinearities, and weights.” (Universal Function Approximation Theorem) Therefore any decision boundaries can be implemented as a three-layer neural network Pattern Classification, Chapter 6
13
Pattern Classification, Chapter 6
14
Pattern Classification, Chapter 6
15
These results are of greater theoretical interest than practical, since they do not give an algorithm for finding suitable values for number of hidden units or weights. Nor do they imply that a 3-layer network is optimal for approximating a function. Pattern Classification, Chapter 6
16
Learning Nonlinear Functions
Given a training set determine Network structure – number of hidden nodes or more generally, network topology For a given structure, determine weights – that minimize the error For now, we focus on the latter Pattern Classification, Chapter 6
17
Backpropagation Algorithm
Our goal now is to set the weights based on the training patterns and the desired outputs In a three-layer network, it is a straightforward matter to understand how the output, and thus the error, depend on the hidden-to-output layer weights Desired outputs for nodes in the hidden layer are not known Backpropagation algorithm: generalized delta rule Pattern Classification, Chapter 6
18
Network have two modes of operation:
Feedforward The feedforward operations consists of presenting a pattern to the input units and passing (or feeding) the signals through the network in order to get outputs units (no cycles!) Learning The supervised learning consists of presenting an input pattern and modifying the network parameters (weights) to reduce distances between the computed output and the desired output Pattern Classification, Chapter 6
19
Pattern Classification, Chapter 6
20
Network Learning Let tk be the k-th target (or desired) output and zk be the k-th computed output with k = 1, …, c and w represents all the weights of the network The per-pattern training error: The backpropagation learning rule is based on gradient descent The weights are initialized with random values and are changed in a direction that will reduce the error: Pattern Classification, Chapter 6
21
where m is the m-th pattern presented hidden–to-output weights
where is the learning rate which indicates the relative size of the change in weights w(m +1) = w(m) + w(m) where m is the m-th pattern presented hidden–to-output weights where the sensitivity of unit k is defined as: and describes how the overall error changes with the activation of the unit’s net Pattern Classification, Chapter 6
22
wkj = kyj = (tk – zk) f’(netk)yj
Since netk = wkt.y therefore: Conclusion: the weight update (or learning rule) for the hidden-to-output weights is: wkj = kyj = (tk – zk) f’(netk)yj input-to-hidden units Pattern Classification, Chapter 6
23
Conclusion: The learning rule for the input-to-hidden weights is:
However, Similarly as in the preceding case, we define the sensitivity for a hidden unit: which means that:“The sensitivity at a hidden unit is simply the sum of the individual sensitivities at the output units weighted by the hidden-to-output weights wkj; all multipled by f’(netj)” Conclusion: The learning rule for the input-to-hidden weights is: Pattern Classification, Chapter 6
24
The batch backpropagation algorithm
So far, we have considered the error on a single pattern, but we want to consider an error defined over the entirety of patterns in the training set The total training error is the sum over the errors of n individual patterns In the batch training, all the training patterns are presented first and their corresponding weight updates summed; only then are the actual weights in the network updated Epoch: one epoch corresponds to a single presentation of all patterns in the training set Pattern Classification, Chapter 6
25
“Backpropagation” algorithm is so-called because during training an error must be propagated from the output layer back to the hidden layer in order to perform the learning of the input-to-hidden weights If the weights wkj to the output layer were ever all zero, the back-propagated error would also be zero and the input-to-hidden weights would never change! Pattern Classification, Chapter 6
26
Pattern Classification, Chapter 6
27
Although our analysis was done for a three-layer network, the backpropagation algorithm can be generalized directly to feed-forward networks in which Input units are connected directly to output units as well as hidden units More than three layers of units Different nonlinearities for different layers Pattern Classification, Chapter 6
28
The stochastic backpropagation algorithm
Begin initialize nH;w, criterion , , m0 do m m + 1 xm randomly chosen pattern wji wji + jxi; wkj wkj + kyj until ||J(w)|| < return w End Pattern Classification, Chapter 6
29
Stopping criterion Learning Curves
The algorithm terminates when the change in the criterion function J(w) is smaller than some preset value There are other stopping criteria that lead to better performance than this one Learning Curves Before training starts, the error on the training set is high; through the learning process, the error becomes smaller --- tends to decrease monotonically Excessive training can lead to poor generalization as the network implements a complex decision boundary “tuned” to the specific training data rather than the general properties of the underlying distributions --- overfitting Pattern Classification, Chapter 6
30
Stopping criterion The average error on an independent test set can decrease as well as increase A validation set is used in order to decide when to stop training ; we do not want to overfit the network and decrease the power of the classifier generalization “we stop training at a minimum of the error on the validation set” Pattern Classification, Chapter 6
31
Pattern Classification, Chapter 6
32
m-fold cross-validation:
The training set is randomly divided into m disjoint sets of equal size n/m The training procedure is run m times, each time with a different set held out as a validation set. On each run the number of epochs is determined that yield the best performance on the validation set. The mean k of these estimates is calculated, and a final run of backpropagation algorithm is performed on all n examples for k epochs. Pattern Classification, Chapter 6
33
Time complexity of back-propagation
Let W be the total number of weights A single evaluation of the error function for a given input pattern would require O(W) operations (The number of weights is typically much larger than the number of units) Back-propagation allows the W derivatives to be evaluated in O(W) operations. Pattern Classification, Chapter 6
34
Error Surfaces The error surface for multilayer networks may contain many local minima or plateaus – regions where the error varies only slightly In practical applications the problem of local minima has not been found to be as severe as one might fear. The high-dimensional space may afford more ways for gradient descent to “get around” local minima The error surfaces are still poorly understood. Heuristics to overcome local minima: stochastic gradient descent, train multiple times with reinitialized weights Pattern Classification, Chapter 6
35
Practical Techniques for Improving Backpropagation
Activation function, desired properties Nonlinear Continuous and differentiable Saturate: have maximum and minimum output value --- This will keep the weights and activations bounded and thus keep training time limited. Monotonic: if not monotonic and has multiple local maxima, additional local extrema in the error surface may become introduced. Linear for a small value of net --- enable the system to implement a linear model Pattern Classification, Chapter 6
36
Practical Techniques The class of sigmoid is the most widely used activation function which has all the desired properties The logistic sigmoid It is best to keep the function centered on zero and anti-symmetric or as an odd function, that is, f(-net)=-f(net). Pattern Classification, Chapter 6
37
Pattern Classification, Chapter 6
38
Practical Techniques For faster learning, use sigmoid functions of the form of a hyperbolic tangent a=1.716, b=1/3 f(net) is nearly linear in the range -1<net<+1. See Figure 6.14 Pattern Classification, Chapter 6
39
Practical Techniques Scaling input
The values of one feature may be orders of magnitude larger than that of another Standardize the training patterns Pattern Classification, Chapter 6
40
Practical Techniques Target values for classification
One-of-c representation If using the suggested sigmoid output units, set target values of +1 for the target category and -1 for all others (instead of and ). Pattern Classification, Chapter 6
41
Practical Techniques Introducing noise
Generate virtual training patterns Add Gaussian noise to true training points with category label unchanged e.g, for the standardized inputs: noise ~ N(0,0.01) Improve generalization, helps avoid over-fitting Pattern Classification, Chapter 6
42
Practical Techniques Manufacture training data
Use knowledge such as geometrical invariances among patterns Manufacture patterns by translating or rotating given training patterns if translation and rotation invariant recognition is desired Pattern Classification, Chapter 6
43
Practical Techniques Number of hidden units
The number of hidden units governs the expressive power of the net --- and thus the complexity of the decision boundary A convenient rule of thumb is to choose the number of hidden units such that the total number of weights in the net is roughly n/10 A more principled method is to adjust the complexity of the network in response to the training data Pattern Classification, Chapter 6
44
Practical Techniques Pattern Classification, Chapter 6
45
Practical Techniques Initializing weights
We seek to set the initial weights to have fast and uniform learning -- all weights reach their final equilibrium values at about the same time Initialize weights to small random values according to a uniform distribution -w<wji<+w, that place the neurons in the linear range -1<net<+1 The net activation from standardized input is distributed roughly between Pattern Classification, Chapter 6
46
Practical Techniques Learning with momentum Speed up convergence
Overcome local minima or plateaus Pattern Classification, Chapter 6
47
Practical Techniques Weight Decay
A technique for simplifying network and avoiding over-fitting Start with a network with “too many” weights and “decay” all weights during training After each weigh update, every weight is decayed according to Weights that are not needed become smaller and smaller, possibly to such a small value that they can be eliminated altogether Pattern Classification, Chapter 6
48
Practical Techniques Learning with hints
Add information or constraints to improve the network Add output units for addressing an ancillary problem, one different from but related to the classification problem at hand The expanded network is trained on the classification problem of interest and the ancillary one E.g., add vowels and consonants units when classifying phonemes During classification, the hint units are not used Pattern Classification, Chapter 6
49
Pattern Classification, Chapter 6
50
Practical Techniques Stochastic or batch learning?
For most applications – especially ones employing large redundant training sets– stochastic learning is typically faster than batch learning More than one hidden layer? Unless under special problem requirements Empirically, more prone to getting caught in undesirable local minima The backpropagation algorithm is easily generalized Pattern Classification, Chapter 6
51
Feature Mapping The novel computational power provided by multilayer neural nets can be attributed to the nonlinear mapping of the input to the representation at the hidden units, since the hidden-to-output layer leads to a linear discriminant One interesting property of multilayer nets is that they may automatically discover feature groupings at the hidden layers that are useful for the classification task. Pattern Classification, Chapter 6
52
Pattern Classification, Chapter 6
53
Pattern Classification, Chapter 6
54
An Example: Face Recognition
From Tom Mitchell “Machine Learning” Data: camera images of faces of various people in various poses 20 people, 32 images each, varying expressions, looking directions, background etc. 640 grayscale images, resolution 120x128, grayscale value 0 (black) to 255 (white) Task: learning the direction in which the person is facing (left, right, straight ahead, upward) Pattern Classification, Chapter 6
55
Design Choices Input encoding Output encoding Network structure
Encode the image as 30x32 pixel intensity values, calculated as the mean of the corresponding high-resolution values. The pixel intensity values 0 to 255 were linearly scaled to range from 0 to 1 Output encoding 1-of-c encoding Target values (0.9, 0.1, 0.1, 0.1) Network structure One hidden layer, three hidden units Logistic sigmoid units for hidden and output layer Pattern Classification, Chapter 6
56
Learning algorithm parameters
Batch algorithm Learning rate 0.3, momentum 0.3 Initial weights: input-to-hidden 0, hidden-to-output small random values Stopping criterion: validation method Results: after training on 260 images, achieves an accuracy of 90% over a separate test set Learned hidden representations Pattern Classification, Chapter 6
57
Pattern Classification, Chapter 6
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.