AI Lectures by Engr.Q.Zia

AI Lectures by Engr.Q.Zia
Neural Networks AI Lectures by Engr.Q.Zia

Contents 1. Multi-Layer Neural Networks
1.1 Back-propagation in M-layer NN 1.2 Improving Performance of Back-propagation 2. Recurrent Networks 2.1 Hopfield Networks 2.2 Bidirectional Associative Memories (BAMs) 3 Unsupervised Learning Networks 3.1 Kohonen Maps 3.2 Kohonen Map Example 3.3 Hebbian Learning 4 Evolving Neural Networks

Multilayer Neural Networks
As most real-world problems are not linearly separable, so a single perceptron is not suitable option to solve real world-problems. Neural networks consist of a number of neurons that are connected together, usually arranged in layers. Multilayer perceptrons are capable of modeling more complex functions, including ones that are not linearly separable, such as the exclusive-OR function. A typical architecture for amultilayer neural network is shown in Figure 11.4.

… The network shown in Figure 11.4 is a feed-forward network, consisting of three layers. Layers : The first layer is the input layer. Each node (or neuron) in this layer receives a single input signal. In fact, it is usually the case that the nodes in this layer are not neurons, but simply act to pass input signals on to the nodes in the next layer, which is in this case a hidden layer. Each input signal is passed to each of the nodes in the hidden layer and that the output of each node in this layer is passed to each node in the final layer, which is the output layer. The output layer carries out the final stage of processing and sends out output signals.

M-Layer: Backpropagation
Multilayer neural networks learn in much the same way as single perceptrons. Each neuron has weights associated with its inputs, and so there are a far greater number of weights to be adjusted when an error occurs with a piece of training data. How to assign blame (or credit) to the various weights. One method that is commonly used is backpropagation. Multilayer backpropagation networks usually use the sigmoid function.

The sigmoid function is defined as follows:
This function is easy to differentiate because

The weights are initialized by small values in the range from -0
The weights are initialized by small values in the range from -0.5 to 0.5 The weights can be normally distributed over the range from -2.4/n to 2.4/n, where n is the number of inputs to the input layer. Each iteration of the algorithm involves first feeding data through the network from the inputs to the outputs. The next phase, which gives the algorithm its name, involves feeding errors back from the outputs to the inputs. These error values feed back through the network, making changes to the weights of nodes along the way. The algorithm repeats in this way until the outputs produced for the training data are sufficiently close to the desired values—in other words, until the error values are sufficiently small.

Because the sigmoid function cannot actually reach 0 or 1, it is usual to accept a value such as as representing 1 and 0.1 as representing 0. Now we shall see the formulae that are used to adjust the weights in the backpropagation algorithm. We will consider a network of three layers and will use i to represent nodes in the input layer, j to represent nodes in the hidden layer, and k to represent nodes in the output layer. Hence, for example, wij refers to the weight of a connection between a node in the input layer and a node in the hidden layer.

The function that is used to derive the output value for a node j in the network
is as follows: where n is the number of inputs to node j; wij is the weight of the connection between each node i and node j;θ j is the threshold value being used for node j, which is set to a random value between 0 and 1; xi is the input value for input node i; and yj is the output value produced by node j.

Once the inputs have been fed through the network to produce outputs, an
error gradient is calculated for each node k in the output layer. The error signal for k is defined as the difference between the desired value and the actual value for that node: dk is the desired value for node k, and yk is the actual value, in this iteration. The error gradient for output node k is defined as the error value for this node multiplied by the derivative of the activation function: xk is the weighted sum of the input values to the node k.

Because y is defined as a sigmoid function of x, we can use the formula that
was given above for the derivative of the sigmoid function to obtain the following formula for the error gradient: Similarly, we calculate an error gradient for each node j in the hidden layer, as follows: where n is the number of nodes in the output layer, and thus the number of outputs from each node in the hidden layer.

Now each weight in the network, wij or wjk, is updated according to the following
formula: where xi is the input value to input node i, α and is the learning rate, which is a positive number below 1, and which should not be too high.

This method is known as gradient (slope or hill) descent (act of moving down) because it involves following the steepest path down the surface that represents the error function to attempt to find the minimum in the error space, which represents the set of weights that provides the best performance of the network. In fact, the iteration of the backpropagation algorithm is usually terminated when the sum of the squares of the errors of the output values for all training data in an epoch is less than some threshold, such as

backpropagation does not appear to occur in the human brain.
Additionally, it is rather inefficient and tends to be too slow for use in solving real-world problems. With some simple problems it can take hundreds or even thousands of epochs to reach a satisfactorily low level of error.

Improving the Performance of Backpropagation
A common method used to improve the performance of backpropagation is to include momentum in the formula that is used to modify the weights. The momentum takes into account the extent to which a particular weight was changed on the previous iteration. We shall use t to represent the current iteration, and t - 1 to represent the previous iteration. Hence, we can write our learning rules as follows:

.. . This rule, including the momentum value, is known as the generalized delta rule. The inclusion of the momentum value has the benefit of enabling the backpropagation method to avoid local minima and also to move more quickly through areas where the error space is not changing.

An alternative method of speeding up backpropagation
An alternative method of speeding up backpropagation is to use the hyperbolic tangent function, tanh, instead of the sigmoid function, which tends to enable the network to converge on a solution in fewer iterations. The tanh function is defined as: where a and b are constants, such as a = 1.7 and b = 0.7.

Changing learning rate α to improve P of BP
A final way to improve the performance of backpropagation is to vary the value of the learning rate α , during the course of training the network. Two heuristics proposed by R. A. Jacobs (1988) use the direction of change (increase or decrease) of the sum of the square of the errors from one epoch to the next to determine the change in learning rate: 1. If for several epochs the sum of the square of the errors changes in the same direction, increase the learning rate. 2. If the sum of the square of the errors alternates its change in direction over several epochs, decrease the learning rate. By using these heuristics in combination with the generalized delta rule, the performance of the backpropagation algorithm can be significantly improved.

Recurrent Networks The neural networks we have been studying so far are feed-forward networks. A feed-forward network is acyclic, in the sense that there are no cycles in the network, because data passes from the inputs to the outputs, and not vice versa,. Once a feed-forward network has been trained, its state is fixed and does not alter as new input data is presented to it. In other words, it does not have memory.

-- A recurrent network can have connections that go backward from output nodes to input nodes and, in fact, can have arbitrary connections between any nodes. In this way, a recurrent network’s internal state can alter as sets of input data are presented to it, and it can be said to have a memory. This is particularly useful in solving problems where the solution depends not just on the current inputs, but on all previous inputs. For example, recurrent networks could be used to predict the stock market price of a particular stock, based on all previous values, or it could be used to predict what the weather will be like tomorrow, based on what the weather has been.

Clearly, due to the lack of memory, feed-forward networks are not able to solve such tasks.
When learning, the recurrent network feeds its inputs through the network, including feeding data back from outputs to inputs, and repeats this process until the values of the outputs do not change. At this point, the network is said to be in a state of equilibrium or stability. For this reason, recurrent networks are also known as attractor networks because they are attracted to certain output values. The stable values of the network, which are also known as fundamental memories, are the output values used as the response to the inputs the network received.

.. Hence, a recurrent network can be considered to be a memory, which is able to learn a set of states—those that act as attractors for it. Once such a network has been trained, for any given input it will output the attractor that is closest to that input. For example, a recurrent network can be used as an error-correcting network. If only a few possible inputs are considered “valid,” the network can correct all other inputs to the closest valid input. It is not always the case that a recurrent network will reach a stable state: some networks are unstable, which means they oscillate between different output values.

. In the 1980s, John Hopfield invented a form of recurrent network that has come to be known as a Hopfield network. The activation function used by most Hopfield networks is the sign activation function, which is defined as: Note that this definition does not provide a value for Sign(0). This is because when a neuron that uses the sign activation function receives an input of 0, it stays in the same state—in other words, it continues to output 1 if it was outputting 1 in the previous iteration, and continues to output -1 if it was outputting -1.

When considering the operation of a Hopfield network, it is usual to use matrix arithmetic. The weights of the network are represented by a matrix, W, which is calculated as follows: where each Xi is an input vector, representing the m input values to the network; Xit is the matrix transposition of Xi; I is the m x m identity matrix; N is the number of states (Xi) that are to be learned.

The transposition of a matrix is simply one where the rows and columns are swapped. If

The identity matrix, I, is a matrix with zeros in every row and column, but with 1s along the leading diagonal. For example,

Now let us examine an example
Now let us examine an example. We will imagine a single-layer Hopfield network with five nodes and three training inputs that are to be learned by the network. We will have our network learn the following three states: We thus have three states (vectors) that are to be learned, each of which consists of five input values. The inputs can be either 1 or -1; similarly, the output values can be either 1 or -1, and so the output can be represented as a similar vector of five values, each of which is either 1 or -1.

The weight matrix is calculated as follows:

Note that the weight matrix has zeros along its leading diagonal
Note that the weight matrix has zeros along its leading diagonal. This means that each node in the network is not connected to itself (i.e., wii = 0 for all i). A further property of a Hopfield network is that the two connections between a pair of nodes have the same weight. In other words, wij = wji for any nodes i and j. The three training states used to produce the weight matrix will be stable states for the network. We can test this by determining the output vectors for each of them. where θ is the threshold matrix, which contains the thresholds for each of the five inputs. We will assume that the thresholds are all set at zero.

Hence, the first input state is a stable state for the network
Hence, the first input state is a stable state for the network. Similarly, we can show that Y2 = X2 and that Y3 = X3. Now let us see how the network treats an input that is different from the training data. We will use Note that this vector differs from X1 in just one value, so we would expect the network to converge on X1 when presented with this input.

The use of the Hopfield network involves three stages
The use of the Hopfield network involves three stages. In the first stage, the network is trained to learn the set of attractor states. This can be thought of as a storage or memorization stage. This is done by setting the weights of the network according to the values given by the weights matrix, W, which is calculated as described above. The second phase involves testing the network, by providing the attractor states as inputs, and checking that the outputs are identical. The final stage involves using the network, in which the network, in acting as a memory, is required to retrieve data from its memory.

In each case, the network will retrieve the attractor closest to the input that it is given. In this case, the nearest attractor is X1, which differs in just two inputs. The measure of distance that is usually used for such vectors is the Hamming distance. The Hamming distance measures the number of elements of the vectors that differ. The Hamming distance between two vectors, X and Y, is written ||X, Y||.

Hence, the Hopfield network is a memory that usually maps an input vector to the memorized vector whose Hamming distance from the input vector is least.

In fact, although a Hopfield network always converges on a stable state, it does not always converge on the state closest to the original input. No method has yet been found for ensuring that a Hopfield network will always converge on the closest state. A Hopfield network is considered to be an autoassociative memory, which means that it is able to remember an item itself, or a similar item that might have been modified slightly, but it cannot use one piece of data to remember another. The human brain is fully associative, or heteroassociative, which means one item is able to cause the brain to recall an entirely different item. A piece of music or a smell will often cause us to remember an old memory: this is using the associative nature of memory. A Hopfield network is not capable of making such associations.

Bidirectional Associative Memories (BAMs)
A Bidirectional Associative Memory, or BAM, is a neural network first discussed by Bart Kosko (1988) that is similar in structure to the Hopfield network and which can be used to associate items from one set to items in another set. The network consists of two layers of nodes, where each node in one layer is connected to every other node in the other layer—this means that the layers are fully connected. This is in contrast to the Hopfield network, which consists of just a single layer of neurons: in the Hopfield network, each neuron is connected to every other neuron within the same layer, whereas in the BAM, each neuron is connected just to neurons in the other layer, not to neurons in its own layer.

As with Hopfield networks, the weight matrix is calculated from the items that are to be learned. In this case, two sets of data are to be learned, so that when an item from set X is presented to the network, it will recall a corresponding item from set Y. The weights matrix W is defined as:

The BAM uses a neuron with a sign activation function, which is also used
by a Hopfield network. When the network is given a vector Xi as an input, it will recall the corresponding vector Yi, and similarly, when presented with Yi, the network will recall Xi. Let us examine a simple example:

We are using our network to learn two sets of vectors
We are using our network to learn two sets of vectors. The network has two layers: the input layer has two neurons, and the output layer has three neurons. The weights matrix is calculated as follows:

Now we will test the network
Now we will test the network. When presented with input X1, the network will output the following vector: If the network is functioning correctly, this should be equal to Y1:

So the network has correctly recalled Y1 when presented with X1.
Similarly, the association should work in reverse: when presented with Y1, the network should recall X1: Note that in this case, we are using the output layer as if it were an input layer, and vice versa— hence, the network is bidirectional.

Like a Hopfield network, the BAM is guaranteed to produce a stable output for any given inputs and for any training data. In fact, a Hopfield network is a type of BAM, with the additional requirement that the weight matrix be square and that each neuron not have a connection to itself (or to its corresponding neuron in the other layer). BAMs are extremely useful neural networks, although their capabilities (and limitations) are not yet fully understood.

Unsupervised Learning Networks
The networks we have studied so far in this chapter use supervised learning: they are presented with pre-classified training data before being asked to classify unseen data. We will now look at a number of methods that are used to enable neural networks to learn in an unsupervised manner.

Kohonen Maps A Kohonen map, or self-organizing feature map, is a form of neural network invented by Kohonen in the 1980s. The Kohonen map uses the winner-take-all algorithm, which leads to a form of unsupervised learning known as competitive learning. The winner-take-all algorithm uses the principle that only one neuron provides the output of the network in response to a given input: the neuron that has the highest activation level. During learning, only connections to this neuron have their weights altered.

The purpose of a Kohonen map is to cluster input data into a number of clusters.
For example, a Kohonen map could be used to cluster news stories into subject categories. A Kohonen map is not told what the categories are: it determines the most useful segmentation itself. Hence, a Kohonen map is particularly useful for clustering data where the clusters are not known in advance.

A Kohonen map has two layers: an input layer and a cluster layer, which serves as the output layer. Each input node is connected to every node in the cluster layer, and typically the nodes in the cluster layer are arranged in a grid formation, although this is not essential. The method used to train a Kohonen map is as follows: Initially, all weights are set to small random values. The learning rate, , is also set, usually to a small positive value.

An input vector is presented to the input layer of the map
An input vector is presented to the input layer of the map. This layer feeds the input data to the cluster layer. The neuron in the cluster layer that most closely matches the input data is declared the winner. This neuron provides the output classification of the map and also has its weights updated.

To determine which neuron wins, its weights are treated as a vector, and this vector is compared with the input vector. The neuron whose weight vector is closest to the input vector is the winner. The Euclidean distance di from the input vector x of a neuron with weight vector wi is calculated as follows: where n is the number of neurons in the input layer and hence the number of elements in the input vector.

For example, let us calculate the distance between the following two vectors:

So the Euclidean distance between these two vectors is 4.
The neuron for which di is the smallest is the winner, and this neuron has its weight vector updated as follows: This adjustment moves the weight vector of the winning neuron closer to the input vector that caused it to win.

In fact, rather than just the winning neuron having its weights updated, a neighborhood of neurons around the winner are usually updated. The neighborhood is usually defined as a radius within the two-dimensional grid of neurons around the winning neuron.

Typically, the radius decreases over time as the training data are examined, ending up fixed at a small value. Similarly, the learning rate is often reduced during the training phase. This training phase usually terminates when the modification of weights becomes very small for all the cluster neurons. At this point, the network has extracted from the training data a set of clusters, where similar items are contained within the same cluster, and similar clusters are near to each other.

Kohonen Map Example

AI Lectures by Engr.Q.Zia

Similar presentations

Presentation on theme: "AI Lectures by Engr.Q.Zia"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

AI Lectures by Engr.Q.Zia

Similar presentations

Presentation on theme: "AI Lectures by Engr.Q.Zia"— Presentation transcript:

Similar presentations

About project

Feedback