Download presentation
Presentation is loading. Please wait.
Published byBarrie Cameron Modified over 9 years ago
1
1/11 طراحی و آموزش شبکه های عصبی Slide from Dr. M. Pomplun
2
2 In supervised learning: We train an ANN with a set of vector pairs, so-called exemplars. Each pair (x, y) consists of an input vector x and a corresponding output vector y. Whenever the network receives input x, we would like it to provide output y. The exemplars thus describe the function that we want to “teach” our network. Besides learning the exemplars, we would like our network to generalize, that is, give plausible output for inputs that the network had not been trained with.
3
3 Classification Neural networks have been used success fully in a large number of practical classification tasks, such as the following: Recognizing printed or handwritten characters Classifying loan applications into credit-worthy and non-credit-worthy groups Analyzing sonar and radar data to determine the nature of the source of a signal
4
4 There is a tradeoff between a network’s ability to precisely learn the given exemplars and its ability to generalize (i.e., inter- and extrapolate). This problem is similar to fitting a function to a given set of data points. Let us assume that you want to find a fitting function f:R R for a set of three data points. You try to do this with polynomials of degree one (a straight line), two, and nine. f(x) x deg. 1 deg. 2 deg. 9 Obviously, the polynomial of degree 2 provides the most plausible fit. Function approximation
5
5 The same principle applies to ANNs: If an ANN has too few neurons, it may not have enough degrees of freedom to precisely approximate the desired function. If an ANN has too many neurons, it will learn the exemplars perfectly, but its additional degrees of freedom may cause it to show implausible behavior for untrained inputs; it then presents poor ability of generalization. Unfortunately, there are no known equations that could tell you the optimal size of your network for a given application; there are only heuristics.
6
6 Evaluation of networks Basic idea: define error function and measure error for untrained data (testing set) Typical: where d is the desired output, and o is the actual output. Root Mean square error:
7
7 Data Representation All networks process one of two types of signal components: analog (continuously variable) signals or discrete (quantized) signals. In both cases, signals have a finite amplitude; their amplitude has a minimum and a maximum value. analog discrete max min Slide from Dr. M. Pomplun
8
8 The main question is: How can we appropriately capture these signals and represent them as pattern vectors that we can feed into the network? We should aim for a data representation scheme that maximizes the ability of the network to detect (and respond to) relevant features in the input pattern. Relevant features are those that enable the network to generate the desired output pattern. Similarly, we also need to define a set of desired outputs that the network can actually produce. We are going to consider internal representation and external interpretation issues as well as specific methods for creating appropriate representations. Data Representation Slide from Dr. M. Pomplun
9
9 Internal Representation Issues As we said before, in all network types, the amplitude of input signals and internal signals is limited: analog networks: values usually between 0 and 1 binary networks: only values 0 and 1 allowed bipolar networks: only values –1 and 1 allowed Without this limitation, patterns with large amplitudes would dominate the network’s behavior. A disproportionately large input signal can activate a neuron even if the relevant connection weight is very small. Slide from Dr. M. Pomplun
10
10 Without any interpretation, we can only use standard methods to define the difference (or similarity) between signals. For example, for binary patterns x and y, we could… … treat them as binary numbers and compute their difference as | x – y | … treat them as vectors and use the cosine of the angle between them as a measure of similarity … count the numbers of digits that we would have to flip in order to transform x into y (Hamming distance ) External Interpretation Issues Slide from Dr. M. Pomplun
11
11 Creating Data Representation The patterns that can be represented by an ANN most easily are binary patterns. Even analog networks “like” to receive and produce binary patterns – we can simply round values < 0.5 to 0 and values 0.5 to 1. To create a binary input vector, we can simply list all features that are relevant to the current task. Each component of our binary vector indicates whether one particular feature is present (1) or absent (0). Slide from Dr. M. Pomplun
12
12 Creating Data Representation With regard to output patterns, most binary-data applications perform classification of their inputs. The output of such a network indicates to which class of patterns the current input belongs. Usually, each output neuron is associated with one class of patterns. As you already know, for any input, only one output neuron should be active (1) and the others inactive (0), indicating the class of the current input. Slide from Dr. M. Pomplun
13
13 In other cases, classes are not mutually exclusive, and more than one output neuron can be active at the same time. Another variant would be the use of binary input patterns and analog output patterns for “classification”. In that case, again, each output neuron corresponds to one particular class, and its activation indicates the probability (between 0 and 1) that the current input belongs to that class. Creating Data Representation Slide from Dr. M. Pomplun
14
14 Tertiary (and n-ary) patterns can cause more problems than binary patterns when we want to format them for an ANN. For example, imagine the tic-tac-toe game. Each square of the board is in one of three different states: occupied by an X, occupied by an O, empty Let us now assume that we want to develop a network that plays tic-tac- toe. This network is supposed to receive the current game configuration as its input. Its output is the position where the network wants to place its next symbol (X or O). Obviously, it is impossible to represent the state of each square by a single binary value. Creating Data Representation Slide from Dr. M. Pomplun
15
15 Creating Data Representation Possible solution: Use multiple binary inputs to represent non-binary states. Treat each feature in the pattern as an individual subpattern. Represent each subpattern with as many positions (units) in the pattern vector as there are possible states for the feature. Then concatenate all subpatterns into one long pattern vector. Slide from Dr. M. Pomplun
16
16 Creating Data Representation Example: X is represented by the subpattern 100 O is represented by the subpattern 010 is represented by the subpattern 001 The squares of the game board are enumerated as follows : 123 456 789 Slide from Dr. M. Pomplun
17
17 Creating Data Representation Then consider the following board configuration: XX OOX O It would be represented by the following binary string: 100 100 001 010 010 100 001 001 010 Consequently, our network would need a layer of 27 input units. Slide from Dr. M. Pomplun
18
18 Creating Data Representation And what would the output layer look like? Well, applying the same principle as for the input, we would use nine units to represent the 9-ary output possibilities. Considering the same enumeration scheme: Our output layer would have nine neurons, one for each position. To place a symbol in a particular square, the corresponding neuron, and no other neuron, would fire (1). 123 456 789 Slide from Dr. M. Pomplun
19
19 But… Would it not lead to a smaller, simpler network if we used shorter encoding of the non-binary states? We do not need 3-digit strings such as 100, 010, and 001, to represent X, O, and the empty square, respectively. We can achieve a unique representation with 2-digits strings such as 10, 01, and 00. Creating Data Representation 000000010010 010001010110 100010011010 Similarly, instead of nine output units, four would suffice, using the following output patterns to indicate a square: Slide from Dr. M. Pomplun
20
20 The problem with such representations is that the meaning of the output of one neuron depends on the output of other neurons. This means that each neuron does not represent (detect) a certain feature, but groups of neurons do. In general, such functions are much more difficult to learn. Such networks usually need more hidden neurons and longer training, and their ability to generalize is weaker than for the one-neuron-per- feature-value networks. Creating Data Representation Slide from Dr. M. Pomplun
21
21 Creating Data Representation On the other hand, sets of orthogonal vectors (such as 100, 010, 001) can be processed by the network more easily. This becomes clear when we consider that a neuron’s net input signal is computed as the inner product of the input and weight vectors. The geometric interpretation of these vectors shows that orthogonal vectors are especially easy to discriminate for a single neuron. Slide from Dr. M. Pomplun
22
22 Another way of representing n-ary data in a neural network is using one neuron per feature, but scaling the (analog) value to indicate the degree to which a feature is present. Good examples: the brightness of a pixel in an input image the distance between a robot and an obstacle Poor examples: the letter (1 – 26) of a word the type (1 – 6) of a chess piece Creating Data Representation Slide from Dr. M. Pomplun
23
23 This can be explained as follows: The way NNs work (both biological and artificial ones) is that each neuron represents the presence/absence of a particular feature. Activations 0 and 1 indicate absence or presence of that feature, respectively, and in analog networks, intermediate values indicate the extent to which a feature is present. Consequently, a small change in one input value leads to only a small change in the network’s activation pattern. Creating Data Representation Slide from Dr. M. Pomplun
24
24 Therefore, it is appropriate to represent a non-binary feature by a single analog input value only if this value is scaled, i.e., it represents the degree to which a feature is present. This is the case for the brightness of a pixel or the output of a distance sensor (feature = obstacle proximity). It is not the case for letters or chess pieces. For example, assigning values to individual letters (a = 0, b = 0.04, c = 0.08, …, z = 1) implies that a and b are in some way more similar to each other than are a and z. Obviously, in most contexts, this is not a reasonable assumption. Creating Data Representation Slide from Dr. M. Pomplun
25
25 If you wanted to represent the state of each square on the tic-tac-toe board by one analog value, which would be the better way to do this? = 0 X = 0.5 O = 1 Not a good scale! Goes from “neutral” to “friendly” and then “hostile”. More natural scale! Goes from “friendly” to “neutral” and then “hostile”. X = 0 = 0.5 O = 1 Creating Data Representation Slide from Dr. M. Pomplun
26
26 Exemplar Analysis When building a neural network application, we must make sure that we choose an appropriate set of exemplars (training data): The entire problem space must be covered. There must be no inconsistencies (contradictions) in the data. We must be able to correct such problems without compromising the effectiveness of the network. Slide from Dr. M. Pomplun
27
27 For many applications, we do not just want our network to classify any kind of possible input. Instead, we want our network to recognize whether an input belongs to any of the given classes or it is “garbage” that cannot be classified. To achieve this, we train our network with both “classifiable” and “garbage” data (null patterns). For the the null patterns, the network is supposed to produce a zero output, or a designated “null neuron” is activated. Ensuring Coverage Slide from Dr. M. Pomplun
28
28 We have to make sure that all of these exemplars taken together cover the entire input space. If it is certain that the network will never be presented with “garbage” data, then we do not need to use null patterns for training. Sometimes there may be conflicting exemplars in our training set. A conflict occurs when two or more identical input patterns are associated with different outputs. Why is this problematic? Slide from Dr. M. Pomplun
29
29 Assume a BPN with a training set including the exemplars (a, b) and (a,c). Whenever the exemplar (a, b) is chosen, the network adjust its weights to present an output for ‘a’ that is closer to b. Whenever (a, c) is chosen, the network changes its weights for an output closer to c, thereby “unlearning” the adaptation for (a, b). In the end, the network will associate input ‘a’ with an output that is “between” ‘b’ and ‘c’, but is neither exactly ‘b’ or ‘c’, so the network error caused by these exemplars will not decrease. For many applications, this is undesirable. Ensuring Consistency Slide from Dr. M. Pomplun
30
30 Uncertainty is often treated as a single uniform concept that simply represents the absence of precise information. The uncertainty sometimes results from a random process; it sometimes results only from the lack of information that induces some 'belief' (instead of some 'knowledge'). Data error is considered to be well-defined and measurable part of uncertainty in reasoning systems, important distinctions have been made between different varieties of uncertainty and the different conditions that produce them. One of these distinctions is between instances of uncertainty that are ‘vague’ and those that are ‘ambiguous’. Uncertainty
31
31 Vague uncertainty exists when there is a general lack of information regarding a judgment or a particular target. In terms of classification, a vague target would be one where there is only weak evidence for membership to any specific class. In contrast, ambiguous uncertainty exists when there is an abundance of conflicting information regarding a possible judgment or a particular target. In terms of classification, an ambiguous target would be one where there is strong evidence for membership in two or more mutually exclusive categories. Uncertainty
32
32 To identify such conflicts, we can apply a search algorithm to our set of exemplars. How can we resolve an identified conflict? Of course, the easiest way is to eliminate the conflicting exemplars from the training set. However, this reduces the amount of training data that is given to the network. Eliminating exemplars is the best way to go if it is found that these exemplars represent invalid data, for example, inaccurate measurements. In general, however, other methods of conflict resolution are preferable. Ensuring Consistency Slide from Dr. M. Pomplun
33
33 Another method combines the conflicting patterns. For example, if we have exemplars (0011, 0101), (0011, 0010), we can replace them with the following single exemplar: (0011, 0111). The way we compute the output vector of the new exemplar based on the two original output vectors depends on the current task. It should be the value that is most “similar” (in terms of the external interpretation) to the original two values. Ensuring Consistency Slide from Dr. M. Pomplun
34
34 Alternatively, we can alter the representation scheme. Let us assume that the conflicting measurements were taken at different times or places. In that case, we can just expand all the input vectors, and the additional values specify the time or place of measurement. For example, the exemplars (0011, 0101), (0011, 0010) could be replaced by the following ones: (100011, 0101), (010011, 0010). Ensuring Consistency Slide from Dr. M. Pomplun
35
35 Training and Performance Evaluation How many samples should be used for training? Heuristic: At least 5-10 times as many samples as there are weights in the network. Formula (Baum & Haussler, 1989): P is the number of samples, |W| is the number of weights to be trained, and ‘a’ is the desired accuracy (e.g., proportion of correctly classified samples). Slide from Dr. M. Pomplun
36
36 What learning rate should we choose? The problems that arise when is too small or to big are similar to the Adaline. Unfortunately, the optimal value of entirely depends on the application. Values between 0.1 and 0.9 are typical for most applications. Often, is initially set to a large value and is decreased during the learning process. Leads to better convergence of learning, also decreases likelihood of “getting stuck” in local error minimum at early learning stage. Training and Performance Evaluation Slide from Dr. M. Pomplun
37
37 When training a BPN, what is the acceptable error, i.e., when do we stop the training? The minimum error that can be achieved does not only depend on the network parameters, but also on the specific training set. Thus, for some applications the minimum error will be higher than for others. Training and Performance Evaluation Slide from Dr. M. Pomplun
38
38 An insightful way of performance evaluation is partial-set training. The idea is to split the available data into two sets – the training set and the test set. The network’s performance on the second set indicates how well the network has actually learned the desired mapping. We should expect the network to interpolate, but not extrapolate. Therefore, this test also evaluates our choice of training samples. Training and Performance Evaluation Slide from Dr. M. Pomplun
39
39 If the test set only contains one exemplar, this type of training is called “hold-one-out” training. It is to be performed sequentially for every individual exemplar. This, of course, is a very time-consuming process. For example, if we have 1,000 exemplars and want to perform 100 epochs of training, this procedure involves 1,000 999 100 = 99,900,000 training steps. Partial-set training with a 700-300 split would only require 70,000 training steps. On the positive side, the advantage of hold-one-out training is that all available exemplars (except one) are use for training, which might lead to better network performance. Training and Performance Evaluation Slide from Dr. M. Pomplun
40
40 Some examples: Predicting the Weather Let us study an interesting neural network application. Its purpose is to predict the local weather based on a set of current weather data: temperature (degrees Celsius) atmospheric pressure (inches of mercury) relative humidity (percentage of saturation) wind speed (kilometers per hour) wind direction (N, NE, E, SE, S, SW, W, or NW) cloud cover (0 = clear … 9 = total overcast) weather condition (rain, hail, thunderstorm, …) Slide from Dr. M. Pomplun
41
41 We assume that we have access to the same data from several surrounding weather stations. There are 8 such stations that surround our position. How should we format the input patterns? We need to represent the current weather conditions by an input vector whose elements range in magnitude between zero and one. When we inspect the raw data, we find that there are two types of data that we have to account for: Scaled, continuously variable values n-ary representations of category values Slide from Dr. M. Pomplun
42
42 The following data can be scaled: temperature (-10… 40 degrees Celsius) atmospheric pressure (26… 34 inches of mercury) relative humidity (0… 100 percent) wind speed (0… 250 km/h) cloud cover (0… 9) We can just scale each of these values so that its lower limit is mapped to some and its upper value is mapped to (1 - ). These numbers will be the components of the input vector. Slide from Dr. M. Pomplun
43
43 Usually, wind speeds vary between 0 and 40 km/h. By scaling wind speed between 0 and 250 km/h, we can account for all possible wind speeds, but usually only make use of a small fraction of the scale. Therefore, only the most extreme wind speeds will exert a substantial effect on the weather prediction. Consequently, we will use two scaled input values: wind speed ranging from 0 to 40 km/h wind speed ranging from 40 to 250 km/h Slide from Dr. M. Pomplun
44
44 How about the non-scalable weather data? Wind direction is represented by an eight- component vector, where only one element (or possibly two adjacent ones) is active, indicating one out of eight wind directions. The subjective weather condition is represented by a nine-component vector with at least one, and possibly more, active elements. With this scheme, we can encode the current conditions at a given weather station with 23 vector components: one for each of the four scaled parameters two for wind speed eight for wind direction nine for the subjective weather condition Slide from Dr. M. Pomplun
45
45 Since the input does not only include our station, but also the eight surrounding ones, the input layer of the network looks like this: … our station … north…… northwest The network has 207 input neurons, which accept 207-component input vectors. Slide from Dr. M. Pomplun
46
46 What should the output patterns look like? We want the network to produce a set of indicators that we can interpret as a prediction of the weather in 24 hours from now. In analogy to the weather forecast on the evening news, we decide to demand the following four indicators: a temperature prediction a prediction of the chance of precipitation occurring an indication of the expected cloud cover a storm indicator (extreme conditions warning) Slide from Dr. M. Pomplun
47
47 Each of these four indicators can be represented by one scaled output value: temperature (-10… 40 degrees Celsius) chance of precipitation (0%… 100%) cloud cover (0… 9) storm warning: two possibilities: –0: no storm warning; 1: storm warning –probability of serious storm (0%… 100%) Of course, the actual network outputs range from to (1 - ), and after their computation, if necessary, they are scaled to match the ranges specified above. Slide from Dr. M. Pomplun
48
48 We decide (or experimentally determine) to use a hidden layer with 42 sigmoidal neurons. In summary, our network has 207 input neurons 42 hidden neurons 4 output neurons Because of the small output vectors, 42 hidden units may suffice for this application. Slide from Dr. M. Pomplun
49
49 The next thing we need to do is collecting the training exemplars. First we have to specify what our network is supposed to do: In production mode, the network is fed with the current weather conditions, and its output will be interpreted as the weather forecast for tomorrow. Therefore, in training mode, we have to present the network with exemplars that associate known past weather conditions at a time t with the conditions at t – 24 hrs. So we have to collect a set of historical exemplars with known correct output for every input. Slide from Dr. M. Pomplun
50
50 Obviously, if such data is unavailable, we have to start collecting them. The selection of exemplars that we need depends, among other factors, on the amount of changes in weather at our location. And how about the granularity of our exemplar data, i.e., the frequency of measurement? Using one sample per day would be a natural choice, but it would neglect rapid changes in weather. If we use hourly instantaneous samples, however, we increase the likelihood of conflicts. Slide from Dr. M. Pomplun
51
51 Therefore, we decide to do the following: We will collect input data every hour, but the corresponding output pattern will be the average of the instantaneous patterns over a 12-hour period. This way we reduce the possibility of errors while increasing the amount of training data. Now we have to train our network. If we use samples in one-hour intervals for one year, we have 8,760 exemplars. Our network has 207 42 + 42 4 = 8862 weights, which means that data from ten years, i.e., 87,600 exemplars would be desirable (rule of thumb). Slide from Dr. M. Pomplun
52
52 Since with a large number of samples the hold-one-out training method is very time consuming, we decide to use partial-set training instead. The best way to do this would be to acquire a test set (control set), that is, another set of input-output pairs measured on random days and at random times. After training the network with the 87,600 exemplars, we could then use the test set to evaluate the performance of our network. Slide from Dr. M. Pomplun
53
53 Neural network troubleshooting: Plot the global error as a function of the training epoch. The error should decrease after every epoch. If it oscillates, do the following tests. Try reducing the size of the training set. If then the network converges, a conflict may exist in the exemplars. If the network still does not converge, continue pruning the training set until it does converge. Then add exemplars back gradually, thereby detecting the ones that cause conflicts. If this still does not work, look for saturated neurons (extreme weights) in the hidden layer. If you find those, add more hidden-layer neurons, possibly an extra 20%. If there are no saturated units and the problems still exist, try lowering the learning parameter and training longer. Slide from Dr. M. Pomplun
54
54 If the network converges but does not accurately learn the desired function, evaluate the coverage of the training set. If the coverage is adequate and the network still does not learn the function precisely, you could refine the pattern representation. For example, you could include a season indicator to the input, helping the network to discriminate between similar inputs that produce very different outputs. Slide from Dr. M. Pomplun
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.