Download presentation
Presentation is loading. Please wait.
Published byMartina Park Modified over 9 years ago
1
B.Macukow 1 Neural Networks Lecture 5
2
B.Macukow 2 The Perceptron
3
B.Macukow 3 In 1962 Frank Rosenblatt introduced the new idea of the perceptron. The Perceptron General idea: a neuron learns on its mistakes!! If the element output signal is wrong – the changes are to minimize the possibilities that such the mistake will be repeated. If the element output signal is correct – there are no changes.
4
B.Macukow 4 The Perceptron The one layer perceptron is based on the McCulloch & Pitts threshold element. The simplest perceptron - Mark 1 – is composed from four types of elements: layer of input elements, (square grid of 400 receptors), elements type S receiving stimuli from the environment and transforming those stimuli into electrical signals associative elements, elements type A, threshold adding elements with excitatory and inhibitory inputs
5
B.Macukow 5 The Perceptron output layer – elements type R, the reacting elements, randomly connected with the A elements, set of A elements correspond with to each R element, R passes to state 1 when its total input signal is greater than zero control units Phase 1 - learning. At the beginning, e.g presentation of the representatives of the first class. Phase 2 – verification of the learning results Learning of the second class etc..
6
B.Macukow 6 Mark 1: 400 threshold elements of the type S; if they are enough excited – they produce at the output one the signal +1 and at the output two the signal -1. The associative element A, has 20 inputs, randomly (or not) connected with the S elements outputs (excitatory or inhibitory). In Mark 1 was 512 elements of type A. The A elements are randomly connected with the elements type R. In Mark 1 was 8 elements of type R. The Perceptron
7
B.Macukow 7 A block diagram of a perceptron. On the receptive layer the picture of the letter K is projected. As the result, in the reacting layer, the region corresponding to letter K (in black) is activated. The Perceptron
8
B.Macukow 8 Each element A obtain „weighted sum” of an input signal. When the number of excitatory signals > than the number of inhibitory signals – at the output the +1 signal is generated When < there is no signal generation. Elements R are reacting on the added input from the elements A. When the input is > than the threshold – The +1 signal is generated, otherwise – signal 0. Learning means changes in weights of active elements A. The Perceptron
9
B.Macukow 9 Simplified version: Two layers – input and output. Active is only the layer two. Input signals are equal 0 or +1. Such the structure is called one layer perceptron. Elements (possibly only one) of the output layer obtain at their input the weighted signal from the input layer. If this signal is greater than the defined threshold value – the signal +1 is generated, otherwise the signal 0. The learning method is based on the correction of weights connecting the input layer with the elements of the output layer. Only the active elements of the input layer are the subject of correction. The Perceptron
10
B.Macukow 10 w iA (new) = w iA (old) - input i w iB (new) = w iB (old) + input i input i = 1 Weights modification rule
11
B.Macukow 11 The example The Perceptron
12
B.Macukow 12 Input Object Output Class A1 Class B0 A 1 2 3 N Output Input N Input 3 Input 2 Input 1 w 3A w 1A w NA w 2A The one-layer and two-elements Perceptron
13
B.Macukow 13 Output Object Output AOutput B Class A10 Class B01 1 2 3 N B A Output A Output B Input N Input 3 Input 2 Input 1 w 3A w 1A w NA w 2B w 2A w NB w 3B w 1B The one-layer and two-elements Perceptron
14
B.Macukow 14 Output Object Output AOutput B Class A10 Class B01 1 2 3 N B A 1 1 Object belongs to the class A w 3A w 1A w 2A w NA w 2B w NB w 3B w 1B Correct output from the element A We do not change the weights incoming to the element A, w iA Incorrect output from the element B (1 instead of 0) Output signal B ≥ threshold value It is necessary to decrease the weights incoming to the element B w iB Perceptron’s learning
15
B.Macukow 15 Assuming n ζ = output (library) – output (real) than n w iB (new) = w iB (old) + w iB For example: w iB (new) = w iB (old) + ζ Input i 0 Input i = +1 Weights modification rule
16
B.Macukow 16 1 2 3 N B A 1 0 w 3A w 1A w NA w 2B w 2A w NB w 3B w 1B Correct output from the element A We do not change the weights incoming to the element A, w iA Correct output from the element B We do not change the weights incoming to the element B, w iB Perceptron’s learning Object belongs to the class A Output Object Output AOutput B Class A10 Class B01
17
B.Macukow 17 1 2 3 N B A 1 0 w 3A w 1A w NA w 2B w 2A w NB w 3B w 1B Incorrect output from the element A (1 instead of 0) Output signal A ≥ threshold value It is necessary to decrease the weights incoming to the element A, w iA Incorrect output from the element B (1 instead of 0) Output signal B ≥ threshold value It is necessary to decrease the weights incoming to the element B, w iB Perceptron’s learning Object belongs to the class B Output Object Output AOutput B Class A10 Class B01
18
B.Macukow 18 The Perceptron learning algorithm
19
B.Macukow 19 The perceptron learning algorithm It can be proved that: „... given it is possible to classify a series of inputs,... then a perceptron network will find this classification”. another words „a perceptron will learn the solution, if there is a solution to be found” Unfortunately, such the solution not always exists !!!
20
B.Macukow 20 It is important to distinguish between the representation and learning. n Representation refers to the ability of a perceptron (or any other network) to simulate a specified function. n Learning requires the existence of a systematic procedure for adjusting the network weights to produce that function. The perceptron learning algorithm
21
B.Macukow 21 This problem was used to illustrate the weakness of the perceptron by Minsky and Papert in 1969: They showed that some perceptrons were impractical or inadequate to solve many problems and stated there was no underlying mathematical theory to perceptrons. The perceptron learning algorithm
22
B.Macukow 22 Bernard Widrow recalls: „..my impression was that Minsky and Papert defined the perceptron narrowly enough that it couldn’t do anything interesting. You can easily design something to overcome many of the things that they proved’ couldn’t be done. It looked like an attempt to show that the perceptron was no good. It wasn’t fair.” The perceptron learning algorithm
23
B.Macukow 23 One of Minsky's and Papert more discouraging results shows that a single-layer perceptron cannot simulate a simple but very important function the exclusive-or (XOR) XOR Problem
24
B.Macukow 24 XOR truth table xyoutputpoint 000A0A0 101B0B0 011B1B1 110A1A1 XOR Problem
25
B.Macukow 25 Function F is the simple threshold function producing at the output signal 0 (zero) when signal s jest below, and signal 1 (one) when signal s is greater (or equal). XOR Problem
26
B.Macukow 26 xw 1 + yw 2 = Does not exist the system of values of w 1 i w 2, that points A 0 i A 1 will be located on one side, and B 0 i B 1 on the other side of this straight line. XOR Problem
27
B.Macukow 27 Question, is it possible to realize every logical function by means of a single neuronal element with properly selected parameters?? Is it possible to built every digital system by means of the neuronal elements?? Unfortunately, there exist functions where it is necessary to use two or more elements. It is easy to demonstrate, that it is impossible to realize any function of N variables by means of single neuronal element. Finally, what the perceptron really is ??
28
B.Macukow 28 Geometrical interpretation of the equation ∑ w i (t)x i (t) = i is a plane (surface), which orientation depends from the weights.. The plane should be orientated in such the way all vertices, where output = 1 where located on the same side, i.e. the inequality will be fulfilled ∑ w i (t)x i (t) ≥ i Finally, what the perceptron really is ??
29
B.Macukow 29 From the figure above it is easy to understand why realization of the XOR is impossible. Does not exist the single plane (for N=2 – straight line) separating points of different color. Finally, what the perceptron really is ??
30
B.Macukow 30 On these figures is the difficulties with the realization demanding the negative threshold values (n.b. the biological interpretation is sometime doubtful). Finally, what the perceptron really is ??
31
B.Macukow 31 The problem of linear separability impose limitations fro the use of one layer neural nets. So, it is very important to knowledge about this property. The problem of linear separability can be solved by the increase of the number of the network layers. Linear separability
32
B.Macukow 32 The convergence of the learning procedure
33
B.Macukow 33 The input patterns are assumed to come from a space which has two classes: F + and F -. We want the perceptron to respond with 1 if the input comes from F +, and -1 if it comes from F -. The set of input values X i as a vector in n- dimensional space X, and the set of weights W i as the another vector in the same space W. Increasing the weights is performed by W + X, and decreasing by W - X. The convergence of the learning procedure
34
B.Macukow 34 start: Choose any value for W test: Choose an X from F + or F - If X F + i W X > 0 test If X F + i W X 0 add If X F - i W X < 0 test If X F - i W X 0 subtract add: Replace W by W + X test subtract: Replace W by W - X test Notice that we go to subtract when X F -, and if we consider that going to subtract is the same as going to add X replaced by –X. The convergence of the learning procedure
35
B.Macukow 35 start: Choose any value for W test: Choose a X from F + or F - If X F - change the sign of X If W X > 0 test otherwise add add: Replace W by W + X test We can simplify the algorithm still further, if we define F to be F + -F - i.e. F + and the negatives of F -. The convergence of the learning procedure
36
B.Macukow 36 start: Choose any value for W test: Choose any X from F If W X > 0 test otherwise add add: Replace W by W + X test The convergence of the learning procedure
37
B.Macukow 37 The Perceptron Theorem and proof
38
B.Macukow 38 Convergence Theorem: Program will only go to add a finite number of times. Proof: Assume that there is a unit vector W*, which partitions up the space, and a small positive fixed number δ, such that W* X > δ for every X F Define G(W) = W* W/|W| and note that G(W) is the cosine of the angle between W and W*. Theorem and proof
39
B.Macukow 39 Since |W*| = 1, we can say that G(W) 1. Consider the behavior of G(W) through add. The numerator: W* W t+1 = W* (W t + X) = W* W t + W* X W* W t + δ since W* X > δ. Hence, after the mth application of add we have W* W m δ m (1) Theorem and proof
40
B.Macukow 40 The denominator: Since W X must be negative (add operation is performed), then | W t+1 | 2 = W t+1 W t+1 = (W t + X) (W t + X) = = | W t | 2 + 2W t X + |X | 2 Moreover |X | = 1, so | W t+1 | 2 < | W t | 2 + 1, and after mth application of add | W m | 2 < m. (2) Theorem and proof
41
B.Macukow 41 Combining (1) i (2) gives Theorem and proof Because G(W) 1, so we can write
42
B.Macukow 42 What does it mean?? Inequality (3) is our proof. In the perceptron algorithm, we only go to test if W X > 0. We have chosen a small fixed number δ, such that W X >δ. Inequality (3) then says that we can make δ as small as we like, but the number of times, m, that we go to add will still be finite, and will be 1/δ 2. In other words, perceptron will learn a weight vector W, that partitions the space successfully, so that patterns from F + are responded to with a positive output and patterns from F - produce a negative output. Theorem and proof
43
B.Macukow 43 The perceptron learning algorithm
44
B.Macukow 44 1 step – initialize weight and threshold: Define w i (t), (i=0,1,...,n) to be the weight from input i at time t, and to be a threshold value ij the output node. Set w i (0) to small random numbers. 2 step – present input and desired output: Present input X = [x 1, x 2,..., x n ], x i {0,1}, and to the comparison block desired output d(t). The perceptron learning algorithm
45
B.Macukow 45 3 step: Calculate actual output 4 step: Adapt weights The perceptron learning algorithm
46
B.Macukow 46 4 step (cont): if y(t) = d(t) w i (t+1) = w i (t) if y(t) = 0 and d(t) = 1 w i (t+1) = w i (t) + x i (t) if y(t) = 1 and d(t) = 0 w i (t+1) = w i (t) − x i (t) The perceptron learning algorithm
47
B.Macukow 47 Algorithm modifications 4 step (cont.): if y(t) = d(t) w i (t+1) = w i (t) if y(t) = 0 and d(t) = 1 w i (t+1) = w i (t) + η∙x i (t) if y(t) = 1 and d(t) = 0 w i (t+1) = w i (t) − η∙x i (t) 0 ≤ η ≤ 1, a positive gain term that controls the adaptation rate. The perceptron learning algorithm
48
B.Macukow 48 Widrow and Hoff modification 4 step (cont.): if y(t) = d(t) w i (t+1) = w i (t) if y(t) ≠ d(t) w i (t+1) = w i (t) + η∙Δ∙x i (t) 0 ≤ η ≤ 1 a positive gain term that controls the adaptation rate. Δ = d(t) – y(t) The perceptron learning algorithm
49
B.Macukow 49 The Widrow-Hoff delta rule calculates the difference between the weighted sum and the required output, and calls that the error. This means that during the learning process, the output from the unit is not passed through the step function – however, actual classification is effected by using the step function to produce the +1 or 0. Neuron units using this learning algorithm were called ADALINEs (ADAptive LInear NEurons), who are also connected into a many ADALINE, or MADALINE structure. The perceptron learning algorithm
50
B.Macukow 50 Model ADALINE
51
B.Macukow 51 Widrow and Hoff model The structure ADALINE, and the way how it performs a weighted sum of inputs is similar to the single perceptron unit and has similar limitations. (for example the XOR problem).
52
B.Macukow 52 Widrow and Hoff model
53
B.Macukow 53 When in a perceptron decision concerning change of weights is taken on the base of the output signal ADALINE uses the signal from the sum unit (marked Σ). Widrow and Hoff model
54
B.Macukow 54 The system of two ADALINE type elements can realize the logical AND function. Widrow and Hoff model
55
B.Macukow 55 Similarly to another multilayer nets (e.g. perceptron), from basic ADALINE elements one can create the whole network called ADALINE or MADALINE. Complicated net’s structure makes difficult definition of an effective learning algorithm. The most in use is the LMS algorithm (Least-Mean-Square). But for the LMS method it is necessary to know the input and output values of every hidden layer. Unfortunately these information are not accessible. Widrow and Hoff model
56
B.Macukow 56 Three layer net composed of ADALINE elements create the MADALINE net. Widrow and Hoff model
57
B.Macukow 57 The neuron operation can be described by the formula (assuming the threshold = 0) y = W T ∙ X where W = [w 1,w 2,...,w n ] is the weight vector X = [x 1,x 2,...x n ] is the input signal (vector) Widrow and Hoff model
58
B.Macukow 58 From the inner product properties, we know that the out put signal will be bigger when the direction of the vector x i in the n-dimensional space of input signals X will coincide with the direction of the vector w i in the n-dimensional space of the weights W. The neuron will react stronger for the input signals more „similar” to the weight vector. Assuming that vectors x i i w i are normalized (i.e. w i = 1 i x i = 1), one get y = cosΦ where Φ is the angle between the vectors x i i w i. Widrow and Hoff model
59
B.Macukow 59 For the m-elements layer of the neurons (processing elements), we get Y = W ∙ X where rows in the matrix W (1,2,...,m) correspond to the weights coming to particular processing elements from input nods, and Y = [y 1, y 2,..., y m ] Widrow and Hoff model
60
B.Macukow 60 The net is mapping the input space X into R m, X R m. Of course this mapping is absolutely free. One can say that the net is performing the filtering. A net operation is defined by the elements of a matrix W – i. e. the weights are an equivalent of the program in numerical calculation. The a priori definition of weights is difficult, and in the multilayer nets – practically impossible. Widrow and Hoff model
61
B.Macukow 61 The one-step process of the weights determining can be replaced by the multi-step process – the learning process. It is necessary to expand a system adding the element able to define the output signal error and the element able to control the weights adaptation. The method of operating the ADALINE is based on the algorithm called DELTA introduced by Widrow and Hoff. General idea: each input signal X is associated with the signal d, the correct output signal. Widrow and Hoff model
62
B.Macukow 62 The actual output signal y is compared with d and the error is calculated. On the base of this error signal and the input signal X the weight vector W is corrected. The new weight vector W' is calculated by the formula W' = W + η ∙ e ∙ X where η is the learning speed coefficient Widrow and Hoff model
63
B.Macukow 63 The idea is identical with the perceptron learning rule. When d > y it means, that the output signal was too small, the angle between vectors X and W – was too big. To the vector W it is necessary to add the vector X multiplied by the constant (0< η ∙ e < 1). {This condition prevents too fast „rotations" of vector W}. The vector W correction is bigger when the error is bigger – the correction should be stronger with the big error and m0ore precise with the small one. Widrow and Hoff model
64
B.Macukow 64 The rule assures, that i-th component of the vector W is changed more the bigger appropriate component of learned X was. When the components of X can be both positive and negative – the sign of the error e defines the increase or decrease of W. Widrow and Hoff model
65
B.Macukow 65 Let us assume that the element has to learn many different input signals. For simplicity, for k-th input signal X k we have {-1,1}, y k {-1,1} and d k {-1,1}. The error for the k-th input signal is equal where η coefficient is constant and equal to 1/n. Widrow and Hoff model
66
B.Macukow 66 This procedure is repeated for all m input signals X k and the output signal y k should be equal to d k for each X k. Of course, usually such the weight vector W, fulfilling such solution does not exists, then we are looking for the vector W* able to minimize the error e k Widrow and Hoff model
67
B.Macukow 67 Lets denote the weights minimizing by,,..., and by the value of where δ is the vector determining the difference between the weight vector W from optimal vector W* δ = W* - W Widrow and Hoff model
68
B.Macukow 68 The necessary condition to minimize is for j = 1,2,3...,n hence Widrow and Hoff model
69
B.Macukow 69 yield where Widrow and Hoff model
70
B.Macukow 70 the optimal weight values can be obtained by solving the above mentioned system of equation It can be shown, this condition are sufficient to minimize the mean square error also. Widrow and Hoff model
71
B.Macukow 71 Learning process convergence The difference equation w t (t+1) = w t (t) + δ(t+1)∙x i (t+1)(1) describes the learning process in the (t+1) iteration, and δ(t+1) = e(t+1)/n Widrow and Hoff model
72
B.Macukow 72 e(t+1) = d(t+1) - ∑ w i (t)x i (t+1) (2) the set of n weights represents the point in the n- dimensional space. The second power of a distance L(t) between the optimal point in this space and the point defined by the weights w i (t+1) L(t) = ∑ [ - ] 2 i Widrow and Hoff model
73
B.Macukow 73 In the learning process, the value of L(t) changes to L(t+1), hence ∆L(t) = L(t) -L(t+1)= ∑{[-] 2 – [- ] 2 } = ∑[ - ][2-2 - δ(t+1)∙x i (t+1) ] = ∑ δ(t+1)∙x i (t+1)][2 - 2 - δ(t+1)∙x i (t+1)] Widrow and Hoff model
74
B.Macukow 74 where δ(t+1) = e(t+1)/n = [d(t+1) - ∑ w i (t)x i (t+1)]/n δ(t+1)x i (t+1) there are two possible cases: 1)the optimal weights give the proper output for every input signal; then ∆L(t)= δ(t+1) { 2 ∑ x i (t+1)- 2 ∑ w i (t)x i (t+1) - nδ(t+1) } = 2∑x i (t+1)[ - w i (t)]= 2nδ(t+1) ] =n δ 2 (t+1) = w i (t+1) Widrow and Hoff model
75
B.Macukow 75 Because the right side is positive, then L(t) is decreasing function of t. Because L(t) is nonnegative, then the infimum exists, the function is convergent and of course δ(t+1) has the limit zero – thus the error e(t+1) has the limit zero also. Widrow and Hoff model
76
B.Macukow 76 2) The optimal weights do not produce the proper output signal for every input signal then ∆L(t)= δ(t+1) { 2 ∑ x i (t+1)- 2 ∑ w i (t)x i (t+1) - ∑ x i 2 (t+1)δ(t+1) } = δ(t+1) { 2[d(t+1)-Δd(t+1)]- 2 ∑ w i (t)x i (t+1) - nδ(t+1) } d(t+1) - e(t+1) e(t+1)/n Widrow and Hoff model
77
B.Macukow 77 = δ(t+1) { 2[d(t+1)-Δd(t+1)]- 2[d(t+1)-e(t+1)] -e(t+1) }= e(t+1)/n ∙ [ e(t+1) - 2Δd(t+1) ] = δ(t+1) [ e(t+1) - 2Δd(t+1) ] = where Δd(t+1) = d(t+1) - 2 ∑ x i (t+1) Widrow and Hoff model
78
B.Macukow 78 It can be also shown that L(t) monotonically decrease when │e(t+1) │> 2 │ Δd(t+1) │. Let us assume that the learning procedure ends where the error e(k) < 2 max │ Δd(k) │, it means that error modulus for every input signal is smaller that 2 max │ Δd(k) │. Widrow and Hoff model
79
B.Macukow 79 The Delta Rule
80
B.Macukow 80 The one –layer network The Delta learning rule
81
B.Macukow 81 The perceptron learning rule is also the delta rule if y(t) = d(t) w i (t+1) = w i (t) if y(t) ≠ d(t) w i (t+1) = w i (t) + η∙Δ∙x i (t) where 0 ≤ η ≤ 1 is the learning coefficient and Δ = d(t) – y(t) The Delta learning rule
82
B.Macukow 82 The basic difference is in the error definition – discrete in the perceptron and continuous in the Adaline model. The Delta learning rule
83
B.Macukow 83 Let k, the error term, defines the difference between the d k desired response of the k-th element of the output layer, and is the actual response (real) y k. Let us define the error function E to be equal to the square of the difference between the actual and desired output, for all elements in the output layer The Delta learning rule
84
B.Macukow 84 Because thus The Delta learning rule
85
B.Macukow 85 The error function E is the function of all the weights. It is the square function with respect to each weight, so it has exactly one minimum with respect to each of the weights. To find this minimum we use the gradient descend method. Gradient of E is the vector consisting of the partial derivatives of E with respect to each variable. This vector gives the direction of most rapid increase in function; the opposite direction gives the direction of most rapid decrease in the function. The Delta learning rule
86
B.Macukow 86 So, the weight change is proportional to the partial derivative of a error function with respect to this weight with the minus sign. where is the learning rate The Delta learning rule
87
B.Macukow 87 Each weight can be fixed this way. Lets calculate the partial derivative of E thus The Delta learning rule
88
B.Macukow 88 The DELTA RULE changes weights in a net proportionally to the output error (the difference between the real and desired output signal), and the value of input signal The Delta learning rule
89
B.Macukow 89 The multilayer Perceptron Many years the idea of multi layer perceptron was introduced. Multi – typically three Layers: input, output and hidden. In 1986 Rumelhart, McClelland and Williams described the new learning rule the backpropagation learning rule.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.