Presentation is loading. Please wait.

Presentation is loading. Please wait.

Previous Lecture Perceptron W  t+1  W  t  t  d(t) - sign (w(t)  x)] x Adaline W  t+1  W  t  t  d(t) - f(w(t)  x)] f’ x Gradient.

Similar presentations


Presentation on theme: "Previous Lecture Perceptron W  t+1  W  t  t  d(t) - sign (w(t)  x)] x Adaline W  t+1  W  t  t  d(t) - f(w(t)  x)] f’ x Gradient."— Presentation transcript:

1 Previous Lecture Perceptron W  t+1  W  t  t  d(t) - sign (w(t)  x)] x Adaline W  t+1  W  t  t  d(t) - f(w(t)  x)] f’ x Gradient descent method

2 5 Multi-Layer Perceptron (MLP, 1) Purpose To introduce MPL and variant techniques developed in training networks Understanding MLP learning

3 5 Multi-Layer Perceptron (MLP, 1) Topics: XOR problem Credit assignment problem Back-propagation algorithm (one of the central topics of the course)

4 XOR problem XOR (exclusive OR) problem 0+0=0 1+1=2=0 mod 2 1+0=1 0+1=1 Perceptron does not work here

5 Credit assignment problem Minsky & Papert (1969) offered solution to XOR problem by combining perceptron unit responses using a second layer of units 1 2 +1 3

6 Credit assignment problem 1 2 +1 3

7     This is a linearly separable problem!

8 For four point { (-1,1), (-1,-1), (1,1),(1,-1) } It is always linearly separable if we want to have three points in a class

9 xnxn x1x1 x2x2 Input Output Four-layer networks Hidden layer

10 Properties of architecture No connections within a layer Each unit is a perceptron

11 Properties of architecture No connections within a layer No direct connections between input and output layers Each unit is a perceptron

12 Properties of architecture No connections within a layer No direct connections between input and output layers Fully connected between layers Each unit is a perceptron

13 Properties of architecture No connections within a layer No direct connections between input and output layers Fully connected between layers Often more than 3 layers Number of output units need not equal number of input units Number of hidden units per layer can be more or less than input or output units Each unit is a perceptron

14

15 2  

16 But how are the weights for units 1 and 2 found when only error is computed for output unit 3? There is no direct error signal for unit 1 and 2!!!!! Credit assignment problem Problem of assigning ‘credit’ or ‘blame’ to individual elements involving in forming overall response of a learning system (hidden units) In neural networks, problem relates to dividing which weights should be altered, by how much and in which direction.

17 Backpropagation learning algorithm ‘BP’ Solution to credit assignment problem in MLP Rumelhart, Hinton and Williams (1986) BP has two phases: Forward pass phase: computes ‘functional signal’, feedforward propagation of input pattern signals through network

18 Backpropagation learning algorithm ‘BP’ Solution to credit assignment problem in MLP Rumelhart, Hinton and Williams (1986) BP has two phases: Forward pass phase: computes ‘functional signal’, feedforward propagation of input pattern signals through network Backward pass phase: computes ‘error signal’, propagation of error (difference between actual and desired output values) backwards through network starting at output units

19 I w(t) W(t) y O I’ll work out the trivial case and the general case is similar Task Data {I, d} to minimize E = (d - o) 2 /2 = [d - f(W(t)y(t) ] 2 /2 = [d - f(W(t)f(w(t)I)) ] 2 /2 Error function at the output unit Weight at time t is w(t)and W(t), intend to find the weight w and W at time t+1 Where y = f(w(t)I), output of the hidden unit

20 Forward pass phase Suppose that we have w(t), W(t) of time t For given input I, we can calculate y = f(w(t)I) and o = f ( W(t) y ) = f ( W(t) f( w(t) I ) ) Error function of output unit will be E = (d - o) 2 /2 I w(t) W(t) y O

21  Backward pass phase I w(t) W(t) y O

22  Backward pass phase I w(t) W(t) y O where  = ( d-o ) f’

23  I w(t) W(t) y O

24 Work the learning rule by yourself for three layer case

25 We will concentrate on three-layer, but could easily generalize to general case I inputs, O outputs, w connections between input and hidden units, W connections between hidden units and output, y is the activity of hidden unit y i (t) = f (  j w ij (t) I j (t) ) at time t = f ( net i (t) ) O i (t) = f (  j W ij (t) y j (t) ) at time t = f ( Net i (t) ) net (t) = network input to the unit at time t

26 Forward pass Weights are fixed during forward and backward pass at time t 1. Compute values for hidden units 2. compute values for output units IiIi w ji (t) W kj (t) yjyj OkOk

27 Backward Pass Recall delta rule (Lecture 5), error measure for pattern n is We want to know how to modify weights in order to decrease E where both for hidden units and output units This can be rewritten as product of two terms using chain rule

28 How error for pattern changes as function of change in network input to unit j How net input to unit j changes as a function of change in weight w both for hidden units and output units Term A Term B

29 Term A Let Term B:

30 Combining A+B gives So to achieve gradient descent in E should change weights by w ij (t+1)-w ij (t) =  i (t) I j (n) W ij (t+1)-W ij (t) =  i (t) y j (t)

31 Now need to find  i (t) and  i (t) for each unit in network --simple recursive method used to compute for each unit for output unit Term 1 = 2( d i (t)-O i (t) ) Term 2 = f’ ( Net i (t) ) since O i (t) = f ( Net i (t) ) Combining term 1 + term 2 gives

32 For hidden unit: Term 1 E(t) = 1/2  (d k (t) -O k (t) ) 2 = 1/2  [d k (t) - f( Net k (t) ) ] 2 = 1/2  [(d k (t) - f(  W ki (t) y i ] 2 = 1/2  [d k (t) - f(  W ki (t) f( net i (t) )] 2

33 For hidden unit: Term 1 Term 2 as for output unit f’ ( net i (t) ) Combining 1+2 gives

34 Backward Pass Weights here can be viewed are providing degree of ‘credit’ or ‘blame’ to hidden units

35 Summary Weight updates are local output unit hidden unit

36 Once weight changes are computed for all units, weights are updated at same time (bias included as weights here) We now compute the derivative of the activation function.

37


Download ppt "Previous Lecture Perceptron W  t+1  W  t  t  d(t) - sign (w(t)  x)] x Adaline W  t+1  W  t  t  d(t) - f(w(t)  x)] f’ x Gradient."

Similar presentations


Ads by Google