Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS621 : Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 21: Perceptron training and convergence.

Similar presentations


Presentation on theme: "CS621 : Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 21: Perceptron training and convergence."— Presentation transcript:

1 CS621 : Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 21: Perceptron training and convergence

2 The Perceptron Model A perceptron is a computing element with input lines having associated weights and the cell having a threshold value. The perceptron model is motivated by the biological neuron. Output = y wnwn W n-1 w1w1 X n-1 x1x1 Threshold = θ

3 θ 1 y Step function / Threshold function y = 1 for Σw i x i >=θ =0 otherwise ΣwixiΣwixi

4 Computation of Boolean functions AND of 2 inputs X1 x2 y 0 0 0 0 10 1 00 1 11 The parameter values (weights & thresholds) need to be found. y w1w1 w2w2 x1x1 x2x2 θ

5 Computing parameter values w1 * 0 + w2 * 0 = 0; since y=0 w1 * 0 + w2 * 1 <= θ  w2 <= θ; since y=0 w1 * 1 + w2 * 0 <= θ  w1 <= θ; since y=0 w1 * 1 + w2 *1 > θ  w1 + w2 > θ; since y=1 w1 = w2 = = 0.5 satisfy these inequalities and find parameters to be used for computing AND function.

6 Perceptron as a learning device

7 Perceptron Training Algorithm (PTA) Preprocessing: 1.The computation law is modified to y = 1 if ∑w i x i > θ y = o if ∑w i x i < θ ... θ, ≤ w1w1 w2w2 wnwn x1x1 x2x2 x3x3 xnxn... θ, < w1w1 w2w2 w3w3 wnwn x1x1 x2x2 x3x3 xnxn w3w3

8 PTA – preprocessing cont… 2. Absorb θ as a weight  3. Negate all the zero-class examples... θ w1w1 w2w2 w3w3 wnwn x2x2 x3x3 xnxn x1x1 w 0 =θ x 0 = -1... θ w1w1 w2w2 w3w3 wnwn x2x2 x3x3 xnxn x1x1

9 Example to demonstrate preprocessing OR perceptron 1-class,, 0-class Augmented x vectors:- 1-class,, 0-class Negate 0-class:-

10 Example to demonstrate preprocessing cont.. Now the vectors are x 0 x 1 x 2 X 1 -1 0 1 X 2 -1 1 0 X 3 -1 1 1 X 4 1 0 0

11 Perceptron Training Algorithm 1.Start with a random value of w ex: 1.Test for wx i > 0 If the test succeeds for i=1,2,…n then return w 3. Modify w, w next = w prev + x fail

12 Tracing PTA on OR-example w= wx 1 fails w= wx 4 fails w= wx 2 fails w= wx 1 fails w= wx 4 fails w= wx 2 fails w= wx 4 fails w= success

13 Theorems on PTA 1.The process will terminate 2.The order of selection of x i for testing and w next does not matter.

14 Proof of Convergence of PTA Perceptron Training Algorithm (PTA) Statement: Whatever be the initial choice of weights and whatever be the vector chosen for testing, PTA converges if the vectors are from a linearly separable function.

15 Proof of Convergence of PTA Suppose w n is the weight vector at the n th step of the algorithm. At the beginning, the weight vector is w 0 Go from w i to w i+1 when a vector X j fails the test w i X j > 0 and update w i as w i+1 = w i + X j Since Xjs form a linearly separable function,  w* s.t. w*X j > 0  j

16 Proof of Convergence of PTA Consider the expression G(w n ) = w n. w* | w n | where w n = weight at nth iteration G(w n ) = |w n |. |w*|. cos  |w n | where  = angle between w n and w* G(w n ) = |w*|. cos  G(w n ) ≤ |w*| ( as -1 ≤ cos  ≤ 1)

17 Behavior of Numerator of G w n. w* = (w n-1 + X n-1 fail ). w*  w n-1. w* + X n-1 fail. w*  (w n-2 + X n-2 fail ). w* + X n-1 fail. w* …..  w 0. w* + ( X 0 fail + X 1 fail +.... + X n-1 fail ). w* w*.X i fail is always positive: note carefully Suppose |X j | ≥ , where  is the minimum magnitude. Num of G ≥ |w 0. w*| + n . |w*| So, numerator of G grows with n.

18 Behavior of Denominator of G |w n | =  w n. w n  (w n-1 + X n-1 fail ) 2  (w n-1 ) 2 + 2. w n-1. X n-1 fail + (X n-1 fail ) 2 ≤  (w n-1 ) 2 + (X n-1 fail ) 2 (as w n-1. X n-1 fail ≤ 0 ) ≤  (w 0 ) 2 + (X 0 fail ) 2 + (X 1 fail ) 2 +…. + (X n-1 fail ) 2 |X j | ≤  (max magnitude) So, Denom ≤  (w 0 ) 2 + n  2

19 Some Observations Numerator of G grows as n Denominator of G grows as  n => Numerator grows faster than denominator If PTA does not terminate, G(w n ) values will become unbounded.

20 Some Observations contd. But, as |G(w n )| ≤ |w*| which is finite, this is impossible! Hence, PTA has to converge. Proof is due to Marvin Minsky.

21 Convergence of PTA Whatever be the initial choice of weights and whatever be the vector chosen for testing, PTA converges if the vectors are from a linearly separable function.

22 Study of Linear Separability y x1x1... θ w1w1 w2w2 w3w3 wnwn x2x2 x3x3 xnxn W. X j = 0 defines a hyperplane in the (n+1) dimension. => W vector and X j vectors are perpendicular to each other.

23 Linear Separability X k+1 - X k+2 - - X m - X1+X1+ + X 2 Xk+Xk+ + Positive set : w. Xj > 0  j≤k Negative set : w. Xj k Separating hyperplane

24 Test for Linear Separability (LS) Theorem: A function is linearly separable iff the vectors corresponding to the function do not have a Positive Linear Combination (PLC) PLC – Both a necessary and sufficient condition. X 1, X 2, …, X m - Vectors of the function Y 1, Y 2, …, Y m - Augmented negated set Prepending -1 to the 0-class vector X i and negating it, gives Y i

25 Example (1) - XNOR The set {Y i } has a PLC if ∑ P i Y i = 0, 1 ≤ i ≤ m –where each P i is a non-negative scalar and –atleast one P i > 0 Example : 2 bit even- parity (X-NOR function) Y4Y4 + X4X4 Y3Y3 - X3X3 Y2Y2 - X2X2 Y1Y1 + X1X1

26 Example (1) - XNOR P 1 [ -1 0 0 ] T + P 2 [ 1 0 -1 ] T + P 3 [ 1 -1 0 ] T + P 4 [ -1 1 1 ] T = [ 0 0 0 ] T All P i = 1 gives the result. For Parity function, PLC exists => Not linearly separable.

27 Example (2) – Majority function + + + - + - - - YiYi XiXi 3-bit majority function Suppose PLC exists. Equations obtained are: P 1 + P 2 + P 3 - P 4 + P 5 - P 6 - P 7 – P 8 = 0 -P 5 + P 6 + P 7 + P 8 = 0 -P 3 + P 4 + P 7 + P 8 = 0 -P 2 + P 4 + P 6 + P 8 = 0 On solving, all P i will be forced to 0 3 bit majority function => No PLC => LS

28 Limitations of perceptron Non-linear separability is all pervading Single perceptron does not have enough computing power Eg: XOR cannot be computed by perceptron

29 Solutions Tolerate error (Ex: pocket algorithm used by connectionist expert systems). –Try to get the best possible hyperplane using only perceptrons Use higher dimension surfaces Ex: Degree - 2 surfaces like parabola Use layered network

30 Example - XOR 011 001 110 000 x1x2x1x2 x2x2 x1x1 w 2 =1.5w 1 =-1 θ = 1 x1x1 x2x2  Calculation of XOR Calculation of x1x2x1x2 w 2 =1w 1 =1 θ = 0.5 x1x2x1x2 x1x2x1x2

31 Example - XOR w 2 =1w 1 =1 θ = 0.5 x1x2x1x2 x1x2x1x2 x1x1 x2x2 1.5 11

32 Multi Layer Perceptron (MLP) Question:- How to find weights for the hidden layers when no target output is available? Credit assignment problem – to be solved by “Gradient Descent”


Download ppt "CS621 : Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 21: Perceptron training and convergence."

Similar presentations


Ads by Google