Backpropagation Introduction to Artificial Intelligence COS302 Michael L. Littman Fall 2001.

Backpropagation Introduction to Artificial Intelligence COS302 Michael L. Littman Fall 2001

Administration Questions, concerns?

Classification Percept. x1x1x1x1 net x2x2x2x2 x3x3x3x3 xDxDxDxD … sum w1w1w1w1 1 w2w2w2w2 w3w3w3w3 wDwDwDwD w0w0w0w0 out squash g

Perceptrons Recall that the squashing function makes the output look more like bits: 0 or 1 decisions. What if we give it inputs that are also bits?

A Boolean Function A B C D E F Gout 1 0 1 0 1 0 10 0 1 1 0 0 0 10 0 0 1 0 0 1 00 1 0 0 0 1 0 01 0 0 1 1 0 0 01 1 1 1 0 1 0 10 0 1 0 1 0 0 11 1 1 1 1 1 0 11 1 1 1 1 1 1 11 1 1 1 0 0 1 10

Think Graphically Can perceptron learn this? C D11 1 0

Ands and Ors out(x) = g(sum k w k x k ) How can we set the weights to represent (v 1 )(v 2 )(~v 7 ) ?AND w i =0, except w 1 =10, w 2 =10, w 7 =-10, w 0 =-15 (5-max) How about ~v 3 + v 4 + ~v 8 ?OR w i =0, except w 1 =-10, w 2 =10, w 7 =-10, w 0 =15 (-5-min)

Majority Are at least half the bits on? Set all weights to 1, w 0 to –n/2. A B C D E F Gout 1 0 1 0 1 0 11 0 1 1 0 0 0 10 0 0 1 0 0 1 00 1 0 0 0 1 0 00 1 1 1 0 1 0 11 0 1 0 1 0 0 10 1 1 1 1 1 0 11 1 1 1 1 1 1 11 Representation size using decision tree?

Sweet Sixteen? ab(~a)+(~b) ab(~a)+(~b) a(~b)(~a)+b a(~b)(~a)+b (~a)ba+(~b) (~a)ba+(~b) (~a)(~b)a+b (~a)(~b)a+b a~a a~a b~b b~b 10 10 a = ba exclusive-or b (a  b) a = ba exclusive-or b (a  b)

XOR Constraints A B out 0 0 0 g( w 0 ) < 1/2 0 1 1 g(w B + w 0 ) > 1/2 1 01 g(w A + w 0 ) > 1/2 1 10 g(w A +w B + w 0 ) < 1/2 w 0 0, w B +w 0 >0, w 0 0, w B +w 0 >0, w A +w B +2 w 0 >0, 0 0, 0 < w A +w B +w 0 < 0

Linearly Separable XOR problematic C D00 1 1 ?

How Represent XOR? A xor B = (A+B)(~A+~B) = (A+B)(~A+~B) net AB c1c1c1c1 net 1 c2c2c2c2 net out 10 -10 -10 10 1 -5 1 15 10 10 -15

Requiem for a Perceptron Rosenblatt proved that a perceptron will learn any linearly separable function. Minsky and Papert (1969) in Perceptrons: “there is no reason to suppose that any of the virtues carry over to the many- layered version.”

Backpropagation Bryson and Ho (1969, same year) described a training procedure for multilayer networks. Went unnoticed. Multiply rediscovered in the 1980s.

Multilayer Net x1x1x1x1 net 1 i x2x2x2x2 x3x3x3x3 xDxDxDxD … W 11 1 W 12 W 13 hid 1 g net H i net 2 i hid hid 2 U1U1U1U1 net i out … … U0U0U0U0

Multiple Outputs Makes no difference for the perceptron. Add more outputs off the hidden layer in the multilayer case.

Output Function out i (x) = g(sum j U ji g(sum k W kj x k )) H: number of “hidden” nodes Also: Use more than one hidden layerUse more than one hidden layer Use direct input-output weightsUse direct input-output weights

How Train? Find a set of weights U, W that minimize sum (x,y) sum i (y i -out i (x)) 2 sum (x,y) sum i (y i -out i (x)) 2 using gradient descent. Incremental version (vs. batch): Move weights a small amount for each training example Move weights a small amount for each training example

Updating Weights 1.Feed-forward to hidden: net j = sum k W kj x k ; hid j = g(net j ) 2.Feed-forward to output: net i = sum j U ji hid j ; out i = g(net i ) 3. Update output weights:  i = g’(net i ) (y i -out i ); U ji +=  hid j  i 4. Update hidden weights:  j = g’(net j ) sum i U jj  i ; W kj +=  x k  j

Multilayer Net (schema) W kj xkxkxkxk net j hid j net i out i U ji jjjj iiii yiyiyiyi

Does it Work? Sort of: Lots of practical applications, lots of people play with it. Fun. However, can fall prey to the standard problems with local search… NP-hard to train a 3-node net.

Step Size Issues Too small? Too big?

Representation Issues Any continuous function can be represented by a one hidden layer net with sufficient hidden nodes. Any function at all can be represented by a two hidden layer net with a sufficient number of hidden nodes. What’s the downside for learning?

Generalization Issues Pruning weights: “optimal brain damage” Cross validation Much, much more to this. Take a class on machine learning.

What to Learn Representing logical functions using sigmoid units Majority (net vs. decision tree) XOR is not linearly separable Adding layers adds expressibility Backprop is gradient descent

Homework 10 (due 12/12) 1.Describe a procedure for converting a Boolean formula in CNF (n variables, m clauses) into an equivalent network? How many hidden units does it have? 2.More soon

Backpropagation Introduction to Artificial Intelligence COS302 Michael L. Littman Fall 2001.

Similar presentations

Presentation on theme: "Backpropagation Introduction to Artificial Intelligence COS302 Michael L. Littman Fall 2001."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Backpropagation Introduction to Artificial Intelligence COS302 Michael L. Littman Fall 2001.

Similar presentations

Presentation on theme: "Backpropagation Introduction to Artificial Intelligence COS302 Michael L. Littman Fall 2001."— Presentation transcript:

Similar presentations

About project

Feedback