Presentation is loading. Please wait.

Presentation is loading. Please wait.

Neural Networks Introduction to Artificial Intelligence COS302 Michael L. Littman Fall 2001.

Similar presentations


Presentation on theme: "Neural Networks Introduction to Artificial Intelligence COS302 Michael L. Littman Fall 2001."— Presentation transcript:

1 Neural Networks Introduction to Artificial Intelligence COS302 Michael L. Littman Fall 2001

2 Administration 11/28 Neural Networks Ch. 19 [19.3, 19.4] 12/03 Latent Semantic Indexing 12/05 Belief Networks Ch. 15 [15.1, 15.2] 12/10 Belief Network Inference Ch. 19 [19.6]

3 Proposal 11/28 Neural Networks Ch. 19 [19.3, 19.4] 12/03 Backpropagation in NNs 12/05 Latent Semantic Indexing 12/10 Segmentation

4 Regression: Data x 1 = 2y 1 = 1 x 1 = 2y 1 = 1 x 2 =6y 2 = 2.2 x 2 =6y 2 = 2.2 x 3 =4y 3 = 2 x 3 =4y 3 = 2 x 4 =3y 4 = 1.9 x 4 =3y 4 = 1.9 x 5 =4y 5 = 3.1 x 5 =4y 5 = 3.1 Given x, want to predict y.

5 Regression: Picture

6 Linear Regression Linear regression assumes that the expected value of the output given an input E(y|x) is linear. Simplest case: out(x) = w x for some unknown weight w. Estimate w given the data.

7 1-Parameter Linear Reg. Assume that the data is formed by y i = w x i + noise where… the noise signals are indep.the noise signals are indep. noise normally distributed: mean 0 and unknown variance  2noise normally distributed: mean 0 and unknown variance  2

8 Distribution for ys wx wx wx+  wx-  wx+2  wx-2  Pr(y|w, x) normally distributed with mean wx and variance  2

9 Data to Model Fix xs. What w makes ys most likely? Also known as… argmax w Pr(y 1 …y n |x 1 …x n, w) = argmax w  i Pr(y i |x i, w) = argmax w  i exp(-1/2 ((y i -wx i )/  ) 2 ) = argmin w  i (y i -wx i ) 2 Minimize sum-of-squared residuals.

10 Residuals ii

11 How Minimize? E =  i (y i -wx i ) 2 =  i y i 2 – (2  i x i y i ) w + (  i x i 2 ) w 2 Minimize quadratic function of w. E minimized with w* = (  i x i y i ) / (  i x i 2 ) so ML model is Out(x) = w* x.

12 Multivariate Regression What if inputs are vectors? n data points, D components x1x1x1x1 xnxnxnxn … X = y1y1y1y1 ynynynyn … Y = D

13 Closed Form Solution Multivariate linear regression assumes a vector w s.t. Out(x) = w T x = w[1] x[1] + … + w[D] x[D] ML solution: w = (X T X) –1 (X T Y) X T X is DxD, k,j elt is sum i x ij x ik X T Y is Dx1, k elt is sum i x ik y i

14 Got Constants?

15 Fitting with an Offset We might expect a linear function that doesn’t go through the origin. Simple obvious hack so we don’t have to start from scratch…

16 Gradient Descent Scalar function: f(w):   Want a local minimum. Start with some value for w. Gradient descent rule: w  w -   /  w f(w)  “learning rate” (small pos. num.) Justify!

17 Partial Derivatives E = sum k (w T x k – y k ) 2 = f(w) w j  w j -   /  w j f(w) How would a small increase in weight w j change the error? Small positive?Large positive? Small negative?Large negative?

18 Neural Net Connection Set of weights w. Find weights to minimize sum-of- squared residuals. Why? When would we want to use gradient descent?

19 Linear Perceptron Earliest, simplest NN. x1x1x1x1 y x2x2x2x2 x3x3x3x3 xDxDxDxD … sum w1w1w1w1 1 w2w2w2w2 w3w3w3w3 wDwDwDwD w0w0w0w0

20 Learning Rule Multivariate linear function, trained by gradient descent. Derive the update rule… out(x) = w T x E = sum k (w T x k – y k ) 2 = f(w) w j  w j -   /  w j f(w)

21 “Batch” Algorithm 1.Randomly initialize w 1 …w D 2.Append 1s to inputs to allow function to miss the origin 3.For i=1 to n,  i = y i – w T x i 4.For j=1 to D, w j = w j +  sum i  i x ij 5.If sum i  i 2 is small, stop, else 3. Why squared?

22 Classification Let’s say all outputs are 0 or 1. How can we interpret the output of the perceptron as zero or one?

23 Classification

24 Change Output Function Solution: Instead ofout(x) = w T x we’ll use out(x) = g(w T x) g(x):   (0,1), squashing function

25 Sigmoid E = sum k (g(w T x k ) – y k ) 2 = f(w) where g(h) = 1/(1+e -h )

26 Classification Percept. x1x1x1x1 net i x2x2x2x2 x3x3x3x3 xDxDxDxD … sum w1w1w1w1 1 w2w2w2w2 w3w3w3w3 wDwDwDwD w0w0w0w0 y squash g

27 Classifying Regions x2x2x2x2 x1x1x1x1 1 1 1 0 0 0

28 Gradient Descent in Perceptrons Notice g’(h) = g(h)(1-g(h)). Let net i = sum k w k x ik,  i = y i -g(net i ) out(x i ) = g(net i ) E = sum i (y i -g(net i )) 2  E/  w j = sum i 2(y i -g(net i )) (-  /  w j g(net i )) = –2 sum i (y i -g(net i )) g’(net i )  /  w j net i = –2 sum i  i g(net i ) (1-g(net i )) x i

29 Delta Rule for Perceptrons w j = w j +  sum i  i out(x i ) (1-out(x i )) x ij Invented and popularized by Rosenblatt (1962) Guaranteed convergence Stable behavior for overconstrained and underconstrained problems

30 What to Learn Linear regression as ML Gradient descent to find ML Perceptron training rule (regressions version and classification version) Sigmoids for classification problems

31 Homework 9 (due 12/5) 1.Write a program that decides if a pair of words are synonyms using wordnet. I’ll send you the list, you send me the answers. 2.Draw a decision tree that represents (a) f 1 +f 2 +…+f n (or), (b) f 1 f 2 …f n (and), (c) parity (odd number of features “on”). 3.Show that g’(h) = g(h)(1-g(h)).


Download ppt "Neural Networks Introduction to Artificial Intelligence COS302 Michael L. Littman Fall 2001."

Similar presentations


Ads by Google