Presentation is loading. Please wait.

Presentation is loading. Please wait.

An Introduction to Artificial Neural Networks. Primitive Neuroscience.

Similar presentations


Presentation on theme: "An Introduction to Artificial Neural Networks. Primitive Neuroscience."— Presentation transcript:

1 An Introduction to Artificial Neural Networks

2 Primitive Neuroscience

3 Advances in Neuroscience

4 Modern Neuroscience

5 A Bottom-Up Approach to AI ComputerBrain One or relatively few processorsMassively parallel processing Multifunction individual processorComputationally simple individual processor Processor is relatively fastProcessor is relatively slow Processor, memory, program distinctProcessor, memory, "program" indistinguishable Numerical addressingContent addressing Processor failure catastrophicFault tolerant processing Good at complex mathematics, exactness Good at vision, natural language, reasoning, inexactness

6 A Typical Neuron http://www.chm.bris.ac.uk/webprojects2006/Cowlishaw/300px-Action-potential.png

7 A Generic Artificial Neuron

8 Threshold provides way to place a bound on unit output ("squashing function") Some common nonlinearities a) Step function (hard limiter) b) Radial function c) Sigmodial functions: Logistic function: f(u) = 1 / (1 + e -u ); as u ranges from -∞ to +∞, f(u) ranges from 0 to 1 Arctangent function: f(u) = arctan(u); as u ranges from -∞ to +∞, f(u) ranges from -π/2 to +π/2 Consider implementing logic functions in a generic neuron: OR, AND, NOT …

9 Luger: Artificial Intelligence, 5 th edition. © Pearson Education Limited, 2005 Thresholding functions. f(x) = 1 / (1 + e -λx );

10 Basic Neural Network Architectures

11 Felleman’s and Van Essen’s Vision System Circuits (1991) (i.e., a real neural net From Suzuki and Amaral in Crick (1994)

12 Some Learning Mechanisms UnsupervisedHebbianCompetitiveBoltzmannSupervised Perceptron learning Delta rule Backpropagation (generalized delta rule) Reinforcement Monte Carlo techniques (e.g. genetic algorithms, particle swarm optimization)

13 Hebb Rule W new = W old + learning rate * X source * X destination

14 The “outstar” of node J, the “winner” in a winner-take-all network. The Y vector supervises the response on the output layer in Grossberg training. The “outstar” is bold with all weights, 1; all other weights are 0. Luger: Artificial Intelligence, 5 th edition. © Pearson Education Limited, 2005

15 A Simple Auto-associative Neural Net Localist Representations

16 Purpose Learn a single pattern so that degraded image can be recognized. Architecture Three layers of neurons (input, hidden, and output) separated by two layers of connections. External patterns are imposed on the input layer (retina) and reconstructed on the output (recognition) layer. Input and output layers each consist of 289 units (neurons) which correspond to a 17x17 square array of “pixels” representing the image during training and recall. The hidden (middle) layer contains 25 units and functions in a winner-take-all (WTA) fashion. Both levels of connections use 100% connectivity. Input to hidden layer weights are initialized with random weights in the range 0 to 1. Hidden to output layer weights are initialized with zero weights. Learning Network is exposed to a number of predefined patterns, one by one. Input to hidden layer connections are trained by setting weights between firing neurons and winning hidden layer neuron to 1, and by setting weights between non-firing input layer neurons and winning hidden unit to 0. Hidden to output layer connections are trained by imposing target activations on output units, then setting weights between winning hidden layer neuron and firing output layer neurons to 1. During training, network learns all of the input patterns, one after another. For a given pattern the input layer is activated. Activation is propagated to the hidden layer. The winning neuron in the hidden layer is selected. Weights to that unit are trained. Weights to the output unit from the winning hidden unit are trained. Activation of winning neuron is suppressed for the duration of the training phase (so it will be unlikely to win again. The process repeats for any additional patterns to be learned. Recall Activations in all layers are reset to resting level prior to entry of a pattern. The pattern entry and activation propagation then follows the same sequence of steps as for training (but with no training) through selection of a winning hidden layer unit. Activation is then propagated from that unit to the output units via the previously trained connections. Miscellaneous The simple binary images used for this demo are composed of periods (.) and asterisks (*). These are interpreted by the system as 0s and 1s, respectively. Auto-associative Neural Net: Notes

17 Perceptron learning Supervised learning (desired response is known) Perceptron: threshold logic unit (TLU) for N inputs Output[j] = +1 if Σa[i]w[i,j] > θ; -1 if Σa[i]w[i,j] ≤ θ (where θ is the threshold and sum over i is from 1 to N, a[i] are the inputs and w[i,j] is the weight from unit i to unit j) By creating a special weight and input (e.g. w[0,j] = -θ and a[0] = 1): Output[j] = +1 if Σa[i]w[i,j] > 0; -1 if Σa[i]w[i,j] ≤ 0 (where sum over i is from 0 to N) Key: feedback based on correctness (i.e., only learns if there is an error) Thus, for learning constant η, a k a vector representing a member of the training set, and w t the set of weights for the input connections to the perceptron at time t: w t+1 = w t if classification is correct w t+1 = w t + positive amount if classification incorrect and response should be +1 w t+1 = w t – positive amount if classification incorrect and response should be -1

18 Perceptron learning (continued) Key: a modified form of Hebbian learning; net is trained only when there is an error. Speed of learning decreases as network becomes more accurate. Algorithm is local and parallel Many valid solutions (i.e., many weight vectors are possible) Noisy data may cause weights to oscillate wildly without stabilization) Perceptron convergence theorem: if a finite collection of patterns is linearly separable (i.e., if a solution is possible), then the perceptron learning rule will produce (in a finite amount of time) a weight vector that correctly classifies the patterns. Key difference in human learning and perceptron learning: after learning a task, humans continue to improve...

19 A two-dimensional plot of the data points in the table. The perceptron provides a linear separation of the data sets. A data set for perceptron classification. Luger: Artificial Intelligence, 5 th edition. © Pearson Education Limited, 2005

20 Widrow-Hoff (Least Mean Square (LMS) or delta rule) learning w ij (ts+1)=w ij (ts)+η  a i w ij = weight from unit i to unit j ts = time step η = learning rate  = t j – Σ i a i w ij (t j =target (desired) output at j; a i = input) Note that the delta rule acts like a Hebbian learning rule with the error  playing the role of the output factor. This is called the “delta rule” since what is being learned is the difference between desired (target) output (thresholded) and the actual activation. For relatively large η the last input/output association will be well learned, possibly at the expense of some unlearning for previously presented pairs (i.e. a recency effect). The longer since a presentation, the less well remembered the corresponding association. A Widrow-Hoff unit sometimes called an Adaline.

21 Widrow-Hoff (delta rule) learning: Example Consider an artificial neuron with 4 inputs and a hard limiter binary threshold of 0. Let a 1 = 2, a 2 = -1, a 3 = -4, a 4 = 3. Let w 1 =.5, w 2 = 1, w 3 = -.5, and w 4 = -1. Suppose the desired (target) output is 1 and the learning rate is.1. What is the activation? Activation = (2)x(.5) + (-1)x(1) + (-4)x(-.5) + (3)x(-1) = -1 What is the output of the neuron. Activation is < 0 so Output = 0. What are the new weights after one weight adjustment? w i (ts+1)=w i (ts)+η  a i  = t – Σ i a i w i = 1 – (-1) = 2 w 1 (ts+1)=w 1 (ts)+η  a 1 =.5 + (.1)x(2)x(2) =.9 w 2 (ts+1)=w 2 (ts)+η  a 2 = 1 + (.1)x(2)x(-1) =.8 w 3 (ts+1)=w 3 (ts)+η  a 3 = -.5 + (.1)x(2)x(-4) = -1.3 w 4 (ts+1)=w 4 (ts)+η  a 4 = -1 + (.1)x(2)x(3) = -.4 What is the activation after weight adjustment (using the same inputs)? Activation = (2)x(.9) + (-1)x(.8) + (-4)x(-1.3) + (3)x(-.4) = 5

22 Another Widrow-Hoff learning example with different η Consider an artificial neuron with 4 inputs and a hard limiter binary threshold of 0. Let a 1 = 2, a 2 = -1, a 3 = -4, a 4 = 3. Let w 1 =.5, w 2 = 1, w 3 = -.5, and w 4 = -1. Suppose the desired (target) output is 1 and the learning rate is.01. What is the activation? Activation = (2)x(.5) + (-1)x(1) + (-4)x(-.5) + (3)x(-1) = -1 What is the output of the neuron. Activation is < 0 so Output = 0. What are the new weights after one weight adjustment? w i (ts+1)=w i (ts)+η  a i  = t – Σ i a i w i = 1 – (-1) = 2 w 1 (ts+1)=w 1 (ts)+η  a 1 =.5 + (.01)x(2)x(2) =.54 w 2 (ts+1)=w 2 (ts)+η  a 2 = 1 + (.01)x(2)x(-1) =.98 w 3 (ts+1)=w 3 (ts)+η  a 3 = -.5 + (.01)x(2)x(-4) = -.58 w 4 (ts+1)=w 4 (ts)+η  a 4 = -1 + (.01)x(2)x(3) = -.94 What is the activation after weight adjustment (using the same inputs)? Activation = (2)x(.54) + (-1)x(.98) + (-4)x(-.58) + (3)x(-.94) = -.4

23 Widrow-Hoff (Least Mean Square (LMS) or delta rule) learning Another iterative learning process (as was perceptron learning algorithm), but this one supports continued learning (even when there is no output error above the threshold). Supervised learning = associative learning (association; classification) Weight space; weight and error space; error surfaces Learning = error reduction Local and global minimum Gradient: direction of steepest descent Overall goal of learning is to minimize total error; done one pattern at a time. Decreasing error for a given pattern may increase it for another, but network will follow gradient on average. Goal is to change a weight by amount proportional to the error times the input weight. An error surface in two dimensions. The constant c dictates the size of the learning step. Luger: Artificial Intelligence, 5 th edition. © Pearson Education Limited, 2005

24 A Simple Two-Layer Pattern Recognizer Distributed Representations

25 The XOR Problem

26 A geometric aside: A general equation for a line is Ax+By+C=0 where A and B are not both zero. A general equation for a plane is Ax+By+Cz+D=0 where A, B, and C are not all zero. A general equation for a hyperplane is A1x1+ A2x2+... + ANxN + B = 0. Note that when B=0 this is just the dot product A∙x.

27 N2^2^N# of linearly separable functions 144 21614 3256104 465,5361,882 54.3x10 9 94,572 61.8x10 19 5,028,134 Windner, 1960

28

29 NetTalk

30 Backpropagation Reconsidering the linear separability issue (alternative networks for XOR) Backpropagation works by creating an artificial error term and using it in the training algorithm (transmitting it backwards over existing connections). Classification considerations: Single set of connections: can solve linear separable problems only Two sets of connections: can classify inputs into convex open or closed regions Three sets of connections: classification limited by number of neurons and weights Increasing classification power assumes hidden layers of neurons and nonlinear performance Luger: Artificial Intelligence, 5 th edition. © Pearson Education Limited, 2005

31 Credit assignment problem Solution: backwards propagation of error Weight change for connection to neuron j in one layer from a previous layer neuron i: Δw[i,j] = η  [j]x[i,j] where  [j] is the generalized error, x[i,j] is the input (from the previous layer), and η is the learning rate. For weights to output layer, the computation is easy; it is the same situation as for an Adaline (i.e. there is external knowledge about the desired output). For previous layers of weights, there is no such knowledge, but an error term may be computed for weights between a layer with input f and one with output g by propagating the error backwards.

32 The error  [j] (at g) is computed as:  [j] = s'(w fg ∙f) Σ k  [k]w[j,k] where: 1.s is the nonlinear (“squashing”) function (usually a sigmoid or similar) operating on the dot product of weights w fg and input f 2.w[j,k] is the weight from unit j to unit k. 3.s' is the derivative of s with respect to net e.g., for s = f(net) = 1 / (1 + e - λ*net ), s' = f(net)' = λ[f(net)*(1-f(net))]. 4.  [k] = the error of the kth unit in the layer beyond the one under consideration Direction of activation propagation

33 Problems with gradient descent (e.g. Widrow-Hoff, backpropagation, etc.) Local minima on the error surface Assumptions about the correct way to measure error may not be correct May overfit the data with resulting loss of generality Long training times Not biologically plausible Catastrophic interference (unlearning) Possible lack of clear understanding of how a network solves a problem Compute total error for all patterns in training set to approximate gradient descent: 1) change wts after presentation of each pattern, or 2) change wts after accumulating total error for all patterns (after entire training cycle (epoch)). 1 st method means less computation; 2 nd may provide better results.

34 Vector Considerations Inner (dot) product: a∙b = Σ a[i]b[i] (where the sum is for i = 1 to n for vectors of n dimensions) Length of vector v = |v| = [v∙v] 1/2 Normalization (make a collection of vectors a standard length): divide each component by the length of the vector (e.g., v/|v|) Normalization Consider vector a = (a 1, a 2, …, a n ) |a| = [a ∙ a] 1/2 = [a 1 a 1 + a 2 a 2 + … + a n a n ] 1/2 = [a 1 2 + a 2 2 + … + a n 2 ] 1/2 u a = a/|a|(i.e., the normalized vector) |u a |= [u a ∙ u a ] 1/2 = [(a/|a|) ∙ (a/|a|)] 1/2 = [(a 1 /|a|, a 2 /|a|, …, a n /|a|) ∙ (a 1 /|a|, a 2 /|a|, …, a n /|a|)] 1/2 = [a 1 2 /|a| 2 + a 2 2 /|a| 2 + … + a n 2 /|a| 2 ] 1/2 = [(a 1 2 + a 2 2 + … + a n 2 )/ |a| 2 ] 1/2 = [|a|/|a|] = 1

35 Vector Considerations Alternative expression for the inner product: a∙b = |a| |b| cosθ (where θ is the angle between the vectors) For normalized vectors: When a and b are identical, cosθ = a∙a / ( [a∙a] 1/2 [a∙a] 1/2 = 1; when a and b are in opposite directions, cosθ = -1; when cosθ = 0, a and b are orthogonal (at right angles) to one another if cosθ = 1, patterns represented by a and b are maximally similar (i.e. same) if cosθ = -1 patterns they represent are maximally different if cosθ = 0 patterns they represent are uncorrelated (orthogonal)

36 One reason (among many) that vector operations are important in the context of neural networks is that they give us a natural way to think about relationships between inputs and weights. For normalized vectors, the smaller the distance between a set of weights and an input vector, the larger their respective dot product is. This is also true for input vectors. The smaller the distance between two input vectors, the larger their dot product. Distance, and hence the dot product, are ways of measuring the similarity between vectors.

37 Example: Consider a neuron with three inputs and weights: w[1]=4, w[2]=1, w[3]=1 Let the input vectors x and y be x[1]=4, x[2]=0, x[3]=0 and y[1]=3, y[2]=1, y[3]=1 Weight vector length = (4 2 + 1 2 + 1 2 ) 1/2 = 4.2426 The normalized weight vector is w[1]=4/4.2426, w[2]=1/4.2426, w[3]=1/4.2426 or w[1]=.943, w[2]=.236, w[3]=.236 The length of x is (4 2 + 0 2 + 0 2 ) 1/2 = 4 x normalized is x n : x n [1]=4/4, x n [2]=0/4, x n [3]=0/4 or x n [1]=1, x n [2]=0, x n [3]=0 The length of y is (3 2 + 1 2 + 1 2 ) 1/2 = 3.317 y normalized is y n : y n [1]=3/3.317, y n [2]=1/3.317, y n [3]=1/3.317 or y n [1]=.904, y n [2]=.301, y n [3]=.301 Inner (dot) product w n ∙x n is.943∙1 +.236∙0 +.236∙0 =.943 Distance between w n and x n is [(.943-1) 2 +(.236-0) 2 +(.236-0) 2 ] 1/2 =.339 Dot product w n ∙y n is.943∙.904 +.236∙.301 +.236∙.301 =.995. Distance between w n and y n is [(.943-.904) 2 +(.236-.301) 2 +(.236-.301) 2 ] 1/2 =.0999

38 Notation: rectangular matrix A having m rows and n columns and components A[i,j] where i and j denote row and column positions respectively Matrices frequently used for connection weights in neural models, although may have other uses as well Addition (and subtraction): C[i,j] = A[i,j] + B[i,j] for all i and j (note that dimensions of A, B, and C are the same) Multiplication by a constant (scalar): cA is computed as cA[i,j] for all i and j Multiplication: Given an m by n matrix A and an n by p matrix B, produce an m by p matrix C for i=1 to m for j=1 to p C[i,j]=0; for k=1 to n C[i,j]=C[i,j]+A[i,k]*B[k,j]);

39 Input Module Target Output Error (target- actual) Observed Output Training Control Hebbian error-correcting network 1 4 3 5 2

40 Hebbian Error-Correcting Network - Notes This network does error correcting learning in a simple pattern associator having one layer of trained connections. Unlike most pattern associators in which the learning algorithm is implemented in software that implies components which are not actually a part of the model, this model is self- contained. The learning is Hebbian in nature, but is between an input module and a module of neurons which actually compute the difference in target and actual values (a la Widrow-Hoff). To use the net, train it for some number of runs on a set of input and matching (desired) output patterns, then test on a set of input patterns only (i.e. blank the name of the external file in the Target Output module). The model requires four iterations for full presentation of a relationship (12 iterations for 3 input/output pairs, 40 iterations for 10 pairs, etc.). Note that the Training Control module automatically prevents learning from occurring when there is no Target Output (i.e. during testing). To apply this model to a problem with different input and output state vector sizes, simply change the neuron and connection configurations (as well as the external file references).

41 Major issues facing neural network research Large scale simulation Stability/plasticity (catastrophic interference) Interconnectivity Credit assignment Local versus distributed coding Graceful degradation Generalization Self-organization The “free-variable” critique

42

43 Neural Network Simulation Software Java source code Java source code Connectivity via linked lists (for sparse connectivity) Connectivity via linked lists (for sparse connectivity) Iterators for speed Iterators for speed Classes Classes Network Network Network Module Module Module Module Connectivity Module Connectivity Module Connectivity Module Connectivity Neuron Neuron Neuron Neuron Connectivity Neuron Connectivity Neuron Connectivity Neuron Connectivity Network | => Module (a collection of neurons that function together as a unit) | => ModuleConnectivity (parameters describing connections from one module to another) | => Neuron (a neuron belonging to a specific module) | => NeuronConnectivity (information about connection from one neuron to another)

44 Thinking a bit deeper…

45 Selfridge’s Model (cont.)

46 http://www.russianlegacy.com/catalog/images/matryoshka_traditional/ND50-001.jpg What’s inside you that makes you intelligent?

47 mage source: http://www.mabot.com/brain/mage source: http://webspace.ship.edu/cgboer/thebrain.html 10 11 neurons 10 14 -10 15 connections Millisecond response times Perception limited to three spatial dimensions and time Meaning from structure… mage source: http://domino.watson.ibm.com/comm/pr.nsf/pages/rscd.neurons_picb.html/$FILE/NeuronsInAColumn1_s.bmp http://www.brainbasedbusiness.com/uploads/neuron.jpg http://www.chm.bris.ac.uk/webprojects2006/Cowlishaw/300px- Action-potential.png How does intelligence arise from mindless mechanisms?


Download ppt "An Introduction to Artificial Neural Networks. Primitive Neuroscience."

Similar presentations


Ads by Google