Neural Networks Winter-Spring 2014

Slides:



Advertisements
Similar presentations
Multi-Layer Perceptron (MLP)
Advertisements

Introduction to Neural Networks Computing
NEURAL NETWORKS Perceptron
Mehran University of Engineering and Technology, Jamshoro Department of Electronic Engineering Neural Networks Feedforward Networks By Dr. Mukhtiar Ali.
Machine Learning Neural Networks
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Simple Neural Nets For Pattern Classification
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
RBF Neural Networks x x1 Examples inside circles 1 and 2 are of class +, examples outside both circles are of class – What NN does.
Radial Basis Functions
September 30, 2010Neural Networks Lecture 8: Backpropagation Learning 1 Sigmoidal Neurons In backpropagation networks, we typically choose  = 1 and 
November 9, 2010Neural Networks Lecture 16: Counterpropagation 1 Unsupervised Learning So far, we have only looked at supervised learning, in which an.
September 21, 2010Neural Networks Lecture 5: The Perceptron 1 Supervised Function Approximation In supervised learning, we train an ANN with a set of vector.
November 2, 2010Neural Networks Lecture 14: Radial Basis Functions 1 Cascade Correlation Weights to each new hidden node are trained to maximize the covariance.
Prediction Networks Prediction –Predict f(t) based on values of f(t – 1), f(t – 2),… –Two NN models: feedforward and recurrent A simple example (section.
September 23, 2010Neural Networks Lecture 6: Perceptron Learning 1 Refresher: Perceptron Training Algorithm Algorithm Perceptron; Start with a randomly.
Hazırlayan NEURAL NETWORKS Radial Basis Function Networks I PROF. DR. YUSUF OYSAL.
October 28, 2010Neural Networks Lecture 13: Adaptive Networks 1 Adaptive Networks As you know, there is no equation that would tell you the ideal number.
Dan Simon Cleveland State University
Neural Networks. Background - Neural Networks can be : Biological - Biological models Artificial - Artificial models - Desire to produce artificial systems.
CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.
Radial Basis Function (RBF) Networks
Radial-Basis Function Networks
Hazırlayan NEURAL NETWORKS Radial Basis Function Networks II PROF. DR. YUSUF OYSAL.
Radial Basis Function Networks
Radial Basis Function Networks
Chapter 4 Supervised learning: Multilayer Networks II.
Artificial Neural Networks
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Non-Bayes classifiers. Linear discriminants, neural networks.
1 Lecture 6 Neural Network Training. 2 Neural Network Training Network training is basic to establishing the functional relationship between the inputs.
Neural Networks Presented by M. Abbasi Course lecturer: Dr.Tohidkhah.
EEE502 Pattern Recognition
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Pattern Recognition Lecture 20: Neural Networks 3 Dr. Richard Spillman Pacific Lutheran University.
1 Neural Networks Winter-Spring 2014 Instructor: A. Sahebalam Instructor: A. Sahebalam Neural Networks Lecture 3: Models of Neurons and Neural Networks.
CSE343/543 Machine Learning Mayank Vatsa Lecture slides are prepared using several teaching resources and no authorship is claimed for any slides.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Today’s Lecture Neural networks Training
Machine Learning Supervised Learning Classification and Regression
Neural networks.
Fall 2004 Backpropagation CS478 - Machine Learning.
Deep Feedforward Networks
Supervised Learning in ANNs
Chapter 4 Supervised learning: Multilayer Networks II
Learning with Perceptrons and Neural Networks
Chapter 2 Single Layer Feedforward Networks
Real Neurons Cell structures Cell body Dendrites Axon
A Simple Artificial Neuron
Chapter 4 Supervised learning: Multilayer Networks II
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
CS621: Artificial Intelligence
CSC 578 Neural Networks and Deep Learning
Data Mining with Neural Networks (HK: Chapter 7.5)
Computational Intelligence
Neuro-Computing Lecture 4 Radial Basis Function Network
of the Artificial Neural Networks.
Neural Network - 2 Mayank Vatsa
Computational Intelligence
Multilayer Perceptron & Backpropagation
Artificial Intelligence Lecture No. 28
Capabilities of Threshold Neurons
Introduction to Radial Basis Function Networks
Parametric Methods Berlin Chen, 2005 References:
Neuro-Computing Lecture 2 Single-Layer Perceptrons
Computational Intelligence
Computer Vision Lecture 19: Object Recognition III
Prediction Networks Prediction A simple example (section 3.7.3)
Computational Intelligence
Presentation transcript:

Neural Networks Winter-Spring 2014 Instructor: A. Sahebalam Neural Networks Lecture 7: Adaptive Network

Neural Networks Lecture 7: Adaptive Networks As you know, there is no equation that would tell you the ideal number of neurons in a multi-layer network. Ideally, we would like to use the smallest number of neurons that allows the network to do its task sufficiently accurately, because of: the small number of weights in the system, fewer training samples being required, faster training, typically, better generalization for new test samples. Neural Networks Lecture 7: Adaptive Networks

Neural Networks Lecture 7: Adaptive Networks So far, we have determined the number of hidden-layer units in BPNs by “trial and error.” However, there are algorithmic approaches for adapting the size of a network to a given task. Some techniques start with a large network and then iteratively prune connections and nodes that contribute little to the network function. Other methods start with a minimal network and then add connections and nodes until the network reaches a given performance level. Finally, there are algorithms that combine these “pruning” and “growing” approaches. Neural Networks Lecture 7: Adaptive Networks

Neural Networks Lecture 7: Adaptive Networks Cascade Correlation None of these algorithms are guaranteed to produce “ideal” networks. (It is not even clear how to define an “ideal” network.) However, numerous algorithms exist that have been shown to yield good results for most applications. We will take a look at one such algorithm named “cascade correlation.” It is of the “network growing” type and can be used to build multi-layer networks of adequate size. However, these networks are not strictly feed-forward in a level-by-level manner. Neural Networks Lecture 7: Adaptive Networks

Refresher: Covariance and Correlation For a dataset (xi, yi) with i = 1, …, n the covariance is: x y x y x y x x x y y y cov(x,y) > 0 cov(x,y) ≈ 0 cov(x,y) < 0 Neural Networks Lecture 7: Adaptive Networks

Refresher: Covariance and Correlation Covariance tells us something about the strength and direction (directly vs. inversely proportional) of the linear relationship between x and y. For many applications, it is useful to normalize this variable so that it ranges from -1 to 1. The result is the correlation coefficient r, which for a dataset (xi, yi) with i = 1, …, n is given by: Neural Networks Lecture 7: Adaptive Networks

Refresher: Covariance and Correlation x y x y x y 0 < r < 1 r ≈ 0 -1 < r < 0 x y x y x y r = 1 r = -1 r undef’d Neural Networks Lecture 7: Adaptive Networks

Refresher: Covariance and Correlation In the case of high (close to 1) or low (close to -1) correlation coefficients, we can use one variable as a predictor of the other one. To quantify the linear relationship between the two variables, we can use linear regression: x y regression line Neural Networks Lecture 7: Adaptive Networks

Neural Networks Lecture 7: Adaptive Networks Cascade Correlation Now let us return to the cascade correlation algorithm. We start with a minimal network consisting of only the input neurons (one of them should be a constant offset = 1) and the output neurons, completely connected as usual. The output neurons (and later the hidden neurons) typically use output functions that can also produce negative outputs; e.g., we can subtract 0.5 from our sigmoid function for a (-0.5, 0.5) output range. Then we successively add hidden-layer neurons and train them to reduce the network error step by step: Neural Networks Lecture 7: Adaptive Networks

Neural Networks Lecture 7: Adaptive Networks Cascade Correlation Output node o1 Solid connections are being modified x1 x2 x3 Input nodes Neural Networks Lecture 7: Adaptive Networks

Neural Networks Lecture 7: Adaptive Networks Cascade Correlation Output node o1 Solid connections are being modified First hidden node x1 x2 x3 Input nodes Neural Networks Lecture 7: Adaptive Networks

Neural Networks Lecture 7: Adaptive Networks Cascade Correlation Output node o1 Second hidden node Solid connections are being modified First hidden node x1 x2 x3 Input nodes Neural Networks Lecture 7: Adaptive Networks

Neural Networks Lecture 7: Adaptive Networks Cascade Correlation Weights to each new hidden node are trained to maximize the covariance of the node’s output with the current network error. Covariance: : vector of weights to the new node : output of the new node to p-th input sample : error of k-th output node for p-th input sample before the new node is added : averages over the training set Neural Networks Lecture 7: Adaptive Networks

Neural Networks Lecture 7: Adaptive Network Cascade Correlation Weights to each new hidden node are trained to maximize the covariance of the node’s output with the current network error. Covariance: : vector of weights to the new node : output of the new node to p-th input sample : error of k-th output node for p-th input sample before the new node is added : averages over the training set Neural Networks Lecture 7: Adaptive Network

Neural Networks Lecture 7: Adaptive Network Cascade Correlation Since we want to maximize S (as opposed to minimizing some error), we use gradient ascent: : i-th input for the p-th pattern : sign of the correlation between the node’s output and the k-th network output : learning rate : derivative of the node’s activation function with respect to its net input, evaluated at p-th pattern Neural Networks Lecture 7: Adaptive Network

Neural Networks Lecture 7: Adaptive Network Cascade Correlation If we can find weights so that the new node’s output perfectly covaries with the error in each output node, we can set the new output node weights and offsets so that the new error is zero. More realistically, there will be no perfect covariance, which means that we will set each output node weight so that the error is minimized. To do this, we can use gradient descent or linear regression for each individual output node weight. The next added hidden node will further reduce the remaining network error, and so on, until we reach a desired error threshold. Neural Networks Lecture 7: Adaptive Network

Neural Networks Lecture 7: Adaptive Network Cascade Correlation This learning algorithm is much faster than backpropagation learning, because only one neuron is trained at a time. On the other hand, its inability to retrain neurons may prevent the cascade correlation network from finding optimal weight patterns for encoding the given function. Neural Networks Lecture 7: Adaptive Network

Neural Networks Lecture 7: Adaptive Network Input Space Clusters One of our basic assumptions about functions to be learned by ANNs is that inputs belonging to the same class (or requiring similar outputs) are located close to each other in the input space. Often, input vectors from the same class form clusters, i.e., local groups of data points. For such data distributions, the linearly dividing functions used by perceptrons, Adalines, or BPNs are not optimal. Neural Networks Lecture 7: Adaptive Network

Neural Networks Lecture 7: Adaptive Network Input Space Clusters Circle 1 Example: Line 1 Line 2 x1 x2 Class 1 Class -1 Line 3 Line 4 A network with linearly separating functions would require four neurons plus one higher-level neuron. On the other hand, a single neuron with a local, circular “receptive field” would suffice. Neural Networks Lecture 7: Adaptive Network

Radial Basis Functions (RBFs) To achieve such local “receptive fields,” we can use radial basis functions, i.e., functions whose output only depends on the Euclidean distance  between the input vector and another (“weight”) vector. A typical choice is a Gaussian function: where c determines the “width” of the Gaussian. However, any radially symmetric, non-increasing function could be used. Neural Networks Lecture 7: Adaptive Network

Linear Interpolation: 1-Dimensional Case For function approximation, the desired output for new (untrained) inputs could be estimated by linear interpolation. As a simple example, how do we determine the desired output of a one-dimensional function at a new input x0 that is located between known data points x1 and x2? which simplifies to: with distances D1 and D2 from x0 to x1 and x2, resp. Neural Networks Lecture 7: Adaptive Network

Linear Interpolation: Multiple Dimensions In the multi-dimensional case, hyperplane segments connect neighboring points so that the desired output for a new input x0 is determined by the P0 known samples that surround it: Where Dp is the Euclidean distance between x0 and xp and f(xp) is the desired output value for input xp. Neural Networks Lecture 7: Adaptive Network

Linear Interpolation: Multiple Dimensions Example for f:R2R1 (with desired output indicated): For four nearest neighbors, the desired output for x0 is X4 : -6 X1 : 9 X3 : 4 D3 X2 : 5 D2 X8 : -9 X5 : 8 D6 D7 X0 : ? X6 : 7 X7 : 6 Neural Networks Lecture 7: Adaptive Network

Radial Basis Functions If we are using such linear interpolation, then our radial basis function (RBF) 0 that weights an input vector based on its distance to a neuron’s reference (weight) vector is 0(D) = D-1. For the training samples xp, p = 1, …, P0, surrounding the new input x, we find for the network’s output o: (In the following, to keep things simple, we will assume that the network has only one output neuron. However, any number of output neurons could be implemented.) Neural Networks Lecture 7: Adaptive Network

Radial Basis Functions Since it is difficult to define what “surrounding” should mean, it is common to consider all P training samples and use any monotonically decreasing RBF : This, however, implies a network that has as many hidden nodes as there are training samples. This in unacceptable because of its computational complexity and likely poor generalization ability – the network resembles a look-up table. Neural Networks Lecture 7: Adaptive Network

Radial Basis Functions It is more useful to have fewer neurons and accept that the training set cannot be learned 100% accurately: Here, ideally, each reference vector i of these N neurons should be placed in the center of an input-space cluster of training samples with identical (or at least similar) desired output i. To learn near-optimal values for the reference vectors and the output weights, we can – as usual – employ gradient descent. Neural Networks Lecture 7: Adaptive Network

Neural Networks Lecture 7: Adaptive Network The RBF Network Example: Network function f: R3  R output vector o1 output layer w0 w1 w2 w3 w4 1,1 2,2 3,3 4,4 RBF layer 1 input layer x0=1 x2 x3 input vector Neural Networks Lecture 7: Adaptive Network

Radial Basis Functions For a fixed number of neurons N, we could learn the following output weights and reference vectors: To do this, we first have to define an error function E: Taken together, we get: Neural Networks Lecture 7: Adaptive Network

Learning in RBF Networks Then the error gradient with regard to w1, …, wN is: For i,j, the j-th vector component of i, we get: Neural Networks Lecture 7: Adaptive Network

Learning in RBF Networks The vector length (||…||) expression is inconvenient, because it is the square root of the given vector multiplied by itself. To eliminate this difficulty, we introduce a function R with R(D2) = (D) and substitute . This leads to a simplified differentiation: Neural Networks Lecture 7: Adaptive Network

Learning in RBF Networks Together with the following derivative… … we finally get the result for our error gradient: Neural Networks Lecture 7: Adaptive Network

Learning in RBF Networks This gives us the following updating rules: where the (positive) learning rates i and i,j could be chosen individually for each parameter wi and i,j. As usual, we can start with random parameters and then iterate these rules for learning until a given error threshold is reached. Neural Networks Lecture 7: Adaptive Network

Learning in RBF Networks If the node function is given by a Gaussian, then: As a result: Neural Networks Lecture 7: Adaptive Network

Learning in RBF Networks The specific update rules are now: and Neural Networks Lecture 7: Adaptive Network

Learning in RBF Networks It turns out that, particularly for Gaussian RBFs, it is more efficient and typically leads to better results to use partially offline training: First, we use any clustering procedure (e.g., k-means) to estimate cluster centers, which are then used to set the values of the reference vectors i and their spreads (standard deviations) i. Then we use the gradient descent method described above to determine the weights wi. Neural Networks Lecture 7: Adaptive Network