Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

Slides:



Advertisements
Similar presentations
Artificial Neural Networks
Advertisements

Multi-Layer Perceptron (MLP)
Beyond Linear Separability
NEURAL NETWORKS Backpropagation Algorithm
Introduction to Neural Networks Computing
Perceptron Learning Rule
Artificial Intelligence 13. Multi-Layer ANNs Course V231 Department of Computing Imperial College © Simon Colton.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Machine Learning: Connectionist McCulloch-Pitts Neuron Perceptrons Multilayer Networks Support Vector Machines Feedback Networks Hopfield Networks.
Machine Learning Neural Networks
Artificial Neural Networks
Simple Neural Nets For Pattern Classification
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
The back-propagation training algorithm
Prénom Nom Document Analysis: Artificial Neural Networks Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Radial Basis Functions
Chapter 5 NEURAL NETWORKS
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Prénom Nom Document Analysis: Artificial Neural Networks Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
2806 Neural Computation Multilayer neural networks Lecture Ari Visa.
Back-Propagation Algorithm
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Chapter 6: Multilayer Neural Networks
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Introduction to Neural Networks John Paxton Montana State University Summer 2003.
MACHINE LEARNING 12. Multilayer Perceptrons. Neural Networks Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Radial Basis Function (RBF) Networks
Artificial Neural Networks
Artificial Neural Networks
Multi Layer NN and Bit-True Modeling of These Networks SILab presentation Ali Ahmadi September 2007.
ANNs (Artificial Neural Networks). THE PERCEPTRON.
11 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering
1 Chapter 6: Artificial Neural Networks Part 2 of 3 (Sections 6.4 – 6.6) Asst. Prof. Dr. Sukanya Pongsuparb Dr. Srisupa Palakvangsa Na Ayudhya Dr. Benjarath.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Artificial Neural Network Supervised Learning دكترمحسن كاهاني
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 16: NEURAL NETWORKS Objectives: Feedforward.
LINEAR CLASSIFICATION. Biological inspirations  Some numbers…  The human brain contains about 10 billion nerve cells ( neurons )  Each neuron is connected.
Artificial Intelligence Methods Neural Networks Lecture 4 Rakesh K. Bissoondeeal Rakesh K. Bissoondeeal.
Artificial Intelligence Techniques Multilayer Perceptrons.
1 RECENT DEVELOPMENTS IN MULTILAYER PERCEPTRON NEURAL NETWORKS Walter H. Delashmit Lockheed Martin Missiles and Fire Control Dallas, TX 75265
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 9: Ways of speeding up the learning and preventing overfitting Geoffrey Hinton.
Non-Bayes classifiers. Linear discriminants, neural networks.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Back-Propagation Algorithm AN INTRODUCTION TO LEARNING INTERNAL REPRESENTATIONS BY ERROR PROPAGATION Presented by: Kunal Parmar UHID:
Chapter 2 Single Layer Feedforward Networks
Neural Networks Teacher: Elena Marchiori R4.47 Assistant: Kees Jong S2.22
EEE502 Pattern Recognition
Neural Networks 2nd Edition Simon Haykin
Previous Lecture Perceptron W  t+1  W  t  t  d(t) - sign (w(t)  x)] x Adaline W  t+1  W  t  t  d(t) - f(w(t)  x)] f’ x Gradient.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
CPH Dr. Charnigo Chap. 11 Notes Figure 11.2 provides a diagram which shows, at a glance, what a neural network does. Inputs X 1, X 2,.., X P are.
CSE343/543 Machine Learning Mayank Vatsa Lecture slides are prepared using several teaching resources and no authorship is claimed for any slides.
Machine Learning Supervised Learning Classification and Regression
Deep Feedforward Networks
The Gradient Descent Algorithm
Chapter 6: Multilayer Neural Networks (Sections , 6.8)
A Simple Artificial Neuron
LECTURE 28: NEURAL NETWORKS
Machine Learning Today: Reading: Maria Florina Balcan
ECE 471/571 - Lecture 17 Back Propagation.
Neuro-Computing Lecture 4 Radial Basis Function Network
Artificial Neural Networks
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
LECTURE 28: NEURAL NETWORKS
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Presentation transcript:

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley & Sons, 2000 with the permission of the authors and the publisher

Chapter 6: Multilayer Neural Networks (Sections 1-5, 8) 1. Introduction 2. Feedforward Operation and Classification 3. Backpropagation Algorithm 4. Error Surfaces 5. Backpropagation as Feature Mapping 8. Practical Techniques for Improving Backpropagation

Pattern Classification, Chapter Introduction Goal: Classify objects by learning nonlinearity There are many problems for which linear discriminants are insufficient for minimum error In previous methods, the central difficulty was the choice of the appropriate nonlinear functions A “brute” approach might be to select a complete basis set such as all polynomials; such a classifier would require too many parameters to be determined from a limited number of training samples

Pattern Classification, Chapter 6 3 There is no automatic method for determining the nonlinearities when no information is provided to the classifier In using the multilayer Neural Networks, the form of the nonlinearity is learned from the training data

Pattern Classification, Chapter Feedforward Operation and Classification A three-layer neural network consists of an input layer, a hidden layer and an output layer interconnected by modifiable weights represented by links between layers Note: some authors count the number of layers as the number of sets of weights, one less than the number of layers defined here

Pattern Classification, Chapter 6 5

6

7 A single “bias unit” is connected to each unit other than the input units Net activation: where the subscript i indexes units in the input layer, j in the hidden; w ji denotes the input-to-hidden layer weights at the hidden unit j. (In neurobiology, such weights or connections are called “synapses”) Each hidden unit emits an output that is a nonlinear function of its activation, that is: y j = f(net j )

Pattern Classification, Chapter 6 8 Figure 6.1 shows a simple threshold function The function f(.) is also called the activation function or “nonlinearity” of a unit. There are more general activation functions with desirables properties Each output unit similarly computes its net activation based on the hidden unit signals as: where the subscript k indexes units in the ouput layer and n H denotes the number of hidden units

Pattern Classification, Chapter 6 9 More than one output are referred z k. An output unit computes the nonlinear function of its net, emitting z k = f(net k ) In the case of c outputs (classes), we can view the network as computing c discriminants functions z k = g k (x) and classify the input x according to the largest discriminant function g k (x)  k = 1, …, c The three-layer network with the weights listed in fig. 6.1 solves the XOR problem

Pattern Classification, Chapter 6 10 The hidden unit y 1 computes the boundary:  0  y 1 = +1 x 1 + x = 0 < 0  y 1 = -1 The hidden unit y 2 computes the boundary:  0  y 2 = +1 x 1 + x = 0 < 0  y 2 = -1 The final output unit emits z 1 = +1  y 1 = +1 and y 2 = +1 z k = y 1 and not y 2 = (x 1 or x 2 ) and not (x 1 and x 2 ) = x 1 XOR x 2 which provides the nonlinear decision of fig. 6.1

Pattern Classification, Chapter 6 11 General Feedforward Operation – case of c output units Hidden units enable us to express more complicated nonlinear functions and thus extend the classification The activation function does not have to be a sign function, it is often required to be continuous and differentiable We can allow the activation in the output layer to be different from the activation function in the hidden layer or have different activation for each individual unit We assume for now that all activation functions are identical

Pattern Classification, Chapter 6 12 Expressive Power of multi-layer Networks Question: Can every decision be implemented by a three- layer network described by equation (1) ? Answer: Yes (due to A. Kolmogorov) “Any continuous function from input to output can be implemented in a three-layer net, given sufficient number of hidden units n H, proper nonlinearities, and weights.” for properly chosen functions  j and  ij

Pattern Classification, Chapter 6 13 Each of the 2n+1 hidden units  j takes as input a sum of d nonlinear functions, one for each input feature x i Each hidden unit emits a nonlinear function  j of its total input The output unit emits the sum of the contributions of the hidden units Unfortunately: Kolmogorov’s theorem tells us very little about how to find the nonlinear functions based on data; this is the central problem in network-based pattern recognition

Pattern Classification, Chapter 6 14

Pattern Classification, Chapter Backpropagation Algorithm Any function from input to output can be implemented as a three-layer neural network These results are of greater theoretical interest than practical, since the construction of such a network requires the nonlinear functions and the weight values which are unknown!

Pattern Classification, Chapter 6 16

Pattern Classification, Chapter 6 17 Our goal now is to set the interconnexion weights based on the training patterns and the desired outputs In a three-layer network, it is a straightforward matter to understand how the output, and thus the error, depend on the hidden-to-output layer weights The power of backpropagation is that it enables us to compute an effective error for each hidden unit, and thus derive a learning rule for the input-to-hidden weights, this is known as: The credit assignment problem

Pattern Classification, Chapter 6 18 Network have two modes of operation: Feedforward The feedforward operations consists of presenting a pattern to the input units and passing (or feeding) the signals through the network in order to get outputs units (no cycles!) Learning The supervised learning consists of presenting an input pattern and modifying the network parameters (weights) to reduce distances between the computed output and the desired output

Pattern Classification, Chapter 6 19

Pattern Classification, Chapter 6 20 Network Learning Let tk be the k-th target (or desired) output and zk be the k-th computed output with k = 1, …, c and w represents all the weights of the network The training error: The backpropagation learning rule is based on gradient descent The weights are initialized with pseudo-random values and are changed in a direction that will reduce the error: where  is the learning rate which indicates the relative size of the change in weights w(m +1) = w(m) +  w(m) where m is the m-th pattern presented

Pattern Classification, Chapter 6 21

Pattern Classification, Chapter 6 22 Starting with a pseudo-random weight configuration, the stochastic backpropagation algorithm can be written as: Begin initialize n H ; w, criterion , , m  0 do m  m + 1 x m  randomly chosen pattern w ji  w ji +  j x i ; w kj  w kj +  k y j until ||  J(w)|| <  return w End

Pattern Classification, Chapter 6 23 Stopping criterion The algorithm terminates when the change in the criterion function J(w) is smaller than some preset value  There are other stopping criteria that lead to better performance than this one

Pattern Classification, Chapter 6 24 Learning Curves Before training starts, the error on the training set is high; through the learning process, the error becomes smaller The error per pattern depends on the amount of training data and the expressive power (such as the number of weights) in the network The average error on an independent test set is always higher than on the training set, and it can decrease as well as increase A validation set is used in order to decide when to stop training ; we do not want to overfit the network and decrease the power of the classifier generalization “we stop training at a minimum of the error on the validation set”

Pattern Classification, Chapter 6 25

Pattern Classification, Chapter 6 26 We can gain understanding and intuition about the backpropagation algorithm by studying error surfaces The network tries to find the global minimum But local minima can be a problem The presence of plateaus can also be a problem 4. Error Surfaces

Pattern Classification, Chapter 6 27 Consider the simplest three-layer nonlinear network with a single global minimum

Pattern Classification, Chapter 6 28 Consider the same network with a problem that is not linearly separable

Pattern Classification, Chapter 6 29 Consider a solution to the XOR problem

Pattern Classification, Chapter 6 30 Because the hidden-to-output layer leads to a linear discriminant, the novel computational power provided by multilayer neural nets can be attributed to the nonlinear warping of the input to the representation at the hidden units 5. Backpropagation as Feature Mapping

Pattern Classification, Chapter 6 31

Pattern Classification, Chapter 6 32

Pattern Classification, Chapter 6 33

Pattern Classification, Chapter 6 34 Because the hidden-to-output weights => linear discriminant, the input-to-hidden weights are most instructive

Pattern Classification, Chapter Activation Function Must be nonlinear Must saturate – have a maximum and minimum Must be continuous and smooth Must be monotonic Sigmoid class of functions provide these properties 8. Practical Techniques for Improving Backpropagation

Pattern Classification, Chapter Parameters for the Sigmoid

Pattern Classification, Chapter Scaling Input Data standardization: The full data set should be scaled to have the same variance in each feature component Example of importance of standardization In the fish classification chapter 1 example, suppose the fish weight is measured in grams and the fish length in meters The weight would be orders of magnitude larger than the length

Pattern Classification, Chapter Target Values Reasonable target values are +1 and -1 For example, for a pattern in class 3, use target vector (-1, -1, +1, -1)

Pattern Classification, Chapter Training with Noise When the training set is small, additional (surrogate) training patterns can be generated by adding noise to the true training patterns. This helps most classification methods except for highly local classifiers such as kNN

Pattern Classification, Chapter Manufacturing Data As with training with noise, manufactured data can be used with a wide variety of pattern classification methods. Manufactured training data, however, usually adds more information than Gaussian noise. For example, in OCR additional training patterns could be generated by slight rotations or line thickening of the the true training patterns

Pattern Classification, Chapter Number of Hidden Units

Pattern Classification, Chapter Initializing Weights Initial weights cannot be zero because no learning would take place In setting weights in a given layer, usually choose weights randomly from a single distribution to help ensure uniform learning Uniform learning is when all weights reach their final equilibrium values at about the same time

Pattern Classification, Chapter Learning Rates

Pattern Classification, Chapter Momentum

Pattern Classification, Chapter Weight Decay After each weight update, every weight is simply “decayed” or shrunk w new = w old (1 -  ) Although there is no principled reason, weight decay helps in most cases

Pattern Classification, Chapter Hints For example, in OCR the hints might be straight line and curved characters. Hints are a special benefit of neural networks.

Pattern Classification, Chapter On-line, Stochastic or Batch Training? Each of the three leading training protocols has strengths and weaknesses Online is usually only used when the training data is huge or memory costs high Batch is typically slower than stochastic learning For most applications stochastic training is preferred

Pattern Classification, Chapter Stopped Training Best to stop training when error rate on a validation set reaches a minimum

Pattern Classification, Chapter Number of Hidden Layers Although three layers suffice to implement any arbitrary function, additional layers can facilitate learning For example, it is easier for a four-layer net to learn translations than a three-layer net

Pattern Classification, Chapter Criterion Function The most common training criterion is the squared error (Eq. 9) Alternatives are cross entropy, Minkowski error, etc.