November 26, 2013Computer Vision Lecture 15: Object Recognition III 1 Backpropagation Network Structure Perceptrons (and many other classifiers) can only.

Slides:



Advertisements
Similar presentations
Artificial Neural Networks
Advertisements

Multi-Layer Perceptron (MLP)
Beyond Linear Separability
NEURAL NETWORKS Backpropagation Algorithm
Introduction to Neural Networks Computing
Artificial Intelligence 13. Multi-Layer ANNs Course V231 Department of Computing Imperial College © Simon Colton.
Neural Networks  A neural network is a network of simulated neurons that can be used to recognize instances of patterns. NNs learn by searching through.
Mehran University of Engineering and Technology, Jamshoro Department of Electronic Engineering Neural Networks Feedforward Networks By Dr. Mukhtiar Ali.
Reading for Next Week Textbook, Section 9, pp A User’s Guide to Support Vector Machines (linked from course website)
Artificial Neural Networks
Simple Neural Nets For Pattern Classification
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Prénom Nom Document Analysis: Artificial Neural Networks Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Radial Basis Functions
November 19, 2009Introduction to Cognitive Science Lecture 20: Artificial Neural Networks I 1 Artificial Neural Network (ANN) Paradigms Overview: The Backpropagation.
1 Chapter 11 Neural Networks. 2 Chapter 11 Contents (1) l Biological Neurons l Artificial Neurons l Perceptrons l Multilayer Neural Networks l Backpropagation.
September 30, 2010Neural Networks Lecture 8: Backpropagation Learning 1 Sigmoidal Neurons In backpropagation networks, we typically choose  = 1 and 
Biological neuron artificial neuron.
September 21, 2010Neural Networks Lecture 5: The Perceptron 1 Supervised Function Approximation In supervised learning, we train an ANN with a set of vector.
November 2, 2010Neural Networks Lecture 14: Radial Basis Functions 1 Cascade Correlation Weights to each new hidden node are trained to maximize the covariance.
Chapter 6: Multilayer Neural Networks
November 30, 2010Neural Networks Lecture 20: Interpolative Associative Memory 1 Associative Networks Associative networks are able to store a set of patterns.
September 23, 2010Neural Networks Lecture 6: Perceptron Learning 1 Refresher: Perceptron Training Algorithm Algorithm Perceptron; Start with a randomly.
October 5, 2010Neural Networks Lecture 9: Applying Backpropagation 1 K-Class Classification Problem Let us denote the k-th class by C k, with n k exemplars.
October 14, 2010Neural Networks Lecture 12: Backpropagation Examples 1 Example I: Predicting the Weather We decide (or experimentally determine) to use.
An Introduction To The Backpropagation Algorithm Who gets the credit?
CS 4700: Foundations of Artificial Intelligence
October 28, 2010Neural Networks Lecture 13: Adaptive Networks 1 Adaptive Networks As you know, there is no equation that would tell you the ideal number.
September 28, 2010Neural Networks Lecture 7: Perceptron Modifications 1 Adaline Schematic Adjust weights i1i1i1i1 i2i2i2i2 inininin …  w 0 + w 1 i 1 +
Neural Networks Lecture 17: Self-Organizing Maps
October 12, 2010Neural Networks Lecture 11: Setting Backpropagation Parameters 1 Exemplar Analysis When building a neural network application, we must.
CS 484 – Artificial Intelligence
November 21, 2012Introduction to Artificial Intelligence Lecture 16: Neural Network Paradigms III 1 Learning in the BPN Gradients of two-dimensional functions:
November 25, 2014Computer Vision Lecture 20: Object Recognition IV 1 Creating Data Representations The problem with some data representations is that the.
Dr. Hala Moushir Ebied Faculty of Computers & Information Sciences
December 5, 2012Introduction to Artificial Intelligence Lecture 20: Neural Network Application Design III 1 Example I: Predicting the Weather Since the.
Artificial Neural Networks
1 Pattern Recognition: Statistical and Neural Lonnie C. Ludeman Lecture 23 Nov 2, 2005 Nanjing University of Science & Technology.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Artificial Neural Network Supervised Learning دكترمحسن كاهاني
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 16: NEURAL NETWORKS Objectives: Feedforward.
Artificial Neural Networks. The Brain How do brains work? How do human brains differ from that of other animals? Can we base models of artificial intelligence.
1 Chapter 11 Neural Networks. 2 Chapter 11 Contents (1) l Biological Neurons l Artificial Neurons l Perceptrons l Multilayer Neural Networks l Backpropagation.
Multi-Layer Perceptron
Non-Bayes classifiers. Linear discriminants, neural networks.
Introduction to Neural Networks. Biological neural activity –Each neuron has a body, an axon, and many dendrites Can be in one of the two states: firing.
Neural Network Basics Anns are analytical systems that address problems whose solutions have not been explicitly formulated Structure in which multiple.
CS621 : Artificial Intelligence
Neural Networks - Berrin Yanıkoğlu1 Applications and Examples From Mitchell Chp. 4.
Neural Networks - lecture 51 Multi-layer neural networks  Motivation  Choosing the architecture  Functioning. FORWARD algorithm  Neural networks as.
Neural Networks Teacher: Elena Marchiori R4.47 Assistant: Kees Jong S2.22
November 20, 2014Computer Vision Lecture 19: Object Recognition III 1 Linear Separability So by varying the weights and the threshold, we can realize any.
EEE502 Pattern Recognition
Chapter 8: Adaptive Networks
Hazırlayan NEURAL NETWORKS Backpropagation Network PROF. DR. YUSUF OYSAL.
November 21, 2013Computer Vision Lecture 14: Object Recognition II 1 Statistical Pattern Recognition The formal description consists of relevant numerical.
Neural Networks 2nd Edition Simon Haykin
Artificial Intelligence CIS 342 The College of Saint Rose David Goldschmidt, Ph.D.
Chapter 6 Neural Network.
An Introduction To The Backpropagation Algorithm.
Learning: Neural Networks Artificial Intelligence CMSC February 3, 2005.
Learning with Neural Networks Artificial Intelligence CMSC February 19, 2002.
CSE343/543 Machine Learning Mayank Vatsa Lecture slides are prepared using several teaching resources and no authorship is claimed for any slides.
Supervised Learning in ANNs
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Classification with Perceptrons Reading:
of the Artificial Neural Networks.
Capabilities of Threshold Neurons
Computer Vision Lecture 19: Object Recognition III
Presentation transcript:

November 26, 2013Computer Vision Lecture 15: Object Recognition III 1 Backpropagation Network Structure Perceptrons (and many other classifiers) can only linearly separate the input space. Backpropagation networks (BPNs) do not have this limitation and can in principle find any statistical relationship between training inputs and desired outputs. The training procedure is computationally complex. BPNs are multi-layered networks. It has been shown that three layers of neurons are sufficient to compute any function that could be useful for, for example, a computer vision application.

November 26, 2013Computer Vision Lecture 15: Object Recognition III 2 Backpropagation Network Structure Most backpropagation networks use the following three layers: Input layer: Only stores the input and sends it to the hidden layer; does not perform computation.Input layer: Only stores the input and sends it to the hidden layer; does not perform computation. Hidden layer: (i.e., not visible from input or output side) receives data from input layer, performs computation, and sends results to output layer.Hidden layer: (i.e., not visible from input or output side) receives data from input layer, performs computation, and sends results to output layer. Output layer: Receives data from hidden layer, performs computation, and its results form the network’s output.Output layer: Receives data from hidden layer, performs computation, and its results form the network’s output.

November 26, 2013Computer Vision Lecture 15: Object Recognition III 3 Backpropagation Network Structure Example: Network function f: R 3  R 2 output layer hidden layer input layer input vector output vector x1x1x1x1 x2x2x2x2 o2o2o2o2 o1o1o1o1 x3x3x3x3

November 26, 2013Computer Vision Lecture 15: Object Recognition III 4 The Backpropagation Algorithm Idea behind backpropagation learning: Neurons compute a continuous, differentiable function function between their input and output. We define an error of the network output as a function of all the network’s weights. Then find those weights for which the error is minimal. With a differentiable error function, we can use the gradient descent technique to find the absolute minimum of the error function.

November 26, 2013Computer Vision Lecture 15: Object Recognition III 5 Sigmoidal Neurons In backpropagation networks, we typically choose  = 1 and  = f i (net i (t)) net i (t)  = 1  = 0.1

November 26, 2013Computer Vision Lecture 15: Object Recognition III 6 Sigmoidal Neurons This leads to a simplified form of the sigmoid function: We do not need a modifiable threshold , because we will use “dummy” inputs as we did for perceptrons. The choice  = 1 works well in most situations and results in a very simple derivative of S(net).

November 26, 2013Computer Vision Lecture 15: Object Recognition III 7 Sigmoidal Neurons This result will be very useful when we develop the backpropagation algorithm.

November 26, 2013Computer Vision Lecture 15: Object Recognition III 8 Gradient Descent Gradient descent is a very common technique to find the absolute minimum of a function. It is especially useful for high-dimensional functions. We will use it to iteratively minimizes the network’s (or neuron’s) error by finding the gradient of the error surface in weight-space and adjusting the weights in the opposite direction.

November 26, 2013Computer Vision Lecture 15: Object Recognition III 9 Gradient Descent Gradient-descent example: Finding the absolute minimum of a one-dimensional error function f(x): f(x)x x0x0x0x0 slope: f’(x 0 ) x 1 = x 0 -  f’(x 0 ) Repeat this iteratively until for some x i, f’(x i ) is sufficiently close to 0.

November 26, 2013Computer Vision Lecture 15: Object Recognition III 10 Gradient Descent Gradients of two-dimensional functions: The two-dimensional function in the left diagram is represented by contour lines in the right diagram, where arrows indicate the gradient of the function at different locations. Obviously, the gradient is always pointing in the direction of the steepest increase of the function. In order to find the function’s minimum, we should always move against the gradient.

November 26, 2013Computer Vision Lecture 15: Object Recognition III 11 Backpropagation Learning Similar to the Perceptron, the goal of the Backpropagation learning algorithm is to modify the network’s weights so that its output vector o p = (o p,1, o p,2, …, o p,K ) is as close as possible to the desired output vector d p = (d p,1, d p,2, …, d p,K ) for K output neurons and input patterns p = 1, …, P. The set of input-output pairs (exemplars) {(x p, d p ) | p = 1, …, P} constitutes the training set.

November 26, 2013Computer Vision Lecture 15: Object Recognition III 12 Backpropagation Learning We need a cumulative error function that is to be minimized: We can choose the mean square error (MSE): where

November 26, 2013Computer Vision Lecture 15: Object Recognition III 13 Backpropagation Learning For input pattern p, the i-th input layer node holds x p,i. Net input to j-th node in hidden layer: Network error for p: Output of k-th node in output layer: Net input to k-th node in output layer: Output of j-th node in hidden layer:

November 26, 2013Computer Vision Lecture 15: Object Recognition III 14 Backpropagation Learning As E is a function of the network weights, we can use gradient descent to find those weights that result in minimal error. For individual weights in the hidden and output layers, we should move against the error gradient (omitting index p): Output layer: Derivative easy to calculate Hidden layer: Derivative difficult to calculate

November 26, 2013Computer Vision Lecture 15: Object Recognition III 15 Backpropagation Learning When computing the derivative with regard to w k,j (2,1), we can disregard any output units except o k : Remember that o k is obtained by applying the sigmoid function S to net k (2), which is computed by: Therefore, we need to apply the chain rule twice.

November 26, 2013Computer Vision Lecture 15: Object Recognition III 16 Backpropagation Learning Since We have: We know that: Which gives us:

November 26, 2013Computer Vision Lecture 15: Object Recognition III 17 Backpropagation Learning For the derivative with regard to w j,i (1,0), notice that E depends on it through net j (1), which influences each o k with k = 1, …, K: Using the chain rule of derivatives again:

November 26, 2013Computer Vision Lecture 15: Object Recognition III 18 Backpropagation Learning This gives us the following weight changes at the output layer: … and at the inner layer:

November 26, 2013Computer Vision Lecture 15: Object Recognition III 19 Backpropagation Learning As you surely remember from a few minutes ago: Then we can simplify the generalized error terms: And:

November 26, 2013Computer Vision Lecture 15: Object Recognition III 20 Backpropagation Learning The simplified error terms  k and  j use variables that are calculated in the feedforward phase of the network and can thus be calculated very efficiently. Now let us state the final equations again and reintroduce the subscript p for the p-th pattern:

November 26, 2013Computer Vision Lecture 15: Object Recognition III 21 Backpropagation Learning Algorithm Backpropagation; Start with randomly chosen weights; Start with randomly chosen weights; while MSE is above desired threshold and computational bounds are not exceeded, do while MSE is above desired threshold and computational bounds are not exceeded, do for each input pattern x p, 1  p  P, picked in random order: for each input pattern x p, 1  p  P, picked in random order: Compute hidden node inputs; Compute hidden node inputs; Compute hidden node outputs; Compute hidden node outputs; Compute inputs to the output nodes; Compute inputs to the output nodes; Compute the network outputs; Compute the network outputs; Compute the error between output and desired output; Compute the error between output and desired output; Modify the weights between hidden and output nodes; Modify the weights between hidden and output nodes; Modify the weights between input and hidden nodes; Modify the weights between input and hidden nodes; end-for end-for end-while. end-while.

November 26, 2013Computer Vision Lecture 15: Object Recognition III 22 K-Class Classification Problem Let us denote the k-th class by C k, with n k exemplars or training samples, forming the sets T k for k = 1, …, K: The complete training set is T = T 1  …  T K. The desired output of the network for an input of class k is 1 for output unit k and 0 for all other output units: with a 1 at the k-th position if the sample is in class k.

November 26, 2013Computer Vision Lecture 15: Object Recognition III 23 K-Class Classification Problem However, due to the sigmoid output function, the net input to the output units would have to be -  or  to generate outputs 0 or 1, respectively. Because of the shallow slope of the sigmoid function at extreme net inputs, even approaching these values would be very slow. To avoid this problem, it is advisable to use desired outputs  and (1 -  ) instead of 0 and 1, respectively. Typical values for  range between 0.01 and 0.1. For  = 0.1, desired output vectors would look like this:

November 26, 2013Computer Vision Lecture 15: Object Recognition III 24 K-Class Classification Problem We should not “punish” more extreme values, though. To avoid punishment, we can define l p,j as follows: 1.If d p,j = (1 -  ) and o p,j  d p,j, then l p,j = 0. 2.If d p,j =  and o p,j  d p,j, then l p,j = 0. 3.Otherwise, l p,j = o p,j - d p,j

November 26, 2013Computer Vision Lecture 15: Object Recognition III 25 NN Application Design Now that we got some insight into the theory of backpropagation networks, how can we design networks for particular applications? Designing NNs is basically an engineering task. For example, there is no formula that would allow you to determine the optimal number of hidden units in a BPN for a given task.

November 26, 2013Computer Vision Lecture 15: Object Recognition III 26 Training and Performance Evaluation How many samples should be used for training? Heuristic: At least 5-10 times as many samples as there are weights in the network. Formula (Baum & Haussler, 1989): P is the number of samples, |W| is the number of weights to be trained, and a is the desired accuracy (e.g., proportion of correctly classified samples).

November 26, 2013Computer Vision Lecture 15: Object Recognition III 27 Training and Performance Evaluation What learning rate  should we choose? The problems that arise when  is too small or to big are similar to the perceptron. Unfortunately, the optimal value of  entirely depends on the application. Values between 0.1 and 0.9 are typical for most applications. Often,  is initially set to a large value and is decreased during the learning process. Leads to better convergence of learning, also decreases likelihood of “getting stuck” in local error minimum at early learning stage.

November 26, 2013Computer Vision Lecture 15: Object Recognition III 28 Training and Performance Evaluation When training a BPN, what is the acceptable error, i.e., when do we stop the training? The minimum error that can be achieved does not only depend on the network parameters, but also on the specific training set. Thus, for some applications the minimum error will be higher than for others.

November 26, 2013Computer Vision Lecture 15: Object Recognition III 29 Training and Performance Evaluation An insightful way of performance evaluation is partial- set training. The idea is to split the available data into two sets – the training set and the test set. The network’s performance on the second set indicates how well the network has actually learned the desired mapping. We should expect the network to interpolate, but not extrapolate. Therefore, this test also evaluates our choice of training samples.

November 26, 2013Computer Vision Lecture 15: Object Recognition III 30 Training and Performance Evaluation If the test set only contains one exemplar, this type of training is called “hold-one-out” training. It is to be performed sequentially for every individual exemplar. This, of course, is a very time-consuming process. For example, if we have 1,000 exemplars and want to perform 100 epochs of training, this procedure involves 1,000  999  100 = 99,900,000 training steps. Partial-set training with a split would only require 70,000 training steps. On the positive side, the advantage of hold-one-out training is that all available exemplars (except one) are use for training, which might lead to better network performance.

November 26, 2013Computer Vision Lecture 15: Object Recognition III 31 Example: Face Recognition Now let us assume that we want to build a network for a computer vision application. More specifically, our network is supposed to recognize faces and face poses. This is an example that has actually been implemented. All information, such as program code and data, can be found at:

November 26, 2013Computer Vision Lecture 15: Object Recognition III 32 Example: Face Recognition The goal is to classify camera images of faces of various people in various poses. Images of 20 different people were collected, with up to 32 images per person. The following variables were introduced: expression (happy, sad, angry, neutral) expression (happy, sad, angry, neutral) direction of looking (left, right, straight ahead, up) direction of looking (left, right, straight ahead, up) sunglasses (yes or no) sunglasses (yes or no) In total, 624 grayscale images were collected, each with a resolution of 30 by 32 pixels and intensity values between 0 and 255.

November 26, 2013Computer Vision Lecture 15: Object Recognition III 33 Example: Face Recognition The network presented here only has the task of determining the face pose (left, right, up, straight) shown in an input image. It uses 960 input units (one for each pixel in the image), 960 input units (one for each pixel in the image), 3 hidden units 3 hidden units 4 output neurons (one for each pose) 4 output neurons (one for each pose) Each output unit receives an additional (“dummy”) input, which is always 1. By varying the weight for this input, the backpropagation algorithm can adjust an offset for the net input signal.

November 26, 2013Computer Vision Lecture 15: Object Recognition III 34 Example: Face Recognition The following diagram visualizes all network weights after 1 epoch and after 100 epochs. Their values are indicated by brightness (ranging from black = -1 to white = 1). Each 30 by 32 matrix represents the weights of one of the three hidden-layer units. Each row of four squares represents the weights of one output neuron (three weights for the signals from the hidden units, and one for the constant signal 1). After training, the network is able to classify 90% of new (non-trained) face images correctly.

November 26, 2013Computer Vision Lecture 15: Object Recognition III 35 Example: Face Recognition

November 26, 2013Computer Vision Lecture 15: Object Recognition III 36 Online Demo: Character Recognition