Artificial Neural Networks

Slides:



Advertisements
Similar presentations
Multi-Layer Perceptron (MLP)
Advertisements

Artificial Neural Networks
Beyond Linear Separability
Slides from: Doug Gray, David Poole
1 Machine Learning: Lecture 4 Artificial Neural Networks (Based on Chapter 4 of Mitchell T.., Machine Learning, 1997)
1 Neural networks. Neural networks are made up of many artificial neurons. Each input into the neuron has its own weight associated with it illustrated.
Multilayer Perceptrons 1. Overview  Recap of neural network theory  The multi-layered perceptron  Back-propagation  Introduction to training  Uses.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Kostas Kontogiannis E&CE
Reading for Next Week Textbook, Section 9, pp A User’s Guide to Support Vector Machines (linked from course website)
Artificial Neural Networks
Machine Learning: Connectionist McCulloch-Pitts Neuron Perceptrons Multilayer Networks Support Vector Machines Feedback Networks Hopfield Networks.
Classification Neural Networks 1
Machine Learning Neural Networks
Lecture 14 – Neural Networks
Simple Neural Nets For Pattern Classification
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
The back-propagation training algorithm
Prénom Nom Document Analysis: Artificial Neural Networks Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
September 30, 2010Neural Networks Lecture 8: Backpropagation Learning 1 Sigmoidal Neurons In backpropagation networks, we typically choose  = 1 and 
Linear Separators. Bankruptcy example R is the ratio of earnings to expenses L is the number of late payments on credit cards over the past year. We will.
Prénom Nom Document Analysis: Artificial Neural Networks Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Chapter 6: Multilayer Neural Networks
CHAPTER 11 Back-Propagation Ming-Feng Yeh.
Dan Simon Cleveland State University
CS 484 – Artificial Intelligence
Neural Networks. Background - Neural Networks can be : Biological - Biological models Artificial - Artificial models - Desire to produce artificial systems.
Dr. Hala Moushir Ebied Faculty of Computers & Information Sciences
Artificial Neural Networks
Classification Part 3: Artificial Neural Networks
Computer Science and Engineering
Artificial Neural Networks
1 Artificial Neural Networks Sanun Srisuk EECP0720 Expert Systems – Artificial Neural Networks.
© Negnevitsky, Pearson Education, Will neural network work for my problem? Will neural network work for my problem? Character recognition neural.
CS464 Introduction to Machine Learning1 Artificial N eural N etworks Artificial neural networks (ANNs) provide a general, practical method for learning.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Classification / Regression Neural Networks 2
LINEAR CLASSIFICATION. Biological inspirations  Some numbers…  The human brain contains about 10 billion nerve cells ( neurons )  Each neuron is connected.
Artificial Intelligence Methods Neural Networks Lecture 4 Rakesh K. Bissoondeeal Rakesh K. Bissoondeeal.
Artificial Intelligence Techniques Multilayer Perceptrons.
1 Chapter 11 Neural Networks. 2 Chapter 11 Contents (1) l Biological Neurons l Artificial Neurons l Perceptrons l Multilayer Neural Networks l Backpropagation.
So Far……  Clustering basics, necessity for clustering, Usage in various fields : engineering and industrial fields  Properties : hierarchical, flat,
Neural Networks and Machine Learning Applications CSC 563 Prof. Mohamed Batouche Computer Science Department CCIS – King Saud University Riyadh, Saudi.
CS344: Introduction to Artificial Intelligence (associated lab: CS386) Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 31: Feedforward N/W; sigmoid.
Non-Bayes classifiers. Linear discriminants, neural networks.
Neural Networks and Backpropagation Sebastian Thrun , Fall 2000.
CS-424 Gregory Dudek Today’s Lecture Neural networks –Training Backpropagation of error (backprop) –Example –Radial basis functions.
Back-Propagation Algorithm AN INTRODUCTION TO LEARNING INTERNAL REPRESENTATIONS BY ERROR PROPAGATION Presented by: Kunal Parmar UHID:
CS621 : Artificial Intelligence
Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.
1 Lecture 6 Neural Network Training. 2 Neural Network Training Network training is basic to establishing the functional relationship between the inputs.
Introduction to Neural Networks Introduction to Neural Networks Applied to OCR and Speech Recognition An actual neuron A crude model of a neuron Computational.
EEE502 Pattern Recognition
Artificial Intelligence Methods Neural Networks Lecture 3 Rakesh K. Bissoondeeal Rakesh K. Bissoondeeal.
CPH Dr. Charnigo Chap. 11 Notes Figure 11.2 provides a diagram which shows, at a glance, what a neural network does. Inputs X 1, X 2,.., X P are.
Learning: Neural Networks Artificial Intelligence CMSC February 3, 2005.
Pattern Recognition Lecture 20: Neural Networks 3 Dr. Richard Spillman Pacific Lutheran University.
Learning with Neural Networks Artificial Intelligence CMSC February 19, 2002.
CSE343/543 Machine Learning Mayank Vatsa Lecture slides are prepared using several teaching resources and no authorship is claimed for any slides.
Today’s Lecture Neural networks Training
Neural networks.
Artificial Neural Networks
Learning with Perceptrons and Neural Networks
第 3 章 神经网络.
CSE 473 Introduction to Artificial Intelligence Neural Networks
Machine Learning Today: Reading: Maria Florina Balcan
Classification Neural Networks 1
Multilayer Perceptron & Backpropagation
Outline Announcement Neural networks Perceptrons - continued
Presentation transcript:

Artificial Neural Networks

Artificial Neural Networks The basic idea in neural nets is to define interconnected networks of simple units (let's call them "artificial neurons") in which each connection has a weight. Weight wij is the weight of the ith input into unit j. The networks have some inputs where the feature values are placed and they compute one or more output values. Each output unit corresponds to a class. The network prediction is the output whose value is highest. The learning takes place by adjusting the weights in the network so that the desired output is produced whenever a sample in the input data set is presented.

Single Perceptron Unit We start by looking at a simpler kind of "neural-like" unit called a perceptron. This is where the perceptron algorithm that we saw earlier came from. Perceptrons antedate the modern neural nets. A perceptron unit basically compares a weighted combination of its inputs against a threshold value and then outputs a 1 if the weighted inputs exceed the threshold. Trick: we treat the (arbitrary) threshold as if it were a weight w0 on a constant input x0 whose value is –1. In this way, we can write the basic rule of operation as computing the weighted sum of all the inputs and comparing to 0.

Linear Classifier: Single Perceptron Unit where

Beyond Linear Separability Since a single perceptron unit can only define a single linear boundary, it is limited to solving linearly separable problems. A problem like that illustrated by the values of the XOR boolean function cannot be solved by a single perceptron unit.

Multi-Layer Perceptron What about if we consider more than one linear separator and combine their outputs; can we get a more powerful classifier? Yes. The introduction of "hidden" units into these networks make them much more powerful: they are no longer limited to linearly separable problems. Earlier layers transform the problem into more tractable problems for the latter layers.

Example: XOR problem See explanations…

Explanations To see how having hidden units can help, let us see how a two-layer perceptron network can solve the XOR problem that a single unit failed to solve. We see that each hidden unit defines its own "decision boundary" and the output from each of these units is fed to the output unit, which returns a solution to the whole problem. Let's look in detail at each of these boundaries and its effect. If we focus on the first decision boundary we see only one of the training points (the one with feature values (1,1)) is in the half space that the normal points into. This is the only point with a positive distance and thus the output is 1 from the perceptron unit. The other points have negative distance and the output is 0 from the perceptron unit. Those are shown in the shaded column in the table.

Example: XOR problem

Example: XOR problem

Multi-Layer Perceptron Learning Any set of training points can be separated by a three-layer perceptron network. “Almost any” set of points is separable by two-layer perceptron network. However, the presence of the discontinuous threshold in the operation means that there is no simple local search for a good set of weights; one is forced into trying possibilities in a combinatorial way. The limitations of the single-layer perceptron and the lack of a good learning algorithm for multilayer perceptrons essentially killed the field for quite a few years.

Soft Threshold A natural question to ask is whether we could use gradient ascent/descent to train a multi-layer perceptron. The answer is that we can't as long as the output is discontinuous with respect to changes in the inputs and the weights. In a perceptron unit it doesn't matter how far a point is from the decision boundary, we will still get a 0 or a 1. We need a smooth output (as a function of changes in the network weights) if we're to do gradient descent.

Sigmoid Unit The classic "soft threshold" that is used in neural nets is referred to as a "sigmoid" (meaning S-like) and is shown here. The variable z is the "total input" or "activation" of a neuron, that is, the weighted sum of all of its inputs. Note that when the input (z) is 0, the sigmoid's value is 1/2. The sigmoid is applied to the weighted inputs (including the threshold value as before). There are actually many different types of sigmoids that can be (and are) used in neural networks. The sigmoid shown here is actually called the logistic function.

Training The key property of the sigmoid is that it is differentiable. This means that we can use gradient based methods of minimization for training. The output of a multi-layer net of sigmoid units is a function of two vectors, the inputs (x) and the weights (w). The output of this function (y) varies smoothly with changes in the input and, importantly, with changes in the weights.

Training

Training Given a set of training points, each of which specifies the net inputs and the desired outputs, we can write an expression for the training error, usually defined as the sum of the squared differences between the actual output (given the weights) and the desired output. The goal of training is to find a weight vector that minimizes the training error. We could also use the mean squared error (MSE), which simply divides the sum of the squared errors by the number of training points instead of just 2. Since the number of training points is a constant, the value for which we get the minimum is not affected.

Training

Gradient Descent We've seen that the simplest method for minimizing a differentiable function is gradient descent (or ascent if we're maximizing). Recall that we are trying to find the weights that lead to a minimum value of training error. Here we see the gradient of the training error as a function of the weights. The descent rule is basically to change the weights by taking a small step (determined by the learning rate ) in the direction opposite this gradient. Online version: We consider each time only the error for one data item

Gradient Descent – Single Unit Substituting in the equation of previous slide we get (for the arbitrary ith element): Delta rule

Derivative of the sigmoid

Generalized Delta Rule Now, let’s compute 4. z4 will influence E, only indirectly through z5 and z6.

Generalized Delta Rule In general, for a hidden unit j we have

Generalized Delta Rule For an output unit we have

Backpropagation Algorithm Initialize weights to small random values Choose a random sample training item, say (xm, ym) Compute total input zj and output yj for each unit (forward prop) Compute n for output layer n = yn(1-yn)(yn-ynm) Compute j for all preceding layers by backprop rule Compute weight change by descent rule (repeat for all weights) Note that each expression involves data local to a particular unit, we don't have to look around summing things over the whole network. It is for this reason, simplicity, locality and, therefore, efficiency that backpropagation has become the dominant paradigm for training neural nets.

Training Neural Nets Now that we have looked at the basic mathematical techniques for minimizing the training error of a neural net, we should step back and look at the whole approach to training a neural net, keeping in mind the potential problem of overfitting. Here we look at a methodology that attempts to minimize that danger.

Training Neural Nets Given: Data set, desired outputs and a neural net with m weights. Find a setting for the weights that will give good predictive performance on new data. Split data set into three subsets: Training set – used for adjusting weights Validation set – used to stop training Test set – used to evaluate performance Pick random, small weights as initial values Perform iterative minimization of error over training set (backprop) Stop when error on validation set reaches a minimum (to avoid overfitting) Repeat training (from step 2) several times (to avoid local minima) Use best weights to compute error on test set.

Autonomous Land Vehicle In a Neural Network (ALVINN) ALVINN is an automatic steering system for a car based on input from a camera mounted on the vehicle. Successfully demonstrated in a cross-country trip.

ALVINN The ALVINN neural network is shown here. It has 960 inputs (a 30x32 array derived from the pixels of an image), four hidden units and 30 output units (each representing a steering command).

Backpropagation Example First do forward propagation: Compute zi’s and yi’s. y3 z3 3 w03 -1 w13 w23 y1 z1 y2 z2 1 2 w21 w12 w02 w01 w11 w22 -1 -1 x1 x2

Input Representation An issue has to do with the representation of discrete data (also known as "categorical" data). We could think of representing these as either unary or binary numbers. Binary numbers are generally a bad choice; Unary is much preferable