CS-424 Gregory Dudek Today’s Lecture Neural networks –Training Backpropagation of error (backprop) –Example –Radial basis functions.

Slides:

Advertisements

Similar presentations

Artificial Intelligence 12. Two Layer ANNs

Advertisements

Multi-Layer Perceptron (MLP)

Beyond Linear Separability

Slides from: Doug Gray, David Poole

NEURAL NETWORKS Backpropagation Algorithm

1 Machine Learning: Lecture 4 Artificial Neural Networks (Based on Chapter 4 of Mitchell T.., Machine Learning, 1997)

1 Neural networks. Neural networks are made up of many artificial neurons. Each input into the neuron has its own weight associated with it illustrated.

Artificial Intelligence 13. Multi-Layer ANNs Course V231 Department of Computing Imperial College © Simon Colton.

Artificial Neural Networks

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Machine Learning: Connectionist McCulloch-Pitts Neuron Perceptrons Multilayer Networks Support Vector Machines Feedback Networks Hopfield Networks.

Machine Learning Neural Networks

Artificial Neural Networks

Simple Neural Nets For Pattern Classification

The back-propagation training algorithm

Prénom Nom Document Analysis: Artificial Neural Networks Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Radial Basis Functions

Carla P. Gomes CS4700 CS 4700: Foundations of Artificial Intelligence Prof. Carla P. Gomes Module: Neural Networks: Concepts (Reading:

1 Chapter 11 Neural Networks. 2 Chapter 11 Contents (1) l Biological Neurons l Artificial Neurons l Perceptrons l Multilayer Neural Networks l Backpropagation.

Neural Networks Marco Loog.

Connectionist Modeling Some material taken from cspeech.ucd.ie/~connectionism and Rich & Knight, 1991.

Biological neuron artificial neuron.

November 2, 2010Neural Networks Lecture 14: Radial Basis Functions 1 Cascade Correlation Weights to each new hidden node are trained to maximize the covariance.

Artificial Neural Networks

CHAPTER 11 Back-Propagation Ming-Feng Yeh.

CS 484 – Artificial Intelligence

Artificial Neural Network

Neural Networks. Background - Neural Networks can be : Biological - Biological models Artificial - Artificial models - Desire to produce artificial systems.

Radial-Basis Function Networks

Radial Basis Function Networks

Neural Networks Lecture 8: Two simple learning algorithms

CS-424 Gregory Dudek Today’s Lecture Neural networks –Backprop example Clustering & classification: case study –Sound classification: the tapper Recurrent.

Dr. Hala Moushir Ebied Faculty of Computers & Information Sciences

MSE 2400 EaLiCaRA Spring 2015 Dr. Tom Way

Artificial Neural Networks

Artificial Neural Networks

Neural NetworksNN 11 Neural netwoks thanks to: Basics of neural network theory and practice for supervised and unsupervised.

Neural Networks AI – Week 23 Sub-symbolic AI Multi-Layer Neural Networks Lee McCluskey, room 3/10

LINEAR CLASSIFICATION. Biological inspirations  Some numbers…  The human brain contains about 10 billion nerve cells ( neurons )  Each neuron is connected.

Artificial Intelligence Techniques Multilayer Perceptrons.

Artificial Neural Networks. The Brain How do brains work? How do human brains differ from that of other animals? Can we base models of artificial intelligence.

CS 478 – Tools for Machine Learning and Data Mining Backpropagation.

1 Chapter 11 Neural Networks. 2 Chapter 11 Contents (1) l Biological Neurons l Artificial Neurons l Perceptrons l Multilayer Neural Networks l Backpropagation.

Neural Networks and Machine Learning Applications CSC 563 Prof. Mohamed Batouche Computer Science Department CCIS – King Saud University Riyadh, Saudi.

Non-Bayes classifiers. Linear discriminants, neural networks.

CS-424 Gregory Dudek Today’s Lecture Administrative Details Learning –Decision trees: cleanup & details –Belief nets –Sub-symbolic learning Neural networks.

Back-Propagation Algorithm AN INTRODUCTION TO LEARNING INTERNAL REPRESENTATIONS BY ERROR PROPAGATION Presented by: Kunal Parmar UHID:

1 Lecture 6 Neural Network Training. 2 Neural Network Training Network training is basic to establishing the functional relationship between the inputs.

Introduction to Neural Networks Introduction to Neural Networks Applied to OCR and Speech Recognition An actual neuron A crude model of a neuron Computational.

Neural Networks Presented by M. Abbasi Course lecturer: Dr.Tohidkhah.

Neural Networks Teacher: Elena Marchiori R4.47 Assistant: Kees Jong S2.22

Artificial Neural Network

EEE502 Pattern Recognition

COSC 4426 AJ Boulay Julia Johnson Artificial Neural Networks: Introduction to Soft Computing (Textbook)

Neural Networks 2nd Edition Simon Haykin

1 Perceptron as one Type of Linear Discriminants IntroductionIntroduction Design of Primitive UnitsDesign of Primitive Units PerceptronsPerceptrons.

Perceptrons Michael J. Watts

Data Mining: Concepts and Techniques1 Prediction Prediction vs. classification Classification predicts categorical class label Prediction predicts continuous-valued.

Learning: Neural Networks Artificial Intelligence CMSC February 3, 2005.

Learning with Neural Networks Artificial Intelligence CMSC February 19, 2002.

CSE343/543 Machine Learning Mayank Vatsa Lecture slides are prepared using several teaching resources and no authorship is claimed for any slides.

Today’s Lecture Neural networks Training

Neural networks.

Real Neurons Cell structures Cell body Dendrites Axon

with Daniel L. Silver, Ph.D. Christian Frey, BBA April 11-12, 2017

of the Artificial Neural Networks.

Perceptron as one Type of Linear Discriminants

Neural Network - 2 Mayank Vatsa

Artificial Intelligence 12. Two Layer ANNs

David Kauchak CS158 – Spring 2019

Presentation transcript:

CS-424 Gregory Dudek Today’s Lecture Neural networks –Training Backpropagation of error (backprop) –Example –Radial basis functions

CS-424 Gregory Dudek Simple neural models Oldest ANN model is McCulloch-Pitts neuron [1943]. –Inputs are +1 or -1 with real-valued weights. –If sum of weighted inputs is > 0, then the neuron “fires” and gives +1 as an output. –Showed you can compute logical functions. –Relation to learning proposed (later!) by Donald Hebb [1949]. Perceptron model [Rosenblatt, 1958]. –Single-layer network with same kind of neuron. Firing when input is about a threshold: ∑x i w i >t. –Added a learning rule to allow weight selection. Not in text

CS-424 Gregory Dudek Perceptrons: Early motivation Von Neumann (1951), among others, worried about the theory behind how you get a network like the brain to learn things given: –Random (?) connections –Broken connections –Noisy signal. Boolean algebra & symbolic logic were not well suited to these questions. The perceptron (originally photo-perceptron) was used as a model for a cell that responds to illumination patterns. Not in text

CS-424 Gregory Dudek Perceptron nets

CS-424 Gregory Dudek Perceptron learning Perceptron learning: –Have a set of training examples (TS) encoded as input values (I.e. in the form of binary vectors) –Have a set of desired output values associated with these inputs. This is supervised learning. –Problem: how to adjust the weights to make the actual outputs match the training examples. NOTE: we to not allow the topology to change! [You should be thinking of a question here.] Intuition: when a perceptron makes a mistake, it’s weights are wrong. –Modify them so make the output bigger or smaller, as desired.

CS-424 Gregory Dudek Learning algorithm Desired T i Actual output O i Weight update formula (weight from unit j to i): W j,i = W j,I + k* x j * (T i - O i ) Where k is the learning rate. If the examples can be learned (encoded), then the perceptron learning rule will find the weights. –How? Gradient descent. Key thing to prove is the absence of local minima.

CS-424 Gregory Dudek Perceptrons: what can they learn? Only linearly separable functions [Minsky & Papert 1969]. N dimensions: N-dimensional hyperplane.

CS-424 Gregory Dudek More general networks Generalize in 3 ways: –Allow continuous output values [0,1] –Allow multiple layers. This is key to learning a larger class of functions. –Allow a more complicated function than thresholded summation [why??] Generalize the learning rule to accommodate this: let’s see how it works.

CS-424 Gregory Dudek The threshold The key variant: –Change threshold into a differentiable function –Sigmoid, known as a “soft non-linearity” (silly). M = ∑x i w i O = 1 / (1 + e -k M )

CS-424 Gregory Dudek Recall: training For a single input-output layer, we could adjust the weights to get linear classification. –The perceptron computed a hyperplane over the space defined by the inputs. This is known as a linear classifier. By stacking layers, we can compute a wider range of functions. Compute error derivative with respect to weights. output inputs Hidden layer

CS-424 Gregory Dudek “Train” the weights to correctly classify a set of examples (TS: the training set). Started with perceptron, which used summing and a step function, and binary inputs and outputs. Embellished by allowing continuous activations and a more complex “threshold” function. –In particular, we considered a sigmoid activation function, which is like a “blurred” threshold.

CS-424 Gregory Dudek The Gaussian Another continuous, differentiable function that is commonly used is the Gaussian function. Gaussian(x) = where  is the width of the Gaussian. The Gaussian is a continuous, differentiable version of the step function. | | 1 _____________ | | 0 __________| | |__________ c

CS-424 Gregory Dudek What is learning? For a fixed set of weights w 1,...,w n f(x 1,...,x n ) = Sigma(x 1 w x n w n ) represents a particular scalar function of n variables. If we allow the weights to vary, then we can represent a family of scalar function of n variables. F(x 1,...,x n,w 1,...,w n ) = Sigma(x 1 w x n w n ) If the weights are real-valued, then the family of functions is determined by an n-dimensional parameter space, R n. Learning involves searching in this parameter space.

CS-424 Gregory Dudek Basis functions Here is another family of functions. In this case, the family is defined by a linear combination of basis functions {g 1,g 2,...,g n }. The input x could be scalar or vector valued. F(x,w 1,...,w n ) = w 1 g 1 (x) w n g n (x)

CS-424 Gregory Dudek Combining basis functions We can build a network as follows: g 1 (x) --- w 1 ---\ g 2 (x) --- w \... \... ∑ --- f(x) / g n (x) --- w n ---/ E.g. From the basis {1,x,x 2 } we can build quadratics: F(x, w 1,w 2,w 3 ) = w 1 + w 2 x + w 3 x 2

CS-424 Gregory Dudek Receptive Field It can be generalized to an arbitrary vector space (e.g., R n ). Often used to model what are called “localized receptive fields” in biological learning theory. –Such receptive fields are specially designed to represent the output of a learned function on a small portion of the input space. –How would you approximate an arbitrary continuous function using a sum of gaussians or a sum of piecewise constant functions of the sort described above?

CS-424 Gregory Dudek Backprop Consider sigmoid activation functions. We can examine the output of the net as a function of the weights. –How does the output change with changes in the weights? –Linear analysis: consider partial derivative of output with respect to weight(s). We saw this last lecture. –If we have multiple layers, consider effect on each layer as a function of the preceding layer(s). We propagate the error backwards through the net (using the chain rule for differentiation). Derivation on overheads [reference: DAA p. 212]

CS-424 Gregory Dudek Backprop observations We can do gradient descent in weight space. What is the dimensionality of this space? –Very high: each weight is a free variable. There are as many dimensions as weights. A “typical” net might have hundreds of weights. Can we find the minimum? –It turns out that for multi-layer networks, the error space (often called the “energy” of the network) is NOT CONVEX. [so?] –Commonest approach: multiple restart gradient descent. i.e. Try learning given various random initial weigh t distributions.

CS-424 Gregory Dudek Success? Stopping? We have a training algorithm (backprop). We might like to ask: –1. Have we done enough training (yet)? –2. How good is our network at solving the problem? –3. Should we try again to learn the problem (from the beginning)? The first 2 problems have standard answers: –Can’t just look at energy. Why not? Because we want to GENERALIZE across examples. “I understand multiplication: I know 3*6=18, 5*4=20.” –What’s 7*3? Hmmmm. –Must have additional examples to validate the training. Separate input data into 2 classes: training and testing sets. Can also use cross-validation.

CS-424 Gregory Dudek What can we learn? For any mapping from input to output units, we can learn it if we have enough hidden units with the right weights! In practice, many weights means difficulty. The right representation is critical! Generalization depends on bias. –The hidden units form an internal representation of the problem. make them learn something general. Bad example: one hidden unit learns exactly one training example. –Want to avoid learning by table lookup.

CS-424 Gregory Dudek Representation Much learning can be equated with selected a good problem representation. –If we have the right hidden layer, things become easy. Consider the problem of face recognition from photographs. Or fingerprints. –Digitized photos: a big array (256x256 or 512x512) of intensities. –How do we match one array to another? (Either manually or by computer.) –Key: measure important properties, use those as criteria for estimating similarity?

CS-424 Gregory Dudek Faces (an example) What is an important property to measure for faces? –Eye distance? –Average intensity BAD! –Nose width? –Forehead height? These measurements form the basis functions for describing faces. –BUT NOT NECESSARILY photographs!!! We don’t need to reconstruct the photo. Some information is not needed.

CS-424 Gregory Dudek Radial basis functions Use “blobs” summed together to create an arbitrary function. –A good kind of blob is a Gaussian: circular, variable width, can be easily generalized to 2D, 3D,....

CS-424 Gregory Dudek Topology changes Can we get by with fewer connections? When every neuron from one layer is connected to every layer in the next layer, we call the network fully-connected. What if we allow signals to flow backwards to a preceding layer? Recurrent networks

CS-424 Gregory Dudek Inductive bias? Where’s the inductive bias? –In the topology and architecture of the network. –In the learning rules. –In the input and output representation. –In the initial weights. Not in text