Today’s Lecture Neural networks Training

Slides:



Advertisements
Similar presentations
Multi-Layer Perceptron (MLP)
Advertisements

Beyond Linear Separability
Slides from: Doug Gray, David Poole
Artificial Intelligence 13. Multi-Layer ANNs Course V231 Department of Computing Imperial College © Simon Colton.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Machine Learning: Connectionist McCulloch-Pitts Neuron Perceptrons Multilayer Networks Support Vector Machines Feedback Networks Hopfield Networks.
Machine Learning Neural Networks
Artificial Neural Networks
Lecture 14 – Neural Networks
Radial Basis Functions
Carla P. Gomes CS4700 CS 4700: Foundations of Artificial Intelligence Prof. Carla P. Gomes Module: Neural Networks: Concepts (Reading:
November 2, 2010Neural Networks Lecture 14: Radial Basis Functions 1 Cascade Correlation Weights to each new hidden node are trained to maximize the covariance.
Artificial Neural Networks
Radial Basis Function (RBF) Networks
CS-424 Gregory Dudek Today’s Lecture Neural networks –Backprop example Clustering & classification: case study –Sound classification: the tapper Recurrent.
Dr. Hala Moushir Ebied Faculty of Computers & Information Sciences
Radial Basis Function Networks
Artificial Neural Networks
Artificial Neural Networks
LINEAR CLASSIFICATION. Biological inspirations  Some numbers…  The human brain contains about 10 billion nerve cells ( neurons )  Each neuron is connected.
1 Chapter 11 Neural Networks. 2 Chapter 11 Contents (1) l Biological Neurons l Artificial Neurons l Perceptrons l Multilayer Neural Networks l Backpropagation.
A note about gradient descent: Consider the function f(x)=(x-x 0 ) 2 Its derivative is: By gradient descent. x0x0 + -
Neural Networks and Machine Learning Applications CSC 563 Prof. Mohamed Batouche Computer Science Department CCIS – King Saud University Riyadh, Saudi.
Artificial Intelligence Chapter 3 Neural Networks Artificial Intelligence Chapter 3 Neural Networks Biointelligence Lab School of Computer Sci. & Eng.
Non-Bayes classifiers. Linear discriminants, neural networks.
CS-424 Gregory Dudek Today’s Lecture Neural networks –Training Backpropagation of error (backprop) –Example –Radial basis functions.
Back-Propagation Algorithm AN INTRODUCTION TO LEARNING INTERNAL REPRESENTATIONS BY ERROR PROPAGATION Presented by: Kunal Parmar UHID:
PARALLELIZATION OF ARTIFICIAL NEURAL NETWORKS Joe Bradish CS5802 Fall 2015.
Introduction to Neural Networks Introduction to Neural Networks Applied to OCR and Speech Recognition An actual neuron A crude model of a neuron Computational.
Neural Networks Presented by M. Abbasi Course lecturer: Dr.Tohidkhah.
EEE502 Pattern Recognition
Learning: Neural Networks Artificial Intelligence CMSC February 3, 2005.
CSE343/543 Machine Learning Mayank Vatsa Lecture slides are prepared using several teaching resources and no authorship is claimed for any slides.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Machine Learning Supervised Learning Classification and Regression
Neural networks.
Deep Feedforward Networks
Neural Networks Winter-Spring 2014
Learning with Perceptrons and Neural Networks
CS623: Introduction to Computing with Neural Nets (lecture-5)
第 3 章 神经网络.
Radial Basis Function G.Anuradha.
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Classification with Perceptrons Reading:
Artificial Neural Networks
with Daniel L. Silver, Ph.D. Christian Frey, BBA April 11-12, 2017
Artificial Neural Networks
CS621: Artificial Intelligence
Machine Learning Today: Reading: Maria Florina Balcan
CSC 578 Neural Networks and Deep Learning
Hidden Markov Models Part 2: Algorithms
Neural Networks Advantages Criticism
Chapter 3. Artificial Neural Networks - Introduction -
Neuro-Computing Lecture 4 Radial Basis Function Network
of the Artificial Neural Networks.
Artificial Intelligence Chapter 3 Neural Networks
Perceptron as one Type of Linear Discriminants
Artificial Neural Networks
Neural Network - 2 Mayank Vatsa
Artificial Intelligence Lecture No. 28
Artificial Intelligence Chapter 3 Neural Networks
Artificial Intelligence 12. Two Layer ANNs
Artificial Intelligence Chapter 3 Neural Networks
Neural networks (1) Traditional multi-layer perceptrons
Artificial Intelligence Chapter 3 Neural Networks
CS623: Introduction to Computing with Neural Nets (lecture-5)
Computer Vision Lecture 19: Object Recognition III
Seminar on Machine Learning Rada Mihalcea
Prof. Pushpak Bhattacharyya, IIT Bombay
Artificial Intelligence Chapter 3 Neural Networks
Presentation transcript:

Today’s Lecture Neural networks Training Backpropagation of error (backprop) Example Radial basis functions CS-424 Gregory Dudek

Recall: training For a single input-output layer, we could adjust the weights to get linear classification. The perceptron computed a hyperplane over the space defined by the inputs. This is known as a linear classifier. By stacking layers, we can compute a wider range of functions. Compute error derivative with respect to weights. output Hidden layer inputs CS-424 Gregory Dudek

“Train” the weights to correctly classify a set of examples (TS: the training set). Started with perceptron, which used summing and a step function, and binary inputs and outputs. Embellished by allowing continuous activations and a more complex “threshold” function. In particular, we considered a sigmoid activation function, which is like a “blurred” threshold. CS-424 Gregory Dudek

The Gaussian Another continuous, differentiable function that is commonly used is the Gaussian function. Gaussian(x) = where  is the width of the Gaussian. The Gaussian is a continuous, differentiable version of the step function. |<--- w --->| 1 _____________ | | 0 __________| | |__________ c CS-424 Gregory Dudek

F(x1,...,xn,w1,...,wn) = Sigma(x1 w1 + ... + xn wn) What is learning? For a fixed set of weights w1,...,wn f(x1,...,xn) = Sigma(x1 w1 + ... + xn wn) represents a particular scalar function of n variables. If we allow the weights to vary, then we can represent a family of scalar function of n variables. F(x1,...,xn,w1,...,wn) = Sigma(x1 w1 + ... + xn wn) If the weights are real-valued, then the family of functions is determined by an n-dimensional parameter space, Rn. Learning involves searching in this parameter space. CS-424 Gregory Dudek

F(x,w1,...,wn) = w1 g1(x) + ... + wn gn(x) Basis functions Here is another family of functions. In this case, the family is defined by a linear combination of basis functions {g1,g2,...,gn}. The input x could be scalar or vector valued. F(x,w1,...,wn) = w1 g1(x) + ... + wn gn(x) CS-424 Gregory Dudek

Combining basis functions We can build a network as follows: g1(x) --- w1 ---\ g2(x) --- w2 ----\ ... \ ... ∑ --- f(x) / gn(x) --- wn ---/ E.g. From the basis {1,x,x2} we can build quadratics: F(x, w1 ,w2 ,w3 ) = w1 + w2 x + w3 x2 CS-424 Gregory Dudek

Receptive Field It can be generalized to an arbitrary vector space (e.g., Rn). Often used to model what are called “localized receptive fields” in biological learning theory. Such receptive fields are specially designed to represent the output of a learned function on a small portion of the input space. How would you approximate an arbitrary continuous function using a sum of gaussians or a sum of piecewise constant functions of the sort described above? CS-424 Gregory Dudek

Backprop Consider sigmoid activation functions. We can examine the output of the net as a function of the weights. How does the output change with changes in the weights? Linear analysis: consider partial derivative of output with respect to weight(s). We saw this last lecture. If we have multiple layers, consider effect on each layer as a function of the preceding layer(s). We propagate the error backwards through the net (using the chain rule for differentiation). Derivation on overheads [reference: DAA p. 212] CS-424 Gregory Dudek

Backprop observations We can do gradient descent in weight space. What is the dimensionality of this space? Very high: each weight is a free variable. There are as many dimensions as weights. A “typical” net might have hundreds of weights. Can we find the minimum? It turns out that for multi-layer networks, the error space (often called the “energy” of the network) is NOT CONVEX. [so?] Commonest approach: multiple restart gradient descent. i.e. Try learning given various random initial weigh t distributions. CS-424 Gregory Dudek

Success? Stopping? We have a training algorithm (backprop). We might like to ask: 1. Have we done enough training (yet)? 2. How good is our network at solving the problem? 3. Should we try again to learn the problem (from the beginning)? The first 2 problems have standard answers: Can’t just look at energy. Why not? Because we want to GENERALIZE across examples. “I understand multiplication: I know 3*6=18, 5*4=20.” What’s 7*3? Hmmmm. Must have additional examples to validate the training. Separate input data into 2 classes: training and testing sets. Can also use cross-validation. CS-424 Gregory Dudek

What can we learn? For any mapping from input to output units, we can learn it if we have enough hidden units with the right weights! In practice, many weights means difficulty. The right representation is critical! Generalization depends on bias. The hidden units form an internal representation of the problem. make them learn something general. Bad example: one hidden unit learns exactly one training example. Want to avoid learning by table lookup. CS-424 Gregory Dudek

Representation Much learning can be equated with selected a good problem representation. If we have the right hidden layer, things become easy. Consider the problem of face recognition from photographs. Or fingerprints. Digitized photos: a big array (256x256 or 512x512) of intensities. How do we match one array to another? (Either manually or by computer.) Key: measure important properties, use those as criteria for estimating similarity? CS-424 Gregory Dudek

Faces (an example) What is an important property to measure for faces? Eye distance? Average intensity BAD! Nose width? Forehead height? These measurements form the basis functions for describing faces. BUT NOT NECESSARILY photographs!!! We don’t need to reconstruct the photo. Some information is not needed. CS-424 Gregory Dudek

Radial basis functions Use “blobs” summed together to create an arbitrary function. A good kind of blob is a Gaussian: circular, variable width, can be easily generalized to 2D, 3D, .... CS-424 Gregory Dudek

Topology changes Can we get by with fewer connections? When every neuron from one layer is connected to every layer in the next layer, we call the network fully-connected. What if we allow signals to flow backwards to a preceding layer? Recurrent networks CS-424 Gregory Dudek

Inductive bias? Where’s the inductive bias? Not in text In the topology and architecture of the network. In the learning rules. In the input and output representation. In the initial weights. CS-424 Gregory Dudek