The Gradient Descent Algorithm

Slides:



Advertisements
Similar presentations
Multi-Layer Perceptron (MLP)
Advertisements

Slides from: Doug Gray, David Poole
NEURAL NETWORKS Backpropagation Algorithm
Neural networks Introduction Fitting neural networks
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Supervised learning 1.Early learning algorithms 2.First order gradient methods 3.Second order gradient methods.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
The back-propagation training algorithm
Chapter 6: Multilayer Neural Networks
Neural Networks. Background - Neural Networks can be : Biological - Biological models Artificial - Artificial models - Desire to produce artificial systems.
Multiple-Layer Networks and Backpropagation Algorithms
Artificial Neural Networks
Chapter 11 – Neural Networks COMP 540 4/17/2007 Derek Singer.
11 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering
Appendix B: An Example of Back-propagation algorithm
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
 Diagram of a Neuron  The Simple Perceptron  Multilayer Neural Network  What is Hidden Layer?  Why do we Need a Hidden Layer?  How do Multilayer.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 16: NEURAL NETWORKS Objectives: Feedforward.
Classification / Regression Neural Networks 2
Artificial Neural Networks. The Brain How do brains work? How do human brains differ from that of other animals? Can we base models of artificial intelligence.
CS 478 – Tools for Machine Learning and Data Mining Backpropagation.
Back Propagation and Representation in PDP Networks Psychology 209 February 6, 2013.
Multi-Layer Perceptron
11 1 Backpropagation Multilayer Perceptron R – S 1 – S 2 – S 3 Network.
Neural Networks Teacher: Elena Marchiori R4.47 Assistant: Kees Jong S2.22
EEE502 Pattern Recognition
Chapter 8: Adaptive Networks
Hazırlayan NEURAL NETWORKS Backpropagation Network PROF. DR. YUSUF OYSAL.
Neural Networks Vladimir Pleskonjić 3188/ /20 Vladimir Pleskonjić General Feedforward neural networks Inputs are numeric features Outputs are in.
Neural Networks 2nd Edition Simon Haykin
Bump Hunting The objective PRIM algorithm Beam search References: Feelders, A.J. (2002). Rule induction by bump hunting. In J. Meij (Ed.), Dealing with.
Multinomial Regression and the Softmax Activation Function Gary Cottrell.
Back Propagation and Representation in PDP Networks
Lecture 7 Learned feedforward visual processing
Multiple-Layer Networks and Backpropagation Algorithms
Back Propagation and Representation in PDP Networks
Fall 2004 Backpropagation CS478 - Machine Learning.
第 3 章 神经网络.
Real Neurons Cell structures Cell body Dendrites Axon
A Simple Artificial Neuron
CSE 473 Introduction to Artificial Intelligence Neural Networks
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Neural Networks CS 446 Machine Learning.
LECTURE 28: NEURAL NETWORKS
with Daniel L. Silver, Ph.D. Christian Frey, BBA April 11-12, 2017
Understanding the Difficulty of Training Deep Feedforward Neural Networks Qiyue Wang Oct 27, 2017.
CSE P573 Applications of Artificial Intelligence Neural Networks
Classification / Regression Neural Networks 2
Prof. Carolina Ruiz Department of Computer Science
CSC 578 Neural Networks and Deep Learning
Data Mining with Neural Networks (HK: Chapter 7.5)
Artificial Neural Network & Backpropagation Algorithm
of the Artificial Neural Networks.
Artificial Intelligence Chapter 3 Neural Networks
CSE 573 Introduction to Artificial Intelligence Neural Networks
Neural Network - 2 Mayank Vatsa
Backpropagation.
Neural Networks Geoff Hulten.
Artificial Intelligence Chapter 3 Neural Networks
Backpropagation.
LECTURE 28: NEURAL NETWORKS
Artificial Intelligence Chapter 3 Neural Networks
Backpropagation Disclaimer: This PPT is modified based on
Back Propagation and Representation in PDP Networks
Neural networks (1) Traditional multi-layer perceptrons
Artificial Intelligence Chapter 3 Neural Networks
CS621: Artificial Intelligence Lecture 22-23: Sigmoid neuron, Backpropagation (Lecture 20 and 21 taken by Anup on Graphical Models) Pushpak Bhattacharyya.
David Kauchak CS158 – Spring 2019
Prof. Carolina Ruiz Department of Computer Science
Presentation transcript:

The Gradient Descent Algorithm Initialize all weights to small random values. REPEAT until done 1- For each weight wij set   2- For each data point (x, t)p set input units to x compute value of output units For each weight wij set   3. For each weight wij set  

The Learning Rate learning rate µ, which determines by how much we change the weights w at each step .

Batch vs. Online Learning The gradient contributions for all data points in the training set are accumulated before updating the weights. This method is often referred to as batch learning. An alternative approach is online learning, where the weights are updated immediately after seeing each data point. Since the gradient for a single data point can be considered a noisy approximation to the overall gradient G (Fig. 5), this is also called stochastic (noisy) gradient descent.

Multi-layer networks

Too large hidden layer - or too many hidden layers - can degrade the network's performance. One shouldn't use more hidden units than necessary Start training with a very small network. If gradient descent fails to find a satisfactory solution, grow the network by adding a hidden unit, and repeat . Any function can be expressed as a linear combination of tanh functions: tanh is a universal basis function. Two classes of activation functions commonly used in neural networks are the sigmoidal (S-shaped) basis functions (to which tanh belongs), and the radial basis functions.

Error Backpropagation We have already seen how to train linear networks by gradient descent. In trying to do the same for multi-layer networks we encounter a difficulty: we don't have any target values for the hidden units.

The Algorithm How to train a multi-layer feedforward network by gradient descent to approximate an unknown function, based on some training data consisting of pairs (x,t)?

Definitions: the error signal for unit j: the (negative) gradient for weight wij:

The first factor is the error of unit i. The second is To compute this gradient, we thus need to know the activity and the error for all relevant nodes in the network

Calculating output error Calculating output error. Assuming that we are using the sum-squared loss the error for output unit o is simply Error backpropagation. For hidden units, we must propagate the error back from the output nodes (hence the name of the algorithm). Again using the chain rule, we can expand the error of a hidden unit in terms of its posterior nodes:

Of the three factors inside the sum, the first is just the error of node i. The second is while the third is the derivative of node j's activation function: For hidden units h that use the tanh activation function, we can make use of the special identity tanh(u)' = 1 - tanh(u)2, giving us Putting all the pieces together we get

The backprop algorithm then looks as follows: Initialize the input layer: Propagate activity forward: for l = 1, 2, ..., L, Calculate the error in the output layer Backpropagate the error: for l = L-1, L-2, ..., 1, Update the weights and biases:

Backpropagation of error: an example an example of a backprop network as it learns to model the highly nonlinear data

To begin with, we set the weights, a To begin with, we set the weights, a..g, to random initial values in the range [-1,1]. Each hidden unit is thus computing a random tanh function. The next figure shows the initial two activation functions and the output of the network, which is their sum plus a negative constant. (If you have difficulty making out the line types, the top two curves are the tanh functions, the one at the bottom is the network output).

We now train the network (learning rate 0 We now train the network (learning rate 0.3), updating the weights after each pattern (online learning). After we have been through the entire dataset 10 times (10 training epochs), the functions computed look like this (the output is the middle curve):

After 20 epochs, we have (output is the humpbacked curve):

and after 27 epochs we have a pretty good fit to the data: