ECE 471/571 - Lecture 17 Back Propagation.

Slides:



Advertisements
Similar presentations
Multi-Layer Perceptron (MLP)
Advertisements

1 Machine Learning: Lecture 4 Artificial Neural Networks (Based on Chapter 4 of Mitchell T.., Machine Learning, 1997)
Introduction to Neural Networks Computing
Artificial Neural Networks (1)
1 Neural networks. Neural networks are made up of many artificial neurons. Each input into the neuron has its own weight associated with it illustrated.
Mehran University of Engineering and Technology, Jamshoro Department of Electronic Engineering Neural Networks Feedforward Networks By Dr. Mukhtiar Ali.
Tuomas Sandholm Carnegie Mellon University Computer Science Department
Machine Learning: Connectionist McCulloch-Pitts Neuron Perceptrons Multilayer Networks Support Vector Machines Feedback Networks Hopfield Networks.
Supervised learning 1.Early learning algorithms 2.First order gradient methods 3.Second order gradient methods.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Prénom Nom Document Analysis: Artificial Neural Networks Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Neural Networks Marco Loog.
Prénom Nom Document Analysis: Artificial Neural Networks Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
CHAPTER 11 Back-Propagation Ming-Feng Yeh.
Artificial Neural Network
Neural networks.
Dr. Hala Moushir Ebied Faculty of Computers & Information Sciences
Neural Networks. Plan Perceptron  Linear discriminant Associative memories  Hopfield networks  Chaotic networks Multilayer perceptron  Backpropagation.
Artificial Neural Networks
Artificial Neural Networks
11 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Artificial Neural Network Supervised Learning دكترمحسن كاهاني
Artificial Neural Networks. The Brain How do brains work? How do human brains differ from that of other animals? Can we base models of artificial intelligence.
CS 478 – Tools for Machine Learning and Data Mining Backpropagation.
Artificial Intelligence Chapter 3 Neural Networks Artificial Intelligence Chapter 3 Neural Networks Biointelligence Lab School of Computer Sci. & Eng.
Non-Bayes classifiers. Linear discriminants, neural networks.
Back-Propagation Algorithm AN INTRODUCTION TO LEARNING INTERNAL REPRESENTATIONS BY ERROR PROPAGATION Presented by: Kunal Parmar UHID:
CS621 : Artificial Intelligence
CS344: Introduction to Artificial Intelligence (associated lab: CS386) Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 32: sigmoid neuron; Feedforward.
EEE502 Pattern Recognition
Chapter 6 Neural Network.
Learning with Neural Networks Artificial Intelligence CMSC February 19, 2002.
CSE343/543 Machine Learning Mayank Vatsa Lecture slides are prepared using several teaching resources and no authorship is claimed for any slides.
Machine Learning Supervised Learning Classification and Regression
Neural networks.
Neural networks and support vector machines
Fall 2004 Backpropagation CS478 - Machine Learning.
Deep Feedforward Networks
Artificial Neural Networks
Learning with Perceptrons and Neural Networks
Chapter 6: Multilayer Neural Networks (Sections , 6.8)
第 3 章 神经网络.
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
LECTURE 28: NEURAL NETWORKS
with Daniel L. Silver, Ph.D. Christian Frey, BBA April 11-12, 2017
CS621: Artificial Intelligence
Machine Learning Today: Reading: Maria Florina Balcan
CSC 578 Neural Networks and Deep Learning
Computational Intelligence
ECE 471/571 – Lecture 12 Perceptron.
Artificial Intelligence Chapter 3 Neural Networks
Neural Networks Chapter 5
Artificial Neural Networks
Neural Network - 2 Mayank Vatsa
CS 621 Artificial Intelligence Lecture 25 – 14/10/05
Capabilities of Threshold Neurons
Lecture Notes for Chapter 4 Artificial Neural Networks
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Artificial Intelligence Chapter 3 Neural Networks
LECTURE 28: NEURAL NETWORKS
Artificial Intelligence Chapter 3 Neural Networks
ECE – Pattern Recognition Lecture 16: NN – Back Propagation
Artificial Intelligence Chapter 3 Neural Networks
CS621: Artificial Intelligence Lecture 22-23: Sigmoid neuron, Backpropagation (Lecture 20 and 21 taken by Anup on Graphical Models) Pushpak Bhattacharyya.
Artificial Intelligence Chapter 3 Neural Networks
Outline Announcement Neural networks Perceptrons - continued
Presentation transcript:

ECE 471/571 - Lecture 17 Back Propagation

Types of NN Recurrent (feedback during operation) Feedforward Hopfield Kohonen Associative memory Feedforward No feedback during operation or testing (only during determination of weights or training) Perceptron Backpropagation

History In the 1980s, NN became fashionable again, after the dark age during the 1970s One of the reasons is the paper by Rumelhart, Hinton, and McClelland, which made the BP algorithm famous The algorithm was first discovered in 1960s, but didn’t draw much attention BP is most famous for applications for layered feedforward networks, or multilayer perceptrons

Limitations of Perceptron The output only has two values (1 or 0) Can only classify samples which are linearly separable (straight line or straight plane) Single layer: can only train AND, OR, NOT Can’t train a network functions like XOR

XOR (3-layer NN) S …… x1 w1 w2 x2 y wd xd S1 -b x1 1 w13 S3 S w35 S5 S1, S2 are identity functions S3, S4, S5 are sigmoid w13 = 1.0, w14 = -1.0 w24 = 1.0, w23 = -1.0 w35 = 0.11, w45 = -0.1 The input takes on only –1 and 1 S2

BP – 3-Layer Network wiq Sq hq wqj Sj yj S(yj) xi Si Sq Sj The problem is essentially “how to choose weight w to minimize the error between the expected output and the actual output” The basic idea behind BP is gradient descent wst is the weight connecting input s at neuron t

Exercise wiq Sq hq wqj Sj yj S(yj) xi Si Sq Sj

*The Derivative – Chain Rule wiq Sq hq wqj Sj yj S(yj) xi Si Sq Sj

Threshold Function Traditional threshold function as proposed by McCulloch-Pitts is binary function The importance of differentiable A threshold-like but differentiable form for S (25 years) The sigmoid

*BP vs. MPP When n goes to infinity, the output of the BP approximates the a-posteriori probability For last step, +\int P^2(\omega_k|x)p(x)dx - \int P^2(\omega_k|x)p(x)dx+\int p(x).P(\omega_k|x)dx The last two terms = \int p(x) P(\omega_k|x)(1-P(\omega_k|x))dx

Practical Improvements to Backpropagation

Activation (Threshold) Function The signum function The sigmoid function Nonlinear Saturate Continuity and smoothness Monotonicity (so S’(x) > 0) Improved Centered at zero Antisymmetric (odd) – leads to faster learning a = 1.716, b = 2/3, to keep S’(0) -> 1, the linear range is –1<x<1, and the extrema of S’’(x) occur roughly at x -> 2 Nonlinear otherwise the 3-layer network provides NO computational power above that of a 2-layer net. Saturation s.t. the output has minimum and maximum Continuity and smoothness s.t. the derivative and the function itself exist, s.t. we can use GD to learn Monotonicity (nonessential): s.t. the derivative would have the same sign to reduce error, otherwise multiple local minima or maxima would occur

Data Standardization Problem in the units of the inputs Different units cause magnitude of difference Same units cause magnitude of difference Standardization – scaling input Shift the input pattern The average over the training set of each feature is zero Scale the full data set Have the same variance in each feature component (around 1.0)

Target Values (output) Instead of one-of-c (c is the number of classes), we use +1/-1 +1 indicates target category -1 indicates non-target category For faster convergence

Number of Hidden Layers The number of hidden layers governs the expressive power of the network, and also the complexity of the decision boundary More hidden layers -> higher expressive power -> better tuned to the particular training set -> poor performance on the testing set Rule of thumb Choose the number of weights to be roughly n/10, where n is the total number of samples in the training set Start with a “large” number of hidden units, and “decay”, prune, or eliminate weights

Number of Hidden Layers

Initializing Weight Can’t start with zero Fast and uniform learning All weights reach their final equilibrium values at about the same time Choose weights randomly from a uniform distribution to help ensure uniform learning Equal negative and positive weights Set the weights such that the integration value at a hidden unit is in the range of –1 and +1 Input-to-hidden weights: (-1/sqrt(d), 1/sqrt(d)) Hidden-to-output weights: (-1/sqrt(nH), 1/sqrt(nH)), nH is the number of connected units

Learning Rate The optimal learning rate Calculate the 2nd derivative of the objective function with respect to each weight Set the optimal learning rate separately for each weight A learning rate of 0.1 is often adequate The maximum learning rate is When , the convergence is slow

Plateaus or Flat Surface in S’ Regions where the derivative is very small When the sigmoid function saturates Momentum Allows the network to learn more quickly when plateaus in the error surface exist Error surface having too many plateaus Momentum: Moving objects tend to keep moving unless acted upon by outside forces When c=0, this is bp With momentum, smoother learning curve, faster convergence Recursive or infinite-impluse-response low-pass filter that smooths the changes in w

Weight Decay Should almost always lead to improved performance Weights that are not needed for reducing the criterion function become smaller and smaller

Batch Training vs. On-line Training Add up the weight changes for all the training patterns and apply them in one go GD On-line training Update all the weights immediately after processing each training pattern Not true GD but faster learning rate

Other Improvements Other error function (Minkowski error)

Further Discussions How to draw the decision boundary of BPNN? How to set the range of valid output 0-0.5 and 0.5-1? 0-0.2 and 0.8-1? 0.1-0.2 and 0.8-0.9? The importance of having symmetric initial input