ECE – Pattern Recognition Lecture 16: NN – Back Propagation

Slides:



Advertisements
Similar presentations
Ch. Eick: More on Machine Learning & Neural Networks Different Forms of Learning: –Learning agent receives feedback with respect to its actions (e.g. using.
Advertisements

Machine Learning: Connectionist McCulloch-Pitts Neuron Perceptrons Multilayer Networks Support Vector Machines Feedback Networks Hopfield Networks.
Lecture 14 – Neural Networks
Neural Networks Marco Loog.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
CHAPTER 11 Back-Propagation Ming-Feng Yeh.
Artificial Neural Networks
Biointelligence Laboratory, Seoul National University
11 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering
Artificial Neural Network Supervised Learning دكترمحسن كاهاني
CS 478 – Tools for Machine Learning and Data Mining Backpropagation.
Non-Bayes classifiers. Linear discriminants, neural networks.
ECE 471/571 – Lecture 2 Bayesian Decision Theory 08/25/15.
ECE 471/571 – Lecture 6 Dimensionality Reduction – Fisher’s Linear Discriminant 09/08/15.
Back-Propagation Algorithm AN INTRODUCTION TO LEARNING INTERNAL REPRESENTATIONS BY ERROR PROPAGATION Presented by: Kunal Parmar UHID:
EEE502 Pattern Recognition
ECE 471/571 - Lecture 19 Review 11/12/15. A Roadmap 2 Pattern Classification Statistical ApproachNon-Statistical Approach SupervisedUnsupervised Basic.
Chapter 6 Neural Network.
ECE 471/571 – Lecture 21 Syntactic Pattern Recognition 11/19/15.
ECE 471/571 – Lecture 3 Discriminant Function and Normal Density 08/27/15.
Learning with Neural Networks Artificial Intelligence CMSC February 19, 2002.
CSE343/543 Machine Learning Mayank Vatsa Lecture slides are prepared using several teaching resources and no authorship is claimed for any slides.
Machine Learning Supervised Learning Classification and Regression
Neural networks and support vector machines
Syntactic Pattern Recognition 04/28/17
Fall 2004 Backpropagation CS478 - Machine Learning.
ECE 471/571 - Lecture 19 Review 02/24/17.
Deep Feedforward Networks
Nonparametric Density Estimation – k-nearest neighbor (kNN) 02/20/17
Chapter 6: Multilayer Neural Networks (Sections , 6.8)
Performance Evaluation 02/15/17
第 3 章 神经网络.
ECE 471/571 – Lecture 18 Classifier Fusion 04/12/17.
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Unsupervised Learning - Clustering 04/03/17
LECTURE 28: NEURAL NETWORKS
Unsupervised Learning - Clustering
CSC 578 Neural Networks and Deep Learning
ECE 471/571 - Lecture 17 Back Propagation.
ECE 471/571 – Lecture 12 Perceptron.
Neuro-Computing Lecture 4 Radial Basis Function Network
ECE 599/692 – Deep Learning Lecture 5 – CNN: The Representative Power
Artificial Intelligence Chapter 3 Neural Networks
ECE 471/571 – Review 1.
Artificial Neural Networks
Neural Network - 2 Mayank Vatsa
Parametric Estimation
Hairong Qi, Gonzalez Family Professor
Deep Learning for Non-Linear Control
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Artificial Intelligence Chapter 3 Neural Networks
LECTURE 28: NEURAL NETWORKS
Artificial Intelligence Chapter 3 Neural Networks
ECE – Pattern Recognition Lecture 15: NN - Perceptron
ECE – Pattern Recognition Lecture 13 – Decision Tree
Artificial Intelligence Chapter 3 Neural Networks
Hairong Qi, Gonzalez Family Professor
Hairong Qi, Gonzalez Family Professor
ECE – Lecture 1 Introduction.
ECE – Pattern Recognition Lecture 10 – Nonparametric Density Estimation – k-nearest-neighbor (kNN) Hairong Qi, Gonzalez Family Professor Electrical.
Bayesian Decision Theory
Hairong Qi, Gonzalez Family Professor
ECE – Pattern Recognition Lecture 8 – Performance Evaluation
ECE – Pattern Recognition Lecture 4 – Parametric Estimation
Artificial Intelligence Chapter 3 Neural Networks
Hairong Qi, Gonzalez Family Professor
ECE – Pattern Recognition Lecture 14 – Gradient Descent
Outline Announcement Neural networks Perceptrons - continued
ECE – Pattern Recognition Midterm Review
Presentation transcript:

ECE471-571 – Pattern Recognition Lecture 16: NN – Back Propagation Hairong Qi, Gonzalez Family Professor Electrical Engineering and Computer Science University of Tennessee, Knoxville http://www.eecs.utk.edu/faculty/qi Email: hqi@utk.edu

Pattern Classification Statistical Approach Non-Statistical Approach Supervised Unsupervised Decision-tree Basic concepts: Baysian decision rule (MPP, LR, Discri.) Basic concepts: Distance Agglomerative method Syntactic approach Parameter estimate (ML, BL) k-means Non-Parametric learning (kNN) Winner-takes-all LDF (Perceptron) Kohonen maps NN (BP) Mean-shift Support Vector Machine Deep Learning (DL) Dimensionality Reduction FLD, PCA Performance Evaluation ROC curve (TP, TN, FN, FP) cross validation local opt (GD) global opt (SA, GA) Classifier Fusion majority voting NB, BKS

Types of NN Recurrent (feedback during operation) Feedforward Hopfield Kohonen Associative memory Feedforward No feedback during operation or testing (only during determination of weights or training) Perceptron Multilayer perceptron and backpropagation

Limitations of Perceptron The output only has two values (1 or 0) Can only classify samples which are linearly separable (straight line or straight plane) Single layer: can only train AND, OR, NOT Can’t train a network functions like XOR

Why Deeper? http://ai.stanford.edu/~quocle/tutorial2.pdf

Why Deeper? (Cont’) http://ai.stanford.edu/~quocle/tutorial2.pdf

S XOR (3-layer NN) …… x1 w1 w2 x2 y wd xd S1 -b x1 1 w13 S3 S w35 S5 S1, S2 are identity functions S3, S4, S5 are sigmoid w13 = 1.0, w14 = -1.0 w24 = 1.0, w23 = -1.0 w35 = 0.11, w45 = -0.1 The input takes on only –1 and 1 S2

BP – 3-Layer Network wiq Sq hq wqj Sj yj S(yj) xi Si Sq Sj The problem is essentially “how to choose weight w to minimize the error between the expected output and the actual output” The basic idea behind BP is gradient descent wst is the weight connecting input s at neuron t

Exercise wiq Sq hq wqj Sj yj S(yj) xi Si Sq Sj

*The Derivative – Chain Rule wiq Sq hq wqj Sj yj S(yj) xi Si Sq Sj

Threshold Function Traditional threshold function as proposed by McCulloch-Pitts is binary function The importance of differentiable A threshold-like but differentiable form for S (25 years) The sigmoid

*BP vs. MPP When n goes to infinity, the output of the BP approximates the a-posteriori probability For last step, +\int P^2(\omega_k|x)p(x)dx - \int P^2(\omega_k|x)p(x)dx+\int p(x).P(\omega_k|x)dx The last two terms = \int p(x) P(\omega_k|x)(1-P(\omega_k|x))dx

Practical Improvements to Backpropagation

Activation (Threshold) Function The signum function The sigmoid function Nonlinear Saturate Continuity and smoothness Monotonicity (so S’(x) > 0) Improved Centered at zero Antisymmetric (odd) – leads to faster learning a = 1.716, b = 2/3, to keep S’(0) -> 1, the linear range is –1<x<1, and the extrema of S’’(x) occur roughly at x -> 2 Nonlinear otherwise the 3-layer network provides NO computational power above that of a 2-layer net. Saturation s.t. the output has minimum and maximum Continuity and smoothness s.t. the derivative and the function itself exist, s.t. we can use GD to learn Monotonicity (nonessential): s.t. the derivative would have the same sign to reduce error, otherwise multiple local minima or maxima would occur

Data Standardization Problem in the units of the inputs Different units cause magnitude of difference Same units cause magnitude of difference Standardization – scaling input Shift the input pattern The average over the training set of each feature is zero Scale the full data set Have the same variance in each feature component (around 1.0)

Target Values (output) Instead of one-of-c (c is the number of classes), we use +1/-1 +1 indicates target category -1 indicates non-target category For faster convergence

Number of Hidden Layers The number of hidden layers governs the expressive power of the network, and also the complexity of the decision boundary More hidden layers -> higher expressive power -> better tuned to the particular training set -> poor performance on the testing set Rule of thumb Choose the number of weights to be roughly n/10, where n is the total number of samples in the training set Start with a “large” number of hidden units, and “decay”, prune, or eliminate weights

Number of Hidden Layers

Initializing Weight Can’t start with zero Fast and uniform learning All weights reach their final equilibrium values at about the same time Choose weights randomly from a uniform distribution to help ensure uniform learning Equal negative and positive weights Set the weights such that the integration value at a hidden unit is in the range of –1 and +1 Input-to-hidden weights: (-1/sqrt(d), 1/sqrt(d)) Hidden-to-output weights: (-1/sqrt(nH), 1/sqrt(nH)), nH is the number of connected units

Learning Rate The optimal learning rate Calculate the 2nd derivative of the objective function with respect to each weight Set the optimal learning rate separately for each weight A learning rate of 0.1 is often adequate

Plateaus or Flat Surface in S’ Regions where the derivative is very small When the sigmoid function saturates Momentum Allows the network to learn more quickly when plateaus in the error surface exist Error surface having too many plateaus Momentum: Moving objects tend to keep moving unless acted upon by outside forces When c=0, this is bp With momentum, smoother learning curve, faster convergence Recursive or infinite-impluse-response low-pass filter that smooths the changes in w

Weight Decay Should almost always lead to improved performance Weights that are not needed for reducing the criterion function become smaller and smaller

Batch Training vs. On-line Training Add up the weight changes for all the training patterns and apply them in one go GD On-line training Update all the weights immediately after processing each training pattern Not true GD but faster learning rate

Other Improvements Other error function (Minkowski error)

Further Discussions How to draw the decision boundary of BPNN? How to set the range of valid output 0-0.5 and 0.5-1? 0-0.2 and 0.8-1? 0.1-0.2 and 0.8-0.9? The importance of having symmetric initial input