Outline Single neuron case: Nonlinear error correcting learning

Slides:

Advertisements

Similar presentations

Multi-Layer Perceptron (MLP)

Advertisements

Slides from: Doug Gray, David Poole

1 Machine Learning: Lecture 4 Artificial Neural Networks (Based on Chapter 4 of Mitchell T.., Machine Learning, 1997)

Artificial Neural Networks

The loss function, the normal equation,

Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.

Intro. ANN & Fuzzy Systems Lecture 8. Learning (V): Perceptron Learning.

Overview over different methods – Supervised Learning

Performance Optimization

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

Improved BP algorithms ( first order gradient method) 1.BP with momentum 2.Delta- bar- delta 3.Decoupled momentum 4.RProp 5.Adaptive BP 6.Trinary BP 7.BP.

12 1 Variations on Backpropagation Variations Heuristic Modifications –Momentum –Variable Learning Rate Standard Numerical Optimization –Conjugate.

Advanced Topics in Optimization

Why Function Optimization ?

Adaptive Signal Processing

Collaborative Filtering Matrix Factorization Approach

Neural Networks Lecture 8: Two simple learning algorithms

Dr. Hala Moushir Ebied Faculty of Computers & Information Sciences

Introduction to Adaptive Digital Filters Algorithms

Artificial Neural Networks

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

Non-Bayes classifiers. Linear discriminants, neural networks.

11 1 Backpropagation Multilayer Perceptron R – S 1 – S 2 – S 3 Network.

Dan Simon Cleveland State University Jang, Sun, and Mizutani Neuro-Fuzzy and Soft Computing Chapter 6 Derivative-Based Optimization 1.

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

CHAPTER 10 Widrow-Hoff Learning Ming-Feng Yeh.

Gradient Methods In Optimization

EEE502 Pattern Recognition

Variations on Backpropagation.

Classification Course web page: vision.cis.udel.edu/~cv May 14, 2003  Lecture 34.

Chapter 2-OPTIMIZATION G.Anuradha. Contents Derivative-based Optimization –Descent Methods –The Method of Steepest Descent –Classical Newton’s Method.

INTRO TO OPTIMIZATION MATH-415 Numerical Analysis 1.

Intro. ANN & Fuzzy Systems Lecture 13. MLP (V): Speed Up Learning.

Neural Networks 2nd Edition Simon Haykin

Previous Lecture Perceptron W  t+1  W  t  t  d(t) - sign (w(t)  x)] x Adaline W  t+1  W  t  t  d(t) - f(w(t)  x)] f’ x Gradient.

Neural NetworksNN 21 Architecture We consider the architecture: feed- forward NN with one layer It is sufficient to study single layer perceptrons with.

Bab 5 Classification: Alternative Techniques Part 4 Artificial Neural Networks Based Classifer.

Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.

Intro. ANN & Fuzzy Systems Lecture 11. MLP (III): Back-Propagation.

Pattern Recognition Lecture 20: Neural Networks 3 Dr. Richard Spillman Pacific Lutheran University.

CSE343/543 Machine Learning Mayank Vatsa Lecture slides are prepared using several teaching resources and no authorship is claimed for any slides.

Lecture 12. MLP (IV): Programming & Implementation

Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)

Computational Intelligence

Artificial neural networks – Lecture 4

Lecture 25 Radial Basis Network (II)

Computational Intelligence

Lecture 12. MLP (IV): Programming & Implementation

Lecture 9 MLP (I): Feed-forward Model

Lecture 11. MLP (III): Back-Propagation

Disadvantages of Discrete Neurons

Computational Intelligence

Non-linear Least-Squares

Collaborative Filtering Matrix Factorization Approach

Variations on Backpropagation.

Synaptic DynamicsII : Supervised Learning

Ch2: Adaline and Madaline

Lecture 7. Learning (IV): Error correcting Learning and LMS Algorithm

Backpropagation.

The loss function, the normal equation,

Mathematical Foundations of BME Reza Shadmehr

Neuro-Computing Lecture 2 Single-Layer Perceptrons

Chapter - 3 Single Layer Percetron

Variations on Backpropagation.

Computational Intelligence

Performance Optimization

Lecture 8. Learning (V): Perceptron Learning

Section 3: Second Order Methods

Presentation transcript:

Lecture 10. MLP (II): Single Neuron Weight Learning and Nonlinear Optimization

Outline Single neuron case: Nonlinear error correcting learning Nonlinear optimization (C) 2001 by Yu Hen Hu

Solving XOR Problem A training sample consists of a feature vector (x1, x2) and a label z. In XOR problem, there are 4 training samples (0, 0; 0), (0, 1; 1), (1, 0; 1), and (1, 1; 0). These four training samples give four equations for 9 variables (x1, x2) Z (0, 0) (0, 1) (1, 0) (1, 1) (C) 2001 by Yu Hen Hu

Finding Weights of MLP In MLP applications, often there are large number of parameters (weights). Sometimes, the number of unknowns is more than the number of equations (as in XOR case). In such a case, the solution may not be unique. The objective is to solve for a set of weights that minimize the actual output of the MLP and the corresponding desired output. Often, the cost function is a sum of square of such difference. This leads to a nonlinear least square optimization problem. (C) 2001 by Yu Hen Hu

Single Neuron Learning f()  1 x2 x1 z  w1 w0 w2 u d + f'(z) e  Nonlinear least square cost function: Goal: Find W = [w0, w1, w2] by minimizing E over given training samples. (C) 2001 by Yu Hen Hu

Gradient based Steepest Descent The weight update formula has the format of t is the epoch index. During each epoch, K training samples will be applied and the weight will be updated once. Since And Hence (C) 2001 by Yu Hen Hu

Delta Error Denote (k) = [d(k)z(k)]f’(u(k)), then (k) is the error signal e(k) = d(k)z(k) modulated by the derivative of the activation function f’(u(k)) and hence represents the amount of correction needs to be applied to the weight wi for the given input xi(k). Therefore, the weight update formula for a single-layer, single neuron perceptron has the following format This is the error-correcting learning formula discussed earlier. (C) 2001 by Yu Hen Hu

Nonlinear Optimization Problem: Given a nonlinear cost function E(W)  0. Find W such that E(W) is minimized. Taylor’s Series Expansion E(W)  E(W(t)) + (WE)T(WW(t)) + (1/2)(WW(t))THW(E)(WW(t)) where (C) 2001 by Yu Hen Hu

Steepest Descent Use a first order approximation: E(W)  E(W(t)) + (W(n)E)T(WW(t)) The amount of reduction of E(W) from E(W(t)) is proportional to the inner product of W(t)E and W = W W(t). If |W| is fixed, then the inner product is maximized when the two vectors are aligned. Since E(W) is to be reduced, we should have W  W(t)E  g(t) This is called the steepest descent method. gradient WE WE W E(W) W(t) (C) 2001 by Yu Hen Hu

Line Search Steepest descent only gives the direction of W but not its magnitude. Often we use a present step size  to control the magnitude of W Line search allows us to find the locally optimal step size along g(t). Set E(W(t+1), ) = E(W(t) +  (g)). Our goal is to find  so that E(W(t+1)) is minimized. Along g, select points a, b, c such that E(a) > E(b), E(c) > E(b). Then  is selected by assuming E() is a quadratic function of . a b c  E() (C) 2001 by Yu Hen Hu

Conjugate Gradient Method Let d(n) be the search direction. The conjugate gradient method decides the new search direction d(n+1) as: (t+1) is computed here using a Polak-Ribiere formula. Conjugate Gradient Algorithm Choose W(0). Evaluate g(0) = W(0)E and set d(0) = g(0) W(t+1)= W(t)+ d(t). Use line search to find  that minimizes E(W(t)+  d(t)). If not converged, compute g(t+1), and (t+1). Then d(t+1)= g(t+1)+(t+1)d(t) go to step 3. (C) 2001 by Yu Hen Hu

Newton’s Method 2nd order approximation: E(W)  E(W(t)) + (g(t))T(WW(t)) + (1/2) (WW(t))THW(t)(E)(WW(t)) Setting WE = 0 w.r.p. W (not W(t)), and solve W = W W(t), we have HW(t)(E) W + g(W(t)) = 0 Thus, W = H1 g(t) This is the Newton’s method. Difficulties H-1 : too complicate to compute! Solution Approximate H-1 by different matrices: Quasi-Newton Method Levenberg-marquardt method (C) 2001 by Yu Hen Hu