Disadvantages of Discrete Neurons Only boolean valued functions can be computed A simple learning algorithm for multi-layer discrete-neuron perceptrons is lacking The computational capabilities of single-layer discrete-neuron perceptrons is limited These disadvantages disappear when we consider multi-layer continuous-neuron perceptrons 23-Nov-18 Rudolf Mak TU/e Computer Science
Preliminaries A continuous-neuron perceptron with n input and m outputs computes: a function Rn ! [0,1]m ,when the sigmoid activation function is used a function Rn ! Rm ,when a linear activation function is used The learning rules for continuous-neuron perceptrons are based on optimization techniques for error-functions. This requires a continuous and differentiable error function. [0,1] denotes an interval Single-layer cn-perceptrons are also limited. Two-layers can approximate any continuous function 23-Nov-18 Rudolf Mak TU/e Computer Science
Sigmoid transfer function Similar property for tanh. For that function derivative can also be expressed in the original function. d tanh(x)/dx = tanh2(x) -1 Tanh(z/2) = 2 sig(z) -1 Small practical advantage using tanh 23-Nov-18 Rudolf Mak TU/e Computer Science
Computational Capabilities Let g:[0,1]n!R be a continuous function and let . Then there exists a two layer perceptron with: First layer build from neurons with threshold and standard sigmoid activation function Second layer build from one neuron without threshold and linear activation function such that the function G computed by this network satis- fies g(x) = Σxn/n! g(n)(o) G(x) = Σwngn(x) Truncated Taylor series gn(x) = xn Other basis function are possible Sin cosine (Fourier) Orthogonal polynomials How-many neurons needed? We start with single-layer (single neuron) networks 23-Nov-18 Rudolf Mak TU/e Computer Science
Single-layer networks Compute function from Rn to [0, 1]m Sufficient to consider a single neuron Compute a function f(w0 + 1 · j · n wjxj ) Assume x0 = 1 then compute a function f(0 · j · n wjxj ) Limited capabilities for single layer networks 23-Nov-18 Rudolf Mak TU/e Computer Science
Error function Again weights are extended with bias w_0 and inputs with component xo = 1 We do not use the prime notation any longer Factor ½ is for computational convenience 23-Nov-18 Rudolf Mak TU/e Computer Science
Gradient Descent 23-Nov-18 Rudolf Mak TU/e Computer Science Least mean square error function LMS 23-Nov-18 Rudolf Mak TU/e Computer Science
Update of Weight i by Training Pair q Hence Δw is in the direction of x Simple cases arise when f is the sigmoid or tanh Even simpler when f is the identity function f(z) = z. Then f’(z) = 1. 23-Nov-18 Rudolf Mak TU/e Computer Science
Delta Rule Learning (incremental version, arbitrary transfer function) In the lecture notes vector manipulation is replaced by a repetition For i:= 0 to n do wi := wi + alpha (t-y) dy xi 23-Nov-18 Rudolf Mak TU/e Computer Science
Stopcriteria The mean square error becomes small enough The mean square error does not decrease any- more, i.e. the gradient has become very small or even changes sign The maximum number of iterations has been exceeded 23-Nov-18 Rudolf Mak TU/e Computer Science
Remarks Delta rule learning is also called L(east) M(ean) S(quare) learning or Widrow Hoff learning Note that the incremental version of the delta rule is strictly not a gradient descent algorithm, because in each step a different error function E(q) is used Convergence of the incremental version can only be guaranteed if the learning parameter a goes to 0 during learning 23-Nov-18 Rudolf Mak TU/e Computer Science
Perceptron Learning Rule (batch version, arbitrary transfer function) 23-Nov-18 Rudolf Mak TU/e Computer Science
Perceptron Learning Delta Rule (batch version, sigmoidal transfer function) 23-Nov-18 Rudolf Mak TU/e Computer Science
Perceptron Learning Rule (batch version, linear transfer function) 23-Nov-18 Rudolf Mak TU/e Computer Science
Convergence of the batch version For small enough learning parameter the batch version of the delta rule always converges. The resulting weights, however, may correspond to a local minimum of the error function, instead of the global minimum Batch always converges, for linear neuron we will Analyze this further 23-Nov-18 Rudolf Mak TU/e Computer Science
Linear Neurons and Least Squares 23-Nov-18 Rudolf Mak TU/e Computer Science
Linear Neurons and Least Squares 23-Nov-18 Rudolf Mak TU/e Computer Science
C is non-singular 23-Nov-18 Rudolf Mak TU/e Computer Science
Linear Least Squares Convergence 23-Nov-18 Rudolf Mak TU/e Computer Science
Rudolf Mak TU/e Computer Science Gradient is a linear operator Recall alpha’ = P alpha Inspect batch version X = <x(1), …, x(P)> 23-Nov-18 Rudolf Mak TU/e Computer Science
Linear Least Squares Convergence 23-Nov-18 Rudolf Mak TU/e Computer Science
Find the line: 23-Nov-18 Rudolf Mak TU/e Computer Science
Solution: 23-Nov-18 Rudolf Mak TU/e Computer Science