Disadvantages of Discrete Neurons

1 Disadvantages of Discrete Neurons
Only boolean valued functions can be computed A simple learning algorithm for multi-layer discrete-neuron perceptrons is lacking The computational capabilities of single-layer discrete-neuron perceptrons is limited These disadvantages disappear when we consider multi-layer continuous-neuron perceptrons

Preliminaries A continuous-neuron perceptron with n input and m outputs computes: a function Rn ! [0,1]m ,when the sigmoid activation function is used a function Rn ! Rm ,when a linear activation function is used The learning rules for continuous-neuron perceptrons are based on optimization techniques for error-functions. This requires a continuous and differentiable error function. [0,1] denotes an interval Single-layer cn-perceptrons are also limited. Two-layers can approximate any continuous function

3 Sigmoid transfer function
Similar property for tanh. For that function derivative can also be expressed in the original function. d tanh(x)/dx = tanh2(x) -1 Tanh(z/2) = 2 sig(z) -1 Small practical advantage using tanh

4 Computational Capabilities
Let g:[0,1]n!R be a continuous function and let Then there exists a two layer perceptron with: First layer build from neurons with threshold and standard sigmoid activation function Second layer build from one neuron without threshold and linear activation function such that the function G computed by this network satis- fies g(x) = Σxn/n! g(n)(o) G(x) = Σwngn(x) Truncated Taylor series gn(x) = xn Other basis function are possible Sin cosine (Fourier) Orthogonal polynomials How-many neurons needed? We start with single-layer (single neuron) networks

5 Single-layer networks
Compute function from Rn to [0, 1]m Sufficient to consider a single neuron Compute a function f(w0 + 1 · j · n wjxj ) Assume x0 = 1 then compute a function f(0 · j · n wjxj ) Limited capabilities for single layer networks

Error function Again weights are extended with bias w_0 and inputs with component xo = 1 We do not use the prime notation any longer Factor ½ is for computational convenience

7 Gradient Descent 23-Nov-18 Rudolf Mak TU/e Computer Science
Least mean square error function LMS

8 Update of Weight i by Training Pair q
Hence Δw is in the direction of x Simple cases arise when f is the sigmoid or tanh Even simpler when f is the identity function f(z) = z. Then f'(z) = 1.

9 Delta Rule Learning (incremental version, arbitrary transfer function)
In the lecture notes vector manipulation is replaced by a repetition For i:= 0 to n do wi := wi + alpha (t-y) dy xi

10 Stopcriteria The mean square error becomes small enough
The mean square error does not decrease any- more, i.e. the gradient has become very small or even changes sign The maximum number of iterations has been exceeded

Remarks Delta rule learning is also called L(east) M(ean) S(quare) learning or Widrow Hoff learning Note that the incremental version of the delta rule is strictly not a gradient descent algorithm, because in each step a different error function E(q) is used Convergence of the incremental version can only be guaranteed if the learning parameter a goes to 0 during learning

12 Perceptron Learning Rule (batch version, arbitrary transfer function)
23-Nov-18 Rudolf Mak TU/e Computer Science

13 Perceptron Learning Delta Rule (batch version, sigmoidal transfer function)
23-Nov-18 Rudolf Mak TU/e Computer Science

14 Perceptron Learning Rule (batch version, linear transfer function)
23-Nov-18 Rudolf Mak TU/e Computer Science

15 Convergence of the batch version
For small enough learning parameter the batch version of the delta rule always converges. The resulting weights, however, may correspond to a local minimum of the error function, instead of the global minimum Batch always converges, for linear neuron we will Analyze this further

16 Linear Neurons and Least Squares
23-Nov-18 Rudolf Mak TU/e Computer Science

17 Linear Neurons and Least Squares
23-Nov-18 Rudolf Mak TU/e Computer Science

18 C is non-singular 23-Nov-18 Rudolf Mak TU/e Computer Science

19 Linear Least Squares Convergence
23-Nov-18 Rudolf Mak TU/e Computer Science

20 Rudolf Mak TU/e Computer Science
Gradient is a linear operator Recall alpha' = P alpha Inspect batch version X = <x(1), …, x(P)>

21 Linear Least Squares Convergence
23-Nov-18 Rudolf Mak TU/e Computer Science

22 Find the line: 23-Nov-18 Rudolf Mak TU/e Computer Science

23 Solution: 23-Nov-18 Rudolf Mak TU/e Computer Science

