Presentation is loading. Please wait.

Presentation is loading. Please wait.

Last lecture summary. biologically motivated synapses Neuron accumulates (Σ) positive/negative stimuli from other neurons. Then Σ is processed further.

Similar presentations


Presentation on theme: "Last lecture summary. biologically motivated synapses Neuron accumulates (Σ) positive/negative stimuli from other neurons. Then Σ is processed further."— Presentation transcript:

1 Last lecture summary

2 biologically motivated synapses Neuron accumulates (Σ) positive/negative stimuli from other neurons. Then Σ is processed further – f(Σ) – to produce an output, i.e. neuron sends an output signal to neurons connected to it.

3 Neural networks for applied science and engineering, Samarasinghe

4 threshold neuron (McCulloch-Pitts) – only binary inputs and output – the weights are pre-set, no learning – set the threshold so that the classification is correct x – inputs w – weights f(Σ) – activation (tansfer) function y - output

5 Heavyside (threshold) activation function

6 Threshold w 0 is incorporated as a weight of one additional input with input value x 0 = 1.0. Such input is called bias.

7 binary classifier, maps its input x (real-valued vector) to f(x) – a binary value (0 or 1) f(x) = 1 … w∙x > 0 (including bias) 0 … otherwise perceptron can adjust its weights (i.e. can learn) – perceptron learning algorithm Perceptron

8 Multiple output perceptron for multicategory (i.e. more than 2 classes) classification one output neuron for each class input layer output layer

9 Learning Learning means there exist an algorithm for setting neuron’s weights (threshold w 0 is also set). – delta rule – gradient descent – β – learning rate

10 iterative algorithm, one pass through the whole training set (epoch) is not enough online learning – adjust weights after each input pattern presentation – weight oscillation may occur batch learning – obtain the error gradient for each input pattern, average them at the end of the epoch

11

12 New stuff Finishing perceptron

13 Perceptron failure Please, help me and draw on the blackboard following functions: – AND, OR, XOR (eXclusive OR, true when exactly one of the operands is true, otherwise false) 0 1 0 1 0 1 0 1 0 1 0 1 AND ORXOR ???

14 Play with http://lcn.epfl.ch/tutorial/english/perceptron/html/index.html

15 Perceptron uses linear activation function, so only linearly separable problems can be solved. 1969 – famous book “Perceptrons” by Marvin Minsky and Seymour Papert showed that it was impossible for these classes of network to learn an XOR function. They conjectured (incorrectly !) that a similar result would hold for a perceptron with three or more layers. The often-cited Minsky/Papert text caused a significant decline in interest and funding of neural network research. It took ten more years until neural network research experienced a resurgence in the 1980s.

16 Play with http://www.eee.metu.edu.tr/~halici/courses/543java/NNOC/Perceptron.html

17 Multilayer perceptron

18 Nonlinear activation functions So far we met threshold and linear activation functions. They are linear, and conversely the solved problems must also be linear. The nonlinearity is introduced by using nonlinear activation functions.

19 logistic (sigmoid, unipolar)tanh (bipolar)

20 Multilayer perceptron MLP, the most famous type of neural network input layerhidden layeroutput layer

21 input layerhidden layeroutput layer three-layer vs. two-layer

22 Backpropagation training algorithm How to train MLP? Gradient descent type of algorithm called backpropagation. MLP works in two passes: forward pass – present a training sample to the neural network – compare the network's output to the desired output from that sample – calculate the error in each output neuron

23 backward pass – compute the amount ∆w by which the weights should be updated – first calculate gradient for hidden-to-output weights – then calculate gradient for input-to-hidden weights the knowledge of grad hidden-output is necessary to calculate grad input-hidden – update the weights in the network It is a gradient descent method – learning rate β is used – can get trapped in local minima

24 input signal propagates forward error propagates backward

25 online learning vs. batch learning – In online learning the weights are changed after each presentation of a training pattern. Weights may oscillate. Suitable for online learning. – In batch learning, the total gradient for the whole epoch is represented as the sum of the gradient for each of the n patterns. Batch learning improves the stability by averaging. Another averaging approach providing stability is using the momentum.

26 This method basically tags the average of the past weight changes onto the new weight increment at every weight change, thereby smoothing out the net weight change. Momentum μ is between 0 and 1. It indicates the relative importance of the past weight change ∆w m-1 on the new weight increment ∆w m Thus, the current gradient and the past weight change together decide how much the new weight increment will be.

27 For example, if μ is equal to 0, momentum does not apply at all, and the past history has no place. If μ is equal to 1, the current change is totally based on the past change. Values of μ between 0 and 1 result in a combined response to weight change.

28 The equation is recursive, so the influence of the past weight change incorporates that of all previous weight changes as well. Momentum can be used with both batch and online learning. In batch learning, it can provide further stability to the gradient descent. Momentum can be especially useful in online learning to minimize oscillations in error after the presentation of each pattern.

29 Delta-Bar-Delta In backpropagation the same learning rate β applies to all of the weights. More flexibility could be achieved if each weight is adjusted independently. This method is called delta-bar-delta (TurboProp). Each weight has its own learning rate, they’re adjusted as follows: – if the direction in which the error decreases at the current point is the same as the direction in which the error has been decreasing recently, then the learning rate is increased. – if the opposite is true, the learning rate is decreased

30 Second order methods Surface curvature can be used to guide the error down the error surface more efficiently.

31 grad is a vector pointing in the direction of the greatest rate of increase of the function. How fast changes the rate of increase of the function in the small neighbourhood? This is given as the derivative of gradient, derivative of derivative, i.e. second derivative. The second derivatives with respect to all pairs of weights are given as the Hessian matrix.

32 Common methods using the Hessian – QuickProp – Gauss-Newton – Levenberg-Marquardt (LM) These methods are order of magnitude faster (i.e. they reach minima in much less epochs) than first order methods (i.e. gradient based). However, the efficiency is gained at a considerable computational cost. – Computing and inverting Hessian for large networks with large number of training patterns is expensive (large storage requirements) and slow.

33 Bias-variance Just a small reminder bias (lack of fit, undefitting) – model does not fit data enough, not enough flexible (too small number of parameters) variance (overfitting) – model is too flexible (too much parameters), fits noise bias-variance tradeoff – improving the generalization ability of the model (i.e. find the correct amount of flexibility)

34 Parameters in MLP: weights If you use one more hidden neuron, the number of weights increases by how much? – # input neurons + # output neurons If MLP is used for regression task, be careful! To use MLP statistically correctly, the number of degrees of freedoms (i.e. weights) can’t exceed the number of data points. – Compare to polynomial regression example from the 2 nd lecture

35 Improving generalization of MLP Flexibility comes from hidden neurons. Choose such a # of hidden neurons so neither undefitting, nor overfitting occurs. Three most common approaches: – exhaustive search – early stopping – regularization

36 Exhaustive search Increase a number of hidden units, and monitor the performance on the validation data set. number of neurons

37 Early stopping fixed and large number of neurons is used network is trained while testing its performance on a validation set at regular intervals minimum at validation error – correct weights epochs

38 Weight decay Idea: keep the growth of weights to a minimum in such a way that non-important weights are pulled toward zero Only the important weights are allowed to grow, others are forced to decay regularization

39 This is achieved not by minimizing MSE, but by minimizing second term – regularization term m – number of weights in the network δ – regularization parameter – the larger the δ, the more important the regularization

40 Network pruning Both early stopping and weight decay use all weights in the NN. They do not reduce the complexity of the model. Network pruning – reduce complexity by keeping only essential weights/neurons. Several pruning approaches, e.g. – optimal brain damage (OBD) – optimal brain surgeon (OBS) – optimal cell damage (OCD)

41 OBD Based on sensitivity analysis – systematically change parameters in a model to determine the effects of such changes Weights that are not important for input- output mapping are removed. The importance (saliency) of the weight is measured based on the cost of setting a weight to zero.

42 How to perform OBD? 1.Train flexible network in a normal way (i.e. use early stopping, weight decay, …) 2.Compute saliency for each weight. Remove weight with small saliencies. 3.Train again the reduced network with kept weights. Initialize the training with their values obtained in the previous step. 4.Repeat from step 1.


Download ppt "Last lecture summary. biologically motivated synapses Neuron accumulates (Σ) positive/negative stimuli from other neurons. Then Σ is processed further."

Similar presentations


Ads by Google