Last lecture summary. biologically motivated synapses Neuron accumulates (Σ) positive/negative stimuli from other neurons. Then Σ is processed further.

Slides:



Advertisements
Similar presentations
Multi-Layer Perceptron (MLP)
Advertisements

NEURAL NETWORKS Backpropagation Algorithm
1 Machine Learning: Lecture 4 Artificial Neural Networks (Based on Chapter 4 of Mitchell T.., Machine Learning, 1997)
1 Neural networks. Neural networks are made up of many artificial neurons. Each input into the neuron has its own weight associated with it illustrated.
Multilayer Perceptrons 1. Overview  Recap of neural network theory  The multi-layered perceptron  Back-propagation  Introduction to training  Uses.
Mehran University of Engineering and Technology, Jamshoro Department of Electronic Engineering Neural Networks Feedforward Networks By Dr. Mukhtiar Ali.
Machine Learning Neural Networks
Artificial Neural Networks
Simple Neural Nets For Pattern Classification
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
The back-propagation training algorithm
Chapter 5 NEURAL NETWORKS
Neural Networks Multi-stage regression/classification model activation function PPR also known as ridge functions in PPR output function bias unit synaptic.
Chapter 6: Multilayer Neural Networks
Lecture 4 Neural Networks ICS 273A UC Irvine Instructor: Max Welling Read chapter 4.
MACHINE LEARNING 12. Multilayer Perceptrons. Neural Networks Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
CHAPTER 11 Back-Propagation Ming-Feng Yeh.
CS 4700: Foundations of Artificial Intelligence
CS 484 – Artificial Intelligence
Neural Networks. Background - Neural Networks can be : Biological - Biological models Artificial - Artificial models - Desire to produce artificial systems.
Last lecture summary.
Last lecture summary.
Artificial Neural Networks
Biointelligence Laboratory, Seoul National University
Artificial Neural Networks
DIGITAL IMAGE PROCESSING Dr J. Shanbehzadeh M. Hosseinajad ( J.Shanbehzadeh M. Hosseinajad)
11 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering
1 Chapter 6: Artificial Neural Networks Part 2 of 3 (Sections 6.4 – 6.6) Asst. Prof. Dr. Sukanya Pongsuparb Dr. Srisupa Palakvangsa Na Ayudhya Dr. Benjarath.
Last lecture summary Naïve Bayes Classifier. Bayes Rule Normalization Constant LikelihoodPrior Posterior Prior and likelihood must be learnt (i.e. estimated.
Machine Learning Dr. Shazzad Hosain Department of EECS North South Universtiy
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 16: NEURAL NETWORKS Objectives: Feedforward.
LINEAR CLASSIFICATION. Biological inspirations  Some numbers…  The human brain contains about 10 billion nerve cells ( neurons )  Each neuron is connected.
Artificial Intelligence Methods Neural Networks Lecture 4 Rakesh K. Bissoondeeal Rakesh K. Bissoondeeal.
Artificial Intelligence Techniques Multilayer Perceptrons.
Artificial Neural Networks. The Brain How do brains work? How do human brains differ from that of other animals? Can we base models of artificial intelligence.
CS 478 – Tools for Machine Learning and Data Mining Backpropagation.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 9: Ways of speeding up the learning and preventing overfitting Geoffrey Hinton.
Non-Bayes classifiers. Linear discriminants, neural networks.
Introduction to Neural Networks. Biological neural activity –Each neuron has a body, an axon, and many dendrites Can be in one of the two states: firing.
CSC321 Introduction to Neural Networks and Machine Learning Lecture 3: Learning in multi-layer networks Geoffrey Hinton.
Back-Propagation Algorithm AN INTRODUCTION TO LEARNING INTERNAL REPRESENTATIONS BY ERROR PROPAGATION Presented by: Kunal Parmar UHID:
Neural Networks - Berrin Yanıkoğlu1 Applications and Examples From Mitchell Chp. 4.
Neural Networks - lecture 51 Multi-layer neural networks  Motivation  Choosing the architecture  Functioning. FORWARD algorithm  Neural networks as.
Neural Networks Teacher: Elena Marchiori R4.47 Assistant: Kees Jong S2.22
EEE502 Pattern Recognition
Neural Networks 2nd Edition Simon Haykin
Previous Lecture Perceptron W  t+1  W  t  t  d(t) - sign (w(t)  x)] x Adaline W  t+1  W  t  t  d(t) - f(w(t)  x)] f’ x Gradient.
BACKPROPAGATION (CONTINUED) Hidden unit transfer function usually sigmoid (s-shaped), a smooth curve. Limits the output (activation) unit between 0..1.
Learning: Neural Networks Artificial Intelligence CMSC February 3, 2005.
Learning with Neural Networks Artificial Intelligence CMSC February 19, 2002.
Last lecture summary Naïve Bayes Classifier. Bayes Rule Normalization Constant LikelihoodPrior Posterior Prior and likelihood must be learnt (i.e. estimated.
CSE343/543 Machine Learning Mayank Vatsa Lecture slides are prepared using several teaching resources and no authorship is claimed for any slides.
Machine Learning Supervised Learning Classification and Regression
Neural networks.
Fall 2004 Backpropagation CS478 - Machine Learning.
Learning with Perceptrons and Neural Networks
One-layer neural networks Approximation problems
第 3 章 神经网络.
with Daniel L. Silver, Ph.D. Christian Frey, BBA April 11-12, 2017
Going Backwards In The Procedure and Recapitulation of System Identification By Ali Pekcan 65570B.
CSC 578 Neural Networks and Deep Learning
ECE 471/571 - Lecture 17 Back Propagation.
Artificial Neural Networks
Artificial Intelligence Chapter 3 Neural Networks
Computer Vision Lecture 19: Object Recognition III
David Kauchak CS158 – Spring 2019
Artificial Intelligence Chapter 3 Neural Networks
Presentation transcript:

Last lecture summary

biologically motivated synapses Neuron accumulates (Σ) positive/negative stimuli from other neurons. Then Σ is processed further – f(Σ) – to produce an output, i.e. neuron sends an output signal to neurons connected to it.

Neural networks for applied science and engineering, Samarasinghe

threshold neuron (McCulloch-Pitts) – only binary inputs and output – the weights are pre-set, no learning – set the threshold so that the classification is correct x – inputs w – weights f(Σ) – activation (tansfer) function y - output

Heavyside (threshold) activation function

Threshold w 0 is incorporated as a weight of one additional input with input value x 0 = 1.0. Such input is called bias.

binary classifier, maps its input x (real-valued vector) to f(x) – a binary value (0 or 1) f(x) = 1 … w∙x > 0 (including bias) 0 … otherwise perceptron can adjust its weights (i.e. can learn) – perceptron learning algorithm Perceptron

Multiple output perceptron for multicategory (i.e. more than 2 classes) classification one output neuron for each class input layer output layer

Learning Learning means there exist an algorithm for setting neuron’s weights (threshold w 0 is also set). – delta rule – gradient descent – β – learning rate

iterative algorithm, one pass through the whole training set (epoch) is not enough online learning – adjust weights after each input pattern presentation – weight oscillation may occur batch learning – obtain the error gradient for each input pattern, average them at the end of the epoch

New stuff Finishing perceptron

Perceptron failure Please, help me and draw on the blackboard following functions: – AND, OR, XOR (eXclusive OR, true when exactly one of the operands is true, otherwise false) AND ORXOR ???

Play with

Perceptron uses linear activation function, so only linearly separable problems can be solved – famous book “Perceptrons” by Marvin Minsky and Seymour Papert showed that it was impossible for these classes of network to learn an XOR function. They conjectured (incorrectly !) that a similar result would hold for a perceptron with three or more layers. The often-cited Minsky/Papert text caused a significant decline in interest and funding of neural network research. It took ten more years until neural network research experienced a resurgence in the 1980s.

Play with

Multilayer perceptron

Nonlinear activation functions So far we met threshold and linear activation functions. They are linear, and conversely the solved problems must also be linear. The nonlinearity is introduced by using nonlinear activation functions.

logistic (sigmoid, unipolar)tanh (bipolar)

Multilayer perceptron MLP, the most famous type of neural network input layerhidden layeroutput layer

input layerhidden layeroutput layer three-layer vs. two-layer

Backpropagation training algorithm How to train MLP? Gradient descent type of algorithm called backpropagation. MLP works in two passes: forward pass – present a training sample to the neural network – compare the network's output to the desired output from that sample – calculate the error in each output neuron

backward pass – compute the amount ∆w by which the weights should be updated – first calculate gradient for hidden-to-output weights – then calculate gradient for input-to-hidden weights the knowledge of grad hidden-output is necessary to calculate grad input-hidden – update the weights in the network It is a gradient descent method – learning rate β is used – can get trapped in local minima

input signal propagates forward error propagates backward

online learning vs. batch learning – In online learning the weights are changed after each presentation of a training pattern. Weights may oscillate. Suitable for online learning. – In batch learning, the total gradient for the whole epoch is represented as the sum of the gradient for each of the n patterns. Batch learning improves the stability by averaging. Another averaging approach providing stability is using the momentum.

This method basically tags the average of the past weight changes onto the new weight increment at every weight change, thereby smoothing out the net weight change. Momentum μ is between 0 and 1. It indicates the relative importance of the past weight change ∆w m-1 on the new weight increment ∆w m Thus, the current gradient and the past weight change together decide how much the new weight increment will be.

For example, if μ is equal to 0, momentum does not apply at all, and the past history has no place. If μ is equal to 1, the current change is totally based on the past change. Values of μ between 0 and 1 result in a combined response to weight change.

The equation is recursive, so the influence of the past weight change incorporates that of all previous weight changes as well. Momentum can be used with both batch and online learning. In batch learning, it can provide further stability to the gradient descent. Momentum can be especially useful in online learning to minimize oscillations in error after the presentation of each pattern.

Delta-Bar-Delta In backpropagation the same learning rate β applies to all of the weights. More flexibility could be achieved if each weight is adjusted independently. This method is called delta-bar-delta (TurboProp). Each weight has its own learning rate, they’re adjusted as follows: – if the direction in which the error decreases at the current point is the same as the direction in which the error has been decreasing recently, then the learning rate is increased. – if the opposite is true, the learning rate is decreased

Second order methods Surface curvature can be used to guide the error down the error surface more efficiently.

grad is a vector pointing in the direction of the greatest rate of increase of the function. How fast changes the rate of increase of the function in the small neighbourhood? This is given as the derivative of gradient, derivative of derivative, i.e. second derivative. The second derivatives with respect to all pairs of weights are given as the Hessian matrix.

Common methods using the Hessian – QuickProp – Gauss-Newton – Levenberg-Marquardt (LM) These methods are order of magnitude faster (i.e. they reach minima in much less epochs) than first order methods (i.e. gradient based). However, the efficiency is gained at a considerable computational cost. – Computing and inverting Hessian for large networks with large number of training patterns is expensive (large storage requirements) and slow.

Bias-variance Just a small reminder bias (lack of fit, undefitting) – model does not fit data enough, not enough flexible (too small number of parameters) variance (overfitting) – model is too flexible (too much parameters), fits noise bias-variance tradeoff – improving the generalization ability of the model (i.e. find the correct amount of flexibility)

Parameters in MLP: weights If you use one more hidden neuron, the number of weights increases by how much? – # input neurons + # output neurons If MLP is used for regression task, be careful! To use MLP statistically correctly, the number of degrees of freedoms (i.e. weights) can’t exceed the number of data points. – Compare to polynomial regression example from the 2 nd lecture

Improving generalization of MLP Flexibility comes from hidden neurons. Choose such a # of hidden neurons so neither undefitting, nor overfitting occurs. Three most common approaches: – exhaustive search – early stopping – regularization

Exhaustive search Increase a number of hidden units, and monitor the performance on the validation data set. number of neurons

Early stopping fixed and large number of neurons is used network is trained while testing its performance on a validation set at regular intervals minimum at validation error – correct weights epochs

Weight decay Idea: keep the growth of weights to a minimum in such a way that non-important weights are pulled toward zero Only the important weights are allowed to grow, others are forced to decay regularization

This is achieved not by minimizing MSE, but by minimizing second term – regularization term m – number of weights in the network δ – regularization parameter – the larger the δ, the more important the regularization

Network pruning Both early stopping and weight decay use all weights in the NN. They do not reduce the complexity of the model. Network pruning – reduce complexity by keeping only essential weights/neurons. Several pruning approaches, e.g. – optimal brain damage (OBD) – optimal brain surgeon (OBS) – optimal cell damage (OCD)

OBD Based on sensitivity analysis – systematically change parameters in a model to determine the effects of such changes Weights that are not important for input- output mapping are removed. The importance (saliency) of the weight is measured based on the cost of setting a weight to zero.

How to perform OBD? 1.Train flexible network in a normal way (i.e. use early stopping, weight decay, …) 2.Compute saliency for each weight. Remove weight with small saliencies. 3.Train again the reduced network with kept weights. Initialize the training with their values obtained in the previous step. 4.Repeat from step 1.