COMP24111: Machine Learning and Optimisation

COMP24111: Machine Learning and Optimisation
Chapter 5: Neural Networks and Deep Learning Dr. Tingting Mu

Outline Understand the perceptron algorithm.
Understand the multi-layer perceptron. Understand the back-propagation method. Understand the concept of deep learning.

Neuron Structure Simulating a neuron: each artificial neural network (ANN) neuron receives multiple inputs, and generates one output. Input signals sent from other neurons. A neuron is an electrically excitable cell that processes and transmits information by electro-chemical signaling. Connection strengths determine how the signals are accumulated. If enough signals accumulate, the neuron fires a signal. Figure is from

Single Neuron Model w1 w2 wd x1 x2 xd b (bias) w1x1 w2x2 wdxd y
An ANN neuron: multiple inputs [x1, x2,…, xd] and one output y. w1 w2 wd x1 x2 xd b (bias) w1x1 w2x2 wdxd y neuron adder activation Basic elements of a typical neuron include: A set of synapses or connections. Each of these is characterised by a weight (strength). An adder for summing the input signals, weighted by the respective synapses. An activation function, which squashes the permissible amplitude range of the output signal. Given d input, a neuron is modeled by d+1 parameters.

Types of Activation Function
Threshold -1 1 Identify function: Threshold function: Sigmoid function (“S”-shaped curve): Rectified linear unit (ReLU): Identity 1 1 Sigmoid Tanh -1 ReLU

The Perceptron Algorithm
When the activation function is set as the identify function, the single neuron model becomes the linear model we learned in previous chapters. The neuron weights and bias are equivalent to the coefficient vector of the linear model. When the activation function is set as the threshold function, the model is still linear, and it is known as the perceptron of Rosenblatt (1962). The perceptron algorithm is for two-class classification, and it occupies an import place in the history of pattern recognition algorithms. Identity Threshold -1 1

Parameters stored in w are optimised by minimising an error function, called perceptron criterion: If a sample is correctly classified, applies an error penalty of zero; if incorrectly classified, applies an error penalty of the following quantity: We want to reduce the number of misclassified samples, therefore to minimise the above error penalty.

Stochastic gradient descent is used for training. Estimate gradient using a misclassified sample: Weight update equation: Update using a misclassified sample in current iteration!

Training Algorithm Perceptron Training:
Update weights using only one misclassified sample by: Perceptron Training: What weight changes do the following cases produce? Initialise the weights (stored in w(0)) to random numbers in range -1 to +1. For t = 1 to NUM_ITERATIONS For each training sample (xi,yi) Calculate activation using current weight (stored in w(t)). Update weight (stored in w(t+1) ) by learning rule. end if... (true label = -1, activation output = -1).... then if... (true label = +1, activation output = +1).... then if... (true label = -1, activation output = +1).... then No change No change Add – Add +

One neuron can be used to construct a linear model.
x1 x2 xd y an input node One neuron can be used to construct a linear model. It has only one layer (input layer), and is called a single layer perceptron. w1 w2 wd x1 x2 xd b w1x1 w2x2 wdxd y adder activation Input Layer What can many connected neurons achieve?

Adding Hidden Layers! x1 x1 x2 x2 y y xd xd
The presence of hidden layers allows to formulate more complex functions. Each hidden node finds a partial solution to the problem to be combined in the next layer. x1 x2 xd y hidden layer 1 layer 2 input layer x1 x2 xd y input layer hidden layer 1 Example:

Multilayer Perceptron
A multilayer perceptron (MLP), also called feedforward artificial neural network, consists of at least three layers of nodes (input, hidden and output layers). input layer hidden layer 1 layer 2 output layer Number of neurons in the input layer is equal to the number of input features. Number of hidden layers is a hyperparameter to be set. Numbers of neurons in hidden layers are also hyperparameters to be set. Number of neurons in output layer depends on the task to be solved.

Multilayer Perceptron
An MLP example with one hidden layer consisting of four hidden neurons. It takes 9 input features and returns 2 output variables (9 input neurons in input layer, 2 output neuron in output layer). Output of the j-th neuron in the hidden layer (j=1,2,3,4), for the n-th training sample: Output of the k-th neuron in the output layer (k=1,2), for the n-training sample: 10 x 4 =40 weights 5 x 2 =10 weights A total of =50 weights to be optimised in this neural network (including bias parameters). Feed-forward information flow when computing the output variables. Hidden layer Output layer yk(n) zj(n) Wjk(o) xi(n) Wij(h) 9+1 weights 4+1 weights j k

Neural Network Training
Neural network training is the process of finding the optimal setting of the neural network weights. input layer hidden layer 1 layer 2 output layer

Neural network training is the process of finding the optimal setting of the neural network weights. Original features x New features φ(x) A neural network can be viewed as a powerful feature extractor to compute an effective representation for the sample, which helps the prediction task input layer hidden layer 1 layer 2 hidden layer 3 prediction layer (new output layer) Loss(φ(x))

Treating φ(x) as the new features and using these as the input of a linear model, all the objective functions we learned in previous chapters can be used to optimise the neural network weights. Minimising sum-of-squares error ( least squares model, Chapter 2) Minimising a mixture of sum-of-squares error and a reguarlisation term (regularised least squares model, Chapter 2) Maximising (log) likelihood or minimising cross-entropy error (logistic regression, Chapter 3) Optimising a mixture of hinge loss error and separation margin (SVM, Chapter 4) Training (optimisation) methods: stochastic gradient descent, mini- batch gradient descent.

Example: Two-class classification
Convert the output of the neural network into a single probability value using the logistic sigmoid function. Optimise neural network weights and prediction parameters w by likelihood maximisation (maxisining the chances of observing the data ). input layer hidden layer 1 layer 2 hidden layer 3 Original features x New features z x1 x2 xd z = φ(x) w Probability of whether it is from a class Use sigmoid function to build the prediction layer z1 zD

Example: Multi-class classification
Convert the output of the neural network into a set of c probability values using softmax function. Optimise neural network weights and softmax function parameters, w1,…wc, by likelihood maximisation. w1 w2 wc red green purple probabilities Use softmax function to build the prediction layer input layer hidden layer 1 layer 2 hidden layer 3 Original features x New features z z1 zD x1 x2 xd z = φ(x)

Backpropagation Technically, backpropagation calculates the gradient of the loss function with respect to layers of neural network weights. It uses chain rule to iteratively compute gradients for each layer. It can be viewed as a process of calculating the error contribution of each neuron after processing a batch of training data.

backpropagation original features x new features z input layer hidden
hidden layer 3 original features x new features z backpropagation

Deep Learning Deep learning refers to techniques for learning using neural networks. Deep learning is considered as a kind of representation (feature) learning techniques. more hidden layers Example: AlexNet contains a total of 5 convolutional layers and 3 fully connected layers. The two figures are from Figs. 1.5 and 1.4 of Deep Learning book (I. Goodfellow, et al. 2016).

Popular Neural Networks
Convolutional neural networks (CNN) have neurons arranged in 3 ways (dimensions): width, height, depth. This makes it suitable for processing images. It automatically learns a good feature vector for an image from its pixels. NeuralStyle, DeepDream, Recurrent neural network (RNN) is especially useful for learning from sequential data. Each neuron can use its internal memory to maintain information about the previous input. This makes it suitable for processing natural languages, speech, music, etc. PoemGenerator, Other architectures suitable for processing videos, and joint language/text and image learning. NeuralTalk, TalkingMachines, Another example: a system learns from images, sound, etc.,

Goodbye! Enjoy your reading week! See you in revision week.

COMP24111: Machine Learning and Optimisation

Similar presentations

Presentation on theme: "COMP24111: Machine Learning and Optimisation"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

COMP24111: Machine Learning and Optimisation

Similar presentations

Presentation on theme: "COMP24111: Machine Learning and Optimisation"— Presentation transcript:

Similar presentations

About project

Feedback