Deep neural networks (DNNs)

Deep neural networks (DNNs)

Conventional and deep networks
What is the difference between a conventional and a deep network? The structural difference is that deep networks have more hidden layers (5-10 instead of 1-2, but nowadays even ) Sounds simple – why it took so long?? The training of deep networks requires new algorithms The first one: DBN pre-training, 2006 The advantages of deep learning show up only for huge amounts of data These were not available on the 80`s Training deep networks is slow This is solved by GPUs now All three factors – new algorithms, access to a lot of data, invention of GPUs – contributed to the current success of deep learning

Why deep networks are more efficient?
We saw a proof earlier that we can solve any task with a network of 2 hidden layers However, it is only true if we have infinitely many neurons, infinite amount of training data and a training algorithm that guarantees global optimum Having a fixed and finite number of neurons, it is more efficient to arrange them into several smaller layers rather than 1-2 „wide” layers This allows the network to process the data hierarchically In image recognition tasks the higher layers indeed learn more and more abstract notions Pixeledge  mouth, nose facehuman, …

Training deep neural networks
Training deep networks is more difficult than “shallow” networks Backpropagation propagates the error from the output to hidden layers The more layers we go back the larger the chance that the gradient “vanishes” so the deeper layers will not learn Solution approaches for training deep networks Pre-training with unlabelled data (DBN pre-training using the contrastive divergence (CD) cost function) This was the first solution, mathematically involved and slow Build and train the network adding layer after layer Much simpler, requires only the backpropagation algorithm We can use newer types of activation functions The simplest solution, we will look at this first Training very deep network requires further tricks (batch normalization, highway networks, etc.)

Modifying the activation function
The sigmoid activation function has been used for 30 years… The tanh activation function is equivalent with this (sigmoid returns values within [0-1], tanh returns values in [-1,1] ) Problem: the two ends are very „flat”  derivative is practically 0  the gradient may easily vanish To avoid this, the rectifier activation function was proposed first, and since then many new activation functions have been recommended

The rectifier activation function
In comparison with the tanh activation function: Formally: For positive input the derivative is always 1, never vanishes For negative input the output and the derivative are both 0 The nonlinearity is very important! (compare it with a linear activation) It works fine in spite of the 0 derivative for negative input, but there have been improved versions proposed for which the derivative in never 0 Networks built out of „rectified linear” (ReLU) neurons is currently the de facto standard in deep learning

Even newer activation functions
See also: Linear, sigmoid, tanh, ReLU, ELU, SELU, Softplus… Examples: These sometimes give slightly better results then the ReLU function, but none of them resulted in a general breakthrough

The restricted Boltzmann machine (RBM)
Very similar to a pair of network layers - but works with binary values Training: contrastive divergence (CD) - It is an unsupervised method (there are no class labels) - Seeks to reconstruct the input from the hidden representation - It can be interpreted as an approximation of the Maximum Likelihood cost function - Iterative, similar to backpropagation

Deep Belief Network - training: CD algorithm, adding layer after layer
Deep Belief Network: Restricted Boltzmann machines stacked on each other - training: CD algorithm, adding layer after layer

Conversion into a deep neural network
DBNs were proposed for the initialization („pre-training”) of deep networks After training, the DBN can be converted into a deep network RBMs are turned to conventional sigmoid neurons (with same weights) A softmax output layer is added on top So we can continue with supervised backpropagation training In the early papers DBN pre-training was used to solve the problem of the difficulty of backpropagation training DBN pre-training resulted in a good starting point for backpropagation The current view is that DBN pre-training is no longer necessary We usually train on much more training data The new activation functions also help a lot We have new weight initialization methods and other training tricks (e.g. batch normalization)

Cases when DBNs are still useful
Training DBNs using the CD criterion is unsupervised training DBN pre-training may still be useful when we have a lot of unlabelled data, and only few labelled The connection between two RBM layers is symmetric We can easily reconstruct the input from a hidden representation We can easily visualize what hidden representation has the network learned

Convolutional neural networks
Classic architecture: „fully connected” net Between two layers all neurons are connected to all neurons The order of the inputs plays no role If we permute the inputs randomly (but using the same permutation for each vector!), then the network will attain a very similar accuracy For many classification task the order of the inputs indeed plays no role Eg. our very first example: (fever, joint_pain, cough)influenza Obviously, just as learnable as (joint_pain, cough, fever)influenza But there are tasks where the order (topology) of features is important The best example is image recognition In this case it is worth using special a network structure  convolutional network

Motivation #1 There are tasks where the order (relation, topology) of the features is important Image recognition: we won’t see the same picture if we mix the pixels But a fully connected network will achieve the same performance in both cases The arrangement of the pixels contains vital information for our brain, but a fully connected network is not able to exploit it The pixels form the image together The semantic connection of nearby pixels is typically stronger (they from objects together)

Motivation #2 A picture typically has a hierarchical structure
From simpler, local building blocks to larger, more complex notions The early shape recognition experiments using ANNs considered such complex tasks to be hopeless They tried only simpler tasks like character recognition Now we can recognize complex “real” images Convolutional networks greatly contributed to this success

Motivation #3 Certain research results show that the human brain also processes the images hierarchically Neurons in the visual cortex fire when seeing certain simple graphical primitives like edges having a certain direction Consider the „Thatcher illusion” – for the upside-down image our brain concludes that the mouth, nose an eyes seem to be correct and in place (they form a face together), but does not realize the error in the fine details

Convolutional neural networks
Based on the above motivations, the convolutional neural network Processes the input hierarchically As a consequence, it will necessarily be deep The input of each neuron is local, so they focus on a small block of the image But going upwards we will cover larger and larger areas Neurons in the higher layers cover larger and larger parts of the image, but with a decreasing resolution (fine details count less) The network will be less sensitive to the exact location of objects (this is where the convolution operation will help) The main application area of convolutional networks is image recognition, but they may be applied in any other areas where the input has a hierarchical structure (for example they are also used in speech recognition)

Building blocks of convolutional networks
The typical architecture of a convolutional neural network Convolutional neurons (also known as filters) Pooling operation These two are repeated several times On top, the network usually has some fully connected layers and a softmax output layer The network can be trained with the usual backpropagation algorithm (adjusted to the convolution and pooling steps)

The convolution operation
The convolutional neurons are very similar to standard neurons. The main differences are: Locality: They process only small parts of the input image Convolution: the same neuron is evaluated at many different locations of the image (as these evaluations use the same weights, this property is also known as “weight sharing”) The operation performed by the neurons is very similar to the filters known from image processing, this is why these neurons are sometimes called filters But in this case we also apply a nonlinear activation function on the filter output And the parameters (weights) will be tuned automatically, not specified manually

An example of what a convolutional neuron does Input image, filter (weight matrix of the neuron), and the result: The evaluation can be performed at each position, so the result is a matrix with the same size as the input But the processed range may also be restricted Stride: the step size of the filter. E.g. stride 2 means that we skip every second position Zero padding: To fit the filter on the edge pixels might require the “padding” of the image (with zeros) – it is a matter of design decision

We expect the convolutional filters to learn abstract notions (eg. to be able to detect an edge or a nose) For this we usually need more neurons, so we train a set of neurons for the same task in parallel. In this example we use 3 neurons to process a position, resulting in 3 output matrices. We call this the “depth” of the output We train further layers on the output of the given layer, so we interpret the layer as a feature extractor for the subsequent layer (hopefully more and more abstract features as we go up…) This is why it is called „feature map” in the image

The pooling operation As we go up we don’t want to preserve the fine details E.g. if the „nose detector” found a nose, then we don’t want to keep its precise position and shape in the higher layers This is what the pooling step does: In a local neighborhood it pools the output values For example, taking their maximum Advantages: Shift invariance within the pooling region (no matter where we found the nose, the maximum will be the same) We gradually decrease the output size (downsampling) – reducing the number of parameters is always useful Going up, we cover larger and larger areas with filters of the same size – hierarchic processing

Hierarchic processing
We stack a lot of convolution+pooling steps on each other We expect the higher layers to extract more and more abstract features This is more or less true (example from a face recognition network): The final classification is performed by fully connected layers

Summary The advantages and drawbacks of convolutional networks
Advantages of convolution: local processing of input blocks using the same weights – fewer parameters, shift-invariance Drawback: slower computation than with fully connected nets Advantages of pooling: parameter reduction, shift-invariance, hierarchic processing from local to larger context Drawback: fine details are lost, they would be useful in certain cases Advantages of hierachical processing: complex pictures have a hierarchic structure, makes sense to process them hierarchically Drawback: the network will be inevitably deep, training deep networks is problematic

Examples Complex image recognition task: more objects on the same image ImageNet database: 1,2 million high-resolution images, 1000 target labels The network can have 5 guesses, it is considered correct if the correct answer is among them Before convolutional networks the smallest error was 26%

More examples First convolutional network (2012): 16% error  now: <5%

Modelling time series with recurrent neural networks
So war we assumed that the examples that follow each other are independent – This is true for most recognition tasks E.g. image recognition But there might be tasks where the order of examples carries vital information This typically occurs when we model time series E.g.: speech recognition, language processing, handwriting recognition, video analysis, stock exchange rates,… The questions is how to modify the network so that it would take the neighbors of an input vector into consideration Feed-forward network on several vectors: Time-delay neural network Recurrent neural network Recurrent neural network with memory: Long short-term memory network

Using neighboring input vectors
The network that processes one input vector looks like this (the line corresponds to full connection): We can easily modify this to process more than one input vectors No modification is required in the network structure Drawback 1: the input size greatly increases Drawback 2: the input context is larger, but still finite

Time-Delay Neural Network
Processes several neighboring input vectors But (at the lower level) we perform the same processing on each vector The results are combined only at a higher level The 5 blocks of the hidden layer uses the same weights ”Weight sharing” Advantage: we can increase the input size without increasing the number of processing neurons If the size of the hidden layer is relatively small, than the input size of the output layer increases only slowly Of course, both the lower and upper processing parts may consist of several layers

Time-Delay Neural Network 2
The TDNN is a feedforward network Backpropagation can be used as before Backpropagation through the „weight sharing” is the only complication Evaluation(„forward pass”): Similar to a fully connected net, but the same weight are used at many positions Training („backward pass”): The error values obtained at the different paths belong to the same weights, so they must be summed before update The TDNN is close relative of convolutional neural networks

Recurrent neural networks (RNN)
We introduce real recurrent connections So the input consists of not only the actual input, but also of the previous output We usually feed back the hidden layer rather than the output layer The network now sees its previous hidden states, so is is like “adding memory” to the network

Backpropagation through time
How can we evaluate an RNN? From left to right, from vector to vector We cannot skip vectors In the first step ht-1 must be initialized somehow How can we train an RNN? Theoratically, in each step we need the previous ht-1 infinite recursion In practice, however, any training data set is finite We may also cut it to chunks artificially… This way, the network can be „unfolded” in time And can be trained using backpropagation

Backpropagation through time
What are the difficulties of „backpropagation through time”? The “copies” at different positions have the same weights! The errors must be collected, just as we saw in the case of „weight sharing” The training involves very long paths along time Theoretically, the actual output is influenced by all previous inputs In practice the gradients along the long paths may vanish or blow up The training of RNNs has instability problems RNNs usually fail to learn very long-term dependencies These are the same problems as with the training of very deep networks

Long short-term memory (LSTM) nets
We want to allow the neurons to learn which previous inputs are important and which may be forgotten This would also solve the problem of learning long-term relations We introduce an inner state that will act as a memory, and it will be less sensitive to backpropagation The information will flow through several paths A „gate” will control when the memory is deleted („forget gate”) And what to store from the actual input („input gate”) And what will contribute to the output („output gate”) This new model is the LSTM neuron, we will replace our previous recurrent neurons with these

RNN vs. LSTM neuron RNN (with tanh activation): LSTM:

How the gates operate The gating weights multiply each component of the input The weights are kept between 0 and 1 using the sigmoid function The optimal weights are learned Example with 0 and 1 gating weights:

LSTM neuron Forget gate: which components should be discarded from memory C Input gate: which input components should be stored in C

LSTM neuron The cell state („memory”) is updated using the forget gate and the input gate: The new value of the hidden state h is calculated as:

LSTM summary

LSTM network We can construct a network from LSTM cells just the same way as from standard recurrent neurons Of course, the training is slower and more complicated But for most task it gives better results than a standard RNN Variants of LSTMs: There are variants with more paths LSTM with „peephole” connections Some people try to simplify LSTMs GRU – gated recurrent unit This is one of the most active current research topics

Bidirectional recurrent networks
We process the input in both directions There is a hidden layer for both directions (they are independent) The output layer combines the two hidden layers Advantage: takes both the earlier and the later context into consideration Disadvantage: cannot operate in real time

Deep recurrent networks
Of course, we can put many recurrent layers on each other These can even be bidirectional But it is less usual (or with fewer layers) than with feed-forward layers Recurrent layers have a larger representational power And their training is much slower and more complicated

Deep neural networks (DNNs)

Similar presentations

Presentation on theme: "Deep neural networks (DNNs)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Deep neural networks (DNNs)

Similar presentations

Presentation on theme: "Deep neural networks (DNNs)"— Presentation transcript:

Similar presentations

About project

Feedback