Download presentation
Presentation is loading. Please wait.
1
Deep Learning Qing LU, Siyuan CAO
2
Some Applications Deep Learning started to show its power in speed recognition since (Hinton et al., “Deep Neural Networks for Acoustic Modeling in Speech Rocognition”, 2012) 9/22/2018 Deep Learning
3
Contents Deep Learning Basic Idea
A simple deep network: Deep Belief Network Unsupervised Learning: Auto-encoder A more popular network: Convolutional Neural Network 9/22/2018 Deep Learning
4
Deep Learning Basic Idea
Basic Architecture 9/22/2018 Deep Learning
5
Architecture Example: This is a deep network with
one 3-unit Input Layer one 2-unit Output Layer two 5-unit Hidden Layers Note: Unit is also known as “neuron” 9/22/2018 Deep Learning
6
Why such architecture? Because we learn things in this way.
Artificial Neural Network met its resistance because of computation time to train the network. Because we learn things in this way. Most of popular Deep Learning Architectures are built from Artificial Neural Network, which was quite popular in 1950s till 1990s 9/22/2018 Deep Learning
7
Deep Learning Basic Idea
Basic Architecture How it works 9/22/2018 Deep Learning
8
How does it work? It is an apple! Smells good! Other parameters
9/22/2018 Deep Learning
9
How does it work? It is an apple! Smells good! Other parameters
Chemical materials (e.g. hormone) are created Signals pass through neurons It is an apple! Smells good! Other parameters 9/22/2018 Deep Learning
10
How does it work? It is an apple! Smells good! Other parameters
Signals pass to the next level of neurons It is an apple! Smells good! Other parameters 9/22/2018 Deep Learning
11
How does it work? It is an apple! It is delicious! Smells good!
Signals pass to the brain and mouth It is an apple! It is delicious! Smells good! It is obvious that the input and the output are visible and the middle layers are unknown to us unless we are biologists. Therefore we call the middle layers “Hidden Layers” Let mouth running water Other parameters 9/22/2018 Deep Learning
12
Notice Each layer is another kind of representation of input
Feature Learning There is no feedback loop in this specific architecture. Feedback models exist. E.g. Recurrent Neural Network But computational complexity increases. That’s why they are not popular, compared with non-feedback models. 9/22/2018 Deep Learning
13
How to make machine to learn in this way?
9/22/2018 Deep Learning
14
Deep Belief Network Input Vector 9/22/2018 Deep Learning
15
Deep Belief Network The unit ℎ 1,𝑗 is activated according to a certain probability 𝑃 ℎ 1,𝑗 =1|𝒗 Note: 1. Each unit is a binary unit, i.e. its value is either 0 or 1. 2. The weights of units and connections are real number Input Vector 9/22/2018 Deep Learning
16
Deep Belief Network The unit ℎ 2,𝑗 is activated according to a certain probability 𝑃 ℎ 2,𝑗 =1| 𝒉 1 Input Vector 9/22/2018 Deep Learning
17
Deep Belief Network Again, the output unit is activated according to a certain probability. Input Vector 9/22/2018 Deep Learning
18
Question How can we find this probability? (i.e. the training process)
First, DBN is stacked by several simple single network, i.e. Restricted Boltzmann Machine (RBM) ——invented under the name “Harmonium” by Paul Smolensky in 1986 9/22/2018 Deep Learning
19
Architecture of DBN 9/22/2018 Deep Learning
20
Architecture of DBN 9/22/2018 Deep Learning
21
Architecture of DBN 9/22/2018 Deep Learning
22
Architecture of DBN 9/22/2018 Deep Learning
23
Architecture of DBN 9/22/2018 Deep Learning
24
Restricted Boltzmann Machine (RBM)
Introduction about RBM: RBM is a variant of Boltzmann Machine. RBM has only two layers, commonly referred as the “visible” and “hidden” units. Connection only exists between one “visible” unit and one “hidden” unit. There is NO connection between two “visible” units or two “hidden” units. 9/22/2018 Deep Learning
25
Question How can we find this probability? (i.e. the training process)
First, DBN is stacked by several single network, i.e. Restricted Boltzmann Machine (RBM) ——invented under the name “Harmonium” by Paul Smolensky in 1986 Second, energy is introduced into the model. 9/22/2018 Deep Learning
26
𝑝 𝑥,𝑦 = 𝑒 −𝐸 𝑥,𝑦 𝑍 with Z is the normalizing factor 𝑍= 𝑒 −𝐸 𝑥,𝑦
Energy Based Models We associate a scalar energy 𝐸 𝑥,𝑦 to each configuration. The probability distribution w.r.t. the energy is defined as 𝑝 𝑥,𝑦 = 𝑒 −𝐸 𝑥,𝑦 𝑍 with Z is the normalizing factor 𝑍= 𝑒 −𝐸 𝑥,𝑦 We want such properties: Lower energy indicates a more “desirable” configuration What is “desirable”? For a given data pair 𝑥,𝑦 , x is the input and y is the output If x and y are compatible, then the energy should be low If x and y are not compatible, then the energy should be high Lecun-06.pdf 9/22/2018 Deep Learning
27
Energy Function For RBM, we can find each configuration contains two units and one connection. Therefore the energy function is defined as follows: 𝐸 𝑣 𝑖 , ℎ 𝑗 =− 𝑎 𝑖 ∙ 𝑣 𝑖 + 𝑏 𝑗 ∙ ℎ 𝑗 + 𝑣 𝑖 𝑤 𝑖𝑗 ℎ 𝑗 With 𝑣 𝑖 , and ℎ 𝑗 are binary units (i.e. 𝑣 𝑖 , ℎ 𝑗 ∈ 0,1 ) 𝑎 𝑖 , and 𝑏 𝑗 are biases of the units 𝑤 𝑖𝑗 is the weight of the connection 9/22/2018 Deep Learning
28
Energy Function We expend the energy function into vector 𝒗 and 𝒉:
𝐸 𝒗,𝒉 =− 𝑖 𝑎 𝑖 ∙ 𝑣 𝑖 + 𝑗 𝑏 𝑗 ∙ ℎ 𝑗 + 𝑖 𝑗 𝑣 𝑖 𝑤 𝑖𝑗 ℎ 𝑗 Further more, totally in vector form: 𝐸 𝒗,𝒉 =− 𝒂 𝑇 ∙𝒗+ 𝒃 𝑇 ∙𝒉+ 𝒗 𝑇 𝑾𝒉 9/22/2018 Deep Learning
29
Something about Probabilities
The probability distribution is defined as 𝑃 𝒗,𝒉 = 𝑒 −𝐸 𝒗,𝒉 𝑍 with Z is the normalizing factor 𝑍= 𝒗,𝒉 𝑒 −𝐸 𝒗,𝒉 The probability of a visible vector is 𝑃 𝒗 = 1 𝑍 𝒉 𝑒 −𝐸 𝒗,𝒉 Because there is no connection between two visible units or two hidden units, the hidden units are independent to each other. The same as visible units. Therefore, we have conditional probabilities as follows: 𝑃 𝒗|𝒉 = 𝑖 𝑃 𝑣 𝑖 |𝒉 𝑃 𝒉|𝒗 = 𝑗 𝑃 ℎ 𝑗 |𝒗 9/22/2018 Deep Learning
30
Activation Probability
The activation probability of unit 𝑣 𝑖 or ℎ 𝑗 is: 𝑃 𝑣 𝑖 =1|𝒉 = 1 1+ 𝑒 − 𝑎 𝑖 + 𝑗 𝑤 𝑖𝑗 ℎ 𝑗 𝑃 ℎ 𝑗 =1|𝒗 = 1 1+ 𝑒 − 𝑏 𝑗 + 𝑖 𝑤 𝑖𝑗 𝑣 𝑖 How to get Sigmoid Function? 9/22/2018 Deep Learning
31
Training Algorithm For given training data set 𝑉 (a matrix with each row is a visible vector 𝒗) RBM is trained to argmax 𝜃 𝒗∈𝑉 𝑃 𝒗 or equivalently argmax 𝜃 𝒗∈𝑉 log 𝑃 𝒗 with 𝜃= 𝒂,𝒃,𝑾 Maximum Likelihood PI(P(V)) is the likelihood function compared to Machine Learning course. 9/22/2018 Deep Learning
32
Training Algorithm 𝜕log 𝑃 𝒗 𝜕𝜃 = 𝜕 𝜕𝜃 log 𝒉 𝑒 −𝐸 𝒗,𝒉 −log 𝒗 𝒉 𝑒 −𝐸 𝒗,𝒉 𝜕log 𝑃 𝒗 𝜕𝜃 = 𝒉 𝑃 𝒉|𝒗 𝜕 𝜕𝜃 −𝐸 𝒗,𝒉 − 𝒗,𝒉 𝑃 𝒗,𝒉 𝜕 𝜕𝜃 −𝐸 𝒗,𝒉 9/22/2018 Deep Learning
33
Training Algorithm 𝜕 𝜕 𝑤 𝑖𝑗 𝐸 𝒗,𝒉 =− 𝑣 𝑖 ℎ 𝑗 𝜕 𝜕 𝑎 𝑖 𝐸 𝒗,𝒉 =− 𝑣 𝑖 𝜕 𝜕 𝑏 𝑗 𝐸 𝒗,𝒉 =− ℎ 𝑗 𝜕log 𝑃 𝒗 𝜕 𝑤 𝑖𝑗 = 𝒉 𝑃 𝒉|𝒗 𝑣 𝑖 ℎ 𝑗 − 𝒗,𝒉 𝑃 𝒗,𝒉 𝑣 𝑖 ℎ 𝑗 𝜕log 𝑃 𝒗 𝜕 𝑤 𝑖𝑗 =𝑃 ℎ 𝑗 =1|𝒗 𝑣 𝑖 − 𝒗 𝑃 𝒗 𝑃 ℎ 𝑗 =1|𝒗 𝑣 𝑖 9/22/2018 Deep Learning
34
Training Algorithm 𝜕log 𝑃 𝒗 𝜕 𝑤 𝑖𝑗 =𝑃 ℎ 𝑗 =1|𝒗 𝑣 𝑖 − 𝒗 𝑃 𝒗 𝑃 ℎ 𝑗 =1|𝒗 𝑣 𝑖 𝜕log 𝑃 𝒗 𝜕 𝑎 𝑖 = 𝑣 𝑖 − 𝒗 𝑃 𝒗 𝑣 𝑖 𝜕log 𝑃 𝒗 𝜕 𝑏 𝑗 =𝑃 ℎ 𝑗 =1|𝒗 − 𝒗 𝑃 𝒗 𝑃 ℎ 𝑗 =1|𝒗 9/22/2018 Deep Learning
35
Training Algorithm To compute the 𝜃= 𝒂,𝒃,𝑾 , three equations should be 0. The first terms of gradient are easy to compute, however there are difficulties to compute the second terms.(requiring many sampling steps, e.g. using Gibbs sampling) 𝜕log 𝑃 𝒗 𝜕 𝑤 𝑖𝑗 =𝑃 ℎ 𝑗 =1|𝒗 𝑣 𝑖 − 𝒗 𝑃 𝒗 𝑃 ℎ 𝑗 =1|𝒗 𝑣 𝑖 9/22/2018 Deep Learning
36
Training Algorithm 𝜕log 𝑃 𝒗 𝜕 𝑤 𝑖𝑗 =𝑃 ℎ 𝑗 =1|𝒗 𝑣 𝑖 − 𝒗 𝑃 𝒗 𝑃 ℎ 𝑗 =1|𝒗 𝑣 𝑖 𝜕log 𝑃 𝒗 𝜕 𝑎 𝑖 = 𝑣 𝑖 − 𝒗 𝑃 𝒗 𝑣 𝑖 𝜕log 𝑃 𝒗 𝜕 𝑏 𝑗 =𝑃 ℎ 𝑗 =1|𝒗 − 𝒗 𝑃 𝒗 𝑃 ℎ 𝑗 =1|𝒗 9/22/2018 Deep Learning
37
Training Algorithm To compute the 𝜃= 𝒂,𝒃,𝑾 , three equations should be 0. The first terms of gradient are easy to compute, however there are difficulties to compute the second terms.(requiring many sampling steps, e.g. using Gibbs sampling) However, recently it was shown that estimates obtained after just a few steps can be sufficient for model training. Hinton, G.E.: Training products of experts by minimizing contrastive divergence. Neural Computation 14, 1771–1800 (2002) Contrastive Divergence is commonly used to approximate the log- likelihood gradient for training RBM. 9/22/2018 Deep Learning
38
Contrastive Divergence (CD or CD-k)
Usually, only 1 step is enough. Source: An Introduction to Restricted Boltzmann Machines by Asja Fischer and Christian Igel 9/22/2018 Deep Learning
39
A Short Conclusion Until now, we have only done with ONE RBM
An RBM has only two layers, not exactly “Deep” We can use Contrastive Divergence to train RBM 9/22/2018 Deep Learning
40
Architecture of DBN Until now, we have only done with ONE RBM.
Then, we do the same thing to the rest RBMs. To compute the a, b, and W. 9/22/2018 Deep Learning
41
Architecture of DBN Train the rest RBMs with same approach. Note:
Until now, we only get local optimal configuration. 9/22/2018 Deep Learning
42
Backpropagation (Fine-Tuning)
Using backpropagation algorithm to fine-tune the network and to get close to global optima. Therefore, it makes DBN a supervised model. 9/22/2018 Deep Learning
43
Demo (MINST Classifier)
Code provided by Ruslan Salakhutdinov and Geoff Hinton 9/22/2018 Deep Learning
44
Conclusion to DBN Simple architecture, easy to scale
Existing an efficient algorithm to pre-train the network My test: Layer 1: 500 Layer 2: 500 Layer 3: 2000 BP: 200, Time: 36 hours Layer 1: 500 Layer 2: 500 Layer 3: 2000 BP: 50, Time: 6.5 hours Layer 1: 200 Layer 2: 200 Layer 3: 1000 BP: 50, Time: 2 hours Limits: Need labeled data Computation time is still an issue (image a full color picture taken from camera, how many parameters need to be updated for a network?) 9/22/2018 Deep Learning
45
Auto-encoder To train a model with classifier, labeled data are needed. However, in most cases, only unlabeled data are available and to label data is very expensive. Therefore, we need an unsupervised way to train the model. 9/22/2018 Deep Learning
46
Auto-encoder basic idea
9/22/2018 Deep Learning
47
Auto-encoder with DBN 9/22/2018 Deep Learning
48
Demo (MINST Autoencoder)
Code provided by Ruslan Salakhutdinov and Geoff Hinton 9/22/2018 Deep Learning
49
Convolutional Neural Network
Very popular in image recognition Special architecture to reduce the data size significantly (which means parameters of the network are also reduced) However it still needs long time to train the network because of algorithm 9/22/2018 Deep Learning
50
Question Given an image, how would you like to reduce the data size?
9/22/2018 Deep Learning
51
Convolutional Neural Network Architecture
Convolution layer Subsampling layer Full Connection layer LeNet-5 architecture, source: Gradient-based Learning Applied to Document Recognition, by Yann LeCunn, etc., 1998 9/22/2018 Deep Learning
52
Basic Idea of CNN Feedforward pass Backpropagation pass
To compute the error Backpropagation pass To update the weights and biases 9/22/2018 Deep Learning
53
Basic Idea of Feedforward Pass
Convolution Layer User several filters to enhance the feature from the input (or previous layer) Subsampling Layer Because image has local spatial relation, down-sampling can reduce data size, at the same time can still keep valuable information. (e.g. imaging you can still recognize the picture from the thumbnail) Full connection Layer Can be regarded as a classifier 9/22/2018 Deep Learning
54
A good video about Feedforward Pass
Convolution Layer Part: starts from 5:42 till 7:16 What is 2D matrix convolution What the effect can 2D matrix convolution achieve Subsampling Layer Part: at 10:10 How to subsampling 9/22/2018 Deep Learning
55
Question How many parameters need for a CNN, compared with DBN? (Input data are 32x32 digit, or full color pictures) 9/22/2018 Deep Learning
56
Convolutional Neural Network Architecture
Task to train CNN: Given labeled data, to obtain suitable weights and biases of matrices for convolution layer and sampling layer. LeNet-5 architecture, source: Gradient-based Learning Applied to Document Recognition, by Yann LeCunn, etc., 1998 9/22/2018 Deep Learning
57
Model of CNN Following slides are referred from “Notes on Convolutional Neural Networks” by Jake Bouvrie, 2006 9/22/2018 Deep Learning
58
Model of CNN For a multiclass problem with 𝑐 classes and 𝑁 training examples, The Error is given: 𝐸 𝑁 = 1 2 𝑛=1 𝑁 𝑘=1 𝑐 𝑡 𝑛,𝑘 − 𝑦 𝑛,𝑘 2 𝐸 𝑁 : Error of whole training examples 𝑡 𝑛,𝑘 : Output of n-th input data w.r.t. k-th class 𝑦 𝑛,𝑘 : Label of n-th input data w.r.t. k-th class 9/22/2018 Deep Learning
59
Model of CNN For a multiclass problem with 𝑐 classes and 𝑁 training examples, The Error of n-th example is given: 𝐸 𝑛 = 1 2 𝑘=1 𝑐 𝑡 𝑛,𝑘 − 𝑦 𝑛,𝑘 2 or 𝐸 𝑛 = 𝒕 𝑛 − 𝒚 𝑛 2 9/22/2018 Deep Learning
60
Model of General Feedforward Pass
The output of a certain layer: 𝒙 𝑙 =𝑓 𝒖 𝑙 with 𝒖 𝑙 = 𝑾 𝑙 𝒙 𝑙−1 + 𝒃 𝑙 𝑙: current layer. Layer 1 is input data layer, Layer 𝐿 is output layer of CNN. Therefore, 𝑙 is from 2 to 𝐿. 𝒙 𝑙 : output of layer 𝑙 𝑾 𝑙 and 𝒃 𝑙 : weights and biases for layer 𝑙 𝑓 ∙ : activation function, commonly to be sigmoid or hyperbolic tangent function. 9/22/2018 Deep Learning
61
Model of General Backpropagation Pass
Backpropagation Algorithm is used to updating weights and biases 𝛿 is regard as bias sensitivity, which will be propagate back through the network. 𝛿≝ 𝜕𝐸 𝜕𝑏 = 𝜕𝐸 𝜕𝑢 𝜕𝑢 𝜕𝑏 Since 𝜕𝑢 𝜕𝑏 =1 𝛿 becomes 𝛿≝ 𝜕𝐸 𝜕𝑏 = 𝜕𝐸 𝜕𝑢 9/22/2018 Deep Learning
62
Model of General Backpropagation Pass
𝛿 for layer 𝑙: 𝜹 𝑙 = 𝑾 𝑙+1 𝑇 𝜹 𝑙+1 ∘ 𝑓 ′ 𝒖 𝑙 for layer 𝐿: 𝜹 𝐿 = 𝑓 ′ 𝒖 𝐿 ∘ 𝒚 𝑛 − 𝒕 𝑛 ∘: element-wise multiplication Final equation to update bias Δ 𝒃 𝑙 =−𝜂 𝜕𝐸 𝜕 𝒃 𝑙 =−𝜂 𝜹 𝑙 𝜂: learning rate 9/22/2018 Deep Learning
63
Model of General Backpropagation Pass
To update weights, with analogous process for the bias update, 𝜕𝐸 𝜕 𝑾 𝑙 = 𝒙 𝑙−1 𝜹 𝑙 𝑇 ∆ 𝑾 𝑙 =−𝜂 𝜕𝐸 𝜕 𝑾 𝑙 9/22/2018 Deep Learning
64
Detail Form for Convolution Layer
Feedforward Pass 𝒙 𝑙,𝑗 =𝑓 𝑖∈ 𝑀 𝑗 𝒙 𝑙−1,𝑖 ∗ 𝒌 𝑙,𝑖𝑗 + 𝒃 𝑙,𝑗 𝒌 𝑙,𝑖𝑗 weight matrix for layer 𝑙, between feature map 𝑖 and 𝑗 𝑀 𝑗 a selection of input maps 9/22/2018 Deep Learning
65
Detail Form for Convolution Layer
Backpropagation Pass 𝜹 𝑙,𝑗 = 𝛽 𝑙+1,𝑗 𝑓 ′ 𝒖 𝑙,𝑗 ∘𝑢𝑝 𝜹 𝑙+1,𝑗 𝛽 𝑙+1,𝑗 see next slide 𝑢𝑝 ⋅ up-sampling method, e.g. Kronecker product 𝑓 ′ ⋅ derivative of activation fucntion 9/22/2018 Deep Learning
66
Detail Form for Subsampling Layer
Feedforward Pass 𝒙 𝑙,𝑗 =𝑓 𝛽 𝑙,𝑗 𝑑𝑜𝑤𝑛 𝒙 𝑙−1,𝑗 + 𝒃 𝑙,𝑗 𝛽 𝑙,𝑗 nothing special, just “weight”. Here it is only a scalar, not a matrix. 𝑑𝑜𝑤𝑛 ⋅ down-sampling method, e.g. average, maximum, etc. 9/22/2018 Deep Learning
67
Detail Form for Subsampling Layer
Backpropagation Pass 𝛿 𝑙,𝑗 = 𝒌 𝑙+1,𝑗 𝑇 𝛿 𝑙+1,𝑗 ∘ 𝑓 ′ 𝒖 𝑙,𝑗 9/22/2018 Deep Learning
68
A Short Conclusion (Feedforward)
Target: Compute Error Convolution Layers: Convolution is used instead of multiplication. Subsampling Layers: Different down-sampling can be used. 9/22/2018 Deep Learning
69
A Short Conclusion (Backpropagation)
Target: back propagate Error and update weight and bias Convolution Layers: Up-sampling is needed Subsampling Layers: shortcut method exists in MatLab (more details are in the paper by Jake Bouvrie) More detailed BP steps are introduced in “Notes on Convolutional Neural Networks” by Jake Bouvrie 9/22/2018 Deep Learning
70
Conclusion to CNN Significantly reduce the data size and parameter size Training algorithm is not efficient (only BP algorithm currently) There are researches available to combine CNN and DBN Personal view: A little bit more understandable what is happening in different layers than DBN although it is still hard for us to understand why to choose certain filters after training. 9/22/2018 Deep Learning
71
Conclusion to Deep Learning
Feature Learning Hierarchical architecture (simulate brain activity) There is no theoretical proof what are the optimal parameters ( number of layers, number units, etc.) Good performance in image, speech recognition Although it is hard for us to understand what is happening in the network Computation time is still an issue 9/22/2018 Deep Learning
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.