Download presentation
Presentation is loading. Please wait.
1
[Figure taken from https://ai. googleblog
MLCV 182: Introduction to Deep Learning Ron Shapira Weber Computer Science, Ben-Gurion University
2
Contents Introduction – What is Deep Learning?
Linear \ Binary Perceptron Multi-Layer Perceptron [Figure from previous slide taken from
3
What is Deep Learning? From perceptron to deep neural networks
Consider talking about: Credit assignment Concepts What is Deep Learning? From perceptron to deep neural networks
4
Example – Object recognition and localization
Note – Deep learning is a supervised learning algorithm [Andrej Karpathy Li Fei-Fei, (2015): Deep Visual-Semantic Alignments for Generating Image Descriptions]
5
Some history – ImageNet challenge
1.2 million images in the training set, each labeled with one of categories Image classification problem
6
Some history – ImageNet challenge
One of the Top-5 guesses needs to be the correct one.
7
Increasing Depth on ImageNet challenge
Trend of increasing depth (Img Credit: Kaiming He)
8
ImageNet architecture comparison
Amount of operations for a single forward pass vs. top-1accuracy Dolev will talk more about this topic [Canziani et al., (2016). An analysis of deep neural network models for practical applications.]
9
Supervised Learning Data: Image classification example:
X – dataset: Images, Videos, Text, etc… y – labels (cat, dog, platypus) Image classification example: Probability distribution over classes Example of classification problem Classifier (SVM, LDA, Deep neural network etc…) Platypus Dog Cat 0.66 0.14 0.2 *We’ll also see variants of deep learning algorithms where it isn’t
10
Supervised Learning An example of a supervised learning algorithm we saw at this course? Least-Squares Estimation in a Linear Model: A known function, ℎ: ℝ 𝑛 → ℝ 𝑑×𝑘 Data: 𝑁 pairs of 𝒙 𝒊 , 𝒚 𝒊 𝑖=1 𝑁 where 𝒙 𝒊 , 𝒚 𝒊 ∈ ℝ 𝑛 × ℝ 𝑑 ∀𝑖: Define 𝑯 𝒊 ≜ℎ 𝒙 𝒊 ( 𝑯 𝒊 is a 𝑑×𝑘 matrix). Goal: find the optimal (in the least-square sense) parameter 𝜃, assuming the model y=ℎ 𝑥 𝜃. In other words: Note that in this framework we try to predict the label (𝑦) of the input 𝑥. X – data, y - labels
11
Un-supervised Learning
Solve some task given “unlabeled” data. An example to unsupervised learning algorithm we saw at this course? Well… PCA K-means GMM
12
Supervised Learning Framework:
Provide data, labels -𝑋, 𝑦 Split data into: Training data: majority of the data (for instance, 60%) Used to train the model. Validation set: a partition of the data (20%) used for tuning of the parameters. Test data: a partition of the data (20%) used to test the accuracy of the model. Define algorithm Define a loss function: In the case of Linear Regression, L2 norm: Define an optimization method to find 𝜃 such that:
13
Example: Deep Learning for Image label classification
Provide data, labels -𝐼𝑚𝑎𝑔𝑒𝑠, 𝑐𝑙𝑎𝑠𝑠 Split data into: Training data Validation set Test data Define algorithm: Artificial Neural Network, Convolutional NN, etc… Define a lost function: L2 norm Cross-Entropy Define an optimization method to find 𝜃 such that: Usually there’s no closed form solution, can use iterative gradient-based methods .
14
When working with images
Represent images as vectors: Image 𝐼∈ ℝ 𝑛×𝑚 . Flatten image so that 𝐼 𝑓 ∈ ℝ 𝑛𝑚 0,2 0,1 0,0 1,2 1,1 1,0 2,2 2,1 2,0 2,2 2,1 2,0 1,2 1,1 1,0 0,2 0,1 0,0 8 7 6 5 4 3 2 1
15
Perceptron 𝑥 1 𝑥 2 𝑥 3 𝑥 𝑛 𝑤 1 𝑤 2 𝑤 𝑖 𝑥 𝑖 +𝑏 𝑓(Σ 𝑤 𝑖 𝑥 𝑖 +𝑏) 𝑤 3 .
𝑤 𝑖 𝑥 𝑖 +𝑏 𝑓(Σ 𝑤 𝑖 𝑥 𝑖 +𝑏) 𝑤 1 𝑤 2 𝑤 3 𝑤 𝑛 . Perceptron
16
Some History The perceptron algorithm was invented in 1957 at the Cornell Aeronautical Laboratory by Frank Rosenblatt It is an extension of the perceptron which was first introduced in the 1950s. In 1969 a famous book entitled “Perceptrons” by Marvin Minsky and Seymour Papert showed that it was impossible for perceptrons to learn an XOR function without adding an hidden layer. Hence the term Multilayer perceptron.
17
Linear Perceptron – Single Output
𝑥 1 𝑥 2 𝑥 3 𝑥 𝑛 𝑤 𝑖 𝑥 𝑖 +𝑏 = 𝑦 𝐼𝑛𝑝𝑢𝑡 𝑊𝑒𝑖𝑔ℎ𝑡𝑠 𝑤 1 𝑤 2 𝑤 3 𝑤 𝑛 . 𝐿𝑜𝑠𝑠 (ℓ2) 𝑦− 𝑦 2
18
Linear Perceptron Data: 𝑁 pairs of 𝒙 𝒊 , 𝒚 𝒊 𝑖=1 𝑁 where 𝒙 𝒊 , 𝒚 𝒊 ∈ ℝ 𝑛 ×ℝ Try to predict 𝒚 𝒊 by 𝒘 𝒊 𝒙 𝒊 +𝒃= 𝒚 This is a linear least squares problem (see PS 4): 𝑓 𝒙;𝐰 = 𝑖=1 𝑁 𝒚 𝒊 − 𝒙 𝑖 𝑇 𝒘 ℓ 2 2 , 𝒘∈ ℝ 𝑛 Find: 𝒘 =𝑎𝑟𝑔𝑚𝑖𝑛 𝑓(𝒙;𝒘) Therefor there is a closed-form solution: 𝑿 𝑻 𝑿 𝒘 𝑳𝑺 = 𝑿 𝑻 𝒚 Where 𝑋 is the entire dataset (each row is a sample).
19
(Vanilla) Binary Perceptron – Single Output
1 𝑥 1 𝑥 2 𝑥 3 𝑥 𝑛 𝐼𝑛𝑝𝑢𝑡 𝑊𝑒𝑖𝑔ℎ𝑡𝑠 𝑤 1 𝑤 2 𝑤 3 𝑤 𝑛 . 𝑤 0 𝐵𝑖𝑛𝑎𝑟𝑦 𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑒𝑟 𝑤 𝑖 𝑥 𝑖 output= 0, 𝑤𝑥≤0 &1, 𝑤𝑥>0 Note: we’ve dropped the bias term and replaced it with w0 and 1
20
(Sigmoid) Binary Perceptron – Single Output
𝑥 1 𝑥 2 𝑥 3 𝑥 𝑛 𝜎(Σ 𝑤 𝑖 𝑥 𝑖 ) 𝑨𝒄𝒕𝒊𝒗𝒂𝒕𝒊𝒐𝒏 𝑭𝒖𝒏𝒄𝒕𝒊𝒐𝒏 𝐼𝑛𝑝𝑢𝑡 𝑊𝑒𝑖𝑔ℎ𝑡𝑠 𝑤 1 𝑤 2 𝑤 3 𝑤 𝑛 . output= 1, 𝜎 Σ 𝑤 𝑖 𝑥 𝑖 +𝑏 >0.5 &0, 𝑒𝑙𝑠𝑒 𝑤 𝑖 𝑥 𝑖
21
Binary Perceptron Data: 𝑁 pairs of 𝒙 𝒊 , 𝒚 𝒊 𝑖=1 𝑁 , 𝒚 𝒊 ∈(0,1)
The binary perceptron acts as a binary classifier 𝜎 𝑤 𝑖 𝑥 𝑖 +𝑏 , 𝑤ℎ𝑒𝑟𝑒 𝜎 𝑥 = 1 1+ 𝑒 −𝑥 And 0≤𝜎 𝑥 ≤1 𝑓 𝑥 = 1, 𝜎 𝑤𝑥+𝑏 >0.5 &0, 𝑒𝑙𝑠𝑒
22
(Softmax) Binary Perceptron - Multiple Outputs
𝑥 1 𝑥 2 𝑥 3 𝑥 𝑛 𝑤 𝑖1 𝑥 𝑖 𝜎(Σ 𝑤 𝑖1 𝑥 𝑖 ) 𝐼𝑛𝑝𝑢𝑡 𝑊𝑒𝑖𝑔ℎ𝑡𝑠 𝑤 11 𝑤 21 . 𝑨𝒄𝒕𝒊𝒗𝒂𝒕𝒊𝒐𝒏 𝑭𝒖𝒏𝒄𝒕𝒊𝒐𝒏 𝑤 𝑖𝑘 𝑥 𝑖 𝑊 𝑘×𝑛 𝑇 𝒙 𝑛𝑥1 = ℎ 𝑘×1 𝑓 𝑥 =𝜎(ℎ) A generalization of the sigmoid function called 𝑠𝑜𝑓𝑡𝑚𝑎𝑥: 𝜎 𝒙 = 𝑒 𝑖 ℎ 𝑘=1 𝐾 𝑒 𝑘 ℎ . 𝜎(Σ 𝑤 𝑖𝑘 𝑥 𝑖 )
23
Multiclass Binary Perceptron
Probability distribution over 𝑘 classes 𝑥 1 𝑥 2 𝑥 3 𝑥 𝑛 𝑤 𝑖1 𝑥 𝑖 𝜎(Σ 𝑤 𝑖1 𝑥 𝑖 ) 𝑨𝒄𝒕𝒊𝒗𝒂𝒕𝒊𝒐𝒏 𝑭𝒖𝒏𝒄𝒕𝒊𝒐𝒏 𝐼𝑛𝑝𝑢𝑡 𝑊𝑒𝑖𝑔ℎ𝑡𝑠 𝑤 11 𝑤 21 . 𝑒 1 ℎ 𝑘=1 𝐾 𝑒 𝑘 ℎ 𝑒 2 ℎ 𝑘=1 𝐾 𝑒 𝑘 ℎ . 𝑒 𝑘 ℎ 𝑘=1 𝐾 𝑒 𝑘 ℎ 0.02 0.76 . 0.06 𝑤 𝑖𝑘 𝑥 𝑖 . 𝜎(Σ 𝑤 𝑖𝑘 𝑥 𝑖 )
24
Multiclass Binary Perceptron
Correct class distribution 𝑥 1 𝑥 2 𝑥 3 𝑥 𝑛 𝑤 𝑖1 𝑥 𝑖 𝜎(Σ 𝑤 𝑖1 𝑥 𝑖 ) 𝑨𝒄𝒕𝒊𝒗𝒂𝒕𝒊𝒐𝒏 𝑭𝒖𝒏𝒄𝒕𝒊𝒐𝒏 𝐼𝑛𝑝𝑢𝑡 𝑊𝑒𝑖𝑔ℎ𝑡𝑠 𝑤 11 𝑤 21 . 𝑒 1 ℎ 𝑘=1 𝐾 𝑒 𝑘 ℎ 𝑒 2 ℎ 𝑘=1 𝐾 𝑒 𝑘 ℎ . 𝑒 𝑘 ℎ 𝑘=1 𝐾 𝑒 𝑘 ℎ 0.02 0.76 . 0.06 1 . 𝑤 𝑖𝑘 𝑥 𝑖 . 𝜎(Σ 𝑤 𝑖𝑘 𝑥 𝑖 )
25
𝐻 𝑝,𝑞 =− 𝒙 𝑝 𝒙 log 𝑞 𝑥 =𝐻 𝑝 + 𝐷 𝐾𝐿 (𝑝‖𝑞) 𝐷 𝐾𝐿 (𝑝 𝑞 = 𝒙 𝑝 𝒙 log 𝑝 𝑥 𝑞 𝑥
Need to calculate loss. How different is ‘our’ probability distribution over the possible classes from the correct one. Cross-entropy (not to be confused with the joint entropy of two random variables): 𝐻 𝑝,𝑞 =− 𝒙 𝑝 𝒙 log 𝑞 𝑥 =𝐻 𝑝 + 𝐷 𝐾𝐿 (𝑝‖𝑞) 𝐷 𝐾𝐿 (𝑝 𝑞 = 𝒙 𝑝 𝒙 log 𝑝 𝑥 𝑞 𝑥 Since our target distribution 𝑝 is “one-hot encoded” (0,0, …,1,0)→𝐻 𝑝 =0. This means it is equivalent to minimizing the KL divergence (𝐷 𝐾𝐿 (𝑝 𝑞 ) between the two distributions. In other words, the cross-entropy objective ‘wants’ the predicted distribution to have all of its mass on the correct answer. When using the SoftMax activation function, with the cross-entropy loss function we get: 𝐿 𝑖 =− log 𝑒 𝑖 ℎ 𝑦𝑖 𝑘=1 𝐾 𝑒 𝑘 ℎ Note: when implementing use the long-sum-exp trick. Intuitively: -log(1) = 0 (we gave 1 to our predicted score) -log(0.25) = 0.602
26
Multilayer perceptron (MLP)
27
The XOR (“exclusive OR”) problem
Given 4 points in ℝ 𝑛 0,0 , 0,1 , 1,0 , 1,1 , return: Can we solve the problem with a linear/binary perceptron (with a single output)? Is it linearly separable? Output X2 X1 1
28
The XOR problem Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning (Vol. 1). Cambridge: MIT press. [Figure from: Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning]
29
The XOR problem A single-layer perceptron is a linear combination of its inputs. The classification of the input is given by a line which separates between the classes of the input. If we look at the equations: 0× 𝑤 1 +0× 𝑤 2 +𝑏≤0 ↔𝑏≤0 0× 𝑤 1 +1× 𝑤 2 +𝑏>0 ↔𝑏>− 𝑤 2 1× 𝑤 1 +0× 𝑤 2 +𝑏>0 ↔𝑏>− 𝑤 1 1× 𝑤 1 +1× 𝑤 2 +𝑏≤0 ↔𝑏≤−( 𝑤 1 + 𝑤 2 ) There is no solution to this linear system
30
The XOR problem We can also try to treat this problem as a least squares problems: Loss function: Model: (Exercise) solving the normal equations we get: 𝒘=𝟎 𝑎𝑛𝑑 𝑏= 1 2 LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. nature, 521(7553), 436. [Example from: Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning]
31
The XOR problem Adding a hidden layer can help solve the XOR problem.
We will add a vector of hidden unit 𝒉= 𝑓 1 𝒙;𝑾,𝒄 . the values of these hidden units are then used as input for the second/output layer. Our model is now: 𝒉= 𝑓 1 𝒙;𝑾,𝒄 𝒚= 𝑓 2 𝒉;𝒘,𝑏 𝑓 𝒙;𝑾,𝒄,𝒘,𝑏 = 𝑓 𝑓 1 𝒙 . LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. nature, 521(7553), 436. [Example from: Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning]
32
The XOR problem 𝑓 𝒙;𝑾,𝒄,𝒘,𝑏 = 𝑓 2 𝑓 1 𝒙
𝑓 𝒙;𝑾,𝒄,𝒘,𝑏 = 𝑓 𝑓 1 𝒙 LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. nature, 521(7553), 436. [Example from: Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning]
33
𝑓 𝑥 = 𝒘 𝑻 𝑾 𝑻 𝒙= 𝒙 𝑻 𝒘 where 𝒘 ′ =𝑾𝒘
The XOR problem What should be our choice of 𝑓 (1) ? 𝑓 (1) can’t be linear, otherwise: 𝑓 1 𝑥 = 𝑾 𝑻 𝒙 and 𝑓 2 ℎ = 𝒉 𝑻 𝒘 Then: 𝑓 𝑥 = 𝒘 𝑻 𝑾 𝑻 𝒙= 𝒙 𝑻 𝒘 where 𝒘 ′ =𝑾𝒘 We must use a non-linear function for 𝑓 (1) . LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. nature, 521(7553), 436. [Example from: Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning]
34
The XOR problem 𝑔 𝑧 =max{0,𝑧}, which is known as Rectified Linear Unit (ReLU) LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. nature, 521(7553), 436. [Figure from: Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning]
35
The XOR problem Our new model: 𝑓 𝒙;𝑾,𝒄,𝒘,𝑏 = 𝒘 𝑇 max 0, 𝑾 𝑇 𝒙+𝒄 +𝑏
You can find a complete walkthrough of the problem at: chapter 6.1 LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. nature, 521(7553), 436. [Figure from: Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning]
36
No hidden layers
37
MLP with one hidden layer
This of it as looking at the input layer
38
MLP with one hidden layer
The hidden layer learns a representation so that the data is linearly separable
39
MLP with one hidden layer
Lecun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444. [Lecun, Y., Bengio, Y., & Hinton, G. (2015)]
40
How big should our hidden layer be?
Example of overfitting Taken from Stanford notes
41
Summary Deep learning is a class of supervised learning algorithms.
Linear \ binary perceptron acts as a linear classifier. Hidden layers (followed by non-linear activation function) allows for non-linear transformation of the input so that it could be linear separable. The number of neurons and connections in each layer determine our model capacity. LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. nature, 521(7553), 436.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.