[Figure taken from googleblog

Slides:



Advertisements
Similar presentations
1 Machine Learning: Lecture 4 Artificial Neural Networks (Based on Chapter 4 of Mitchell T.., Machine Learning, 1997)
Advertisements

1 Image Classification MSc Image Processing Assignment March 2003.
Perceptron Learning Rule
1 Neural networks. Neural networks are made up of many artificial neurons. Each input into the neuron has its own weight associated with it illustrated.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Machine Learning: Connectionist McCulloch-Pitts Neuron Perceptrons Multilayer Networks Support Vector Machines Feedback Networks Hopfield Networks.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Neural Networks. Background - Neural Networks can be : Biological - Biological models Artificial - Artificial models - Desire to produce artificial systems.
Non-Bayes classifiers. Linear discriminants, neural networks.
Back-Propagation Algorithm AN INTRODUCTION TO LEARNING INTERNAL REPRESENTATIONS BY ERROR PROPAGATION Presented by: Kunal Parmar UHID:
Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.
Neural Networks The Elements of Statistical Learning, Chapter 12 Presented by Nick Rizzolo.
Lecture 12. Outline of Rule-Based Classification 1. Overview of ANN 2. Basic Feedforward ANN 3. Linear Perceptron Algorithm 4. Nonlinear and Multilayer.
Xintao Wu University of Arkansas Introduction to Deep Learning 1.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Machine Learning Supervised Learning Classification and Regression
Neural networks.
Neural networks and support vector machines
Big data classification using neural network
Fall 2004 Backpropagation CS478 - Machine Learning.
Learning Deep Generative Models by Ruslan Salakhutdinov
Deep Feedforward Networks
Artificial Neural Networks
Deep Learning Amin Sobhani.
Machine Learning & Deep Learning
Dhruv Batra Georgia Tech
Data Mining, Neural Network and Genetic Programming
Learning with Perceptrons and Neural Networks
第 3 章 神经网络.
COMP24111: Machine Learning and Optimisation
Restricted Boltzmann Machines for Classification
Lecture 24: Convolutional neural networks
Perceptrons Lirong Xia.
Multimodal Learning with Deep Boltzmann Machines
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
CS 4501: Introduction to Computer Vision Basics of Neural Networks, and Training Neural Nets I Connelly Barnes.
Neural Networks CS 446 Machine Learning.
Classification with Perceptrons Reading:
Intelligent Information System Lab
ECE 6504 Deep Learning for Perception
Deep Belief Networks Psychology 209 February 22, 2013.
A brief introduction to neural network
Classification / Regression Neural Networks 2
LINEAR AND NON-LINEAR CLASSIFICATION USING SVM and KERNELS
Machine Learning Today: Reading: Maria Florina Balcan
Introduction to Neural Networks
Goodfellow: Chap 6 Deep Feedforward Networks
Chapter 3. Artificial Neural Networks - Introduction -
CS 4501: Introduction to Computer Vision Training Neural Networks II
Synaptic DynamicsII : Supervised Learning
Neuro-Computing Lecture 4 Radial Basis Function Network
Multilayer Perceptron & Backpropagation
Neural Networks Geoff Hulten.
Deep Learning for Non-Linear Control
Lecture Notes for Chapter 4 Artificial Neural Networks
ML – Lecture 3B Deep NN.
Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824
COSC 4335: Part2: Other Classification Techniques
Word embeddings (continued)
Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824
David Kauchak CS158 – Spring 2019
Introduction to Neural Networks
VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION
CS621: Artificial Intelligence Lecture 18: Feedforward network contd
Perceptrons Lirong Xia.
Deep learning: Recurrent Neural Networks CV192
An introduction to neural network and machine learning
Overall Introduction for the Lecture
Patterson: Chap 1 A Review of Machine Learning
Presentation transcript:

[Figure taken from https://ai. googleblog MLCV 182: Introduction to Deep Learning Ron Shapira Weber Computer Science, Ben-Gurion University

Contents Introduction – What is Deep Learning? Linear \ Binary Perceptron Multi-Layer Perceptron [Figure from previous slide taken from https://ai.googleblog.com/2015/06/inceptionism-going-deeper-into-neural.html]

What is Deep Learning? From perceptron to deep neural networks Consider talking about: Credit assignment Concepts What is Deep Learning? From perceptron to deep neural networks

Example – Object recognition and localization Note – Deep learning is a supervised learning algorithm [Andrej Karpathy Li Fei-Fei, (2015): Deep Visual-Semantic Alignments for Generating Image Descriptions]

Some history – ImageNet challenge 1.2 million images in the training set, each labeled with one of 1000 categories Image classification problem https://cs.stanford.edu/people/karpathy/cnnembed/

Some history – ImageNet challenge One of the Top-5 guesses needs to be the correct one. https://blog.acolyer.org/2016/04/20/imagenet-classification-with-deep-convolutional-neural-networks/

Increasing Depth on ImageNet challenge Trend of increasing depth (Img Credit: Kaiming He)

ImageNet architecture comparison Amount of operations for a single forward pass vs. top-1accuracy Dolev will talk more about this topic [Canziani et al., (2016). An analysis of deep neural network models for practical applications.]

Supervised Learning Data: Image classification example: X – dataset: Images, Videos, Text, etc… y – labels (cat, dog, platypus) Image classification example: Probability distribution over classes Example of classification problem Classifier (SVM, LDA, Deep neural network etc…) Platypus Dog Cat 0.66 0.14 0.2 *We’ll also see variants of deep learning algorithms where it isn’t

Supervised Learning An example of a supervised learning algorithm we saw at this course? Least-Squares Estimation in a Linear Model: A known function, ℎ: ℝ 𝑛 → ℝ 𝑑×𝑘 Data: 𝑁 pairs of 𝒙 𝒊 , 𝒚 𝒊 𝑖=1 𝑁 where 𝒙 𝒊 , 𝒚 𝒊 ∈ ℝ 𝑛 × ℝ 𝑑 ∀𝑖: Define 𝑯 𝒊 ≜ℎ 𝒙 𝒊 ( 𝑯 𝒊 is a 𝑑×𝑘 matrix). Goal: find the optimal (in the least-square sense) parameter 𝜃, assuming the model y=ℎ 𝑥 𝜃. In other words: Note that in this framework we try to predict the label (𝑦) of the input 𝑥. X – data, y - labels

Un-supervised Learning Solve some task given “unlabeled” data. An example to unsupervised learning algorithm we saw at this course? Well… PCA K-means GMM

Supervised Learning Framework: Provide data, labels -𝑋, 𝑦 Split data into: Training data: majority of the data (for instance, 60%) Used to train the model. Validation set: a partition of the data (20%) used for tuning of the parameters. Test data: a partition of the data (20%) used to test the accuracy of the model. Define algorithm Define a loss function: In the case of Linear Regression, L2 norm: Define an optimization method to find 𝜃 such that:

Example: Deep Learning for Image label classification Provide data, labels -𝐼𝑚𝑎𝑔𝑒𝑠, 𝑐𝑙𝑎𝑠𝑠 Split data into: Training data Validation set Test data Define algorithm: Artificial Neural Network, Convolutional NN, etc… Define a lost function: L2 norm Cross-Entropy Define an optimization method to find 𝜃 such that: Usually there’s no closed form solution, can use iterative gradient-based methods .

When working with images Represent images as vectors: Image 𝐼∈ ℝ 𝑛×𝑚 . Flatten image so that 𝐼 𝑓 ∈ ℝ 𝑛𝑚 0,2 0,1 0,0 1,2 1,1 1,0 2,2 2,1 2,0 2,2 2,1 2,0 1,2 1,1 1,0 0,2 0,1 0,0 8 7 6 5 4 3 2 1

Perceptron 𝑥 1 𝑥 2 𝑥 3 𝑥 𝑛 𝑤 1 𝑤 2 𝑤 𝑖 𝑥 𝑖 +𝑏 𝑓(Σ 𝑤 𝑖 𝑥 𝑖 +𝑏) 𝑤 3 . 𝑤 𝑖 𝑥 𝑖 +𝑏 𝑓(Σ 𝑤 𝑖 𝑥 𝑖 +𝑏) 𝑤 1 𝑤 2 𝑤 3 𝑤 𝑛 . Perceptron

Some History The perceptron algorithm was invented in 1957 at the Cornell Aeronautical Laboratory by Frank Rosenblatt It is an extension of the perceptron which was first introduced in the 1950s. In 1969 a famous book entitled “Perceptrons” by Marvin Minsky and Seymour Papert showed that it was impossible for perceptrons to learn an XOR function without adding an hidden layer. Hence the term Multilayer perceptron. https://en.wikipedia.org/wiki/Perceptron

Linear Perceptron – Single Output 𝑥 1 𝑥 2 𝑥 3 𝑥 𝑛 𝑤 𝑖 𝑥 𝑖 +𝑏 = 𝑦 𝐼𝑛𝑝𝑢𝑡 𝑊𝑒𝑖𝑔ℎ𝑡𝑠 𝑤 1 𝑤 2 𝑤 3 𝑤 𝑛 . 𝐿𝑜𝑠𝑠 (ℓ2) 𝑦− 𝑦 2

Linear Perceptron Data: 𝑁 pairs of 𝒙 𝒊 , 𝒚 𝒊 𝑖=1 𝑁 where 𝒙 𝒊 , 𝒚 𝒊 ∈ ℝ 𝑛 ×ℝ Try to predict 𝒚 𝒊 by 𝒘 𝒊 𝒙 𝒊 +𝒃= 𝒚 This is a linear least squares problem (see PS 4): 𝑓 𝒙;𝐰 = 𝑖=1 𝑁 𝒚 𝒊 − 𝒙 𝑖 𝑇 𝒘 ℓ 2 2 , 𝒘∈ ℝ 𝑛 Find: 𝒘 =𝑎𝑟𝑔𝑚𝑖𝑛 𝑓(𝒙;𝒘) Therefor there is a closed-form solution: 𝑿 𝑻 𝑿 𝒘 𝑳𝑺 = 𝑿 𝑻 𝒚 Where 𝑋 is the entire dataset (each row is a sample).

(Vanilla) Binary Perceptron – Single Output 1 𝑥 1 𝑥 2 𝑥 3 𝑥 𝑛 𝐼𝑛𝑝𝑢𝑡 𝑊𝑒𝑖𝑔ℎ𝑡𝑠 𝑤 1 𝑤 2 𝑤 3 𝑤 𝑛 . 𝑤 0 𝐵𝑖𝑛𝑎𝑟𝑦 𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑒𝑟 𝑤 𝑖 𝑥 𝑖 output= 0, 𝑤𝑥≤0 &1, 𝑤𝑥>0 Note: we’ve dropped the bias term and replaced it with w0 and 1

(Sigmoid) Binary Perceptron – Single Output 𝑥 1 𝑥 2 𝑥 3 𝑥 𝑛 𝜎(Σ 𝑤 𝑖 𝑥 𝑖 ) 𝑨𝒄𝒕𝒊𝒗𝒂𝒕𝒊𝒐𝒏 𝑭𝒖𝒏𝒄𝒕𝒊𝒐𝒏 𝐼𝑛𝑝𝑢𝑡 𝑊𝑒𝑖𝑔ℎ𝑡𝑠 𝑤 1 𝑤 2 𝑤 3 𝑤 𝑛 . output= 1, 𝜎 Σ 𝑤 𝑖 𝑥 𝑖 +𝑏 >0.5 &0, 𝑒𝑙𝑠𝑒 𝑤 𝑖 𝑥 𝑖

Binary Perceptron Data: 𝑁 pairs of 𝒙 𝒊 , 𝒚 𝒊 𝑖=1 𝑁 , 𝒚 𝒊 ∈(0,1) The binary perceptron acts as a binary classifier 𝜎 𝑤 𝑖 𝑥 𝑖 +𝑏 , 𝑤ℎ𝑒𝑟𝑒 𝜎 𝑥 = 1 1+ 𝑒 −𝑥 And 0≤𝜎 𝑥 ≤1 𝑓 𝑥 = 1, 𝜎 𝑤𝑥+𝑏 >0.5 &0, 𝑒𝑙𝑠𝑒

(Softmax) Binary Perceptron - Multiple Outputs 𝑥 1 𝑥 2 𝑥 3 𝑥 𝑛 𝑤 𝑖1 𝑥 𝑖 𝜎(Σ 𝑤 𝑖1 𝑥 𝑖 ) 𝐼𝑛𝑝𝑢𝑡 𝑊𝑒𝑖𝑔ℎ𝑡𝑠 𝑤 11 𝑤 21 . 𝑨𝒄𝒕𝒊𝒗𝒂𝒕𝒊𝒐𝒏 𝑭𝒖𝒏𝒄𝒕𝒊𝒐𝒏 𝑤 𝑖𝑘 𝑥 𝑖 𝑊 𝑘×𝑛 𝑇 𝒙 𝑛𝑥1 = ℎ 𝑘×1 𝑓 𝑥 =𝜎(ℎ) A generalization of the sigmoid function called 𝑠𝑜𝑓𝑡𝑚𝑎𝑥: 𝜎 𝒙 = 𝑒 𝑖 ℎ 𝑘=1 𝐾 𝑒 𝑘 ℎ . 𝜎(Σ 𝑤 𝑖𝑘 𝑥 𝑖 )

Multiclass Binary Perceptron Probability distribution over 𝑘 classes 𝑥 1 𝑥 2 𝑥 3 𝑥 𝑛 𝑤 𝑖1 𝑥 𝑖 𝜎(Σ 𝑤 𝑖1 𝑥 𝑖 ) 𝑨𝒄𝒕𝒊𝒗𝒂𝒕𝒊𝒐𝒏 𝑭𝒖𝒏𝒄𝒕𝒊𝒐𝒏 𝐼𝑛𝑝𝑢𝑡 𝑊𝑒𝑖𝑔ℎ𝑡𝑠 𝑤 11 𝑤 21 . 𝑒 1 ℎ 𝑘=1 𝐾 𝑒 𝑘 ℎ 𝑒 2 ℎ 𝑘=1 𝐾 𝑒 𝑘 ℎ . 𝑒 𝑘 ℎ 𝑘=1 𝐾 𝑒 𝑘 ℎ 0.02 0.76 . 0.06 𝑤 𝑖𝑘 𝑥 𝑖 . 𝜎(Σ 𝑤 𝑖𝑘 𝑥 𝑖 )

Multiclass Binary Perceptron Correct class distribution 𝑥 1 𝑥 2 𝑥 3 𝑥 𝑛 𝑤 𝑖1 𝑥 𝑖 𝜎(Σ 𝑤 𝑖1 𝑥 𝑖 ) 𝑨𝒄𝒕𝒊𝒗𝒂𝒕𝒊𝒐𝒏 𝑭𝒖𝒏𝒄𝒕𝒊𝒐𝒏 𝐼𝑛𝑝𝑢𝑡 𝑊𝑒𝑖𝑔ℎ𝑡𝑠 𝑤 11 𝑤 21 . 𝑒 1 ℎ 𝑘=1 𝐾 𝑒 𝑘 ℎ 𝑒 2 ℎ 𝑘=1 𝐾 𝑒 𝑘 ℎ . 𝑒 𝑘 ℎ 𝑘=1 𝐾 𝑒 𝑘 ℎ 0.02 0.76 . 0.06 1 . 𝑤 𝑖𝑘 𝑥 𝑖 . 𝜎(Σ 𝑤 𝑖𝑘 𝑥 𝑖 )

𝐻 𝑝,𝑞 =− 𝒙 𝑝 𝒙 log 𝑞 𝑥 =𝐻 𝑝 + 𝐷 𝐾𝐿 (𝑝‖𝑞) 𝐷 𝐾𝐿 (𝑝 𝑞 = 𝒙 𝑝 𝒙 log 𝑝 𝑥 𝑞 𝑥 Need to calculate loss. How different is ‘our’ probability distribution over the possible classes from the correct one. Cross-entropy (not to be confused with the joint entropy of two random variables): 𝐻 𝑝,𝑞 =− 𝒙 𝑝 𝒙 log 𝑞 𝑥 =𝐻 𝑝 + 𝐷 𝐾𝐿 (𝑝‖𝑞) 𝐷 𝐾𝐿 (𝑝 𝑞 = 𝒙 𝑝 𝒙 log 𝑝 𝑥 𝑞 𝑥 Since our target distribution 𝑝 is “one-hot encoded” (0,0, …,1,0)→𝐻 𝑝 =0. This means it is equivalent to minimizing the KL divergence (𝐷 𝐾𝐿 (𝑝 𝑞 ) between the two distributions. In other words, the cross-entropy objective ‘wants’ the predicted distribution to have all of its mass on the correct answer. When using the SoftMax activation function, with the cross-entropy loss function we get: 𝐿 𝑖 =− log 𝑒 𝑖 ℎ 𝑦𝑖 𝑘=1 𝐾 𝑒 𝑘 ℎ Note: when implementing use the long-sum-exp trick. Intuitively: -log(1) = 0 (we gave 1 to our predicted score) -log(0.25) = 0.602 http://cs231n.github.io/linear-classify/

Multilayer perceptron (MLP)

The XOR (“exclusive OR”) problem Given 4 points in ℝ 𝑛 0,0 , 0,1 , 1,0 , 1,1 , return: Can we solve the problem with a linear/binary perceptron (with a single output)? Is it linearly separable? Output X2 X1 1

The XOR problem Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning (Vol. 1). Cambridge: MIT press. [Figure from: Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning]

The XOR problem A single-layer perceptron is a linear combination of its inputs. The classification of the input is given by a line which separates between the classes of the input. If we look at the equations: 0× 𝑤 1 +0× 𝑤 2 +𝑏≤0 ↔𝑏≤0 0× 𝑤 1 +1× 𝑤 2 +𝑏>0 ↔𝑏>− 𝑤 2 1× 𝑤 1 +0× 𝑤 2 +𝑏>0 ↔𝑏>− 𝑤 1 1× 𝑤 1 +1× 𝑤 2 +𝑏≤0 ↔𝑏≤−( 𝑤 1 + 𝑤 2 ) There is no solution to this linear system

The XOR problem We can also try to treat this problem as a least squares problems: Loss function: Model: (Exercise) solving the normal equations we get: 𝒘=𝟎 𝑎𝑛𝑑 𝑏= 1 2 LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. nature, 521(7553), 436. [Example from: Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning]

The XOR problem Adding a hidden layer can help solve the XOR problem. We will add a vector of hidden unit 𝒉= 𝑓 1 𝒙;𝑾,𝒄 . the values of these hidden units are then used as input for the second/output layer. Our model is now: 𝒉= 𝑓 1 𝒙;𝑾,𝒄 𝒚= 𝑓 2 𝒉;𝒘,𝑏 𝑓 𝒙;𝑾,𝒄,𝒘,𝑏 = 𝑓 2 𝑓 1 𝒙 . LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. nature, 521(7553), 436. [Example from: Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning]

The XOR problem 𝑓 𝒙;𝑾,𝒄,𝒘,𝑏 = 𝑓 2 𝑓 1 𝒙 𝑓 𝒙;𝑾,𝒄,𝒘,𝑏 = 𝑓 2 𝑓 1 𝒙 LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. nature, 521(7553), 436. [Example from: Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning]

𝑓 𝑥 = 𝒘 𝑻 𝑾 𝑻 𝒙= 𝒙 𝑻 𝒘 where 𝒘 ′ =𝑾𝒘 The XOR problem What should be our choice of 𝑓 (1) ? 𝑓 (1) can’t be linear, otherwise: 𝑓 1 𝑥 = 𝑾 𝑻 𝒙 and 𝑓 2 ℎ = 𝒉 𝑻 𝒘 Then: 𝑓 𝑥 = 𝒘 𝑻 𝑾 𝑻 𝒙= 𝒙 𝑻 𝒘 where 𝒘 ′ =𝑾𝒘 We must use a non-linear function for 𝑓 (1) . LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. nature, 521(7553), 436. [Example from: Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning]

The XOR problem 𝑔 𝑧 =max⁡{0,𝑧}, which is known as Rectified Linear Unit (ReLU) LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. nature, 521(7553), 436. [Figure from: Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning]

The XOR problem Our new model: 𝑓 𝒙;𝑾,𝒄,𝒘,𝑏 = 𝒘 𝑇 max 0, 𝑾 𝑇 𝒙+𝒄 +𝑏 You can find a complete walkthrough of the problem at: http://www.deeplearningbook.org/ chapter 6.1 LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. nature, 521(7553), 436. [Figure from: Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning]

No hidden layers http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/

MLP with one hidden layer This of it as looking at the input layer http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/

MLP with one hidden layer The hidden layer learns a representation so that the data is linearly separable http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/

MLP with one hidden layer Lecun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444. https://doi.org/10.1038/nature14539 [Lecun, Y., Bengio, Y., & Hinton, G. (2015)]

How big should our hidden layer be? Example of overfitting Taken from Stanford notes https://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html

Summary Deep learning is a class of supervised learning algorithms. Linear \ binary perceptron acts as a linear classifier. Hidden layers (followed by non-linear activation function) allows for non-linear transformation of the input so that it could be linear separable. The number of neurons and connections in each layer determine our model capacity. LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. nature, 521(7553), 436.