Download presentation
1
A shallow introduction to Deep Learning
Zhiting Hu
2
Outline Motivation: why go deep? DL since 2006 Some DL Models
Discussion
3
Outline Motivation: why go deep? DL since 2006 Some DL Models
Discussion
4
Motivation Definition Deep Learning A wide class of machine learning techniques and architectures, with the hallmark of using many layers of non-linear information processing that are hierarchical in nature. An Example: Deep Neural Networks
5
Motivation Definition Example Neural Network
6
Neural Network Input: x = Output: Y = (0, 0, 0, 0, 0, 1, 0, 0, 0, 0)
Motivation Definition Motivation Definition Example Neural Network Input: x = Output: Y = (0, 0, 0, 0, 0, 1, 0, 0, 0, 0)
7
Deep Neural Network (DNN)
Motivation Definition Example Motivation Definition Deep Neural Network (DNN)
8
Parameter learning: Back-propagation
Motivation Definition Example Motivation Definition Parameter learning: Back-propagation Given a training dataset: (X, Y) Learn parameters: W
9
Parameter learning: Back-propagation
Motivation Definition Motivation Definition Example Parameter learning: Back-propagation Given a training dataset: (X, Y) Learn parameters: W 2 phases:
10
Parameter learning: Back-propagation
Motivation Definition Motivation Definition Example Parameter learning: Back-propagation Given a training dataset: (X, Y) Learn parameters: W 2 phases: (1) Forward propagation
11
Parameter learning: Back-propagation
Motivation Definition Example Parameter learning: Back-propagation Given a training dataset: (X, Y) Learn parameters: W 2 phases: (1) Forward propagation (2) backward propagation
12
Motivation: why go deep?
Why Deep? Motivation: why go deep? Brains have a deep architecture Humans organize their ideas hierarchically, through composition of simpler ideas Insufficiently deep architectures can be exponentially inefficient Distributed representations are necessary to achieve non-local generalization Intermediate representations allow sharing statistical strength
13
Brains have a deep architecture
Motivation Why Deep? Brains have a deep architecture
14
Brains have a deep architecture
Motivation Why Deep? Brains have a deep architecture [Lee, Grosse, Ranganath & Ng, 2009]
15
Brains have a deep architecture
Motivation Why Deep? Brains have a deep architecture Deep Learning = Learning Hierarchical Representations (features) [Lee, Grosse, Ranganath & Ng, 2009]
16
Deep Architecture in our Mind
Motivation Why Deep? Deep Architecture in our Mind Humans organize their ideas and concepts hierarchically Humans first learn simpler concepts and then compose them to represent more abstract ones Engineers break-up solutions into multiple levels of abstraction and processing
17
Insufficiently deep architectures can be exponentially inefficient
Motivation Why Deep? Insufficiently deep architectures can be exponentially inefficient Theoretical arguments Two layers of neurons = universal approximator Some functions compactly represented with k layers may require exponential size with 2 layers Theorems on advantage of depth: (Hastad et al 86 & 91, Bengio et al 2007, Bengio & Delalleau 2011, Braverman 2011)
18
Insufficiently deep architectures can be exponentially inefficient
Motivation Why Deep? Insufficiently deep architectures can be exponentially inefficient “Shallow” computer program “Deep” computer program
19
Outline Motivation: why go deep? DL since 2006 Some DL Models
Discussion
20
Why now? “Winter of Neural Networks” Since 90’s
DL since 2006 Why Now? Why now? “Winter of Neural Networks” Since 90’s Before 2006 training deep architectures was unsuccessful (except for Convolutional Neural Nets) Main difficulty: local optima in the non-convex objective function of the deep networks Back-propagation (local gradient descent, random initialization) often gets trapped in poor local optima
21
Why now? “Winter of Neural Networks” Since 90’s
DL since 2006 Why Now? Why now? “Winter of Neural Networks” Since 90’s Before 2006 training deep architectures was unsuccessful (except for Convolutional Neural Nets) Main difficulty: local optima in the non-convex objective function of the deep networks Back-propagation (local gradient descent, random initialization) often gets trapped in poor local optima Others: Too many parameters, so small labeled dataset => overfitting Hard to do theoretical analysis Need a lot of tricks to play with ….
22
Why now? “Winter of Neural Networks” Since 90’s
DL since 2006 Why Now? Why now? “Winter of Neural Networks” Since 90’s Before 2006 training deep architectures was unsuccessful (except for Convolutional Neural Nets) Main difficulty: local optima in the non-convex objective function of the deep networks Back-propagation (local gradient descent, random initialization) often gets trapped in poor local optima Others: Too many parameters, so small labeled dataset => overfitting Hard to do theoretical analysis Need a lot of tricks to play with …. So people turned to shallow models with convex loss function (e.g., SVMs, CRFs etc.)
23
DL since 2006 Why Now? What has changed? New methods for unsupervised pre-training have been developed Unsupervised: use unlabeled data Pre-training: better initialization => better local optima
24
DL since 2006 Why Now? What has changed? New methods for unsupervised pre-training have been developed Unsupervised: use unlabeled data Pre-training: better initialization => better local optima GPU, distributed systems Large-scale learning
25
Success in object recognition
DL since 2006 Success in object recognition Task: classify the 1.2 million images in the ImageNet LSVRC-2010 contest into the 1000 different classes.
26
Success in object recognition
DL since 2006 Success in object recognition Task: classify the 1.2 million images in the ImageNet LSVRC-2010 contest into the 1000 different classes.
27
Success in object recognition
DL since 2006 Success in object recognition Task: classify the 1.2 million images in the ImageNet LSVRC-2010 contest into the 1000 different classes.
28
Success in speech recognition
DL since 2006 Success in speech recognition Google uses DL in their android speech recognizer (both server-side and on some phones with enough memory) Results from Google, IBM, Microsoft
29
Success in NLP Neural Word embedding
DL since 2006 Success in NLP Neural Word embedding Use neural network to learn vector representation of a word Semantic relations appear as linear relationships in the space of learned representations King – Queen ~= Man – Woman Paris – France + Italy ~= Rome
30
DL in Industry Microsoft
DL since 2006 DL in Industry Microsoft First successful DL models for speech recognition, by MSR in 2009 Google “Google Brain” Led by Google fellow Jeff Dean Large-scale deep learning infrastructure (Le et al, ICML’12) 10 million 200*200 images. Network with 1 billion connections, train on 1000 machines (16K cores) for 3 days Facebook Facebook hires NYU deep learning expert to run its new AI lab ( )
31
Outline Motivation: why go deep? DL since 2006 Some DL Models
Convolutional Neural Networks Deep Belief Nets Stacked auto-encoders / sparse coding Discussion
32
Convolutional Neural Networks (CNNs)
DL Models CNN Convolutional Neural Networks (CNNs) Proposed by (LeCun et al., 1989), the “only” successful DL model before 2006 Widely used to image data (recently also to other tasks)
33
Convolutional Neural Networks (CNNs)
DL Models CNN Convolutional Neural Networks (CNNs) Proposed by (LeCun et al., 1989), the “only” successful DL model before 2006 Widely used to image data (recently also to other tasks) Nearby pixels are more strongly correlated than more distant pixels Translation invariance
34
Convolutional Neural Networks (CNNs)
DL Models CNN Convolutional Neural Networks (CNNs) Proposed by (LeCun et al., 1989), the “only” successful DL model before 2006 Widely used to image data (recently also to other tasks) Nearby pixels are more strongly correlated than more distant pixels Translation invariance CNNs Local receptive fields Weight sharing All of the units in the convolutional layer detect the same patterns but at different locations in the input image Subsampling Be relatively insensitive to small shifts of the image
35
Convolutional Neural Networks (CNNs)
DL Models CNN Convolutional Neural Networks (CNNs) Proposed by (LeCun et al., 1989), the “only” successful DL model before 2006 Widely used to image data (recently also to other tasks) Nearby pixels are more strongly correlated than more distant pixels Translation invariance CNNs Local receptive fields Weight sharing All of the units in the convolutional layer detect the same patterns but at different locations in the input image Subsampling Be relatively insensitive to small shifts of the image
36
Convolutional Neural Networks (CNNs)
DL Models CNN Convolutional Neural Networks (CNNs) Proposed by (LeCun et al., 1989), the “only” successful DL model before 2006 Widely used to image data (recently also to other tasks) Nearby pixels are more strongly correlated than more distant pixels Translation invariance CNNs Local receptive fields Weight sharing All of the units in the convolutional layer detect the same patterns but at different locations in the input image Subsampling Be relatively insensitive to small shifts of the image Training Back-propagation
37
Convolutional Neural Networks (CNNs)
DL Models CNN Convolutional Neural Networks (CNNs) MNIST handwritten digits benchmark State-of-the-art: 0.35% error rate (IJCAI 2011)
38
Outline Motivation: why go deep? DL since 2006 Some DL Models
Convolutional Neural Networks Deep Belief Nets Stacked auto-encoders / sparse coding Discussion
39
Restricted Boltzmann Machine (RBM)
DL Models DBN RBM Restricted Boltzmann Machine (RBM) Building block of Deep Belief Nets (DBNs) and Deep Boltzmann Machine (DBM) Bipartite undirected graphical model Define: Parameter learning: Model parameters: W, b, c Maximize Gradient Descent, but use Contrastive Divergence (CD) to approximate the gradient
40
Deep Belief Nets (DBNs)
DL Models DBN Deep Belief Nets (DBNs)
41
DBNs Layer-wise pre-training
DL Models DBN DBNs Layer-wise pre-training
42
DBNs Layer-wise pre-training
DL Models DBN DBNs Layer-wise pre-training
43
DBNs Layer-wise pre-training
DL Models DBN DBNs Layer-wise pre-training
44
Supervised fine-tuning
DL Models DBN Supervised fine-tuning After pre-training, the parameters W and c for each layer can be used to initialize a deep multi-layer neural network. These parameters can then be fine-tuned using back-propagation on labeled data
45
Outline Motivation: why go deep? DL since 2006 Some DL Models
Convolutional Neural Networks Deep Belief Nets Stacked auto-encoders / sparse coding Discussion
46
Stacked auto-encoders / sparse coding
DL Models AE / Sparse coding Stacked auto-encoders / sparse coding Building blocks: auto-encoder / sparse coding (nonprobabilistic) Structure similar to DBNs
47
Stacked auto-encoders / sparse coding
DL Models AE / Sparse coding Stacked auto-encoders / sparse coding Building blocks: auto-encoder / sparse coding (nonprobabilistic) Structure similar to DBNs Let’s skip it….
48
DL Models
49
DL Models
50
DL Models
51
Outline Motivation: why go deep? DL since 2006 Some DL Models
Convolutional Neural Networks Deep Belief Nets Stacked auto-encoders / sparse coding Discussion
52
Deep Learning = Learning Hierarchical features
Discussion Feature Learning Deep Learning = Learning Hierarchical features
53
Deep Learning = Learning Hierarchical features
Discussion Feature Learning Deep Learning = Learning Hierarchical features The pipeline of machine visual perception
54
Deep Learning = Learning Hierarchical features
Discussion Feature Learning Deep Learning = Learning Hierarchical features The pipeline of machine visual perception Features in NLP (hand-crafted)
55
Deep Learning = Learning Hierarchical features
Discussion Feature Learning Deep Learning = Learning Hierarchical features
56
Discussion Problems Problems No need of feature engineering, but training DL models does require significant amount of engineering, e.g., parameter tuning #layer, layer size, connection Learning rate
57
Discussion Problems Problems No need of feature engineering, but training DL models does require significant amount of engineering, e.g., parameter tuning #layer, layer size, connection Learning rate Computational scaling Recent breakthroughs in speech, object recognition and NLP hinged on faster computing, GPUs, and large datasets
58
Discussion Problems Problems No need of feature engineering, but training DL models does require significant amount of engineering, e.g., parameter tuning #layer, layer size, connection Learning rate Computational scaling Recent breakthroughs in speech, object recognition and NLP hinged on faster computing, GPUs, and large datasets Lack of theoretical analysis
59
Outline Motivation: why go deep? DL since 2006 Some DL Models
Convolutional Neural Networks Deep Belief Nets Stacked auto-encoders / sparse coding Discussion
60
References
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.