A shallow introduction to Deep Learning Zhiting Hu 2014-4-1
Outline Motivation: why go deep? DL since 2006 Some DL Models Discussion
Outline Motivation: why go deep? DL since 2006 Some DL Models Discussion
Motivation Definition Deep Learning A wide class of machine learning techniques and architectures, with the hallmark of using many layers of non-linear information processing that are hierarchical in nature. An Example: Deep Neural Networks
Motivation Definition Example Neural Network
Neural Network Input: x = Output: Y = (0, 0, 0, 0, 0, 1, 0, 0, 0, 0) Motivation Definition Motivation Definition Example Neural Network Input: x = Output: Y = (0, 0, 0, 0, 0, 1, 0, 0, 0, 0)
Deep Neural Network (DNN) Motivation Definition Example Motivation Definition Deep Neural Network (DNN)
Parameter learning: Back-propagation Motivation Definition Example Motivation Definition Parameter learning: Back-propagation Given a training dataset: (X, Y) Learn parameters: W
Parameter learning: Back-propagation Motivation Definition Motivation Definition Example Parameter learning: Back-propagation Given a training dataset: (X, Y) Learn parameters: W 2 phases:
Parameter learning: Back-propagation Motivation Definition Motivation Definition Example Parameter learning: Back-propagation Given a training dataset: (X, Y) Learn parameters: W 2 phases: (1) Forward propagation
Parameter learning: Back-propagation Motivation Definition Example Parameter learning: Back-propagation Given a training dataset: (X, Y) Learn parameters: W 2 phases: (1) Forward propagation (2) backward propagation
Motivation: why go deep? Why Deep? Motivation: why go deep? Brains have a deep architecture Humans organize their ideas hierarchically, through composition of simpler ideas Insufficiently deep architectures can be exponentially inefficient Distributed representations are necessary to achieve non-local generalization Intermediate representations allow sharing statistical strength
Brains have a deep architecture Motivation Why Deep? Brains have a deep architecture
Brains have a deep architecture Motivation Why Deep? Brains have a deep architecture [Lee, Grosse, Ranganath & Ng, 2009]
Brains have a deep architecture Motivation Why Deep? Brains have a deep architecture Deep Learning = Learning Hierarchical Representations (features) [Lee, Grosse, Ranganath & Ng, 2009]
Deep Architecture in our Mind Motivation Why Deep? Deep Architecture in our Mind Humans organize their ideas and concepts hierarchically Humans first learn simpler concepts and then compose them to represent more abstract ones Engineers break-up solutions into multiple levels of abstraction and processing
Insufficiently deep architectures can be exponentially inefficient Motivation Why Deep? Insufficiently deep architectures can be exponentially inefficient Theoretical arguments Two layers of neurons = universal approximator Some functions compactly represented with k layers may require exponential size with 2 layers Theorems on advantage of depth: (Hastad et al 86 & 91, Bengio et al 2007, Bengio & Delalleau 2011, Braverman 2011)
Insufficiently deep architectures can be exponentially inefficient Motivation Why Deep? Insufficiently deep architectures can be exponentially inefficient “Shallow” computer program “Deep” computer program
Outline Motivation: why go deep? DL since 2006 Some DL Models Discussion
Why now? “Winter of Neural Networks” Since 90’s DL since 2006 Why Now? Why now? “Winter of Neural Networks” Since 90’s Before 2006 training deep architectures was unsuccessful (except for Convolutional Neural Nets) Main difficulty: local optima in the non-convex objective function of the deep networks Back-propagation (local gradient descent, random initialization) often gets trapped in poor local optima
Why now? “Winter of Neural Networks” Since 90’s DL since 2006 Why Now? Why now? “Winter of Neural Networks” Since 90’s Before 2006 training deep architectures was unsuccessful (except for Convolutional Neural Nets) Main difficulty: local optima in the non-convex objective function of the deep networks Back-propagation (local gradient descent, random initialization) often gets trapped in poor local optima Others: Too many parameters, so small labeled dataset => overfitting Hard to do theoretical analysis Need a lot of tricks to play with ….
Why now? “Winter of Neural Networks” Since 90’s DL since 2006 Why Now? Why now? “Winter of Neural Networks” Since 90’s Before 2006 training deep architectures was unsuccessful (except for Convolutional Neural Nets) Main difficulty: local optima in the non-convex objective function of the deep networks Back-propagation (local gradient descent, random initialization) often gets trapped in poor local optima Others: Too many parameters, so small labeled dataset => overfitting Hard to do theoretical analysis Need a lot of tricks to play with …. So people turned to shallow models with convex loss function (e.g., SVMs, CRFs etc.)
DL since 2006 Why Now? What has changed? New methods for unsupervised pre-training have been developed Unsupervised: use unlabeled data Pre-training: better initialization => better local optima
DL since 2006 Why Now? What has changed? New methods for unsupervised pre-training have been developed Unsupervised: use unlabeled data Pre-training: better initialization => better local optima GPU, distributed systems Large-scale learning
Success in object recognition DL since 2006 Success in object recognition Task: classify the 1.2 million images in the ImageNet LSVRC-2010 contest into the 1000 different classes.
Success in object recognition DL since 2006 Success in object recognition Task: classify the 1.2 million images in the ImageNet LSVRC-2010 contest into the 1000 different classes.
Success in object recognition DL since 2006 Success in object recognition Task: classify the 1.2 million images in the ImageNet LSVRC-2010 contest into the 1000 different classes.
Success in speech recognition DL since 2006 Success in speech recognition Google uses DL in their android speech recognizer (both server-side and on some phones with enough memory) Results from Google, IBM, Microsoft
Success in NLP Neural Word embedding DL since 2006 Success in NLP Neural Word embedding Use neural network to learn vector representation of a word Semantic relations appear as linear relationships in the space of learned representations King – Queen ~= Man – Woman Paris – France + Italy ~= Rome
DL in Industry Microsoft DL since 2006 DL in Industry Microsoft First successful DL models for speech recognition, by MSR in 2009 Google “Google Brain” Led by Google fellow Jeff Dean Large-scale deep learning infrastructure (Le et al, ICML’12) 10 million 200*200 images. Network with 1 billion connections, train on 1000 machines (16K cores) for 3 days Facebook Facebook hires NYU deep learning expert to run its new AI lab (2013.12)
Outline Motivation: why go deep? DL since 2006 Some DL Models Convolutional Neural Networks Deep Belief Nets Stacked auto-encoders / sparse coding Discussion
Convolutional Neural Networks (CNNs) DL Models CNN Convolutional Neural Networks (CNNs) Proposed by (LeCun et al., 1989), the “only” successful DL model before 2006 Widely used to image data (recently also to other tasks)
Convolutional Neural Networks (CNNs) DL Models CNN Convolutional Neural Networks (CNNs) Proposed by (LeCun et al., 1989), the “only” successful DL model before 2006 Widely used to image data (recently also to other tasks) Nearby pixels are more strongly correlated than more distant pixels Translation invariance
Convolutional Neural Networks (CNNs) DL Models CNN Convolutional Neural Networks (CNNs) Proposed by (LeCun et al., 1989), the “only” successful DL model before 2006 Widely used to image data (recently also to other tasks) Nearby pixels are more strongly correlated than more distant pixels Translation invariance CNNs Local receptive fields Weight sharing All of the units in the convolutional layer detect the same patterns but at different locations in the input image Subsampling Be relatively insensitive to small shifts of the image
Convolutional Neural Networks (CNNs) DL Models CNN Convolutional Neural Networks (CNNs) Proposed by (LeCun et al., 1989), the “only” successful DL model before 2006 Widely used to image data (recently also to other tasks) Nearby pixels are more strongly correlated than more distant pixels Translation invariance CNNs Local receptive fields Weight sharing All of the units in the convolutional layer detect the same patterns but at different locations in the input image Subsampling Be relatively insensitive to small shifts of the image
Convolutional Neural Networks (CNNs) DL Models CNN Convolutional Neural Networks (CNNs) Proposed by (LeCun et al., 1989), the “only” successful DL model before 2006 Widely used to image data (recently also to other tasks) Nearby pixels are more strongly correlated than more distant pixels Translation invariance CNNs Local receptive fields Weight sharing All of the units in the convolutional layer detect the same patterns but at different locations in the input image Subsampling Be relatively insensitive to small shifts of the image Training Back-propagation
Convolutional Neural Networks (CNNs) DL Models CNN Convolutional Neural Networks (CNNs) MNIST handwritten digits benchmark State-of-the-art: 0.35% error rate (IJCAI 2011)
Outline Motivation: why go deep? DL since 2006 Some DL Models Convolutional Neural Networks Deep Belief Nets Stacked auto-encoders / sparse coding Discussion
Restricted Boltzmann Machine (RBM) DL Models DBN RBM Restricted Boltzmann Machine (RBM) Building block of Deep Belief Nets (DBNs) and Deep Boltzmann Machine (DBM) Bipartite undirected graphical model Define: Parameter learning: Model parameters: W, b, c Maximize Gradient Descent, but use Contrastive Divergence (CD) to approximate the gradient
Deep Belief Nets (DBNs) DL Models DBN Deep Belief Nets (DBNs)
DBNs Layer-wise pre-training DL Models DBN DBNs Layer-wise pre-training
DBNs Layer-wise pre-training DL Models DBN DBNs Layer-wise pre-training
DBNs Layer-wise pre-training DL Models DBN DBNs Layer-wise pre-training
Supervised fine-tuning DL Models DBN Supervised fine-tuning After pre-training, the parameters W and c for each layer can be used to initialize a deep multi-layer neural network. These parameters can then be fine-tuned using back-propagation on labeled data
Outline Motivation: why go deep? DL since 2006 Some DL Models Convolutional Neural Networks Deep Belief Nets Stacked auto-encoders / sparse coding Discussion
Stacked auto-encoders / sparse coding DL Models AE / Sparse coding Stacked auto-encoders / sparse coding Building blocks: auto-encoder / sparse coding (nonprobabilistic) Structure similar to DBNs
Stacked auto-encoders / sparse coding DL Models AE / Sparse coding Stacked auto-encoders / sparse coding Building blocks: auto-encoder / sparse coding (nonprobabilistic) Structure similar to DBNs Let’s skip it….
DL Models
DL Models
DL Models
Outline Motivation: why go deep? DL since 2006 Some DL Models Convolutional Neural Networks Deep Belief Nets Stacked auto-encoders / sparse coding Discussion
Deep Learning = Learning Hierarchical features Discussion Feature Learning Deep Learning = Learning Hierarchical features
Deep Learning = Learning Hierarchical features Discussion Feature Learning Deep Learning = Learning Hierarchical features The pipeline of machine visual perception
Deep Learning = Learning Hierarchical features Discussion Feature Learning Deep Learning = Learning Hierarchical features The pipeline of machine visual perception Features in NLP (hand-crafted)
Deep Learning = Learning Hierarchical features Discussion Feature Learning Deep Learning = Learning Hierarchical features
Discussion Problems Problems No need of feature engineering, but training DL models does require significant amount of engineering, e.g., parameter tuning #layer, layer size, connection Learning rate
Discussion Problems Problems No need of feature engineering, but training DL models does require significant amount of engineering, e.g., parameter tuning #layer, layer size, connection Learning rate Computational scaling Recent breakthroughs in speech, object recognition and NLP hinged on faster computing, GPUs, and large datasets
Discussion Problems Problems No need of feature engineering, but training DL models does require significant amount of engineering, e.g., parameter tuning #layer, layer size, connection Learning rate Computational scaling Recent breakthroughs in speech, object recognition and NLP hinged on faster computing, GPUs, and large datasets Lack of theoretical analysis
Outline Motivation: why go deep? DL since 2006 Some DL Models Convolutional Neural Networks Deep Belief Nets Stacked auto-encoders / sparse coding Discussion
References