Download presentation
Presentation is loading. Please wait.
1
Signal processing and Networking for Big Data Applications Lectures 11-12: Deep Learning Basics
Zhu Han University of Houston Thanks for Dr. Hien Nguyen slides and help by Xunshen Du and Kevin Tsai
2
outline Motivation and overview Idea of Deep Feedforward Networks
Why deep learning Basic concepts CNN, RNN vs. DBN Idea of Deep Feedforward Networks Example: Learning XOR Output Units
3
Which one is better? Answer “it depends”
Classic method such as Random forest might be good if you don't have a lot of data, or training data have categorical features, or you want a more explainable model, or you want high run-time speed. Deep learning is good when you have a lot of training data of the same or related domain. Deep networks could be much slower than random forest. Appropriate scaling and normalization have to be done to the data to improve the convergence speed.
4
Why Deep Learning? Beat me on IMAGENET! How can I convince you? Hinton
Malik Alex
5
Why Deep Learning Complex engineered systems AlexNet
6
Why Deep Learning
7
Why Deep Learning
8
Why Deep Learning Carotic artery bifurcation landmark detection
9
Why Deep Learning “Deep” in deep learning:
Some functions computable with a polynomial-size logic gates circuit of depth k will require exponential size for depth k − 1 [Hastad Johan, Theory of computing, 1986] Complex patterns is composed of repeated simples patterns (Everything in the universe is within you) Curse of dimensionality: Convergence rate decreases exponentially with dimension of data We say that the expression of a function is compact when it has few computational elements, i.e., few degrees of freedom that need to be tuned by learning. Compact means more generalizable in machine learning. The depth of an architecture is the maximum length of a path from any input of the graph to any output of the graph, there are functions computable with a polynomial-size logic gates circuit of depth k that require exponential size when restricted to depth k − 1 J. H˚astad, “Almost optimal lower bounds for small depth circuits,” in Proceedings of the 18th annual ACM Symposium on Theory of Computing, pp. 6–20, Berkeley, California: ACM Press, 1986
10
Why now? Big annotated datasets become available: GPU processing power
ImageNet: Google Video: Mechanical Turk Crowsourcing GPU processing power Better stochastic gradient descents: AdaGrad, AdamGrad, RMSProp
11
Biological Neuron and Modeling
12
Hierarchical Structure Deep Learning
1981, David Hubel & Torsten Wiesel: The Mamalian Visual Cortex is Hierarchical 14
13
Deep Learning Structure
Hinton
14
Fully Connected Feed-Forward Network
Linear transformation followed by non-linear rectification Training of NN Back propagation
15
Rectifier Functions Activation function from biological data
16
Rectifier Linear Unit (ReLU)
Pros: Faster to compute Help reduce vanishing gradient problem Sparse activation More biologically plausible More robust to noise Information disentanglement, etc Cons: Very negative neuron cannot recover due to zero gradient Parametric ReLu come to rescue
17
Dropout Hilton et. al. 2012 Randomly pick some neurons and replace their value with 0 during training Reduce overfitting (Pediatric ADHD) and co-adaptation
18
DropConnect Randomly remove connections
Wang et. al. ICML 2013 Randomly remove connections Claim to work better than dropout
19
Other Tips Data augmentation / perturbation Batch normalization
L1/L2 regularization: Big network with regularizer often outperform small network Parameters search using Bayesian optimization
20
Type 1: Convolutional Neural Networks (CNNs)
LeCun et al. 1989 Neural network with specialized connectivity structure
21
AlexNet Same model as LeCun’98 but: Bigger model (8 layers)
More data (106 vs 103 images) GPU implementation (50x speedup over CPU) Better regularization (DropOut) 7 hidden layers; 650,000 neurons; 60,000,000 parameters Trained on 2 GPUs for a week
22
AlexNet 14 million labeled images, 20k classes
Images gathered from Internet Human labels via Amazon Turk
23
Pre-Training Effect Pretrain with ImageNet Network
Fine-tune on a small set of samples Dataset: Caltech-256 Zeiler & Fergus, arXiv , 2013
24
CONVOLUTION (LEARNED)
CNN Structure FEATURE MAPS Feed-forward: Convolve input Non-linearity (rectified linear) Pooling (local max) Supervised Train convolutional filters by back-propagating classification error POOLING NON-LINEAR RECTIFIER CONVOLUTION (LEARNED) INPUT IMAGE
25
Filtering Convolution Filter is learned during training
Same filter at each location Each filter looks at local region only
26
Filtering Multiple filters at each layer
More complex patterns at deeper layers
27
Non-Linearity Rectified linear function Applied per-pixel
Output = max(0,input)
28
Pooling Spatial Pooling Non-overlapping / overlapping regions
Sum or Max Boureau et al. ICML’10 for theoretical analysis Max pooling make it more invariant to local transformation like translation.
29
Pooling Spatial Pooling Non-overlapping / overlapping regions
Sum or Max Boureau et al. ICML’10 for theoretical analysis Max Sum
30
Role of Pooling Spatial pooling Variants:
Invariance to small transformations Larger receptive fields Variants: Pool across feature maps Tiled pooling
31
Going Deeper Current best models are much deeper than AlexNet
Deeper architectures Current best models are much deeper than AlexNet
32
Going Deeper CaffeNet Google Inception Network Residual network
8 layers Google Inception Network 22 layers Google inception network 22 layers Residual network 152 layers
33
Detection with CNNs Sometime we also want to locate specific objects in the scene Traditional approaches: Scan over entire images, at different scales Construct a feature for each location Apply a classifier on that feature Sophisticated tricks to speedup scanning (Integral image, subwindow search) Combine multiple hand-engineered features Difficult to scale up to multiple classes
34
Detection with CNNs Faster R-CNN, NIPS 2015 Perform regression of bounding box coordinates and object class at the same time Jointly learn features and regressors Currently best performer in PASCAL detection
35
Segmentation with CNNs
Noh et. al., ICCV 2015 Add unpooling and upsampling layers Competitive performance on PASCAL dataset
36
Reinforcement Learning with CNNs
Robot takes action, get reward, and enter new state Goal: Find optimal policy for robot Q-learning:
37
Reinforcement Learning with CNNs
Sensor data Robot action Action-value function jointly learned with image features Outperform human expert (and linear method)
38
Reinforcement Learning with CNNs
Model probability of next action with CNN Predict distribution of opponent’s next action feed into policy learning Play with itself to improve policy ALPHA GO GO game
39
Self-Driving Car with CNNs
NVIDIA
40
Why CNNs work well Small number of variables
Jointly learn features and classifier Hierarchically learning complex patterns mitigate curse of dimensionality Designed to be invariant to shift/translation
41
Type 2: Recurrent Neural Network
Vast number of tasks involve time series data Need long term dependency: “I grew up in France… I speak fluent French.”
42
Recurrent Network
43
Recurrent Network
44
Recurrent Network
45
Long Short-Term Memory Network
46
Long-Short Term Memory Network
47
Long-Short Term Memory Network
Rafal et. al., JMLR 2015 Forget gate is most important Output gate is least important Adding positive bias (1 or 2) to forget gate help reduce vanishing gradient greatly improves performance of LSTM
48
Sequence to Sequence Could apply to sequence of input and output
49
Example: Video to Text with LSTM
Connect CNN with stack of LSTM
50
Network Visualization
What were learned in deep networks Raw coefficients of learned filters in higher layers are difficult to interpret Two main approaches: Project activation back to pixel space Optimize input image to maximize response of a particular neuron / feature map
51
Network Visualization
Projection from higher layers back to input Visualizing and Understanding Convolutional Networks, Matt Zeiler & Rob Fergus, ECCV 2014 Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps, Karen Simonyan, Andrea Vedaldi, Andrew Zisserman, arXiv , 2013 Object Detectors Emerge in Deep Scene CNNs, Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, Antonio Torralba, ICLR 2015
52
Network Visualization
Divide image into regions Iteratively remove regions with least change in output score Zhou & Torralba
53
Network Visualization
Find K images producing highest response of a neuron Occlude random regions to find the most critical one, i.e. causing largest change in output Object detectors emerge!
54
Network Visualization
Optimize input to maximize particular output: Erhan et. al. 2009, Le et. al. [NIPS 2010] Output depends on initialization Google Deep Dream
55
Network Visualization
Google Deep Dream Ask network to enhance what it sees
56
Type 3: Deep Belief Network: Restricted Boltzmann Machine
Each link associates with a probability; It is parametric men women
57
Deep belief network layered structure
After learning an RBM, treat the activation probabilities of its hidden units as the data for training the RBM one layer up Stacking a number of the RBMs learned layer by layer from bottom-up gives “Trained” methods for better clustering
58
Comparisons
59
Deep Learning Software
Caffe: Theano: Torch: Matlab: TensorFlow:
60
Deep Learning Software
61
outline Motivation and overview Idea of Deep Feedforward Networks
Example: Learning XOR Output Units Book: Ian Goodfellow, Yoshua Bengio, and Aaron Courville, Deep Learning
62
Concept Also known as Approximate some function 𝑓 ∗
Feedforward Neural Networks Multi-layer Perceptrons (MLPs) Approximate some function 𝑓 ∗ Ex: 𝑦= 𝑓 ∗ (𝑥) Feedforward network learns the parameters 𝜃 in a function Ex: 𝑦=𝑓(𝑥;𝜃) IDEA OF DEEP FEEDFORWARD NETWORKS
63
Deep? Feedforward? Networks?
NETWORKS – composing by multiple functions FEEDFORWARD – information/input (x) flows through functions to produce output y Ex: 𝑓 𝑥 = 𝑓 3 ( 𝑓 2 ( 𝑓 1 (𝑥))) Note: Recurrent neural networks is the extension of feedforward neural networks with feedback connections DEEP – overall length of the chain gives the “depth” of the model. (deep learning) Ex: 𝑓 1 is called “1st layer”, 𝑓 2 is called “2nd layer”, …, final layer of the network is called the “output layer” IDEA OF DEEP FEEDFORWARD NETWORKS
64
Linear nonlinear Linear models usually fit efficiently and reliably in closed form or with convex optimization Nonlinear functions can overcome the limitation of linear models Apply linear model to a transformed input 𝜙(𝒙). 𝜙 is nonlinear transformation Very generic 𝜙. Infinite-dimensional 𝜙 Manually engineer 𝜙. Dominant approach before deep learning Generated by deep learning IDEA OF DEEP FEEDFORWARD NETWORKS
65
Hidden layers Recall: 𝑓 𝑥 = 𝑓 3 ( 𝑓 2 ( 𝑓 1 (𝑥)))
𝑓 𝑛 might be linear or nonlinear function Training data doesn’t determine what each individual layer should do Learning algorithm needs to decide how to connect each layer. This is NOT part of the “learning” algorithm IDEA OF DEEP FEEDFORWARD NETWORKS
66
Hidden layers IDEA OF DEEP FEEDFORWARD NETWORKS
67
Cost functions The choice of the cost function is important for designing a deep neural network In most cases, we use cross-entropy between the training data and the model’s predictions as the cost function since the parametric model usually defines a distribution 𝑝(𝒚|𝒙;𝜃)
68
Training feedforward network
Output units form Cost function Optimizer IDEA OF DEEP FEEDFORWARD NETWORKS
69
Example: Learning xor
70
objective Target function 𝑦= 𝑓 ∗ (𝑥) Our model 𝑦=𝑓(𝑥;𝜃)
Objective: adapt 𝜃 to make 𝑓 as similar as 𝑓 ∗ Example: Learning xor
71
Strategy Treat as a regression problem
Use a mean square error (MSE) loss function 𝐽 𝜽 = 1 4 𝒙∈𝕏 ( 𝑓 ∗ 𝒙 −𝑓 𝒙;𝜽 ) 2 Just for simplifying the math here Our model: 𝑓 𝒙;𝜽 =𝑓 𝒙;𝒘,𝑏 = 𝒙 ⊺ 𝒘+𝑏 Example: Learning xor
72
problem Solution: w=0 and b=0.5 X1=0, output increases as x2 increases
Useless! X1=0, output increases as x2 increases x1=1, output decrease as x2 increases Coefficient w not fixed Example: Learning xor
73
solution Add hidden layer (h) which contains two hidden units
Rewrite model Map and to Example: Learning xor
74
solution Use nonlinear function to describe the features
Note: most neural networks do so using affine transformation controlled by learned parameters, followed by a fixed, nonlinear function called an “activation function” 𝒉= 𝑓 1 𝒙;𝑾,𝑐 =𝑔( 𝑾 ⊺ 𝒙+𝑐) where 𝑾 are the weights and 𝒄 are the biases Complete networks: 𝑓 𝒙;𝑾,𝒄,𝒘,𝑏 = 𝒘 ⊺ max 0, 𝑾 ⊺ 𝒙+𝒄 +𝑏 Example: Learning xor
75
Side note: common activation function
Example: Learning xor
76
Gradient-based learning
77
remark Linear models Neural networks (include nonlinear function)
Linearity Linear equation solvers is convergence guarantees Capacity limitation Nonlinearity Causes most loss functions to become non-convex; in other words, using iterative, gradient-based optimizers that drive cost function to small value
78
Stochastic Gradient descent
No convergence guarantees Sensitive to the initial parameters value Initial parameters should be small random values (discuss in Ch. 8)
79
Output units
80
Output units Any kind of neural network unit (ex. Hidden unit)
Note: can be used internally in the neuron Cost function is tightly coupled with the choice of output unit OUTPUT UNITS
81
Output units linear units for Gaussian output distributions
Based on an affine transformation with no nonlinearity Output of linear output units: 𝒚 = 𝑾 ⊺ 𝒉+𝒃 where 𝒉 is the given features Often used to produce the mean of a conditional Gaussian distribution 𝑝 𝒚 𝒙 =𝒩(𝒚; 𝑦 ,𝐼) Covariance must be positive definite matrix for all inputs (hard to satisfy) Do not saturate. Pose difficulty for some opt. algorithms OUTPUT UNITS
82
Output units Sigmoid unit for Bernoulli output distributions
Useful for classification problems with two classes Neural net predict only 𝑃(𝑦=1|𝒙) where valid probability ∈[0,1] 𝑃 𝑦=1 𝒙 =max{0, min 1, 𝒘 ⊺ 𝒉+𝑏 } 𝑃=0 when 𝒘 ⊺ 𝒉+𝑏 is out of [0,1]. Unable to learn What now? OUTPUT UNITS
83
Output units Sigmoid unit for Bernoulli output distributions
Ensure strong gradient whenever the model has the wrong answer Combine sigmoid output units with maximum likelihood Two components: linear layer to compute 𝑧= 𝒘 ⊺ 𝒉+𝑏, sigmoid activation function to convert 𝑧 into a probability Sigmoid output unit: 𝑦 =𝜎(𝑧) where 𝜎(𝑥)= 1 1+exp(−𝑥) OUTPUT UNITS
84
Output units softmax unit for multinoulli output distributions
Represent a probability distribution over a discrete variable with 𝑛 possible values Often used as the output of a classifier Generalization of the sigmoid function Output a vector 𝑦 OUTPUT UNITS
85
Output units softmax unit for multinoulli output distributions
𝑦 𝑖 =𝑃 𝑦=𝑖 𝒙 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝒛) 𝑖 = exp( 𝑧 𝑖 ) 𝑗 exp( 𝑧 𝑗 ) OUTPUT UNITS
86
Intro to Hidden Units Haven’t fully defined
Design process consists of trial and error Choice of activation function
87
Outline Hidden Units & ReLU Architecture Design Visualization Design
Brief Intro to Propagation
88
Hidden Units Design process consists of trial and error
Not expect gradient to be 0 Distinguished by the choice of the activation function Hidden Units & ReLU
89
Rectified linear units
Derivatives remain large whenever the unit is active 2nd derivative is 1 when active Gradients are large and consistent Typically used on top of an affine transformation 𝒉=𝑔( 𝑾 ⊺ 𝒙+𝒃) 𝒃 is small positive value. This make ReLU initially active for most input to allow the derivatives to pass through Hidden Units & ReLU
90
ReLU generalization Based on non-zero slope
𝑧 𝑖 <0: ℎ 𝑖 =𝑔 (𝑧,𝛼) 𝑖 = max 0, 𝑧 𝑖 + 𝛼 𝑖 min(0, 𝑧 𝑖 ) Absolute value rectification: 𝛼 𝑖 =1 to obtain 𝑔 𝑧 = 𝑧 Leaky ReLU: 𝛼 𝑖 =0.01 Parametric ReLU or PReLU: treats 𝛼 𝑖 as a learnable parameter Hidden Units & ReLU
91
ReLU generalization Maxout units: 𝑔(𝒛) 𝑖 = max 𝑗∈ 𝔾 (𝑖) 𝑧 𝑗
where 𝔾 (𝑖) is the inputs for group 𝑖, { 𝑖−1 𝑘+1,…,𝑖𝑘} Activation function Usually combined with dropout Hidden Units & ReLU
92
Architecture Design
93
Architecture Design Number of units? How to connect?
Organized into groups of units called layers. Arrange these layers in a chain structure. Ex: First layer: 𝒉 (1) = 𝑔 (1) 𝑾 (1)⊺ 𝒙+ 𝒃 (1) Second layer: 𝒉 (2) = 𝑔 (2) 𝑾 (2)⊺ 𝒉 (1) + 𝒃 (2) Deeper networks are able to use fewer units per layer and fewer parameters but often harder to optimize Via experimentation guided by monitoring the validation set error Architecture Design
94
Universal approximation theorem
Define: a feedforward network with a linear output layer and at least one hidden layer with any “squashing” activation function can approximate any Borel measurable function from one finite-dimensional space to another with any desired non- zero amount of error, provided that the network is given enough hidden units. Large MLP will be able to represent any function we are trying to learn Not guaranteed that the training algorithm will be able to learn that function Two reason that fail to learn: Opt. algorithm may not find the correct parameters Training algorithm might choose the wrong function due to overfitting Architecture Design
95
Visualization Design Brief Intro to Propagation & Visualization Design
96
Computational graph Brief Intro to Propagation & Visualization Design
97
Symbol-to-symbol derivatives
Brief Intro to Propagation & Visualization Design
98
Brief Intro to Propagation & Visualization Design
99
Forward propagation & back-propagation
Inputs 𝒙 propagates up to the hidden units at each layer and produces 𝒚 During training, forward propagation can continue onward until it produces a scalar cost function 𝐽(𝜽) Known as backprop Allows the information from the cost function to flow backwards through the network, in order to compute the gradient Brief Intro to Propagation & Visualization Design
100
Thank you
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.