Zhu Han University of Houston

Signal processing and Networking for Big Data Applications Lectures 11-12: Deep Learning Basics
Zhu Han University of Houston Thanks for Dr. Hien Nguyen slides and help by Xunshen Du and Kevin Tsai

outline Motivation and overview Idea of Deep Feedforward Networks
Why deep learning Basic concepts CNN, RNN vs. DBN Idea of Deep Feedforward Networks Example: Learning XOR Output Units

Which one is better? Answer “it depends”
Classic method such as Random forest might be good if you don't have a lot of data, or training data have categorical features, or you want a more explainable model, or you want high run-time speed. Deep learning is good when you have a lot of training data of the same or related domain. Deep networks could be much slower than random forest. Appropriate scaling and normalization have to be done to the data to improve the convergence speed.

Why Deep Learning? Beat me on IMAGENET! How can I convince you? Hinton
Malik Alex

Why Deep Learning Complex engineered systems AlexNet

Why Deep Learning

Why Deep Learning Carotic artery bifurcation landmark detection

Why Deep Learning “Deep” in deep learning:
Some functions computable with a polynomial-size logic gates circuit of depth k will require exponential size for depth k − 1 [Hastad Johan, Theory of computing, 1986] Complex patterns is composed of repeated simples patterns (Everything in the universe is within you) Curse of dimensionality: Convergence rate decreases exponentially with dimension of data We say that the expression of a function is compact when it has few computational elements, i.e., few degrees of freedom that need to be tuned by learning. Compact means more generalizable in machine learning. The depth of an architecture is the maximum length of a path from any input of the graph to any output of the graph, there are functions computable with a polynomial-size logic gates circuit of depth k that require exponential size when restricted to depth k − 1 J. H˚astad, “Almost optimal lower bounds for small depth circuits,” in Proceedings of the 18th annual ACM Symposium on Theory of Computing, pp. 6–20, Berkeley, California: ACM Press, 1986

Why now? Big annotated datasets become available: GPU processing power
ImageNet: Google Video: Mechanical Turk Crowsourcing GPU processing power Better stochastic gradient descents: AdaGrad, AdamGrad, RMSProp

Biological Neuron and Modeling

Hierarchical Structure Deep Learning
1981, David Hubel & Torsten Wiesel: The Mamalian Visual Cortex is Hierarchical 14

Deep Learning Structure
Hinton

Fully Connected Feed-Forward Network
Linear transformation followed by non-linear rectification Training of NN Back propagation

Rectifier Functions Activation function from biological data

Rectifier Linear Unit (ReLU)
Pros: Faster to compute Help reduce vanishing gradient problem Sparse activation More biologically plausible More robust to noise Information disentanglement, etc Cons: Very negative neuron cannot recover due to zero gradient  Parametric ReLu come to rescue

Dropout Hilton et. al. 2012 Randomly pick some neurons and replace their value with 0 during training Reduce overfitting (Pediatric ADHD) and co-adaptation

DropConnect Randomly remove connections
Wang et. al. ICML 2013 Randomly remove connections Claim to work better than dropout

Other Tips Data augmentation / perturbation Batch normalization
L1/L2 regularization: Big network with regularizer often outperform small network Parameters search using Bayesian optimization

Type 1: Convolutional Neural Networks (CNNs)
LeCun et al. 1989 Neural network with specialized connectivity structure

AlexNet Same model as LeCun’98 but: Bigger model (8 layers)
More data (106 vs 103 images) GPU implementation (50x speedup over CPU) Better regularization (DropOut) 7 hidden layers; 650,000 neurons; 60,000,000 parameters Trained on 2 GPUs for a week

AlexNet 14 million labeled images, 20k classes
Images gathered from Internet Human labels via Amazon Turk

Pre-Training Effect Pretrain with ImageNet Network
Fine-tune on a small set of samples Dataset: Caltech-256 Zeiler & Fergus, arXiv , 2013

CONVOLUTION (LEARNED)
CNN Structure FEATURE MAPS Feed-forward: Convolve input Non-linearity (rectified linear) Pooling (local max) Supervised Train convolutional filters by back-propagating classification error POOLING NON-LINEAR RECTIFIER CONVOLUTION (LEARNED) INPUT IMAGE

Filtering Convolution Filter is learned during training
Same filter at each location Each filter looks at local region only

Filtering Multiple filters at each layer
More complex patterns at deeper layers

Non-Linearity Rectified linear function Applied per-pixel
Output = max(0,input)

Pooling Spatial Pooling Non-overlapping / overlapping regions
Sum or Max Boureau et al. ICML’10 for theoretical analysis Max pooling make it more invariant to local transformation like translation.

Pooling Spatial Pooling Non-overlapping / overlapping regions
Sum or Max Boureau et al. ICML’10 for theoretical analysis Max Sum

Role of Pooling Spatial pooling Variants:
Invariance to small transformations Larger receptive fields Variants: Pool across feature maps Tiled pooling

Going Deeper Current best models are much deeper than AlexNet
Deeper architectures Current best models are much deeper than AlexNet

Going Deeper CaffeNet Google Inception Network Residual network
8 layers Google Inception Network 22 layers Google inception network 22 layers Residual network 152 layers

Detection with CNNs Sometime we also want to locate specific objects in the scene Traditional approaches: Scan over entire images, at different scales Construct a feature for each location Apply a classifier on that feature Sophisticated tricks to speedup scanning (Integral image, subwindow search) Combine multiple hand-engineered features Difficult to scale up to multiple classes

Detection with CNNs Faster R-CNN, NIPS 2015 Perform regression of bounding box coordinates and object class at the same time Jointly learn features and regressors Currently best performer in PASCAL detection

Segmentation with CNNs
Noh et. al., ICCV 2015 Add unpooling and upsampling layers Competitive performance on PASCAL dataset

Reinforcement Learning with CNNs
Robot takes action, get reward, and enter new state Goal: Find optimal policy for robot Q-learning:

Sensor data Robot action Action-value function jointly learned with image features Outperform human expert (and linear method)

Model probability of next action with CNN Predict distribution of opponent’s next action  feed into policy learning Play with itself to improve policy ALPHA GO GO game

Self-Driving Car with CNNs
NVIDIA

Why CNNs work well Small number of variables
Jointly learn features and classifier Hierarchically learning complex patterns  mitigate curse of dimensionality Designed to be invariant to shift/translation

Type 2: Recurrent Neural Network
Vast number of tasks involve time series data Need long term dependency: “I grew up in France… I speak fluent French.”

Recurrent Network

Long Short-Term Memory Network

Long-Short Term Memory Network

Long-Short Term Memory Network
Rafal et. al., JMLR 2015 Forget gate is most important Output gate is least important Adding positive bias (1 or 2) to forget gate  help reduce vanishing gradient  greatly improves performance of LSTM

Sequence to Sequence Could apply to sequence of input and output

Example: Video to Text with LSTM
Connect CNN with stack of LSTM

Network Visualization
What were learned in deep networks Raw coefficients of learned filters in higher layers are difficult to interpret Two main approaches: Project activation back to pixel space Optimize input image to maximize response of a particular neuron / feature map

Projection from higher layers back to input Visualizing and Understanding Convolutional Networks, Matt Zeiler & Rob Fergus, ECCV 2014 Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps, Karen Simonyan, Andrea Vedaldi, Andrew Zisserman, arXiv , 2013 Object Detectors Emerge in Deep Scene CNNs, Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, Antonio Torralba, ICLR 2015

Divide image into regions Iteratively remove regions with least change in output score Zhou & Torralba

Find K images producing highest response of a neuron Occlude random regions to find the most critical one, i.e. causing largest change in output Object detectors emerge!

Optimize input to maximize particular output: Erhan et. al. 2009, Le et. al. [NIPS 2010] Output depends on initialization Google Deep Dream

Google Deep Dream Ask network to enhance what it sees

Type 3: Deep Belief Network: Restricted Boltzmann Machine
Each link associates with a probability; It is parametric men women

Deep belief network layered structure
After learning an RBM, treat the activation probabilities of its hidden units as the data for training the RBM one layer up Stacking a number of the RBMs learned layer by layer from bottom-up gives “Trained” methods for better clustering

Comparisons

Deep Learning Software
Caffe: Theano: Torch: Matlab: TensorFlow:

Deep Learning Software

outline Motivation and overview Idea of Deep Feedforward Networks
Example: Learning XOR Output Units Book: Ian Goodfellow, Yoshua Bengio, and Aaron Courville, Deep Learning

Concept Also known as Approximate some function 𝑓 ∗
Feedforward Neural Networks Multi-layer Perceptrons (MLPs) Approximate some function 𝑓 ∗ Ex: 𝑦= 𝑓 ∗ (𝑥) Feedforward network learns the parameters 𝜃 in a function Ex: 𝑦=𝑓(𝑥;𝜃) IDEA OF DEEP FEEDFORWARD NETWORKS

Deep? Feedforward? Networks?
NETWORKS – composing by multiple functions FEEDFORWARD – information/input (x) flows through functions to produce output y Ex: 𝑓 𝑥 = 𝑓 3 ( 𝑓 2 ( 𝑓 1 (𝑥))) Note: Recurrent neural networks is the extension of feedforward neural networks with feedback connections DEEP – overall length of the chain gives the “depth” of the model. (deep learning) Ex: 𝑓 1 is called “1st layer”, 𝑓 2 is called “2nd layer”, …, final layer of the network is called the “output layer” IDEA OF DEEP FEEDFORWARD NETWORKS

Linear  nonlinear Linear models usually fit efficiently and reliably in closed form or with convex optimization Nonlinear functions can overcome the limitation of linear models Apply linear model to a transformed input 𝜙(𝒙). 𝜙 is nonlinear transformation Very generic 𝜙. Infinite-dimensional 𝜙 Manually engineer 𝜙. Dominant approach before deep learning Generated by deep learning IDEA OF DEEP FEEDFORWARD NETWORKS

Hidden layers Recall: 𝑓 𝑥 = 𝑓 3 ( 𝑓 2 ( 𝑓 1 (𝑥)))
𝑓 𝑛 might be linear or nonlinear function Training data doesn’t determine what each individual layer should do Learning algorithm needs to decide how to connect each layer. This is NOT part of the “learning” algorithm IDEA OF DEEP FEEDFORWARD NETWORKS

Hidden layers IDEA OF DEEP FEEDFORWARD NETWORKS

Cost functions The choice of the cost function is important for designing a deep neural network In most cases, we use cross-entropy between the training data and the model’s predictions as the cost function since the parametric model usually defines a distribution 𝑝(𝒚|𝒙;𝜃)

Training feedforward network
Output units form Cost function Optimizer IDEA OF DEEP FEEDFORWARD NETWORKS

Example: Learning xor

objective Target function 𝑦= 𝑓 ∗ (𝑥) Our model 𝑦=𝑓(𝑥;𝜃)
Objective: adapt 𝜃 to make 𝑓 as similar as 𝑓 ∗ Example: Learning xor

Strategy Treat as a regression problem
Use a mean square error (MSE) loss function 𝐽 𝜽 = 1 4 𝒙∈𝕏 ( 𝑓 ∗ 𝒙 −𝑓 𝒙;𝜽 ) 2 Just for simplifying the math here Our model: 𝑓 𝒙;𝜽 =𝑓 𝒙;𝒘,𝑏 = 𝒙 ⊺ 𝒘+𝑏 Example: Learning xor

problem Solution: w=0 and b=0.5 X1=0, output increases as x2 increases
Useless! X1=0, output increases as x2 increases x1=1, output decrease as x2 increases Coefficient w not fixed Example: Learning xor

solution Add hidden layer (h) which contains two hidden units
Rewrite model Map and to Example: Learning xor

solution Use nonlinear function to describe the features
Note: most neural networks do so using affine transformation controlled by learned parameters, followed by a fixed, nonlinear function called an “activation function” 𝒉= 𝑓 1 𝒙;𝑾,𝑐 =𝑔( 𝑾 ⊺ 𝒙+𝑐) where 𝑾 are the weights and 𝒄 are the biases Complete networks: 𝑓 𝒙;𝑾,𝒄,𝒘,𝑏 = 𝒘 ⊺ max 0, 𝑾 ⊺ 𝒙+𝒄 +𝑏 Example: Learning xor

Side note: common activation function
Example: Learning xor

Gradient-based learning

remark Linear models Neural networks (include nonlinear function)
Linearity Linear equation solvers is convergence guarantees Capacity limitation Nonlinearity Causes most loss functions to become non-convex; in other words, using iterative, gradient-based optimizers that drive cost function to small value

Stochastic Gradient descent
No convergence guarantees Sensitive to the initial parameters value Initial parameters should be small random values (discuss in Ch. 8)

Output units

Output units Any kind of neural network unit (ex. Hidden unit)
Note: can be used internally in the neuron Cost function is tightly coupled with the choice of output unit OUTPUT UNITS

Output units linear units for Gaussian output distributions
Based on an affine transformation with no nonlinearity Output of linear output units: 𝒚 = 𝑾 ⊺ 𝒉+𝒃 where 𝒉 is the given features Often used to produce the mean of a conditional Gaussian distribution 𝑝 𝒚 𝒙 =𝒩(𝒚; 𝑦 ,𝐼) Covariance must be positive definite matrix for all inputs (hard to satisfy) Do not saturate. Pose difficulty for some opt. algorithms OUTPUT UNITS

Output units Sigmoid unit for Bernoulli output distributions
Useful for classification problems with two classes Neural net predict only 𝑃(𝑦=1|𝒙) where valid probability ∈[0,1] 𝑃 𝑦=1 𝒙 =max⁡{0, min 1, 𝒘 ⊺ 𝒉+𝑏 } 𝑃=0 when 𝒘 ⊺ 𝒉+𝑏 is out of [0,1]. Unable to learn What now? OUTPUT UNITS

Output units Sigmoid unit for Bernoulli output distributions
Ensure strong gradient whenever the model has the wrong answer Combine sigmoid output units with maximum likelihood Two components: linear layer to compute 𝑧= 𝒘 ⊺ 𝒉+𝑏, sigmoid activation function to convert 𝑧 into a probability Sigmoid output unit: 𝑦 =𝜎(𝑧) where 𝜎(𝑥)= 1 1+exp⁡(−𝑥) OUTPUT UNITS

Output units softmax unit for multinoulli output distributions
Represent a probability distribution over a discrete variable with 𝑛 possible values Often used as the output of a classifier Generalization of the sigmoid function Output a vector 𝑦 OUTPUT UNITS

Output units softmax unit for multinoulli output distributions
𝑦 𝑖 =𝑃 𝑦=𝑖 𝒙 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝒛) 𝑖 = exp⁡( 𝑧 𝑖 ) 𝑗 exp⁡( 𝑧 𝑗 ) OUTPUT UNITS

Intro to Hidden Units Haven’t fully defined
Design process consists of trial and error Choice of activation function

Outline Hidden Units & ReLU Architecture Design Visualization Design
Brief Intro to Propagation

Hidden Units Design process consists of trial and error
Not expect gradient to be 0 Distinguished by the choice of the activation function Hidden Units & ReLU

Rectified linear units
Derivatives remain large whenever the unit is active 2nd derivative is 1 when active Gradients are large and consistent Typically used on top of an affine transformation 𝒉=𝑔( 𝑾 ⊺ 𝒙+𝒃) 𝒃 is small positive value. This make ReLU initially active for most input to allow the derivatives to pass through Hidden Units & ReLU

ReLU generalization Based on non-zero slope
𝑧 𝑖 <0: ℎ 𝑖 =𝑔 (𝑧,𝛼) 𝑖 = max 0, 𝑧 𝑖 + 𝛼 𝑖 min(0, 𝑧 𝑖 ) Absolute value rectification: 𝛼 𝑖 =1 to obtain 𝑔 𝑧 = 𝑧 Leaky ReLU: 𝛼 𝑖 =0.01 Parametric ReLU or PReLU: treats 𝛼 𝑖 as a learnable parameter Hidden Units & ReLU

ReLU generalization Maxout units: 𝑔(𝒛) 𝑖 = max 𝑗∈ 𝔾 (𝑖) 𝑧 𝑗
where 𝔾 (𝑖) is the inputs for group 𝑖, { 𝑖−1 𝑘+1,…,𝑖𝑘} Activation function Usually combined with dropout Hidden Units & ReLU

Architecture Design

Architecture Design Number of units? How to connect?
Organized into groups of units called layers. Arrange these layers in a chain structure. Ex: First layer: 𝒉 (1) = 𝑔 (1) 𝑾 (1)⊺ 𝒙+ 𝒃 (1) Second layer: 𝒉 (2) = 𝑔 (2) 𝑾 (2)⊺ 𝒉 (1) + 𝒃 (2) Deeper networks are able to use fewer units per layer and fewer parameters but often harder to optimize Via experimentation guided by monitoring the validation set error Architecture Design

Universal approximation theorem
Define: a feedforward network with a linear output layer and at least one hidden layer with any “squashing” activation function can approximate any Borel measurable function from one finite-dimensional space to another with any desired non- zero amount of error, provided that the network is given enough hidden units. Large MLP will be able to represent any function we are trying to learn Not guaranteed that the training algorithm will be able to learn that function Two reason that fail to learn: Opt. algorithm may not find the correct parameters Training algorithm might choose the wrong function due to overfitting Architecture Design

Visualization Design Brief Intro to Propagation & Visualization Design

Computational graph Brief Intro to Propagation & Visualization Design

Symbol-to-symbol derivatives
Brief Intro to Propagation & Visualization Design

Brief Intro to Propagation & Visualization Design

Forward propagation & back-propagation
Inputs 𝒙 propagates up to the hidden units at each layer and produces 𝒚 During training, forward propagation can continue onward until it produces a scalar cost function 𝐽(𝜽) Known as backprop Allows the information from the cost function to flow backwards through the network, in order to compute the gradient Brief Intro to Propagation & Visualization Design

Thank you

Zhu Han University of Houston

Similar presentations

Presentation on theme: "Zhu Han University of Houston"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Zhu Han University of Houston

Similar presentations

Presentation on theme: "Zhu Han University of Houston"— Presentation transcript:

Similar presentations

About project

Feedback