Department of Electrical and Computer Engineering

Department of Electrical and Computer Engineering
CNT 6805 Network Science and Applications Lecture 2 Unsupervised Deep learning Dr. Dapeng Oliver Wu University of Florida Department of Electrical and Computer Engineering Fall 2016

Outline Introduction to Machine Learning
Chronological Development of ideas Problems with Neural Networks What exactly is different in Deep Learning Energy Based Models and Training Applications to real world problems Scalability Issues

Learning to Learn Face Recognition- Object Recognition-
Weather prediction- ML can be broadly classified into 3 major categories of problems Clustering, Regression and Classification

Chronological Development
G0- Blind Guess G1- Linear Methods ( PCA, LDA, LR) What if relationship is nonlinear- G2- Neural Networks Uses multiple non-linear elements to approximate G3- Kernel Machines Linear Computations in infinite dim-space without ‘actually’ learning a mapping

Neural network Non-linear transformation at summing nodes in hidden and outer layers. -e.g.- Sigmoid- Output- estimates of posterior probability

Back-Propagation If S is a logistic function,
then S’(x) = S(x)(1 – S(x))

Challenges with multi-layers NN
Get stuck in local minima or plateaus due to random initialization. Vanishing gradient- Effect becomes smaller and smaller in lower layers Excellent training but poor in testing- A classic case of overfitting

Why Vanishing Gradient?
Both sigmoid and its derivative < 1 Gradient calculated to train each layer : Lower layers remain undertrained.

Deep Learning – Early Phase
Unsupervised pre-training followed by traditional supervised backpropagation Let the data speak for itself Try to derive the inherent features of input Why it clicks? Pre-training helps create a data-dependent prior and hence better regularization Gives a set of W’s that is better to start with Lower layers are better optimized and hence vanishing gradients do not affect much

Restricted Boltzmann Machine-I
x-Visible (input) h-Hidden (latent) Energy given by Joint Probability where Z is the partition function given by Target is to maximize P(x), (or its log-likelihood) P(h|x) & P(x|h) factorizable For the binary case {0,1}, again sigmoid function arises as

Restricted Boltzmann Machine-II
Gradient of log-likelihood looks like Where is called Free Energy If we average it over training set Q, the RHS looks like So, gradient= - training + model = - Observable + reconstruction

Sampling Approximations
Generally Intractable. But approximations lead to a simpler sampling problem Update equation now looks like

Cont’d Now, we take the partial derivatives of w.r.t. to the parameter vector So, an unbiased update rule for weights looks like Usually once sufficient

Deep belief Network Conditional distributions for
Layers 0,1….l-1 and joint for Each layer initialized as an RBM Training is done layer-by-layer greedily in a sequential order. It is then fed to a conventional Neural network

Deep Autoencoders Codes itself and then again reconstructs the output. Can be stacked to form DBNs Training procedure is similar layer-by-layer Except, in final step, may be supervised or unsupervised( just like backprop) Denoising AE Contractive AE Regularized AE

Dimensionality Reduction
Original DBN LogisticPCA Just PCA

What does it learn? Higher layers- birdview - Invariant Features
Denoising AE - Stacked RBMs (DBN)

Computational Considerations
Part 1- unsupervised pretraining Matrix Multiplications Weight Update sequential (just like adaptive systems/filters) But can be parallelized over nodes/ dimensions Tricks- use minibatches- Update the weight only once per many epochs by taking average

Unsup Pre-Training: Rarely Used Now
But with large number of labeled training examples, lower layers will eventually change Recent architectures prefer weight initialization like Glorot et al (2011) Gaussian distribution with Srivastava, Hinton, et al. (2014) proposes a dropout method to mitigate overfitting. He et al. (2015) derive optimal weight initialization for ReLU/ PReLU activations. ReLU: Rectified Linear Unit PReLU: Parametric Rectified Linear Unit

Dropout Neural Net Model (1)
Srivastava N, Hinton GE, Krizhevsky A, Sutskever I, Salakhutdinov R. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research Jan 1;15(1):

Example: Handwritten Digits Recognition

Recurrent Neural Networks (RNN)
Deep Learning for time series data Uses memory to process input sequences Output y can be any supervised target or even the future samples of x, as in prediction

Vanishing/Exploding Gradient- Both Temporally and Spatially
Multi-layered RNNs have their lower layers undertrained Information from previous input are not properly carried- chained gradients ALSO, cannot handle long range dependency

Why, again? We cannot relate inputs from the distant past to the target output.

Long Short Term Memory Error signals trapped within a memory cell cannot change. Gates have to learn which error to trap and which ones to forget

Conclusion Practical breakthrough, Companies happy
but theoreticians unconvinced. Deep Learning architectures have won many competitions in recent past. Plans to put concept to build artificial brain for big data

Department of Electrical and Computer Engineering

Similar presentations

Presentation on theme: "Department of Electrical and Computer Engineering"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Department of Electrical and Computer Engineering

Similar presentations

Presentation on theme: "Department of Electrical and Computer Engineering"— Presentation transcript:

Similar presentations

About project

Feedback