Download presentation
Presentation is loading. Please wait.
Published byMichael Baldwin Modified over 6 years ago
1
Department of Electrical and Computer Engineering
CNT 6805 Network Science and Applications Lecture 2 Unsupervised Deep learning Dr. Dapeng Oliver Wu University of Florida Department of Electrical and Computer Engineering Fall 2016
2
Outline Introduction to Machine Learning
Chronological Development of ideas Problems with Neural Networks What exactly is different in Deep Learning Energy Based Models and Training Applications to real world problems Scalability Issues
3
Learning to Learn Face Recognition- Object Recognition-
Weather prediction- ML can be broadly classified into 3 major categories of problems Clustering, Regression and Classification
4
Chronological Development
G0- Blind Guess G1- Linear Methods ( PCA, LDA, LR) What if relationship is nonlinear- G2- Neural Networks Uses multiple non-linear elements to approximate G3- Kernel Machines Linear Computations in infinite dim-space without ‘actually’ learning a mapping
5
Neural network Non-linear transformation at summing nodes in hidden and outer layers. -e.g.- Sigmoid- Output- estimates of posterior probability
6
Back-Propagation If S is a logistic function,
then S’(x) = S(x)(1 – S(x))
7
Challenges with multi-layers NN
Get stuck in local minima or plateaus due to random initialization. Vanishing gradient- Effect becomes smaller and smaller in lower layers Excellent training but poor in testing- A classic case of overfitting
8
Why Vanishing Gradient?
Both sigmoid and its derivative < 1 Gradient calculated to train each layer : Lower layers remain undertrained.
9
Deep Learning – Early Phase
Unsupervised pre-training followed by traditional supervised backpropagation Let the data speak for itself Try to derive the inherent features of input Why it clicks? Pre-training helps create a data-dependent prior and hence better regularization Gives a set of W’s that is better to start with Lower layers are better optimized and hence vanishing gradients do not affect much
10
Restricted Boltzmann Machine-I
x-Visible (input) h-Hidden (latent) Energy given by Joint Probability where Z is the partition function given by Target is to maximize P(x), (or its log-likelihood) P(h|x) & P(x|h) factorizable For the binary case {0,1}, again sigmoid function arises as
11
Restricted Boltzmann Machine-II
Gradient of log-likelihood looks like Where is called Free Energy If we average it over training set Q, the RHS looks like So, gradient= - training + model = - Observable + reconstruction
12
Sampling Approximations
Generally Intractable. But approximations lead to a simpler sampling problem Update equation now looks like
13
Cont’d Now, we take the partial derivatives of w.r.t. to the parameter vector So, an unbiased update rule for weights looks like Usually once sufficient
14
Deep belief Network Conditional distributions for
Layers 0,1….l-1 and joint for Each layer initialized as an RBM Training is done layer-by-layer greedily in a sequential order. It is then fed to a conventional Neural network
15
Deep Autoencoders Codes itself and then again reconstructs the output. Can be stacked to form DBNs Training procedure is similar layer-by-layer Except, in final step, may be supervised or unsupervised( just like backprop) Denoising AE Contractive AE Regularized AE
16
Dimensionality Reduction
Original DBN LogisticPCA Just PCA
17
What does it learn? Higher layers- birdview - Invariant Features
Denoising AE - Stacked RBMs (DBN)
19
Computational Considerations
Part 1- unsupervised pretraining Matrix Multiplications Weight Update sequential (just like adaptive systems/filters) But can be parallelized over nodes/ dimensions Tricks- use minibatches- Update the weight only once per many epochs by taking average
20
Unsup Pre-Training: Rarely Used Now
But with large number of labeled training examples, lower layers will eventually change Recent architectures prefer weight initialization like Glorot et al (2011) Gaussian distribution with Srivastava, Hinton, et al. (2014) proposes a dropout method to mitigate overfitting. He et al. (2015) derive optimal weight initialization for ReLU/ PReLU activations. ReLU: Rectified Linear Unit PReLU: Parametric Rectified Linear Unit
21
Dropout Neural Net Model (1)
Srivastava N, Hinton GE, Krizhevsky A, Sutskever I, Salakhutdinov R. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research Jan 1;15(1):
22
Dropout Neural Net Model (2)
23
Dropout Neural Net Model (3)
24
Example: Handwritten Digits Recognition
25
Recurrent Neural Networks (RNN)
Deep Learning for time series data Uses memory to process input sequences Output y can be any supervised target or even the future samples of x, as in prediction
26
Vanishing/Exploding Gradient- Both Temporally and Spatially
Multi-layered RNNs have their lower layers undertrained Information from previous input are not properly carried- chained gradients ALSO, cannot handle long range dependency
27
Why, again? We cannot relate inputs from the distant past to the target output.
28
Long Short Term Memory Error signals trapped within a memory cell cannot change. Gates have to learn which error to trap and which ones to forget
29
Conclusion Practical breakthrough, Companies happy
but theoreticians unconvinced. Deep Learning architectures have won many competitions in recent past. Plans to put concept to build artificial brain for big data
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.