Department of Electrical and Computer Engineering CNT 6805 Network Science and Applications Lecture 2 Unsupervised Deep learning Dr. Dapeng Oliver Wu University of Florida Department of Electrical and Computer Engineering Fall 2016
Outline Introduction to Machine Learning Chronological Development of ideas Problems with Neural Networks What exactly is different in Deep Learning Energy Based Models and Training Applications to real world problems Scalability Issues
Learning to Learn Face Recognition- Object Recognition- Weather prediction- ML can be broadly classified into 3 major categories of problems Clustering, Regression and Classification
Chronological Development G0- Blind Guess G1- Linear Methods ( PCA, LDA, LR) What if relationship is nonlinear- G2- Neural Networks Uses multiple non-linear elements to approximate G3- Kernel Machines Linear Computations in infinite dim-space without ‘actually’ learning a mapping
Neural network Non-linear transformation at summing nodes in hidden and outer layers. -e.g.- Sigmoid- Output- estimates of posterior probability
Back-Propagation If S is a logistic function, then S’(x) = S(x)(1 – S(x))
Challenges with multi-layers NN Get stuck in local minima or plateaus due to random initialization. Vanishing gradient- Effect becomes smaller and smaller in lower layers Excellent training but poor in testing- A classic case of overfitting
Why Vanishing Gradient? Both sigmoid and its derivative < 1 Gradient calculated to train each layer : Lower layers remain undertrained.
Deep Learning – Early Phase Unsupervised pre-training followed by traditional supervised backpropagation Let the data speak for itself Try to derive the inherent features of input Why it clicks? Pre-training helps create a data-dependent prior and hence better regularization Gives a set of W’s that is better to start with Lower layers are better optimized and hence vanishing gradients do not affect much
Restricted Boltzmann Machine-I x-Visible (input) h-Hidden (latent) Energy given by Joint Probability where Z is the partition function given by Target is to maximize P(x), (or its log-likelihood) P(h|x) & P(x|h) factorizable For the binary case {0,1}, again sigmoid function arises as
Restricted Boltzmann Machine-II Gradient of log-likelihood looks like Where is called Free Energy If we average it over training set Q, the RHS looks like So, gradient= - training + model = - Observable + reconstruction
Sampling Approximations Generally Intractable. But approximations lead to a simpler sampling problem Update equation now looks like
Cont’d Now, we take the partial derivatives of w.r.t. to the parameter vector So, an unbiased update rule for weights looks like Usually once sufficient
Deep belief Network Conditional distributions for Layers 0,1….l-1 and joint for Each layer initialized as an RBM Training is done layer-by-layer greedily in a sequential order. It is then fed to a conventional Neural network
Deep Autoencoders Codes itself and then again reconstructs the output. Can be stacked to form DBNs Training procedure is similar layer-by-layer Except, in final step, may be supervised or unsupervised( just like backprop) Denoising AE Contractive AE Regularized AE
Dimensionality Reduction Original DBN LogisticPCA Just PCA
What does it learn? Higher layers- birdview - Invariant Features Denoising AE - Stacked RBMs (DBN)
Computational Considerations Part 1- unsupervised pretraining Matrix Multiplications Weight Update sequential (just like adaptive systems/filters) But can be parallelized over nodes/ dimensions Tricks- use minibatches- Update the weight only once per many epochs by taking average
Unsup Pre-Training: Rarely Used Now But with large number of labeled training examples, lower layers will eventually change Recent architectures prefer weight initialization like Glorot et al (2011) Gaussian distribution with Srivastava, Hinton, et al. (2014) proposes a dropout method to mitigate overfitting. He et al. (2015) derive optimal weight initialization for ReLU/ PReLU activations. ReLU: Rectified Linear Unit PReLU: Parametric Rectified Linear Unit
Dropout Neural Net Model (1) Srivastava N, Hinton GE, Krizhevsky A, Sutskever I, Salakhutdinov R. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research. 2014 Jan 1;15(1):1929-58.
Dropout Neural Net Model (2)
Dropout Neural Net Model (3)
Example: Handwritten Digits Recognition https://github.com/fchollet/keras/blob/master/examples/mnist_mlp.py
Recurrent Neural Networks (RNN) Deep Learning for time series data Uses memory to process input sequences Output y can be any supervised target or even the future samples of x, as in prediction
Vanishing/Exploding Gradient- Both Temporally and Spatially Multi-layered RNNs have their lower layers undertrained Information from previous input are not properly carried- chained gradients ALSO, cannot handle long range dependency
Why, again? We cannot relate inputs from the distant past to the target output.
Long Short Term Memory Error signals trapped within a memory cell cannot change. Gates have to learn which error to trap and which ones to forget
Conclusion Practical breakthrough, Companies happy but theoreticians unconvinced. Deep Learning architectures have won many competitions in recent past. Plans to put concept to build artificial brain for big data