Limitations of Traditional Deep Network Architectures

Slides:



Advertisements
Similar presentations
Neural Networks and Kernel Methods
Advertisements

Neural networks Introduction Fitting neural networks
1 Machine Learning: Lecture 4 Artificial Neural Networks (Based on Chapter 4 of Mitchell T.., Machine Learning, 1997)
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 7: Learning in recurrent networks Geoffrey Hinton.
Advanced topics.
Deep Learning and Neural Nets Spring 2015
Deep Learning.
How to do backpropagation in a brain
Comp 5013 Deep Learning Architectures Daniel L. Silver March,
Building high-level features using large-scale unsupervised learning Anh Nguyen, Bay-yuan Hsu CS290D – Data Mining (Spring 2014) University of California,
Neural Networks The Elements of Statistical Learning, Chapter 12 Presented by Nick Rizzolo.
CSC321 Lecture 24 Using Boltzmann machines to initialize backpropagation Geoffrey Hinton.
CSC321 Lecture 27 Using Boltzmann machines to initialize backpropagation Geoffrey Hinton.
Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.
Deep Learning Overview Sources: workshop-tutorial-final.pdf
Xintao Wu University of Arkansas Introduction to Deep Learning 1.
Multinomial Regression and the Softmax Activation Function Gary Cottrell.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Machine Learning Supervised Learning Classification and Regression
Neural networks and support vector machines
Some Slides from 2007 NIPS tutorial by Prof. Geoffrey Hinton
CS 388: Natural Language Processing: Neural Networks
Learning Deep Generative Models by Ruslan Salakhutdinov
Deep Feedforward Networks
The Relationship between Deep Learning and Brain Function
Deep Learning Amin Sobhani.
Data Mining, Neural Network and Genetic Programming
CSC321: Neural Networks Lecture 22 Learning features one layer at a time Geoffrey Hinton.
ECE 5424: Introduction to Machine Learning
Article Review Todd Hricik.
COMP24111: Machine Learning and Optimisation
Matt Gormley Lecture 16 October 24, 2016
Intro to NLP and Deep Learning
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Neural Networks CS 446 Machine Learning.
Intelligent Information System Lab
Supervised Training of Deep Networks
Deep learning and applications to Natural language processing
with Daniel L. Silver, Ph.D. Christian Frey, BBA April 11-12, 2017
Convolution Neural Networks
Deep Learning Qing LU, Siyuan CAO.
Understanding the Difficulty of Training Deep Feedforward Neural Networks Qiyue Wang Oct 27, 2017.
Deep Belief Networks Psychology 209 February 22, 2013.
Unsupervised Learning and Autoencoders
Deep Learning Workshop
with Daniel L. Silver, Ph.D. Christian Frey, BBA April 11-12, 2017
Neural Networks and Backpropagation
CSC 578 Neural Networks and Deep Learning
CSC 578 Neural Networks and Deep Learning
ANN Design and Training
Recurrent Neural Networks
CS 4501: Introduction to Computer Vision Training Neural Networks II
INF 5860 Machine learning for image classification
with Daniel L. Silver, Ph.D. Christian Frey, BBA April 11-12, 2017
[Figure taken from googleblog
Neural Networks Geoff Hulten.
Deep Learning for Non-Linear Control
Neural Networks ICS 273A UC Irvine Instructor: Max Welling
Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824
Neural networks (1) Traditional multi-layer perceptrons
实习生汇报 ——北邮 张安迪.
Neural networks (3) Regularization Autoencoder
COSC 4335: Part2: Other Classification Techniques
Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824
CSC321: Neural Networks Lecture 11: Learning in recurrent networks
An introduction to: Deep Learning aka or related to Deep Neural Networks Deep Structural Learning Deep Belief Networks etc,
Introduction to Neural Networks
CSC 578 Neural Networks and Deep Learning
CSC 578 Neural Networks and Deep Learning
Overall Introduction for the Lecture
Presentation transcript:

Limitations of Traditional Deep Network Architectures with Daniel L. Silver, Ph.D. Christian Frey, BBA April 11-12, 2017 23/11/2018 Deep Learning Workshop

Deep Learning Workshop Outline Problem of vanishing gradient Beyond two hiddden layers Methods of overcoming the problem Long patient training Better activation functions Pre-training Constrained internal representation - weight-sharing Better regularizers (sparsity, weight-decay, dropout, injected noise) Tutorial Demo of how more than three layers can hurt 23/11/2018 Deep Learning Workshop

Exploding or Vanishing Gradients Big difference between the forward and backward passes of BP Forward pass - uses squashing functions (like the logistic) to prevent the activity vectors from exploding Backward pass - completely linear. If you double the error derivatives at the final layer, all the error derivatives will double https://www.youtube.com/watch?v=SKMpmAOUa2Q Based on slide by Geoff Hinton

Exploding or Vanishing Gradients What happens to the magnitude of the gradients as we backpropagate through many layers? If the weights are big the gradients grow exponentially (explode) If the weights are small, the gradients shrink exponentially (vanish) Typical feed-forward neural nets can cope with these exponential effects because of few hidden layers Deep Networks: Exploding gradient - clip the weights Vanishing gradient - bigger problem Details: https://ayearofai.com/rohan-4-the-vanishing-gradient- problem-ec68f76ffb9b Based on slide by Geoff Hinton

Based on slides from Tony Martinez Vanishing Gradient Early layers of MLP do not get trained well Leads to very slow training - especially in lower layers Top couple layers can usually learn any task "pretty well” The error to lower layers drops quickly Lower layers never get the opportunity to use their capacity to improve results, they just do a random feature map Need a way for early layers to do effective work If weights were large gradient could explode as go down the net –much less common but still can happen Note large weights usually lead to saturation and even smaller f’(net) : Another important reason error is low. Final layer updates weights so that overall accuracy is pretty good just based on the last hidden layer (and the still pretty much initial weights of the earlier layers). Thus T-Z already gets small pretty fast based on just the learning of the last few layers using the initial layers as a random shuffle. Local minima? Some say, More likely difficult slow curvatures (Martens: Hessian Free Learning), Properly tuned Moment with proper initial conditions can make a big difference. LSTM are more specialized approach for very long memory cases, but is it best for typical tasks? Hessian free claims to out-perform it and can also do the specialized tasks Based on slides from Tony Martinez

Overcoming Vanishing/Exploding Gradient Perform long patient training small learning rates Better weight initialization Normalize output between layers Use parallel processsing - GPUs, etc Usually more nodes in internal layers than input layer, layers not same size Save this until later slide which has it.: Another important reason error is low. Final layer updates weights so that overall accuracy is pretty good just based on the last hidden layer (and the still pretty much initial weights of the earlier layers). Thus T-Z already gets small pretty fast based on just the learning of the last few layers using the initial layers as a random shuffle. RLEs, constant derivate, thus just can treat as 1 and scale with LR. 0 once saturated, can come back? Review these more before next time Evidence that tanh and other activations better than sigmoid for avoiding early top layer saturation Based on slides from Tony Martinez

Better Activation Functions Rectified Linear Units or Relu: f(x) = Max(0,x) More efficient gradient propagation, derivative is 0 or constant, just fold into learning rate Variations on Relu More efficient computation: Only comparison, addition and multiplication. Leaky ReLU f(x) = x if x > 0 else ax, where 0 ≤ a <= 1, so that derivate is not 0 and can do some learning for net < 0 (does not “die”). Lots of other variations Sparse activation: For example, in a randomly initialized networks, only about 50% of hidden units are activated (having a non-zero output) Learning in linear range easier for most learning models Green is softplus function f(x) = ls(1+e^x) (smooth version of RLU), whose derivative is the sigmoid 1/(1+e^-x) For regular ReLU, set bias slightly positive, so that active to start with Leaky ReLU f(x) = x if > 0 else .01x – some effect for x < 0 Note ReLU is not everywhere differentiable. At 0 just choose left or right derivative (0 or 1) and works fine empirically, though theoretically a bit bothersome – Note that since net is real number the prob(0) = 0 anyways and if computer has 0, probably just a representation issue and we would be a little to the left or right anyways in a true system. So why not same with threshold function? Derivate would be 0 everywhere with no learning Principle: Nets easier to optimize when there behavior is closer to linear Still learning why ReLUs work as well as they do Based on slides from Tony Martinez

Pre-training of Deep Learning Architectures Hinton and Bengio (2007+) Learning Internal Representations Layered networks of unsupervised auto-encoders efficiently develop hierarchies of features that capture regularities in their respective inputs Auto-encoders Restricted Boltzmann Machines Can be used to develop models for families of tasks

Autoencoders using BP ANN Learn to predict their input at the output Learn an internal representation (encoding) of image Feature detectors of the input Can be used to pre-train a classifier Feed a standard BP ANN to do Classification 23/11/2018 Deep Learning Workshop

Deep Learning Architectures Consider the problem of trying to classify these hand-written digits.

Deep Learning Architectures 2000 top-level artificial neurons 2 1 3 500 neurons (higher level features) 1 2 3 4 5 6 7 8 9 500 neurons (low level features) Neural Network: - Trained on 40,000 examples Learns: * labels / recognize images * generate images from labels Probabilistic in nature Demo Images of digits 0-9 (28 x 28 pixels)

Constrained Internal Representation Weight sharing based on knowledge of the input space Reduces overall complexity of model One of key hints for using deep learning architectures 23/11/2018 Deep Learning Workshop

Deep Learning Architectures Andrew Ng’s work on Deep Learning Networks (ICML-2012) Problem: Learn to recognize human faces, cats, etc from unlabeled data Dataset of 10 million images; each image has 200x200 pixels 9-layered locally connected neural network (1B connections) Parallel algorithm; 1,000 machines (16,000 cores) for three days Building High-level Features Using Large Scale Unsupervised Learning Quoc V. Le, Marc’Aurelio Ranzato, Rajat Monga, Matthieu Devin, Kai Chen, Greg S. Corrado, Jeffrey Dean, and Andrew Y. Ng ICML 2012: 29th International Conference on Machine Learning, Edinburgh, Scotland, June, 2012.

Deep Learning Architectures Results: A face detector that is 81.7% accurate Robust to translation, scaling, and rotation Further results: 15.8% accuracy in recognizing 20,000 object categories from ImageNet 70% relative improvement over the previous state-of-the-art.

Better Regularization of Free Parameters Weights (free parameters) make the ANN model Beyond the training examples we can control weights Momentum Weight-decay Sparsity Dropout Injected noise 23/11/2018 Deep Learning Workshop

Softmax and Cross-Entropy Sum-squared error (L2) loss gradient seeks the maximum likelihood hypothesis under the assumption that the training data can be modeled by Normally distributed noise added to the target function value. Fine for regression but less natural for classification. For classification problems it is advantageous and increasingly popular to use the softmax activation function, just at the output layer, with the cross-entropy loss function. Softmax (softens) 1 of n targets to mimic a probability vector for each output. Cross entropy seeks to to find the maximum likelihood hypotheses under the assumption that the observed (1 of n) Boolean outputs is a probabilistic function of the input instance. Maximizing likelihood is cast as the equivalent minimizing of the negative log likelihood. With new loss and activation functions, we must recalculate the gradient equation. Gradient/Error on the output is just (t-z), no f '(net)! The exponent of softmax is unraveled by the ln of cross entropy. Helps avoid gradient saturation. Common with deep networks, with ReLU activations for the hidden nodes Cross entropy with sigmoid also has t-z gradient as ln undoes exponential of sigmoid. Softmax with TSS gives gradient of (t-z)z(d-z of target node) where d is of 1 if node is target, else 0, saturates and global CE loss – 0 for non target nodes, and ln(z) for target node. Thus if z = 1 0, loss, loss grows as z approaches 0 Based on slides from Tony Martinez

TUTORIAL 4 Demonstrate how the additional of hidden layers makes the develop of a BP network more challenging (Python code)