Supervised Training of Deep Networks

Slides:



Advertisements
Similar presentations
Deep Learning Early Work Why Deep Learning Stacked Auto Encoders
Advertisements

A brief review of non-neural-network approaches to deep learning
NEURAL NETWORKS Backpropagation Algorithm
1 Machine Learning: Lecture 4 Artificial Neural Networks (Based on Chapter 4 of Mitchell T.., Machine Learning, 1997)
1 Neural networks. Neural networks are made up of many artificial neurons. Each input into the neuron has its own weight associated with it illustrated.
Neural Networks  A neural network is a network of simulated neurons that can be used to recognize instances of patterns. NNs learn by searching through.
Deep Learning Early Work Why Deep Learning Stacked Auto Encoders
An introduction to: Deep Learning aka or related to Deep Neural Networks Deep Structural Learning Deep Belief Networks etc,
Unsupervised Learning With Neural Nets Deep Learning and Neural Nets Spring 2015.
Neural Networks. Background - Neural Networks can be : Biological - Biological models Artificial - Artificial models - Desire to produce artificial systems.
Convolutional Neural Networks for Image Processing with Applications in Mobile Robotics By, Sruthi Moola.
Cascade Correlation Architecture and Learning Algorithm for Neural Networks.
CSC321 Introduction to Neural Networks and Machine Learning Lecture 3: Learning in multi-layer networks Geoffrey Hinton.
Object Recognizing. Deep Learning Success in 2012 DeepNet and speech processing.
Deep Belief Network Training Same greedy layer-wise approach First train lowest RBM (h 0 – h 1 ) using RBM update algorithm (note h 0 is x) Freeze weights.
Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.
Deep Learning Overview Sources: workshop-tutorial-final.pdf
Supervised Learning – Network is presented with the input and the desired output. – Uses a set of inputs for which the desired outputs results / classes.
Today’s Topics 11/10/15CS Fall 2015 (Shavlik©), Lecture 21, Week 101 More on DEEP ANNs –Convolution –Max Pooling –Drop Out Final ANN Wrapup FYI:
Learning: Neural Networks Artificial Intelligence CMSC February 3, 2005.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Neural networks.
Some Slides from 2007 NIPS tutorial by Prof. Geoffrey Hinton
Convolutional Sequence to Sequence Learning
Unsupervised Learning of Video Representations using LSTMs
Handwritten Digit Recognition Using Stacked Autoencoders
Convolutional Neural Network
CS 388: Natural Language Processing: LSTM Recurrent Neural Networks
The Relationship between Deep Learning and Brain Function
CS 6501: 3D Reconstruction and Understanding Convolutional Neural Networks Connelly Barnes.
Deep Learning Amin Sobhani.
an introduction to: Deep Learning
Learning with Perceptrons and Neural Networks
Data Mining, Neural Network and Genetic Programming
Data Mining, Neural Network and Genetic Programming
ECE 5424: Introduction to Machine Learning
Computer Science and Engineering, Seoul National University
Deep Learning Models 길 이 만 성균관대학교 소프트웨어대학.
Article Review Todd Hricik.
Matt Gormley Lecture 16 October 24, 2016
Intelligent Information System Lab
Neural Networks A neural network is a network of simulated neurons that can be used to recognize instances of patterns. NNs learn by searching through.
Deep Learning Qing LU, Siyuan CAO.
Convolutional Networks
Deep Belief Networks Psychology 209 February 22, 2013.
CS6890 Deep Learning Weizhen Cai
with Daniel L. Silver, Ph.D. Christian Frey, BBA April 11-12, 2017
Introduction to Neural Networks
cs540 - Fall 2016 (Shavlik©), Lecture 20, Week 11
Deep learning Introduction Classes of Deep Learning Networks
Deep Architectures for Artificial Intelligence
Word Embedding Word2Vec.
CSC 578 Neural Networks and Deep Learning
Creating Data Representations
LECTURE 35: Introduction to EEG Processing
Neural Networks Geoff Hulten.
Lecture: Deep Convolutional Neural Networks
Capabilities of Threshold Neurons
LECTURE 33: Alternative OPTIMIZERS
Ch4: Backpropagation (BP)
Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824
Convolutional Neural Networks
Deep Learning Some slides are from Prof. Andrew Ng of Stanford.
Computer Vision Lecture 19: Object Recognition III
Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824
Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton
An introduction to: Deep Learning aka or related to Deep Neural Networks Deep Structural Learning Deep Belief Networks etc,
Ch4: Backpropagation (BP)
EE 193/Comp 150 Computing with Biological Parts
CSC 578 Neural Networks and Deep Learning
Presentation transcript:

Supervised Training of Deep Networks Early layers of MLP do not get trained well Diffusion of gradient – error attenuates as it propagates to earlier layers Leads to very slow training Exacerbated since top couple layers can usually learn any task "pretty well" and thus the error to earlier layers drops quickly as the top layers "mostly" solve the task– lower layers never get the opportunity to use their capacity to improve results, they just do a random feature map Need a way for early layers to do effective work Often not enough labeled data available while there may be lots of unlabeled data Can we use unsupervised/semi-supervised approaches to take advantage of the unlabeled data? Deep networks tend to have more local minima problems than shallow networks during supervised training

Deep Learning Overview Multiple layers build improved feature space First layer learns 1st order features (e.g. edges…) 2nd layer learns higher order features (combinations of first layer features, combinations of edges, etc.) In current models layers often learn in an unsupervised mode and discover general features of the input space – serving multiple tasks related to the unsupervised instances (image recognition, etc.) Then final layer features are fed into a supervised model Entire network is often subsequently tuned using supervised training of the entire net, using the initial weightings learned in the unsupervised phase

Greedy Layer-Wise Training Train first layer using data without labels (unsupervised) Since there are no targets at this level, labels don't help. Could also use the more abundant unlabeled data which is not part of the training set Then freeze the first layer parameters and start training the second layer using the output of the first layer as the unsupervised input to the second layer Repeat this for as many layers as desired This builds our set of robust features Use the outputs of the final layer as inputs to a supervised layer/model and train the last supervised layer(s) (leave early weights frozen) Unfreeze all weights and fine tune the full network by training with a supervised approach, given the pre-processed weight settings If do supervised, we may not get the benefits of building up the incrementally abstracted feature space

Deep Net with Greedy Layer Wise Training Supervised Learning ML Model New Feature Space Unsupervised Learning Can input both new and original spaces Original Inputs

Greedy Layer-Wise Training Avoids many of the problems of trying to train a deep net in a supervised fashion Each layer gets full learning focus in its turn since it is the only current "top" layer Can take advantage of the unlabeled data When you finally tune the entire network with supervised training the network weights have already been adjusted so that you are in a good error basin and just need fine tuning. This helps with problems of Ineffective early layer learning Deep network local minima

(Stacked) Auto-Encoders (I) A type of unsupervised learning which tries to discover generic features of the data Learn identity function by learning important sub-features (not by just passing through data) Compression, etc. Can use just new features in the new training set or concatenate both

(Stacked) Auto-Encoders (II) Once features have been learned Remove output layer Train output weights with inputs and features

Stacked Auto-Encoders (III) Stack many auto-encoders in succession and train them using greedy layer-wise training Drop the decode output layer each time

Stacked Auto-Encoders (IV) Do supervised training on the last layer using final features Then do supervised training on the entire network to fine- tune all weights Shows softmax, but could use BP or any other variation

Adobe – Deep Learning and Active Learning Deep Learning Tasks Usually best when input space is locally structured – spatial or temporal: images, language, etc. vs. arbitrary input features Images Example: view of vision layer Each square in the figure shows the input image that maximally activates one of 100 hidden units (auto-encoder training on 10x10 images) – detect edges Adobe – Deep Learning and Active Learning

Convolutional Neural Networks (CNNs) (I) Networks built specifically for problems with low dimensional (e.g., 2-D) local structure Character recognition – where neighboring pixels will have high correlations and local features (edges, corners, etc.), while distant pixels (features) are un-correlated Natural images – have the property of being stationary, meaning that the statistics of one part of the image are the same as any other part Not so good/lousy for general learning with abstract features having no prescribed ordering

CNNs (II) Use three basic ideas: We look at each in turn Local receptive fields Shared weights, and Pooling layers We look at each in turn

Local Receptive Fields In standard NN, nodes take input from all nodes in the previous layer In CNNs, nodes receive take input from only a small set of nodes/features that are spatially or temporally close to each other called receptive fields (e.g., 3x3, 5x5), in the previous layer Local receptive field slides across the entire input image. For each local receptive field, there is a different hidden neuron in the next layer E.g., 28x28 input image and 5x5 local receptive fields -> next layer has 24x24 neurons

Shared Weights Instead of each node in the next layer having its own individual weights and bias (e.g., 5x5 weights and 1 bias), all of the hidden nodes share the same weights and bias E.g., suppose weights and bias are such that hidden neuron picks out, say, a vertical edge in a particular local receptive field. That ability is likely useful at other places in the image: makes sense to apply same feature detector everywhere (translation invariance) Map from input to hidden layer is a feature map In practice, there is more than one feature map, each defined by its own set of shared weights and bias

Pooling Layers Generally used right after convolutional layers Applied to each feature map Each unit in pooling layer summarizes a region (e.g., 2x2) of the feature map (creates a condensed feature map) Typical: max-pooling (i.e., output maximum value in region) Checks whether a given feature is found anywhere in a region of the image; exact location not as important as rough location relative to other features; many fewer pooled features/parameters for later layers

CNNs (III) Final layer of connections in the network is a fully-connected layer Output node computes sigmoid Convolution and pooling layers often interleaved

CNNs (IV) C layers are convolutions, S layers pool/sample Often starts with fairly raw features at initial input and lets CNN discover improved feature layer for final supervised learner – eg. MLP/BP CS 678 – Deep Learning

CNN Training Trained with BP but with weight tying in each feature map Randomized initial weights through entire network Just average the weight updates over the tied weights in feature map layers Convolution layer Each feature map has one weight for each input and one bias Thus a feature map with a 5x5 receptive field would have a total of 26 weights, which are the same coming into each node of the feature map If a convolution layer had 10 feature maps, then only a total of 260 unique weights to be trained in that layer (much less than an arbitrary deep net layer without sharing) Pooling/Sub-sampling Layer All elements of receptive field max’d, averaged, or summed, result multiplied by one trainable weight and a bias added, then squashed for each pooling node If a layer had 10 pooling feature maps, then 20 unique weights to be trained While all weights are trained, the structure of the CNN is currently usually hand crafted with trial and error. Number of total layers, number of receptive fields, size of receptive fields, size of sub-sampling (pooling) fields, which fields of the previous layer to connect to Typically decrease size of feature maps and increase number of feature maps for later layers

Deep Learning Summary Other Approaches: Deep Belief Networks LSTM Networks "Google-Brain" Sum of Products Nets Good in structured/Markovian spaces Problems with strong local spatial/temporal correlation: vision, speech, audio, etc. Important research question: to what extent can we use Deep Learning in more arbitrary feature spaces? Potential for significant improvements