Deep Learning Some slides are from Prof. Andrew Ng of Stanford.
Training set: Features extraction problem
Object detection
Raw image
Convolution 3x3 filter 5x5 or Feature Map
Activation map or feature map
Learning filters Convolution neural network learns the values of these filters on its own during the training process The AI guy specifies parameters such as number of filters, filter size, architecture of the network, etc. The number of filters is called the depth. The more number of filters we have, the more image features get extracted and the better our network becomes at recognizing patterns in unseen images.
Subsampling, down-sampling, or pooling Pooling achieves dimensionality reduction. Stride is the number of pixels by which we slide our filter matrix over the input matrix.
Convolutional Neural Networks
Advantages of CNN Character recognition, natural images Find edges, corners, endpoints, local 2-D structures Translation Invariance Convolution and sub-sampling layers are interleaved Sub-sampling smooth the data Exact position of detected feature not important, but the relative positions can be and can be captured by later layers 5x5, 4x4 common
Computer vision: Identify coffee mug
Why is computer vision hard?
Learning from tagged data (supervised)
Why deep learning? Deep Learning uses a neural network with several layers. The sequence of layers identifies features in stages like our brains seem to. On image and speech data, it often performs better than other methods
Building huge neural networks
AlexNet 2012 ImageNet computer image recognition competition Alex Krizhevsky of the University of Toronto won. 5 convolutional layers 60 million parameters 650,000 neurons 1 million training images Trained on two NVIDIA GPUs for a week Used hidden-unit dropout to reduce overfitting
DNN 2015 Using deep learning, Google and Microsoft both beat the best human score in the ImageNet challenge. Microsoft and the China University of Science and Technology announced a DNN that achieved IQ test scores at the college post-graduate level. Baidu announced that a deep learning system called Deep Speech 2 had learned both English and Mandarin. Deep learning had achieved superhuman levels of perception for the challenge.
Deep Learning Overview Train networks with many layers (vs. shallow nets with just a couple of layers) Multiple layers work to build an improved feature space First layer learns 1st order features (e.g. edges…) 2nd layer learns higher order features (combinations of first layer features, combinations of edges, etc.) In current models layers often learn in an unsupervised mode and discover general features of the input space – serving multiple tasks related to the unsupervised instances (image recognition, etc.) Then final layer features are fed into supervised layer(s) And entire network is often subsequently tuned using supervised training of the entire net, using the initial weightings learned in the unsupervised phase Could also do fully supervised versions, etc. (early BP attempts)
Learning from tagged data
AI will transform the internet
Deep network training We have always had good algorithms for learning the weights in networks with 1 hidden layer but these algorithms are not good at learning the weights for networks with many hidden layers What’s new: algorithms for training many-layer networks
Handwritten digits
What is this unit doing?
Hidden layer units become self-organised feature detectors 1 5 10 15 20 25 … … 1 strong +ve weight low/zero weight
What does this unit detect? 1 5 10 15 20 25 … … 1 strong +ve weight low/zero weight it will send strong signal for a horizontal line in the top row, ignoring everywhere else
What does this unit detect? 1 5 10 15 20 25 … … 1 strong +ve weight low/zero weight 63
What does this unit detect? 1 5 10 15 20 25 … … 1 strong +ve weight low/zero weight Strong signal for a dark area in the top left corner
What features might you expect a good NN to learn, when trained with data like this?
vertical lines 1 63
Horizontal lines 1 63
Small circles 1 63
But what about position invariance? Small circles 1 But what about position invariance? Our example unit detectors were tied to specific parts of the image 63
Successive layers can learn higher-level features etc … 1st layer: detect lines in specific positions 2nd layer: horizontal line, vertical line, upper loop, etc. etc … v
etc … etc … v What does this unit detect? 1st layer: detect lines in specific positions 2nd layer: horizontal line, vertical line, upper loop, etc. etc … v What does this unit detect?
Layers in brain
New way to train MLP
Train this layer first
Train this layer first then this layer
Train this layer first then this layer then this layer
Train this layer first then this layer then this layer then this layer
Train this layer first then this layer then this layer then this layer finally this layer
EACH of the (non-output) layers is trained to be an auto-encoder. Basically, it is forced to learn good features that describe what comes from the previous layer
Auto-encoding Unsupervised training input = output, identity mapping By making this happen with fewer units, this forces the hidden layer units to become good feature detectors Restricted Boltzmann Machine is an example of auto-encoder.
Deep auto-encoding Deep auto-encoder often performs dimensionality reduction better than principle component analysis.
Stacked Auto-Encoders Stack many (sparse) auto-encoders in succession and train them using greedy layer-wise training Drop the decode output layer each time
Face recognition
Dropout – Overfit avoidance Very common with current deep networks Won’t overfit one particular network structure Forces to regularize *Dropconnect – randomly drop connections Shakeout Instead of randomly discarding units as Dropout does at the training stage, our method randomly chooses to enhance or inverse the contributions of each unit to the next layer. Others – Dropin, Standout, etc. For each instance drop a node (hidden or input) and its connections with probability p and train Final net just has all averaged weights (actually scaled by 1-p) As if ensembling 2n different network substructures
Weaknesses of CNN Plain nets: stacking 3x3 convolution layers 56-layer net has higher training error and test error than 20-layers net
Google’s Artificial Brain 10 million randomly selected YouTube video thumbnails over the course of three days a neural network of 16,000 computer processors with one billion connections 20,000 output neurons 81.7% accuracy in detecting human faces, 76.7% accuracy when identifying human body parts 74.8% accuracy when identifying cats 15.8% accuracy in recognizing 20,000 object categories
can treat perturbation Residual Network Difference between an original image and a changed image Preserving base information Some Network residual can treat perturbation
Residual Network Deeper ResNets have lower training error
Results Deep Resnets can be trained without difficulties Deeper ResNets have lower training error, and also lower test error
Results 1st places in all five main tracks in “ILSVRC & COCO 2015 Competitions” ImageNet Classification ImageNet Detection ImageNet Localization COCO Detection COCO Segmentation
Deep net tools
user interface (UI)
Google’s Tensorflow Nodes represent operations Edges represent the flow of data Data are tensors A tensor of rank n is represented by an n-dimensional array Tensorflow is the flow of arrays in a computational graph.
Deep learning libraries
Object detection
Summary Residual nets can train to a depth of 200 layers. Deep networks naturally integrate low/mid/high level features and classifiers in an end-to-end multilayer fashion