Data Mining, Neural Network and Genetic Programming Deep Learning and Transfer Learning for Object Recognition Yi Mei Yi.mei@ecs.vuw.ac.nz
Outline Object Recognition Deep Learning Overview Recipes of deep learning Automated ANN Architecture Design Transfer Learning DLID DeCAF DAN
Object Recognition Object recognition usually refers to object classification. Sometimes refers to the whole procedure that finds objects in large pictures. Object detection also has other meanings or interpretations. Object Recognition Object Classification Object Localization One-class object detection Multi-class object detection
Methods for Object Detection Object Recognition Neural Methods Genetic Methods Genetic Algorithms Genetic Programming Classification only Classification and localisation Classification only Classification and localisation Feed Forward Networks Share Weight Networks SOMs High Order Networks Deep Learning ……
Deep Learning Machine learning algorithms based on learning multiple levels of representation / abstraction. Fig: I. Goodfellow
Deep Learning Has been successful in many areas: object recognition, object detection, speech recognition, natural language processing, …
Deep Learning LeNet: 7 layers [LeCun et al. 1998] Subsampling (Max-pooling)
Deep Learning Subsampling will not change the object Over-subsampling
Deep Learning AlexNet: 8 layers [Krizhevsky et al. 2012] Similar to LeNet but Bigger model (7 hidden layers, 650k units, 60M params) Error: 15.315% for ImageNet 2012 challenge (No. 1) image-net.org
Deep Learning VGGNet: 19 layers [Simonyan and Zisserman 2014] Error: 7.32% for ImageNet 2014 challenge
Deep Learning GoogLeNet: 22 layers [Szegedy et al. 2014] Error: 6.67% for ImageNet 2014 challenge (No. 1)
Deep Learning ResNet: 152 layers [He et al. 2015] Error: 3.57% for ImageNet 2015 challenge (No. 1)
Deep Learning
Automated CNN Architecture Design Manually designing CNN architecture requires a lot of domain knowledge and trial-and-errors Use genetic programming to automatically evolve an architecture
Automated CNN Architecture Design Cartisian GP
Automated CNN Architecture Design Functions (Operators) CovBlock (shift 1 with padding, keep input size) ResBlock (shift 1 with padding, keep input size) Max (Average) pooling (2x2 filter, shift 2) Summation (element-wise addition) Constrain the architecture Sum two feature maps with the same size
Automated CNN Architecture Design Results ConvSet much better than VGG ResSet much better than ResNet
Automated CNN Architecture Design Using ResSet can evolve much simpler architecture
Why Deep Learning is Hard? Vanishing Gradient
Recipes of Deep Learning Mini-batch (online learning) Proper loss function: Cross entropy New activation function: ReLU Adaptive learning rate Regularisation (Weight decay) Dropout Data augmentation
Mini-batch Offline learning updates weights once after using all the training examples in one epoch Online learning updates weight after using each training examples Mini-batch splits the training examples into a number of batches, and updates weights after using each batch E.g. 100 examples in a mini-batch Online learning Mini-batch Offline learning
Mini-batch Objective function (error) changes from one batch to another We are not minimising the real error Can get better performance (cross-validation?)
Proper Loss Function 2-layer network (1 hidden layer) Cross entropy Square error
Rectified Linear Unit (ReLU) Fast to compute Fix gradient (1) when positive No learning when negative
Rectified Linear Unit (ReLU)
Rectified Linear Unit (ReLU)
Adaptive Learning Rate NN performance heavily relies on the learning rate (step size) Even with the correct direction (gradient), it is unknown how far we should go along that direction
Adaptive Learning Rate Decrease learning rate over time At the beginning, we are far from the destination, so we use larger learning rate After several epochs, we are close to the destination, so we reduce the learning rate Adagrad Smaller derivatives, larger learning rate, and vice versa
Regularisation L2 (Weight decay) Prevent weights from going too large Make more weights near zero, and thus can be ignored 0.99~
Dropout Each time before updating the weights, each neuron has p% to dropout The network structure is changed For each mini-batch, we resample the dropout neurons
Dropout Train a set (ensemble) of different networks Test by averaging the output of all trained networks Cannot get y1, y2, y3, y4, …
Dropout Approximation: Test with full network If a neuron has p% to dropout, then at test time its weights are multiplied by (1-p)% Approximation
Data Augmentation Create synthetic training images by transformations Rotation, scaling, flipping, cropping, noise, …
Transfer Learning/Domain Adaptation Use knowledge learned from past (source domain) to help solve the problem at hand (target domain)
Transfer Learning/Domain Adaptation Standard supervised learning assumes that the training and test examples (x, y) are drawn i.i.d from a distribution In domain adaptation, the source and target domains have different but related distributions and One can extract/use knowledge from more than one source domains at the same time Unsupervised: labelled and unlabelled source examples + unlabelled target examples Semi-supervised: also consider a small set of labelled target examples Supervised: all examples are labelled
Transfer Learning/Domain Adaptation DLID (Deep Learning by Interpolation between Domain) DeCAF DAN (Deep Adaptation Networks)
Dataset Amazon Dslr Webcam
DLID [Chopra et al. 2013] Discrete interpolation from source domain to target domain Different proportions of training examples from source and target domains for each interpolation point Feature extraction using unsupervised trainer New feature repn by combining all extracted features
DeCAF [Donahue et al. ICML 2014] 1. Take the AlexNet trained on ImageNet 2. Do feed-forward operation using AlexNet on new images 3. Get the 6th or 7th layer activations as “features” (DeCAF6 and DeCAF7) 4. Apply any classifier (e.g. SVM and logistic regression)
DAN [Long et al. ICML 2015 ] Fine tune the AlexNet trained on ImageNet Freeze layers 1-3 Fine-tune layers 4-5 Similar distribution in the source and target domain under the hidden representations in layers 6-8 Regulariser, computed by MK-MMD
Summary Deep learning is hard (Standard BP cannot work well) Vanishing gradient problem Many good ideas in deep learning (e.g. mini-batch, cross-entropy, ReLU, adaptive learning rate, regularisation, dropout, data augmentation) Transfer learning/Domain adaptation is a trending direction (similar to human learning) Strategies in transfer learning Learn shared hidden representations (e.g. DLID) Share features (e.g. DeCAF, DAN)