AN ANALYSIS OF SINGLE- LAYER NETWORKS IN UNSUPERVISED FEATURE LEARNING [1] Yani Chen 10/14/2014 1
Outline 2 Introduction Framework for feature learning Unsupervised feature learning algorithms Effect of some parameters Experiments and analysis on the results
Introduction 3 1. Much work focused on employing complex unsupervised feature learning algorithm. 2. Simple factors, such as the number of hidden nodes may be more important to achieving high performance than the learning algorithm or the depth of the model. 3. Using only one single layer network can get very good feature learning results.
Unsupervised feature learning framework 4 1>. extract random patches from unlabeled training images (choose image as example) 2>. apply a pre-processing stage to the patches 3>. learn a feature-mapping using an unsupervised feature learning algorithm 4>. extract features from equally spaced sub-patches covering the input images 5>. pool features together to reduce the number of feature values 6>. train a linear classifier to predict the labels given the feature vectors
Unsupervised learning algorithm 5 1. Sparse autoencoder 2. Sparse restricted Boltzmann machine 3. K-means clustering 4. Gaussian mixture models clustering
Sparse auto-encoder 6 Objective function (minimize): Feature mapping function:
Sparse restricted Boltzman machine 7 Energy function of an RBM is : The same type of sparsity penalty can be added like in the sparse autoencoder Sparse RBMs can be trained using a contrastive divergence approximation [7] Feature mapping function:
K-means clustering 8 Object function for learning K centroids Feature mapping function 1> hard-assignment 2> soft-assignment
GMM clustering 9 Gaussian mixture models: A Gaussian mixture model is a probabilistic model that assumes all the data points are generated from a mixture of a finite number of Gaussian distributions with unknown parameters.
GMM(Gaussian mixture models) 10
EM algorithm 11 EM(expectation-maximization) algorithm is an iterative method for finding maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models. E-step : assign points to clusters M-step : estimate model parameters
Gaussian mixtures 12 Feature mapping function:
Feature extraction and classification 13 Convolutional feature extraction and pooling(sum) Classification : (L2) SVM
Data 14 1. CIFAR-10 (this data is used to tune the parameters) 2. NORB 3. downsampled STL(96*96 --> 32*32)
CIFAR10 dataset 15 The CIFAR-10 dataset consists of x32 colour images in 10 classes, with 6000 images per class. There are training images and test images [3]
NORB dataset 16 This dataset is intended for experiments in 3D object recognition from shape. It contains images of 50 toys belonging to 5 generic categories: animals, human figures, airplanes, trucks, and cars. 24,300 training image pairs (96*96), test image pairs. [4]
STL-10 dataset 17 The STL-10 dataset consists of x64 color images and 3200 test images in 4 classes, airplane, cat, car and dog. There are training images and test images. [5]
Effect elements 18 1. with or without whitening 2. number of features 3. stride(spacing between patches) 4. receptive field size
Effect of whitening 19 Result of whitening: 1. the features are less correlated with each other 2. the features all have the same variance For sparse autoencoder and sparse RBM when using only 100 features, significant benefit from whitening preprocessing when the number of features getting bigger, the advantage disappeared For clustering algorithms The whitening is a must have step because they cannot handle the correlations in the data.
Effect of number of features 20 Num of features used: 100, 200, 400, 800, 1600 All algorithms generally achieved higher performance by learning more features
Effect of stride 21 Stride is the spacing between patches where feature values will be extracted Downward performance with increasing step size
Effect of receptive field size 22 Receptive field size is the patch size. Overall, the 6 pixel receptive field size worked best.
Classification results 23 AlgorithmAccuracy Raw pixels 3-way factored RBM (3 layers) Mean-covariance RBM (3 layers) Improved Local Coord. Coding Conv. Deep Belief Net (2 layers) 37.3% 65.3% 71.0% 74.5% 78.9% Sparse auto-encoder Sparse RBM K-means (Hard) K-means (Triangle, 1600 features) k-means (Triangle, 4000 features) 73.4% 72.4% 68.6% 77.9% 79.6% Table 1: Test recognition accuracy on CIFAR-10 stride = 1, receptive field = 6, with whitening, large number of features
Classification results 24 AlgorithmAccuracy(error) Conv. Neural Network Deep Boltzmann Machine Deep Belief Network Best result of [6] Deep neural network 93.4% (6.6%) 92.8% (7.2%) 95.0% (5.0%) 94.4% (5.0%) 97.13% (2.87%) Sparse auto-encoder Sparse RBM K-means (Hard) K-means (Triangle, 1600 features) k-means (Triangle, 4000 features) 96.9% (3.1%) 96.2% (3.8%) 96.9% (3.1%) 97.0% (3.0%) 97.21% (2.79%) Table 2: Test recognition accuracy (and error) for NORB (normalized-uniform) stride = 1, receptive field = 6, with whitening, large number of features
Classification results 25 AlgorithmAccuracy Raw pixels K-means (Triangle 1600 features) 31.8% ( 0.62%) 51.5% ( 1.73%) Table 3: Test recognition accuracy on STL-10 The method proposed is strongest when we have large labeled training sets.
Conclusion 26 Best performance is based on k-means clustering. Easy and fast. No hypermeters to tune. One layer network can get good result. Using more features and dense extraction.
Reference 27 [1] Coates, Adam, Andrew Y. Ng, and Honglak Lee. "An analysis of single-layer networks in unsupervised feature learning." International Conference on Artificial Intelligence and Statistics [2] [3]A. Krizhevsky. Learning multiple layers of features form Tiny Images. Master’s thesis, Dept. of Comp. Sci., University of Toronto, 2009 [4] LeCun, Yann, Fu Jie Huang, and Leon Bottou. "Learning methods for generic object recognition with invariance to pose and lighting." Computer Vision and Pattern Recognition, CVPR Proceedings of the 2004 IEEE Computer Society Conference on. Vol. 2. IEEE, [5] [6] Jarrett, Kevin, et al. "What is the best multi-stage architecture for object recognition?." Computer Vision, 2009 IEEE 12th International Conference on. IEEE, [7] Goh, Hanlin, Nicolas Thome, and Matthieu Cord. "Biasing restricted Boltzmann machines to manipulate latent selectivity and sparsity." NIPS workshop on deep learning and unsupervised feature learning
THANK YOU ! 28