Article Review Todd Hricik
Learning High Level Features Previous studies in computer vision have used labeled data to learn “higher” level features Requires a large training set containing features you wish to recognize Difficult to obtain in many cases Focus of current work in this paper is to build high level , class specific, feature detectors from unlabeled images
Learning Features From Unlabeled Data RBMs (Hinton et al.,2006) Autoencoders (Hinton & Salakhutdinov, 2006; Bengio et al., 2007) Sparse coding (Lee et al., 2007) and K-means (Coates et al., 2011) To date most have only succeeded in learning low-level features such as “lines” or “globs” Authors consider the possibility of capturing more complex features using deep autoencoders on unlabeled data
Deep Autoencoders Made up of symmetrical encoding (blue) and decoding (red) deep belief networks
Training Set Randomly sample 200x200 pixel frames from 10 million YouTube videos OpenCV face detector was run on 60x60 randomly-sampled patches from the training set 3% of the 100,000 sampled patches contained faces learned by OpenCV
Deep Autoencoder Architecture 1 billion trainable parameters Still tiny. Human Visual Cortex is 106 times larger Local Receptive Fields (LRF) each feature connects to small region of the lower layer Local L2 Pooling Square root of sum of squares (inputs) Local Contrast Normalization (LCN) H W1
Learning and Optimization Parameters H are fixed to uniform weights Encoding weights W1 and decoding weights W2 of the first sublayers Lambda: tradeoff parameter between sparsity and reconstruction m, k: number of examples and pooling units in a layer respectively The objective function of the model is the sum of the individual objectives of the three layers
Validation of Higher Level Features Learned Control experiments used to analyze invariance properties of the face detector Test set consists of 37,000 images (containing 13,026 faces) and were sampled from Labeled Faces In the Wild dataset (Huang et al., 2007) ImageNet dataset (Deng et al., 2009) After training, test set was used to measure the performance of each neuron in classifying faces against distractors For each neuron, compute its maximum and minimum activation values and then picked 20 equally spaced thresholds in between The reported accuracy is the best classification accuracy among 20 thresholds
Sub-sample of test set positive/negative = 1 Validation Results Best neuron obtained 81.7% accuracy in detecting faces (serendipity?) Random Guess Accuracy achieved 64.8% accuracy Best neuron in one layered network achieved 71% accuracy Sub-sample of test set positive/negative = 1 Entire test set
Validation Results Analysis I Removing the LCF layer reduced accuracy of best performing neuron to 78.5% Robustness of face detector to translation, scaling and out-of-plane rotation (Fig. 4,5) Remove all images that have faces from the training set and repeat experiment results in 72.5% accuracy
Can Other Well Performing Neurons Recognize Other High Level Features? Constructed two datasets having positive/negative ratios similar to face ratios in training data Human bodies vs. distractors (Keller et al., 2009) Cat faces vs. distractors (Zhang et al., 2008)
Can Other Well Performing Neurons Recognize Other High Level Features?
Summary of Results and Comparisons to State of the Art Methods Thank You Questions?