What is the Best Multi-Stage Architecture for Object Recognition Kevin Jarrett, Koray Kavukcuoglu, Marc’ Aurelio Ranzato and Yann LeCun Presented by Lingbo.

Slides:

Advertisements

Similar presentations

A brief review of non-neural-network approaches to deep learning

Advertisements

Zhimin CaoThe Chinese University of Hong Kong Qi YinITCS, Tsinghua University Xiaoou TangShenzhen Institutes of Advanced Technology Chinese Academy of.

Advanced topics.

Deep Learning and Neural Nets Spring 2015

ImageNet Classification with Deep Convolutional Neural Networks

Tiled Convolutional Neural Networks TICA Speedup Results on the CIFAR-10 dataset Motivation Pretraining with Topographic ICA References [1] Y. LeCun, L.

What is the Best Multi-Stage Architecture for Object Recognition? Ruiwen Wu [1] Jarrett, Kevin, et al. "What is the best multi-stage architecture for object.

Learning Deep Energy Models

Learning Convolutional Feature Hierarchies for Visual Recognition

Supervised and Unsupervised learning and application to Neuroscience Cours CA6b-4.

RBF Neural Networks x x1 Examples inside circles 1 and 2 are of class +, examples outside both circles are of class – What NN does.

1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.

Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.

Distributed Representations of Sentences and Documents

K-means Based Unsupervised Feature Learning for Image Recognition Ling Zheng.

AN ANALYSIS OF SINGLE- LAYER NETWORKS IN UNSUPERVISED FEATURE LEARNING [1] Yani Chen 10/14/

Overview of Back Propagation Algorithm

Radial-Basis Function Networks

Face Recognition Using Neural Networks Presented By: Hadis Mohseni Leila Taghavi Atefeh Mirsafian.

Convolutional Neural Networks for Image Processing with Applications in Mobile Robotics By, Sruthi Moola.

Hurieh Khalajzadeh Mohammad Mansouri Mohammad Teshnehlab

Presented by: Mingyuan Zhou Duke University, ECE June 17, 2011

Dr. Z. R. Ghassabi Spring 2015 Deep learning for Human action Recognition 1.

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

Object Recognition as Ranking Holistic Figure-Ground Hypotheses Fuxin Li and Joao Carreira and Cristian Sminchisescu 1.

Convolutional Restricted Boltzmann Machines for Feature Learning Mohammad Norouzi Advisor: Dr. Greg Mori Simon Fraser University 27 Nov

Learning Hierarchical Features for Scene Labeling

Rich feature hierarchies for accurate object detection and semantic segmentation 2014 IEEE Conference on Computer Vision and Pattern Recognition Ross Girshick,

MLSLP-2012 Learning Deep Architectures Using Kernel Modules (thanks collaborations/discussions with many people) Li Deng Microsoft Research, Redmond.

Deep Learning Overview Sources: workshop-tutorial-final.pdf

Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition arXiv: v4 [cs.CV(CVPR)] 23 Apr 2015 Kaiming He, Xiangyu Zhang, Shaoqing.

Xintao Wu University of Arkansas Introduction to Deep Learning 1.

A Hierarchical Deep Temporal Model for Group Activity Recognition

Yann LeCun Learning Invariant Feature Hierarchies Learning Invariant Feature Hierarchies Yann LeCun The Courant Institute of Mathematical Sciences Center.

Yann LeCun Other Methods and Applications of Deep Learning Yann Le Cun The Courant Institute of Mathematical Sciences New York University

Yann LeCun Learning Invariant Feature Hierarchies Learning Invariant Feature Hierarchies Yann LeCun The Courant Institute of Mathematical Sciences Center.

Unsupervised Learning of Video Representations using LSTMs

CS 4501: Introduction to Computer Vision Object Localization, Detection, Semantic Segmentation Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy.

Handwritten Digit Recognition Using Stacked Autoencoders

Convolutional Neural Network

The Relationship between Deep Learning and Brain Function

Summary of “Efficient Deep Learning for Stereo Matching”

Deep Learning Amin Sobhani.

Computer Science and Engineering, Seoul National University

CLASSIFICATION OF TUMOR HISTOPATHOLOGY VIA SPARSE FEATURE LEARNING Nandita M. Nayak1, Hang Chang1, Alexander Borowsky2, Paul Spellman3 and Bahram Parvin1.

Learning Mid-Level Features For Recognition

Generative Adversarial Networks

Classification of Hand-Written Digits Using Scattering Convolutional Network Dongmian Zou Advisor: Professor Radu Balan.

CS6890 Deep Learning Weizhen Cai

R-CNN region By Ilia Iofedov 11/11/2018 BGU, DNN course 2016.

Human-level control through deep reinforcement learning

Bird-species Recognition Using Convolutional Neural Network

Boosting Nearest-Neighbor Classifier for Character Recognition

Image Classification.

Neuro-Computing Lecture 4 Radial Basis Function Network

Very Deep Convolutional Networks for Large-Scale Image Recognition

network of simple neuron-like computing elements

8-3 RRAM Based Convolutional Neural Networks for High Accuracy Pattern Recognition and Online Learning Tasks Z. Dong, Z. Zhou, Z.F. Li, C. Liu, Y.N. Jiang,

An Improved Neural Network Algorithm for Classifying the Transmission Line Faults Slavko Vasilic Dr Mladen Kezunovic Texas A&M University.

Deep Learning and Mixed Integer Optimization

Neural Networks Geoff Hulten.

On Convolutional Neural Network

Lecture: Deep Convolutional Neural Networks

Outline Background Motivation Proposed Model Experimental Results

Tuning CNN: Tips & Tricks

Neural networks (3) Regularization Autoencoder

Department of Computer Science Ben-Gurion University of the Negev

Introduction to Neural Networks

VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION

Image recognition.

Presentation transcript:

What is the Best Multi-Stage Architecture for Object Recognition Kevin Jarrett, Koray Kavukcuoglu, Marc’ Aurelio Ranzato and Yann LeCun Presented by Lingbo Li ECE, Duke University Dec. 13rd, 2010

Outline Introduction Model Architecture Training Protocol Experiments  Caltech 101 Dataset  NORB Dataset  MNIST Dataset Conclusions

Introduction (I) Feature extraction stages:  A filter bank  A non-linear operation  A pooling operation Recognition architectures: Single stage of features + supervised classifier: SIFT, HoG, etc. Two or more successive stages of feature extractors + supervised classifier: convolutional networks

Introduction (II) Q1: How do the non-linearities that follow the filter banks influence the recognition accuracy? Q2: Is there any advantage to using an architecture with two successive stages of features extraction, rather than with a single stage? Q3: Does learning the filter banks in an unsupervised or supervised manner improve the performance over hard- wired filters or even random filters?

Model Architecture (I) Input: Output : Filter : A filter bank layer with 64 filters of size 9x9 : is the j-th feature map

Model Architecture (II)  Subtractive normalization operation  Divisive normalization operation

Model Architecture (III) An average pooling layer with 4x4 down-sampling: A max-pooling layer with 4x4 down-sampling:

Model Architecture (IV) Combining Modules into a Hierarchy

Training Protocol (I) Optimal sparse coding: Under sparse condition, this problem can be written as an optimization problem: Given training samples, learning proceeds: 1)Minimize the loss function 2)Find by running a rather expensive optimization algorithm.

Training Protocol (II) Predictive Sparse Decomposition (PSD) PSD trains a regressor to approximate the sparse solution for all training samples, where Learning proceeds by minimizing the loss function where Thus, (dictionary) and (filters) are simultaneously optimized.

Training Protocol (III) A single letter: an architecture with a single stage of feature extraction followed by a classifier; A double letter: an architecture with two stages if feature extraction followed by a classifier.  Filters are set to random values and kept fixed. Classifiers are trained in supervised mode.  Filters are trained using unsupervised PSD algorithm, and kept fixed. Classifiers are trained in supervised model.  Filters are initialized with random values. The entire system (Feature stages + classifiers) is trained in supervised mode with gradient descent.  Filters are initialized with the PSD unsupervised learning algorithm. The entire system (feature stages + classifiers) is trained in supervised mode by gradient descent.

Experiments (I) – Caltech 101 Data pre-processing: 1)Convert to gray-scale and resize to 151x151 pixels; 2)Subtract the image mean and divide by the image standard deviation; 3)Apply subtractive/divisive normalization (N layer with c=1); 4)Zero-padding the shorter side to 143 pixels. Recognition rates are averaged over 5 drawings of the training set (30 images per class). Hyper-parameters are selected to maximize the performance on the validation set of 5 samples per class taken out of the training sets.

Experiments (I) – Caltech 101 Using a Single Stage of Feature Extraction: Using Two Stages of Feature Extraction: Multinomial logistic regression PMK-SVM 64 26x26 feature maps Multinomial logistic regression PMK-SVM 256 4x4 feature maps 64 26x26 feature maps

Experiments (I) – Caltech 101

Random filters and no filter learning whatsoever with can achieve decent performance; Supervised fine tuning improves the performance; Two-stage systems are better than their single-stage counterparts; With rectification and normalization, unsupervised training does not improve the performance; abs rectification is a crucial component for good performance; Single-stage system with PMK-SVM reaches the same performance with a two-stage with logistic regression;

Experiments (II) – NORB Dataset NORB dataset has 5 object categories; training samples and test samples (4860 per class); Each image is gray-scale with 96x96 pixels; Only consider the protocols; 1)Random filters do not perform as well as learned filters with more labels samples. 2)The use of abs and normalization makes a big difference.

Experiments (II) – NORB Dataset Use gradient descent to find the optimal input patterns in a architecture. In the left figure: (1-a) random stage-1 filters; (1-b) corresponding optimal inputs; (2-a) PSD filters; (2-b) Optimal input patterns; (3) subset of stage-2 filters after PSD and supervised refinement on Caltech-101. (3) (1-a) (2-b) (2-a) (1-b)

Experiments (III) – MNIST Dataset 60,000 gray-scale 28x28 pixel images for training and 10,000 images for testing; 2-stage of feature extraction: convolution 50 7x7 filters Max-pooling 2*2 windows 50 28x28 feature maps 50 14x14 feature maps Input Image 34x3 4 convolution x5filters 64 10x10 feature maps Max-pooling 2x2 windows 64 5x5 feature maps the first stage the second stage 10-way multinomial logistic regression

Experiments (III) – MNIST Dataset Parameters are trained with PSD: the only hyper- parameter is tuned with a validation set of 10,000 training samples. The classifier is randomly initialized; The whole system is tuned in supervised mode. A test error rate of 0.53% was obtained.

Conclusions (I) Q1: How do the non-linearities that follow the filter banks influence the recognition accuracy? 1) A rectifying non-linearity is the single most important factor. 2) A local normalization layer can also improve the performance. Q2: Is there any advantage to using an architecture with two successive stages of feature extraction, rather than with a single stage? 1) Two stages are better than one. 2) The performance of two-stage system is similar to that of the best single-stage systems based on SIFT and PMK-SVM.

Conclusions (II) Q3: Does learning the filter banks in an unsupervised or supervised manner improve the performance over hard-wired filters or even random filters? 1) Random filters yield good performance only in the case of small training set. 2) The optimal input patterns for a randomly initialized stage are similar to the optimal inputs for a stage that use learned filters. 3) The global supervised learning of filters yields good recognition rate if with the proper non-linearites. 4) Unsupervised pre-training followed by supervised refinement yields the best overall accuracy.