Towards Deep Understanding on Convolutional Neural Networks

Towards Deep Understanding on Convolutional Neural Networks
Lingxi Xie The Johns Hopkins University Slides available on my homepage (search “Lingxi Xie”)

Towards Deep Understanding on CNN
Outline Introduction Towards Deep Understanding on CNN Neuron-level: the Max-Conv Operator Mid-level: the GNPP Algorithm Regularization: the DisturbLabel Algorithm Feature Visualization: the InterActive Algorithm Conclusions and Future Work 8/9/2019 Towards Deep Understanding on CNN

Lingxi Xie (谢凌曦) Education Background Bachelor in Engineering, Tsinghua University, 2010 Ph.D. in Engineering, Tsinghua University, 2015 Working Experience Visiting Student, the University of Texas at San Antonio, 2014 (supervisor: Prof. Qi Tian) Research Intern, Microsoft Research Asia, 2013 – 2015 (supervisor: Dr. Jingdong Wang) Postdoc Researcher, University of California, Los Angeles, 2015 – 2016 (supervisor: Prof. Alan Yuille) 8/9/2019 Towards Deep Understanding on CNN

Lingxi Xie (谢凌曦) Research interests Computer Vision Large-scale image classification Fine-grained object recognition Multimedia Information Retrieval Near-duplicate object retrieval Large-scale image search Deep Learning The Convolutional Neural Networks (CNN) 8/9/2019 Towards Deep Understanding on CNN

Contributors in Alphabetical Order
Dr. John Flynn, Univ. of California, Los Angeles Prof. Weiyao Lin, Shanghai Jiao Tong University Prof. Qi Tian, Univ. of Texas at San Antonio Dr. Jingdong Wang, Microsoft Research Asia Prof. Meng Wang, Hefei Univ. of Technology Zhen Wei, Shanghai Jiao Tong University Prof. Alan Yuille, the Johns Hopkins University Prof. Bo Zhang, Tsinghua University Dr. Liang Zheng, Univ. of Texas at San Antonio 8/9/2019 Towards Deep Understanding on CNN

Introduction Computer vision and image processing An important direction of artificial intelligence Teaching computers to understand images A wide range of real-world applications Image classification Image retrieval or search Object detection Semantic segmentation ...... 8/9/2019 Towards Deep Understanding on CNN

Talk: Deep Learning and CNN
Image Classification BIRD DOG DOG FLOWER FLOWER Image Dataset Black-foot. Albatross Chihuahua daffodil Groove-billed Ani Siberian Husky snowdrop Bird-200 Dog-120 Flwr-102 Rhinoceros Auklet Golden Retriever Colts’ foot Test FLOWER ? ? DOG Colts’ foot Siberian Husky 8/9/2019 Talk: Deep Learning and CNN

Image Retrieval Image Dataset Holiday QUERY TP TP TP TP Test TP True- Positive FP TP FP TP FP False-Positive 8/9/2019 Talk: Deep Learning and CNN

Fast Development and Challenges
Computer vision is promising The performance on almost every benchmark is largely boosted in the recent 5 years The industry has applied many state-of-the-art techniques to real-world systems Computer vision is very difficult Many fundamental problems remain unsolved Deep learning: a double-edged sword 8/9/2019 Towards Deep Understanding on CNN

A Machine Learning Perspective
Problem settings Dataset: 𝒟= 𝐱 𝑛 , 𝐲 𝑛 𝑛=1 𝑁 𝐱 𝑛 : 𝐷-dimensional input vector (𝐷 is often large) 𝐲 𝑛 : 𝐶-dimensional output vector (𝐶 is # of classes) Classification vs. regression: different encoding schemes Model: 𝐟 𝐱 ∈ 𝕄 𝐶 For image processing, the function 𝐟 𝐱 is often very complicated, and thus difficult to be manually designed Learning: supervised vs. unsupervised 8/9/2019 Talk: Deep Learning and CNN

Deep Learning Deep Learning The state-of-the-art machine learning theory Using a cascade of many layers of non-linear neurons for feature extraction and transformation Learning multiple levels of feature representation Higher-level features are derived from lower-level features to form a hierarchical architecture Multiple levels of representation correspond to different levels of abstraction 8/9/2019 Towards Deep Understanding on CNN

Deep Learning (cont.) The prerequisites of deep learning A large-scale dataset (e.g., ImageNet) Powerful computational resources (e.g., GPU) A side note: designing an effective image processing framework From global to local: the BoVW model From shallow to deep: the CNN model 8/9/2019 Towards Deep Understanding on CNN

Blooming Development! Conference publications CVPR 2012: <10 Deep Learning papers. CVPR 2016: >200 Deep Learning papers! Competitions ILSVRC 2011: no CNNs in top-10 ILSVRC 2012: only AlexNet in top-10 (No. 1) ILSVRC 2015: all CNNs in top-10 8/9/2019 Talk: Deep Learning and CNN

Applications in Computer Vision
Basic networks for image classification AlexNet, GoogLeNet, VGGNet, ResNet, etc. Fine-tuning for other computer vision tasks Object detection: R-CNN, Faster R-CNN, etc. Semantic segmentation: FCN, DeepLab, etc. Deep features for transfer learning Image classification on small datasets, image retrieval, person re-identification, etc. 8/9/2019 Talk: Deep Learning and CNN

Applications in Other Fields
Which problems to solve? Given a problem, if we can define a differentiable loss function, then it is possible to construct a deep network to formulate this problem What are the advantages? Deep learning is good at learning complicated data distribution, it shows advantages especially when we are not aware of the exact distribution, but we can “let data speak themselves” 8/9/2019 Talk: Deep Learning and CNN

Convolutional Neural Networks
The Convolutional Neural Networks A fundamental machine learning tool Good performance in a wide range of problems in computer vision as well as other research areas Evolutions in many real-world applications Theory: a multi-layer, hierarchical network often has a larger capacity, also requires a larger amount of data to get trained 8/9/2019 Towards Deep Understanding on CNN

Designing CNN Structures
History From linear to non-linear From shallow to deep From fully-connected to convolutional Today A cascade of various types of non-linear units Typical units: convolution, pooling, activation, etc. 8/9/2019 CVPR 2017 – In submission

The Convolutional Neural Networks Example 1: the LeNet [LeCun, 1998] 8/9/2019 Towards Deep Understanding on CNN

The Convolutional Neural Networks Example 2: the AlexNet [Krizhevsky, 2012] 8/9/2019 Towards Deep Understanding on CNN

Basic Notations CNN is trained on a dataset 𝒮= 𝐱 𝑛 , 𝐲 𝑛 𝑛=1 𝑁 , where 𝑁 is the dataset size, 𝐱 𝑛 ∈ ℝ 𝐷 is a data point, and 𝐲 𝑛 = 0,⋯,0,1,0,⋯,0 Τ ∈ ℝ 𝐶 Let CNN be a model 𝕄: 𝐡 𝐗 0 ;𝜽 ∈ ℝ 𝐶 , in which 𝐗 0 is the input image, and 𝜽 is the set of weights (neuron connections) 𝐗 𝑙 is the neuron responses on the 𝑙-th layer, which is a 𝑊 𝑙 × 𝐻 𝑙 × 𝐷 𝑙 cube 8/9/2019 Towards Deep Understanding on CNN

CNN Training Process In each iteration 𝑡, a mini-batch ℬ 𝑡 is sampled from 𝒮, and the current model parameters 𝜽 𝑡−1 are updated via Stochastic Gradient Descent (SGD): 𝜽 𝑡 = 𝜽 𝑡−1 + 𝛾 𝑡−1 ∙ 1 ℬ 𝑡 ∙ 𝐱,𝐲 ∈ ℬ 𝑡 𝛻 𝜽 𝑡−1 𝐿 𝐱,𝐲 𝛾 𝑡−1 : the current learning rate 𝐿 𝐱,𝐲 : the loss function of 𝐱,𝐲 Computed via gradient back-propagation 8/9/2019 Towards Deep Understanding on CNN

Our Goal Many basic properties of CNN remain unclear The main reason may lie in the deep structure and the high-dimensional space, which make it very difficult to analyze the behavior of the network CNN should not be considered as a black box We aim at exploring the properties in different ways, including going deep into the framework, and performing some fine-scaled visual recognition tasks (part detection, etc.) 8/9/2019 Towards Deep Understanding on CNN

Motivation CNN feature is not reversal-invariant Reversal-invariant: given an image 𝐗 0 and its reversed (left-right flipped) copy, the intermediate output of a model 𝕄 should be identical on them A convolutional filter (kernel) can only capture visual contents in one direction Given that the number of filters on each layer is limited, this property largely constrains the model capacity 8/9/2019 Towards Deep Understanding on CNN

Deep Features on Small Datasets
Six (6) datasets used Generic classification: Caltech256 Scene classification: MIT Indoor-67, SUN-397 Fine-grained object recognition: Oxford Pet-37, Oxford Flower-102, Caltech-UCSD Bird-200 We evaluate the original deep features and InterActive features on different layers individually, and then combine them together 8/9/2019 Towards Deep Understanding on CNN

Reversal Invariance: Deep Features
Model C-256 I-67 S-397 P-37 F-102 B-200 AlexNet (w/o AUGM), ORIG 67.69 53.91 41.01 76.95 84.56 43.43 AlexNet (w/o AUGM), AVG 70.39 58.10 44.47 79.60 87.17 47.98 AlexNet (w/o AUGM), MAX 70.17 57.77 44.19 79.40 86.88 47.82 AlexNet (w/ AUGM), ORIG 70.48 57.78 44.77 80.85 87.27 47.17 AlexNet (w/ AUGM), AVG 71.75 59.76 46.42 81.79 88.34 49.28 AlexNet (w/ AUGM), MAX 71.57 59.45 46.24 81.55 88.26 49.15 VGGNet-16 (w/ AUGM), ORIG 82.69 75.78 60.43 93.09 93.69 71.62 VGGNet-16 (w/ AUGM), AVG 83.09 76.06 61.50 93.31 94.01 72.66 VGGNet-16 (w/ AUGM), MAX 83.12 75.93 61.39 93.25 93.97 72.73 VGGNet-19 (w/ AUGM), ORIG 83.51 75.49 61.30 93.10 93.57 71.70 VGGNet-19 (w/ AUGM), AVG 83.90 75.93 62.40 93.17 93.83 72.55 VGGNet-19 (w/ AUGM), MAX 83.90 75.83 62.25 93.12 93.84 72.59 8/9/2019 Towards Deep Understanding on CNN

Original Convolution Original convolution 𝑓 𝑎,𝑏,𝑘 𝑙 𝐈;𝕄 = 𝐟 𝑎,𝑏 𝑙−1 , 𝜽 𝑘 𝑙 + 𝑏 𝑘 𝑙 When the kernel 𝜽 𝑘 𝑙 is not symmetric, 𝑓 𝑎,𝑏,𝑘 𝑙 is not reversal-invariant Solution: using reversal-invariant operators! 8/9/2019 Towards Deep Understanding on CNN

Reversal-Invariant Convolution
Average convolution (Avg-Conv) 𝑓 𝑎,𝑏,𝑘 𝑙 𝐈;𝕄 = 𝐟 𝑎,𝑏 𝑙−1 , 𝜽 𝑘 𝑙 + 𝐟 𝑎,𝑏 𝑙−1 ,R , 𝜽 𝑘 𝑙 + 𝑏 𝑘 𝑙 Max convolution (Max-Conv) 𝑓 𝑎,𝑏,𝑘 𝑙 𝐈;𝕄 = max 𝐟 𝑎,𝑏 𝑙−1 , 𝜽 𝑘 𝑙 , 𝐟 𝑎,𝑏 𝑙−1 ,R , 𝜽 𝑘 𝑙 𝑏 𝑘 𝑙 8/9/2019 Towards Deep Understanding on CNN

Explanation Convolution is a template matching process 𝐟 𝑎,𝑏 𝑙−1 , 𝜽 𝑘 𝑙 : the matching score between the input patch 𝐟 𝑎,𝑏 𝑙−1 and the template 𝜽 𝑘 𝑙 An input patch gets a high score if it is similar to the template 8/9/2019 Towards Deep Understanding on CNN

Explanation (cont.) Reversal-invariant convolution: computing the matching scores of a patch and its reversed copy, and performing average or max operator Avg-Conv: a patch gets a high score if it is similar to both the template and its reversed copy Max-Conv: a patch gets a high score if it is similar to either the template or its reversed copy Max-Conv is more reasonable, and it enlarges the model capacity by having more templates 8/9/2019 Towards Deep Understanding on CNN

Reversal-Invariant CNN
Replacing all the convolution layers (including fully-connected layers) of the original network with reversal-invariant convolution operators Such a network produces identical features for an image and its reversed copy This property can be proved with mathematical induction 8/9/2019 Towards Deep Understanding on CNN

CIFAR Experiments The CIFAR10/CIFAR100 datasets Low-resolution natural images (10 or 100 classes) 50,000 training and 10,000 testing images Uniformly distributed over 10/100 classes The network structure A 3-layer LeNet (input size: 32×32) 8/9/2019 Towards Deep Understanding on CNN

CIFAR10 Results CIFAR10 w/o AUGM w/ AUGM LeNet 18.11±.20 16.99±.22 LeNet, AVG 21.01±.35 20.99±.26 LeNet, MAX 16.93±.18 𝟏𝟔.𝟔𝟒±.17 8/9/2019 Towards Deep Understanding on CNN

CIFAR100 Results CIFAR100 w/o AUGM w/ AUGM LeNet 46.08±.26 44.55±.10 LeNet, AVG 47.79±.41 47.55±.31 LeNet, MAX 43.90±.19 𝟒𝟑.𝟔5±.16 8/9/2019 Towards Deep Understanding on CNN

ILSVRC2012 Experiments The ILSVRC2012 dataset High-resolution natural images (1,000 classes) 1.3M training and 50K validation images Almost uniformly distributed over all classes The network structure The AlexNet 5 convolution layers, 3 pooling layers, and 3 fully-connected layers 8/9/2019 Towards Deep Understanding on CNN

ILSVRC2012 Results ILSVRC2012, top-1 w/o AUGM w/ AUGM AlexNet 43.05±.19 42.52±.07 AlexNet, MAX 42.16±.05 𝟒𝟐.𝟏𝟎±.07 8/9/2019 Towards Deep Understanding on CNN

ILSVRC2012 Results ILSVRC2012, top-5 w/o AUGM w/ AUGM AlexNet 20.62±.08 19.52±.05 AlexNet, MAX 19.42±.03 𝟏𝟗.𝟏𝟐±.07 8/9/2019 Towards Deep Understanding on CNN

Back to Deep Features Model C-256 I-67 S-397 P-37 F-102 B-200 AlexNet (w/ AUGM), ORIG 70.75 58.04 45.12 81.02 87.39 47.53 AlexNet (w/ AUGM), AVG 71.97 60.01 46.64 81.98 88.40 .49.53 AlexNet (w/ AUGM), MAX 71.81 59.77 46.47 81.93 88.29 49.42 AlexNet-MAX (w/ AUGM), ORIG 71.78 59.91 46.47 81.92 88.17 49.55 8/9/2019 Towards Deep Understanding on CNN

Summary Reversal-invariant is important in extracting deep features and training a CNN model Cancelling out the reversal transform in natural images Max-Conv is an efficient solution Considering the reversed copy of a filter (kernel) Equivalently increasing the model capacity Achieving reversal-invariance in deep features 8/9/2019 Towards Deep Understanding on CNN

Motivation The basic units in a CNN are neurons CNN considers each neuron individually, while we argue that it is important to model the co-occurrence of neuron responses In the BoVW model, visual words are grouped as visual phrases, but it remains unclear if the same idea works well in neural networks 8/9/2019 Towards Deep Understanding on CNN

Neural Words Defined on a hidden layer 𝐗 𝑙 of the CNN, for simplicity, denote 𝐗 𝑙 as 𝐗 𝐗 is a 3D cube with 𝑊×𝐻×𝐷 neurons We naturally consider the data as a set of 𝐷-dimensional neural words: 𝒳= 𝐱 𝑤,ℎ 𝑤=1,ℎ=1 𝑊,𝐻 The spatial position of each word is closely related to its receptive field on the input image 8/9/2019 Towards Deep Understanding on CNN

Geometric Neural Phrase
A geometric neural phrase is a group of neighboring neurons: 𝒢 𝑤,ℎ = 𝐱 𝑤,ℎ 𝑘 𝑘=0 𝐾 𝐱 𝑤,ℎ 0 = 𝐱 𝑤,ℎ : the central word 𝐱 𝑤,ℎ 𝑘 (𝑘>0): the side words, which are located in a small neighborhood of 𝐱 𝑤,ℎ For each neural word, there is a neural phrase 8/9/2019 Towards Deep Understanding on CNN

Geometric Neural Phrase (cont.)
side words Neural Phrase: Type 1 Neural Phrase: Type 2 central word Convolutional Neural Network neuron map conv pool fully- connect classifier input image 8/9/2019 Towards Deep Understanding on CNN

Geometric Neural Phrase Pooling
Computing a 𝐷-dimensional vector for each geometric neural phrase individually 𝐳 𝑤,ℎ = 𝐱 𝑤,ℎ + max 𝑘>0 𝑠 𝑤,ℎ 𝑘 × 𝐱 𝑤,ℎ 𝑘 max 𝑘>0 : dimension-wise maximization 𝑠 𝑤,ℎ 𝑘 =σ or 𝑠 𝑤,ℎ 𝑘 = σ 2 , according to the relative position to the central word σ∈ 0,1 : the smoothing parameter 8/9/2019 Towards Deep Understanding on CNN

GNPP: an Illustration Isolated neuron responses are punished We argue that an isolated neuron response often corresponds to unexpected random noise, thus it is less reliable than clustered neuron responses 0.0 0.2 0.4 0.8 1.0 0.1 0.3 0.7 0.6 0.5 0.9 8/9/2019 Towards Deep Understanding on CNN

GNPP as a Network Layer GNPP does not change the dimension of data, thus it can be inserted as an intermediate layer anywhere into a network In practice, we only insert the GNPP layer between a convolutional layer and a pooling layer, since GNPP punishes isolated responses, and pooling after GNPP is an efficient way of aggregating these rectified neuron responses 8/9/2019 Towards Deep Understanding on CNN

MNIST Experiments The MNIST dataset Handwritten digit recognition (10 classes) 60,000 training and 10,000 testing images The network structure A 2-layer LeNet (input size: 28×28) 8/9/2019 Towards Deep Understanding on CNN

MNIST Results L1 L2 D T1 (1.0) T1 (0.9) T1 (0.8) T2 (1.0) T2 (0.9) T2 (0.8) 0.87±.02 0.87±.02 0.87±.02 0.87±.02 0.87±.02 0.87±.02 √ 0.72±.04 0.73±.03 0.70±.05 0.71±.06 0.71±.06 0.72±.04 √ 0.75±.03 0.79±.02 0.77±.05 0.73±.04 0.75±.04 0.73±.05 √ √ 𝟎.𝟕𝟐±.03 𝟎.𝟔𝟕±.04 𝟎.𝟔𝟗±.04 𝟎.𝟔𝟑±.03 𝟎.𝟔𝟒±.03 𝟎.𝟔𝟕±.03 √ 0.72±.03 0.72±.03 0.72±.03 0.72±.03 0.72±.03 0.72±.03 √ √ 0.59±.02 0.61±.05 0.62±.03 0.59±.03 0.59±.02 0.63±.03 √ √ 0.63±.03 0.62±.07 0.64±.03 0.62±.05 0.60±.03 0.65±.03 √ √ √ 𝟎.𝟓𝟖±.05 𝟎.𝟓𝟓±.05 𝟎.𝟓𝟕±.02 𝟎.𝟓𝟒±.05 𝟎.𝟓𝟔±.04 𝟎.𝟔𝟏±.05 8/9/2019 Towards Deep Understanding on CNN

SVHN Experiments The SVHN dataset Street view digit recognition (10 classes) 73,257 training, 26,032 testing and 531,131 extra images (after pre-proc., 598,388 training images) The network structure A 3-layer LeNet (input size: 32×32) 8/9/2019 Towards Deep Understanding on CNN

SVHN Results L1 L2 L3 T1 (1.0) T1 (0.9) T1 (0.8) T2 (1.0) T2 (0.9) T2 (0.8) 4.63±.06 4.63±.06 4.63±.06 4.63±.06 4.63±.06 4.63±.06 √ 4.46±.06 4.47±.05 4.42±.09 4.42±.08 4.42±.07 4.43±.09 √ 4.15±.08 4.18±.01 4.17±.07 4.08±.10 4.19±.07 4.20±.05 √ √ 3.76±.03 3.72±.05 3.77±.06 3.53±.07 3.64±.07 3.65±.10 √ 4.10±.05 4.07±.03 4.10±.05 4.10±.07 4.10±.03 4.14±.07 √ √ 3.55±.10 3.60±.03 3.67±.06 3.47±.05 3.47±.02 3.55±.09 √ √ 𝟑.𝟒𝟑±.06 3.52±.07 𝟑.𝟓𝟓±.04 𝟑.𝟒𝟏±.03 3.42±.04 3.51±.05 √ √ √ 3.46±.07 𝟑.𝟒𝟕±.06 3.55±.06 3.43±.05 𝟑.𝟑𝟗±.01 𝟑.𝟒𝟔±.03 8/9/2019 Towards Deep Understanding on CNN

CIFAR Experiments The CIFAR10/CIFAR100 datasets Low-resolution natural images (10 or 100 classes) 50,000 training and 10,000 testing images Uniformly distributed over 10/100 classes The network structure A 3-layer LeNet (input size: 32×32) 8/9/2019 Towards Deep Understanding on CNN

CIFAR10 Results L1 L2 L3 T1 (1.0) T1 (0.9) T1 (0.8) T2 (1.0) T2 (0.9) T2 (0.8) 17.07±.15 17.07±.15 17.07±.15 17.07±.15 17.07±.15 17.07±.15 √ 16.67±.22 16.80±.25 16.84±.12 16.65±.19 17.03±.15 17.04±.17 √ 15.79±.22 16.09±.17 15.95±.31 15.69±.11 16.07±.27 15.90±.09 √ √ 15.49±.15 15.31±.20 15.51±.25 15.27±.10 15.29±.14 15.28±.16 √ 15.82±.23 15.76±.18 15.98±.14 16.05±.29 15.90±.25 15.94±.09 √ √ 15.15±.20 15.29±.12 15.44±.19 15.29±.32 15.19±.35 15.20±.35 √ √ 𝟏𝟒.𝟗𝟐±.18 15.00±.18 15.15±.15 𝟏𝟒.𝟖𝟑±.25 14.93±.20 14.92±.16 √ √ √ 14.97±.17 𝟏𝟒.𝟖𝟑±.23 𝟏𝟒.𝟕𝟖±.17 15.22±.16 𝟏𝟒.𝟕𝟗±.26 𝟏𝟒.𝟖𝟓±.26 8/9/2019 Towards Deep Understanding on CNN

CIFAR100 Results L1 L2 L3 T1 (1.0) T1 (0.9) T1 (0.8) T2 (1.0) T2 (0.9) T2 (0.8) 44.99±.19 44.99±.19 44.99±.19 44.99±.19 44.99±.19 44.99±.19 √ 44.62±.17 44.53±.45 44.78±.06 44.43±.29 44.58±.36 44.58±.52 √ 43.34±.23 43.71±.19 43.37±.26 43.21±.23 43.03±.27 43.37±.30 √ √ 43.11±.24 42.77±.37 42.99±.24 42.96±.32 42.81±.38 43.08±.39 √ 43.99±.07 43.63±.11 43.50±.26 43.38±.37 43.34±.27 43.46±.25 √ √ 42.85±.38 42.81±.27 42.82±.29 43.08±.27 42.79±.34 42.93±.22 √ √ 𝟒𝟐.𝟑𝟓±.30 𝟒𝟐.𝟑𝟒±.31 𝟒𝟐.𝟎𝟒±.20 𝟒𝟐.𝟗𝟐±.33 𝟒𝟐.𝟕𝟐±.25 𝟒𝟐.𝟓𝟒±.29 √ √ √ 42.97±.29 42.77±.36 42.36±.18 43.31±.34 42.85±.18 42.60±.36 8/9/2019 Towards Deep Understanding on CNN

Analysis on Small Experiments
GNPP produces consistent accuracy gain when it is inserted anywhere into the network However, larger improvement is observed when it is inserted after a high-level convolutional layer, which verifies the hypothesis that high-level layers better satisfy the assumption of GNPP The parameter 𝐾 and 𝜎 do not impact a lot on the accuracy In the following experiments, 𝐾=4 and 𝜎=0.8 8/9/2019 Towards Deep Understanding on CNN

Larger Experiments Experiments with a big network (BigNet) on the four small datasets BigNet: an 11-layer network with 3×3 filters with 3 pooling layers, GNPP is inserted before the second and the third pooling layers Experiments on the ImageNet (ILSVRC2012) with the AlexNet GNPP is only inserted before the pool-5 layer 8/9/2019 Towards Deep Understanding on CNN

Results with a Big Network
MNIST SVHN CIFAR10 CIFAR100 Wan, ICML’13 𝟎.𝟐𝟏 1.94 9.32 − Zeiler, ICLR’13 0.47 2.80 15.13 42.51 Goodfellow, ICML’14 0.45 2.47 9.38 38.57 Lin, ICLR’14 2.35 8.81 35.68 Lee, AISTATS’15 0.39 1.92 7.97 34.57 Liang, CVPR’15 0.31 1.77 7.09 𝟑𝟏.𝟕𝟓 BigNet, w/o GNPP 0.36 2.14 7.80 31.03 BigNet, w/ GNPP 𝟎.𝟑𝟐 𝟏.𝟖𝟕 𝟕.𝟏𝟒 𝟐𝟗.𝟕𝟒 Lee, AISTATS’16 𝟏.𝟔𝟗 𝟔.𝟎𝟓 32.37 8/9/2019 Towards Deep Understanding on CNN

ILSVRC2012 Results ILSVRC2012 top-1 top-5 AlexNet, w/o GNPP 43.19 19.87 AlexNet, w/ GNPP 𝟒𝟐.𝟏𝟔 𝟏𝟗.𝟐𝟒 8/9/2019 Towards Deep Understanding on CNN

GNPP Helps Image Representation
When GNPP is inserted after the conv-5 layer, the neuron response becomes smoother Interestingly, since the roles of fully-connected layers do not change, the conv-5 layer adjusts itself to be more concentrated As a result, the conv-5 layer in a GNPP-Net produces better image representation On Caltech256, classification accuracy +1.20% 8/9/2019 Towards Deep Understanding on CNN

GNPP Helps Image Representation
The heatmap of neuron responses Original Image AlexNet Heatmap on conv-5 layer GNPPNet GNPP-5 layer eagle snake pig boat monkey sleigh crab 8/9/2019 Towards Deep Understanding on CNN

GNPP Increases Model Capacity
GNPP adds more fixed neuron connections between a layer and its previous layer Example: the GNPP layer after the conv-5 layer of the AlexNet increases the number of connections of each neuron to the previous layer from 9 to 21, and the total number of connections between conv-4 and conv-5 from 149.5M to 348.9M Meanwhile, the number of parameters does not change, which prevents the net from over-fitting 8/9/2019 Towards Deep Understanding on CNN

GNPP Increases Model Capacity
Comparison: AlexNet with GNPP vs. AlexNet with 512 channels on conv-5 Number of connections between conv-4 and conv-5: 348.9M vs M Extra time cost: 1.29% vs. 9.97% Extra memory cost: 2.52% vs. 5.58% Top-1 error rate: 42.16% vs % Top-5 error rate: 19.24% vs % 8/9/2019 Towards Deep Understanding on CNN

GNPP Accelerates Network Training
GNPP allows a neuron has a larger receptive field, thus visual information is propagated faster throughout the network, consequently, the training process converges faster 8/9/2019 Towards Deep Understanding on CNN

Summary The co-occurrence of neuron responses needs to be considered explicitly GNPP is a possible solution to this purpose Effective: consistently improves the accuracy Efficient: only requires 1.29% extra time and 2.52% extra memory costs Can be explained in many different ways, including improving image representation, increasing model capacity, and accelerating network training 8/9/2019 Towards Deep Understanding on CNN

The CNN Training Process
In each iteration 𝑡, a mini-batch ℬ 𝑡 is sampled from 𝒮, and the current model parameters 𝜽 𝑡−1 are updated via Stochastic Gradient Descent (SGD): 𝜽 𝑡 = 𝜽 𝑡−1 + 𝛾 𝑡−1 ∙ 1 ℬ 𝑡 ∙ 𝐱,𝐲 ∈ ℬ 𝑡 𝛻 𝜽 𝑡−1 𝐿 𝐱,𝐲 𝛾 𝑡−1 : the current learning rate 𝐿 𝐱,𝐲 : the loss function of 𝐱,𝐲 Computed via gradient back-propagation 8/9/2019 Towards Deep Understanding on CNN

A Review of CNN Regularization
A way of preventing over-fitting Typical CNN regularization methods Weight decay: constraining the parameters with ℓ 2 -regularization Data augmentation: generating more training data by randomly transforming input images Dropout: randomly discarding a part of neuron responses in training Introducing stochastic operation in training 8/9/2019 Towards Deep Understanding on CNN

A Summary of CNN Regularization
Regularization Method Regularization Units Weight decay Neuron connections (weights) Data augmentation Input layer (neurons) Dropout Hidden layer (neurons) DropConnect Stochastic Pooling Pooling layer (neurons) DisturbLabel Loss layer (neurons) DisturbLabel is the first work to regularize CNN on the loss layer! 8/9/2019 Towards Deep Understanding on CNN

The DisturbLabel Algorithm
Working on each mini-batch independently An extra sampling process for each data point Each data is disturbed with probability 𝛼 𝛼 is named the noise rate of the algorithm For a disturbed datum 𝐱 𝑛 , 𝐲 𝑛 , it is assigned with a new class label 𝑐 , which is distributed uniformly among 1,2,⋯,𝐶 , regardless of the true label 𝑐. This datum is changed to 𝐱 𝑛 , 𝐲 𝑛 , in which 𝐲 𝑛 depends on 𝑐 , and sent into the network training process. 8/9/2019 Towards Deep Understanding on CNN

A Toy Example of DisturbLabel
Each mini-batch is disturbed indepently The disturbed label may remain unchanged 5 Batch 1 2 Batch 2 Batch 4 3 Batch 3 1 Batch 5 7 Batch 𝑇 …… 8/9/2019 Towards Deep Understanding on CNN

The Effect of Noise Rate 𝛼
A proper noise rate helps to improve the accuracy, but introducing too much noise harms recognition Just like the drop-ratio in Dropout 8/9/2019 Towards Deep Understanding on CNN

DisturbLabel as Regularizer
DisturbLabel acts as a regularizer, which improves the recognition accuracy by preventing over-fitting Training error increases while testing error decreases 8/9/2019 Towards Deep Understanding on CNN

DisturbLabel as Model Ensemble
Given the original dataset 𝒮= 𝐱 𝑛 , 𝐲 𝑛 𝑛=1 𝑁 and the way of disturbing labels, we generate a family of noisy datasets 𝒰= 𝒮 𝑢 , 𝜆 𝑢 𝑢=1 𝑈 , where 𝒮 𝑢 is the 𝑢-th noisy set, and 𝜆 𝑢 is the probability of its presence Note that the total number 𝑈 of possible datasets is exponentially large, thus it is impossible to train an individual model for each of these sets, nor to combine them at the testing stage 8/9/2019 Towards Deep Understanding on CNN

DisturbLabel as Model Ensemble (cont.)
An equivalent solution is to use mini-batches. The family of all possible mini-batches is 𝒱= ℬ 𝑣 , 𝜏 𝑣 𝑣=1 𝑉 , where 𝜏 𝑣 is the probability of the presence of the 𝑣-th mini-batch ℬ 𝑣 A mini-batch can be sampled from different 𝒮 𝑢 ’s DisturbLabel samples one mini-batch following the probability distribution over the family 𝒱, and serves as an alternative way of training the same model with different data 8/9/2019 Towards Deep Understanding on CNN

Cooperation with Dropout
Both Dropout and DisturbLabel ensembles models Dropout: different structures trained on same data DisturbLabel: same structures trained on different data 8/9/2019 Towards Deep Understanding on CNN

DisturbLabel as Data Augmentation
Given a disturbed data point 𝐱 𝑛 , 𝐲 𝑛 , its loss function value is 𝐿 𝐱 𝑛 , 𝐲 𝑛 . We can generate a data point 𝐱 𝑛 , 𝐲 𝑛 with the original class label preserved, and 𝐿 𝐱 𝑛 , 𝐲 𝑛 ≈𝐿 𝐱 𝑛 , 𝐲 𝑛 , so that the effect of 𝐱 𝑛 , 𝐲 𝑛 is approximately equivalent to that of 𝐱 𝑛 , 𝐲 𝑛 𝐱 𝑛 , 𝐲 𝑛 can be considered an augmented datum 𝐱 𝑛 can be computed by iterative back-propagation 8/9/2019 Towards Deep Understanding on CNN

Visualizing Augmented Data
Ep. 1 1.77% Ep. 2 1.08% Ep. 5 0.97% Ep. 10 0.90% Ep. 20 0.86% Ep. 10 28.97% Ep. 20 25.61% Ep. 30 24.82% Ep. 40 24.68% Ep. 60 23.33% Ep. 80 22.74% Ep. 100 22.50% 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9

Application: Few Training Data
The MNIST and CIFAR10 datasets In all the classes, only 1% or 10% training data are preserved MNIST CIFAR10 1% data, w/o DisturbLabel 10.92 43.29 1% data, w/ DisturbLabel 6.38 37.83 10% data, w/o DisturbLabel 2.83 27.21 10% data, w/ DisturbLabel 1.89 24.37 100% data, w/o DisturbLabel 0.86 22.50 100% data, w/ DisturbLabel 0.66 20.26 8/9/2019 Towards Deep Understanding on CNN

Application: Imbalanced Training Data
The MNIST and CIFAR10 datasets Except for the first class, only 1% or 10% training data are preserved MNIST CIFAR10 overall first class overall first class 1% data, w/o DisturbLabel 9.31 0.28 42.01 11.48 1% data, w/ DisturbLabel 6.29 2.35 36.92 24.30 10% data, w/o DisturbLabel 2.78 0.47 26.50 13.09 10% data, w/ DisturbLabel 1.76 1.46 24.03 18.19 100% data, w/o DisturbLabel 0.86 0.89 22.50 22.41 100% data, w/ DisturbLabel 0.66 0.71 20.26 20.29 8/9/2019 Towards Deep Understanding on CNN

MNIST Experiments The MNIST dataset Handwritten digit recognition (10 classes) 60,000 training and 10,000 testing images The network structure A 2-layer LeNet (input size: 28×28) A 5-layer BigNet (input size: 24×24) DisturbLabel with 𝛼=10% 8/9/2019 Towards Deep Understanding on CNN

SVHN Experiments The SVHN dataset Street view digit recognition (10 classes) 73,257 training, 26,032 testing and 531,131 extra images (after pre-proc., 598,388 training images) The network structure A 3-layer LeNet (input size: 32×32) A 5-layer BigNet (input size: 24×24) DisturbLabel with 𝛼=10% 8/9/2019 Towards Deep Understanding on CNN

MNIST and SVHN Results MNIST w/o DA w/ DA SVHN w/o DA w/ DA Wan, ICML’13 0.52 0.21 Wan, ICML’13 − 1.94 Zeiler, ICLR’13 0.47 − Zeiler, ICLR’13 2.80 − Goodfellow, ICML’14 0.45 − Goodfellow, ICML’14 2.47 − Lin, ICLR’14 0.47 − Lin, ICLR’14 2.35 − Lee, AISTATS’15 0.39 − Lee, AISTATS’15 1.92 − Liang, CVPR’15 0.31 − Liang, CVPR’15 1.77 − LeNet, no regul. 0.86 0.48 LeNet, no regul. 3.93 3.48 LeNet, + Dropout 0.68 0.43 LeNet, + Dropout 3.65 3.25 LeNet, + DisturbLabel 0.66 0.45 LeNet, + DisturbLabel 3.69 3.27 LeNet, + both regul. 0.63 0.41 LeNet, + both regul. 3.61 3.21 BigNet, no regul. 0.69 0.39 BigNet, no regul. 2.87 2.35 BigNet, + Dropout 0.36 0.29 BigNet, + Dropout 2.23 2.08 BigNet, + DisturbLabel 0.38 0.32 BigNet, + DisturbLabel 2.28 2.21 BigNet, + both regul. 8/9/2019 0.33 Towards Deep Understanding on CNN 0.28 BigNet, + both regul. 2.19 2.02

CIFAR Experiments The CIFAR10/CIFAR100 datasets Low-resolution natural images (10 or 100 classes) 50,000 training and 10,000 testing images Uniformly distributed over 10/100 classes The network structure A 3-layer LeNet (input size: 32×32) A 5-layer BigNet (input size: 24×24) DisturbLabel with 𝛼=10% 8/9/2019 Towards Deep Understanding on CNN

CIFAR Results CIFAR10 w/o DA w/ DA CIFAR100 w/o DA w/ DA Wan, ICML’13 − 9.32 Wan, ICML’13 − − Zeiler, ICLR’13 15.13 − Zeiler, ICLR’13 42.51 − Goodfellow, ICML’14 11.68 9.38 Goodfellow, ICML’14 38.57 − Lin, ICLR’14 10.41 8.81 Lin, ICLR’14 35.68 − Lee, AISTATS’15 9.69 7.97 Lee, AISTATS’15 34.57 − Liang, CVPR’15 8.69 7.09 Liang, CVPR’15 31.75 − LeNet, no regul. 22.50 15.76 LeNet, no regul. 56.72 43.31 LeNet, + Dropout 19.42 14.24 LeNet, + Dropout 49.08 41.28 LeNet, + DisturbLabel 20.26 14.48 LeNet, + DisturbLabel 51.83 41.84 LeNet, + both regul. 19.18 13.98 LeNet, + both regul. 48.72 40.98 BigNet, no regul. 11.23 9.29 BigNet, no regul. 39.54 33.59 BigNet, + Dropout 9.69 7.08 BigNet, + Dropout 33.30 27.05 BigNet, + DisturbLabel 9.82 7.93 BigNet, + DisturbLabel 34.81 28.39 BigNet, + both regul. 8/9/2019 9.45 Towards Deep Understanding on CNN 6.98 BigNet, + both regul. 32.99 26.63

ILSVRC2012 Experiments The ILSVRC2012 dataset High-resolution natural images (1,000 classes) 1.3M training and 50K validation images Almost uniformly distributed over all classes The network structure The AlexNet 5 convolution layers, 3 pooling layers, and 3 fully-connected layers 8/9/2019 Towards Deep Understanding on CNN

ILSVRC2012 Results ILSVRC2012 top-1 top-5 AlexNet, + Dropout 43.1 19.9 AlexNet, + both regul. 42.8 19.7 8/9/2019 Towards Deep Understanding on CNN

Summary Regularization is an important technique to prevent over-fitting in network training DisturbLabel regularizes CNN on the loss layer DisturbLabel is very simple to implement DisturbLabel works well in a wide range of tasks DisturbLabel can be interpreted as an implicit way of model ensemble and/or data augmentation 8/9/2019 Towards Deep Understanding on CNN

The State-of-the-Art Method
Extracting deep feature from a single image Given an image, passing it through a pre-trained network (e.g., AlexNet or VGGNet) Since VGGNet produces better results, we take it as the default network in the following experiments Extracting the intermediate output on a specified layer (e.g., pool-5 or fc-6) Sending the feature vectors into machine learning tools (e.g., SVM or KNN classifiers) 8/9/2019 Towards Deep Understanding on CNN

Improving Deep Features
For an input image, we resize it so that: The aspect ratio is maximally preserved The area (number of pixels) is approximately The width and height are multipliers of 32 (the down-sampling rate of VGGNet) On each layer (e.g., pool-4, pool-5, fc-6, etc.), we average the neuron responses on different spatial positions into a vector (e.g., on the pool-5 and fc-6 layers, the vectors are of 512 and 4096 dimensions, respectively) 8/9/2019 Towards Deep Understanding on CNN

Improving Deep Features (cont.)
This algorithm significantly improves classification accuracy, compared to the original image resizing method (resizing an image to 224×224 directly, where 224×224 is the input size of VGGNet) Examples (using the pool-5 features): On Caltech256: improved from 77.46% to 81.40% On SUN-397: improved from 48.19% to 55.22% On Flower-102: improved from 86.87% to 94.70% We use it as the default way of feature extraction 8/9/2019 Towards Deep Understanding on CNN

Problems Encountered In the improved deep feature extraction The input image size is relatively large (≈ ) The receptive field of a neuron is relatively small On the pool-4, pool-5 and fc-6 layer, a neuron can “see” , and input pixels, respectively Two problems occurs A low-level neuron may not see enough visual context to make prediction (the “small” problem) There may be some irrelevant low-level neurons which contaminate image representation (the “big” problem) 8/9/2019 Towards Deep Understanding on CNN

Problems Encountered (cont.)
Result: low-level features are less reliable However, we need low-level features to represent some local visual attributes (e.g., object parts) SMALL problem sky water BIG problem dog

Solution? How to deal with these problems? For the “small” problem: low-level neurons must receive information from high-level neurons For the “big” problem: a weight (activeness) must be computed for each low-level neuron In short, we need to back-propagate high-level information throughout the network to assist the representative ability of low-level features This process should be unsupervised! 8/9/2019 Towards Deep Understanding on CNN

The InterActive Algorithm
Main goal: in extracting deep features, adding a back-propagation process, so that low-level neurons can receive high-level information Flowchart: Defining a score function in the top level Back-propagating gradients to get the activeness of neuron connections Collecting the activeness of network connections as the activeness of neurons 8/9/2019 Towards Deep Understanding on CNN

InterActive: an Illustration
Output Layer Hidden Layer 3 Hidden Layer 2 Hidden Layer 1 Input Layer 8/9/2019 Towards Deep Understanding on CNN

InterActive: an Illustration
Score Function Activeness of Connections Activeness of Hidden Layer 3 Activeness of Hidden Layer 2 Activeness of Hidden Layer 1 Activeness of Input Layer 8/9/2019 Towards Deep Understanding on CNN

Related Algorithms The back-propagation process looks like some other algorithms, such as: Gradient back-propagation in network training Visualizing the CNN [Zeiler, CVPR’14] Object detectors in object CNNs [Zhou, ICLR’15] Top-down visual attention [Cao, ICCV’15] 8/9/2019 Towards Deep Understanding on CNN

Related Algorithms (cont.)
InterActive differs from the algorithms: We focus on generating descriptive image features, while they focus on network visualization We can visualize back-propagated neuron activeness, while they visualize neuron responses We perform back-propagation in an unsupervised way, while all the others are supervised Being unsupervised, we can generalize to many more problems with a different set of image classes 8/9/2019 Towards Deep Understanding on CNN

Mathematical Notations
Let the CNN be a model 𝕄: 𝐡 𝐗 0 ;𝜽 , where 𝐗 0 is the input image, and 𝜽 is the weights of neuron connections 𝐗 𝑙 is the neuron responses on the 𝑡-th layer, which is a 𝑊 𝑙 × 𝐻 𝑙 × 𝐷 𝑙 cube 𝐱 𝑙 is the average over all spatial positions of 𝐗 𝑙 , i.e., a 𝐷 𝑙 -dimensional vector: 𝑥 𝑑 𝑙 = 1 𝑊 𝑙 × 𝐻 𝑙 𝑤=0 𝑊 𝑙 −1 ℎ=0 𝐻 𝑙 −1 𝑋 𝑤,ℎ,𝑑 𝑙 8/9/2019 Towards Deep Understanding on CNN

The Statistics of Neuron Responses
We send each image in Caltech256 to VGGNet, and perform statistics on different layers Higher levels have better feature sparsity

The PDF of Neuron Responses
We study the PDF of neuron responses on the 𝑇-th layer, which is the starting point of back-propagation. We assume the following PDF: 𝑓 𝐱 𝑇 = 𝐶 𝑝 ×exp − 𝐱 𝑇 𝑝 𝑝 𝑝: norm (1 or 2 in this work) 𝐶 𝑝 : normalization coefficient 8/9/2019 Towards Deep Understanding on CNN

The Score Function A popular method to produce discriminative features from generative models Computing the gradient of the log-likelihood with respect to the model parameters Given input 𝐗 0 , we compute 𝐗 𝑇 and define a PDF 𝑓 𝑇 ≝𝑓 𝐗 𝑇 , then score function w.r.t. the parameter of the 𝑡-th layer, 𝜽 𝑙 : 𝜕 ln 𝑓 𝑇 𝜕 𝜽 𝑙 = 𝜕 ln 𝑓 𝑇 𝜕 𝐗 𝑙+1 × 𝜕 𝐗 𝑙+1 𝜕 𝜽 𝑙 8/9/2019 Towards Deep Understanding on CNN

The Activeness of Connections
The score function defines the activeness over each individual neuron connections 𝜕 ln 𝑓 𝑇 𝜕 𝜽 𝑙 = 𝜕 ln 𝑓 𝑇 𝜕 𝐗 𝑙+1 × 𝜕 𝐗 𝑙+1 𝜕 𝜽 𝑙 𝜕 ln 𝑓 𝑇 𝜕 𝐗 𝑙+1 : the layer score, computed by back-propagating gradients through the network 𝜕 𝐗 𝑙+1 𝜕 𝜽 𝑙 : the inter-layer activeness, computed by definition ( 𝐗 𝑙+1 involves simple operation of 𝜽 𝑙 ) 8/9/2019 Towards Deep Understanding on CNN

The Importance of Neurons
We rewrite each neuron activeness term as: 𝜕 ln 𝑓 𝑇 𝜕 𝜃 𝑤,ℎ,𝑑, 𝑤 ′ , ℎ ′ , 𝑑 ′ 𝑙 = 𝑥 𝑤,ℎ,𝑑 𝑙 × 𝛼 𝑤,ℎ,𝑑, 𝑤 ′ , ℎ ′ , 𝑑 ′ 𝑙 𝛼 𝑤,ℎ,𝑑, 𝑤 ′ , ℎ ′ , 𝑑 ′ 𝑙 : the importance of the neuron 𝑥 𝑤,ℎ,𝑑 𝑙 to the connection (weight) between 𝑥 𝑤,ℎ,𝑑 𝑙 and 𝑥 𝑤 ′ , ℎ ′ , 𝑑 ′ 𝑙+1 , i.e., 𝜃 𝑤,ℎ,𝑑, 𝑤 ′ , ℎ ′ , 𝑑 ′ 𝑙 8/9/2019 Towards Deep Understanding on CNN

The Importance of Neurons (cont.)
We rewrite the entire neuron activeness as: 𝑥 𝑤,ℎ,𝑑 𝑙 = 𝑥 𝑤,ℎ,𝑑 𝑙 × 𝛾 𝑤,ℎ,𝑑 𝑙 𝛾 𝑤,ℎ,𝑑 𝑙 = 𝑤 ′ , ℎ ′ , 𝑑 ′ 𝛼 𝑤,ℎ,𝑑, 𝑤 ′ , ℎ ′ , 𝑑 ′ 𝑙 𝛾 𝑤,ℎ,𝑑 𝑙 is the importance of each neuron 𝑥 𝑤,ℎ,𝑑 𝑙 To visualize 𝛾 𝑤,ℎ,𝑑 𝑙 , we sum up it as a 2D heatmap: 𝛾 𝑤,ℎ 𝑙 = 𝑑 𝛾 𝑤,ℎ,𝑑 𝑙 8/9/2019 Towards Deep Understanding on CNN

Different Configurations
The norm 𝑝: larger 𝑝 indicates higher sparsity The relationship between the start layer 𝑇 and the end layer 𝑙 The last configuration: 𝑇=𝐿−1 (𝐿 is the total number of layers), the 𝑙-th layer receives the information from the highest level of the network The next configuration: 𝑇=𝑙+1, the 𝑙-th layer receives information only from the next layer 8/9/2019 Towards Deep Understanding on CNN

Visualization (cont.) Original Image layer pool-1 layer pool-2 layer conv-3-3 layer pool-3 layer conv-4-3 layer pool-4 layer conv-5-3 layer pool-5 last config. 𝑝=1 last config. 𝑝=2 next config. 𝑝=1 bird next config. 𝑝=2 last config. 𝑝=1 last config. 𝑝=2 next config. 𝑝=1 flower next config. 𝑝=2 8/9/2019 Towards Deep Understanding on CNN

Visualization (cont.) Original Image layer pool-1 layer pool-2 layer conv-3-3 layer pool-3 layer conv-4-3 layer pool-4 layer conv-5-3 layer pool-5 last config. 𝑝=1 last config. 𝑝=2 next config. 𝑝=1 cat next config. 𝑝=2 last config. 𝑝=1 last config. 𝑝=2 next config. 𝑝=1 scene next config. 𝑝=2 8/9/2019 Towards Deep Understanding on CNN

Discussions Gradient carries image weighting information Back-propagation is useful on the testing stage The next and last configurations Using the last configuration is often better Richer information is considered when we propagate deeper information through the network The power parameter 𝑝 A larger 𝑝 makes the spatial weighting more concentrated (higher sparsity) 8/9/2019 Towards Deep Understanding on CNN

Experiments on Small Datasets
Six (6) datasets used Generic classification: Caltech256 Scene classification: MIT Indoor-67, SUN-397 Fine-grained object recognition: Oxford Pet-37, Oxford Flower-102, Caltech-UCSD Bird-200 We evaluate the original deep features and InterActive features on different layers individually, and then combine them together 8/9/2019 Towards Deep Understanding on CNN

Results: Separate Layers
Model Dims C-256 I-67 S-397 P-37 F-102 B-200 pool-1 ORIG, AVG-p 64 11.12 19.96 8.52 12.09 29.36 5.10 pool-1 ORIG, MAX-p 64 8.77 16.82 7.27 14.83 27.95 7.81 pool-1 next, 𝑝=1 64 11.01 19.97 8.62 11.60 29.11 4.95 pool-1 next, 𝑝=2 64 11.26 19.71 8.92 12.38 31.07 5.30 pool-1 last, 𝑝=1 64 12.93 20.83 9.83 20.64 32.93 8.55 pool-1 last, 𝑝=2 64 𝟏𝟑.𝟏𝟒 𝟐𝟏.𝟏𝟎 𝟏𝟎.𝟎𝟐 𝟐𝟏.𝟏𝟗 𝟑𝟑.𝟓𝟖 𝟗.𝟎𝟏 pool-2 ORIG, AVG-p 128 21.03 31.12 18.63 20.49 45.77 8.30 pool-2 ORIG, MAX-p 128 19.47 28.29 16.05 24.60 43.39 11.28 pool-2 next, 𝑝=1 128 20.98 30.93 18.59 19.89 45.62 8.01 pool-2 next, 𝑝=2 128 20.65 30.95 19.01 21.18 48.27 9.60 pool-2 last, 𝑝=1 128 25.84 33.24 20.25 37.29 53.72 18.52 pool-2 last, 𝑝=2 128 𝟐𝟔.𝟐𝟎 𝟑𝟑.𝟒𝟕 𝟐𝟎.𝟓𝟎 𝟑𝟖.𝟒𝟐 𝟓𝟒.𝟐𝟐 𝟏𝟗.𝟒𝟑 8/9/2019 Towards Deep Understanding on CNN

Results: Separate Layers (cont.)
Model Dims C-256 I-67 S-397 P-37 F-102 B-200 conv-3-3 ORIG, AVG-p 256 26.44 36.42 22.73 27.78 49.70 10.47 conv-3-3 ORIG, MAX-p 256 24.18 33.27 19.71 31.43 48.02 13.85 conv-3-3 next, 𝑝=1 256 27.29 36.97 22.84 28.89 50.62 10.93 conv-3-3 next, 𝑝=2 256 27.62 37.36 23.41 30.38 54.06 12.73 conv-3-3 last, 𝑝=1 256 34.50 39.40 25.84 49.41 60.53 24.21 conv-3-3 last, 𝑝=2 256 𝟑𝟓.𝟐𝟗 𝟑𝟗.𝟔𝟖 𝟐𝟔.𝟎𝟐 𝟓𝟎.𝟓𝟕 𝟔𝟏.𝟎𝟔 𝟐𝟓.𝟐𝟕 pool-3 ORIG, AVG-p 256 29.17 37.98 23.59 29.88 52.44 11.00 pool-3 ORIG, MAX-p 256 26.53 34.65 20.83 33.68 50.93 13.66 pool-3 next, 𝑝=1 256 29.09 38.12 24.05 30.08 52.26 10.89 pool-3 next, 𝑝=2 256 29.55 38.61 24.31 31.98 55.06 12.65 pool-3 last, 𝑝=1 256 36.96 41.02 26.73 50.91 62.41 24.58 pool-3 last, 𝑝=2 256 𝟑𝟕.𝟒𝟎 𝟒𝟏.𝟒𝟓 𝟐𝟕.𝟐𝟐 𝟓𝟏.𝟗𝟔 𝟔𝟑.𝟎𝟔 𝟐𝟓.𝟒𝟕 8/9/2019 Towards Deep Understanding on CNN

Model Dims C-256 I-67 S-397 P-37 F-102 B-200 conv-4-3 ORIG, AVG-p 512 49.62 59.66 42.03 55.57 76.98 21.45 conv-4-3 ORIG, MAX-p 512 47.73 55.83 40.10 59.40 75.72 23.39 conv-4-3 next, 𝑝=1 512 51.83 60.37 43.59 59.29 78.54 25.01 conv-4-3 next, 𝑝=2 512 53.52 60.65 44.17 63.40 80.48 31.07 conv-4-3 last, 𝑝=1 512 61.62 62.45 45.43 75.29 85.91 52.26 conv-4-3 last, 𝑝=2 512 𝟔𝟏.𝟗𝟖 𝟔𝟐.𝟕𝟒 𝟒𝟓.𝟖𝟕 𝟕𝟕.𝟔𝟏 𝟖𝟔.𝟎𝟖 𝟓𝟒.𝟏𝟐 pool-4 ORIG, AVG-p 512 60.39 66.49 49.73 66.76 85.56 28.56 pool-4 ORIG, MAX-p 512 57.92 62.96 47.29 69.23 84.39 30.01 pool-4 next, 𝑝=1 512 60.59 66.48 49.55 66.28 85.68 28.40 pool-4 next, 𝑝=2 512 62.06 66.94 50.10 72.40 87.36 37.49 pool-4 last, 𝑝=1 512 68.20 67.20 51.04 81.04 91.22 57.41 pool-4 last, 𝑝=2 512 𝟔𝟖.𝟔𝟎 𝟔𝟕.𝟒𝟎 𝟓𝟏.𝟑𝟎 𝟖𝟐.𝟓𝟔 𝟗𝟐.𝟎𝟎 𝟓𝟗.𝟐𝟓 8/9/2019 Towards Deep Understanding on CNN

Model Dims C-256 I-67 S-397 P-37 F-102 B-200 conv-5-3 ORIG, AVG-p 512 77.40 74.66 59.47 88.36 94.03 55.44 conv-5-3 ORIG, MAX-p 512 75.93 71.38 57.03 87.10 91.30 55.19 conv-5-3 next, 𝑝=1 512 80.31 𝟕𝟒.𝟖𝟎 59.63 90.29 94.84 67.64 conv-5-3 next, 𝑝=2 512 80.73 74.52 𝟓𝟗.𝟕𝟒 𝟗𝟏.𝟓𝟔 95.16 𝟕𝟑.𝟏𝟒 conv-5-3 last, 𝑝=1 512 80.77 73.68 59.10 90.73 95.40 69.32 conv-5-3 last, 𝑝=2 512 𝟖𝟎.𝟖𝟒 73.58 58.96 91.19 𝟗𝟓.𝟕𝟎 69.75 pool-5 ORIG, AVG-p 512 81.40 𝟕𝟒.𝟗𝟑 𝟓𝟓.𝟐𝟐 91.78 94.70 69.72 pool-5 ORIG, MAX-p 512 79.61 71.88 54.04 89.43 90.01 68.52 pool-5 next, 𝑝=1 512 81.50 72.70 53.83 92.01 95.41 71.96 pool-5 next, 𝑝=2 512 81.58 72.63 53.57 𝟗𝟐.𝟑𝟎 95.40 𝟕𝟑. 𝟐𝟏 pool-5 last, 𝑝=1 512 81.60 72.58 53.93 92.20 𝟗𝟓.𝟒𝟑 72.47 pool-5 last, 𝑝=2 512 𝟖𝟏.𝟔𝟖 72.68 53.79 92.18 95.41 72.51 8/9/2019 Towards Deep Understanding on CNN

Results: All Layers Model C-256 I-67 S-397 P-37 F-102 B-200 Murray et.al., CVPR’14 − − − 56.8 84.6 33.3 Kobayashi et.al., CVPR’15 58.3 64.8 − − − 30.0 Xie et.al., ICCV’15 60.25 64.93 50.12 63.49 86.45 50.81 Ravazian et.al., CVPR’14 − 69.0 − − 86.8 61.8 Qian et.al., CVPR’15 − − − 81.18 89.45 67.86 Xie et.al., ICMR’15 − 70.13 54.87 90.03 86.82 62.02 ORIG, AVG-pooling 84.02 78.02 62.30 93.02 95.70 73.35 ORIG, MAX-pooling 84.38 77.32 61.87 93.20 95.98 74.76 next, 𝑝=1 84.43 78.01 62.26 92.91 96.02 74.37 next, 𝑝=2 84.64 78.23 62.50 93.22 96.26 74.61 last, 𝑝=1 84.94 78.40 62.69 93.40 96.35 75.47 last, 𝑝=2 𝟖𝟓.𝟎𝟔 𝟕𝟖.𝟔𝟓 𝟔𝟐.𝟗𝟕 𝟗𝟑.𝟒𝟓 𝟗𝟔.𝟒𝟎 𝟕𝟓.𝟔𝟐 8/9/2019 Towards Deep Understanding on CNN

ImageNet Experiments Evaluated on the ImageNet (ILSVRC2012) dataset Baseline: VGGNet (16-layer and 19-layer models) For each testing image, we forward-propagate the input signal, then compute the InterActive features on the second-to-last (fc-7) layer, and finally use the InterActive feature to update the fc-8 layer and get the prediction results 8/9/2019 Towards Deep Understanding on CNN

Results on ILSVRC2012 We test on the 16-layer and 19-layer VGGNets, as well as the combined model (averaging the neuron response on the output layer) Model top-1 error top-5 error VGGNet-16, original 24.2% 7.1% VGGNet-16, InterActive 23.8% 6.9% VGGNet-19, original 24.0% 7.0% VGGNet-19, InterActive 23.5% 6.7% VGGNet-combined, original 23.6% 6.7% VGGNet-combined, InterActive 23.2% 6.5% 8/9/2019 Towards Deep Understanding on CNN

Summary Gradient carries rich information The inter-layer gradient: the likelihood of convolution The layer score: the prior given by probability Back-propagation gives the chance to consider context The receptive field of each neuron is enlarged In the next configuration: same as the neurons in the next layer In the last configuration: the entire image Large benefit is obtained on low-level layers (features) Result: a soft spatial weighting scheme Working better than both sum-pooling and max-pooling Generalized to a wide range of image classification tasks 8/9/2019 Towards Deep Understanding on CNN

Conclusions Although CNN is a powerful machine learning tool, many aspects of its behavior still remain unclear We are not treating each neural network as a black-box, but to explore the inner structure and working mechanism of the model We are looking forward to finding some useful clues for the future research 8/9/2019 Towards Deep Understanding on CNN

Future Directions Explainability Efficiency Learning: supervised vs. unsupervised Structure: fixed vs. variable 8/9/2019 Towards Deep Understanding on CNN

Thanks! Questions, please? 8/9/2019 Towards Deep Understanding on CNN

Towards Deep Understanding on Convolutional Neural Networks

Similar presentations

Presentation on theme: "Towards Deep Understanding on Convolutional Neural Networks"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Towards Deep Understanding on Convolutional Neural Networks

Similar presentations

Presentation on theme: "Towards Deep Understanding on Convolutional Neural Networks"— Presentation transcript:

Similar presentations

About project

Feedback