VALSE Webinar ICCV Pre-conference SORT & Genetic CNN Speaker: Lingxi Xie Slides available at my homepage (TALKS)! Department of Computer Science The Johns Hopkins University http://lingxixie.com/
We Focus on Image Recognition Image recognition or classification is important It is the lowest goal of understanding an image The ease of data collection and large-scale datasets Recognition itself is of little use, but it helps other tasks Many other tasks, including instance retrieval, object detection, semantic segmentation, boundary detection, etc., benefit from the pre-trained models on a large dataset Meanwhile, the recognition task is still developing A single label is not enough for describing an image Recognition is being combined with natural language processing 11/22/2018 VALSE Webinar 2017
Brief History: Image Recognition Image recognition: a fundamental task Clearly defined, labeled data easy to obtain Development in datasets Small datasets: from two classes to few classes Mid-level datasets: tens or hundreds of classes Current age: more than 10,000 classes [Deng, 2009] Evolution in algorithms Early years: global features, e.g., color histograms From 2000’s: local features, e.g., SIFT Current age: deep neural networks, e.g., AlexNet 11/22/2018 VALSE Webinar 2017
Key Principles: Image Recognition Principle #1: invariance The ability of modeling and capturing invariance determines the transfer ability The local features are often more repeatable than global features Example: handcrafted features – from global to local Principle #2: parameters A large parameter count often leads to the risk of over-fitting Example: neuron connectivity – from fully-connected to convolutional (partially-connected and weight sharing) Principle #3: capacity A model with a large capacity would benefit from data increase Example: network structure – from shallow to deep 11/22/2018 VALSE Webinar 2017
Deep Learning Basics Deep learning is the idea of constructing a very complicated mathematical function based on a hierarchy of differentiable operations We provide a large function space, and let the data speak for themselves The hierarchy often appears as a network structure, and the operations are often illustrated as links between neurons People tend to believe that a network with an enough depth and a sufficient number of neurons is able to fit any complicated feature space 11/22/2018 VALSE Webinar 2017
Recognition: Background Deeper architectures AlexNet: the first deep network for large-scale recognition (8 layers) VGGNet: deeper structures (16 or 19 layers) GoogLeNet: multi-scale, multi-path (22 layers) ResNet: deeper networks with highway connections (50, 101 layers or more) DenseNet: dense layer connections (100+ layers) 11/22/2018 VALSE Webinar 2017
Recognition: Background (cont.) Towards efficient network training Basic elements: learning rate, mini-batch, momentum ReLU: a non-linear unit to prevent gradient vanishing Dropout: introducing randomness to prevent over-fitting Batch normalization: towards better numerical stability 11/22/2018 VALSE Webinar 2017
Our Work on Image Recognition Novel network modules L. Xie et.al, Towards Reversal-Invariant Image Representation, ICCV’2015, IJCV’2017 L. Xie et.al, Geometric Neural Phrase Pooling: Modeling the Spatial Co-occurrence of Neurons, ECCV’2016 Y. Wang et.al, SORT: Second-Order Response Transform for Visual Recognition, ICCV’2017 A new training strategy L. Xie et.al, DisturbLabel: Regularizing CNN on the Loss Layer, CVPR’2016 Automatically discovering new network structures L. Xie et.al, Genetic CNN, ICCV’2017 11/22/2018 VALSE Webinar 2017
ICCV 2017 SORT: Second-Order Response Transform for Visual Recognition Speaker: Lingxi Xie Authors: Yan Wang, Lingxi Xie, Chenxi Liu, Siyuan Qiao, Ya Zhang, Wenjun Zhang, Qi Tian, Alan Yuille Department of Computer Science The Johns Hopkins University http://lingxixie.com/
Outline Introduction Second-Order Response Transform Experiments Conclusions and Future Work 11/22/2018 VALSE Webinar 2017
Outline Introduction Second-Order Response Transform Experiments Conclusions and Future Work 11/22/2018
Introduction Deep Learning The state-of-the-art machine learning theory Using a cascade of many layers of non-linear neurons for feature extraction and transformation Learning multiple levels of feature representation Higher-level features are derived from lower-level features to form a hierarchical architecture Multiple levels of representation correspond to different levels of abstraction 11/22/2018
Introduction (cont.) The Convolutional Neural Networks A fundamental machine learning tool Good performance in a wide range of problems in computer vision as well as other research areas Evolutions in many real-world applications Theory: a multi-layer, hierarchical network often has a larger capacity, also requires a larger amount of data to get trained 11/22/2018
Outline Introduction Second-Order Response Transform Experiments Conclusions and Future Work 11/22/2018 VALSE Webinar 2017
Motivation The representation ability of deep neural networks comes from the composition of nonlinear functions Currently, the main source of nonlinearity comes from the ReLU (or sigmoid) activation, and the max-pooling operation We add a second-order term into the network to facilitate nonlinearity 11/22/2018 VALSE Webinar 2017
Branched Network Structures An input data cube 𝐱 is feed into two parallel modules, and we get intermediate outputs 𝐅 1 𝐱; 𝜽 1 and 𝐅 𝟐 𝐱; 𝜽 2 , then fuse them into an output cube 𝐲 Example 1: in the Maxout network, 𝐅 1 𝐱 = 𝜽 1 𝐱 𝐅 2 𝐱 = 𝜽 2 𝐱, and 𝐲= max 𝐅 1 𝐱 , 𝐅 2 𝐱 Example 2: in the deep ResNet, 𝐅 1 𝐱 =𝐱, 𝐅 2 𝐱 = 𝜽 2 ′ 𝜎 𝜽 2 𝐱 , and 𝐲= 𝐅 1 𝐱 + 𝐅 2 𝐱 11/22/2018 VALSE Webinar 2017
Formulation Adding a second-order term into the fusion stage of 𝐅 1 𝐱 and 𝐅 𝟐 𝐱 𝐲= 𝐅 1 𝐱 + 𝐅 2 𝐱 + 𝐅 1 𝐱 ⊙ 𝐅 2 𝐱 ⊙ is element-wise product operation Implementation Details Gradient back-propagation is straightforward Less than 5% extra time, no extra memory 11/22/2018 VALSE Webinar 2017
Illustration A single-branch network, after each convolution layer is replaced by a two-branch module, can be improved by SORT 𝐱 𝐅 1 𝐱 𝐅 2 𝐱 𝐲 𝐅 𝐱 𝐲 R = 𝐅 1 𝐱 + 𝐅 2 𝐱 𝐲 S = 𝐅 1 𝐱 + 𝐅 2 𝐱 + 𝐅 1 𝐱 ⊙ 𝐅 2 𝐱 𝐲 R =𝐱+𝐅 𝐱 𝐲 S =𝐱+𝐅 𝐱 + 𝐱⊙𝐅 𝐱 ORIGINAL SORT A Two-Branch Block A Residual Block conv-1a conv-1b conv-2a conv-2b conv-a conv-b Fusion 11/22/2018 VALSE Webinar 2017
Benefit? What is the benefit of the second-order term? Increasing nonlinearity The roles of different orders Cross-branch gradient back-propagation Other explanations? 11/22/2018 VALSE Webinar 2017
Increasing the Nonlinearity Both ReLU and max operations are nonlinear at a sub-dimension, but a real second-order term is nonlinear at the entire input space 𝐅 1 + 𝐅 𝟐 max 𝐅 1 , 𝐅 𝟐 𝐅 1 ⊙ 𝐅 𝟐 ResNet-20 on CIFAR10 √ 7.60 7.55 not converge 7.63 𝟕.𝟏𝟒 7.64 7.90 11/22/2018 VALSE Webinar 2017
The Role of Different Orders Linear terms help convergence It is not recommended to use 𝐅 1 ⊙ 𝐅 𝟐 alone Nonlinear terms help representation ability Using a second-order term is better than using a piecewise linear term (such as ReLU and max) A combination of linear and nonlinear terms produces the best performance 11/22/2018 VALSE Webinar 2017
Cross-Branch Gradient Back-Prop Original form: 𝐲= 𝐅 1 𝐱; 𝜽 1 + 𝐅 2 𝐱; 𝜽 2 𝜕𝐲 𝜕 𝜽 1 only depends on 𝜽 1 , 𝜕𝐲 𝜕 𝜽 2 only depends on 𝜽 2 SORT: 𝐲= 𝐅 1 𝐱; 𝜽 1 + 𝐅 2 𝐱; 𝜽 2 + 𝐅 1 𝐱; 𝜽 1 ⊙ 𝐅 2 𝐱; 𝜽 2 Both 𝜕𝐲 𝜕 𝜽 1 and 𝜕𝐲 𝜕 𝜽 2 depends on both 𝜽 1 and 𝜽 2 A branch can update the parameter based on the information from another branch 11/22/2018 VALSE Webinar 2017
Any Other Explanations? This is still an open problem! Possible options Using a nonlinear kernel in visual recognition Gating: a popular idea in recurrent CNN The mask operation in the attention model 11/22/2018 VALSE Webinar 2017
Outline Introduction Second-Order Response Transform Experiments Conclusions and Future Work 11/22/2018 VALSE Webinar 2017
Small-Scale Experiments Datasets CIFAR10, CIFAR100, SVHN Networks LeNet (5 layers) BigNet (11 layers) ResNet (20 layers, 32 layers, 56 layers) WideResNet (28 layers) 11/22/2018 VALSE Webinar 2017
Small-Scale Results Network CIFAR10 CIFAR100 SVHN DSN (2014) 7.97 34.57 1.92 r-CNN (2015) 7.09 31.75 1.77 GePool (2016) 6.05 32.37 1.69 WRN (2016) 5.37 24.53 1.85 StocNet (2016) 5.25 24.98 1.75 DenNet (2017) 3.74 19.25 1.59 LeNet* 11.10 𝟏𝟎.𝟑𝟒 36.93 𝟑𝟒.𝟕𝟓 2.55 𝟐.𝟑𝟗 BigNet* 6.84 𝟔.𝟔𝟎 29.25 𝟐𝟖.𝟎𝟕 1.97 𝟏.𝟖𝟕 ResNet-20 7.60 𝟕.𝟏𝟒 30.66 𝟑𝟎.𝟏𝟗 2.04 𝟐.𝟎𝟏 ResNet-32 6.72 𝟔.𝟏𝟔 29.55 𝟐𝟖.𝟖𝟒 2.20 𝟏.𝟗𝟒 ResNet-56 6.00 𝟓.𝟓𝟐 27.55 𝟐𝟔.𝟖𝟖 2.22 𝟏.𝟖𝟏 WRN-28 4.78 𝟒.𝟎𝟎 22.05 𝟐𝟎.𝟗𝟒 1.80 𝟏.𝟓𝟐 11/22/2018 VALSE Webinar 2017
Small-Scale Results Network CIFAR10 CIFAR100 SVHN DSN (2014) 7.97 34.57 1.92 r-CNN (2015) 7.09 31.75 1.77 GePool (2016) 6.05 32.37 1.69 WRN (2016) 5.37 24.53 1.85 StocNet (2016) 5.25 24.98 1.75 DenNet (2017) 3.74 19.25 1.59 LeNet* 11.10 𝟏𝟎.𝟑𝟒 36.93 𝟑𝟒.𝟕𝟓 2.55 𝟐.𝟑𝟗 BigNet* 6.84 𝟔.𝟔𝟎 29.25 𝟐𝟖.𝟎𝟕 1.97 𝟏.𝟖𝟕 ResNet-20 7.60 𝟕.𝟏𝟒 30.66 𝟑𝟎.𝟏𝟗 2.04 𝟐.𝟎𝟏 ResNet-32 6.72 𝟔.𝟏𝟔 29.55 𝟐𝟖.𝟖𝟒 2.20 𝟏.𝟗𝟒 ResNet-56 6.00 𝟓.𝟓𝟐 27.55 𝟐𝟔.𝟖𝟖 2.22 𝟏.𝟖𝟏 WRN-28 4.78 𝟒.𝟎𝟎 22.05 𝟐𝟎.𝟗𝟒 1.80 𝟏.𝟓𝟐 11/22/2018 VALSE Webinar 2017
Small-Scale Results Network CIFAR10 CIFAR100 SVHN DSN (2014) 7.97 34.57 1.92 r-CNN (2015) 7.09 31.75 1.77 GePool (2016) 6.05 32.37 1.69 WRN (2016) 5.37 24.53 1.85 StocNet (2016) 5.25 24.98 1.75 DenNet (2017) 3.74 19.25 1.59 LeNet* 11.10 𝟏𝟎.𝟑𝟒 36.93 𝟑𝟒.𝟕𝟓 2.55 𝟐.𝟑𝟗 BigNet* 6.84 𝟔.𝟔𝟎 29.25 𝟐𝟖.𝟎𝟕 1.97 𝟏.𝟖𝟕 ResNet-20 7.60 𝟕.𝟏𝟒 30.66 𝟑𝟎.𝟏𝟗 2.04 𝟐.𝟎𝟏 ResNet-32 6.72 𝟔.𝟏𝟔 29.55 𝟐𝟖.𝟖𝟒 2.20 𝟏.𝟗𝟒 ResNet-56 6.00 𝟓.𝟓𝟐 27.55 𝟐𝟔.𝟖𝟖 2.22 𝟏.𝟖𝟏 WRN-28 4.78 𝟒.𝟎𝟎 22.05 𝟐𝟎.𝟗𝟒 1.80 𝟏.𝟓𝟐 11/22/2018 VALSE Webinar 2017
ImageNet Experiments Dataset Networks ILSVRC2012 AlexNet (8 layers) ResNet (18, 34, or 50 layers) The Facebook implementation on pytorch is used 11/22/2018 VALSE Webinar 2017
ImageNet Results Network Top-1 Error Top-5 Error AlexNet 43.19 19.87 36.66 14.79 AlexNet*+SORT 𝟑𝟓.𝟔𝟔 𝟏𝟒.𝟏𝟑 ResNet-18 30.50 11.07 ResNet-18+SORT 𝟐𝟗.𝟗𝟓 𝟏𝟎.𝟖𝟎 ResNet-34 27.02 8.77 ResNet-34+SORT 𝟐𝟔.𝟓𝟕 𝟖.𝟓𝟓 ResNet-50 24.10 7.11 ResNet-50+SORT 𝟐𝟑.𝟖𝟐 𝟔.𝟕𝟐 11/22/2018 VALSE Webinar 2017
Outline Introduction Second-Order Response Transform Experiments Conclusions and Future Work 11/22/2018 VALSE Webinar 2017
Conclusions SORT: a simple idea to improve deep networks Effective: accuracy is boosted consistently Efficient: a light-weighted operation which needs less than 2% extra time and no extra memory Can be applied to a wide range of networks The role of different terms First-order terms: basic property and convergence Second-order terms: nonlinearity 11/22/2018 VALSE Webinar 2017
Future Work Applying SORT to the concatenation module? Inception, ResNeXt, DenseNet, etc. Adding other terms? Even higher-order, or arbitrary polynomial terms Non-polynomial terms Application to recurrent neural networks? 11/22/2018 VALSE Webinar 2017
ICCV 2017 Genetic CNN Speaker: Lingxi Xie Authors: Lingxi Xie, Alan Yuille Department of Computer Science The Johns Hopkins University http://lingxixie.com/
Outline Introduction Designing CNN Structures Genetic CNN Experiments Discussions and Conclusions 11/22/2018 VALSE Webinar 2017
Outline Introduction Designing CNN Structures Genetic CNN Experiments Discussions and Conclusions 11/22/2018 VALSE Webinar 2017
Introduction Deep Learning The state-of-the-art machine learning theory Using a cascade of many layers of non-linear neurons for feature extraction and transformation Learning multiple levels of feature representation Higher-level features are derived from lower-level features to form a hierarchical architecture Multiple levels of representation correspond to different levels of abstraction 11/22/2018 VALSE Webinar 2017
Introduction (cont.) The Convolutional Neural Networks A fundamental machine learning tool Good performance in a wide range of problems in computer vision as well as other research areas Evolutions in many real-world applications Theory: a multi-layer, hierarchical network often has a larger capacity, also requires a larger amount of data to get trained 11/22/2018 VALSE Webinar 2017
Outline Introduction Designing CNN Structures Genetic CNN Experiments Discussions and Conclusions 11/22/2018 VALSE Webinar 2017
Designing CNN Structures History From linear to non-linear From shallow to deep From fully-connected to convolutional Today A cascade of various types of non-linear units Typical units: convolution, pooling, activation, etc. 11/22/2018 VALSE Webinar 2017
Example Networks LeNet [LeCun et.al, 1998] 11/22/2018 VALSE Webinar 2017
Example Networks (cont.) AlexNet [Krizhevsky et.al, 2012] 11/22/2018 VALSE Webinar 2017
Example Networks (cont.) Other deep networks VGGNet [Simonyan et.al, 2014] GoogLeNet (Inception) [Szegedy et.al, 2014] Deep ResNet [He et.al, 2016] DenseNet [Huang et.al, 2016] 11/22/2018 VALSE Webinar 2017
Problem All the networks architectures are fixed This limits the ability and complexity of the networks We see some examples such as the Stochastic Network [Huang et.al, 2016], which allows the network to skip some layers in the training stage, but we point out that this is a fixed structure with a stochastic training strategy 11/22/2018 VALSE Webinar 2017
Outline Introduction Designing CNN Structures Genetic CNN Experiments Discussions and Conclusions 11/22/2018 VALSE Webinar 2017
General Idea Modeling a large family of CNN architectures as a solution space In this work, each architecture is encoded into a binary string of a fixed length Using an efficient search algorithm to explore good candidates In this work, the genetic algorithm is used 11/22/2018 VALSE Webinar 2017
The Genetic Algorithm A metaheuristic inspired by the process of natural selection that belongs to the larger class of evolutionary algorithms Commonly used to generate high-quality solutions to optimization and search problems by relying on bio-inspired operators such as mutation, crossover and selection https://en.wikipedia.org/wiki/Genetic_algorithm 11/22/2018 VALSE Webinar 2017
The Genetic Algorithm (cont.) Typical requirements of a genetic process A genetic representation of each individual (a sample in the solution space) A function to evaluate each individual (cost function or loss function) 11/22/2018 VALSE Webinar 2017
The Genetic Algorithm (cont.) Flowchart of a genetic process Initialization: generating a population of individuals to start with Selection: determining which individuals survive Genetic operations: crossover, mutation, etc. Iteration: repeating the above two process several times and ending the process when a condition holds 11/22/2018 VALSE Webinar 2017
The Genetic Algorithm (cont.) Example: the Traveling Salesman Problem (TSP) Finding the shortest Hamilton path over 𝑁 towns A typical genetic algorithm for TSP Genetic representation: a permutation of 𝑁 numbers Cost function: the total length of the current path Crossover: switching the sub-sequences in two paths Mutation: switching the position of two towns in a path Termination: after a fixed number of generations 11/22/2018 VALSE Webinar 2017
General Framework Two requirements of the genetic algorithm A genetic representation: each CNN is encoded into a fixed length (𝐿) of binary codes An evaluation function: the network is trained from scratch and the accuracy is obtained Note: the genetic algorithm is only used to generate network structures, the network weights are trained from scratch! 11/22/2018 VALSE Webinar 2017
General Framework (cont.) Input: # of individuals 𝑁, # of generations 𝑇, network configuration (to be detailed later), hyper-parameters (to be detailed later) Initialization: generating 𝑁 random individuals Evaluating each individual by training from scratch Repeat the following process for 𝑇 rounds Selection: generating 𝑁 individuals with Russian Roulette Crossover and mutation: generating new individuals pairwise or singly Evaluating each new individual by training from scratch Output: population after 𝑁 generations 11/22/2018 VALSE Webinar 2017
Encoding CNN into Binary Codes Input: the number of stages 𝑆, the number of nodes 𝑚 𝑠 in each stage Each stage is a a DAG structure: node 𝑗 can receive information from node 𝑖 of 𝑖<𝑗 There is a bit denoting if node 𝑗 takes input from node 𝑖 A node sums up all its inputs and performs convolution There is a “source” node at the beginning, performing convolution and feeding the results to all nodes without a precedent; there is a “destination” node at the end, collecting from all nodes without a follower Output: a binary vector of length 𝑠 1 2 𝑚 𝑠 𝑚 𝑠 −1 11/22/2018 VALSE Webinar 2017
What is Encoded? What is encoded: What is not encoded: Connection between layers in the same stage What is not encoded: Network weights The number of filters at each layer Geometric information such as stride and size Other layers such as pooling and activation Fully-connected stages 11/22/2018 VALSE Webinar 2017
Example of CNN Encoding INPUT A1 A2 A3 A4 A0 A5 POOL1 conv@32 pooling next stage 32×32×3 Code: 1-00-111 16×16×32 prev. stage B0 B1 B2 B3 B4 B5 B6 POOL2 conv@64 Code: 0-10-000-0011 8×8×64 Encoding Area Stage 1 Stage 2 11/22/2018 VALSE Webinar 2017
Relationship to Popular Nets What can be encoded: Chain nets (e.g., VGGNet) Highway nets (e.g., ResNet) DenseNet [Huang et.al, 2016] What cannot be encoded: Multi-scale nets (e.g., GoogLeNet, a.k.a., Inception) Tricky modules (e.g., MaxOut) conv layer VGGNet 𝐾=4 ResNet 𝐾=3 Code: 1-01-001 Code: 1-11 11/22/2018 VALSE Webinar 2017
Notations 𝑁: # of individuals, 𝑇: # of rounds 𝑆: # of stages, 𝐾 𝑠 : # of nodes at the 𝑠-th stage, 𝐿= 𝑠 1 2 𝐾 𝑠 𝐾 𝑠 −1 : # of bits 𝕄 𝑡,𝑛 : the 𝑛-th individual in the 𝑡-th round 𝑏 𝑡,𝑛 𝑙 ∈ 0,1 : the 𝑙-th bit in 𝕄 𝑡,𝑛 𝑟 𝑡,𝑛 : the fitness function value of 𝕄 𝑡,𝑛 11/22/2018 VALSE Webinar 2017
Initialization 𝑏 0,𝑛 𝑙 ~ℬ 0.5 , 𝑙=1,2,⋯,𝐿 We shall see later that initialization does not impact much on the genetic process 11/22/2018 VALSE Webinar 2017
Selection An individual is more likely to be selected if it produces better recognition performance The probability of selecting 𝕄 𝑡,𝑛 is proportional to 𝑟 𝑡,𝑛 − min 𝑛 𝑟 𝑡,𝑛 A Russian roulette process The worst individual is always eliminated, and some good individuals may be selected multiple times 11/22/2018 VALSE Webinar 2017
Crossover and Mutation Enumerating each pair, performing crossover with probability 𝑝 C , if not used for crossover, performing mutation with probability 𝑝 M Crossover: switching each stage (multiple bits) with probability 𝑞 C Mutation: flipping each bit with probability 𝑞 M 11/22/2018 VALSE Webinar 2017
Evaluation A training-from-scratch process on 𝕄 𝑡,𝑛 If 𝕄 𝑡,𝑛 is previously evaluated, it is evaluated once again and the average accuracy is preserved To guarantee the testing data remain unseen, we partition the original training set into two (training and validation) subsets 11/22/2018 VALSE Webinar 2017
Outline Introduction Designing CNN Structures Genetic CNN Experiments Discussions and Conclusions 11/22/2018 VALSE Webinar 2017
MNIST Experiments The MNIST dataset Network setting 10 classes, 60,000 training (50,000 for training, 10,000 for validation) and 10,000 testing images Network setting 𝑆=2, 𝐾 1 , 𝐾 2 = 3,5 𝐿=13, 2 𝐿 =8192 𝑁=20 (individuals), 𝐿=50 (rounds) 𝑝 M =0.8, 𝑞 M =0.1, 𝑝 C =0.2, 𝑞 C =0.3 11/22/2018 VALSE Webinar 2017
MNIST Results Gen Max % Min % Avg % Med % Std-D % 99.59 99.38 99.50 99.59 99.38 99.50 0.06 1 99.61 99.40 99.53 99.54 0.05 2 99.62 99.43 99.55 99.58 3 99.56 5 99.46 99.57 0.04 8 99.63 99.60 10 20 99.45 30 99.64 99.49 50 99.66 99.51 99.65 11/22/2018 VALSE Webinar 2017
CIFAR10 Experiments The CIFAR10 dataset Network setting 10 classes, 50,000 training (40,000 for training, 10,000 for validation) and 10,000 testing images Network setting 𝑆=3, 𝐾 1 , 𝐾 2 = 3,4,5 𝐿=19, 2 𝐿 =524288 𝑁=20 (individuals), 𝐿=50 (rounds) 𝑝 M =0.8, 𝑞 M =0.05, 𝑝 C =0.2, 𝑞 C =0.2 11/22/2018 VALSE Webinar 2017
CIFAR10 Results Gen Max % Min % Avg % Med % Std-D % 75.96 71.81 74.39 75.96 71.81 74.39 74.53 0.91 1 73.93 75.01 75.17 0.57 2 73.95 75.32 75.48 3 76.06 73.47 75.37 75.62 0.70 5 76.24 72.60 75.65 0.89 8 76.59 74.75 75.77 75.86 0.53 10 76.72 73.92 75.68 75.80 0.88 20 76.83 74.91 76.45 76.79 0.61 30 76.95 74.38 76.42 76.53 0.46 50 77.06 75.84 76.58 76.81 0.55 11/22/2018 VALSE Webinar 2017
Diagnosis: Initialization Issues Is the genetic process sensitive to initialization? 11/22/2018 VALSE Webinar 2017
Diagnosis: Rationality Do strong parents generate strong children? 11/22/2018 VALSE Webinar 2017
Designed CNN Structures 1 2 3 4 5 6 Code: 1-01 Chain-Shaped Networks AlexNet VGGNet Code: 0-01-100 Code: 1-01-100 Code: 0-11-101-0001 Multiple-Path GoogLeNet Highway Deep ResNet Two individual genetic processes are performed The best individuals after the final round are shown A little bit surprisingly, the learned network structures are similar in two individual genetic processes 11/22/2018 VALSE Webinar 2017
Transferring to Other Datasets Using a basic structure learned from VGGNet For small datasets, 3 learned stages followed by fully-connected layers 64,128,256 filters at 3 stages For ILSVRC2012, 2 fixed down-sampling stages followed by 3 learned stages followed by fully-connected layers 256,512,512 filters at 3 stages 11/22/2018 VALSE Webinar 2017
Experiments: SVHN and CIFAR DSN [Lee et.al, 2014] 1.92 7.97 34.57 Gener. Pooling [Lee et.al, 2016] 1.69 6.05 32.37 WideResNet [Zagorukyo, 2016] 1.85 5.37 24.53 StocNet [Huang et.al, 2016] 1.75 5.25 24.98 DenseNet [Huang et.al, 2016] 1.59 3.74 19.25 GeNet #1, after Gen-00 2.25 8.18 31.46 GeNet #1, after Gen-05 2.15 7.67 30.17 GeNet #1, after Gen-20 2.05 7.36 29.63 GeNet #1, after Gen-50 1.99 7.19 29.03 GeNet #2, after Gen-50 1.97 7.10 29.05 11/22/2018 VALSE Webinar 2017
Experiments: ILSVRC12 Top-1 Top-5 Depth AlexNet [Krizhevsky et.al, 2012] 42.6 19.6 8 GoogLeNet [Szege. et.al, 2016] 34.2 12.9 22 VGGNet-16 [Simon. et.al, 2016] 28.5 9.9 16 VGGNet-19 [Simon. et.al, 2016] 28.7 19 ResNet-50 [He et.al, 2016] 24.6 7.7 50 ResNet-101 [He et.al, 2016] 23.4 7.0 101 ResNet-152 [He et.al, 2016] 23.0 6.7 152 GeNet #1 28.12 9.95 GeNet #2 27.87 9.74 http://www.vlfeat.org/matconvnet/pretrained/ 11/22/2018 VALSE Webinar 2017
Outline Introduction Designing CNN Structures Genetic CNN Experiments Discussions and Conclusions 11/22/2018 VALSE Webinar 2017
Limitations The genetic process is very slow The explored network structures are still of limited flexibility Our approach is not evaluated in the scenario of very deep networks (hundreds of layers)? Our approach cannot symbiotically learn network structure and network weights 11/22/2018 VALSE Webinar 2017
Conclusions A genetic process to explore CNN structures Foundation: a CNN encoding scheme Fact: the “genes” in strong individuals Efficient genetic operations are performed A lot of future work is remaining Increasing the depth of the networks Adding more network modules Incorporating learning network weights 11/22/2018 VALSE Webinar 2017
Thank you! Questions please? 11/22/2018 VALSE Webinar 2017