VALSE Webinar ICCV Pre-conference SORT & Genetic CNN

Slides:



Advertisements
Similar presentations
A brief review of non-neural-network approaches to deep learning
Advertisements

Neural networks Introduction Fitting neural networks
ImageNet Classification with Deep Convolutional Neural Networks
Institute of Intelligent Power Electronics – IPE Page1 Introduction to Basics of Genetic Algorithms Docent Xiao-Zhi Gao Department of Electrical Engineering.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Spatial Pyramid Pooling in Deep Convolutional
Genetic Algorithm.
A Genetic Algorithms Approach to Feature Subset Selection Problem by Hasan Doğu TAŞKIRAN CS 550 – Machine Learning Workshop Department of Computer Engineering.
1 Genetic Algorithms and Ant Colony Optimisation.
D Nagesh Kumar, IIScOptimization Methods: M8L5 1 Advanced Topics in Optimization Evolutionary Algorithms for Optimization and Search.
Deep Residual Learning for Image Recognition
Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.
Deep Learning Overview Sources: workshop-tutorial-final.pdf
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Deep Learning and Its Application to Signal and Image Processing and Analysis Class III - Fall 2016 Tammy Riklin Raviv, Electrical and Computer Engineering.
Neural networks and support vector machines
Deep Residual Learning for Image Recognition
Convolutional Sequence to Sequence Learning
Deep Residual Networks
Compact Bilinear Pooling
Sentence Modeling Representation of sentences is the heart of Natural Language Processing A sentence model is a representation and analysis of semantic.
Data Mining, Neural Network and Genetic Programming
Computer Science and Engineering, Seoul National University
Convolutional Neural Fabrics by Shreyas Saxena, Jakob Verbeek
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Inception and Residual Architecture in Deep Convolutional Networks
ICCV Hierarchical Part Matching for Fine-Grained Image Classification
Neural Networks 2 CS446 Machine Learning.
Training Techniques for Deep Neural Networks
CVPR 2017 (in submission) Genetic CNN
Convolutional Networks
CS6890 Deep Learning Weizhen Cai
Machine Learning: The Connectionist
Deep Residual Learning for Image Recognition
Dynamic Routing Using Inter Capsule Routing Protocol Between Capsules
ECE 599/692 – Deep Learning Lecture 6 – CNN: The Variants
Fully Convolutional Networks for Semantic Segmentation
Computer Vision James Hays
Introduction to Neural Networks
Image Classification.
Grid Long Short-Term Memory
Example: Applying EC to the TSP Problem
Towards Understanding the Invertibility of Convolutional Neural Networks Anna C. Gilbert1, Yi Zhang1, Kibok Lee1, Yuting Zhang1, Honglak Lee1,2 1University.
Deep Learning Hierarchical Representations for Image Steganalysis
Very Deep Convolutional Networks for Large-Scale Image Recognition
Chap. 7 Regularization for Deep Learning (7.8~7.12 )
Object Classes Most recent work is at the object level We perceive the world in terms of objects, belonging to different classes. What are the differences.
Neural Networks Geoff Hulten.
Lecture: Deep Convolutional Neural Networks
Use 3D Convolutional Neural Network to Inspect Solder Ball Defects
EE368 Soft Computing Genetic Algorithms.
Boltzmann Machine (BM) (§6.4)
Designing Neural Network Architectures Using Reinforcement Learning
Problems with CNNs and recent innovations 2/13/19
Image Classification & Training of Neural Networks
边缘检测年度进展概述 Ming-Ming Cheng Media Computing Lab, Nankai University
Heterogeneous convolutional neural networks for visual recognition
Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton
Neural Architecture Search: Basic Approach, Acceleration and Tricks
Reuben Feinman Research advised by Brenden Lake
Human-object interaction
Introduction to Neural Networks
Natalie Lang Tomer Malach
CS295: Modern Systems: Application Case Study Neural Network Accelerator Sang-Woo Jun Spring 2019 Many slides adapted from Hyoukjun Kwon‘s Gatech “Designing.
VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION
Learning and Memorization
Image recognition.
Noah’s Ark Lab, Huawei Inc. (华为诺亚方舟实验室)
Shengcong Chen, Changxing Ding, Minfeng Liu 2018
Presentation transcript:

VALSE Webinar ICCV Pre-conference SORT & Genetic CNN Speaker: Lingxi Xie Slides available at my homepage (TALKS)! Department of Computer Science The Johns Hopkins University http://lingxixie.com/

We Focus on Image Recognition Image recognition or classification is important It is the lowest goal of understanding an image The ease of data collection and large-scale datasets Recognition itself is of little use, but it helps other tasks Many other tasks, including instance retrieval, object detection, semantic segmentation, boundary detection, etc., benefit from the pre-trained models on a large dataset Meanwhile, the recognition task is still developing A single label is not enough for describing an image Recognition is being combined with natural language processing 11/22/2018 VALSE Webinar 2017

Brief History: Image Recognition Image recognition: a fundamental task Clearly defined, labeled data easy to obtain Development in datasets Small datasets: from two classes to few classes Mid-level datasets: tens or hundreds of classes Current age: more than 10,000 classes [Deng, 2009] Evolution in algorithms Early years: global features, e.g., color histograms From 2000’s: local features, e.g., SIFT Current age: deep neural networks, e.g., AlexNet 11/22/2018 VALSE Webinar 2017

Key Principles: Image Recognition Principle #1: invariance The ability of modeling and capturing invariance determines the transfer ability The local features are often more repeatable than global features Example: handcrafted features – from global to local Principle #2: parameters A large parameter count often leads to the risk of over-fitting Example: neuron connectivity – from fully-connected to convolutional (partially-connected and weight sharing) Principle #3: capacity A model with a large capacity would benefit from data increase Example: network structure – from shallow to deep 11/22/2018 VALSE Webinar 2017

Deep Learning Basics Deep learning is the idea of constructing a very complicated mathematical function based on a hierarchy of differentiable operations We provide a large function space, and let the data speak for themselves The hierarchy often appears as a network structure, and the operations are often illustrated as links between neurons People tend to believe that a network with an enough depth and a sufficient number of neurons is able to fit any complicated feature space 11/22/2018 VALSE Webinar 2017

Recognition: Background Deeper architectures AlexNet: the first deep network for large-scale recognition (8 layers) VGGNet: deeper structures (16 or 19 layers) GoogLeNet: multi-scale, multi-path (22 layers) ResNet: deeper networks with highway connections (50, 101 layers or more) DenseNet: dense layer connections (100+ layers) 11/22/2018 VALSE Webinar 2017

Recognition: Background (cont.) Towards efficient network training Basic elements: learning rate, mini-batch, momentum ReLU: a non-linear unit to prevent gradient vanishing Dropout: introducing randomness to prevent over-fitting Batch normalization: towards better numerical stability 11/22/2018 VALSE Webinar 2017

Our Work on Image Recognition Novel network modules L. Xie et.al, Towards Reversal-Invariant Image Representation, ICCV’2015, IJCV’2017 L. Xie et.al, Geometric Neural Phrase Pooling: Modeling the Spatial Co-occurrence of Neurons, ECCV’2016 Y. Wang et.al, SORT: Second-Order Response Transform for Visual Recognition, ICCV’2017 A new training strategy L. Xie et.al, DisturbLabel: Regularizing CNN on the Loss Layer, CVPR’2016 Automatically discovering new network structures L. Xie et.al, Genetic CNN, ICCV’2017 11/22/2018 VALSE Webinar 2017

ICCV 2017 SORT: Second-Order Response Transform for Visual Recognition Speaker: Lingxi Xie Authors: Yan Wang, Lingxi Xie, Chenxi Liu, Siyuan Qiao, Ya Zhang, Wenjun Zhang, Qi Tian, Alan Yuille Department of Computer Science The Johns Hopkins University http://lingxixie.com/

Outline Introduction Second-Order Response Transform Experiments Conclusions and Future Work 11/22/2018 VALSE Webinar 2017

Outline Introduction Second-Order Response Transform Experiments Conclusions and Future Work 11/22/2018

Introduction Deep Learning The state-of-the-art machine learning theory Using a cascade of many layers of non-linear neurons for feature extraction and transformation Learning multiple levels of feature representation Higher-level features are derived from lower-level features to form a hierarchical architecture Multiple levels of representation correspond to different levels of abstraction 11/22/2018

Introduction (cont.) The Convolutional Neural Networks A fundamental machine learning tool Good performance in a wide range of problems in computer vision as well as other research areas Evolutions in many real-world applications Theory: a multi-layer, hierarchical network often has a larger capacity, also requires a larger amount of data to get trained 11/22/2018

Outline Introduction Second-Order Response Transform Experiments Conclusions and Future Work 11/22/2018 VALSE Webinar 2017

Motivation The representation ability of deep neural networks comes from the composition of nonlinear functions Currently, the main source of nonlinearity comes from the ReLU (or sigmoid) activation, and the max-pooling operation We add a second-order term into the network to facilitate nonlinearity 11/22/2018 VALSE Webinar 2017

Branched Network Structures An input data cube 𝐱 is feed into two parallel modules, and we get intermediate outputs 𝐅 1 𝐱; 𝜽 1 and 𝐅 𝟐 𝐱; 𝜽 2 , then fuse them into an output cube 𝐲 Example 1: in the Maxout network, 𝐅 1 𝐱 = 𝜽 1 𝐱 𝐅 2 𝐱 = 𝜽 2 𝐱, and 𝐲= max 𝐅 1 𝐱 , 𝐅 2 𝐱 Example 2: in the deep ResNet, 𝐅 1 𝐱 =𝐱, 𝐅 2 𝐱 = 𝜽 2 ′ 𝜎 𝜽 2 𝐱 , and 𝐲= 𝐅 1 𝐱 + 𝐅 2 𝐱 11/22/2018 VALSE Webinar 2017

Formulation Adding a second-order term into the fusion stage of 𝐅 1 𝐱 and 𝐅 𝟐 𝐱 𝐲= 𝐅 1 𝐱 + 𝐅 2 𝐱 + 𝐅 1 𝐱 ⊙ 𝐅 2 𝐱 ⊙ is element-wise product operation Implementation Details Gradient back-propagation is straightforward Less than 5% extra time, no extra memory 11/22/2018 VALSE Webinar 2017

Illustration A single-branch network, after each convolution layer is replaced by a two-branch module, can be improved by SORT 𝐱 𝐅 1 𝐱 𝐅 2 𝐱 𝐲 𝐅 𝐱 𝐲 R = 𝐅 1 𝐱 + 𝐅 2 𝐱 𝐲 S = 𝐅 1 𝐱 + 𝐅 2 𝐱 + 𝐅 1 𝐱 ⊙ 𝐅 2 𝐱 𝐲 R =𝐱+𝐅 𝐱 𝐲 S =𝐱+𝐅 𝐱 + 𝐱⊙𝐅 𝐱 ORIGINAL SORT A Two-Branch Block A Residual Block conv-1a conv-1b conv-2a conv-2b conv-a conv-b Fusion 11/22/2018 VALSE Webinar 2017

Benefit? What is the benefit of the second-order term? Increasing nonlinearity The roles of different orders Cross-branch gradient back-propagation Other explanations? 11/22/2018 VALSE Webinar 2017

Increasing the Nonlinearity Both ReLU and max operations are nonlinear at a sub-dimension, but a real second-order term is nonlinear at the entire input space 𝐅 1 + 𝐅 𝟐 max 𝐅 1 , 𝐅 𝟐 𝐅 1 ⊙ 𝐅 𝟐 ResNet-20 on CIFAR10 √ 7.60 7.55 not converge 7.63 𝟕.𝟏𝟒 7.64 7.90 11/22/2018 VALSE Webinar 2017

The Role of Different Orders Linear terms help convergence It is not recommended to use 𝐅 1 ⊙ 𝐅 𝟐 alone Nonlinear terms help representation ability Using a second-order term is better than using a piecewise linear term (such as ReLU and max) A combination of linear and nonlinear terms produces the best performance 11/22/2018 VALSE Webinar 2017

Cross-Branch Gradient Back-Prop Original form: 𝐲= 𝐅 1 𝐱; 𝜽 1 + 𝐅 2 𝐱; 𝜽 2 𝜕𝐲 𝜕 𝜽 1 only depends on 𝜽 1 , 𝜕𝐲 𝜕 𝜽 2 only depends on 𝜽 2 SORT: 𝐲= 𝐅 1 𝐱; 𝜽 1 + 𝐅 2 𝐱; 𝜽 2 + 𝐅 1 𝐱; 𝜽 1 ⊙ 𝐅 2 𝐱; 𝜽 2 Both 𝜕𝐲 𝜕 𝜽 1 and 𝜕𝐲 𝜕 𝜽 2 depends on both 𝜽 1 and 𝜽 2 A branch can update the parameter based on the information from another branch 11/22/2018 VALSE Webinar 2017

Any Other Explanations? This is still an open problem! Possible options Using a nonlinear kernel in visual recognition Gating: a popular idea in recurrent CNN The mask operation in the attention model 11/22/2018 VALSE Webinar 2017

Outline Introduction Second-Order Response Transform Experiments Conclusions and Future Work 11/22/2018 VALSE Webinar 2017

Small-Scale Experiments Datasets CIFAR10, CIFAR100, SVHN Networks LeNet (5 layers) BigNet (11 layers) ResNet (20 layers, 32 layers, 56 layers) WideResNet (28 layers) 11/22/2018 VALSE Webinar 2017

Small-Scale Results Network CIFAR10 CIFAR100 SVHN DSN (2014) 7.97 34.57 1.92 r-CNN (2015) 7.09 31.75 1.77 GePool (2016) 6.05 32.37 1.69 WRN (2016) 5.37 24.53 1.85 StocNet (2016) 5.25 24.98 1.75 DenNet (2017) 3.74 19.25 1.59 LeNet* 11.10 𝟏𝟎.𝟑𝟒 36.93 𝟑𝟒.𝟕𝟓 2.55 𝟐.𝟑𝟗 BigNet* 6.84 𝟔.𝟔𝟎 29.25 𝟐𝟖.𝟎𝟕 1.97 𝟏.𝟖𝟕 ResNet-20 7.60 𝟕.𝟏𝟒 30.66 𝟑𝟎.𝟏𝟗 2.04 𝟐.𝟎𝟏 ResNet-32 6.72 𝟔.𝟏𝟔 29.55 𝟐𝟖.𝟖𝟒 2.20 𝟏.𝟗𝟒 ResNet-56 6.00 𝟓.𝟓𝟐 27.55 𝟐𝟔.𝟖𝟖 2.22 𝟏.𝟖𝟏 WRN-28 4.78 𝟒.𝟎𝟎 22.05 𝟐𝟎.𝟗𝟒 1.80 𝟏.𝟓𝟐 11/22/2018 VALSE Webinar 2017

Small-Scale Results Network CIFAR10 CIFAR100 SVHN DSN (2014) 7.97 34.57 1.92 r-CNN (2015) 7.09 31.75 1.77 GePool (2016) 6.05 32.37 1.69 WRN (2016) 5.37 24.53 1.85 StocNet (2016) 5.25 24.98 1.75 DenNet (2017) 3.74 19.25 1.59 LeNet* 11.10 𝟏𝟎.𝟑𝟒 36.93 𝟑𝟒.𝟕𝟓 2.55 𝟐.𝟑𝟗 BigNet* 6.84 𝟔.𝟔𝟎 29.25 𝟐𝟖.𝟎𝟕 1.97 𝟏.𝟖𝟕 ResNet-20 7.60 𝟕.𝟏𝟒 30.66 𝟑𝟎.𝟏𝟗 2.04 𝟐.𝟎𝟏 ResNet-32 6.72 𝟔.𝟏𝟔 29.55 𝟐𝟖.𝟖𝟒 2.20 𝟏.𝟗𝟒 ResNet-56 6.00 𝟓.𝟓𝟐 27.55 𝟐𝟔.𝟖𝟖 2.22 𝟏.𝟖𝟏 WRN-28 4.78 𝟒.𝟎𝟎 22.05 𝟐𝟎.𝟗𝟒 1.80 𝟏.𝟓𝟐 11/22/2018 VALSE Webinar 2017

Small-Scale Results Network CIFAR10 CIFAR100 SVHN DSN (2014) 7.97 34.57 1.92 r-CNN (2015) 7.09 31.75 1.77 GePool (2016) 6.05 32.37 1.69 WRN (2016) 5.37 24.53 1.85 StocNet (2016) 5.25 24.98 1.75 DenNet (2017) 3.74 19.25 1.59 LeNet* 11.10 𝟏𝟎.𝟑𝟒 36.93 𝟑𝟒.𝟕𝟓 2.55 𝟐.𝟑𝟗 BigNet* 6.84 𝟔.𝟔𝟎 29.25 𝟐𝟖.𝟎𝟕 1.97 𝟏.𝟖𝟕 ResNet-20 7.60 𝟕.𝟏𝟒 30.66 𝟑𝟎.𝟏𝟗 2.04 𝟐.𝟎𝟏 ResNet-32 6.72 𝟔.𝟏𝟔 29.55 𝟐𝟖.𝟖𝟒 2.20 𝟏.𝟗𝟒 ResNet-56 6.00 𝟓.𝟓𝟐 27.55 𝟐𝟔.𝟖𝟖 2.22 𝟏.𝟖𝟏 WRN-28 4.78 𝟒.𝟎𝟎 22.05 𝟐𝟎.𝟗𝟒 1.80 𝟏.𝟓𝟐 11/22/2018 VALSE Webinar 2017

ImageNet Experiments Dataset Networks ILSVRC2012 AlexNet (8 layers) ResNet (18, 34, or 50 layers) The Facebook implementation on pytorch is used 11/22/2018 VALSE Webinar 2017

ImageNet Results Network Top-1 Error Top-5 Error AlexNet 43.19 19.87 36.66 14.79 AlexNet*+SORT 𝟑𝟓.𝟔𝟔 𝟏𝟒.𝟏𝟑 ResNet-18 30.50 11.07 ResNet-18+SORT 𝟐𝟗.𝟗𝟓 𝟏𝟎.𝟖𝟎 ResNet-34 27.02 8.77 ResNet-34+SORT 𝟐𝟔.𝟓𝟕 𝟖.𝟓𝟓 ResNet-50 24.10 7.11 ResNet-50+SORT 𝟐𝟑.𝟖𝟐 𝟔.𝟕𝟐 11/22/2018 VALSE Webinar 2017

Outline Introduction Second-Order Response Transform Experiments Conclusions and Future Work 11/22/2018 VALSE Webinar 2017

Conclusions SORT: a simple idea to improve deep networks Effective: accuracy is boosted consistently Efficient: a light-weighted operation which needs less than 2% extra time and no extra memory Can be applied to a wide range of networks The role of different terms First-order terms: basic property and convergence Second-order terms: nonlinearity 11/22/2018 VALSE Webinar 2017

Future Work Applying SORT to the concatenation module? Inception, ResNeXt, DenseNet, etc. Adding other terms? Even higher-order, or arbitrary polynomial terms Non-polynomial terms Application to recurrent neural networks? 11/22/2018 VALSE Webinar 2017

ICCV 2017 Genetic CNN Speaker: Lingxi Xie Authors: Lingxi Xie, Alan Yuille Department of Computer Science The Johns Hopkins University http://lingxixie.com/

Outline Introduction Designing CNN Structures Genetic CNN Experiments Discussions and Conclusions 11/22/2018 VALSE Webinar 2017

Outline Introduction Designing CNN Structures Genetic CNN Experiments Discussions and Conclusions 11/22/2018 VALSE Webinar 2017

Introduction Deep Learning The state-of-the-art machine learning theory Using a cascade of many layers of non-linear neurons for feature extraction and transformation Learning multiple levels of feature representation Higher-level features are derived from lower-level features to form a hierarchical architecture Multiple levels of representation correspond to different levels of abstraction 11/22/2018 VALSE Webinar 2017

Introduction (cont.) The Convolutional Neural Networks A fundamental machine learning tool Good performance in a wide range of problems in computer vision as well as other research areas Evolutions in many real-world applications Theory: a multi-layer, hierarchical network often has a larger capacity, also requires a larger amount of data to get trained 11/22/2018 VALSE Webinar 2017

Outline Introduction Designing CNN Structures Genetic CNN Experiments Discussions and Conclusions 11/22/2018 VALSE Webinar 2017

Designing CNN Structures History From linear to non-linear From shallow to deep From fully-connected to convolutional Today A cascade of various types of non-linear units Typical units: convolution, pooling, activation, etc. 11/22/2018 VALSE Webinar 2017

Example Networks LeNet [LeCun et.al, 1998] 11/22/2018 VALSE Webinar 2017

Example Networks (cont.) AlexNet [Krizhevsky et.al, 2012] 11/22/2018 VALSE Webinar 2017

Example Networks (cont.) Other deep networks VGGNet [Simonyan et.al, 2014] GoogLeNet (Inception) [Szegedy et.al, 2014] Deep ResNet [He et.al, 2016] DenseNet [Huang et.al, 2016] 11/22/2018 VALSE Webinar 2017

Problem All the networks architectures are fixed This limits the ability and complexity of the networks We see some examples such as the Stochastic Network [Huang et.al, 2016], which allows the network to skip some layers in the training stage, but we point out that this is a fixed structure with a stochastic training strategy 11/22/2018 VALSE Webinar 2017

Outline Introduction Designing CNN Structures Genetic CNN Experiments Discussions and Conclusions 11/22/2018 VALSE Webinar 2017

General Idea Modeling a large family of CNN architectures as a solution space In this work, each architecture is encoded into a binary string of a fixed length Using an efficient search algorithm to explore good candidates In this work, the genetic algorithm is used 11/22/2018 VALSE Webinar 2017

The Genetic Algorithm A metaheuristic inspired by the process of natural selection that belongs to the larger class of evolutionary algorithms Commonly used to generate high-quality solutions to optimization and search problems by relying on bio-inspired operators such as mutation, crossover and selection https://en.wikipedia.org/wiki/Genetic_algorithm 11/22/2018 VALSE Webinar 2017

The Genetic Algorithm (cont.) Typical requirements of a genetic process A genetic representation of each individual (a sample in the solution space) A function to evaluate each individual (cost function or loss function) 11/22/2018 VALSE Webinar 2017

The Genetic Algorithm (cont.) Flowchart of a genetic process Initialization: generating a population of individuals to start with Selection: determining which individuals survive Genetic operations: crossover, mutation, etc. Iteration: repeating the above two process several times and ending the process when a condition holds 11/22/2018 VALSE Webinar 2017

The Genetic Algorithm (cont.) Example: the Traveling Salesman Problem (TSP) Finding the shortest Hamilton path over 𝑁 towns A typical genetic algorithm for TSP Genetic representation: a permutation of 𝑁 numbers Cost function: the total length of the current path Crossover: switching the sub-sequences in two paths Mutation: switching the position of two towns in a path Termination: after a fixed number of generations 11/22/2018 VALSE Webinar 2017

General Framework Two requirements of the genetic algorithm A genetic representation: each CNN is encoded into a fixed length (𝐿) of binary codes An evaluation function: the network is trained from scratch and the accuracy is obtained Note: the genetic algorithm is only used to generate network structures, the network weights are trained from scratch! 11/22/2018 VALSE Webinar 2017

General Framework (cont.) Input: # of individuals 𝑁, # of generations 𝑇, network configuration (to be detailed later), hyper-parameters (to be detailed later) Initialization: generating 𝑁 random individuals Evaluating each individual by training from scratch Repeat the following process for 𝑇 rounds Selection: generating 𝑁 individuals with Russian Roulette Crossover and mutation: generating new individuals pairwise or singly Evaluating each new individual by training from scratch Output: population after 𝑁 generations 11/22/2018 VALSE Webinar 2017

Encoding CNN into Binary Codes Input: the number of stages 𝑆, the number of nodes 𝑚 𝑠 in each stage Each stage is a a DAG structure: node 𝑗 can receive information from node 𝑖 of 𝑖<𝑗 There is a bit denoting if node 𝑗 takes input from node 𝑖 A node sums up all its inputs and performs convolution There is a “source” node at the beginning, performing convolution and feeding the results to all nodes without a precedent; there is a “destination” node at the end, collecting from all nodes without a follower Output: a binary vector of length 𝑠 1 2 𝑚 𝑠 𝑚 𝑠 −1 11/22/2018 VALSE Webinar 2017

What is Encoded? What is encoded: What is not encoded: Connection between layers in the same stage What is not encoded: Network weights The number of filters at each layer Geometric information such as stride and size Other layers such as pooling and activation Fully-connected stages 11/22/2018 VALSE Webinar 2017

Example of CNN Encoding INPUT A1 A2 A3 A4 A0 A5 POOL1 conv@32 pooling next stage 32×32×3 Code: 1-00-111 16×16×32 prev. stage B0 B1 B2 B3 B4 B5 B6 POOL2 conv@64 Code: 0-10-000-0011 8×8×64 Encoding Area Stage 1 Stage 2 11/22/2018 VALSE Webinar 2017

Relationship to Popular Nets What can be encoded: Chain nets (e.g., VGGNet) Highway nets (e.g., ResNet) DenseNet [Huang et.al, 2016] What cannot be encoded: Multi-scale nets (e.g., GoogLeNet, a.k.a., Inception) Tricky modules (e.g., MaxOut) conv layer VGGNet 𝐾=4 ResNet 𝐾=3 Code: 1-01-001 Code: 1-11 11/22/2018 VALSE Webinar 2017

Notations 𝑁: # of individuals, 𝑇: # of rounds 𝑆: # of stages, 𝐾 𝑠 : # of nodes at the 𝑠-th stage, 𝐿= 𝑠 1 2 𝐾 𝑠 𝐾 𝑠 −1 : # of bits 𝕄 𝑡,𝑛 : the 𝑛-th individual in the 𝑡-th round 𝑏 𝑡,𝑛 𝑙 ∈ 0,1 : the 𝑙-th bit in 𝕄 𝑡,𝑛 𝑟 𝑡,𝑛 : the fitness function value of 𝕄 𝑡,𝑛 11/22/2018 VALSE Webinar 2017

Initialization 𝑏 0,𝑛 𝑙 ~ℬ 0.5 , 𝑙=1,2,⋯,𝐿 We shall see later that initialization does not impact much on the genetic process 11/22/2018 VALSE Webinar 2017

Selection An individual is more likely to be selected if it produces better recognition performance The probability of selecting 𝕄 𝑡,𝑛 is proportional to 𝑟 𝑡,𝑛 − min 𝑛 𝑟 𝑡,𝑛 A Russian roulette process The worst individual is always eliminated, and some good individuals may be selected multiple times 11/22/2018 VALSE Webinar 2017

Crossover and Mutation Enumerating each pair, performing crossover with probability 𝑝 C , if not used for crossover, performing mutation with probability 𝑝 M Crossover: switching each stage (multiple bits) with probability 𝑞 C Mutation: flipping each bit with probability 𝑞 M 11/22/2018 VALSE Webinar 2017

Evaluation A training-from-scratch process on 𝕄 𝑡,𝑛 If 𝕄 𝑡,𝑛 is previously evaluated, it is evaluated once again and the average accuracy is preserved To guarantee the testing data remain unseen, we partition the original training set into two (training and validation) subsets 11/22/2018 VALSE Webinar 2017

Outline Introduction Designing CNN Structures Genetic CNN Experiments Discussions and Conclusions 11/22/2018 VALSE Webinar 2017

MNIST Experiments The MNIST dataset Network setting 10 classes, 60,000 training (50,000 for training, 10,000 for validation) and 10,000 testing images Network setting 𝑆=2, 𝐾 1 , 𝐾 2 = 3,5 𝐿=13, 2 𝐿 =8192 𝑁=20 (individuals), 𝐿=50 (rounds) 𝑝 M =0.8, 𝑞 M =0.1, 𝑝 C =0.2, 𝑞 C =0.3 11/22/2018 VALSE Webinar 2017

MNIST Results Gen Max % Min % Avg % Med % Std-D % 99.59 99.38 99.50 99.59 99.38 99.50 0.06 1 99.61 99.40 99.53 99.54 0.05 2 99.62 99.43 99.55 99.58 3 99.56 5 99.46 99.57 0.04 8 99.63 99.60 10 20 99.45 30 99.64 99.49 50 99.66 99.51 99.65 11/22/2018 VALSE Webinar 2017

CIFAR10 Experiments The CIFAR10 dataset Network setting 10 classes, 50,000 training (40,000 for training, 10,000 for validation) and 10,000 testing images Network setting 𝑆=3, 𝐾 1 , 𝐾 2 = 3,4,5 𝐿=19, 2 𝐿 =524288 𝑁=20 (individuals), 𝐿=50 (rounds) 𝑝 M =0.8, 𝑞 M =0.05, 𝑝 C =0.2, 𝑞 C =0.2 11/22/2018 VALSE Webinar 2017

CIFAR10 Results Gen Max % Min % Avg % Med % Std-D % 75.96 71.81 74.39 75.96 71.81 74.39 74.53 0.91 1 73.93 75.01 75.17 0.57 2 73.95 75.32 75.48 3 76.06 73.47 75.37 75.62 0.70 5 76.24 72.60 75.65 0.89 8 76.59 74.75 75.77 75.86 0.53 10 76.72 73.92 75.68 75.80 0.88 20 76.83 74.91 76.45 76.79 0.61 30 76.95 74.38 76.42 76.53 0.46 50 77.06 75.84 76.58 76.81 0.55 11/22/2018 VALSE Webinar 2017

Diagnosis: Initialization Issues Is the genetic process sensitive to initialization? 11/22/2018 VALSE Webinar 2017

Diagnosis: Rationality Do strong parents generate strong children? 11/22/2018 VALSE Webinar 2017

Designed CNN Structures 1 2 3 4 5 6 Code: 1-01 Chain-Shaped Networks AlexNet VGGNet Code: 0-01-100 Code: 1-01-100 Code: 0-11-101-0001 Multiple-Path GoogLeNet Highway Deep ResNet Two individual genetic processes are performed The best individuals after the final round are shown A little bit surprisingly, the learned network structures are similar in two individual genetic processes 11/22/2018 VALSE Webinar 2017

Transferring to Other Datasets Using a basic structure learned from VGGNet For small datasets, 3 learned stages followed by fully-connected layers 64,128,256 filters at 3 stages For ILSVRC2012, 2 fixed down-sampling stages followed by 3 learned stages followed by fully-connected layers 256,512,512 filters at 3 stages 11/22/2018 VALSE Webinar 2017

Experiments: SVHN and CIFAR DSN [Lee et.al, 2014] 1.92 7.97 34.57 Gener. Pooling [Lee et.al, 2016] 1.69 6.05 32.37 WideResNet [Zagorukyo, 2016] 1.85 5.37 24.53 StocNet [Huang et.al, 2016] 1.75 5.25 24.98 DenseNet [Huang et.al, 2016] 1.59 3.74 19.25 GeNet #1, after Gen-00 2.25 8.18 31.46 GeNet #1, after Gen-05 2.15 7.67 30.17 GeNet #1, after Gen-20 2.05 7.36 29.63 GeNet #1, after Gen-50 1.99 7.19 29.03 GeNet #2, after Gen-50 1.97 7.10 29.05 11/22/2018 VALSE Webinar 2017

Experiments: ILSVRC12 Top-1 Top-5 Depth AlexNet [Krizhevsky et.al, 2012] 42.6 19.6 8 GoogLeNet [Szege. et.al, 2016] 34.2 12.9 22 VGGNet-16 [Simon. et.al, 2016] 28.5 9.9 16 VGGNet-19 [Simon. et.al, 2016] 28.7 19 ResNet-50 [He et.al, 2016] 24.6 7.7 50 ResNet-101 [He et.al, 2016] 23.4 7.0 101 ResNet-152 [He et.al, 2016] 23.0 6.7 152 GeNet #1 28.12 9.95 GeNet #2 27.87 9.74 http://www.vlfeat.org/matconvnet/pretrained/ 11/22/2018 VALSE Webinar 2017

Outline Introduction Designing CNN Structures Genetic CNN Experiments Discussions and Conclusions 11/22/2018 VALSE Webinar 2017

Limitations The genetic process is very slow The explored network structures are still of limited flexibility Our approach is not evaluated in the scenario of very deep networks (hundreds of layers)? Our approach cannot symbiotically learn network structure and network weights 11/22/2018 VALSE Webinar 2017

Conclusions A genetic process to explore CNN structures Foundation: a CNN encoding scheme Fact: the “genes” in strong individuals Efficient genetic operations are performed A lot of future work is remaining Increasing the depth of the networks Adding more network modules Incorporating learning network weights 11/22/2018 VALSE Webinar 2017

Thank you! Questions please? 11/22/2018 VALSE Webinar 2017