ALL YOU NEED IS A GOOD INIT

Slides:



Advertisements
Similar presentations
Greedy Layer-Wise Training of Deep Networks
Advertisements

1 Machine Learning: Lecture 4 Artificial Neural Networks (Based on Chapter 4 of Mitchell T.., Machine Learning, 1997)
Introduction to Neural Networks Computing
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Lecture 5: CNN: Regularization
The back-propagation training algorithm
Saturation, Flat-spotting Shift up Derivative Weight Decay No derivative on output nodes.
Traffic Sign Recognition Using Artificial Neural Network Radi Bekker
Artificial Neural Networks
Artificial Neural Networks
Neural Networks Ellen Walker Hiram College. Connectionist Architectures Characterized by (Rich & Knight) –Large number of very simple neuron-like processing.
Neural Networks Vladimir Pleskonjić 3188/ /20 Vladimir Pleskonjić General Feedforward neural networks Inputs are numeric features Outputs are in.
Artificiel Neural Networks 3 Tricks for improved learning Morten Nielsen Department of Systems Biology, DTU.
Neural Networks The Elements of Statistical Learning, Chapter 12 Presented by Nick Rizzolo.
Deep Residual Learning for Image Recognition
Lecture 3a Analysis of training of NN
Machine Learning Supervised Learning Classification and Regression
Big data classification using neural network
Deep Residual Learning for Image Recognition
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift CS838.
Deep Learning Methods For Automated Discourse CIS 700-7
Deep Residual Networks
Deep Feedforward Networks
Energy models and Deep Belief Networks
Data Mining, Neural Network and Genetic Programming
Artificial Neural Networks
Extreme Learning Machine
Learning Deep L0 Encoders
Convolutional Neural Fabrics by Shreyas Saxena, Jakob Verbeek
第 3 章 神经网络.
Generative Adversarial Networks
Inception and Residual Architecture in Deep Convolutional Networks
Neural Networks CS 446 Machine Learning.
ECE 6504 Deep Learning for Perception
Understanding the Difficulty of Training Deep Feedforward Neural Networks Qiyue Wang Oct 27, 2017.
Random walk initialization for training very deep feedforward networks
Neural Networks and Backpropagation
Dynamic Routing Using Inter Capsule Routing Protocol Between Capsules
CSC 578 Neural Networks and Deep Learning
CSC 578 Neural Networks and Deep Learning
CSC 578 Neural Networks and Deep Learning
Towards Understanding the Invertibility of Convolutional Neural Networks Anna C. Gilbert1, Yi Zhang1, Kibok Lee1, Yuting Zhang1, Honglak Lee1,2 1University.
Incremental Training of Deep Convolutional Neural Networks
MXNet Internals Cyrus M. Vahid, Principal Solutions Architect,
CS 4501: Introduction to Computer Vision Training Neural Networks II
GATED RECURRENT NETWORKS FOR SEIZURE DETECTION
Tips for Training Deep Network
Artificial Intelligence Chapter 3 Neural Networks
Backpropagation.
Image to Image Translation using GANs
Neural Networks Geoff Hulten.
Artificial Intelligence Chapter 3 Neural Networks
LECTURE 42: AUTOMATIC INTERPRETATION OF EEGS
Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824
LECTURE 41: AUTOMATIC INTERPRETATION OF EEGS
Artificial Intelligence Chapter 3 Neural Networks
Done.
实习生汇报 ——北邮 张安迪.
Artificial Intelligence Chapter 3 Neural Networks
ImageNet Classification with Deep Convolutional Neural Networks
Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824
Natalie Lang Tomer Malach
Batch Normalization.
VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION
Learning and Memorization
CSC 578 Neural Networks and Deep Learning
Semi-Orthogonal Low-Rank Matrix Factorization for Deep Neural Networks
Artificial Intelligence Chapter 3 Neural Networks
Principles of Back-Propagation
Overall Introduction for the Lecture
Presentation transcript:

ALL YOU NEED IS A GOOD INIT Dmytro Mishkin, Jiri Matas Center for Machine Perception Czech Technical University in Prague Presenter: Qi Sun

Weight Initialization Why A proper initialization of the weights in a neural network is critical to its convergence, avoiding exploding or vanishing gradients. How: To Keep Signal’s Variance Constant Gaussian noise with variance 𝑉𝑎𝑟 𝑊 𝑙 = 2 𝑛 𝑙 + 𝑛 𝑙+1 (Glorot et. al. 2010) 𝑉𝑎𝑟 𝑊 𝑙 = 2 𝑛 𝑙 (He et.al. 2015) Orthogonal initial conditions on weights SVD->w=U or V (Saxe et. al. 2013) Data-dependent: LSUV Other method: Batch normalization (Loffe et. Al.2015) Helpful for relaxing the careful tuning of weight initialization

Weight Initialization

Layer-Sequential Unit-Variance Initialization

Pre-initialize Pre-initialize network with orthonormal matrices as in Saxe et al (2014) Why Fundamental action: repeated matrix multiplications Orthonormal matrices: all the eigenvalues of an orthogonal matrix have absolute value 1 The resulting matrix doesn't explode or vanish How (Briefly) Initialize a matrix with standard Gaussian distribution Apply Singular Value Decomposition (SVD) to the matrix Initializes the array with either side of resultant orthogonal matrices, depending on the shape of the input array

Iterative Procedure Using minibathes of data, rescale weights to have variance of 1 for each layer Why Data driven: batch normalization performed only on the first mini-batch The similarity to batch normalization is the unit variance normalization procedure How (Briefly) For each layer: STEP1: Given a minibatch, compute the activation output variance STEP2: For each layer, compute variance by the threshold defined as 𝑇𝑜𝑙 𝑣𝑎𝑟 to the target variance 1 If below max number of iterations or above the 𝑇𝑜𝑙 𝑣𝑎𝑟 , rescale the layer weights by the squared variance of the minibatch, GOTO STEP1 Else finish initializing this layer Variant LSUV Normalizing input activations of each layer instead of output ones. SAME Pre-initialization of weights with Gaussian noise instead of orthonormal matrix. SAMLL DECREASE

Main contribution A simple initialization procedure leads to state-of-the-art thin and very deep neural nets. Explored the initialization with different activation functions in very deep networks. (ReLU, hyperbolic tangent, sigmoid, maxout, and VLReLU) Absence of a general, repeatable and efficient procedure for their end-to-end training Romero et al(2015) stated that deep and thin networks are very hard to train by back propagation if deeper than five layers, especially with uniform initialization

Validation Network architecture Datasets MNIST CIFAR-10/100 ILSVRC-2012

Validation

Validation

Validation

Comparison to batch normalization As good as batch-normalized one No extra computations

Conclusion LSUV initialization allows learning of very deep nets via standard SGD, and leads to (near) state-of-the art results on MNIST, CIFAR, ImageNet datasets, outperforming the sophisticated systems designed specifically for very deep nets such as FitNets The proposed initialization works well with different activation functions.

Questions?