Download presentation
Presentation is loading. Please wait.
Published bySolomon Johnson Modified over 6 years ago
1
Understanding the Difficulty of Training Deep Feedforward Neural Networks
Qiyue Wang Oct 27, 2017
2
Outline Introduction Experiment setting and dataset Analysis of activation function Analysis of gradient Experiment validation and conclusion
3
Introduction Paper information: Xavier Glorot and Yoshua Bengio
13th AISTATS 2010, Italy Paper purpose: Perfect theory and difficult practice of Deep Neural Network Understanding the difficulty of training deep feedforward neural networks (1) activation function (2) gradient propagation (initialization)
4
Experimental Datasets
Shapeset-3 × 2 1 or 2 two-dimensional objects 3 shape categories (triangle, parallelogram, ellipse), 9 classes placed with random shape parameters (relative lengths and/or angles), scaling, rotation, translation and grey-scale
5
Experimental Datasets
MNIST: 50,000 training data, 10,000 validation data, 10,000 testing data 28 * 28 gray scale pixel, 10 digits (0-9) CIFAR-10: 32 * 32 color image, 10 classes objects Small-ImageNet: 90,000 training data, 10,000 validation data, 10,000 testing data 37 * 37 gray level image, 10 classes objects
6
Experimental Datasets
MNIST CIFAR-10 Small-ImageNet
7
Experimental setting 1. The cost function:
negative log-likelihood −𝑙𝑜𝑔(𝑦|𝑥) 2. mini-batches of size:10 3. Gradient descent: 𝜃←𝜃−𝜖𝑔 4. learning rate 𝜖 : optimized based on validation set error 5. activation function: (1) sigmoid 𝑒 −𝑥 (2) hyperbolic tangent 𝑒 𝑥 − 𝑒 −𝑥 𝑒 𝑥 + 𝑒 −𝑥 (3) softsign 𝑥 1+|𝑥| 6. neural network depth: optimized by validation data 7. The logistic layer output: softmax 8. Initialization: 𝑊 𝑖𝑗 ~𝑈[− 1 𝑛 , 1 𝑛 ]
8
Effect of Activation function
Principle: two things to be avoided in training process: Excessive saturation The gradient will disappear and not propagated well (2) Overly linear units The deep hierarchy does not make sense and will not learn the high-level feature well 𝐺∙ 𝑊∙ℎ+𝑏 = 𝐺∙𝑊 ∙ℎ+𝐺∙𝑏
9
Effect of Activation function –Sigmoid
saturation result: evolution of the activation values (after the nonlinearity) at each hidden layer during training of a deep architecture (Shapeset 3 × 2, other data has similar results) Begin to saturate quickly (5 epochs) Escape from the saturation Saturation lasts very long (5th – 95th epoch)
10
Effect of Activation function –Sigmoid
Explanation: Random initialization The hidden output corresponds to the saturation region of sigmoid activation function Hypothesis: Output layer: 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑊ℎ+𝑏) Biases matter in output result in the beginning of training process and learned fast Based on the low layers and tend to be pushed to be 0 in the beginning of training Corresponds to saturation of sigmoid function, but good for hyperbolic tangent and softsign
11
Effect of Activation function –hyperbolic tangent & softsign
98 percentiles (markers alone) and standard deviation (solid lines with markers) of the distribution of the activation values in training process Linear area saturation Knee area Activation values normalized histogram at the end of training
12
Small Summary In sigmoid activation function, with the random initialization, the training process will fall in saturation very fast, last a long period and then escape from it. In hyperbolic tangent and softsign activation, the saturation will also occur with random initialization with an unknown reason. Compared with the hyperbolic tangent, activation value with the softsign activation distribute more in knee areas where there is non-linearity and the gradient can flow well.
13
Gradient and their propagation –cost function
The conditional log-likelihood cost function − 𝑙𝑜𝑔 𝑦 𝑥 with softmax outputs works much better than the quadratic cost function in classification problems. Explanation: the plateaus are less. log-likelihood cost function (black, surface on top) and quadratic (red, bottom surface) cost as a function of two weights
14
Gradient and their propagation –an new initialization
Standard initialization: 𝑊 𝑖𝑗 ~𝑈 − 1 𝑛 , 1 𝑛 Proposed new initialization: normalization initialization 𝑊 𝑖𝑗 ~𝑈 − 𝑛 𝑗 + 𝑛 𝑗+1 , 𝑛 𝑗 + 𝑛 𝑗+1 where 𝑛 𝑗 is activation node number of layer j. Theory: based on the invariance of and 𝑉𝑎𝑟[ 𝑧 𝑖 ] Hypothesis: (1) linear activation function (2) Activation derivative at 0 is 1: f’(0) = 1 (3) E[w] = 0 E[x] = 0 (4) W and x are indepemdent Symbol definition: 𝒛 𝒊 for the input of the layer i, 𝒔 𝒊 for the output of the layer i,
15
Gradient and their propagation –an new initialization
By the back propagation theory (chain rule): By the forward propagation 𝑧 𝑖+1 =𝑓( 𝑧 𝑖 𝑊 𝑖 +𝑏) Assuming that f is linear and identify function 𝑉𝑎𝑟[𝑧 𝑖+1 ]=𝑉𝑎𝑟 𝑧 𝑖 𝑛 𝑖 𝑉𝑎𝑟[ 𝑊 𝑖 ] After substitution and f’(0) = 1: 𝑉𝑎𝑟 𝜕𝐶𝑜𝑠𝑡 𝜕 𝑠 𝑖 = 𝑛 𝑖+1 𝑉𝑎𝑟 𝑊 𝑖 𝑉𝑎𝑟[ 𝜕𝐶𝑜𝑠𝑡 𝜕 𝑠 𝑖+1 ] To keep information flowing similarly 𝑉𝑎𝑟 𝜕𝐶𝑜𝑠𝑡 𝜕 𝑠 𝑖 =𝑉𝑎𝑟[ 𝜕𝐶𝑜𝑠𝑡 𝜕 𝑠 𝑖+1 ] 𝑉𝑎𝑟[𝑧 𝑖+1 ]=𝑉𝑎𝑟[ 𝑧 𝑖 ] 𝑛 𝑖 𝑉𝑎𝑟[ 𝑊 𝑖 ]=1 Compromise 𝑛 𝑖+1 𝑉𝑎𝑟 𝑊 𝑖 = 1 𝑉𝑎𝑟 𝑊 𝑖 = 2 𝑛 𝑖 + 𝑛 𝑖+1 𝑊 𝑖𝑗 ~𝑈 − 𝑛 𝑗 + 𝑛 𝑗+1 , 𝑛 𝑗 + 𝑛 𝑗+1 For independent variables = Var(X)Var(Y) if E(X) = 0 and Y = 0 further As a compromise between these two constraints, we might want to have
16
Gradient and their propagation –an new initialization
forward propagation 𝑉𝑎𝑟[𝑧 𝑖+1 ]=𝑉𝑎𝑟 𝑧 𝑖 𝑛 𝑖 𝑉𝑎𝑟[ 𝑊 𝑖 ] standard initialization: 𝑊 𝑖𝑗 ~𝑈 − 1 𝑛 , 1 𝑛 normalized initialization: 𝑊 𝑖𝑗 ~𝑈 − 𝑛 𝑗 + 𝑛 𝑗+1 , 𝑛 𝑗 + 𝑛 𝑗+1 1 Activation values normalized histograms with hyperbolic tangent activation
17
Gradient and their propagation –an new initialization
By the back propagation 𝑉𝑎𝑟 𝜕𝐶𝑜𝑠𝑡 𝜕 𝑠 𝑖 = 𝑛 𝑖+1 𝑉𝑎𝑟 𝑊 𝑖 𝑉𝑎𝑟[ 𝜕𝐶𝑜𝑠𝑡 𝜕 𝑠 𝑖+1 ] standard initialization: normalized initialization: Back-propagated gradients normalized histograms with hyperbolic tangent activation
18
Gradient and their propagation –an new initialization
By the back propagation (chain rule) & standard initialization: normalized initialization: Weight gradient normalized histograms with hyperbolic tangent activation just after initialization
19
Error curve and conclusion
The classical neural networks with sigmoid and standard initialization perform rather poorly; The softsign networks seem to be more robust to the initialization procedure than the tanh networks, presumably because of their gentler non-linearity; For tanh networks, the proposed normalized initialization can be quite helpful, presumably because the layer-to-layer transformations maintain magnitudes of activations (flowing forward) and gradients (flowing backward)
20
Welcome to discuss
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.