Download presentation
Presentation is loading. Please wait.
Published byRudolph Harrell Modified over 6 years ago
1
Random walk initialization for training very deep feedforward networks
Wenhua Jiao 10/30/207
2
outline Introduction Analysis and Proposed Initialization
Experiment and Results Summary
3
Introduction Author: David Sussillo Research Scientist @ Google Brain
Adjunct Stanford Paper Purpose: Using mathematical analysis to address the vanishing gradient problem looking for the features which effect training very deep networks
4
Analysis and proposed initialization
Since the early 90s, the vanishing gradient (VG) problem has been found. Because of VG, adding many extra layers in FFN does not usually improve performance. Back-propagation involves applying similar matrices repeatedly to compute the error gradient. The outcome of this process depends on whether the magnitudes of the leading eigenvalues tend to be greater than or less than one. Only if the magnitudes of the leading eigenvalues are tightly constrained can there be a useful βnon-vanishingβ gradient. It can be achieved by appropriate initialization.
5
Analysis and proposed initialization
Feedforward networks of the form: π π =π π π β πβ1 + π π β π =π( π π ) β π is vector of hidden activations, π π is linear transformation, π π is biases, all at depth π, with π=0,1,β¦,π·. πis an element-wise nonlinear with π β² 0 =1, and π is a scale factor on the matrices. Assume the network has π· layers and each layer has width π, the elements of π π are initially drawn from a Gaussian distribution with 0 mean and variance 1 π , the elements of π π are initialized 0.Define β 0 to be inputs, β π· to be outputs. πΈ is objective function.
6
Analysis and proposed initialization
πΏ π β‘ ππΈ ππ | π =π π π+1 πΏ π+1 back-propagation equation(gradient vector) π π π,π = π β² ( π π (π)) π π (π,π) πΏ π 2 = g 2 π§ π+1 πΏ π+1 2 squared magnitude of gradient vector π§ π = π π πΏ π / πΏ π 2 π= πΏ 0 2 πΏ π· 2 = π 2π· π=1 π· π§ π across-all-layer gradient magnitude Solving the VG problem amounts to keep Z to be 1, appropriately adjusting g. The matrices πand π change during learning, so author think we can only do this for the initial configuration of the network before learning has made these changes
7
Analysis and proposed initialization
Because the matrices π are initially random, we can think of the z variables as random variables. Then, Z is proportional to a product of random variables. ln π =π· ln π 2 + π=1 π· ln ( π§ π ) ln (π) as being the result of a random walk, with step d in the walk given by the random variable ln ( π§ π ) . The goal of Random Walk Initialization is to chose π to make ln (π) as chose to zero as possible.
8
Calculation of the optimal π values
To make analytic statements that apply to all networks, author average over realization of the matrices π applied during back- propagation. ln (π) =π· ln π ln (π§) )=0βπ=expβ‘(β ln (π§) ) Here π§ is a random variable determined by π§= π πΏ/ πΏ 2 ,with πΒ Μ and πΏ are same distribution as the π π and πΏ π variables of the different layers of the network.
9
Calculation of the optimal π values
When WΜ is Gaussian πΜπΏ/|πΏ|is Gaussian for any vector πΏ, then π§ is π 2 distributed. With the N*N matrix π have variance 1/N, writhing π§=π/π, π is distributed according to π π 2 . Expanding the logarithm in a Taylor series about π§=1 and using the mean and variance of the distribution lnβ‘(π§) β π§β1 β π§β1 2 =β 1 π π ππππ =expβ‘( 1 2π )
10
Calculation of the optimal π values
For the ReLU case Thus the derivative of the ReLU function sets 1-M rows of π to 0 and leaves M rows with Gaussian entries. π§ is the sum of the squares of M random variables with variance 1/N. Writing π§=π/π, π is distributed according to π π 2 . computing lnβ‘(π§) in a Taylor series, with the π§=1/2, lnβ‘(π§) ββ ln π§ β 2 π Author computed β¨lnβ‘(π§)β© numerically and fit simple analytic expressions to the results to obtain: lnβ‘(π§) ββ ln 2 β 2.4 max π,6 β2.4 π π
ππΏπ = 2 expβ‘( 1.2 max π,6 β2.4 )
11
Computational verification
Sample random walks of random vectors back- propagated through a linear network. With N=100, D=500, and π=1.005
12
Computational verification
Predicted π values as a function of the layer width N and the nonlinearity, with D=200.
13
Computational verification
The growth of the magnitude of πΏ 0 in comparison to πΏ π· for a fixed N = 100, as a function of the π scaling parameter and the nonlinearity
14
Experiment and Results
A slight adjustment to g may be helpful, as most real-world data is far from a normal distribution. The initial scaling of the final output layer may need to be adjusted separately, as the back-propagating errors will be affected by the initialization of the final output layer. Random Walk Initialization requires tuning of three parameters: input scaling (or π 1 ), π π· , and g, the first two to handle transient effects of the inputs and errors, and the last to generally tune the entire network. By far the most important of the three is g.
15
Experiment and Results
Experiments on both the MNIST and TIMIT datasets with a standard FFN. MNIST database of handwritten digits has a training set of 60,000 examples, and a test set of 10,000 examples. The TIMIT speech corpus contains a total of 6300 sentences, 10 sentences spoken by 630 speakers selected from 8 major dialect regions of the USA. 70% of the speakers are male, 30% are female.
16
Experiment and Results
whether training a real-world dataset would be affected by choosing g according to the Random Walk Initialization?
17
Experiment and Results
Does increased depth actually help to decrease the objective function?
18
Summary The g values can be successfully trained on real datasets for depths upwards of 200 layers. Simply increase N to decrease the fluctuations in the norm of the back- propagated errors. The learning rate scheduling made a huge difference in performance in very deep networks.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.