Natural Gradient Works Efficiently in Learning S Amari (Fri) Computational Modeling of Intelligence Summarized by Joon Shik Kim
Abstract The ordinary gradient of a function does not represent its steepest direction, but the natural gradient does. The dynamical behavior of natural gradient online learning is analyzed and is proved to be Fisher efficient. The plateau phenomenon, which appears in the backpropagation learning algorithm of multilayer perceptron, might disappear or might not be so serious when the natural gradient is used.
Introduction (1/2) The stochastic gradient method is a popular learning method in the general nonlinear optimization framework. The parameter space is not Euclidean but has a Riemannian metric structure in many cases. In these cases, the ordinary gradient does not give the steepest direction of target function.
Introduction (2/2) Barkai, Seung, and Sompolisky (1995) proposed an adaptive method of adjusting the learning rate. We generalize their idea and evaluate its performance based on the Riemannian metric of errors.
Natural Gradient (1/5) The squared length of a small incremental vector dw, When the coordinate system is nonorthogonal, the squared length is given by the quadratic form,
Natural Gradient (2/5) The steepest descent direction of a function L(w) at w is defined by the vector dw has that minimizes L(w+dw) where |dw| has a fixed length, that is, under the constant,
Natural Gradient (3/5) The steepest descent direction of L(w) in a Riemannian space is given by,
Natural Gradient (4/5)
Natural Gradient (5/5)
Natural Gradient Learning Risk function or average loss, Learning is a procedure to search for the optimal w* that minimizes L(w). Stochastic gradient descent learning
Statistical Estimation of Probability Density Function (1/2) In the case of statistical estimation, we assume a statistical model {p(z,w)}, and the problem is to obtain the probability distribution that approximates the unknown density function q(z) in the best way. Loss function is
Statistical Estimation of Probability Density Function (2/2) The expected loss is then given by H z is the entropy of q(z) not depending on w. Riemannian metric is Fisher information
Fisher Information as the Metric of Kullback-Leibler Divergence (1/2) p=q(θ+h)
Fisher Information as the Metric of Kullback-Leibler Divergence (2/2) I: Fisher information
Multilayer Neural Network (1/2)
Multilayer Neural Network (2/2) c is a normalizing constant
Natural Gradient Gives Fisher- Efficient Online Learning Algorithms (1/4) D T = {(x 1,y 1 ),…,(x T,y T )} is T-independent input-output examples generated by the teacher network having parameter w*. Minimizing the log loss over the training data D T is to obtain that minimizes the training error
Natural Gradient Gives Fisher- Efficient Online Learning Algorithms (2/4) The Cramér-Rao theorem states that the expected squared error of an unbiased estimator satisfies An estimator is said to be efficient or Fisher efficient when it satisfies above equation.
Natural Gradient Gives Fisher- Efficient Online Learning Algorithms (3/4) Theorem 2. The natural gradient online estimator is Fisher efficient. Proof
Natural Gradient Gives Fisher- Efficient Online Learning Algorithms (4/4)