Presentation is loading. Please wait.

Presentation is loading. Please wait.

Natural Gradient Works Efficiently in Learning S Amari 11.03.18.(Fri) Computational Modeling of Intelligence Summarized by Joon Shik Kim.

Similar presentations


Presentation on theme: "Natural Gradient Works Efficiently in Learning S Amari 11.03.18.(Fri) Computational Modeling of Intelligence Summarized by Joon Shik Kim."— Presentation transcript:

1 Natural Gradient Works Efficiently in Learning S Amari 11.03.18.(Fri) Computational Modeling of Intelligence Summarized by Joon Shik Kim

2 Abstract The ordinary gradient of a function does not represent its steepest direction, but the natural gradient does. The dynamical behavior of natural gradient online learning is analyzed and is proved to be Fisher efficient. The plateau phenomenon, which appears in the backpropagation learning algorithm of multilayer perceptron, might disappear or might not be so serious when the natural gradient is used.

3 Introduction (1/2) The stochastic gradient method is a popular learning method in the general nonlinear optimization framework. The parameter space is not Euclidean but has a Riemannian metric structure in many cases. In these cases, the ordinary gradient does not give the steepest direction of target function.

4 Introduction (2/2) Barkai, Seung, and Sompolisky (1995) proposed an adaptive method of adjusting the learning rate. We generalize their idea and evaluate its performance based on the Riemannian metric of errors.

5 Natural Gradient (1/5) The squared length of a small incremental vector dw, When the coordinate system is nonorthogonal, the squared length is given by the quadratic form,

6 Natural Gradient (2/5) The steepest descent direction of a function L(w) at w is defined by the vector dw has that minimizes L(w+dw) where |dw| has a fixed length, that is, under the constant,

7 Natural Gradient (3/5) The steepest descent direction of L(w) in a Riemannian space is given by,

8 Natural Gradient (4/5)

9 Natural Gradient (5/5)

10 Natural Gradient Learning Risk function or average loss, Learning is a procedure to search for the optimal w* that minimizes L(w). Stochastic gradient descent learning

11 Statistical Estimation of Probability Density Function (1/2) In the case of statistical estimation, we assume a statistical model {p(z,w)}, and the problem is to obtain the probability distribution that approximates the unknown density function q(z) in the best way. Loss function is

12 Statistical Estimation of Probability Density Function (2/2) The expected loss is then given by H z is the entropy of q(z) not depending on w. Riemannian metric is Fisher information

13 Fisher Information as the Metric of Kullback-Leibler Divergence (1/2) p=q(θ+h)

14 Fisher Information as the Metric of Kullback-Leibler Divergence (2/2) I: Fisher information

15 Multilayer Neural Network (1/2)

16 Multilayer Neural Network (2/2) c is a normalizing constant

17 Natural Gradient Gives Fisher- Efficient Online Learning Algorithms (1/4) D T = {(x 1,y 1 ),…,(x T,y T )} is T-independent input-output examples generated by the teacher network having parameter w*. Minimizing the log loss over the training data D T is to obtain that minimizes the training error

18 Natural Gradient Gives Fisher- Efficient Online Learning Algorithms (2/4) The Cramér-Rao theorem states that the expected squared error of an unbiased estimator satisfies An estimator is said to be efficient or Fisher efficient when it satisfies above equation.

19 Natural Gradient Gives Fisher- Efficient Online Learning Algorithms (3/4) Theorem 2. The natural gradient online estimator is Fisher efficient. Proof

20 Natural Gradient Gives Fisher- Efficient Online Learning Algorithms (4/4)


Download ppt "Natural Gradient Works Efficiently in Learning S Amari 11.03.18.(Fri) Computational Modeling of Intelligence Summarized by Joon Shik Kim."

Similar presentations


Ads by Google