신경망의 기울기 강하 학습 ー정보기하 이론과 자연기울기를 중심으로ー

신경망의 기울기 강하 학습 ー정보기하 이론과 자연기울기를 중심으로ー
신경망의 기울기 강하 학습 ー정보기하 이론과 자연기울기를 중심으로ー 박혜영 Lab. for Mathematical Neuroscience Brain Science Institute, RIKEN, JAPAN 한국정보과학회 춘계학술대회 튜토리얼

Overview(1/2) Introduction Geometrical Approach to Learning
Feed forward neural networks Learning of neural networks Plateau problem Geometrical Approach to Learning Geometry of neural networks Information Geometry Information Geometry for neural networks Natural Gradient Superiority of natural gradient Natural gradient and plateau Problem of natural gradient learning 한국정보과학회 춘계학술대회 튜토리얼

Overview(2/2) Adaptive Natural Gradient Learning(ANGL)
Basic formula ANGL for regression problem ANGL for classification problem Computational experiments Comparison with Second Order Method Newton method Gauss-Newton method and Levenberg-Marquardt method Natural gradient vs. Gauss-Newton method Conclusions 한국정보과학회 춘계학술대회 튜토리얼

Feed Forward Neural Networks
A network model 한국정보과학회 춘계학술대회 튜토리얼

Learning of Neural Networks
Data set Error Function Squared error function Negative log likelihood Training error Learning (Gradient Descent Learning) Goal : find an optimal parameter Search an estimate of step by step on-line mode batch mode 한국정보과학회 춘계학술대회 튜토리얼

Plateau problem Typical learning curve of neural networks
Plateaus make learning extremely slow 한국정보과학회 춘계학술대회 튜토리얼

Why Do Plateaus Appear? (1/2)
[Saad and Solla 1995] Analyze dynamics of parameters in learning statistical mechanics In the early stage of learning, the learning network is drawn into a suboptimal symmetric phase, which means that all hidden nodes have same weight values It takes dominant time to break the symmetry → Plateau optimal point e suboptimal symmetric phase starting point 한국정보과학회 춘계학술대회 튜토리얼

Why Do Plateaus Appear? (2/2)
[Fukumizu and Amari 1999] Spaces of smaller networks are subspaces of larger networks → critical subspace Global/local minima of smaller networks can be local minima or saddles of the the larger network. The saddle points is a main reason of making plateaus. 한국정보과학회 춘계학술대회 튜토리얼

Hierarchical Structure of Space of NN
larger network critical subspace Space of smaller network saddle points, local minima minimum 한국정보과학회 춘계학술대회 튜토리얼

Geometrical Structure of Neural Manifold
Which is the fastest way to optimal point? How to find an efficient path? Neural Manifold Error surface on parameter space 한국정보과학회 춘계학술대회 튜토리얼

Information Geometry Study on the space of probability density function which specified by parameter q ,p(x;q ) Basic characteristics A Reimannian space Need local metric for distance measure The corresponding metric is given by Fisher information matrix Steepest descent of a function e (q ) on the space is given by the natural gradient 한국정보과학회 춘계학술대회 튜토리얼

An Example of Riemannian Space
Curved Space and Locally Euclidean Space Metric for the space 한국정보과학회 춘계학술대회 튜토리얼

Information Geometry for Neural Networks
Stochastic Neural Networks Neural Networks can be considered as probability density functions. Gradient in the space of neural networks Natural Gradient Learning 한국정보과학회 춘계학술대회 튜토리얼

Why Natural Gradient? - Related Researches -
[Amari 1998] Show that NG gives the steepest direction of a loss function on an arbitrary point in the manifold of the probability distributions NG learning achieves the best asymptotic performance that any unbiased learning algorithm can achieve [Park et al. 1999] Suggest the possibility of avoiding plateaus or quickly escaping from them using natural gradient learning Show experimental evidence of avoiding plateaus [Rattray and Saad, 1999] Confirm the possibility of avoiding plateaus through statistical mechanical analysis 한국정보과학회 춘계학술대회 튜토리얼

Why Natural Gradient? - Intuitive explanations -
Consider movement of parameter around critical subspace Standard gradient descent learning method Natural gradient method 한국정보과학회 춘계학술대회 튜토리얼

Why Natural Gradient? - Experimental evidence - (1/3)
Toy model with 2-dim Parameter Space Model Assumption : input x ~ N(0,I ) , noise  ~ N(0,0.1 ) Reduction of number of parameters Training data : given from teacher network of same structure with true parameter parameter space a2 true point (a1*,a2*) initial point (a1o,a2o) critical subspace a1= a2 a1 한국정보과학회 춘계학술대회 튜토리얼

Dynamics of Ordinary Gradient Learning 한국정보과학회 춘계학술대회 튜토리얼

Dynamics of Natural Gradient Learning 한국정보과학회 춘계학술대회 튜토리얼

Problem of Natural Gradient Learning
Updating Rule Calculation of Fisher Information matrix Need to know input distribution Estimated by sample mean Calculation of the inverse of Fisher information matrix High computational cost Need adaptive estimation method 한국정보과학회 춘계학술대회 튜토리얼

Adaptive Natural Gradient Learning(1/2)
Stochastic neural networks Fisher information matrix 한국정보과학회 춘계학술대회 튜토리얼

Adaptive Natural Gradient Learning(2/2)
Adaptive estimation of Fisher information matrix Inverse of Fisher information matrix Adaptive natural gradient learning 한국정보과학회 춘계학술대회 튜토리얼

Implementation of ANGL
Consider two types of practical applications Regression problem give a prediction of output values for given input time series prediction, nonlinear system identification generally continuous output Classification problem assign a given input to one of classes pattern recognition, data mining binary output Use different stochastic model for each type. additive noise model (squared error function) flipping coin model (cross entropy error) 한국정보과학회 춘계학술대회 튜토리얼

ANGL for Regression problem(1/3)
Stochastic model of neural networks Additive noise subject to a probability distribution Error function negative log likelihood noise subject to Gaussian with scalar variance s2 한국정보과학회 춘계학술대회 튜토리얼

Estimation of Fisher information matrix and adaptive natural gradient learning algorithm 한국정보과학회 춘계학술대회 튜토리얼

Case of Gaussian additive noise with scalar variance 한국정보과학회 춘계학술대회 튜토리얼

ANGL for classification problem(1/4)
Classification problems An output node represents a class (binary values) Need different stochastic models from that for regression Stochastic model I (case of 2 classes) Output 1 for class 1, output 0 for class 2 Error function (Cross-entropy error function) 한국정보과학회 춘계학술대회 튜토리얼

Estimation of Fisher information matrix Adaptive natural gradient learning algorithm 한국정보과학회 춘계학술대회 튜토리얼

Stochastic model II (case of multiple classes (L)) Need L output nodes so that each output node represents each class Error function (Cross-entropy error function) 한국정보과학회 춘계학술대회 튜토리얼

Estimation of Fisher information matrix Adaptive natural gradient learning algorithm 한국정보과학회 춘계학술대회 튜토리얼

Experiments on Regression Problems (1/3)
Mackey-Glass time series prediction generation of time series input : 4 previous values ; output : 1 future value ; learning data : 500 test data : 500 noise of output: subject to Gaussian distribution x(t-18),x(t-12),x(t-6),x(t) x(t+6) (t=200,…,700) (t=5000,…,5500) 한국정보과학회 춘계학술대회 튜토리얼

Experimental results (average results over 10 trials) X 10 -5 X 10 -5 X 10 -5 한국정보과학회 춘계학술대회 튜토리얼

Learning curve 한국정보과학회 춘계학술대회 튜토리얼

Experiments on Classification Problems - case of two classes (1/3)
Extended XOR problems 2 classes use stochastic model I learning data : 1800 test data : 900 한국정보과학회 춘계학술대회 튜토리얼

Experimental results( average results over 10 trials) 한국정보과학회 춘계학술대회 튜토리얼

Learning curve 한국정보과학회 춘계학술대회 튜토리얼

Experiments on Classification Problems
Experiments on Classification Problems case of multiple classes (1/3) IRIS classification problem classify three different species of iris flower input : 4 attributes about the shape of the plant (4 input nodes) output: 3 classes of the flower (3 input nodes) use stochastic model II learning data: 90 (30 for each class) test data: 60 (20 for each class) 한국정보과학회 춘계학술대회 튜토리얼

Experiments on Classification Problems case of multiple classes (2/3) Experimental results(average results over 10 trials) 한국정보과학회 춘계학술대회 튜토리얼

Experiments on Classification Problems case of multiple classes (3/3) Learning curve 한국정보과학회 춘계학술대회 튜토리얼

Comparison with Second Order Method(1/3)
Newton Method Use second order Taylor expansion of Error function around optimal point q* updating rule effective only around the optimal point unstable according to the condition of Hessian matrix q q* 한국정보과학회 춘계학술대회 튜토리얼

Comparison with Second Order Method (2/3)
Gauss-Newton Method Consider square of error function Gauss-Newton approximation of Hessian Updating rule Levenberg-Marquardt Method 한국정보과학회 춘계학술대회 튜토리얼

Comparison with Second Order Method (3/3)
Natural Gradient Learning Updating rule On-line/batch learning Use general error function Consider geometrical characteristics of NN space Gauss-Newton Method Updating Rule Batch learning Assume sum of square error Use quadratic approximation and approximation of Hessian Under the assumption of additive Gaussian noise with scalar variance, natural gradient method gives theoretical justification of the Gauss-Newton approximation natural gradient method is a generalization of Gauss-Newton Method 한국정보과학회 춘계학술대회 튜토리얼

Conclusions A study on an efficient learning method
Consider plateau problem in learning Consider geometrical structure of space of neural networks Take information geometrical approach to solve the plateau problems Present natural gradient learning method as a solution of plateau problems Present adaptive natural gradient learning for realizing natural gradient in the field of neural networks Show practical advantages of adaptive natural gradient learning Compare with second order method When is the natural gradient learning is good? Problems with Large data and small network size Problems requiring fine accuracy of approximation 한국정보과학회 춘계학술대회 튜토리얼

References Plateau Problem Second Order method and learning theory
Fukumizu, F. & Amari, S. (2000). Local Minima and Plateaus in Hierachical Structures of Multilayer Perceptrons, Neural Networks, 13, Saad, D. & Solla, S. A. (1995). On-line Learning in Soft Committee Machines, Physical Review E, 52, , 1995. Second Order method and learning theory Bishop, C. (1995). Neural networks for pattern recognition, Oxford University Press. LeCun, Y., Bottou, L., Orr G. B., & Müller, K. -R. (1998). In G. B. Orr and K. R. Müller, Neural networks: tricks of the trade, Springer Lecture Notes in Computer Sciences, vol. 1524, Heidelberg: Springer. Information Geometry Amari, S. & Nagaoka, H. (1999). Information geometry, AMS and Oxford University Press. 한국정보과학회 춘계학술대회 튜토리얼

References Basic Concept of Natural Gradient
Amari, S, (1998). Natural gradient works efficiently in learning, Neural Computation, 10, Natural Gradient for Neural Networks Amari, S., Park, H., & Fukumizu, F. (2000). Adaptive method of realizing natural gradient learning for multilayer perceptrons, Neural Computation, 12, Park, H., Amari, S., & Lee, Y. (1999). An Information Geometrical Approach on Plateau Problemes in Multilayer Perceptron Learning, Journal of KISS(B): Software and Applications, 26(4), (in Korean) Park, H., Amari, S. & Fukumizu, K. (2000), Adaptive Natural Gradient Learning Algorithms for Various Stochastic Models, Neural Networks, 13, Rattray, M., Saad D., & Amari, S. (1998). Natural gradient descent for on-line learning, Physical Review Letters, 81, 한국정보과학회 춘계학술대회 튜토리얼

신경망의 기울기 강하 학습 ー정보기하 이론과 자연기울기를 중심으로ー

Similar presentations

Presentation on theme: "신경망의 기울기 강하 학습 ー정보기하 이론과 자연기울기를 중심으로ー"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

신경망의 기울기 강하 학습 ー정보기하 이론과 자연기울기를 중심으로ー

Similar presentations

Presentation on theme: "신경망의 기울기 강하 학습 ー정보기하 이론과 자연기울기를 중심으로ー"— Presentation transcript:

Similar presentations

About project

Feedback