신경망의 기울기 강하 학습 ー정보기하 이론과 자연기울기를 중심으로ー

Slides:

Advertisements

Similar presentations

Pattern Recognition and Machine Learning

Advertisements

EE-M /7: IS L7&8 1/24, v3.0 Lectures 7&8: Non-linear Classification and Regression using Layered Perceptrons Dr Martin Brown Room: E1k

INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Fast Algorithms For Hierarchical Range Histogram Constructions

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Computer vision: models, learning and inference

The loss function, the normal equation,

Industrial Engineering College of Engineering Bayesian Kernel Methods for Binary Classification and Online Learning Problems Theodore Trafalis Workshop.

Machine Learning Neural Networks

Neural Networks I CMPUT 466/551 Nilanjan Ray. Outline Projection Pursuit Regression Neural Network –Background –Vanilla Neural Networks –Back-propagation.

INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Prénom Nom Document Analysis: Parameter Estimation for Pattern Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

RBF Neural Networks x x1 Examples inside circles 1 and 2 are of class +, examples outside both circles are of class – What NN does.

Chapter 5 NEURAL NETWORKS

Neural Networks Marco Loog.

Improved BP algorithms ( first order gradient method) 1.BP with momentum 2.Delta- bar- delta 3.Decoupled momentum 4.RProp 5.Adaptive BP 6.Trinary BP 7.BP.

Artificial Neural Networks

MACHINE LEARNING 12. Multilayer Perceptrons. Neural Networks Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Last lecture summary.

Radial Basis Function Networks

8/10/ RBF NetworksM.W. Mak Radial Basis Function Networks 1. Introduction 2. Finding RBF Parameters 3. Decision Surface of RBF Networks 4. Comparison.

Collaborative Filtering Matrix Factorization Approach

Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.

Biointelligence Laboratory, Seoul National University

Natural Gradient Works Efficiently in Learning S Amari (Fri) Computational Modeling of Intelligence Summarized by Joon Shik Kim.

11 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering

Natural Actor-Critic Authors: Jan Peters and Stefan Schaal Neurocomputing, 2008 Cognitive robotics 2008/2009 Wouter Klijn.

Neural Networks - Berrin Yanıkoğlu1 Applications and Examples From Mitchell Chp. 4.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 16: NEURAL NETWORKS Objectives: Feedforward.

Neural Network Introduction Hung-yi Lee. Review: Supervised Learning Training: Pick the “best” Function f * Training Data Model Testing: Hypothesis Function.

Stochastic Linear Programming by Series of Monte-Carlo Estimators Leonidas SAKALAUSKAS Institute of Mathematics&Informatics Vilnius, Lithuania

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.

Learning Theory Reza Shadmehr LMS with Newton-Raphson, weighted least squares, choice of loss function.

Linear Discrimination Reading: Chapter 2 of textbook.

Linear Models for Classification

Back-Propagation Algorithm AN INTRODUCTION TO LEARNING INTERNAL REPRESENTATIONS BY ERROR PROPAGATION Presented by: Kunal Parmar UHID:

Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.

Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.

Neural Networks Presented by M. Abbasi Course lecturer: Dr.Tohidkhah.

Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.

Variations on Backpropagation.

Asymptotic Behavior of Stochastic Complexity of Complete Bipartite Graph-Type Boltzmann Machines Yu Nishiyama and Sumio Watanabe Tokyo Institute of Technology,

Neural Networks The Elements of Statistical Learning, Chapter 12 Presented by Nick Rizzolo.

Computer Vision Lecture 7 Classifiers. Computer Vision, Lecture 6 Oleh Tretiak © 2005Slide 1 This Lecture Bayesian decision theory (22.1, 22.2) –General.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

A Method to Approximate the Bayesian Posterior Distribution in Singular Learning Machines Kenji Nagata, Sumio Watanabe Tokyo Institute of Technology.

Pattern Recognition Lecture 20: Neural Networks 3 Dr. Richard Spillman Pacific Lutheran University.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Machine Learning Supervised Learning Classification and Regression

Big data classification using neural network

Deep Feedforward Networks

Information Geometry and Neural Netowrks

Multilayer Perceptrons

Collaborative Filtering Matrix Factorization Approach

Neuro-Computing Lecture 4 Radial Basis Function Network

Artificial Neural Networks

Deep Learning for Non-Linear Control

Lecture Notes for Chapter 4 Artificial Neural Networks

Learning Theory Reza Shadmehr

The loss function, the normal equation,

CSCE833 Machine Learning Lecture 9 Linear Discriminant Analysis

Mathematical Foundations of BME Reza Shadmehr

Parametric Methods Berlin Chen, 2005 References:

COSC 4335: Part2: Other Classification Techniques

Neural Network Training

Linear Discrimination

Presentation transcript:

신경망의 기울기 강하 학습 ー정보기하 이론과 자연기울기를 중심으로ー 신경망의 기울기 강하 학습 ー정보기하 이론과 자연기울기를 중심으로ー 박혜영 Lab. for Mathematical Neuroscience Brain Science Institute, RIKEN, JAPAN 2001-04-28 한국정보과학회 춘계학술대회 튜토리얼

Overview(1/2) Introduction Geometrical Approach to Learning Feed forward neural networks Learning of neural networks Plateau problem Geometrical Approach to Learning Geometry of neural networks Information Geometry Information Geometry for neural networks Natural Gradient Superiority of natural gradient Natural gradient and plateau Problem of natural gradient learning 2001-04-28 한국정보과학회 춘계학술대회 튜토리얼

Overview(2/2) Adaptive Natural Gradient Learning(ANGL) Basic formula ANGL for regression problem ANGL for classification problem Computational experiments Comparison with Second Order Method Newton method Gauss-Newton method and Levenberg-Marquardt method Natural gradient vs. Gauss-Newton method Conclusions 2001-04-28 한국정보과학회 춘계학술대회 튜토리얼

Feed Forward Neural Networks A network model 2001-04-28 한국정보과학회 춘계학술대회 튜토리얼

Learning of Neural Networks Data set Error Function Squared error function Negative log likelihood Training error Learning (Gradient Descent Learning) Goal : find an optimal parameter Search an estimate of step by step on-line mode batch mode 2001-04-28 한국정보과학회 춘계학술대회 튜토리얼

Plateau problem Typical learning curve of neural networks Plateaus make learning extremely slow 2001-04-28 한국정보과학회 춘계학술대회 튜토리얼

Why Do Plateaus Appear? (1/2) [Saad and Solla 1995] Analyze dynamics of parameters in learning statistical mechanics In the early stage of learning, the learning network is drawn into a suboptimal symmetric phase, which means that all hidden nodes have same weight values It takes dominant time to break the symmetry → Plateau optimal point e suboptimal symmetric phase starting point 2001-04-28 한국정보과학회 춘계학술대회 튜토리얼

Why Do Plateaus Appear? (2/2) [Fukumizu and Amari 1999] Spaces of smaller networks are subspaces of larger networks → critical subspace Global/local minima of smaller networks can be local minima or saddles of the the larger network. The saddle points is a main reason of making plateaus. 2001-04-28 한국정보과학회 춘계학술대회 튜토리얼

Hierarchical Structure of Space of NN larger network critical subspace Space of smaller network saddle points, local minima minimum 2001-04-28 한국정보과학회 춘계학술대회 튜토리얼

Geometrical Structure of Neural Manifold Which is the fastest way to optimal point? How to find an efficient path? Neural Manifold Error surface on parameter space 2001-04-28 한국정보과학회 춘계학술대회 튜토리얼

Information Geometry Study on the space of probability density function which specified by parameter q ,p(x;q ) Basic characteristics A Reimannian space Need local metric for distance measure The corresponding metric is given by Fisher information matrix Steepest descent of a function e (q ) on the space is given by the natural gradient 2001-04-28 한국정보과학회 춘계학술대회 튜토리얼

An Example of Riemannian Space Curved Space and Locally Euclidean Space Metric for the space 2001-04-28 한국정보과학회 춘계학술대회 튜토리얼

Information Geometry for Neural Networks Stochastic Neural Networks Neural Networks can be considered as probability density functions. Gradient in the space of neural networks Natural Gradient Learning 2001-04-28 한국정보과학회 춘계학술대회 튜토리얼

Why Natural Gradient? - Related Researches - [Amari 1998] Show that NG gives the steepest direction of a loss function on an arbitrary point in the manifold of the probability distributions NG learning achieves the best asymptotic performance that any unbiased learning algorithm can achieve [Park et al. 1999] Suggest the possibility of avoiding plateaus or quickly escaping from them using natural gradient learning Show experimental evidence of avoiding plateaus [Rattray and Saad, 1999] Confirm the possibility of avoiding plateaus through statistical mechanical analysis 2001-04-28 한국정보과학회 춘계학술대회 튜토리얼

Why Natural Gradient? - Intuitive explanations - Consider movement of parameter around critical subspace Standard gradient descent learning method Natural gradient method 2001-04-28 한국정보과학회 춘계학술대회 튜토리얼

Why Natural Gradient? - Experimental evidence - (1/3) Toy model with 2-dim Parameter Space Model Assumption : input x ~ N(0,I ) , noise  ~ N(0,0.1 ) Reduction of number of parameters Training data : given from teacher network of same structure with true parameter parameter space a2 true point (a1*,a2*) initial point (a1o,a2o) critical subspace a1= a2 a1 2001-04-28 한국정보과학회 춘계학술대회 튜토리얼

Why Natural Gradient? - Experimental evidence - (2/3) Dynamics of Ordinary Gradient Learning 2001-04-28 한국정보과학회 춘계학술대회 튜토리얼

Why Natural Gradient? - Experimental evidence - (3/3) Dynamics of Natural Gradient Learning 2001-04-28 한국정보과학회 춘계학술대회 튜토리얼

Problem of Natural Gradient Learning Updating Rule Calculation of Fisher Information matrix Need to know input distribution Estimated by sample mean Calculation of the inverse of Fisher information matrix High computational cost Need adaptive estimation method 2001-04-28 한국정보과학회 춘계학술대회 튜토리얼

Adaptive Natural Gradient Learning(1/2) Stochastic neural networks Fisher information matrix 2001-04-28 한국정보과학회 춘계학술대회 튜토리얼

Adaptive Natural Gradient Learning(2/2) Adaptive estimation of Fisher information matrix Inverse of Fisher information matrix Adaptive natural gradient learning 2001-04-28 한국정보과학회 춘계학술대회 튜토리얼

Implementation of ANGL Consider two types of practical applications Regression problem give a prediction of output values for given input time series prediction, nonlinear system identification generally continuous output Classification problem assign a given input to one of classes pattern recognition, data mining binary output Use different stochastic model for each type. additive noise model (squared error function) flipping coin model (cross entropy error) 2001-04-28 한국정보과학회 춘계학술대회 튜토리얼

ANGL for Regression problem(1/3) Stochastic model of neural networks Additive noise subject to a probability distribution Error function negative log likelihood noise subject to Gaussian with scalar variance s2 2001-04-28 한국정보과학회 춘계학술대회 튜토리얼

ANGL for Regression problem(2/3) Estimation of Fisher information matrix and adaptive natural gradient learning algorithm 2001-04-28 한국정보과학회 춘계학술대회 튜토리얼

ANGL for Regression problem(3/3) Case of Gaussian additive noise with scalar variance 2001-04-28 한국정보과학회 춘계학술대회 튜토리얼

ANGL for classification problem(1/4) Classification problems An output node represents a class (binary values) Need different stochastic models from that for regression Stochastic model I (case of 2 classes) Output 1 for class 1, output 0 for class 2 Error function (Cross-entropy error function) 2001-04-28 한국정보과학회 춘계학술대회 튜토리얼

ANGL for classification problem(2/4) Estimation of Fisher information matrix Adaptive natural gradient learning algorithm 2001-04-28 한국정보과학회 춘계학술대회 튜토리얼

ANGL for classification problem(3/4) Stochastic model II (case of multiple classes (L)) Need L output nodes so that each output node represents each class Error function (Cross-entropy error function) 2001-04-28 한국정보과학회 춘계학술대회 튜토리얼

ANGL for classification problem(4/4) Estimation of Fisher information matrix Adaptive natural gradient learning algorithm 2001-04-28 한국정보과학회 춘계학술대회 튜토리얼

Experiments on Regression Problems (1/3) Mackey-Glass time series prediction generation of time series input : 4 previous values ; output : 1 future value ; learning data : 500 test data : 500 noise of output: subject to Gaussian distribution x(t-18),x(t-12),x(t-6),x(t) x(t+6) (t=200,…,700) (t=5000,…,5500) 2001-04-28 한국정보과학회 춘계학술대회 튜토리얼

Experiments on Regression Problems (2/3) Experimental results (average results over 10 trials) X 10 -5 X 10 -5 X 10 -5 2001-04-28 한국정보과학회 춘계학술대회 튜토리얼

Experiments on Regression Problems (3/3) Learning curve 2001-04-28 한국정보과학회 춘계학술대회 튜토리얼

Experiments on Classification Problems - case of two classes (1/3) Extended XOR problems 2 classes use stochastic model I learning data : 1800 test data : 900 2001-04-28 한국정보과학회 춘계학술대회 튜토리얼

Experiments on Classification Problems - case of two classes (2/3) Experimental results( average results over 10 trials) 2001-04-28 한국정보과학회 춘계학술대회 튜토리얼

Experiments on Classification Problems - case of two classes (3/3) Learning curve 2001-04-28 한국정보과학회 춘계학술대회 튜토리얼

Experiments on Classification Problems Experiments on Classification Problems - case of multiple classes (1/3) IRIS classification problem classify three different species of iris flower input : 4 attributes about the shape of the plant (4 input nodes) output: 3 classes of the flower (3 input nodes) use stochastic model II learning data: 90 (30 for each class) test data: 60 (20 for each class) 2001-04-28 한국정보과학회 춘계학술대회 튜토리얼

Experiments on Classification Problems Experiments on Classification Problems - case of multiple classes (2/3) Experimental results(average results over 10 trials) 2001-04-28 한국정보과학회 춘계학술대회 튜토리얼

Experiments on Classification Problems Experiments on Classification Problems - case of multiple classes (3/3) Learning curve 2001-04-28 한국정보과학회 춘계학술대회 튜토리얼

Comparison with Second Order Method(1/3) Newton Method Use second order Taylor expansion of Error function around optimal point q* updating rule effective only around the optimal point unstable according to the condition of Hessian matrix q q* 2001-04-28 한국정보과학회 춘계학술대회 튜토리얼

Comparison with Second Order Method (2/3) Gauss-Newton Method Consider square of error function Gauss-Newton approximation of Hessian Updating rule Levenberg-Marquardt Method 2001-04-28 한국정보과학회 춘계학술대회 튜토리얼

Comparison with Second Order Method (3/3) Natural Gradient Learning Updating rule On-line/batch learning Use general error function Consider geometrical characteristics of NN space Gauss-Newton Method Updating Rule Batch learning Assume sum of square error Use quadratic approximation and approximation of Hessian Under the assumption of additive Gaussian noise with scalar variance, natural gradient method gives theoretical justification of the Gauss-Newton approximation natural gradient method is a generalization of Gauss-Newton Method 2001-04-28 한국정보과학회 춘계학술대회 튜토리얼

Conclusions A study on an efficient learning method Consider plateau problem in learning Consider geometrical structure of space of neural networks Take information geometrical approach to solve the plateau problems Present natural gradient learning method as a solution of plateau problems Present adaptive natural gradient learning for realizing natural gradient in the field of neural networks Show practical advantages of adaptive natural gradient learning Compare with second order method When is the natural gradient learning is good? Problems with Large data and small network size Problems requiring fine accuracy of approximation 2001-04-28 한국정보과학회 춘계학술대회 튜토리얼

References Plateau Problem Second Order method and learning theory Fukumizu, F. & Amari, S. (2000). Local Minima and Plateaus in Hierachical Structures of Multilayer Perceptrons, Neural Networks, 13, 317-328. Saad, D. & Solla, S. A. (1995). On-line Learning in Soft Committee Machines, Physical Review E, 52, 4225-4243, 1995. Second Order method and learning theory Bishop, C. (1995). Neural networks for pattern recognition, Oxford University Press. LeCun, Y., Bottou, L., Orr G. B., & Müller, K. -R. (1998). In G. B. Orr and K. R. Müller, Neural networks: tricks of the trade, Springer Lecture Notes in Computer Sciences, vol. 1524, Heidelberg: Springer. Information Geometry Amari, S. & Nagaoka, H. (1999). Information geometry, AMS and Oxford University Press. 2001-04-28 한국정보과학회 춘계학술대회 튜토리얼

References Basic Concept of Natural Gradient Amari, S, (1998). Natural gradient works efficiently in learning, Neural Computation, 10, 251-276. Natural Gradient for Neural Networks Amari, S., Park, H., & Fukumizu, F. (2000). Adaptive method of realizing natural gradient learning for multilayer perceptrons, Neural Computation, 12, 1399-1409. Park, H., Amari, S., & Lee, Y. (1999). An Information Geometrical Approach on Plateau Problemes in Multilayer Perceptron Learning, Journal of KISS(B): Software and Applications, 26(4),546-556. (in Korean) Park, H., Amari, S. & Fukumizu, K. (2000), Adaptive Natural Gradient Learning Algorithms for Various Stochastic Models, Neural Networks, 13, 755-764. Rattray, M., Saad D., & Amari, S. (1998). Natural gradient descent for on-line learning, Physical Review Letters, 81, 5461-5464. 2001-04-28 한국정보과학회 춘계학술대회 튜토리얼