Natural Gradient Works Efficiently in Learning S Amari 11.03.18.(Fri) Computational Modeling of Intelligence Summarized by Joon Shik Kim.

Slides:



Advertisements
Similar presentations
The Helmholtz Machine P Dayan, GE Hinton, RM Neal, RS Zemel
Advertisements

Thomas Trappenberg Autonomous Robotics: Supervised and unsupervised learning.
CHAPTER 13 M ODELING C ONSIDERATIONS AND S TATISTICAL I NFORMATION “All models are wrong; some are useful.”  George E. P. Box Organization of chapter.
Hazırlayan NEURAL NETWORKS Least Squares Estimation PROF. DR. YUSUF OYSAL.
Chapter 9 Perceptrons and their generalizations. Rosenblatt ’ s perceptron Proofs of the theorem Method of stochastic approximation and sigmoid approximation.
Chapter 4: Linear Models for Classification
Linear Discriminant Functions Wen-Hung Liao, 11/25/2008.
The loss function, the normal equation,
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
The Mean Square Error (MSE):. Now, Examples: 1) 2)
AGC DSP AGC DSP Professor A G Constantinides© Estimation Theory We seek to determine from a set of data, a set of parameters such that their values would.
Speaker Adaptation for Vowel Classification
Machine Learning Motivation for machine learning How to set up a problem How to design a learner Introduce one class of learners (ANN) –Perceptrons –Feed-forward.
Optimal Adaptation for Statistical Classifiers Xiao Li.
Planning operation start times for the manufacture of capital products with uncertain processing times and resource constraints D.P. Song, Dr. C.Hicks.
Advanced Topics in Optimization
Nonlinear Stochastic Programming by the Monte-Carlo method Lecture 4 Leonidas Sakalauskas Institute of Mathematics and Informatics Vilnius, Lithuania EURO.
Collaborative Filtering Matrix Factorization Approach
The free-energy principle: a rough guide to the brain? K Friston Summarized by Joon Shik Kim (Thu) Computational Models of Intelligence.
Biointelligence Laboratory, Seoul National University
CHAPTER 4 S TOCHASTIC A PPROXIMATION FOR R OOT F INDING IN N ONLINEAR M ODELS Organization of chapter in ISSO –Introduction and potpourri of examples Sample.
1 Artificial Neural Networks Sanun Srisuk EECP0720 Expert Systems – Artificial Neural Networks.
Natural Actor-Critic Authors: Jan Peters and Stefan Schaal Neurocomputing, 2008 Cognitive robotics 2008/2009 Wouter Klijn.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
Online Learning for Collaborative Filtering
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
CHAPTER 5 S TOCHASTIC G RADIENT F ORM OF S TOCHASTIC A PROXIMATION Organization of chapter in ISSO –Stochastic gradient Core algorithm Basic principles.
Multi-Layer Perceptron
Linear Discrimination Reading: Chapter 2 of textbook.
Non-Bayes classifiers. Linear discriminants, neural networks.
11 1 Backpropagation Multilayer Perceptron R – S 1 – S 2 – S 3 Network.
Consistency An estimator is a consistent estimator of θ, if , i.e., if
BCS547 Neural Decoding. Population Code Tuning CurvesPattern of activity (r) Direction (deg) Activity
Linear Classification with Perceptrons
BCS547 Neural Decoding.
Linear Models for Classification
ADALINE (ADAptive LInear NEuron) Network and
Data Modeling Patrice Koehl Department of Biological Sciences National University of Singapore
CHAPTER 10 Widrow-Hoff Learning Ming-Feng Yeh.
신경망의 기울기 강하 학습 ー정보기하 이론과 자연기울기를 중심으로ー
Neural Networks Presented by M. Abbasi Course lecturer: Dr.Tohidkhah.
Chapter 8: Adaptive Networks
Hazırlayan NEURAL NETWORKS Backpropagation Network PROF. DR. YUSUF OYSAL.
Neural Networks 2nd Edition Simon Haykin 柯博昌 Chap 3. Single-Layer Perceptrons.
Week 31 The Likelihood Function - Introduction Recall: a statistical model for some data is a set of distributions, one of which corresponds to the true.
Support-Vector Networks C Cortes and V Vapnik (Tue) Computational Models of Intelligence Joon Shik Kim.
Neural Networks 2nd Edition Simon Haykin
Neural NetworksNN 21 Architecture We consider the architecture: feed- forward NN with one layer It is sufficient to study single layer perceptrons with.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Computacion Inteligente Least-Square Methods for System Identification.
Computer Vision Lecture 7 Classifiers. Computer Vision, Lecture 6 Oleh Tretiak © 2005Slide 1 This Lecture Bayesian decision theory (22.1, 22.2) –General.
Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability Primer Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability.
Week 21 Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.
Pattern Recognition Lecture 20: Neural Networks 3 Dr. Richard Spillman Pacific Lutheran University.
Machine Learning Today: Reading: Maria Florina Balcan
CSC 578 Neural Networks and Deep Learning
Instructor :Dr. Aamer Iqbal Bhatti
Introduction to Scientific Computing II
Introduction to Scientific Computing II
(Thu) Computational Models of Intelligence Joon Shik Kim
Capabilities of Threshold Neurons
Backpropagation.
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
(Fri) Computational Modeling of Intelligence Joon Shik Kim
Learning Theory Reza Shadmehr
The loss function, the normal equation,
Mathematical Foundations of BME Reza Shadmehr
Chapter - 3 Single Layer Percetron
Backpropagation.
Mathematical Foundations of BME
Presentation transcript:

Natural Gradient Works Efficiently in Learning S Amari (Fri) Computational Modeling of Intelligence Summarized by Joon Shik Kim

Abstract The ordinary gradient of a function does not represent its steepest direction, but the natural gradient does. The dynamical behavior of natural gradient online learning is analyzed and is proved to be Fisher efficient. The plateau phenomenon, which appears in the backpropagation learning algorithm of multilayer perceptron, might disappear or might not be so serious when the natural gradient is used.

Introduction (1/2) The stochastic gradient method is a popular learning method in the general nonlinear optimization framework. The parameter space is not Euclidean but has a Riemannian metric structure in many cases. In these cases, the ordinary gradient does not give the steepest direction of target function.

Introduction (2/2) Barkai, Seung, and Sompolisky (1995) proposed an adaptive method of adjusting the learning rate. We generalize their idea and evaluate its performance based on the Riemannian metric of errors.

Natural Gradient (1/5) The squared length of a small incremental vector dw, When the coordinate system is nonorthogonal, the squared length is given by the quadratic form,

Natural Gradient (2/5) The steepest descent direction of a function L(w) at w is defined by the vector dw has that minimizes L(w+dw) where |dw| has a fixed length, that is, under the constant,

Natural Gradient (3/5) The steepest descent direction of L(w) in a Riemannian space is given by,

Natural Gradient (4/5)

Natural Gradient (5/5)

Natural Gradient Learning Risk function or average loss, Learning is a procedure to search for the optimal w* that minimizes L(w). Stochastic gradient descent learning

Statistical Estimation of Probability Density Function (1/2) In the case of statistical estimation, we assume a statistical model {p(z,w)}, and the problem is to obtain the probability distribution that approximates the unknown density function q(z) in the best way. Loss function is

Statistical Estimation of Probability Density Function (2/2) The expected loss is then given by H z is the entropy of q(z) not depending on w. Riemannian metric is Fisher information

Fisher Information as the Metric of Kullback-Leibler Divergence (1/2) p=q(θ+h)

Fisher Information as the Metric of Kullback-Leibler Divergence (2/2) I: Fisher information

Multilayer Neural Network (1/2)

Multilayer Neural Network (2/2) c is a normalizing constant

Natural Gradient Gives Fisher- Efficient Online Learning Algorithms (1/4) D T = {(x 1,y 1 ),…,(x T,y T )} is T-independent input-output examples generated by the teacher network having parameter w*. Minimizing the log loss over the training data D T is to obtain that minimizes the training error

Natural Gradient Gives Fisher- Efficient Online Learning Algorithms (2/4) The Cramér-Rao theorem states that the expected squared error of an unbiased estimator satisfies An estimator is said to be efficient or Fisher efficient when it satisfies above equation.

Natural Gradient Gives Fisher- Efficient Online Learning Algorithms (3/4) Theorem 2. The natural gradient online estimator is Fisher efficient. Proof

Natural Gradient Gives Fisher- Efficient Online Learning Algorithms (4/4)