Fluctuation-Dissipation Relations for Stochastic Gradient Descent

Slides:

Advertisements

Similar presentations

Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh ETH Zurich November 3, 2014.

Advertisements

Machine Learning and Data Mining Linear regression

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Semi-Stochastic Gradient Descent Methods Jakub Konečný (joint work with Peter Richtárik) University of Edinburgh.

Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh BASP Frontiers Workshop January 28, 2014.

Head First Dropout Naiyan Wang.

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

The loss function, the normal equation,

Lecture 4: CNN: Optimization Algorithms

Saturation, Flat-spotting Shift up Derivative Weight Decay No derivative on output nodes.

Collaborative Filtering Matrix Factorization Approach

Biointelligence Laboratory, Seoul National University

Outline Classification Linear classifiers Perceptron Multi-class classification Generative approach Naïve Bayes classifier 2.

Using Fast Weights to Improve Persistent Contrastive Divergence Tijmen Tieleman Geoffrey Hinton Department of Computer Science, University of Toronto ICML.

1 Jorge Nocedal Northwestern University With S. Hansen, R. Byrd and Y. Singer IPAM, UCLA, Feb 2014 A Stochastic Quasi-Newton Method for Large-Scale Learning.

Machine Learning Chapter 4. Artificial Neural Networks

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 9: Ways of speeding up the learning and preventing overfitting Geoffrey Hinton.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

CHAPTER 5 S TOCHASTIC G RADIENT F ORM OF S TOCHASTIC A PROXIMATION Organization of chapter in ISSO –Stochastic gradient Core algorithm Basic principles.

July 11, 2006Bayesian Inference and Maximum Entropy Probing the covariance matrix Kenneth M. Hanson T-16, Nuclear Physics; Theoretical Division Los.

신경망의 기울기 강하 학습 ー정보기하 이론과 자연기울기를 중심으로ー

Tips for Training Neural Network

Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.

Logistic Regression William Cohen.

Neural Networks Vladimir Pleskonjić 3188/ /20 Vladimir Pleskonjić General Feedforward neural networks Inputs are numeric features Outputs are in.

Hand-written character recognition

Intro. ANN & Fuzzy Systems Lecture 13. MLP (V): Speed Up Learning.

10 1 Widrow-Hoff Learning (LMS Algorithm) ADALINE Network  w i w i1  w i2  w iR  =

Numerical Analysis – Data Fitting Hanyang University Jong-Il Park.

Lecture 2b: Convolutional NN: Optimization Algorithms

Welcome deep loria !.

Stanford University.

Predictive Learning from Data

CS : Designing, Visualizing and Understanding Deep Neural Networks

Large-scale Machine Learning

Randomness in Neural Networks

Learning Recommender Systems with Adaptive Regularization

Mahdi Nazemi, Shahin Nazarian, and Massoud Pedram July 10, 2017

Machine Learning – Regression David Fenyő

Understanding Generalization in Adaptive Data Analysis

Department of Civil and Environmental Engineering

Classification with Perceptrons Reading:

Widrow-Hoff Learning (LMS Algorithm).

ECE 6504 Deep Learning for Perception

Understanding the Difficulty of Training Deep Feedforward Neural Networks Qiyue Wang Oct 27, 2017.

Machine Learning – Regression David Fenyő

Random walk initialization for training very deep feedforward networks

Probabilistic Models for Linear Regression

Neural Networks and Backpropagation

Vitaly (the West Coast) Feldman

Face Recognition with Deep Learning Method

Presented by Xinxin Zuo 10/20/2017

Hyperparameters, bias-variance tradeoff, validation

Collaborative Filtering Matrix Factorization Approach

Deep Learning and Mixed Integer Optimization

Biointelligence Laboratory, Seoul National University

Overfitting and Underfitting

Lecture 15: Data Cleaning for ML

Learning Theory Reza Shadmehr

Softmax Classifier.

Artificial Intelligence 10. Neural Networks

CS639: Data Management for Data Science

Image Classification & Training of Neural Networks

Mathematical Foundations of BME

Nonlinear Conjugate Gradient Method for Supervised Training of MLP

Section 3: Second Order Methods

Adversarial Personalized Ranking for Recommendation

Stochastic Methods.

Introduction to Machine Learning

Presentation transcript:

Fluctuation-Dissipation Relations for Stochastic Gradient Descent Sho Yaida [arXiv: 1810.00004]

Physics of Machine Learning Dynamics of SGD Geometry of loss-function landscape Algorithms for faster/better learning … (SGD=Stochastic Gradient Descent)

Outline 01 FDR for SGD in theory 02 FDR for SGD in action 03 Outlook (FDR=Fluctuation-Dissipation Relations SGD=Stochastic Gradient Descent)

01 FDR for SGD in theory

ML as optimization of the loss function w.r.t. model parameters : better accuracy with larger MNIST CIFAR-10

ML as optimization of the loss function w.r.t. model parameters : GD:

computationally more expensive ML as optimization of the loss function w.r.t. model parameters : GD: better accuracy with larger but computationally more expensive SGD

SGD dynamics: : model parameters : learning rate : mini-batch realization of size

Stationarity Assumption: SGD sampling at long time is governed by stationary-state distribution model parameters

Stationarity Assumption: SGD sampling at long time is governed by stationary-state distribution No quadratic assumption on loss surfaces No Gaussian assumption on noise distributions

stationary-state average Stationarity Assumption: SGD sampling at long time is governed by stationary-state distribution stationary-state average

Stationarity Assumption:

song & dance

natural

Exact for any stationary states Easy to measure on the fly Use it to check equilibration/stationarity Use it for adaptive training

height of noise ball ~ “temperature” Intuition within harmonic approximation song & dance height of noise ball ~ “temperature”

higher derivatives of loss higher correlations of noise

Linearity for small : Nonlinearity for high : breakdown of constant Hessian & constant noise-matrix approximation

02 FDR for SGD in action

checks equilibration/stationarity

checks equilibration/stationarity Compare their (half-running) time averages Expect

(dotted) (solid)

Slope @ small : magnitude of Hessian Nonlinearity @ high : breakdown of constant Hessian & constant noise-matrix approximation

Slope @ small : magnitude of Hessian MLP for MNIST CNN for CIFAR-10 Slope @ small : magnitude of Hessian Nonlinearity @ high : breakdown of constant Hessian & constant noise-matrix approximation

algorithmizes adaptive training scheduling Measure at the end of each epoch If , then

adaptive training scheduling algorithmizes adaptive training scheduling MLP for MNIST CNN for CIFAR-10

03 Outlook

Physics of Machine Learning Dynamics of SGD Geometry of loss-function landscape Algorithms for faster/better learning …

for equilibration dynamics for loss-function landscape for adaptive training algorithm

Adaptive training algorithm for near-SOTA Time-dependence (Onsager, Green-Kubo, Jarzynski,…) in sample distribution: ads, lifelong/sequential/continual learning quasi-stationarity: cascading overfitting dynamics in deep learning etc. Connection to whatever Dan is doing