Fluctuation-Dissipation Relations for Stochastic Gradient Descent

Slides:



Advertisements
Similar presentations
Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh ETH Zurich November 3, 2014.
Advertisements

Machine Learning and Data Mining Linear regression
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Semi-Stochastic Gradient Descent Methods Jakub Konečný (joint work with Peter Richtárik) University of Edinburgh.
Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh BASP Frontiers Workshop January 28, 2014.
Head First Dropout Naiyan Wang.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
The loss function, the normal equation,
Lecture 4: CNN: Optimization Algorithms
Saturation, Flat-spotting Shift up Derivative Weight Decay No derivative on output nodes.
Collaborative Filtering Matrix Factorization Approach
Biointelligence Laboratory, Seoul National University
Outline Classification Linear classifiers Perceptron Multi-class classification Generative approach Naïve Bayes classifier 2.
Using Fast Weights to Improve Persistent Contrastive Divergence Tijmen Tieleman Geoffrey Hinton Department of Computer Science, University of Toronto ICML.
1 Jorge Nocedal Northwestern University With S. Hansen, R. Byrd and Y. Singer IPAM, UCLA, Feb 2014 A Stochastic Quasi-Newton Method for Large-Scale Learning.
Machine Learning Chapter 4. Artificial Neural Networks
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 9: Ways of speeding up the learning and preventing overfitting Geoffrey Hinton.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
CHAPTER 5 S TOCHASTIC G RADIENT F ORM OF S TOCHASTIC A PROXIMATION Organization of chapter in ISSO –Stochastic gradient Core algorithm Basic principles.
July 11, 2006Bayesian Inference and Maximum Entropy Probing the covariance matrix Kenneth M. Hanson T-16, Nuclear Physics; Theoretical Division Los.
신경망의 기울기 강하 학습 ー정보기하 이론과 자연기울기를 중심으로ー
Tips for Training Neural Network
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
Logistic Regression William Cohen.
Neural Networks Vladimir Pleskonjić 3188/ /20 Vladimir Pleskonjić General Feedforward neural networks Inputs are numeric features Outputs are in.
Hand-written character recognition
Intro. ANN & Fuzzy Systems Lecture 13. MLP (V): Speed Up Learning.
10 1 Widrow-Hoff Learning (LMS Algorithm) ADALINE Network  w i w i1  w i2  w iR  =
Numerical Analysis – Data Fitting Hanyang University Jong-Il Park.
Lecture 2b: Convolutional NN: Optimization Algorithms
Welcome deep loria !.
Stanford University.
Predictive Learning from Data
CS : Designing, Visualizing and Understanding Deep Neural Networks
Large-scale Machine Learning
Randomness in Neural Networks
Learning Recommender Systems with Adaptive Regularization
Mahdi Nazemi, Shahin Nazarian, and Massoud Pedram July 10, 2017
Machine Learning – Regression David Fenyő
Understanding Generalization in Adaptive Data Analysis
Department of Civil and Environmental Engineering
Classification with Perceptrons Reading:
Widrow-Hoff Learning (LMS Algorithm).
ECE 6504 Deep Learning for Perception
Understanding the Difficulty of Training Deep Feedforward Neural Networks Qiyue Wang Oct 27, 2017.
Machine Learning – Regression David Fenyő
Random walk initialization for training very deep feedforward networks
Probabilistic Models for Linear Regression
Neural Networks and Backpropagation
Vitaly (the West Coast) Feldman
Face Recognition with Deep Learning Method
Presented by Xinxin Zuo 10/20/2017
Hyperparameters, bias-variance tradeoff, validation
Collaborative Filtering Matrix Factorization Approach
Deep Learning and Mixed Integer Optimization
Biointelligence Laboratory, Seoul National University
Overfitting and Underfitting
Lecture 15: Data Cleaning for ML
Learning Theory Reza Shadmehr
Softmax Classifier.
Artificial Intelligence 10. Neural Networks
CS639: Data Management for Data Science
Image Classification & Training of Neural Networks
Mathematical Foundations of BME
Nonlinear Conjugate Gradient Method for Supervised Training of MLP
Section 3: Second Order Methods
Adversarial Personalized Ranking for Recommendation
Stochastic Methods.
Introduction to Machine Learning
Presentation transcript:

Fluctuation-Dissipation Relations for Stochastic Gradient Descent Sho Yaida [arXiv: 1810.00004]

Physics of Machine Learning Dynamics of SGD Geometry of loss-function landscape Algorithms for faster/better learning … (SGD=Stochastic Gradient Descent)

Outline 01 FDR for SGD in theory 02 FDR for SGD in action 03 Outlook (FDR=Fluctuation-Dissipation Relations SGD=Stochastic Gradient Descent)

01 FDR for SGD in theory

ML as optimization of the loss function w.r.t. model parameters : better accuracy with larger MNIST CIFAR-10

ML as optimization of the loss function w.r.t. model parameters : GD:

computationally more expensive ML as optimization of the loss function w.r.t. model parameters : GD: better accuracy with larger but computationally more expensive SGD

SGD dynamics: : model parameters : learning rate : mini-batch realization of size

Stationarity Assumption: SGD sampling at long time is governed by stationary-state distribution model parameters

Stationarity Assumption: SGD sampling at long time is governed by stationary-state distribution No quadratic assumption on loss surfaces No Gaussian assumption on noise distributions

stationary-state average Stationarity Assumption: SGD sampling at long time is governed by stationary-state distribution stationary-state average

Stationarity Assumption:

song & dance

natural

Exact for any stationary states Easy to measure on the fly Use it to check equilibration/stationarity Use it for adaptive training

height of noise ball ~ “temperature” Intuition within harmonic approximation song & dance height of noise ball ~ “temperature”

higher derivatives of loss higher correlations of noise

Linearity for small : Nonlinearity for high : breakdown of constant Hessian & constant noise-matrix approximation

02 FDR for SGD in action

checks equilibration/stationarity

checks equilibration/stationarity Compare their (half-running) time averages Expect

(dotted) (solid)

Slope @ small : magnitude of Hessian Nonlinearity @ high : breakdown of constant Hessian & constant noise-matrix approximation

Slope @ small : magnitude of Hessian MLP for MNIST CNN for CIFAR-10 Slope @ small : magnitude of Hessian Nonlinearity @ high : breakdown of constant Hessian & constant noise-matrix approximation

algorithmizes adaptive training scheduling Measure at the end of each epoch If , then

adaptive training scheduling algorithmizes adaptive training scheduling MLP for MNIST CNN for CIFAR-10

03 Outlook

Physics of Machine Learning Dynamics of SGD Geometry of loss-function landscape Algorithms for faster/better learning …

for equilibration dynamics for loss-function landscape for adaptive training algorithm

Adaptive training algorithm for near-SOTA Time-dependence (Onsager, Green-Kubo, Jarzynski,…) in sample distribution: ads, lifelong/sequential/continual learning quasi-stationarity: cascading overfitting dynamics in deep learning etc. Connection to whatever Dan is doing