1 Jorge Nocedal Northwestern University With S. Hansen, R. Byrd and Y. Singer IPAM, UCLA, Feb 2014 A Stochastic Quasi-Newton Method for Large-Scale Learning.

Slides:

Advertisements

Similar presentations

Zhen Lu CPACT University of Newcastle MDC Technology Reduced Hessian Sequential Quadratic Programming(SQP)

Advertisements

Regularization David Kauchak CS 451 – Fall 2013.

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.

Optimization Tutorial

CHAPTER 8 A NNEALING- T YPE A LGORITHMS Organization of chapter in ISSO –Introduction to simulated annealing –Simulated annealing algorithm Basic algorithm.

CMPUT 466/551 Principal Source: CMU

Inexact SQP Methods for Equality Constrained Optimization Frank Edward Curtis Department of IE/MS, Northwestern University with Richard Byrd and Jorge.

1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.

Aspects of Conditional Simulation and estimation of hydraulic conductivity in coastal aquifers" Luit Jan Slooten.

Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.

Visual Recognition Tutorial

1 L-BFGS and Delayed Dynamical Systems Approach for Unconstrained Optimization Xiaohui XIE Supervisor: Dr. Hon Wah TAM.

1 PEGASOS Primal Efficient sub-GrAdient SOlver for SVM Shai Shalev-Shwartz Yoram Singer Nati Srebro The Hebrew University Jerusalem, Israel YASSO = Yet.

Unconstrained Optimization Rong Jin. Recap  Gradient ascent/descent Simple algorithm, only requires the first order derivative Problem: difficulty in.

Announcements  Homework 4 is due on this Thursday (02/27/2004)  Project proposal is due on 03/02.

1 L-BFGS and Delayed Dynamical Systems Approach for Unconstrained Optimization Xiaohui XIE Supervisor: Dr. Hon Wah TAM.

Support Vector Regression David R. Musicant and O.L. Mangasarian International Symposium on Mathematical Programming Thursday, August 10, 2000

Semi-Stochastic Gradient Descent Methods Jakub Konečný (joint work with Peter Richtárik) University of Edinburgh SIAM Annual Meeting, Chicago July 7, 2014.

Unconstrained Optimization Rong Jin. Logistic Regression The optimization problem is to find weights w and b that maximizes the above log-likelihood How.

CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.

Normalised Least Mean-Square Adaptive Filtering

UNCONSTRAINED MULTIVARIABLE

Collaborative Filtering Matrix Factorization Approach

CHAPTER 15 S IMULATION - B ASED O PTIMIZATION II : S TOCHASTIC G RADIENT AND S AMPLE P ATH M ETHODS Organization of chapter in ISSO –Introduction to gradient.

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

Colorado Center for Astrodynamics Research The University of Colorado STATISTICAL ORBIT DETERMINATION Project Report Unscented kalman Filter Information.

Introduction to Adaptive Digital Filters Algorithms

The horseshoe estimator for sparse signals CARLOS M. CARVALHO NICHOLAS G. POLSON JAMES G. SCOTT Biometrika (2010) Presented by Eric Wang 10/14/2010.

Frank Edward Curtis Northwestern University Joint work with Richard Byrd and Jorge Nocedal February 12, 2007 Inexact Methods for PDE-Constrained Optimization.

Minimum Phoneme Error Based Heteroscedastic Linear Discriminant Analysis for Speech Recognition Bing Zhang and Spyros Matsoukas BBN Technologies Present.

Adaptive CSMA under the SINR Model: Fast convergence using the Bethe Approximation Krishna Jagannathan IIT Madras (Joint work with) Peruru Subrahmanya.

Computing a posteriori covariance in variational DA I.Gejadze, F.-X. Le Dimet, V.Shutyaev.

MML Inference of RBFs Enes Makalic Lloyd Allison Andrew Paplinski.

Optimization in Engineering Design Georgia Institute of Technology Systems Realization Laboratory 101 Quasi-Newton Methods.

Qualifier Exam in HPC February 10 th, Quasi-Newton methods Alexandru Cioaca.

Mathematical formulation XIAO LIYING. Mathematical formulation.

Selective Block Minimization for Faster Convergence of Limited Memory Large-scale Linear Models Kai-Wei Chang and Dan Roth Experiment Settings Block Minimization.

Stochastic Linear Programming by Series of Monte-Carlo Estimators Leonidas SAKALAUSKAS Institute of Mathematics&Informatics Vilnius, Lithuania

Efficient Integration of Large Stiff Systems of ODEs Using Exponential Integrators M. Tokman, M. Tokman, University of California, Merced 2 hrs 1.5 hrs.

Online Learning for Collaborative Filtering

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

The Group Lasso for Logistic Regression Lukas Meier, Sara van de Geer and Peter Bühlmann Presenter: Lu Ren ECE Dept., Duke University Sept. 19, 2008.

Frank Edward Curtis Northwestern University Joint work with Richard Byrd and Jorge Nocedal January 31, 2007 Inexact Methods for PDE-Constrained Optimization.

Generalised method of moments approach to testing the CAPM Nimesh Mistry Filipp Levin.

Data Modeling Patrice Koehl Department of Biological Sciences National University of Singapore

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

Rank Minimization for Subspace Tracking from Incomplete Data

Inexact SQP methods for equality constrained optimization Frank Edward Curtis Department of IE/MS, Northwestern University with Richard Byrd and Jorge.

Logistic Regression William Cohen.

Large-Scale Matrix Factorization with Missing Data under Additional Constraints Kaushik Mitra University of Maryland, College Park, MD Sameer Sheoreyy.

Chapter 2-OPTIMIZATION G.Anuradha. Contents Derivative-based Optimization –Descent Methods –The Method of Steepest Descent –Classical Newton’s Method.

METHOD OF STEEPEST DESCENT ELE Adaptive Signal Processing1 Week 5.

INTRO TO OPTIMIZATION MATH-415 Numerical Analysis 1.

Massive Support Vector Regression (via Row and Column Chunking) David R. Musicant and O.L. Mangasarian NIPS 99 Workshop on Learning With Support Vectors.

Multiplicative updates for L1-regularized regression

Lecture 07: Soft-margin SVM

Table 1. Advantages and Disadvantages of Traditional DM/ML Methods

Generalization and adaptivity in stochastic convex optimization

Probabilistic Models for Linear Regression

CHAPTER 3 RECURSIVE ESTIMATION FOR LINEAR MODELS

Collaborative Filtering Matrix Factorization Approach

Lecture 07: Soft-margin SVM

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Large Scale Support Vector Machines

Lecture 08: Soft-margin SVM

Lecture 07: Soft-margin SVM

Overfitting and Underfitting

CS639: Data Management for Data Science

Presentation transcript:

1 Jorge Nocedal Northwestern University With S. Hansen, R. Byrd and Y. Singer IPAM, UCLA, Feb 2014 A Stochastic Quasi-Newton Method for Large-Scale Learning

Propose a robust quasi-Newton method that operates in the stochastic approximation regime purely stochastic method (not batch) – to compete with stochatic gradient (SG) method Full non-diagonal Hessian approximation Scalable to millions of parameters 2 Goal

3 Outline Are iterations of this following form viable? - theoretical considerations; iteration costs - differencing noisy gradients? Key ideas: compute curvature information pointwise at regular intervals, build on strength of BFGS updating recalling that it is an overwriting (and not averaging process) - results on text and speech problems - examine both training and testing errors

4 Problem Applications Simulation optimization Machine learning Algorithm not (yet) applicable to simulation based optimization

For loss function Robbins-Monro or stochastic gradient method using stochastic gradient (estimator) min-batch 5 Stochastic gradient method

1. Is there any reason to think that including a Hessian approximation will improve upon stochastic gradient method? 2.Iteration costs are so high that even if method is faster than SG in terms of training costs it will be a weaker learner 6 Why it won’t work ….

Number of iterations needed to compute an epsilon-accurate solution: Depends on the Hessian at true solution and the gradient covariance matrix Depends on the condition number of the Hessian at the true solution Completely removes the dependency on the condition number (Murata 98); cf Bottou-Bousquet 7 Theoretical Considerations

Assuming we obtain efficiencies of classical quasi-Newton methods in limited memory form Each iteration requires 4Md operations M = memory in limited memory implementation; M=5 d = dimension of the optimization problem 8 Computational cost Stochastic gradient method

assuming a min-batch b=50 cost of stochastic gradient = 50d Use of small mini-batches will be a game-changer b =10, 50, Mini-batching

Mini-batching makes operation counts favorable but does not resolve challenges related to noise 1.Avoid differencing noise Curvature estimates cannot suffer from sporadic spikes in noise (Schraudolph et al. (99), Ribeiro et at (2013) Quasi-Newton updating is an overwriting process not an averaging process Control quality of curvature information 2.Cost of curvature computation Use of small mini-batches will be a game-changer b =10,50, Game changer? Not quite…

11 Desing of Stochastic Quasi-Newton Method Propose a method based on the famous BFGS formula all components seem to fit together well numerical performance appears to be strong Propose a new quasi-Newton updating formula Specifically designed to deal with noisy gradients Work in progress

12 Review of the deterministic BFGS method

13 The remarkable properties of BFGS method (convex case) Superlinear convergence; global convergence for strongly convex problems, self-correction properties Only need to approximate Hessian in a subspace Powell 76 Byrd-N 89

14 Adaptation to stochastic setting Cannot mimic classical approach and update after each iteration Since batch size b is small this will yield highly noisy curvature estimates Instead: Use a collection of iterates to define the correction pairs

15 Stochastic BFGS: Approach 1 Define two collections of size L: Define average iterate/gradient: New curvature pair:

16 Stochastic L-BFGS: First Approach

17 Stochastic BFGS: Approach 1 We could not make this work in a robust manner! 1. Two sources of error Sample variance Lack of sample uniformity 2. Initial reaction Control quality of average gradients Use of sample variance … dynamic sampling Proposed Solution Control quality of curvature y estimate directly

Standard definition arises from Hessian-vector products are often available Define curvature vector for L-BFGS via a Hessian-vector product perform only every L iterations 18 Key idea: avoid differencing

19 Structure of Hessian-vector product Mini-batch stochastic gradient 1.Code Hessian-vector product directly 2.Achieve sample uniformity automatically (c.f. Schraudolph) 3.Avoid numerical problems when || s|| is small 4.Control cost of y computation

20 The Proposed Algorithm

b: stochastic gradient batch size b H : Hessian-vector batch size L: controls frequency of quasi-Newton updating M: memory parameter in L-BFGS updating M=5 - use limited memory form 21 Algorithmic Parameters

22 Need Hessian to implement a quasi-Newton method? Are you out of your mind? We don’t need Hessian-vector product, but it has many Advantages: complete freedom in sampling and accuracy Ŧ Œäś ϖ⌥ ħ ??

23 Numerical Tests Stochastic gradient method (SGD) Stochastic quasi-Newton method (SQN) It is well know that SGD is highly sensitive to choice of steplength, and so will be the SQN method (though perhaps less)

b = 50, 300, 1000 M = 5, L = 20 b H = RCV1 Problem n = , N = Accessed data points; includes Hessian- vector products sgd sqn

b = 100, 500, M = 5, L = 20, b H = Speech Problem n= 30315, N = sgd sqn

26 Varying Hessian batch bH: RCV1 b=300

27 Varying memory size M in limited memory BFGS: RCV1

28 Varying L-BFGS Memory Size: Synthetic problem

29 Generalization Error: RCV1 Problem SGD SQN

Synthetically Generated Logistic Regression: Singer et al –n = 50, N = 7000 –Training data :. RCV1 dataset – n = , N = – Training data: SPEECH dataset – NF = 235, |C| = 129 – n = NF x |C| --> n= 30315, N = – Training data: 30 Test Problems

Iteration Costs mini-batch stochastic gradient SQN SGD mini-batch stochastic gradient Hessian-vector product every L iterations matrix-vector product 31

b = SGD SQN b H = L = M = 3-20 Typical Parameter Values b = 300 b H = 1000 L = 20 M = 5 300n 370n 32 Iteration Costs

33 Hasn’t this been done before? Hessian-free Newton method: Martens (2010), Byrd et al (2011) - claim: stochastic Newton not competitive with stochastic BFGS Prior work: Schraudolph et al. - similar, cannot ensure quality of y - change BFGS formula in one-sided form

34 Supporting theory? Work in progress: Figen Oztoprak, Byrd, Soltnsev - combine classical analysis Murata, Nemirovsky et a - and asumptotic quasi-Newton theory - effect on constants (condition number) - invoke self-correction properties of BFGS Practical Implementation: limited memory BFGS - loses superlinear convergence property - enjoys self-correction mechanisms

SGD: SQN: b adp/iter b + b H /L adp/iter bn + b H n/L +4Mn work/iter bn work/iter b H =1000, M=5, L=200 Parameters L, M and b H provide freedom in adapting the SQN method to a specific application 35 Small batches: RCV1 Problem

36 Alternative quasi-Newton framework BFGS method was not derived with noisy gradients in mind - how do we know it is an appropriate framework - Start from scratch - derive quasi-Newton updating formulas tolerant to noise

Define quadratic model around a reference point z Using a collection indexed by I, natural to require i.e. residuals are zero in expectation Not enough information to determine the whole model 37 Foundations

Given a collection I, choose model q to minimize Differentiating w.r.t. g: Encouraging: obtained residual condition 38 Mean square error

39 The End