Stochastic Approximation Neta Shoham. References This Presentation is totally based on the book Introduction to Stochastic Search and Optimization (2003)

Slides:



Advertisements
Similar presentations
Neural Networks and Kernel Methods
Advertisements

CHAPTER 13 M ODELING C ONSIDERATIONS AND S TATISTICAL I NFORMATION “All models are wrong; some are useful.”  George E. P. Box Organization of chapter.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.
CHAPTER 3 CHAPTER 3 R ECURSIVE E STIMATION FOR L INEAR M ODELS Organization of chapter in ISSO –Linear models Relationship between least-squares and mean-square.
Tracking Unknown Dynamics - Combined State and Parameter Estimation Tracking Unknown Dynamics - Combined State and Parameter Estimation Presenters: Hongwei.
P ROBABILITY T HEORY APPENDIX C P ROBABILITY T HEORY you can never know too much probability theory. If you are well grounded in probability theory, you.
CHAPTER 8 A NNEALING- T YPE A LGORITHMS Organization of chapter in ISSO –Introduction to simulated annealing –Simulated annealing algorithm Basic algorithm.
Chap 8: Estimation of parameters & Fitting of Probability Distributions Section 6.1: INTRODUCTION Unknown parameter(s) values must be estimated before.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: The FIR Adaptive Filter The LMS Adaptive Filter Stability and Convergence.
Model Assessment, Selection and Averaging
CHAPTER 2 D IRECT M ETHODS FOR S TOCHASTIC S EARCH Organization of chapter in ISSO –Introductory material –Random search methods Attributes of random search.
Chapter 2: Lasso for linear models
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Newton’s Method Application to LMS Recursive Least Squares Exponentially-Weighted.
The loss function, the normal equation,
CHAPTER 16 MARKOV CHAIN MONTE CARLO
Visual Recognition Tutorial
Maximum likelihood (ML) and likelihood ratio (LR) test
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Chapter 10 Simple Regression.
Maximum likelihood (ML)
Maximum likelihood (ML) and likelihood ratio (LR) test
Evaluating Hypotheses
Maximum likelihood (ML)
Normalised Least Mean-Square Adaptive Filtering
Chapter 5ELE Adaptive Signal Processing 1 Least Mean-Square Adaptive Filtering.
Collaborative Filtering Matrix Factorization Approach
CHAPTER 15 S IMULATION - B ASED O PTIMIZATION II : S TOCHASTIC G RADIENT AND S AMPLE P ATH M ETHODS Organization of chapter in ISSO –Introduction to gradient.
Algorithm Taxonomy Thus far we have focused on:
Introduction to Adaptive Digital Filters Algorithms
Biointelligence Laboratory, Seoul National University
CHAPTER 4 S TOCHASTIC A PPROXIMATION FOR R OOT F INDING IN N ONLINEAR M ODELS Organization of chapter in ISSO –Introduction and potpourri of examples Sample.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 16: NEURAL NETWORKS Objectives: Feedforward.
Random Regressors and Moment Based Estimation Prepared by Vera Tabakova, East Carolina University.
Boltzmann Machine (BM) (§6.4) Hopfield model + hidden nodes + simulated annealing BM Architecture –a set of visible nodes: nodes can be accessed from outside.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
CHAPTER 4 Adaptive Tapped-delay-line Filters Using the Least Squares Adaptive Filtering.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 9: Ways of speeding up the learning and preventing overfitting Geoffrey Hinton.
SUPA Advanced Data Analysis Course, Jan 6th – 7th 2009 Advanced Data Analysis for the Physical Sciences Dr Martin Hendry Dept of Physics and Astronomy.
CHAPTER 5 S TOCHASTIC G RADIENT F ORM OF S TOCHASTIC A PROXIMATION Organization of chapter in ISSO –Stochastic gradient Core algorithm Basic principles.
Unit-V DSP APPLICATIONS. UNIT V -SYLLABUS DSP APPLICATIONS Multirate signal processing: Decimation Interpolation Sampling rate conversion by a rational.
Learning Theory Reza Shadmehr LMS with Newton-Raphson, weighted least squares, choice of loss function.
Chapter 5 Parameter estimation. What is sample inference? Distinguish between managerial & financial accounting. Understand how managers can use accounting.
LEAST MEAN-SQUARE (LMS) ADAPTIVE FILTERING. Steepest Descent The update rule for SD is where or SD is a deterministic algorithm, in the sense that p and.
CHAPTER 6 STOCHASTIC APPROXIMATION AND THE FINITE-DIFFERENCE METHOD
CHAPTER 17 O PTIMAL D ESIGN FOR E XPERIMENTAL I NPUTS Organization of chapter in ISSO –Background Motivation Finite sample and asymptotic (continuous)
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.
CHAPTER 10 Widrow-Hoff Learning Ming-Feng Yeh.
Professors: Eng. Diego Barral Eng. Mariano Llamedo Soria Julian Bruno
Chapter 2-OPTIMIZATION G.Anuradha. Contents Derivative-based Optimization –Descent Methods –The Method of Steepest Descent –Classical Newton’s Method.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
State-Space Recursive Least Squares with Adaptive Memory College of Electrical & Mechanical Engineering National University of Sciences & Technology (NUST)
Computacion Inteligente Least-Square Methods for System Identification.
DSP-CIS Part-III : Optimal & Adaptive Filters Chapter-9 : Kalman Filters Marc Moonen Dept. E.E./ESAT-STADIUS, KU Leuven
The simple linear regression model and parameter estimation
Chapter 7. Classification and Prediction
Deep Feedforward Networks
Probabilistic Models for Linear Regression
CHAPTER 3 RECURSIVE ESTIMATION FOR LINEAR MODELS
Slides for Introduction to Stochastic Search and Optimization (ISSO) by J. C. Spall CHAPTER 15 SIMULATION-BASED OPTIMIZATION II: STOCHASTIC GRADIENT AND.
Convolutional networks
Learning Theory Reza Shadmehr
Parametric Methods Berlin Chen, 2005 References:
Mathematical Foundations of BME
16. Mean Square Estimation
Chapter 7 Inverse Dynamics Control
Presentation transcript:

Stochastic Approximation Neta Shoham

References This Presentation is totally based on the book Introduction to Stochastic Search and Optimization (2003) By James C. Spall The slides here are heavily based on slides that have been used as an aid in teaching from the book ISSO. The later slides can be found here:

Agenda 1. STOCHASTIC APPROXIMATION FOR ROOT FINDING 2. STOCHASTIC GRADIENT

STOCHASTIC APPROXIMATION FOR ROOT FINDING IN NONLINEAR MODELS

Stochastic Root-Finding Problem Focus is on finding  (i.e.,   ) such that g(  ) = 0 nonlinear g(  ) is typically a nonlinear function of . Assume only noisy measurements of g(  ) are available: Y k (  ) = g(  ) + e k (  ), k = 0, 1, 2,…, Above problem arises frequently in practice Optimization with noisy measurements (g(  ) represents gradient of loss function) (see Chapter 5 of ISSO) Equation solving in physics-based models Machine learning (see Chapter 11 of ISSO)

Some remarks on the noise measurements In many applications the measurements includes an input vector We can see it is not far from Y k (  ) = g(  ) + e k (  ), by substituting If the inputs x k are independent and In the case where we will have the time varying problem

Core Algorithm for Stochastic Root-Finding Basic algorithm published in Robbins and Monro (1951) Algorithm is a stochastic analogue to steepest descent when used for optimization Noisy measurement Y k (  ) replaces exact gradient g(  ) Generally wasteful to average measurements at given value of  across iterations Average across iterations (changing  ) Core Robbins-Monro algorithm for unconstrained root-finding is Constrained version of algorithm also exists

Sample mean as an SA algorithm X i are independent measurements E(X i )=μ the goal is to find μ Letting Puts this recursion in the frame work of SA, we can also let g(θ)=θ-μ and e k =μ- X k+1

Circular Error Probable (CEP): Example of Root-Finding (Example 4.3 in ISSO) Interested in estimating radius of circle about target such that half of impacts lie within circle (  is scalar radius) Define success variable Root-finding algorithm becomes Figure on next slide illustrates results for one study

True and estimated CEP: 1000 impact points with impact mean differing from target point (Example 4.3 in ISSO)

Convergence Conditions Central aspect of root-finding SA are conditions for formal convergence of the iterate to a root   Provides rigorous basis for many popular algorithms (LMS, backpropagation, simulated annealing, etc.) Section 4.3 of ISSO contains two sets of conditions: “Statistics” conditions based on classical assumptions about g(  ), noise, and gains a k “ODE” conditions based on connection to deterministic ordinary differential equation (ODE) Neither of statistics or engineering conditions is special case of other

“Statistics” Conditions A1 (Gain sequence) A2 (Search direction) for some positive definite B A3 (Mean zero noize) A4 (Growth and varince bounds) for all θ,k and some c>0 “ODE” Conditions B1 (Gain sequence) B2 (Relationship to ODE) g(θ) is continues and has a stable equilibrium point at θ * B3 (Iterate blondeness) for A, a compact subset of the domain of attraction (in B2) Let B4 (Bounded variance property of measurement error) B5 (Disappearing bias)

Connection to ODEs To motivate the connection to ODE note that in the deterministic case define: Then we can write Suppose Then the ODE can be regards as a limiting form of the difference equation.

ODE Convergence Paths for Nonlinear Problem in Example 4.6 in ISSO: Satisfies ODE Conditions Due to Asymptotic Stability and Global Domain of Attraction

Gain Selection Choice of the gain sequence a k is critical to the performance of SA Famous conditions for convergence are =  and A common practical choice of gain sequence is where 1/2 0, and A  0 Strictly positive A (“stability constant”) allows for larger a (possibly faster convergence) without risking unstable behavior in early iterations  and A can usually be pre-specified; critical coefficient a usually chosen by “trial-and-error”

Some asymptotic details and more about gain selection. Fabian (1968) shows that, under appropriate regularity conditions for a k =a/(k+1) α where Σ depends on {a k } and the Jacobian matrix of g(θ). Under general condition the rate of convergence is O(1/k -α ) in appropriate stochastic sense. (Note that in deterministic gradient descent the rate of convergence is O(c k ) 0<c<1). Asymptotically optimal gain is, a k =H(θ * ) -1 /(k+1), (a k is now a matrix) where H is the Jacobian matrix of g(θ), (equivalently to Newton-Raphson search). In practice we do not know θ * or H.

Constant step size In many application a constant step size is used to avoid to small step size for large k. Typical applications involve adaptive tracking or control where θ * is changing in time. Constant step size also appears in Neural Networks training although there is no variation in the underlying θ *. Algorithms with constant step size generally not formally convergence. (there is a theory for weak convergence) The restart trick : periodically restart the algorithm with and

Extensions to Basic Root-Finding SA (Section 4.5 of ISSO) Joint Parameter and State Evolution There exists state vector x k related to system being optimized E.g., state-space model governing evolution of x k, where model depends on values of  Adaptive Estimation and Higher-Order Algorithms Adaptively estimating gain a k SA analogues of fast Newton-Raphson search Iterate Averaging See slides to follow Time-Varying Functions See slides to follow

Adaptive Estimation and Higher-Order Algorithms The aim is to enhance the convergence rate. SA analogues of Newton-Raphson search are trying to adaptively estimating the Jacobian in order to achieve the asymptotically optimality gain a k =H(θ * ) -1 /(k+1). One of the first adaptive gain (Kesten 1958) is based on the sign of. Frequent sign changes is an indication that we are close to θ * and the gain is decreased. While the above method are effective ways to speed the convergence, they are restricted in their range of application

Iterate Averaging Iterate averaging is important and relatively recent development in SA asymptotic Provides means for achieving optimal asymptotic performance without using optimal gains a k (optimal between the squences satisfies a k+1 /a k =1+o(a k )) Basic iterate average uses following sample mean as final estimate: finite-sample Results in finite-sample practice are mixed Success relies on large proportion of individual iterates hovering in some balanced way around   Many practical problems have iterate approaching   in roughly monotonic manner Monotonicity not consistent with good performance of iterate averaging; see plot on following slide

Contrasting Search Paths for Typical p = 2 Problem: Ineffective and Effective Uses of Iterate Averaging

Time-Varying Functions In some problems, the root-finding function varies with iteration: g k (  ) (rather than g(  )) Adaptive control with time-varying target vector Experimental design with user-specified input values Let denote the root to g k (  ) = 0 Suppose that  for some fixed value (equivalent to the fixed   in conventional root-finding) In such cases, much standard theory continues to apply Plot on following slide shows case when g k (  ) represents a gradient function with scalar 

Time-Varying g k (  ) =  L k (  )  /  for Loss Functions with Limiting Minimum

STOCHASTIC GRADIENT FORM OF STOCHASTIC APROXIMATION

Stochastic Gradient Formulation For differentiable L(  ), recall familiar set of p equations and p unknowns for use in finding a minimum   : Above is special case of root-finding problem Suppose cannot observe L(  ) and g(  ) except in presence of noise Adaptive control (target tracking) Simulation-based optimization Etc. unbiased measurement Seek unbiased measurement of  L /  for optimization

Stochastic Gradient Formulation (Cont’d Suppose L(  ) = E[Q( ,  V  )] V represents all random effects Q( ,  V) represents “observed” cost (noisy measurement of L(  )) Seek a representation where  Q /  is an unbiased measurement of  L /  Not true when distribution function for V depends on  desiredrepresentation Above implies that desired representation isnot where p V (  ) is density function for V

Stochastic Gradient Measurement and Algorithm When density p V (  ) is independent of , is unbiased measurement of  L /  ( in general  Q( ,  V)/  is not equal to  L /  ) Above requires derivative – integral interchange in  L/  =  E[Q( ,  V)]/  = E[  Q( ,  V)/  ] to be valid Can use root-finding (Robbins-Monro) SA algorithm to attempt to find   : Unbiased measurement satisfies key convergence conditions of SA (Section 4.3 in ISSO)

Example of conversion to preferred stochastic gradient form Suppose that Q(θ,V )=f (θ,Ζ)+W where W’~N(0,θ 2 σ 2 ) and Z is independent of θ V=(Z,W ) is dependent of θ we can write W’=θW, where W~N(0,σ 2 ). Now Q(θ,V’ )= f (θ,Ζ)+θW’ V’=(Z,W’ ) is independent of θ

Stochastic Gradient Tendency to Move Iterate in Correct Direction

Implementation of general non linear regression problem. Instantaneous input form Batch form The minimum is tied with fixed n (and z 1,…,z n ) To implement we need full set of data at the beginning

Stochastic Gradient and LMS Connections The basic linear model is: Consider standard MSE loss: L(  )= Implies Q = Then the basic LMS algorithm is Hence LMS is direct application of stochastic gradient SA Proposition 5.1 in ISSO shows how SA convergence theory applies to LMS Implies convergence of LMS to  

Neural Networks Neural networks (NNs) are general function approximators Actual output z k represented by a NN according to standard model z k = h( ,  x k ) + v k h( ,  x k ) represents NN output for input x k and weight values  v k represents noise Diagram of simple feedforward NN on next slide backpropagation Most popular training method is backpropagation (mean-squared- type loss function) Backpropagation is following stochastic gradient recursion

Simple Feedforward Neural Network with p = 25 Weight Parameters

Image Restoration Aim is to recover true image subject to having recorded image corrupted by noise Common to construct least-squares type problem where H  s represents a convolution of the measurement process (H) and the true pixel-by-pixel image (s) Can be solved by either batch linear regression methods or the LMS/RLS methods Nonlinear measurements need full power of stochastic gradient method Measurements modeled as Z = F(s, x, V)

Multiple pass implementation Recursive and batch form represent two extremes in implementation of stochastic gradient algorithm. An hybrid from is to use the instantaneous gradient, as in the recursive form, yet making multiply pass through the full data. The user may choose to restart the gain value at a 0.