CHAPTER 4 S TOCHASTIC A PPROXIMATION FOR R OOT F INDING IN N ONLINEAR M ODELS Organization of chapter in ISSO –Introduction and potpourri of examples Sample.

Slides:



Advertisements
Similar presentations
CHAPTER 13 M ODELING C ONSIDERATIONS AND S TATISTICAL I NFORMATION “All models are wrong; some are useful.”  George E. P. Box Organization of chapter.
Advertisements

CHAPTER 3 CHAPTER 3 R ECURSIVE E STIMATION FOR L INEAR M ODELS Organization of chapter in ISSO –Linear models Relationship between least-squares and mean-square.
Tracking Unknown Dynamics - Combined State and Parameter Estimation Tracking Unknown Dynamics - Combined State and Parameter Estimation Presenters: Hongwei.
P ROBABILITY T HEORY APPENDIX C P ROBABILITY T HEORY you can never know too much probability theory. If you are well grounded in probability theory, you.
CHAPTER 8 A NNEALING- T YPE A LGORITHMS Organization of chapter in ISSO –Introduction to simulated annealing –Simulated annealing algorithm Basic algorithm.
Chap 8: Estimation of parameters & Fitting of Probability Distributions Section 6.1: INTRODUCTION Unknown parameter(s) values must be estimated before.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: The FIR Adaptive Filter The LMS Adaptive Filter Stability and Convergence.
CHAPTER 2 D IRECT M ETHODS FOR S TOCHASTIC S EARCH Organization of chapter in ISSO –Introductory material –Random search methods Attributes of random search.
Chapter 2: Lasso for linear models
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Newton’s Method Application to LMS Recursive Least Squares Exponentially-Weighted.
Gizem ALAGÖZ. Simulation optimization has received considerable attention from both simulation researchers and practitioners. Both continuous and discrete.
Modern iterative methods For basic iterative methods, converge linearly Modern iterative methods, converge faster –Krylov subspace method Steepest descent.
The loss function, the normal equation,
CHAPTER 16 MARKOV CHAIN MONTE CARLO
Visual Recognition Tutorial
Maximum likelihood (ML) and likelihood ratio (LR) test
Stochastic Approximation Neta Shoham. References This Presentation is totally based on the book Introduction to Stochastic Search and Optimization (2003)
Maximum likelihood (ML)
SYSTEMS Identification
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Maximum likelihood (ML) and likelihood ratio (LR) test
1 Markov random field: A brief introduction Tzu-Cheng Jen Institute of Electronics, NCTU
Planning operation start times for the manufacture of capital products with uncertain processing times and resource constraints D.P. Song, Dr. C.Hicks.
Maximum likelihood (ML)
Normalised Least Mean-Square Adaptive Filtering
Chapter 5ELE Adaptive Signal Processing 1 Least Mean-Square Adaptive Filtering.
Collaborative Filtering Matrix Factorization Approach
CHAPTER 15 S IMULATION - B ASED O PTIMIZATION II : S TOCHASTIC G RADIENT AND S AMPLE P ATH M ETHODS Organization of chapter in ISSO –Introduction to gradient.
Equalization in a wideband TDMA system
Algorithm Taxonomy Thus far we have focused on:
Introduction to Adaptive Digital Filters Algorithms
CHAPTER 17 O PTIMAL D ESIGN FOR E XPERIMENTAL I NPUTS Organization of chapter in ISSO* –Background Motivation Finite sample and asymptotic (continuous)
CHAPTER 1 S TOCHASTIC S EARCH AND O PTIMIZATION: M OTIVATION AND S UPPORTING R ESULTS Organization of chapter in ISSO –Introduction –Some principles of.
CHAPTER 14 CHAPTER 14 S IMULATION - B ASED O PTIMIZATION I : R EGENERATION, C OMMON R ANDOM N UMBERS, AND R ELATED M ETHODS Organization of chapter in.
Natural Gradient Works Efficiently in Learning S Amari (Fri) Computational Modeling of Intelligence Summarized by Joon Shik Kim.
Extrapolation Models for Convergence Acceleration and Function ’ s Extension David Levin Tel-Aviv University MAIA Erice 2013.
Boltzmann Machine (BM) (§6.4) Hopfield model + hidden nodes + simulated annealing BM Architecture –a set of visible nodes: nodes can be accessed from outside.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
SUPA Advanced Data Analysis Course, Jan 6th – 7th 2009 Advanced Data Analysis for the Physical Sciences Dr Martin Hendry Dept of Physics and Astronomy.
CHAPTER 5 S TOCHASTIC G RADIENT F ORM OF S TOCHASTIC A PROXIMATION Organization of chapter in ISSO –Stochastic gradient Core algorithm Basic principles.
Learning Theory Reza Shadmehr LMS with Newton-Raphson, weighted least squares, choice of loss function.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
LEAST MEAN-SQUARE (LMS) ADAPTIVE FILTERING. Steepest Descent The update rule for SD is where or SD is a deterministic algorithm, in the sense that p and.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Derivation Computational Simplifications Stability Lattice Structures.
CHAPTER 6 STOCHASTIC APPROXIMATION AND THE FINITE-DIFFERENCE METHOD
Vaida Bartkutė, Leonidas Sakalauskas
ECE-7000: Nonlinear Dynamical Systems Overfitting and model costs Overfitting  The more free parameters a model has, the better it can be adapted.
CHAPTER 17 O PTIMAL D ESIGN FOR E XPERIMENTAL I NPUTS Organization of chapter in ISSO –Background Motivation Finite sample and asymptotic (continuous)
linear  2.3 Newton’s Method ( Newton-Raphson Method ) 1/12 Chapter 2 Solutions of Equations in One Variable – Newton’s Method Idea: Linearize a nonlinear.
Data Modeling Patrice Koehl Department of Biological Sciences National University of Singapore
CHAPTER 10 Widrow-Hoff Learning Ming-Feng Yeh.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:
Chapter 2-OPTIMIZATION G.Anuradha. Contents Derivative-based Optimization –Descent Methods –The Method of Steepest Descent –Classical Newton’s Method.
METHOD OF STEEPEST DESCENT ELE Adaptive Signal Processing1 Week 5.
CHAPTER 11 R EINFORCEMENT L EARNING VIA T EMPORAL D IFFERENCES Organization of chapter in ISSO –Introduction –Delayed reinforcement –Basic temporal difference.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Chapter 4. The Normality Assumption: CLassical Normal Linear Regression Model (CNLRM)
Generalization and adaptivity in stochastic convex optimization
CHAPTER 3 RECURSIVE ESTIMATION FOR LINEAR MODELS
NONPARAMETRIC LEAST SQUARES ESTIMATION IN DERIVATIVE FAMILIES Data On Derivatives, The Curse Of Dimensionality and Cost Function Estimation Adonis.
Slides for Introduction to Stochastic Search and Optimization (ISSO) by J. C. Spall CHAPTER 15 SIMULATION-BASED OPTIMIZATION II: STOCHASTIC GRADIENT AND.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Instructor :Dr. Aamer Iqbal Bhatti
CHAPTER 12 STATISTICAL METHODS FOR OPTIMIZATION IN DISCRETE PROBLEMS
Introduction to Simulated Annealing
Parametric Methods Berlin Chen, 2005 References:
16. Mean Square Estimation
Presentation transcript:

CHAPTER 4 S TOCHASTIC A PPROXIMATION FOR R OOT F INDING IN N ONLINEAR M ODELS Organization of chapter in ISSO –Introduction and potpourri of examples Sample mean Quantile and CEP Production function (contrast with maximum likelihood) –Convergence of the SA algorithm –Asymptotic normality of SA and choice of gain sequence –Extensions to standard root-finding SA Joint parameter and state estimation Higher-order methods for algorithm acceleration Iterate averaging Time-varying functions Slides for Introduction to Stochastic Search and Optimization (ISSO) by J. C. Spall

4-2 Stochastic Root-Finding Problem Focus is on finding  (i.e.,   ) such that g(  ) = 0 nonlinear –g(  ) is typically a nonlinear function of  (contrast with Chapter 3 in ISSO) Assume only noisy measurements of g(  ) are available: Y k (  ) = g(  ) + e k (  ), k = 0, 1, 2,…, Above problem arises frequently in practice –Optimization with noisy measurements ( g(  ) represents gradient of loss function) (see Chapter 5 of ISSO) –Quantile-type problems –Equation solving in physics-based models –Machine learning (see Chapter 11 of ISSO)

4-3 Core Algorithm for Stochastic Root-Finding Basic algorithm published in Robbins and Monro (1951) Algorithm is a stochastic analogue to steepest descent when used for optimization –Noisy measurement Y k (  ) replaces exact gradient g(  ) Generally wasteful to average measurements at given value of  across iterations –Average across iterations (changing  ) Core Robbins-Monro algorithm for unconstrained root- finding is Constrained version of algorithm also exists

4-4 Circular Error Probable (CEP): Example of Root-Finding (Example 4.3 in ISSO) Interested in estimating radius of circle about target such that half of impacts lie within circle (  is scalar radius) Define success variable Root-finding algorithm becomes Figure on next slide illustrates results for one study

4-5 True and estimated CEP: 1000 impact points with impact mean differing from target point (Example 4.3 in ISSO)

4-6 Convergence Conditions Central aspect of root-finding SA are conditions for formal convergence of the iterate to a root   –Provides rigorous basis for many popular algorithms (LMS, backpropagation, simulated annealing, etc.) Section 4.3 of ISSO contains two sets of conditions: –“Statistics” conditions based on classical assumptions about g(  ), noise, and gains a k –“Engineering” conditions based on connection to deterministic ordinary differential equation (ODE) Convergence and stability of ODE dZ(  )  /  d  = –g(Z(  )) closely related to convergence of SA algorithm (Z(  ) represents p- dimensional time-varying function and  denotes time) Neither of statistics or engineering conditions is special case of other

4-7 ODE Convergence Paths for Nonlinear Problem in Example 4.6 in ISSO: Satisfies ODE Conditions Due to Asymptotic Stability and Global Domain of Attraction

4-8 Gain Selection Choice of the gain sequence a k is critical to the performance of SA Famous conditions for convergence are =  and A common practical choice of gain sequence is where 1/2 0, and A  0 Strictly positive A (“stability constant”) allows for larger a (possibly faster convergence) without risking unstable behavior in early iterations  and A can usually be pre-specified; critical coefficient a usually chosen by “trial-and-error”

4-9 Extensions to Basic Root-Finding SA (Section 4.5 of ISSO) Joint Parameter and State Evolution –There exists state vector x k related to system being optimized –E.g., state-space model governing evolution of x k, where model depends on values of  Adaptive Estimation and Higher-Order Algorithms –Adaptively estimating gain a k –SA analogues of fast Newton-Raphson search Iterate Averaging –See slides to follow Time-Varying Functions –See slides to follow

4-10 Iterate Averaging Iterate averaging is important and relatively recent development in SA asymptoticProvides means for achieving optimal asymptotic performance without using optimal gains a k Basic iterate average uses following sample mean as final estimate: finite-sampleResults in finite-sample practice are mixed Success relies on large proportion of individual iterates hovering in some balanced way around   –Many practical problems have iterate approaching   in roughly monotonic manner –Monotonicity not consistent with good performance of iterate averaging; see plot on following slide

4-11 Contrasting Search Paths for Typical p = 2 Problem: Ineffective and Effective Uses of Iterate Averaging

4-12 Time-Varying Functions In some problems, the root-finding function varies with iteration: g k (  ) (rather than g(  )) –Adaptive control with time-varying target vector –Experimental design with user-specified input values –Signal processing based on Markov models (Subsection of ISSO) Let denote the root to g k (  ) = 0 Suppose that  for some fixed value (equivalent to the fixed   in conventional root-finding) In such cases, much standard theory continues to apply Plot on following slide shows case when g k (  ) represents a gradient function with scalar 

4-13 Time-Varying g k (  ) =  L k (  )  /   for Loss Functions with Limiting Minimum