CHAPTER 1 S TOCHASTIC S EARCH AND O PTIMIZATION: M OTIVATION AND S UPPORTING R ESULTS Organization of chapter in ISSO –Introduction –Some principles of.

Slides:



Advertisements
Similar presentations
1 Modeling and Simulation: Exploring Dynamic System Behaviour Chapter9 Optimization.
Advertisements

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
EKF, UKF TexPoint fonts used in EMF.
CHAPTER 13 M ODELING C ONSIDERATIONS AND S TATISTICAL I NFORMATION “All models are wrong; some are useful.”  George E. P. Box Organization of chapter.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
CHAPTER 3 CHAPTER 3 R ECURSIVE E STIMATION FOR L INEAR M ODELS Organization of chapter in ISSO –Linear models Relationship between least-squares and mean-square.
CHAPTER 8 A NNEALING- T YPE A LGORITHMS Organization of chapter in ISSO –Introduction to simulated annealing –Simulated annealing algorithm Basic algorithm.
CHAPTER 2 D IRECT M ETHODS FOR S TOCHASTIC S EARCH Organization of chapter in ISSO –Introductory material –Random search methods Attributes of random search.
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
CHAPTER 9 E VOLUTIONARY C OMPUTATION I : G ENETIC A LGORITHMS Organization of chapter in ISSO –Introduction and history –Coding of  –Standard GA operations.
Gizem ALAGÖZ. Simulation optimization has received considerable attention from both simulation researchers and practitioners. Both continuous and discrete.
Institute of Intelligent Power Electronics – IPE Page1 Introduction to Basics of Genetic Algorithms Docent Xiao-Zhi Gao Department of Electrical Engineering.
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
CHAPTER 16 MARKOV CHAIN MONTE CARLO
Visual Recognition Tutorial
D Nagesh Kumar, IIScOptimization Methods: M1L1 1 Introduction and Basic Concepts (i) Historical Development and Model Building.
Spring, 2013C.-S. Shieh, EC, KUAS, Taiwan1 Heuristic Optimization Methods Prologue Chin-Shiuh Shieh.
Motion Analysis (contd.) Slides are from RPI Registration Class.
Stochastic Approximation Neta Shoham. References This Presentation is totally based on the book Introduction to Stochastic Search and Optimization (2003)
Job Release-Time Design in Stochastic Manufacturing Systems Using Perturbation Analysis By: Dongping Song Supervisors: Dr. C.Hicks & Dr. C.F.Earl Department.
No Free Lunch (NFL) Theorem Many slides are based on a presentation of Y.C. Ho Presentation by Kristian Nolde.
Gradient Methods May Preview Background Steepest Descent Conjugate Gradient.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
D Nagesh Kumar, IIScOptimization Methods: M1L4 1 Introduction and Basic Concepts Classical and Advanced Techniques for Optimization.
Maximum likelihood (ML)
Normalised Least Mean-Square Adaptive Filtering
Collaborative Filtering Matrix Factorization Approach
CHAPTER 15 S IMULATION - B ASED O PTIMIZATION II : S TOCHASTIC G RADIENT AND S AMPLE P ATH M ETHODS Organization of chapter in ISSO –Introduction to gradient.
Ch 8.1 Numerical Methods: The Euler or Tangent Line Method
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Introduction to Adaptive Digital Filters Algorithms
Computational Stochastic Optimization: Bridging communities October 25, 2012 Warren Powell CASTLE Laboratory Princeton University
1 Hybrid methods for solving large-scale parameter estimation problems Carlos A. Quintero 1 Miguel Argáez 1 Hector Klie 2 Leticia Velázquez 1 Mary Wheeler.
CHAPTER 17 O PTIMAL D ESIGN FOR E XPERIMENTAL I NPUTS Organization of chapter in ISSO* –Background Motivation Finite sample and asymptotic (continuous)
CHAPTER 10 E VOLUTIONARY C OMPUTATION II : G ENERAL M ETHODS AND T HEORY Organization of chapter in ISSO –Introduction –Evolution strategy and evolutionary.
CHAPTER 14 CHAPTER 14 S IMULATION - B ASED O PTIMIZATION I : R EGENERATION, C OMMON R ANDOM N UMBERS, AND R ELATED M ETHODS Organization of chapter in.
Biointelligence Laboratory, Seoul National University
Basic Numerical methods and algorithms
CHAPTER 4 S TOCHASTIC A PPROXIMATION FOR R OOT F INDING IN N ONLINEAR M ODELS Organization of chapter in ISSO –Introduction and potpourri of examples Sample.
Stochastic Linear Programming by Series of Monte-Carlo Estimators Leonidas SAKALAUSKAS Institute of Mathematics&Informatics Vilnius, Lithuania
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
CS 478 – Tools for Machine Learning and Data Mining Backpropagation.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
CHAPTER 5 S TOCHASTIC G RADIENT F ORM OF S TOCHASTIC A PROXIMATION Organization of chapter in ISSO –Stochastic gradient Core algorithm Basic principles.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
CHAPTER 6 STOCHASTIC APPROXIMATION AND THE FINITE-DIFFERENCE METHOD
Vaida Bartkutė, Leonidas Sakalauskas
CHAPTER 17 O PTIMAL D ESIGN FOR E XPERIMENTAL I NPUTS Organization of chapter in ISSO –Background Motivation Finite sample and asymptotic (continuous)
Local Search and Optimization Presented by Collin Kanaley.
Data Modeling Patrice Koehl Department of Biological Sciences National University of Singapore
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Monte-Carlo based Expertise A powerful Tool for System Evaluation & Optimization  Introduction  Features  System Performance.
Chapter 2-OPTIMIZATION G.Anuradha. Contents Derivative-based Optimization –Descent Methods –The Method of Steepest Descent –Classical Newton’s Method.
Optimization in Engineering Design 1 Introduction to Non-Linear Optimization.
INTRO TO OPTIMIZATION MATH-415 Numerical Analysis 1.
Bounded Nonlinear Optimization to Fit a Model of Acoustic Foams
An Evolutionary Approach
Chapter 7. Classification and Prediction
C.-S. Shieh, EC, KUAS, Taiwan
Meta-Heuristic Algorithms 16B1NCI637
Collaborative Filtering Matrix Factorization Approach
Slides for Introduction to Stochastic Search and Optimization (ISSO) by J. C. Spall CHAPTER 15 SIMULATION-BASED OPTIMIZATION II: STOCHASTIC GRADIENT AND.
CHAPTER 12 STATISTICAL METHODS FOR OPTIMIZATION IN DISCRETE PROBLEMS
CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29
Presentation transcript:

CHAPTER 1 S TOCHASTIC S EARCH AND O PTIMIZATION: M OTIVATION AND S UPPORTING R ESULTS Organization of chapter in ISSO –Introduction –Some principles of stochastic search and optimization Key points in implementation and analysis No free lunch theorems –Gradients, Hessians, etc. –Steepest descent and Newton-Raphson search Slides for Introduction to Stochastic Search and Optimization (ISSO) by J. C. Spall

1-2 Potpourri of Problems Using Stochastic Search and Optimization Minimize the costs of shipping from production facilities to warehouses Maximize the probability of detecting an incoming warhead (vs. decoy) in a missile defense system Place sensors in manner to maximize useful information Determine the times to administer a sequence of drugs for maximum therapeutic effect Find the best red-yellow-green signal timings in an urban traffic network Determine the best schedule for use of laboratory facilities to serve an organization’s overall interests

1-3 Search and Optimization Algorithms as Part of Problem Solving There exist many deterministic and stochastic algorithms partAlgorithms are part of the broader solution Need clear understanding of problem structure, constraints, data characteristics, political and social context, limits of algorithms, etc. “Imagine how much money could be saved if truly appropriate techniques were applied that go beyond simple linear programming.” (Michalewicz and Fogel, 2000) –Deeper understanding required to provide truly appropriate solutions; “COTS” software usually not enough! Many (most?) real-world implementations involve stochastic effects

4 Two Fundamental Problems of Interest Let  be the domain of allowable values for a vector   represents a vector of “adjustables” –  may be continuous or discrete (or both) Two fundamental problems of interest: Problem 1. Problem 1. Find the value(s) of a vector    that minimize a scalar-valued loss function L(  ) — or — Problem 2. Problem 2. Find the value(s) of    that solve the equation g(  ) = 0 for some vector-valued function g(  ) Frequently (but not necessarily) g(  ) = Let   denote solution to Problem 1 or 2 (may or may not be unique)

1-5 Continuous Discrete/ Continuous Discrete Three Common Types of Loss Functions

1-6 Classical Calculus-Based Optimization Classical optimization setting of interest Find  that minimizes the differentiable loss L(  ) subject to  satisfying relevant constraints (   ) Standard nonlinear unconstrained optimization setting: Find  such that = 0 –Lagrange multipliers useful for some types of constraints

1-7 Global vs. Local Solutions Typically the solution to = 0 may only be a local (vs. global) optimal very difficultGeneral global optimization problem is very difficult Sometimes local optimization is “good enough” given limited resources available Global methods include: genetic algorithms, evolutionary strategies, simulated annealing, etc.

1-8 Global vs. Local Solutions (cont’d) Global methods tend to have following characteristics: –Inefficient, especially for high-dimensional  –Relatively difficult to use (e.g., require very careful selection of algorithm coefficients) –Sometimes questionable theoretical foundation for global convergence –Multiple runs usually required to have confidence in reaching global optimum Some “hype” with many methods; e.g., genetic algorithm (GA) software advertisement: –“…uses GAs to solve any optimization problem!” someBut there are some sound methods restricted settings –E.g., restricted settings for GAs, simulated annealing, and stochastic approximation

1-9 Examples of Easy and Hard Problems for Global Optimization Note: “Hard problem” below (plot (b)) is indicative of effective volume of region around   in practical high-dimensional problems

1-10 Stochastic Search and Optimization Focus here is stochastic search and optimization: A. Random noise in input information (e.g., noisy measurements of L(  ): y(  )  L(  ) + noise) — and/or — B. Injected randomness (Monte Carlo) in choice of algorithm iteration magnitude/direction Contrasts with deterministic methods –E.g., steepest descent, Newton-Raphson, etc. –Assume perfect information about L(  ) (and its gradients) –Search magnitude/direction deterministic at each iteration Injected randomness (B) in search magnitude/direction can offer benefits in efficiency and robustness –E.g., Capabilities for global (vs. local) optimization

1-11 Some Popular Stochastic Search and Optimization Techniques Random search Stochastic approximation –Robbins-Monro and Kiefer-Wolfowitz –SPSA –NN backpropagation –Infinitesimal perturbation analysis –Recursive least squares –Many others Simulated annealing Genetic algorithms Evolutionary programs and strategies Reinforcement learning Markov chain Monte Carlo (MCMC) Etc.

1-12 Effects of Noise on Simple Optimization Problem (Example 1.4 in ISSO)

1-13 Consider tracking problem where controller and/or system depend on design parameters  –E.g.: Missile guidance, robot arm manipulation, attaining macroeconomic target values, etc. Aim is to pick  to minimize mean-squared error (MSE): In general nonlinear and/or non-Gaussian systems, not possible to compute L(  ) observedGet observed squared error by running system Note that –Values of y(  ), not L(  ), used in optimization of  Example of Noisy Loss Measurements: Tracking Problem (Example 1.5 in ISSO)

1-14 Have credible Monte Carlo simulation of real system Parameters  in simulation have physical meaning in system –E.g.:  is machine locations in plant layout, timing settings in traffic control, resource allocation in military operations, etc. Run simulation to determine best  for use in real system Want to minimize average measure of performance L(  ) –Let y(  ) represent one simulation output (y(  ) = L(  ) + noise) Example of Noisy Loss Measurements: Simulation-Based Optimization (Example 1.6 in ISSO)  Stochasticoptimizer y()y() Monte Carlo Simulation inputs

1-15 Algorithm comparisons via number of measurements of L(  ) or g(  ) (not iterations) –Function measurements typically represent major cost Curse of dimensionality –E.g.: If dim(  ) = 10, each element of  can take on 10 values. Taking 10,000 random samples: Prob(finding at least one of 500 best  ) = –Above example would be even much harder with only noisy function measurements Constraints Limits of numerical comparisons –Avoid broad claims based on numerical studies and –Best to combine theory and numerical analysis Some Key Properties in Implementation and Evaluation of Stochastic Algorithms

1-16 = 0 setting usually associated with unconstrained optimization Most real problems include constraints Many constrained problems can also be converted to = 0 –Penalty functions, projection methods, ad hoc methods and common sense, etc Considerations for constraints: –Hard vs. soft –Explicit vs. implicit Constrained vs. Unconstrained

1-17 No Free Lunch Theorems Wolpert and Macready (1997) establish several “No Free Lunch” (NFL) Theorems for optimization NFL Theorems apply to settings where parameter set  and set of loss function values are finite, discrete sets –Relevant for continuous  problem when considering digital computer implementation –Results are valid for deterministic and stochastic settings Number of optimization problems — mappings from  to set of loss values — is finite NFL Theorems state, in essence, that no one search algorithm is “best” for all problems

18 No Free Lunch Theorems—Basic Formulation Suppose that N   number of possible values of  N L  number of possible values of loss function Then There is a finite (but possibly huge) number of loss functions Basic form of NFL considers average performance over all loss functions

1-19 Illustration of No Free Lunch Theorems (Example 1.7 in ISSO) Three values of , two outcomes for noise free loss L –Eight possible mappings, hence eight optimization problems Mean loss across all problems is same regardless of  ; entries 1 or 2 in table below represent two possible L outcomes  1 2 3 Map

1-20 No Free Lunch Theorems—Basic Formulation Assume algorithm A is applied to L(  ) Let represent value of a loss function after n unique function evaluations Consider probability that given algorithm A and loss function L Proof of NFL Theorems consider this probability for choices of algorithms A and loss functions L Conclusion of NFL based on averaging above probability w.r.t. either L or A Most important special case of is = L(   )

21 No Free Lunch Theorems (cont’d) NFL Theorems state, in essence: In particular, if algorithm 1 performs better than algorithm 2 over some set of problems, then algorithm 2 performs better than algorithm 1 on another set of problems NFL theorems say nothing about specific algorithms on specific problems –NFL provides no direct guidance for implementation, but useful as conceptual device to avoid exaggerated claims of power of specific algorithms Averaging (uniformly) over all possible problems (loss functions L), all algorithms perform equally well Overall relative efficiency of two algorithms cannot be inferred from a few sample problems

1-22 Often used directly in deterministic methods like steepest descent and Newton-Raphson; indirectly in stochastic methods not –Exact gradients and Hessians generally not available in stochastic optimization Gradient g(  ) of L(  ) is the vector of 1st partial derivatives Hessian of L(  ) is the matrix H(  ) consisting the 2nd partial derivatives Hessian useful in characterizing shape of L and in providing search direction for Newton-Raphson algorithm Gradients and Hessians

1-23 Rationale Behind Steepest Descent Update Direction for i th Element of 

1-24 Steepest Descent with Noisy and Noise-Free Gradient Input (Example 1.8 in ISSO)

1-25 Noisy and Noise-Free Gradient Input (cont’d): Relative Loss Values (Example 1.8 in ISSO)

1-26 Deterministic Optimization: 1 st -order (Steepest Descent) and 2 nd -order (Newton-Raphson) Directions with p = 2

27 Final Remarks on Contrast of Deterministic and Stochastic Optimization: Relative Theoretical Rates of Convergence Theoretical analysis based on convergence rates of iterates where k is iteration counter Recall   is optimal value of  deterministicFor deterministic optimization, a standard rate result is: noisy measurementsCorresponding rate with noisy measurements Stochastic rate inherently slower in theory and practice

1-28 Concluding Remarks Stochastic search and optimization very widely used –Handles noise in the function evaluations –Generally better for global optimization –Broader applicability to “non-nice” problems (robustness) Some challenges in practical problems –Noise dramatically affects convergence –Distinguishing global from local minima not generally easy –Curse of dimensionality –Choosing algorithm “tuning coefficients” Rarely sufficient to use theory for standard deterministic methods to characterize stochastic methods “No free lunch” theorems are barrier to exaggerated claims of power and efficiency of any specific algorithm Algorithms should be implemented in context: “Better a rough answer to the right question than an exact answer to the wrong one” (Lord Kelvin)

29 Selected References on Stochastic Optimization Fogel, D. B. (2000), Evolutionary Computation: Toward a New Philosophy of Machine Intelligence (2nd ed.), IEEE Press, Piscataway, NJ. Fu, M. C. (2002), “Optimization for Simulation: Theory vs. Practice” (with discussion by S. Andradóttir, P. Glynn, and J. P. Kelly), INFORMS Journal on Computing, vol. 14, pp. 192  227. Goldberg, D. E. (1989), Genetic Algorithms in Search, Optimization, and Machine Learning, Addison-Wesley, Reading, MA. Gosavi, A. (2003), Simulation-Based Optimization: Parametric Optimization Techniques and Reinforcement Learning, Kluwer, Boston. Holland, J. H. (1975), Adaptation in Natural and Artificial Systems, University of Michigan Press, Ann Arbor. Kleijnen, J. P. C. (2008), Design and Analysis of Simulation Experiments, Springer. Kushner, H. J. and Yin, G. G. (2003), Stochastic Approximation and Recursive Algorithms and Applications (2nd ed.), Springer-Verlag, New York. Michalewicz, Z. and Fogel, D. B. (2000), How to Solve It: Modern Heuristics, Springer-Verlag, New York. Spall, J. C. (2003), Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control, Wiley, Hoboken, NJ. Zhigljavsky, A. A. (1991), Theory of Global Random Search, Kluwer, Boston.