Approximate Dynamic Programming and Policy Search: Does anything work? Rutgers Applied Probability Workshop June 6, 2014 Warren B. Powell Daniel R. Jiang.

Slides:

Advertisements

Similar presentations

Pattern Recognition and Machine Learning

Advertisements

Pattern Recognition and Machine Learning

Bayesian Belief Propagation

Lecture 18: Temporal-Difference Learning

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.

Batch RL Via Least Squares Policy Iteration

Partially Observable Markov Decision Process (POMDP)

Pattern Recognition and Machine Learning

SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.

SVM—Support Vector Machines

Computer vision: models, learning and inference Chapter 8 Regression.

Dynamic Bayesian Networks (DBNs)

Computational Stochastic Optimization:

1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.

Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.

Planning under Uncertainty

© 2003 Warren B. Powell Slide 1 Approximate Dynamic Programming for High Dimensional Resource Allocation NSF Electric Power workshop November 3, 2003 Warren.

© 2008 Warren B. Powell Slide 1 The Dynamic Energy Resource Model Warren Powell Alan Lamont Jeffrey Stewart Abraham George © 2007 Warren B. Powell, Princeton.

© 2004 Warren B. Powell Slide 1 Outline A car distribution problem.

Approximate Dynamic Programming for High-Dimensional Asset Allocation Ohio State April 16, 2004 Warren Powell CASTLE Laboratory Princeton University

An Optimal Learning Approach to Finding an Outbreak of a Disease Warren Scott Warren Powell

1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

Prediction and model selection

End of Chapter 8 Neil Weisenfeld March 28, 2005.

Slide 1 © 2008 Warren B. Powell Slide 1 Approximate Dynamic Programming for High-Dimensional Problems in Energy Modeling Ohio St. University October 7,

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Slide 1 Tutorial: Optimal Learning in the Laboratory Sciences Working with nonlinear belief models December 10, 2014 Warren B. Powell Kris Reyes Si Chen.

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

PATTERN RECOGNITION AND MACHINE LEARNING

Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.

Computational Stochastic Optimization: Bridging communities October 25, 2012 Warren Powell CASTLE Laboratory Princeton University

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.

Discriminant Functions

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.

Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

Virtual Vector Machine for Bayesian Online Classification Yuan (Alan) Qi CS & Statistics Purdue June, 2009 Joint work with T.P. Minka and R. Xiang.

© 2007 Warren B. Powell Slide 1 The Dynamic Energy Resource Model Lawrence Livermore National Laboratories September 24, 2007 Warren Powell Alan Lamont.

Maximum a posteriori sequence estimation using Monte Carlo particle filters S. J. Godsill, A. Doucet, and M. West Annals of the Institute of Statistical.

Generalised method of moments approach to testing the CAPM Nimesh Mistry Filipp Levin.

Outline The role of information What is information? Different types of information Controlling information.

ECE-7000: Nonlinear Dynamical Systems Overfitting and model costs Overfitting  The more free parameters a model has, the better it can be adapted.

Linear Models for Classification

MDPs (cont) & Reinforcement Learning

Data Modeling Patrice Koehl Department of Biological Sciences National University of Singapore

Chapter1: Introduction Chapter2: Overview of Supervised Learning

Chapter 2-OPTIMIZATION G.Anuradha. Contents Derivative-based Optimization –Descent Methods –The Method of Steepest Descent –Classical Newton’s Method.

The Unscented Particle Filter 2000/09/29 이 시은. Introduction Filtering –estimate the states(parameters or hidden variable) as a set of observations becomes.

CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov

DEPARTMENT/SEMESTER ME VII Sem COURSE NAME Operation Research Manav Rachna College of Engg.

Application of Dynamic Programming to Optimal Learning Problems Peter Frazier Warren Powell Savas Dayanik Department of Operations Research and Financial.

Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.

Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.

Chapter 7. Classification and Prediction

Probability Theory and Parameter Estimation I

Machine Learning Basics

Clearing the Jungle of Stochastic Optimization

Collaborative Filtering Matrix Factorization Approach

Filtering and State Estimation: Basic Concepts

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Reinforcement Learning in MDPs by Lease-Square Policy Iteration

Parametric Methods Berlin Chen, 2005 References:

Reinforcement Learning Dealing with Partial Observability

Linear Discrimination

Reinforcement Learning (2)

Presentation transcript:

Approximate Dynamic Programming and Policy Search: Does anything work? Rutgers Applied Probability Workshop June 6, 2014 Warren B. Powell Daniel R. Jiang with contributions from: Daniel Salas Vincent Pham Warren Scott © 2014 Warren B. Powell, Princeton University

Storage problems How much energy to store in a battery to handle the volatility of wind and spot prices to meet demands?

Storage problems How much money should we hold in cash given variable market returns and interest rates to meet the needs of a business? Bonds Stock prices

Storage problems Elements of a “storage problem” »Controllable scalar state giving the amount in storage: Decision may be to deposit money, charge a battery, chill the water, release water from the reservoir. There may also be exogenous changes (deposits/withdrawals) »Multidimensional “state of the world” variable that evolves exogenously: Prices Interest rates Weather Demand/loads »Other features: Problem may be time-dependent (and finite horizon) or stationary We may have access to forecasts of future

Slide 5 Storage problems Dynamics are captured by the transition function: »Controllable resource (scalar): »Exogenous state variables:

Stochastic optimization models The objective function »With deterministic problems, we want to find the best decision. »With stochastic problems, we want to find the best function (policy) for making a decision. Decision function (policy) State variable Cost function Finding the best policy Expectation over all random outcomes

1) Policy function approximations (PFAs) » Lookup tables, rules, parametric functions 2) Cost function approximation (CFAs) » 3) Policies based on value function approximations (VFAs) » 4) Lookahead policies (a.k.a. model predictive control) »Deterministic lookahead (rolling horizon procedures) »Stochastic lookahead (stochastic programming,MCTS) Four classes of policies

Value iteration Classical backward dynamic programming »The three curses of dimensionality The state space The outcome space The action space

A storage problem Energy storage with stochastic prices, supplies and demands.

A storage problem Bellman’s optimality equation

Managing a water reservoir Backward dynamic programming in one dimension

Managing cash in a mutual fund Dynamic programming in multiple dimensions

Approximate dynamic programming Algorithmic strategies: »Approximate value iteration Mimics backward dynamic programming »Approximate policy iteration Mimics policy iteration »Policy search Based on the field of stochastic search

Approximate value iteration Step 1: Start with a pre-decision state Step 2: Solve the deterministic optimization using an approximate value function: to obtain. Step 3: Update the value function approximation Step 4: Obtain Monte Carlo sample of and compute the next pre-decision state: Step 5: Return to step 1. Simulation Deterministic optimization Recursive statistics “on policy learning”

Approximate value iteration Step 1: Start with a pre-decision state Step 2: Solve the deterministic optimization using an approximate value function: to obtain. Step 3: Update the value function approximation Step 4: Obtain Monte Carlo sample of and compute the next pre-decision state: Step 5: Return to step 1. Simulation Deterministic optimization Recursive statistics

Approximate value iteration The true (discretized) value function

Outline Least squares approximate policy iteration Direct policy search Approximate policy iteration using machine learning Exploiting concavity Exploiting monotonicity Closing thoughts

Approximate dynamic programming Classical approximate dynamic programming »We can estimate the value of being in a state using »Use linear regression to estimate. »Our policy is then given by »This is known as Bellman error minimization.

Approximate dynamic programming Least squares policy iteration (Lagoudakis and Parr) »Bellman’s equation: »… is equivalent to (for a fixed policy) »Rearranging gives: »… where “X” is our explanatory variable. But we cannot compute this exactly (due to the expectation) so we can sample it.

Approximate dynamic programming … in matrix form:

Approximate dynamic programming First issue: »This is known as an “errors-in-variable” model, which produces biased estimates of. We can correct the bias using instrumental variables. Sample state, compute basis functions. Simulate next state and compute basis functions

Approximate dynamic programming Second issue: »Bellman’s optimality equation written using basis functions: »… does not possess a fixed point (result due to van Roy and de Farias). This is the reason that classical Bellman error minimization using basis functions: …does not work. Instead we have to use projected Bellman error: Projection operator onto the space spanned by the basis functions.

Approximate dynamic programming Surprising result: »Theorem (W. Scott and W.B.P.) Bellman error using instrumental variables and projected Bellman error minimization are the same!

Optimizing storage For benchmark datasets, see:

Optimizing storage For benchmark datasets, see:

Optimizing storage For benchmark datasets, see:

Outline Least squares approximate policy iteration Direct policy search Approximate policy iteration using machine learning Exploiting concavity Exploiting monotonicity Closing thoughts

Policy search Finding the best policy (“policy search”) »Assume our policy is given by »We wish to maximize the function Error correction term

Policy search A number of fields work on this problem under different names: »Stochastic search »Simulation optimization »Black box optimization »Sequential kriging »Global optimization »Open loop control »Optimization of expensive functions »Bandit problems (for on-line learning) »Optimal learning

Policy search For benchmark datasets, see:

Policy search For benchmark datasets, see:

Outline Least squares approximate policy iteration Direct policy search Approximate policy iteration using machine learning Exploiting concavity Exploiting monotonicity Closing thoughts

Approximate policy iteration Step 1: Start with a pre-decision state Step 2: Inner loop: Do for m=1,…,M: Step 2a: Solve the deterministic optimization using an approximate value function: to obtain. Step 2b: Update the value function approximation Step 2c: Obtain Monte Carlo sample of and compute the next pre-decision state: Step 3: Update using and return to step 1.

Approximate policy iteration Machine learning methods (coded in R) »SVR - Support vector regression with Gaussian radial basis kernel »LBF – Weighted linear combination of polynomial basis functions »GPR – Gaussian process regression with Gaussian RBF »LPR – Kernel smoothing with second-order local polynomial fit »DC-R – Dirichlet clouds – Local parametric regression. »TRE – Regression trees with constant local fit.

Approximate policy iteration Test problem sets »Linear Gaussian control L1 = Linear quadratic regulation Remaining problems are nonquadratic »Finite horizon energy storage problems (Salas benchmark problems) 100 time-period problems Value functions are fitted for each time period

Linear Gaussian control SVR - Support vector regression GPR - Gaussian process regression DC-R – local parametric regression LPR – kernel smoothing LBF – linear basis functions LBF – regression trees

Energy storage applications SVR - Support vector regression GPR - Gaussian process regression LPR – kernel smoothing DC-R – local parametric regression

A tale of two distributions »The sampling distribution, which governs the likelihood that we sample a state. »The learning distribution, which is the distribution of states we would visit given the current policy Approximate policy iteration State S

Approximate policy iteration Using the optimal value function Now we are going to use the optimal policy to fit approximate value functions and watch the stability. Optimal value function…with quadratic fit State distribution using optimal policy

Approximate policy iteration Policy evaluation: 500 samples (problem only has 31 states!) After 50 policy improvements with optimal distribution: divergence in sequence of VFA’s, 40%-70% optimality. After 50 policy improvements with uniform distribution: stable VFAs, 90% optimality. State distribution using optimal policy VFA estimated after 50 policy iterations VFA estimated after 51 policy iterations

Outline Least squares approximate policy iteration Direct policy search Approximate policy iteration using machine learning Exploiting concavity Exploiting monotonicity Closing thoughts

Exploiting concavity Bellman’s optimality equation »With pre-decision state: »With post-decision state: Inventory held over from previous time period

Exploiting concavity We update the piecewise linear value functions by computing estimates of slopes using a backward pass: »The cost along the marginal path is the derivative of the simulation with respect to the flow perturbation.

Exploiting concavity Derivatives are used to estimate a piecewise linear approximation

Exploiting concavity Convergence results for piecewise linear, concave functions: »Godfrey, G. and W.B. Powell, "An Adaptive, Distribution-Free Algorithm for the Newsvendor Problem with Censored Demands, with Application to Inventory and Distribution Problems," Management Science, Vol. 47, No. 8, pp , (2001). »Topaloglu, H. and W.B. Powell, “An Algorithm for Approximating Piecewise Linear Concave Functions from Sample Gradients,” Operations Research Letters, Vol. 31, No. 1, pp (2003). »Powell, W.B., A. Ruszczynski and H. Topaloglu, “Learning Algorithms for Separable Approximations of Stochastic Optimization Problems,” Mathematics of Operations Research, Vol 29, No. 4, pp (2004). Convergence results for storage problems »J. Nascimento, W. B. Powell, “An Optimal Approximate Dynamic Programming Algorithm for Concave, Scalar Storage Problems with Vector-Valued Controls,” IEEE Transactions on Automatic Control, Vol. 58, No. 12, pp (2013) »Powell, W.B., A. Ruszczynski and H. Topaloglu, “Learning Algorithms for Separable Approximations of Stochastic Optimization Problems,” Mathematics of Operations Research, Vol 29, No. 4, pp (2004). »Nascimento, J. and W. B. Powell, “An Optimal Approximate Dynamic Programming Algorithm for the Lagged Asset Acquisition Problem,” Mathematics of Operations Research, (2009).

Exploiting concavity Percent of optimal Storage problem

Exploiting concavity Percent of optimal Storage problem

Slide 49 Grid level storage

ADP (blue) vs. LP optimal (black)

Exploiting concavity The problem of dealing with state of the world »Temperature, interest rates, … State of the world

Active area of research. Key ideas center on different methods for clustering. Exploiting concavity State of the world Query state Lauren Hannah, W.B. Powell, D. Dunson, “Semi-Convex Regression for Metamodeling-Based Optimization,” SIAM J. on Optimization, Vol. 24, No. 2, pp , (2014).

Outline Least squares approximate policy iteration Direct policy search Approximate policy iteration using machine learning Exploiting concavity Exploiting monotonicity Closing thoughts

Bid is placed at 1pm, consisting of charge and discharge prices between 2pm and 3pm. Hour-ahead biding 1pm 2pm3pm

A bidding problem The exact value function

A bidding problem Approximate value function without monotonicity

A bidding problem

Outline Least squares approximate policy iteration Direct policy search Approximate policy iteration using machine learning Exploiting concavity Exploiting monotonicity Closing thoughts

Observations »Approximate value iteration using a linear model can produce very poor results under the best of circumstances, and potentially terrible results. »Least squares approximate value iteration, a highly regarded classic algorithm by Lagoudakis and Parr, works poorly. »Approximate policy iteration is OK with support vector regression, but below expectation for such a simple problem. »Basic lookup table by itself works poorly »Lookup table with structure works very well: Convexity – Does not require explicit exploration Monotonicity – Does require explicit exploration but limited to very low dimensional information state. »So, we can conclude that nothing works reliably in a way that would scale to more complex problems!