Computational Stochastic Optimization: Bridging communities October 25, 2012 Warren Powell CASTLE Laboratory Princeton University

Slides:



Advertisements
Similar presentations
Bayesian Belief Propagation
Advertisements

Active Appearance Models
Dialogue Policy Optimisation
Edge Preserving Image Restoration using L1 norm
Partially Observable Markov Decision Process (POMDP)
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Solving POMDPs Using Quadratically Constrained Linear Programs Christopher Amato.
SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.
ANDREW MAO, STACY WONG Regrets and Kidneys. Intro to Online Stochastic Optimization Data revealed over time Distribution of future events is known Under.
Dynamic Bayesian Networks (DBNs)
Computational Stochastic Optimization:
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: The FIR Adaptive Filter The LMS Adaptive Filter Stability and Convergence.
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
Infinite Horizon Problems
Planning under Uncertainty
Visual Recognition Tutorial
© 2003 Warren B. Powell Slide 1 Approximate Dynamic Programming for High Dimensional Resource Allocation NSF Electric Power workshop November 3, 2003 Warren.
© 2004 Warren B. Powell Slide 1 Outline A car distribution problem.
Approximate Dynamic Programming for High-Dimensional Asset Allocation Ohio State April 16, 2004 Warren Powell CASTLE Laboratory Princeton University
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
An Optimal Learning Approach to Finding an Outbreak of a Disease Warren Scott Warren Powell
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Independent Component Analysis (ICA) and Factor Analysis (FA)
EE 685 presentation Optimization Flow Control, I: Basic Algorithm and Convergence By Steven Low and David Lapsley Asynchronous Distributed Algorithm Proof.
© 2009 Warren B. Powell 1. Optimal Learning for Homeland Security CCICADA Workshop Morgan State, Baltimore, Md. March 7, 2010 Warren Powell With research.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Optimal Fixed-Size Controllers for Decentralized POMDPs Christopher Amato Daniel.
Single Point of Contact Manipulation of Unknown Objects Stuart Anderson Advisor: Reid Simmons School of Computer Science Carnegie Mellon University.
Visual Recognition Tutorial
Slide 1 © 2008 Warren B. Powell Slide 1 Approximate Dynamic Programming for High-Dimensional Problems in Energy Modeling Ohio St. University October 7,
Overview and Mathematics Bjoern Griesbach
Slide 1 Tutorial: Optimal Learning in the Laboratory Sciences Working with nonlinear belief models December 10, 2014 Warren B. Powell Kris Reyes Si Chen.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
MAKING COMPLEX DEClSlONS
Slide 1 Tutorial: Optimal Learning in the Laboratory Sciences The knowledge gradient December 10, 2014 Warren B. Powell Kris Reyes Si Chen Princeton University.
Stochastic Linear Programming by Series of Monte-Carlo Estimators Leonidas SAKALAUSKAS Institute of Mathematics&Informatics Vilnius, Lithuania
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
TKK | Automation Technology Laboratory Partially Observable Markov Decision Process (Chapter 15 & 16) José Luis Peralta.
ENM 503 Lesson 1 – Methods and Models The why’s, how’s, and what’s of mathematical modeling A model is a representation in mathematical terms of some real.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Approximate Dynamic Programming and Policy Search: Does anything work? Rutgers Applied Probability Workshop June 6, 2014 Warren B. Powell Daniel R. Jiang.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
LEAST MEAN-SQUARE (LMS) ADAPTIVE FILTERING. Steepest Descent The update rule for SD is where or SD is a deterministic algorithm, in the sense that p and.
CS Statistical Machine learning Lecture 24
Outline The role of information What is information? Different types of information Controlling information.
ECE-7000: Nonlinear Dynamical Systems Overfitting and model costs Overfitting  The more free parameters a model has, the better it can be adapted.
CHAPTER 17 O PTIMAL D ESIGN FOR E XPERIMENTAL I NPUTS Organization of chapter in ISSO –Background Motivation Finite sample and asymptotic (continuous)
© 2009 Ilya O. Ryzhov 1 © 2008 Warren B. Powell 1. Optimal Learning On A Graph INFORMS Annual Meeting October 11, 2009 Ilya O. Ryzhov Warren Powell Princeton.
Data Modeling Patrice Koehl Department of Biological Sciences National University of Singapore
Chapter 2-OPTIMIZATION G.Anuradha. Contents Derivative-based Optimization –Descent Methods –The Method of Steepest Descent –Classical Newton’s Method.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Slide 1 Tutorial: Optimal Learning in the Laboratory Sciences Searching a two-dimensional surface December 10, 2014 Warren B. Powell Kris Reyes Si Chen.
COMP 2208 Dr. Long Tran-Thanh University of Southampton Reinforcement Learning.
Smart Sleeping Policies for Wireless Sensor Networks Venu Veeravalli ECE Department & Coordinated Science Lab University of Illinois at Urbana-Champaign.
Application of Dynamic Programming to Optimal Learning Problems Peter Frazier Warren Powell Savas Dayanik Department of Operations Research and Financial.
Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department of Operations Research and Financial Engineering.
COMPLIMENTARY TEACHING MATERIALS Farm Business Management: The Fundamentals of Good Practice Peter L. Nuthall.
6.5 Stochastic Prog. and Benders’ decomposition
Deep reinforcement learning
Machine Learning Basics
Clearing the Jungle of Stochastic Optimization
Propagating Uncertainty In POMDP Value Iteration with Gaussian Process
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Markov Decision Problems
CS 416 Artificial Intelligence
6.5 Stochastic Prog. and Benders’ decomposition
Reinforcement Learning Dealing with Partial Observability
Kalman Filter: Bayes Interpretation
Reinforcement Learning (2)
Reinforcement Learning (2)
Presentation transcript:

Computational Stochastic Optimization: Bridging communities October 25, 2012 Warren Powell CASTLE Laboratory Princeton University © 2012 Warren B. Powell, Princeton University © 2012 Warren B. Powell

Outline From stochastic search to dynamic programming From dynamic programming to stochastic programming © 2012 Warren B. Powell

From stochastic search to DP Classical stochastic search »The prototypical stochastic search problem is posed as where x is a deterministic parameter, and W is a random variable. »Variations: Expectation cannot be computed Function evaluations may be expensive Random noise may be heavy-tailed (e.g. rare events) Function F may or may not be differentiable. © 2012 Warren B. Powell

From stochastic search to DP Imagine if our policy is given by Instead of estimating the value of being in a state, what if we tune to get the best performance? »This is known as policy search. It builds on classical fields such as stochastic search and simulation- optimization. »Very stable, but it is generally limited to problems with a much smaller number of parameters. © 2012 Warren B. Powell

From stochastic search to DP The next slide illustrate experiments on a simple battery storage problem. We developed 20 benchmark problems which we could solve optimally using classical methods from the MDP literature. We then compared four policies: »A myopic policy (which did not store energy) »Two policies that use Bellman error minimization: LSAPI – Least squares, approximate policy iteration IVAPI – This is LSAPI but using instrumental variables »Direct – Here we use the same policy as LSAPI, but use directly policy search to find the regression vector. © 2012 Warren B. Powell

From stochastic search to DP Performance using Bellman error minimization (light blue and purple bars) © 2012 Warren B. Powell

Optimal learning Now assume we have five choices, with uncertainty in our belief about how well each one will perform. If you can make one measurement, which would you measure? © 2012 Warren B. Powell

Optimal learning Policy search process: »Choose »Simulate the policy to get a noisy estimate of its value: © 2012 Warren B. Powell “a sample path”

Optimal learning At first, we believe that But we measure alternative x and observe Our beliefs change: Thus, our beliefs about the rewards are gradually improved over measurements i j i j i j © 2012 Warren B. Powell

Optimal learning No improvement  Now assume we have five choices, with uncertainty in our belief about how well each one will perform.  If you can make one measurement, which would you measure? © 2012 Warren B. Powell

Optimal learning New solution The value of learning is that it may change your decision.  Now assume we have five choices, with uncertainty in our belief about how well each one will perform.  If you can make one measurement, which would you measure? © 2012 Warren B. Powell

Optimal learning An important problem class involves correlated beliefs – measuring one alternative tells us something other alternatives these beliefs change too. measure here... © 2012 Warren B. Powell

Optimal learning with a physical state The knowledge gradient »The knowledge gradient is the expected value of a single measurement x, given by »Knowledge gradient policy chooses the measurement with the highest marginal value. Knowledge state Implementation decision Updated knowledge state given measurement x Expectation over different measurement outcomes Marginal value of measuring x (the knowledge gradient) Optimization problem given what we know New optimization problem © 2012 Warren B. Powell

The knowledge gradient Computing the knowledge gradient for Gaussian beliefs »The change in variance can be found to be »Next compute the normalized influence: »Let »Knowledge gradient is computed using 0 Comparison to other alternatives © 2012 Warren B. Powell

 After four measurements: »Whenever we measure at a point, the value of another measurement at the same point goes down. The knowledge gradient guides us to measuring areas of high uncertainty. Optimizing storage Measurement Value of another measurement at same location. Estimated valueKnowledge gradient New optimum © 2012 Warren B. Powell

Optimizing storage  After five measurements: Estimated valueKnowledge gradient After measurement © 2012 Warren B. Powell

Optimizing storage  After six samples Estimated valueKnowledge gradient © 2012 Warren B. Powell

Optimizing storage  After seven samples Estimated valueKnowledge gradient © 2012 Warren B. Powell

Optimizing storage  After eight samples Estimated valueKnowledge gradient © 2012 Warren B. Powell

Optimizing storage  After nine samples Estimated valueKnowledge gradient © 2012 Warren B. Powell

Optimizing storage  After ten samples Estimated valueKnowledge gradient © 2012 Warren B. Powell

 After ten samples, our estimate of the surface: Optimizing storage Estimated valueTrue value © 2012 Warren B. Powell

From stochastic search to DP Performance using direct policy search (yellow bars) © 2012 Warren B. Powell

From stochastic search to DP Notes: »Direct policy research can be used to tune the parameters of any policy: The horizon for a deterministic lookahead policy The sampling strategy when using a stochastic lookahead policy The parameters of a parametric policy function approximation »But, there are some real limitations. It can be very difficult obtaining gradients of the objective function with respect to the tunable parameters. It is very hard to do derivative-free stochastic search with large numbers of parameters. This limits our ability to handle time-dependent policies. © 2012 Warren B. Powell

Outline From stochastic search to dynamic programming From dynamic programming to stochastic programming © 2012 Warren B. Powell

From DP to stochastic programming The slides that follow start from the most familiar form of Bellman’s optimality equation for discrete states and actions. We then create a bridge to classical formulations used in stochastic programming. Along the way, we show that stochastic programming is actually a lookahead policy, and solves a reduced dynamic program over a shorter horizon and a restricted representation of the random outcomes. © 2012 Warren B. Powell

From DP to stochastic programming All dynamic programming starts with Bellman’s equation: All problems in stochastic programming are time- dependent, we write it as We cannot compute the one-step transition matrix, so we first replace it with the expectation form: © 2012 Warren B. Powell

From DP to stochastic programming Implicit in the value function is an assumption that we are following an optimal policy. We are going to temporarily assume that we are following a fixed policy We cannot compute the expectation, so we replace it with a Monte Carlo sample. We use this opportunity to make the transition from discrete actions a to vectors x. © 2012 Warren B. Powell

From DP to stochastic programming The state consists of two components: »The resource vector that is determined by the prior decisions »The exogenous information. In stochastic programming, it is common to represent exogenous information as the entire history (starting at time t) which we might write as While we can always write, in most applications is lower dimensional. »We will write, where we mean that has the same information content as. © 2012 Warren B. Powell

From DP to stochastic programming We are now going to drop the reference to the generic policy, and instead reference the decision vector indexed by the history (alternatively, the node in the scenario tree). »Note that this is equivalent to a lookup table representation using simulated histories. »We write this in the form of a policy, and also make the transition to the horizon t,…,t+H Here, is a vector over all histories. This is a lookahead policy, which optimizes the lookahead model. © 2012 Warren B. Powell

From DP to stochastic programming We make one last tweak to get it into a more compact form: »In this formulation, we let »We are now writing a vector for each sample path in. This introduces a complication that we did not encounter when we indexed each decision by a history. We are now letting the decision “see” the future. © 2012 Warren B. Powell

From DP to stochastic programming When we indexed by histories, there might be one history at time t (since this is where we are starting), 10 histories for time t+1 and 100 histories for time t+2, giving us 111 vectors to determine. When we have a vector for each sample path, then we have 100 vectors for each time t, t+1 and t+2, giving us 300 vectors. When we index on the sample path, we are effectively letting the decision “see” the entire sample path. © 2012 Warren B. Powell

From DP to stochastic programming To avoid the problem of letting a decision see into the future, we create sets of all the sample paths with the same history: We now require that all decisions with the same history be the same: »These are known in stochastic programming as nonanticipativity constraints. © 2012 Warren B. Powell

From DP to stochastic programming A scenario tree »A node in the scenario tree is equivalent to a history © 2012 Warren B. Powell

From DP to stochastic programming This is a lookahead policy that solves the lookahead model directly, by optimizing over all decisions over all time periods at the same time. Not surprisingly, this can be computationally demanding. The lookahead model is, first and foremost, a dynamic program (although simpler than the original dynamic program), and can be solved using Bellman’s equation (but exploiting convexity). © 2012 Warren B. Powell

From DP to stochastic programming We are going to start with Bellman’s equation as it is used in the stochastic programming community (e.g. by Shapiro and Ruszczynski): Translation: © 2012 Warren B. Powell

From DP to stochastic programming We first modify the notation to reflect that we are solving the lookahead model. »This gives us © 2012 Warren B. Powell

From DP to stochastic programming For our next step, we have to introduce the concept of the post-decision state. Our resource vector evolves according to where represents exogenous (stochastic) input at time t’ (when we are solving the lookahead model starting at time t). Now define the post-decision resource state This is the state immediately after a decision is made. The post-decision state is given by © 2012 Warren B. Powell

From DP to stochastic programming We can now write Bellman’s equation as We can approximate the value function using Benders cuts © 2012 Warren B. Powell

From DP to stochastic programming We can use value functions to step backward through the tree as a way of solving the lookahead model. © 2012 Warren B. Powell

From DP to stochastic programming Stochastic programming, in most applications, consists of solving a stochastic lookahead model, using one of two methods: »Direct solution of all decisions over the scenario tree. »Solution of the scenario tree using approximate dynamic programming with Benders cuts. Finding optimal solutions, even using a restricted representation of the outcomes using a scenario tree, can be computationally very demanding. An extensive literature exists designing optimal algorithms (or algorithms with provable bounds) of a scenario-restricted representation. Solving the lookahead model is an approximate policy. An optimal solution to the lookahead model is not an optimal policy. Bounds on the solution do not provide bounds on the performance of the optimal policy. © 2012 Warren B. Powell