Approximate Dynamic Programming for

Slides:

Advertisements

Similar presentations

R&D Portfolio Optimization One Stage R&D Portfolio Optimization with an Application to Solid Oxide Fuel Cells Lauren Hannah 1, Warren Powell 1, Jeffrey.

Advertisements

1 University of Southern California Keep the Adversary Guessing: Agent Security by Policy Randomization Praveen Paruchuri University of Southern California.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Solving POMDPs Using Quadratically Constrained Linear Programs Christopher Amato.

Questions?. Setting a reward function, with and without subgoals Difference between agent and environment AI for games, Roomba Markov Property – Broken.

Use of Kalman filters in time and frequency analysis John Davis 1st May 2011.

Computational Stochastic Optimization:

Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very.

Optimal Design Laboratory | University of Michigan, Ann Arbor 2011 Design Preference Elicitation Using Efficient Global Optimization Yi Ren Panos Y. Papalambros.

Planning under Uncertainty

© 2003 Warren B. Powell Slide 1 Approximate Dynamic Programming for High Dimensional Resource Allocation NSF Electric Power workshop November 3, 2003 Warren.

© 2008 Warren B. Powell Slide 1 The Dynamic Energy Resource Model Warren Powell Alan Lamont Jeffrey Stewart Abraham George © 2007 Warren B. Powell, Princeton.

© 2004 Warren B. Powell Slide 1 Outline A car distribution problem.

Approximate Dynamic Programming for High-Dimensional Asset Allocation Ohio State April 16, 2004 Warren Powell CASTLE Laboratory Princeton University

Reinforcement Learning

An Optimal Learning Approach to Finding an Outbreak of a Disease Warren Scott Warren Powell

Supply Chain Design Problem Tuukka Puranen Postgraduate Seminar in Information Technology Wednesday, March 26, 2009.

1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Optimal Fixed-Size Controllers for Decentralized POMDPs Christopher Amato Daniel.

Slide 1 © 2008 Warren B. Powell Slide 1 Approximate Dynamic Programming for High-Dimensional Problems in Energy Modeling Ohio St. University October 7,

Nonlinear Stochastic Programming by the Monte-Carlo method Lecture 4 Leonidas Sakalauskas Institute of Mathematics and Informatics Vilnius, Lithuania EURO.

MAKING COMPLEX DEClSlONS

Computational Stochastic Optimization: Bridging communities October 25, 2012 Warren Powell CASTLE Laboratory Princeton University

General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005.

CHAPTER 4 S TOCHASTIC A PPROXIMATION FOR R OOT F INDING IN N ONLINEAR M ODELS Organization of chapter in ISSO –Introduction and potpourri of examples Sample.

Generative Topographic Mapping by Deterministic Annealing Jong Youl Choi, Judy Qiu, Marlon Pierce, and Geoffrey Fox School of Informatics and Computing.

Tactical Planning in Healthcare with Approximate Dynamic Programming Martijn Mes & Peter Hulshof Department of Industrial Engineering and Business Information.

MURI: Integrated Fusion, Performance Prediction, and Sensor Management for Automatic Target Exploitation 1 Dynamic Sensor Resource Management for ATE MURI.

Statistics and the Verification Validation & Testing of Adaptive Systems Roman D. Fresnedo M&CT, Phantom Works The Boeing Company.

Slide 1 Tutorial: Optimal Learning in the Laboratory Sciences Overview December 10, 2014 Warren B. Powell Kris Reyes Si Chen Princeton University

Approximate Dynamic Programming Methods for Resource Constrained Sensor Management John W. Fisher III, Jason L. Williams and Alan S. Willsky MIT CSAIL.

1 S ystems Analysis Laboratory Helsinki University of Technology Flight Time Allocation Using Reinforcement Learning Ville Mattila and Kai Virtanen Systems.

© 2007 Warren B. Powell Slide 1 So you want to get funding from industry? Funding workshop Informs Annual Meeting Seattle, 2007 Warren Powell CASTLE Laboratory.

© 2007 Warren B. Powell Slide 1 The Dynamic Energy Resource Model Lawrence Livermore National Laboratories September 24, 2007 Warren Powell Alan Lamont.

FORS 8450 Advanced Forest Planning Lecture 11 Tabu Search.

Maximum a posteriori sequence estimation using Monte Carlo particle filters S. J. Godsill, A. Doucet, and M. West Annals of the Institute of Statistical.

Hidden Markov Model Multiarm Bandits: A Methodology for Beam Scheduling in Multitarget Tracking Presented by Shihao Ji Duke University Machine Learning.

Outline The role of information What is information? Different types of information Controlling information.

Monte-Carlo based Expertise A powerful Tool for System Evaluation & Optimization  Introduction  Features  System Performance.

INFORMS Annual Meeting Austin1 Learning in Approximate Dynamic Programming for Managing a Multi- Attribute Driver Martijn Mes Department of Operational.

Bayesian Brain: Probabilistic Approaches to Neural Coding Chapter 12: Optimal Control Theory Kenju Doya, Shin Ishii, Alexandre Pouget, and Rajesh P.N.Rao.

Stochastic Optimization for Markov Modulated Networks with Application to Delay Constrained Wireless Scheduling Michael J. Neely University of Southern.

DEPARTMENT/SEMESTER ME VII Sem COURSE NAME Operation Research Manav Rachna College of Engg.

Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department of Operations Research and Financial Engineering.

CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.

Keep the Adversary Guessing: Agent Security by Policy Randomization

. Development of Analysis Tools for Certification of

Making complex decisions

C.-S. Shieh, EC, KUAS, Taiwan

Analytics and OR DP- summary.

Reinforcement learning (Chapter 21)

PSG College of Technology

Dynamical Statistical Shape Priors for Level Set Based Tracking

Clearing the Jungle of Stochastic Optimization

Announcements Homework 3 due today (grace period through Friday)

. Development of Analysis Tools for Certification of

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Metaheuristic methods and their applications. Optimization Problems Strategies for Solving NP-hard Optimization Problems What is a Metaheuristic Method?

. Development of Analysis Tools for Certification of

Markov Decision Problems

2. University of Northern British Columbia, Prince George, Canada

NONLINEAR AND ADAPTIVE SIGNAL ESTIMATION

CS 416 Artificial Intelligence

Physics-guided machine learning for milling stability:

NONLINEAR AND ADAPTIVE SIGNAL ESTIMATION

Markov Decision Processes

Dr. Arslan Ornek MATHEMATICAL MODELS

Markov Decision Processes

Reinforcement Learning (2)

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7

Presentation transcript:

Approximate Dynamic Programming for High-Dimensional Resource Allocation Problems Princeton University, Warren B. Powell Long-Term PAYOFF: Efficient, robust control of complex systems of people, equipment and resources. Advances in fundamental algorithms for stochastic control with many applications. OBJECTIVES Fast algorithms for real-time control. Optimal learning rates to maximize rate of convergence and adapt to new conditions. Self-adapting planning models which minimize tuning and calibration. Robust solutions which improve response to random events. APPROACH/TECHNICAL CHALLENGES Approximate dynamic programming combining math programming, signal processing and recursive statistics. Challenge: quickly finding stable policies in the presence of high-dimensional state variables. ACCOMPLISHMENTS/RESULTS New method for optimal learning (the “knowledge gradient”). Reduced solution time for a class of nonlinear control problems from hours to seconds. Adaptation to robust response of spare parts, transportation control, general resource allocation. FUNDING ($K)—Show all funding contributing to this project FY05 FY06 FY07 FY08 FY09 AFOSR Funds 190 197 205 Several industrial sponsors who provide applications for ADP in different settings. Additional funds from NSF, Canadian Air Force. TRANSITIONS Adoption by several industrial sponsors. 9 publications (inc. 4 to appear) in 2006. Interaction with AFRL (summer, 2006) and ongoing relationship with AMC (tanker refueling) STUDENTS, POST-DOCS 3 students, 1 post-doc AFOSR is the major government sponsor for CASTLE Laboratory, which focuses on the development of modeling and algorithmic technologies for solving high-dimensional stochastic resource allocation problems with applications in both the real-time control and planning of complex operational systems. These systems arise in the management of UAV’s, on-ground people and equipment, the Air Mobility Command, mid-air refueling and other resources such as fuel, medical personnel, vaccines and other supplies. This research has produced advances in research for stochastic optimization problems with many applications outside of this domain. A major breakthrough has been the developing of a set of techniques, under the umbrella of approximate dynamic programming, which merge dynamic programming, math programming, statistics and simulation.

Technical approach Approximate dynamic programming for high-dimensional resource allocation problems Standard formulation of Bellman’s equation: State variables with 10 dimensions can produce state spaces with 100 million elements. Real problems exhibit state vectors with millions of dimensions. Steps in ADP strategy: Simulate (from post to pre-decision state) Solve deterministic optimization problem: Statistically update value function approx: At the heart of our approach is a technique that breaks the traditional transition equation for describing the evolution of the state variable into two steps: the evolution from the classical pre-decision state variable S_t, to the post-decision state variable S^x_t. The core algorithm then consists of three steps: Simulation: simulating the evolution from post-decision state S^x_{t-1} to the next pre-decision state S_t. Optimization: solving a deterministic optimization problem (using any of a vast array of solvers) which uses a value function estimated around the post-decision state (which is a deterministic function of the pre-decision state S_t and the decision x_t). Because this is for one time period at a time, we can solve very large-scale problems using commercial solvers such as Cplex. Statistics: now we use the results of our optimization problem at time t to update the value function around the previous post-decision state S^x_{t-1}. Technical advances in 2006: New optimal stepsize algorithm accelerates recursive learning (extends Kalman filter to nonstationary dataset – publication has appeared in Machine Learning). New hierarchical learning algorithm estimates variable-dimensioned approximation, adapting to higher-dimensional representation as algorithm progresses. Avoids need to prespecify a particular functional form. Convergence proofs for special problem classes which do not require state space exploration (which do not work for high dimensional problems).

Control of stochastic “storage” problems Advances in the stability of stochastic algorithms “Storage” problems are a class of nonlinear control Scalar control (quantity of gas being stored, temperature) with complex exogenous information processes Provably convergent algorithms can produce unstable behavior. Research demonstrated importance of evolving from simple to more complex functional approximations. Decrease storage No change Increase storage Solution: Use weighted combination of successively more complex approximations. Adaptive weighting puts more weight on approximations with more parameters as we gain information. One problem we addressed this year was the optimal control of a storage process where there is a good (natural gas, fuel, number of aircraft, people in the military, energy resources) where we have to determine how much to hold in the presence of various forms of random, exogenous information (market prices, availability of a technology, weather, …). Such problems have a single, scalar state variable which is controlled (how much is in storage) and a vector of variables describing exogenous conditions. We have shown that our standard ADP algorithm is provably convergent, using what is known as “pure exploitation”. This is a significant feature in ADP, where we typically have to explore (visit states just to estimate the value of being in the state). But it introduces significant statistical problems. Using this problem (which is relatively simple compared to other problems we are working on), we found that standard statistical methods for approximating the value function, even if they were provably convergent, could work very poorly in practice. We designed an adaptive learning algorithm that made the transition from simple to more complex approximations as we created the data to estimate these approximations accurately. VA SAB 01 G. Candler & K. Sinha / University of Minnesota, P. Martin / Princeton University

Optimal learning using “knowledge gradients” Challenge: You have N tries to learn as much as possible about a function which you can only measure with noise. Knowledge gradient policy optimizes the value of what we learn in each iteration, balancing learning (variance reduction) and the value of what we learn (identifying good decisions). Outperforms Boltzmann exploration, Gittins exploration, Kaelbing’s interval estimation, exploitation and pure exploration. Optimal for several special cases. Computationally fast. Applications in approximate DP, stochastic optimization, sequential estimation, search problems, … Best choice Confidence interval Observations Choices An important problem in approximate dynamic programming, as well as an important problem in its own right, is the challenge of determining how to collect information efficiently. There is a well-developed theory for solving this problem in an infinite-horizon, on-line setting (the so-called “bandit problem”). There are many applications where we have a fixed number of trials (possibly iterations of an algorithm) to estimate a function, after which we have to use the function. We developed a simple, easy-to-compute policy called “knowledge gradients” which produces an index for each action we might take. We just have to choose the action with the highest index. The index trades off the value of learning (reducing the variance) with the power of what we are learning (do we have a chance of actually changing our opinion of what is best). The algorithm is provably convergent for some important special cases, but has also been found to work well against the standard heuristics that have been proposed in the literature (Boltzmann exploration, Gittins exploration). Knowledge gradient policy Boltzmann exploration Pure exploration Gittins exploration

Transitions Managing locomotives at Norfolk Southern Railroad Using AFOSR-sponsored research, modified ADP algorithm to use nested nonlinear value function approximations. Dramatic improvement in rate of convergence and stability of solution. Planning driver operations at Schneider National ADP algorithm is planning movements of 7,000 drivers with complex operational strategies. Used by Schneider full time to plan driver management policies. Fleet simulations for Netjets Captures pilots, aircraft and requirements (customers); full work rules and operational constraints. After three years, model was judged to be calibrated in 2006. First study used to evaluate the effect of aircraft reliability. Mid-air refueling simulator for AMC Optimized scheduling of tankers under uncertainty. Reduced required number of tankers by 20 percent over current planning methods. CASTLE Lab attracts funding from a number of industrial sponsors who provide specific problem settings for us to test our ideas. Above are three companies (Norfolk Southern, one of the four largest railroads in the U.S., Schneider National, the largest truckload motor carrier, and Netjets, the largest operator of business jets) plus the Air Mobility Command (which gave us a nice problem for planning mid-air refueling). These projects often produce new challenges for us, but they serve primarily as an opportunity for us to test out the theory developed under AFOSR funding. These projects demonstrate that our ADP methodology is scalable, and can handle a wide range of issues that arise in real applications.