Approximate Dynamic Programming for

Approximate Dynamic Programming for
High-Dimensional Resource Allocation Problems Princeton University, Warren B. Powell Long-Term PAYOFF: Efficient, robust control of complex systems of people, equipment and resources. Advances in fundamental algorithms for stochastic control with many applications. OBJECTIVES Fast algorithms for real-time control. Optimal learning rates to maximize rate of convergence and adapt to new conditions. Self-adapting planning models which minimize tuning and calibration. Robust solutions which improve response to random events. APPROACH/TECHNICAL CHALLENGES Approximate dynamic programming combining math programming, signal processing and recursive statistics. Challenge: quickly finding stable policies in the presence of high-dimensional state variables. ACCOMPLISHMENTS/RESULTS New method for optimal learning (the “knowledge gradient”). Reduced solution time for a class of nonlinear control problems from hours to seconds. Adaptation to robust response of spare parts, transportation control, general resource allocation. FUNDING ($K)—Show all funding contributing to this project FY05 FY06 FY07 FY08 FY09 AFOSR Funds Several industrial sponsors who provide applications for ADP in different settings. Additional funds from NSF, Canadian Air Force. TRANSITIONS Adoption by several industrial sponsors. 9 publications (inc. 4 to appear) in 2006. Interaction with AFRL (summer, 2006) and ongoing relationship with AMC (tanker refueling) STUDENTS, POST-DOCS 3 students, 1 post-doc AFOSR is the major government sponsor for CASTLE Laboratory, which focuses on the development of modeling and algorithmic technologies for solving high-dimensional stochastic resource allocation problems with applications in both the real-time control and planning of complex operational systems. These systems arise in the management of UAV’s, on-ground people and equipment, the Air Mobility Command, mid-air refueling and other resources such as fuel, medical personnel, vaccines and other supplies. This research has produced advances in research for stochastic optimization problems with many applications outside of this domain. A major breakthrough has been the developing of a set of techniques, under the umbrella of approximate dynamic programming, which merge dynamic programming, math programming, statistics and simulation.

Technical approach Approximate dynamic programming for high-dimensional resource allocation problems Standard formulation of Bellman’s equation: State variables with 10 dimensions can produce state spaces with 100 million elements. Real problems exhibit state vectors with millions of dimensions. Steps in ADP strategy: Simulate (from post to pre-decision state) Solve deterministic optimization problem: Statistically update value function approx: At the heart of our approach is a technique that breaks the traditional transition equation for describing the evolution of the state variable into two steps: the evolution from the classical pre-decision state variable S_t, to the post-decision state variable S^x_t. The core algorithm then consists of three steps: Simulation: simulating the evolution from post-decision state S^x_{t-1} to the next pre-decision state S_t. Optimization: solving a deterministic optimization problem (using any of a vast array of solvers) which uses a value function estimated around the post-decision state (which is a deterministic function of the pre-decision state S_t and the decision x_t). Because this is for one time period at a time, we can solve very large-scale problems using commercial solvers such as Cplex. Statistics: now we use the results of our optimization problem at time t to update the value function around the previous post-decision state S^x_{t-1}. Technical advances in 2006: New optimal stepsize algorithm accelerates recursive learning (extends Kalman filter to nonstationary dataset – publication has appeared in Machine Learning). New hierarchical learning algorithm estimates variable-dimensioned approximation, adapting to higher-dimensional representation as algorithm progresses. Avoids need to prespecify a particular functional form. Convergence proofs for special problem classes which do not require state space exploration (which do not work for high dimensional problems).

Control of stochastic “storage” problems Advances in the stability of stochastic algorithms
“Storage” problems are a class of nonlinear control Scalar control (quantity of gas being stored, temperature) with complex exogenous information processes Provably convergent algorithms can produce unstable behavior. Research demonstrated importance of evolving from simple to more complex functional approximations. Decrease storage No change Increase storage Solution: Use weighted combination of successively more complex approximations. Adaptive weighting puts more weight on approximations with more parameters as we gain information. One problem we addressed this year was the optimal control of a storage process where there is a good (natural gas, fuel, number of aircraft, people in the military, energy resources) where we have to determine how much to hold in the presence of various forms of random, exogenous information (market prices, availability of a technology, weather, …). Such problems have a single, scalar state variable which is controlled (how much is in storage) and a vector of variables describing exogenous conditions. We have shown that our standard ADP algorithm is provably convergent, using what is known as “pure exploitation”. This is a significant feature in ADP, where we typically have to explore (visit states just to estimate the value of being in the state). But it introduces significant statistical problems. Using this problem (which is relatively simple compared to other problems we are working on), we found that standard statistical methods for approximating the value function, even if they were provably convergent, could work very poorly in practice. We designed an adaptive learning algorithm that made the transition from simple to more complex approximations as we created the data to estimate these approximations accurately. VA SAB 01 G. Candler & K. Sinha / University of Minnesota, P. Martin / Princeton University

Optimal learning using “knowledge gradients”
Challenge: You have N tries to learn as much as possible about a function which you can only measure with noise. Knowledge gradient policy optimizes the value of what we learn in each iteration, balancing learning (variance reduction) and the value of what we learn (identifying good decisions). Outperforms Boltzmann exploration, Gittins exploration, Kaelbing’s interval estimation, exploitation and pure exploration. Optimal for several special cases. Computationally fast. Applications in approximate DP, stochastic optimization, sequential estimation, search problems, … Best choice Confidence interval Observations Choices An important problem in approximate dynamic programming, as well as an important problem in its own right, is the challenge of determining how to collect information efficiently. There is a well-developed theory for solving this problem in an infinite-horizon, on-line setting (the so-called “bandit problem”). There are many applications where we have a fixed number of trials (possibly iterations of an algorithm) to estimate a function, after which we have to use the function. We developed a simple, easy-to-compute policy called “knowledge gradients” which produces an index for each action we might take. We just have to choose the action with the highest index. The index trades off the value of learning (reducing the variance) with the power of what we are learning (do we have a chance of actually changing our opinion of what is best). The algorithm is provably convergent for some important special cases, but has also been found to work well against the standard heuristics that have been proposed in the literature (Boltzmann exploration, Gittins exploration). Knowledge gradient policy Boltzmann exploration Pure exploration Gittins exploration

Transitions Managing locomotives at Norfolk Southern Railroad
Using AFOSR-sponsored research, modified ADP algorithm to use nested nonlinear value function approximations. Dramatic improvement in rate of convergence and stability of solution. Planning driver operations at Schneider National ADP algorithm is planning movements of 7,000 drivers with complex operational strategies. Used by Schneider full time to plan driver management policies. Fleet simulations for Netjets Captures pilots, aircraft and requirements (customers); full work rules and operational constraints. After three years, model was judged to be calibrated in 2006. First study used to evaluate the effect of aircraft reliability. Mid-air refueling simulator for AMC Optimized scheduling of tankers under uncertainty. Reduced required number of tankers by 20 percent over current planning methods. CASTLE Lab attracts funding from a number of industrial sponsors who provide specific problem settings for us to test our ideas. Above are three companies (Norfolk Southern, one of the four largest railroads in the U.S., Schneider National, the largest truckload motor carrier, and Netjets, the largest operator of business jets) plus the Air Mobility Command (which gave us a nice problem for planning mid-air refueling). These projects often produce new challenges for us, but they serve primarily as an opportunity for us to test out the theory developed under AFOSR funding. These projects demonstrate that our ADP methodology is scalable, and can handle a wide range of issues that arise in real applications.

Approximate Dynamic Programming for

Similar presentations

Presentation on theme: "Approximate Dynamic Programming for"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Approximate Dynamic Programming for

Similar presentations

Presentation on theme: "Approximate Dynamic Programming for"— Presentation transcript:

Similar presentations

About project

Feedback