7-11-2010INFORMS Annual Meeting Austin1 Learning in Approximate Dynamic Programming for Managing a Multi- Attribute Driver Martijn Mes Department of Operational.

Slides:

Advertisements

Similar presentations

Introduction to Transportation Systems. PART II: FREIGHT TRANSPORTATION.

Advertisements

Reinforcement Learning

Dialogue Policy Optimisation

Hadi Goudarzi and Massoud Pedram

Markov Decision Process

Partially Observable Markov Decision Process (POMDP)

Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.

Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.

Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.

Decision Theoretic Planning

Computational Stochastic Optimization:

Reinforcement Learning

1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.

Markov Decision Processes

Infinite Horizon Problems

Planning under Uncertainty

Visual Recognition Tutorial

CS 589 Information Risk Management 6 February 2007.

Reinforcement Learning

Lecture 5: Learning models using EM

Approximate Dynamic Programming for High-Dimensional Asset Allocation Ohio State April 16, 2004 Warren Powell CASTLE Laboratory Princeton University

Reinforcement Learning

An Optimal Learning Approach to Finding an Outbreak of a Disease Warren Scott Warren Powell

Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.

Slide 1 Tutorial: Optimal Learning in the Laboratory Sciences Working with nonlinear belief models December 10, 2014 Warren B. Powell Kris Reyes Si Chen.

CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

ECON 6012 Cost Benefit Analysis Memorial University of Newfoundland

Computational Stochastic Optimization: Bridging communities October 25, 2012 Warren Powell CASTLE Laboratory Princeton University

Reinforcement Learning

Trust-Aware Optimal Crowdsourcing With Budget Constraint Xiangyang Liu 1, He He 2, and John S. Baras 1 1 Institute for Systems Research and Department.

An Overview of Dynamic Programming Seminar Series Joe Hartman ISE October 14, 2004.

1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.

CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.

Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.

Tactical Planning in Healthcare with Approximate Dynamic Programming Martijn Mes & Peter Hulshof Department of Industrial Engineering and Business Information.

Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 1: Generalized policy iteration.

INFORMS Annual Meeting San Diego 1 HIERARCHICAL KNOWLEDGE GRADIENT FOR SEQUENTIAL SAMPLING Martijn Mes Department of Operational Methods for.

Utilities and MDP: A Lesson in Multiagent System Based on Jose Vidal’s book Fundamentals of Multiagent Systems Henry Hexmoor SIUC.

1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

1 S ystems Analysis Laboratory Helsinki University of Technology Flight Time Allocation Using Reinforcement Learning Ville Mattila and Kai Virtanen Systems.

1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.

© 2009 Ilya O. Ryzhov 1 © 2008 Warren B. Powell 1. Optimal Learning On A Graph INFORMS Annual Meeting October 11, 2009 Ilya O. Ryzhov Warren Powell Princeton.

The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.

Consolidation and Last-mile Costs Reduction in Intermodal Transport Martijn Mes & Arturo Pérez Rivera Department of Industrial Engineering and Business.

1 Optimizing Decisions over the Long-term in the Presence of Uncertain Response Edward Kambour.

1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.

COMP 2208 Dr. Long Tran-Thanh University of Southampton Reinforcement Learning.

Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.

Tactical Planning in Healthcare using Approximate Dynamic Programming (tactisch plannen van toegangstijden in de zorg) Peter J.H. Hulshof, Martijn R.K.

Application of Dynamic Programming to Optimal Learning Problems Peter Frazier Warren Powell Savas Dayanik Department of Operations Research and Financial.

Parameter Estimation. Statistics Probability specified inferred Steam engine pump “prediction” “estimation”

Sequential Off-line Learning with Knowledge Gradients Peter Frazier Warren Powell Savas Dayanik Department of Operations Research and Financial Engineering.

Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.

Bayesian Optimization. Problem Formulation Goal  Discover the X that maximizes Y  Global optimization Active experimentation  We can choose which values.

1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.

Anticipatory Synchromodal Transportation Planning

Markov Decision Processes

Markov Decision Processes

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Chapter 2: Evaluative Feedback

Markov Decision Problems

Reinforcement Learning Dealing with Partial Observability

Chapter 2: Evaluative Feedback

Reinforcement Learning (2)

Reinforcement Learning (2)

Presentation transcript:

INFORMS Annual Meeting Austin1 Learning in Approximate Dynamic Programming for Managing a Multi- Attribute Driver Martijn Mes Department of Operational Methods for Production and Logistics University of Twente The Netherlands Sunday, November 7, 2010 INFORMS Annual Meeting Austin

INFORMS Annual Meeting Austin 2/40 OUTLINE 1.Illustration: a transportation application 2.Stylized illustration: the Nomadic Trucker Problem 3.Approximate Dynamic Programming (ADP) 4.Challenges with ADP 5.Optimal Learning 6.Optimal Learning in ADP 7.Challenges with Optimal Learning in ADP 8.Sketch of our solution concept

INFORMS Annual Meeting Austin 3/40 TRANSPORTATION APPLICATION  Heisterkamp  Trailer trucking:  Providing trucks and drivers  Planning department:  Accept orders  Assign orders to trucks  Assign drivers to trucks  Type of orders:  Direct order: move trailer from A to B; client pays depending on distance between A to B, but trailer might go through hubs to change the truck and/or driver  Customer guidance order: rent a truck and driver to a client for some time period

INFORMS Annual Meeting Austin 4/40 REAL APPLICATION  Heisterkamp

INFORMS Annual Meeting Austin 5/40 CHARACTERISTICS  The drivers are bounded by EU drivers’ hours regulations  However, given sufficient supply of orders and drivers, trucks can in principle be utilized 24/7 by switching drivers  Even though we can replace a driver (to increase utilization of trucks), we still might face costs for the old driver  Objective: increase profits by ‘clever’ order acceptance and minimization of costs for drivers, trucks and moving empty (i.e., without trailer)  We solve a dynamic assignment problem, given the state of all trucks and (probabilistic) known orders, at specific time instances for a fixed horizon  This problem is known as a Dynamic Fleet Management Problem (DFMP). For illustrative purposes we now focus on the single vehicle version of the DFMP.

INFORMS Annual Meeting Austin 6/40 THE NOMADIC TRUCKER PROBLEM  Single trucker moving from city to city either with a load or empty  Rewards when moving loads otherwise there are costs involved  Vector of attributes describing a single resource with the set of possible attribute vectors The truck The driver Dynamic attributes

INFORMS Annual Meeting Austin 7/40 MODELING THE DYNAMICS  State where  with R ta =1 when the truck has attribute a (in the DFMP, R ta gives the number of resources at time t with attribute a)  with D tl the number of loads of type l  Decision x t : make a loaded move, wait at current location, or move empty to another location; x t follows from a decision function where π  Π is a family of policies  Exogenous information W t+1 : information arriving between t and t+1 such as new loads, wear of truck, occurrence of breakdowns etc.  Choosing decision x t with current state S t and exogenous information W t+1, results in a transition with contribution (payment or costs)

INFORMS Annual Meeting Austin 8/40 OBJECTIVE  Objective is to find the policy π that maximizes the expected sum of discounted contributions over all time periods

INFORMS Annual Meeting Austin 9/40 SOLVING THE PROBLEM  Optimality equation (expectation form of Bellman’s equation):  Enumerating by backward induction?  Suppose a=(location, arrival time, domicile) and we discretize to 500 locations and 50 possible arrival times → |A |=12,500,000  In the backward loop we not only have to visit all states, but also we have to evaluate all actions, and, to compute the expectation, we probably also have to evaluate all possible outcomes  Backwards dynamic programming might become intractable Approximate Dynamic Programming

INFORMS Annual Meeting Austin 10/40 APPROXIMATE DYNAMIC PROGRAMMING  We replace the original optimality equation  With the following

INFORMS Annual Meeting Austin 11/40 APPROXIMATE DYNAMIC PROGRAMMING  We replace the original optimality equation  With the following Using a value function approximation This allows us to step forward in time 1

INFORMS Annual Meeting Austin 12/40 APPROXIMATE DYNAMIC PROGRAMMING  We replace the original optimality equation  With the following Using the post-decision state variable Deterministic function 2

INFORMS Annual Meeting Austin 13/40 APPROXIMATE DYNAMIC PROGRAMMING  We replace the original optimality equation  With the following Generating sample paths 3

INFORMS Annual Meeting Austin 14/40 APPROXIMATE DYNAMIC PROGRAMMING  We replace the original optimality equation  With the following Learning through iterations 4

INFORMS Annual Meeting Austin 15/40 OUTLINE OF THE ADP ALGORITHM Deterministic optimizationSimulationStatistics

INFORMS Annual Meeting Austin 16/40 CHALLENGES WITH ADP  Exploration vs. exploitation:  Exploitation: we do we currently think is best  Exploration: we choose to try something and learn more (information collection)  To avoid getting stuck in a local optimum, we have to explore. But what do we want to explore and for how long? Do we need to explore the whole state space?  Do we update the value functions using the results of the exploration steps or do we want to perform off-policy control?  Techniques from Optimal Learning might help here

INFORMS Annual Meeting Austin 17/40 OPTIMAL LEARNING  To cope with the exploration vs. exploitation dilemma  Undirected exploration:  Try to randomly explore the whole state space  Examples: pure exploration and epsilon greedy (explore with probability ε n and exploit with probability 1- ε n )  Directed exploration:  Utilize past experience to execute efficient exploration (costs are gradually avoided by making more expensive actions less likely)  Examples of directed exploration  Boltzmann exploration; choose x that maximizes  Interval estimation; choose x that maximizes  The knowledge gradient policy (see next sheets)

INFORMS Annual Meeting Austin 18/40 THE KNOWLEDGE GRADIENT POLICY [1/2]  Basic principle:  Assume you can make only one measurement, after which you have to make a final choice (the implementation decision)  What choice would you make now to maximize the expected value of the implementation decision? Change which produces a change in the decision. Change in estimated value of option 5 due to measurement of 5 Updated estimate of the value of option 5 Observation

INFORMS Annual Meeting Austin 19/40 THE KNOWLEDGE GRADIENT POLICY [2/2]  The knowledge gradient is the expected marginal value of a single measurement x  The knowledge gradient policy is given by  There are many problems where making one measurement tells us something about what we might observe from other measurements (e.g., in our transportation application nearby locations have similar properties)  Correlations are particularly important when the number of possible measurements is extremely large relative to the measurement budget (or continuous functions)  There are various extensions of the Knowledge Gradient policy that take into account similarities between alternatives Hierarchical Knowledge Gradient policy

INFORMS Annual Meeting Austin 20/40 HIERARCHIAL KNOWLEDGE GRADIENT (HKG)  Idea: instead of having a belief on the true value θ x of each alternative x (Bayesian prior with mean and precision ), we have a belief on the value of each alternative at various levels of aggregation (with and )  Using aggregation, we express (our estimate of θ x ) as a weighted combination  Intuition: highest weight to levels with lowest sum of variance and bias; see [1] and [2] for details. [1]M.R.K. Mes, W.B. Powell, and P.I. Frazier (2010). Hierarchical Knowledge Gradient for Sequential Sampling. [2]A. George, W.B. Powell, and S.R. Kulkarni (2008). Value Function Approximation using Multiple Aggregation for Multiattribute Resource Management.

INFORMS Annual Meeting Austin 21/40 STATISTICAL AGGREGATION  Example of an aggregation structure for the Nomadic Trucker Problem  With HKG we would have 38,911 beliefs and our belief about a single alternative can be expressed as a function of 6 beliefs (1 for each aggregation level). LevelLocationDriver typeDay of weekSize of state space 0City**500x10x7=35,000 1Region**50x10x7=3,500 2Region-*50x1x7=350 3Region--50x1x1=50 4Province--10x1x1=10 5Country--1x1x1=1 We need this for each time unit * include in this level- exclude in this level

INFORMS Annual Meeting Austin 22/40 ILLUSTRATION OF HKG  The knowledge gradient policy prefers to measure alternatives with high mean and/or low precision:  Equal means  measure lowest precision  Equal precisions  measure highest mean  Demo HKG…

INFORMS Annual Meeting Austin 23/40 B C D A t+2 t+1 t t-1 COMBINING OPTIMAL LEARNING AND ADP DECISIONS  Illustration learning in ADP  State S t =(R t,D t ) where R t resembles a location R t  {A,B,C,D} and D t available loads going out from R t  Decision x t is a location to move to x t  {A,B,C,D}  Exogenous information W t are the new loads D t time → location →

INFORMS Annual Meeting Austin 24/40 COMBINING OPTIMAL LEARNING AND ADP DECISIONS B C D A t+2 t+1 t t-1 time → location → We were in the post decision state where we decided to move to location C. After observing the new loads, we are in the pre decision state

INFORMS Annual Meeting Austin 25/40 COMBINING OPTIMAL LEARNING AND ADP DECISIONS B C D A t+2 t+1 t t-1 where n n+1 iteration → time → location →

INFORMS Annual Meeting Austin 26/40 COMBINING OPTIMAL LEARNING AND ADP DECISIONS B C D A t+2 t+1 t t-1 So not necessarily influences the value However, it determines the state we update next n n+1 iteration → time → location →

INFORMS Annual Meeting Austin 27/40 COMBINING OPTIMAL LEARNING AND ADP DECISIONS B C D A t+2 t+1 t t-1 Using Optimal Learning, we estimate the knowledge gain n n+1 iteration → time → location →

INFORMS Annual Meeting Austin 28/40 COMBINING OPTIMAL LEARNING AND ADP DECISIONS B C D A t+2 t+1 t t-1 n n+1 iteration → time → location → We decide to move to location B resulting in a post decision state

INFORMS Annual Meeting Austin 29/40 COMBINING OPTIMAL LEARNING AND ADP DECISIONS B C D A t+2 t+1 t t-1 n n+1 iteration → time → location → After observing the new loads, we are in the pre decision state

INFORMS Annual Meeting Austin 30/40 COMBINING OPTIMAL LEARNING AND ADP DECISIONS B C D A t+2 t+1 t t-1 where n n+1 iteration → time → location →

INFORMS Annual Meeting Austin 31/40 COMBINING OPTIMAL LEARNING AND ADP DECISIONS B C D A t+2 t+1 t t-1 n n+1 iteration → time → location → Again we have to make a sampling decision

INFORMS Annual Meeting Austin 32/40 COMBINING OPTIMAL LEARNING AND ADP DECISIONS B C D A t+2 t+1 t t-1 n n+1 iteration → time → location → Again, we estimate the knowledge gain

INFORMS Annual Meeting Austin 33/40 COMBINING OPTIMAL LEARNING AND ADP DECISIONS B C D A t+2 t+1 t t-1 n n+1 iteration → time → location → We decide to move to location B resulting in a post decision state

INFORMS Annual Meeting Austin 34/40 CHALLENGES WITH OPTIMAL LEARNING IN ADP  Impact on next iteration hard to compute → so we assume a similar resource and demand state in the next iteration and evaluate the impact of an updated knowledge state  Bias:  Decisions have impact on the value of states in the downstream path (we learn what we measure)  Decisions have impact on the value of states in the upstream path (with on-policy control)  To decision to measure a state will change its value which in turn might influence our decisions in the next iteration: Simply measuring states more often might increase their estimated values which in turn make them more attractive next time

INFORMS Annual Meeting Austin 35/40 SKETCH OF OUR SOLUTION APPROACH  To cope with the bias, we propose using so-called projected value functions  Assumption: exponential increase (decrease if we started with optimistic estimates) in estimated values as a function of the number of iterations  Value iteration is known to converge geometrically, see [1] [1]M.L. Puterman (1994). Markov decision processes. New York: John Wiley & Sons. …and hopefully n>n 0 weighted estimates output after n 0 limiting value rate

INFORMS Annual Meeting Austin 36/40 SKETCH OF OUR SOLUTION APPROACH  Illustration projected value functions:

INFORMS Annual Meeting Austin 37/40 NEW ADP ALGORITHM  Step 2b:  Update the value function estimates at all levels of aggregation  Update the weights and compute the weighted value function estimates, possibly for many states at once  Step 2c:  combine the updated value function estimates with the prior distributions on the projected value functions to obtain posterior distributions, see [1] for details  The new state follows from running HKG using our beliefs on the projected value functions as input  So we completely separated the updating step (step 2a/b) and the exploration step (step 2c) [1]P.I. Frazier, W.B. Powell, and H.P. Simão (2009). Simulation model calibration with correlated knowledge-gradients.

INFORMS Annual Meeting Austin 38/40 PERFORMANCE IMPRESSION  Experiment on an instance of the Nomadic Trucker Problem

INFORMS Annual Meeting Austin 39/40 SHORTCOMINGS  Fitting  It is not always possible to find a nice fit  For example, if observed values increase slightly faster in the beginning and slower after that (compared to the fitted exponential), we still have this bias where sampled states look more attractive than others; after a sufficient number of measurements this will be corrected  Computation time  We have to spent quite some computation time to make the sampling decision; we could have used this time just to sample the states instead of thinking about it  Application area: large state space (pure exploration doesn’t make sense) but small action space

INFORMS Annual Meeting Austin 40/40 CONCLUSIONS  We illustrated the challenges of ADP using the Nomadic Trucker example  We illustrated how optimal learning can be helpful here  We illustrated the difficulty of learning in ADP due to the bias:  our estimated values are influenced by the measurement policy which in turn is influenced by our estimated values  To cope with this bias we introduced the notion of projected value functions  This enables use to use the HKG policy to  cope with the exploration vs. exploitation dilemma  allow generalization across states  We shortly illustrated the potential of using this approach but also mentioned several shortcomings

INFORMS Annual Meeting Austin41 QUESTIONS? Martijn Mes Assistant professor University of Twente School of Management and Governance Operational Methods for Production and Logistics The Netherlands Contact Phone: Web: