Presentation is loading. Please wait.

Presentation is loading. Please wait.

7-11-2010INFORMS Annual Meeting Austin1 Learning in Approximate Dynamic Programming for Managing a Multi- Attribute Driver Martijn Mes Department of Operational.

Similar presentations


Presentation on theme: "7-11-2010INFORMS Annual Meeting Austin1 Learning in Approximate Dynamic Programming for Managing a Multi- Attribute Driver Martijn Mes Department of Operational."— Presentation transcript:

1 7-11-2010INFORMS Annual Meeting Austin1 Learning in Approximate Dynamic Programming for Managing a Multi- Attribute Driver Martijn Mes Department of Operational Methods for Production and Logistics University of Twente The Netherlands Sunday, November 7, 2010 INFORMS Annual Meeting Austin

2 7-11-2010INFORMS Annual Meeting Austin 2/40 OUTLINE 1.Illustration: a transportation application 2.Stylized illustration: the Nomadic Trucker Problem 3.Approximate Dynamic Programming (ADP) 4.Challenges with ADP 5.Optimal Learning 6.Optimal Learning in ADP 7.Challenges with Optimal Learning in ADP 8.Sketch of our solution concept

3 7-11-2010INFORMS Annual Meeting Austin 3/40 TRANSPORTATION APPLICATION  Heisterkamp  Trailer trucking:  Providing trucks and drivers  Planning department:  Accept orders  Assign orders to trucks  Assign drivers to trucks  Type of orders:  Direct order: move trailer from A to B; client pays depending on distance between A to B, but trailer might go through hubs to change the truck and/or driver  Customer guidance order: rent a truck and driver to a client for some time period

4 7-11-2010INFORMS Annual Meeting Austin 4/40 REAL APPLICATION  Heisterkamp

5 7-11-2010INFORMS Annual Meeting Austin 5/40 CHARACTERISTICS  The drivers are bounded by EU drivers’ hours regulations  However, given sufficient supply of orders and drivers, trucks can in principle be utilized 24/7 by switching drivers  Even though we can replace a driver (to increase utilization of trucks), we still might face costs for the old driver  Objective: increase profits by ‘clever’ order acceptance and minimization of costs for drivers, trucks and moving empty (i.e., without trailer)  We solve a dynamic assignment problem, given the state of all trucks and (probabilistic) known orders, at specific time instances for a fixed horizon  This problem is known as a Dynamic Fleet Management Problem (DFMP). For illustrative purposes we now focus on the single vehicle version of the DFMP.

6 7-11-2010INFORMS Annual Meeting Austin 6/40 THE NOMADIC TRUCKER PROBLEM  Single trucker moving from city to city either with a load or empty  Rewards when moving loads otherwise there are costs involved  Vector of attributes describing a single resource with the set of possible attribute vectors The truck The driver Dynamic attributes

7 7-11-2010INFORMS Annual Meeting Austin 7/40 MODELING THE DYNAMICS  State where  with R ta =1 when the truck has attribute a (in the DFMP, R ta gives the number of resources at time t with attribute a)  with D tl the number of loads of type l  Decision x t : make a loaded move, wait at current location, or move empty to another location; x t follows from a decision function where π  Π is a family of policies  Exogenous information W t+1 : information arriving between t and t+1 such as new loads, wear of truck, occurrence of breakdowns etc.  Choosing decision x t with current state S t and exogenous information W t+1, results in a transition with contribution (payment or costs)

8 7-11-2010INFORMS Annual Meeting Austin 8/40 OBJECTIVE  Objective is to find the policy π that maximizes the expected sum of discounted contributions over all time periods

9 7-11-2010INFORMS Annual Meeting Austin 9/40 SOLVING THE PROBLEM  Optimality equation (expectation form of Bellman’s equation):  Enumerating by backward induction?  Suppose a=(location, arrival time, domicile) and we discretize to 500 locations and 50 possible arrival times → |A |=12,500,000  In the backward loop we not only have to visit all states, but also we have to evaluate all actions, and, to compute the expectation, we probably also have to evaluate all possible outcomes  Backwards dynamic programming might become intractable Approximate Dynamic Programming

10 7-11-2010INFORMS Annual Meeting Austin 10/40 APPROXIMATE DYNAMIC PROGRAMMING  We replace the original optimality equation  With the following

11 7-11-2010INFORMS Annual Meeting Austin 11/40 APPROXIMATE DYNAMIC PROGRAMMING  We replace the original optimality equation  With the following Using a value function approximation This allows us to step forward in time 1

12 7-11-2010INFORMS Annual Meeting Austin 12/40 APPROXIMATE DYNAMIC PROGRAMMING  We replace the original optimality equation  With the following Using the post-decision state variable Deterministic function 2

13 7-11-2010INFORMS Annual Meeting Austin 13/40 APPROXIMATE DYNAMIC PROGRAMMING  We replace the original optimality equation  With the following Generating sample paths 3

14 7-11-2010INFORMS Annual Meeting Austin 14/40 APPROXIMATE DYNAMIC PROGRAMMING  We replace the original optimality equation  With the following Learning through iterations 4

15 7-11-2010INFORMS Annual Meeting Austin 15/40 OUTLINE OF THE ADP ALGORITHM Deterministic optimizationSimulationStatistics

16 7-11-2010INFORMS Annual Meeting Austin 16/40 CHALLENGES WITH ADP  Exploration vs. exploitation:  Exploitation: we do we currently think is best  Exploration: we choose to try something and learn more (information collection)  To avoid getting stuck in a local optimum, we have to explore. But what do we want to explore and for how long? Do we need to explore the whole state space?  Do we update the value functions using the results of the exploration steps or do we want to perform off-policy control?  Techniques from Optimal Learning might help here

17 7-11-2010INFORMS Annual Meeting Austin 17/40 OPTIMAL LEARNING  To cope with the exploration vs. exploitation dilemma  Undirected exploration:  Try to randomly explore the whole state space  Examples: pure exploration and epsilon greedy (explore with probability ε n and exploit with probability 1- ε n )  Directed exploration:  Utilize past experience to execute efficient exploration (costs are gradually avoided by making more expensive actions less likely)  Examples of directed exploration  Boltzmann exploration; choose x that maximizes  Interval estimation; choose x that maximizes  The knowledge gradient policy (see next sheets)

18 7-11-2010INFORMS Annual Meeting Austin 18/40 THE KNOWLEDGE GRADIENT POLICY [1/2]  Basic principle:  Assume you can make only one measurement, after which you have to make a final choice (the implementation decision)  What choice would you make now to maximize the expected value of the implementation decision? 1234 5 Change which produces a change in the decision. Change in estimated value of option 5 due to measurement of 5 Updated estimate of the value of option 5 Observation

19 7-11-2010INFORMS Annual Meeting Austin 19/40 THE KNOWLEDGE GRADIENT POLICY [2/2]  The knowledge gradient is the expected marginal value of a single measurement x  The knowledge gradient policy is given by  There are many problems where making one measurement tells us something about what we might observe from other measurements (e.g., in our transportation application nearby locations have similar properties)  Correlations are particularly important when the number of possible measurements is extremely large relative to the measurement budget (or continuous functions)  There are various extensions of the Knowledge Gradient policy that take into account similarities between alternatives Hierarchical Knowledge Gradient policy

20 7-11-2010INFORMS Annual Meeting Austin 20/40 HIERARCHIAL KNOWLEDGE GRADIENT (HKG)  Idea: instead of having a belief on the true value θ x of each alternative x (Bayesian prior with mean and precision ), we have a belief on the value of each alternative at various levels of aggregation (with and )  Using aggregation, we express (our estimate of θ x ) as a weighted combination  Intuition: highest weight to levels with lowest sum of variance and bias; see [1] and [2] for details. [1]M.R.K. Mes, W.B. Powell, and P.I. Frazier (2010). Hierarchical Knowledge Gradient for Sequential Sampling. [2]A. George, W.B. Powell, and S.R. Kulkarni (2008). Value Function Approximation using Multiple Aggregation for Multiattribute Resource Management.

21 7-11-2010INFORMS Annual Meeting Austin 21/40 STATISTICAL AGGREGATION  Example of an aggregation structure for the Nomadic Trucker Problem  With HKG we would have 38,911 beliefs and our belief about a single alternative can be expressed as a function of 6 beliefs (1 for each aggregation level). LevelLocationDriver typeDay of weekSize of state space 0City**500x10x7=35,000 1Region**50x10x7=3,500 2Region-*50x1x7=350 3Region--50x1x1=50 4Province--10x1x1=10 5Country--1x1x1=1 We need this for each time unit * include in this level- exclude in this level

22 7-11-2010INFORMS Annual Meeting Austin 22/40 ILLUSTRATION OF HKG  The knowledge gradient policy prefers to measure alternatives with high mean and/or low precision:  Equal means  measure lowest precision  Equal precisions  measure highest mean  Demo HKG…

23 7-11-2010INFORMS Annual Meeting Austin 23/40 B C D A t+2 t+1 t t-1 COMBINING OPTIMAL LEARNING AND ADP DECISIONS  Illustration learning in ADP  State S t =(R t,D t ) where R t resembles a location R t  {A,B,C,D} and D t available loads going out from R t  Decision x t is a location to move to x t  {A,B,C,D}  Exogenous information W t are the new loads D t time → location →

24 7-11-2010INFORMS Annual Meeting Austin 24/40 COMBINING OPTIMAL LEARNING AND ADP DECISIONS B C D A t+2 t+1 t t-1 time → location → We were in the post decision state where we decided to move to location C. After observing the new loads, we are in the pre decision state

25 7-11-2010INFORMS Annual Meeting Austin 25/40 COMBINING OPTIMAL LEARNING AND ADP DECISIONS B C D A t+2 t+1 t t-1 where n n+1 iteration → time → location →

26 7-11-2010INFORMS Annual Meeting Austin 26/40 COMBINING OPTIMAL LEARNING AND ADP DECISIONS B C D A t+2 t+1 t t-1 So not necessarily influences the value However, it determines the state we update next n n+1 iteration → time → location →

27 7-11-2010INFORMS Annual Meeting Austin 27/40 COMBINING OPTIMAL LEARNING AND ADP DECISIONS B C D A t+2 t+1 t t-1 Using Optimal Learning, we estimate the knowledge gain n n+1 iteration → time → location →

28 7-11-2010INFORMS Annual Meeting Austin 28/40 COMBINING OPTIMAL LEARNING AND ADP DECISIONS B C D A t+2 t+1 t t-1 n n+1 iteration → time → location → We decide to move to location B resulting in a post decision state

29 7-11-2010INFORMS Annual Meeting Austin 29/40 COMBINING OPTIMAL LEARNING AND ADP DECISIONS B C D A t+2 t+1 t t-1 n n+1 iteration → time → location → After observing the new loads, we are in the pre decision state

30 7-11-2010INFORMS Annual Meeting Austin 30/40 COMBINING OPTIMAL LEARNING AND ADP DECISIONS B C D A t+2 t+1 t t-1 where n n+1 iteration → time → location →

31 7-11-2010INFORMS Annual Meeting Austin 31/40 COMBINING OPTIMAL LEARNING AND ADP DECISIONS B C D A t+2 t+1 t t-1 n n+1 iteration → time → location → Again we have to make a sampling decision

32 7-11-2010INFORMS Annual Meeting Austin 32/40 COMBINING OPTIMAL LEARNING AND ADP DECISIONS B C D A t+2 t+1 t t-1 n n+1 iteration → time → location → Again, we estimate the knowledge gain

33 7-11-2010INFORMS Annual Meeting Austin 33/40 COMBINING OPTIMAL LEARNING AND ADP DECISIONS B C D A t+2 t+1 t t-1 n n+1 iteration → time → location → We decide to move to location B resulting in a post decision state

34 7-11-2010INFORMS Annual Meeting Austin 34/40 CHALLENGES WITH OPTIMAL LEARNING IN ADP  Impact on next iteration hard to compute → so we assume a similar resource and demand state in the next iteration and evaluate the impact of an updated knowledge state  Bias:  Decisions have impact on the value of states in the downstream path (we learn what we measure)  Decisions have impact on the value of states in the upstream path (with on-policy control)  To decision to measure a state will change its value which in turn might influence our decisions in the next iteration: Simply measuring states more often might increase their estimated values which in turn make them more attractive next time

35 7-11-2010INFORMS Annual Meeting Austin 35/40 SKETCH OF OUR SOLUTION APPROACH  To cope with the bias, we propose using so-called projected value functions  Assumption: exponential increase (decrease if we started with optimistic estimates) in estimated values as a function of the number of iterations  Value iteration is known to converge geometrically, see [1] [1]M.L. Puterman (1994). Markov decision processes. New York: John Wiley & Sons. …and hopefully n>n 0 weighted estimates output after n 0 limiting value rate

36 7-11-2010INFORMS Annual Meeting Austin 36/40 SKETCH OF OUR SOLUTION APPROACH  Illustration projected value functions:

37 7-11-2010INFORMS Annual Meeting Austin 37/40 NEW ADP ALGORITHM  Step 2b:  Update the value function estimates at all levels of aggregation  Update the weights and compute the weighted value function estimates, possibly for many states at once  Step 2c:  combine the updated value function estimates with the prior distributions on the projected value functions to obtain posterior distributions, see [1] for details  The new state follows from running HKG using our beliefs on the projected value functions as input  So we completely separated the updating step (step 2a/b) and the exploration step (step 2c) [1]P.I. Frazier, W.B. Powell, and H.P. Simão (2009). Simulation model calibration with correlated knowledge-gradients.

38 7-11-2010INFORMS Annual Meeting Austin 38/40 PERFORMANCE IMPRESSION  Experiment on an instance of the Nomadic Trucker Problem

39 7-11-2010INFORMS Annual Meeting Austin 39/40 SHORTCOMINGS  Fitting  It is not always possible to find a nice fit  For example, if observed values increase slightly faster in the beginning and slower after that (compared to the fitted exponential), we still have this bias where sampled states look more attractive than others; after a sufficient number of measurements this will be corrected  Computation time  We have to spent quite some computation time to make the sampling decision; we could have used this time just to sample the states instead of thinking about it  Application area: large state space (pure exploration doesn’t make sense) but small action space

40 7-11-2010INFORMS Annual Meeting Austin 40/40 CONCLUSIONS  We illustrated the challenges of ADP using the Nomadic Trucker example  We illustrated how optimal learning can be helpful here  We illustrated the difficulty of learning in ADP due to the bias:  our estimated values are influenced by the measurement policy which in turn is influenced by our estimated values  To cope with this bias we introduced the notion of projected value functions  This enables use to use the HKG policy to  cope with the exploration vs. exploitation dilemma  allow generalization across states  We shortly illustrated the potential of using this approach but also mentioned several shortcomings

41 7-11-2010INFORMS Annual Meeting Austin41 QUESTIONS? Martijn Mes Assistant professor University of Twente School of Management and Governance Operational Methods for Production and Logistics The Netherlands Contact Phone: +31-534894062 Email: m.r.k.mes@utwente.nl Web: http://mb.utwente.nl/ompl/staff/Mes/


Download ppt "7-11-2010INFORMS Annual Meeting Austin1 Learning in Approximate Dynamic Programming for Managing a Multi- Attribute Driver Martijn Mes Department of Operational."

Similar presentations


Ads by Google