Download presentation
Presentation is loading. Please wait.
1
A Reinforcement Learning Approach for Product Delivery by Multiple Vehicles Scott Proper Oregon State University Prasad Tadepalli Hong TangRasaratnam Logendran
2
Vehicle Routing & Product Delivery
3
Contributions of our Research Multiple vehicle product delivery is a well- studied problem in operations research We have formulated this problem as an average reward reinforcement learning (RL) problem We have combined inventory control with vehicle routing We have scaled RL methods to work with large state spaces
4
Markov Decision Processes Action a Actions are stochastic: P i,j (a) Actions have costs or rewards: r i (a) Move Unload
5
Average Reward Reinforcement Learning Goal: Maximize average reward/time step –Minimize stockout penalty + movement penalty Policy: states → actions Value function: states → real values –expected long-term reward from a state, relative to other states, when following the optimal policy
6
H-Learning The value function satisfies the Bellman equation: The optimal action a* maximizes the immediate reward + expected value of the next state H-Learning is a real-time algorithm for solving the value function
7
H-Learning: an example 1 -.1, 1/1 0, 9/9 0, 0/9 A ED C B Value Table A0 B0 C0 D0 E0
8
H-Learning: an example 2 Stockout penalty: -20 A ED C B -.1, 1/1 0, 9/10 -20, 1/10 Value Table A-.1 B0 C0 D0 E0
9
H-Learning: an example 3 A ED C B -.1, 1/1 0, 9/10 Value Table A-.1 B0 C0 D0 E0 -20, 1/10
10
H-Learning: an example 4 Move penalty: -.1 A ED C B -.1, 2/2 0, 9/10 Value Table A-.1 B0 C0 D0 E0 -20, 1/10
11
On-line Product Delivery Deliver 1 product 9 truck actions: –4 levels of unload –4 move directions –wait P(Inventory decrease | shop) Stockout penalty: -20 Movement penalty: -.1 5 Shops Depot
12
The problem of state-space explosion The loads of trucks and shop inventories are discretized into 5 levels States grow exponentially in shops and trucks –10 locations, 5 shops, 2 trucks = (10 2 )(5 5 )(5 2 ) = 7,812,500 states –5 trucks = 976,592,500,000 states Table-based methods take too much time and space
13
Piecewise Linear Function Approximation We use a different linear function for each possible 5-tuple of locations l 1,…, l 5 of trucks Each function is linear in truck loads and shop inventories Every function represents 10 million states million-fold reduction of learnable parameters
14
Piecewise linear function approximation vs. table-based
15
Storing and using the action models Problem: exponential time to determine the expected value of the next state: - Each shop’s consumption is independent - Value function is piecewise linear ? ? ? ?
16
Ignoring Truck Identity m = number of locations (10) k = number of trucks (2-5) 5 trucks: 10 5 functions Learnable parameters: 1.1 million 2002 functions Learnable parameters: 22,022 mkmk
17
The problem of action-space explosion Every action a is a vector of individual “truck actions” a = (a 1, a 2,…,a n ) Actions grow exponentially in the number of trucks –9 “truck actions” –For 2 trucks: 9 2 = 81 total actions –For 5 trucks: 9 5 = 59,049 total actions
18
Hill Climbing Search We initialize the vector of truck actions a to all “wait” actions We use hill climbing to reach a local optimum Randomly perturb a truck action, repeat This results in an order-of- magnitude improvement in search time
19
Hill climbing vs. exhaustive search for 4 and 5 trucks
20
Conclusion Average-reward RL and Piecewise linear function approximation are promising approaches for real-time product delivery Hill climbing shows great potential for speeding up search in domains with a large action space Problems of scaling are surmountable
21
Future Work Scaling! More trucks, more locations, more shops, more depots, and more items Allowing trucks to move with non-uniform speeds (event-based model needed) Real-valued shop inventory and truck load levels
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.