A Reinforcement Learning Approach for Product Delivery by Multiple Vehicles Scott Proper Oregon State University Prasad Tadepalli Hong TangRasaratnam Logendran.

A Reinforcement Learning Approach for Product Delivery by Multiple Vehicles Scott Proper Oregon State University Prasad Tadepalli Hong TangRasaratnam Logendran

Vehicle Routing & Product Delivery

Contributions of our Research Multiple vehicle product delivery is a well- studied problem in operations research We have formulated this problem as an average reward reinforcement learning (RL) problem We have combined inventory control with vehicle routing We have scaled RL methods to work with large state spaces

Markov Decision Processes Action a Actions are stochastic: P i,j (a) Actions have costs or rewards: r i (a) Move Unload

Average Reward Reinforcement Learning Goal: Maximize average reward/time step –Minimize stockout penalty + movement penalty Policy: states → actions Value function: states → real values –expected long-term reward from a state, relative to other states, when following the optimal policy

H-Learning The value function satisfies the Bellman equation: The optimal action a* maximizes the immediate reward + expected value of the next state H-Learning is a real-time algorithm for solving the value function

H-Learning: an example 1 -.1, 1/1 0, 9/9 0, 0/9 A ED C B Value Table A0 B0 C0 D0 E0

H-Learning: an example 2 Stockout penalty: -20 A ED C B -.1, 1/1 0, 9/10 -20, 1/10 Value Table A-.1 B0 C0 D0 E0

H-Learning: an example 3 A ED C B -.1, 1/1 0, 9/10 Value Table A-.1 B0 C0 D0 E0 -20, 1/10

H-Learning: an example 4 Move penalty: -.1 A ED C B -.1, 2/2 0, 9/10 Value Table A-.1 B0 C0 D0 E0 -20, 1/10

On-line Product Delivery Deliver 1 product 9 truck actions: –4 levels of unload –4 move directions –wait P(Inventory decrease | shop) Stockout penalty: -20 Movement penalty: -.1 5 Shops Depot

The problem of state-space explosion The loads of trucks and shop inventories are discretized into 5 levels States grow exponentially in shops and trucks –10 locations, 5 shops, 2 trucks = (10 2 )(5 5 )(5 2 ) = 7,812,500 states –5 trucks = 976,592,500,000 states Table-based methods take too much time and space

Piecewise Linear Function Approximation We use a different linear function for each possible 5-tuple of locations l 1,…, l 5 of trucks Each function is linear in truck loads and shop inventories Every function represents 10 million states million-fold reduction of learnable parameters

Piecewise linear function approximation vs. table-based

Storing and using the action models Problem: exponential time to determine the expected value of the next state: - Each shop’s consumption is independent - Value function is piecewise linear ? ? ? ?

Ignoring Truck Identity m = number of locations (10) k = number of trucks (2-5) 5 trucks: 10 5 functions Learnable parameters: 1.1 million 2002 functions Learnable parameters: 22,022 mkmk

The problem of action-space explosion Every action a is a vector of individual “truck actions” a = (a 1, a 2,…,a n ) Actions grow exponentially in the number of trucks –9 “truck actions” –For 2 trucks: 9 2 = 81 total actions –For 5 trucks: 9 5 = 59,049 total actions

Hill Climbing Search We initialize the vector of truck actions a to all “wait” actions We use hill climbing to reach a local optimum Randomly perturb a truck action, repeat This results in an order-of- magnitude improvement in search time

Hill climbing vs. exhaustive search for 4 and 5 trucks

Conclusion Average-reward RL and Piecewise linear function approximation are promising approaches for real-time product delivery Hill climbing shows great potential for speeding up search in domains with a large action space Problems of scaling are surmountable

Future Work Scaling! More trucks, more locations, more shops, more depots, and more items Allowing trucks to move with non-uniform speeds (event-based model needed) Real-valued shop inventory and truck load levels

A Reinforcement Learning Approach for Product Delivery by Multiple Vehicles Scott Proper Oregon State University Prasad Tadepalli Hong TangRasaratnam Logendran.

Similar presentations

Presentation on theme: "A Reinforcement Learning Approach for Product Delivery by Multiple Vehicles Scott Proper Oregon State University Prasad Tadepalli Hong TangRasaratnam Logendran."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Reinforcement Learning Approach for Product Delivery by Multiple Vehicles Scott Proper Oregon State University Prasad Tadepalli Hong TangRasaratnam Logendran.

Similar presentations

Presentation on theme: "A Reinforcement Learning Approach for Product Delivery by Multiple Vehicles Scott Proper Oregon State University Prasad Tadepalli Hong TangRasaratnam Logendran."— Presentation transcript:

Similar presentations

About project

Feedback