Download presentation
Presentation is loading. Please wait.
Published byDorothy MargaretMargaret Bond Modified over 7 years ago
1
Authors: Md. Arafat Habib Muhidul Islam Khan Jia Uddin
Optimal Route Selection in Complex Multi-stage Supply Chain Networks using SARSA(λ) Authors: Md. Arafat Habib Muhidul Islam Khan Jia Uddin
2
Dr. Jia Uddin is an Assistant Professor of Computer Science and Engineering Department of BRAC University. He was an Assistant Professor in Department of Computer and Communication Engineering Department in International Islamic University Chittagong, Bangladesh. He received Ph.D. degree (Computer Engineering) from University of Ulsan, South Korea in January During his Ph.D. duration ( ), he was involved with a research laboratory “Embedded Ubiquitous Computing System Lab” and has a number of peer reviewed journals. He attended several international conferences and symposiums at home and abroad. Prior to his Ph.D., he obtained M.Sc. Engg. (Telecommunications) degree from Blekinge Institute of Technology, Sweden at 2010. Md. Muhidul Islam Khan received his Bachelor degree in Computer Science and Engineering from Khulna University of Engineering and Technology (KUET) in 2007. In 2009 he received his Masters degree from Bangladesh University of Engineering and Technology (BUET). He has participated in the "eLINK"-project at Corvinus University of Budapest, Hungary, from September 2009 until July 2010 (funded by European Union). His specialization lies in the fields of Wireless Sensor Networks, Networked Embedded Systems and Pervasive Computing. He started his PhD studies under the Erasmus Mundus Grant from European Commission working at Klagefurt University, Austria from January 2011 – July 2012, spent a year at University of Genova, Italy and returned to Klagenfurt in July 2013 to continue his work and his dissertation. He obtained his joint doctorate degree in September, 2014.
3
[1] N. Rahman, A. Habib, Z. Alam, A. Zoarder, and M
[1] N. Rahman, A. Habib, Z. Alam, A. Zoarder, and M. Haque, “Route Optimization in Supply Chain Network,” Imperial Journal of Interdisciplinary Research, vol. 2, no. 7, pp , [2] T. Stockheim, M. Schwind, A. Korth, and B. Simsek, “Supply Chain Yield Management Based on Reinforcement Learning,” [online]: [3] F. Altiparmak, M. Gen, L. Lin and T. Paksoy, “A Genetic Algorithm Approach for Multi-objective Optimization of Supply Chain Networks,” vol. 51, no. 1, pp , [4]Z. Mortaza, A. Selamat, and S. M. Hashim, “Route Planning Model of Multi-agent System for a Supply Chain Management,” vol. 40, no. 5, pp , 2013.
4
Proposed Model Designing an MDP Deciding States Deciding Actions Defining Goal State Defining a Reward Function Solving the MDP with SARSA (λ) to find the Optimal Route Fig. 1. Diagram of the different phases of the Proposed Model.
5
The MDP that models our optimal route selection problem is, M= {S, A, T, R, β} where: S= {S, M, D, R, C, Ca} Here, S is the supplier, M is the manufacturer, D stands for distributor, R for retailer, C for customer and lastly Ca stands for carrier. A= {t, r, a, w, i} In the action set, t denotes the action of transporting products from one place to another by truck. Element r in the action set refers to the transportation of goods through rails. The third element a in the action set is for the action of product transportation via air. Water transportation of products is represented with w symbol in the action set T is the probability distribution of going to a state s′ from s by taking any random action a. R is the cost function that expresses the reward if action a is taken at state s. β is the discount factor, 0< β <1.
6
Designed MDP Fig. 2. State transition diagram for the MDP.
7
Reward Function R=β (Cost) + (1-β) (Penalty) (1) Where, Cost = Cr × Sg × (Cr × T) (2) Penalty=Pt × (1+ (Pd- Psla) / Psla) (3) In equation (2), Cr = Cost of the transport that is rented Sg = Amount of goods to be shifted (in kilogram) T = Total number of transports required In equation (3), Pt = Penalty for not meeting the desired time deadline Pd = Performance displayed by the system randomly Psla = Performance to be maintained by the service provider that is time and the goods in the finest condition Lastly, β is the balancing factor.
8
Q-learning 1. (∀s ∈ S)(∀a ∈ A(s)); 2. initialize Q(s , a) 3. s := the initial observed state 4. loop 5. Choose a ∈ A(s) according to a policy derived from Q 6. Take action a and observe next state s ′ and reward r 7. Q[s , a] := Q[s , a] + α(R[s,a] + g * maxa Q[s′ , a′ ] - Q[s, a]) 8. s := s′ 9. end loop 10. return π (s) = argmaxa Q(s , a)
9
SARSA (Lambda) 1. Initialize Q(s, a) arbitrarily
1. Initialize Q(s, a) arbitrarily 2. Repeat (for each episode): 3. Initialize s 4. Choose a from s using policy derived from Q 5. Repeat (for each episode): 6. Take action a, observe r, s′ 7. Choose a′ from s′ using policy derived from Q 8. δ = r+ g Q[s′ , a′ ] - Q[s, a] 9. e(s, a) = e(s, a)+1 10. For all (s , a): 11. Q[s , a] = Q[s , a] + α δ e(s, a) 12. e(s, a) = g λe(s, a) 13. s=s′ ; a=a′ 14. until s is terminal SARSA (Lambda)
10
Fig. 3. Cost Vs Penalty Graph for beta in Q-learning.
11
Fig. 4. Cost Vs Penalty Graph for beta in Q-learning.
12
Fig. 5. Cost Vs Penalty Graph for beta in Q and SARSA (λ).
13
Fig. 6. Varying Learning rate to find Convergence (Q- leaning).
14
Fig. 7. Varying Learning rate to find convergence (SARSA (λ)).
15
Fig. 11. Convergence speed comparison of Q-learning and SARSA (λ).
16
Optimized Routes
17
Future Work
18
Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.