Some Final Thoughts Abhijit Gosavi
From MDPs to SMDPs The Semi-MDP is a more general model in which the time for transition is also a random variable. The MDP Bellman equations can be extended to SMDPs to accommodate time.
SMDPs (contd.) In the average reward case, we would be interested in maximizing the average reward per unit time. For the discounted reward case, we will need to discount proportionate to the time spent in each transition. The Q-Learning algorithm for discounted reward has a direct extension. For average reward, we have a family of algorithms called R-SMART (see book for references).
Policy Iteration Another method to solve the MDP: an alternative to SMDPs Slightly more involved mathematically Sometimes more efficient than value iteration Its Reinforcement Learning counterpart is called Approximate Policy Iteration
Other Applications Supply Chain Problems Disaster Response Management Production Planning in Remanufacturing Systems Continuous event systems (LQG control)
What you’ve learned (hopefully ) Markov chains and how they can be employed to model systems Markov decision processes: the idea of optimizing systems (controls) driven by Markov chains Some concepts from Artificial Intelligence Some (hopefully) cool applications of Reinforcement Learning Some coding (for those who were not averse to doing it) Systems thinking Coding iterative algorithms Some discrete-event simulation HOPE YOU’VE ENJOYED THE CLASS!