Real-Time Scheduling via Reinforcement Learning Robert Glaubius, Terry Tidwell, Christopher Gill, and William D. Smart Department of Computer Science and Engineering Washington University in St. Louis Problem In cyber-physical systems, setting and enforcing a utilization target for shared resources is a useful mechanism for ensuring timely execution of tasks. Existing techniques model stochastic tasks pessimistically according to their worst-case execution time, but better performance can be attained by reasoning about the full distribution of task behavior. Given: A mutually exclusive shared resource. Tasks a {1, …, n}, each characterized by a finitely-supported duration distribution P(ta). A utilization target u = (u1, …, un), where ua is the fraction of time that task a should hold the resource. Find: A scheduling policy that maintains relative resource utilization among tasks near u over the system lifetime. Sample Complexity of Scheduling MDP construction requires prior knowledge of task behaviors, i.e., their distribution of durations. In practice, we often need to estimate these distributions from observations. This naturally leads to the question “How many observations do we need to guarantee a good policy?” There are two specific challenges particular to this domain: Unbounded State Space The transitions from any state x depend only on the duration distributions. There is only one type [2] of state we need to learn. Unbounded Costs Values are bounded pointwise, as costs grow polynomially but are discounted exponentially. P(1a) P(2a) P(1b) Analytical Results W is the longest possible duration among all tasks. m is the number of observations. Pm is the estimated task model Qm is the optimal value of the estimated MDP. Simulation Lemma. If there is a constant such that for all tasks Ti, then Theorem. If each task is sampled an equal number of times with then with probability at least 1–. Corollary. By applying a classical result of [3], if then the resulting policy is -optimal with probability at least 1–. MDP Representation Basic Model States x = (x1, …, xn), where xa is task a’s resource usage. Actions a {1, …, n} correspond to the decision to dispatch a task a. The system transitions from x to y = (x1, …, xa + t, …, xn) with probability P(ta) when task a is run. The cost of a state x is related to its distance from u, C(x) = x – (x)u, where (x) = a xa is the accumulated resource usage in state x. Wrapped Model States x and y with equal displacement from the utilization ray have identical optimal value and optimal actions [1]. Thus an equivalent MDP formulation retains just one of each of these states. This formulation allows us to approximate optimal scheduling policies, provided task models are available. Task 1 Resource Use Task 2 Resource Use {u:≥0} Empirical Results We compared the performance of three exploration strategies, averaged across 400 randomly generated two-task problem instances. -greedy with k = k-10 at decision epoch k. m decision epochs of balanced wandering. Interval-based selection [4] according to with Suboptimal Decisions Effective Exploitation In this domain explicit exploration mechanisms apparently are less effective than always exploiting available information. The problem structure enforces effective exploration, as ignoring any task causes costs to grow without bound. Task 1 Resource Use Task 2 Resource Use References [1] R. Glaubius, T. Tidwell, W. D. Smart, and C. Gill. Scheduling design and verification for open soft real-time systems. In RTSS’08, pages 505–514, 2008. [2] B. R. Leffler, M. L. Littman, and T. Edmunds. Efficient reinforcement learning with relocatable action models. In AAAI’07, pages 572–577, 2007. [3] S. P. Singh and R. C. Yee. An upper bound on the loss from approximate optimal-value functions. Machine Learning, 16(3):227–233, 1994. [4] E. Even-Dar, S. Mannor, and Y. Mansour. PAC bounds for multi-armed bandit and markov decision processes. In COLT’02, pages 255–270, 2002. Acknowledgements This research has been supported in part by NSF grants CNS-0716764 (Cybertrust) and CCF-0448562 (CAREER)