 Real-Time Scheduling via Reinforcement Learning

Slides:



Advertisements
Similar presentations
Partially Observable Markov Decision Process (POMDP)
Advertisements

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Solving POMDPs Using Quadratically Constrained Linear Programs Christopher Amato.
SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.
Questions?. Setting a reward function, with and without subgoals Difference between agent and environment AI for games, Roomba Markov Property – Broken.
1 Monte-Carlo Planning: Introduction and Bandit Basics Alan Fern.
Scalable Utility Aware Scheduling Heuristics for Real-time Tasks with Stochastic Non-preemptive Execution Intervals* Terry Tidwell 1, Carter Bass 1, Eli.
Towards Feasibility Region Calculus: An End-to-end Schedulability Analysis of Real- Time Multistage Execution William Hawkins and Tarek Abdelzaher Presented.
Markov Decision Processes
Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.
Exploration and Exploitation Strategies for the K-armed Bandit Problem by Alexander L. Strehl.
4/1 Agenda: Markov Decision Processes (& Decision Theoretic Planning)
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Exploration and Apprenticeship Learning in Reinforcement Learning Pieter Abbeel and Andrew Y. Ng Stanford University.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Optimal Fixed-Size Controllers for Decentralized POMDPs Christopher Amato Daniel.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.
More RL. MDPs defined A Markov decision process (MDP), M, is a model of a stochastic, dynamic, controllable, rewarding process given by: M = 〈 S, A,T,R.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Learning and Planning for POMDPs Eyal Even-Dar, Tel-Aviv University Sham Kakade, University of Pennsylvania Yishay Mansour, Tel-Aviv University.
Active Learning for Probabilistic Models Lee Wee Sun Department of Computer Science National University of Singapore LARC-IMS Workshop.
Qjk Qjk.
Search and Planning for Inference and Learning in Computer Vision
CPS Scheduling Policy Design with Utility and Stochastic Execution* Chris Gill Associate Professor Department of Computer Science and Engineering Washington.
Scheduling as a Learned Art* Christopher Gill, William D. Smart, Terry Tidwell, and Robert Glaubius {cdgill, wds, ttidwell, Department.
General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005.
REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.
Scheduling for Reliable Execution in Autonomic Systems* Terry Tidwell, Robert Glaubius, Christopher Gill, and William D. Smart {ttidwell, rlg1, cdgill,
Non-Preemptive Scheduling Policy Design for Tasks with Stochastic Execution Times* Chris Gill Associate Professor Department of Computer Science and Engineering.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
Analysis of algorithms Analysis of algorithms is the branch of computer science that studies the performance of algorithms, especially their run time.
1 S ystems Analysis Laboratory Helsinki University of Technology Flight Time Allocation Using Reinforcement Learning Ville Mattila and Kai Virtanen Systems.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Model Minimization in Hierarchical Reinforcement Learning Balaraman Ravindran Andrew G. Barto Autonomous Learning Laboratory.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Generalized Point Based Value Iteration for Interactive POMDPs Prashant Doshi Dept. of Computer Science and AI Institute University of Georgia
Using MDP Characteristics to Guide Exploration in Reinforcement Learning Paper: Bohdana Ratich & Doina Precucp Presenter: Michael Simon Some pictures/formulas.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Achieving Goals in Decentralized POMDPs Christopher Amato Shlomo Zilberstein UMass.
Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles Delvenne and Raphaël Jungers Benelux Meeting in Systems and Control 2015 About.
1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.
OPERATING SYSTEMS CS 3502 Fall 2017
Machine Learning Applications in Grid Computing
Random Testing: Theoretical Results and Practical Implications IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 2012 Andrea Arcuri, Member, IEEE, Muhammad.
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 10
Monte-Carlo Planning:
István Szita & András Lőrincz
Reinforcement Learning in POMDPs Without Resets
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Vincent Conitzer CPS Repeated games Vincent Conitzer
Scalable Scheduling Policy Design for Open Soft Real-Time Systems*
Empirical Comparison of Preprocessing and Lookahead Techniques for Binary Constraint Satisfaction Problems Zheying Jane Yang & Berthe Y. Choueiry Constraint.
Professor Arne Thesen, University of Wisconsin-Madison
Scheduling Design and Verification for Open Soft Real-time Systems
Markov Decision Processes
Optimizing Expected Time Utility in Cyber-Physical Systems Schedulers
Chris Gill Associate Professor
Markov Decision Processes
Chapter 2: Evaluative Feedback
Multiagent Systems Repeated Games © Manfred Huber 2018.
 Real-Time Scheduling via Reinforcement Learning
DATABASE HISTOGRAMS E0 261 Jayant Haritsa
Vincent Conitzer Repeated games Vincent Conitzer
CS 188: Artificial Intelligence Spring 2006
Linköping University, IDA, ESLAB
Reinforcement Learning Dealing with Partial Observability
CS 416 Artificial Intelligence
Chapter 2: Evaluative Feedback
Reinforcement Learning (2)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Vincent Conitzer CPS Repeated games Vincent Conitzer
Reinforcement Learning (2)
Presentation transcript:

 Real-Time Scheduling via Reinforcement Learning Robert Glaubius, Terry Tidwell, Christopher Gill, and William D. Smart Department of Computer Science and Engineering Washington University in St. Louis Problem In cyber-physical systems, setting and enforcing a utilization target for shared resources is a useful mechanism for ensuring timely execution of tasks. Existing techniques model stochastic tasks pessimistically according to their worst-case execution time, but better performance can be attained by reasoning about the full distribution of task behavior. Given: A mutually exclusive shared resource. Tasks a  {1, …, n}, each characterized by a finitely-supported duration distribution P(ta). A utilization target u = (u1, …, un), where ua is the fraction of time that task a should hold the resource. Find: A scheduling policy that maintains relative resource utilization among tasks near u over the system lifetime. Sample Complexity of Scheduling MDP construction requires prior knowledge of task behaviors, i.e., their distribution of durations. In practice, we often need to estimate these distributions from observations. This naturally leads to the question “How many observations do we need to guarantee a good policy?” There are two specific challenges particular to this domain: Unbounded State Space The transitions from any state x depend only on the duration distributions. There is only one type [2] of state we need to learn. Unbounded Costs Values are bounded pointwise, as costs grow polynomially but are discounted exponentially. P(1a) P(2a) P(1b) Analytical Results W is the longest possible duration among all tasks. m is the number of observations. Pm is the estimated task model Qm is the optimal value of the estimated MDP. Simulation Lemma. If there is a constant  such that for all tasks Ti, then Theorem. If each task is sampled an equal number of times with then with probability at least 1–. Corollary. By applying a classical result of [3], if then the resulting policy is -optimal with probability at least 1–. MDP Representation Basic Model States x = (x1, …, xn), where xa is task a’s resource usage. Actions a  {1, …, n} correspond to the decision to dispatch a task a. The system transitions from x to y = (x1, …, xa + t, …, xn) with probability P(ta) when task a is run. The cost of a state x is related to its distance from u, C(x) = x – (x)u, where (x) = a xa is the accumulated resource usage in state x. Wrapped Model States x and y with equal displacement from the utilization ray have identical optimal value and optimal actions [1]. Thus an equivalent MDP formulation retains just one of each of these states. This formulation allows us to approximate optimal scheduling policies, provided task models are available.  Task 1 Resource Use Task 2 Resource Use {u:≥0} Empirical Results We compared the performance of three exploration strategies, averaged across 400 randomly generated two-task problem instances. -greedy with k = k-10 at decision epoch k. m decision epochs of balanced wandering. Interval-based selection [4] according to with Suboptimal Decisions Effective Exploitation In this domain explicit exploration mechanisms apparently are less effective than always exploiting available information. The problem structure enforces effective exploration, as ignoring any task causes costs to grow without bound. Task 1 Resource Use Task 2 Resource Use References [1] R. Glaubius, T. Tidwell, W. D. Smart, and C. Gill. Scheduling design and verification for open soft real-time systems. In RTSS’08, pages 505–514, 2008. [2] B. R. Leffler, M. L. Littman, and T. Edmunds. Efficient reinforcement learning with relocatable action models. In AAAI’07, pages 572–577, 2007. [3] S. P. Singh and R. C. Yee. An upper bound on the loss from approximate optimal-value functions. Machine Learning, 16(3):227–233, 1994. [4] E. Even-Dar, S. Mannor, and Y. Mansour. PAC bounds for multi-armed bandit and markov decision processes. In COLT’02, pages 255–270, 2002. Acknowledgements This research has been supported in part by NSF grants CNS-0716764 (Cybertrust) and CCF-0448562 (CAREER)