Presentation is loading. Please wait.

Presentation is loading. Please wait.

Service Level Agreement Based Scheduling Heuristics Rizos Sakellariou, Djamila Ouelhadj.

Similar presentations


Presentation on theme: "Service Level Agreement Based Scheduling Heuristics Rizos Sakellariou, Djamila Ouelhadj."— Presentation transcript:

1 Service Level Agreement Based Scheduling Heuristics Rizos Sakellariou, Djamila Ouelhadj

2 Motivation – is this a good state of affairs? Scheduling jobs onto (high-performance) compute resources is traditionally queue based (has been since time immemorial ) Two basic levels of service are provided: –Run this when it gets to the head of the queue (in other words, whenever!) –Run this at a precise time (advance reservation) Even sophisticated systems, such as Condor, are still queue-based…

3 Scheduling workflows DAG scheduling heuristics do exist …but… In a queue based system: –To maintain dependences, each component is scheduled after all parents have finished: the penalty is that each component pays the cost of the batch queue latency! –Assurances about the start time, completion time of each component are desirable! The best that one can aim for at the moment is advance reservation: too restrictive!

4 Advance Reservation Setting times precisely is not what the user really wants. Often users are only interested in the bounds (e.g., latest end time). This information is not captured, nor used! Doesnt fit well into the batch processing model. –Utilisation (hence income) decreases rapidly as the number of AR jobs increases (gaps cant be effectively plugged – checkpointing and/or suspend/resume costs!)

5 But its not only about workflows… Renegotiation of resources: A long-term goal of the Reality Grid project Experiments may need to be extended in time (at short notice) – (discovery of the century is around the corner ) Resources may need to be changed – in which case checkpointing/restart is needed: state may be in the order of 1TB! Could it also be about expanding the user base?

6 A novel approach to scheduling? There is no queue; jobs do not have a priority The schedule is based on satisfying constraints. These constraints are expressed in a Service Level Agreement: a contract between users and brokers; brokers and local schedulers, etc… What to optimise for? (objective function) Resource utilisation (income) If someone comes with lots of cash the scheduler may want to break some smaller agreements (money rules?) – reliability?

7 Local Scheduler n Local Scheduler 3 Local Scheduler 2 Super Scheduler3 Super Scheduler1 Local Scheduler 1 Super Scheduler2 Users Jobs to finish anytime (no guarantee required) Compute Resources Cluster1 Cluster2 Cluster3 MetaSLA subSLA Resource Record

8 Key components Users: they negotiate and agree an SLA with a broker (or superscheduler) Brokers: based on SLAs agreed with users, they negotiate and agree SLAs with local schedulers (and possibly other brokers) Local Schedulers: they schedule the work that corresponds to an SLA agreed with a broker. Two types (?) of SLA: –Meta-SLA (between user and broker) –Sub-SLA (between broker and local scheduler)

9 Issues Definition of SLAs –Resources, start/finish time, how long, cost, guarantee, penalty for failure –Meta-SLAs are negotiated first, sub-SLAs come later Negotiation Protocols –Based on availability (needs behaviour model) Scheduling –Jobs onto resources (local) Renegotiation Economy (selfish entities…), metrics

10 The Research Challenges L AI Planning & Scheduling Scheduling for the Grid Fuzzy logic Multicriteria scheduling AI constraint satisfaction SLAs Negotiation Scheduling heuristics Economic considerations

11 SLA Contents Info Hardware Estimated Response Time Resource metaSLA List of Resources: H/W, response Time Name, ID Number of Nodes Date Deadline Time Period Task execution time Start TimeEnd Time Guarantee Level Cost Payment for task execution Budget Constraint Max cost specified by the user Execution results by this time Nodes Execution Host Preference in a specific machine Info HardwareSoftwareTime Resource subSLA Book keeping info List of Resources: H/W, S/W ID Resource Compute node definition Hardware ArchMemDiskCPU b/w Software OS Name and version Time Date Resource reservation time Start TimeEnd Time ClientOwnerRemote Machine Book keeping info

12 Negotiation Meta-SLA –User requests an SLA –Based on (high-level view of) availability broker suggests an SLA –If the user accepts, a contract is in place Sub-SLAs –Broker has agreed a meta-SLA –Usage of resources needs to be agreed – sub-SLA is requested –Bids are made, based on availability –Sub-SLA is agreed

13 When Super Scheduler unable to check locally Submit job execution req. Authorize Client Job Client Super Scheduler 1 (SS1) Local Scheduler 1 (LS1) Check Local Resources Availability Send a metaSLA with the id Checks Resource Availability Response Verify Resources Submit subSLA(s) Agree subSLA(s) Parse subSLA(s) + Verify Response Status Information Request Response Completion Report Task Execution period Initiate Task Execution Parse req. Create + Store subSLA(s) Task Initiation Notification (email) Reservation + Set Deadline Update Storage of State Info about LS Update Storage of State Info about LS, SLA Store. Report (LS state info) Assign an SLAid Cost Calculation Agree metaSLA Create + Store metaSLA Request (execution host info) Optional Action Update LS metaSLA negotiation takes place...

14 Local Scheduling EPSRC e-Science Meeting 2005 The scheduling problem is defined by the allocation of a set of independent SLAs, S={SLA 1, …, SLA s } to a network of hosts H={H 1, …, H n }. The expected execution time E ij of SLA i on host H j. The earliest possible start time ST i for the SLA i on a selection of hosts is the latest free time of all the selected hosts. The expected completion time C i of SLA i on host H j = ST i + E ij The makespan is defined as: C max = max (C i ) 1<i<s The objective function is to minimise the makespan: min C max

15 Tabu Search for Local Scheduling EPSRC e-Science Meeting 2005 To solve the problem we propose to investigate the use of advanced search techniques: tabu search, Genetic algorithms, simulated annealing, etc. Tabu search is a high-level iterative procedure that makes use of memory structures and exploration strategies based on information stored in memory to search beyond local optima. In tabu search, the search process starts from a feasible solution and iteratively moves from the current solution to its best neighbouring solution even if that moves worsens the objective function value (Glover, 1997).

16 Tabu Search for Local Scheduling EPSRC e-Science Meeting 2005 Tabu search for local scheduling: Initial solution: FCFS, Min-min, Max-min, sufferage, and backfilling. The solution is improved by using two moves: SLA-swap and SLA-transfer moves. SLA-swap move swaps two SLAs performed by different processors. SLA-transfer move shifts the SLA to another processor. Composite neighbourhood.

17 Other objective functions for Local Scheduling EPSRC e-Science Meeting 2005 Other objective functions: minimising maximum lateness minimising cost to the user, maximising profit (to supplier), maximising personal / general utility, maximise resource utilisation, etc.

18 Fuzzy Scheduling EPSRC e-Science Meeting 2005 Uncertainty handling using fuzzy models: fuzzy due dates, fuzzy execution time. p 1 p 2 p 3 x 0 1 μ(P) x μ(D) 0 1 d 1 d 2

19 Fuzzy objective function EPSRC e-Science Meeting 2005 The objective is to minimise the maximum fuzzy completion time :

20 Re-negotiation in the Presence of Uncertainties Dynamic nature of Grid computing: Resources may fail, high priority jobs may submitted, new resources can be added, etc. EPSRC e-Science Meeting 2005 In the presence of real-time events, which make the LS agents not any more able to execute the SLAs, the SS agents re-negotiate the SLAs in failure at the local and global levels of the Grid in order to find alternative LS agents to execute them.

21 Renegotiation in the Presence of Uncertainties EPSRC e-Science Meeting 2005 SS n SS 1 Local Sched 11 Local Sched 21 Local Sched m1 Local Sched 12 Local Sched 22 Local Sched w2 SSt 2 Sub-SLA re-negotiation Meta-SLA re-negotiation Sub-SLA re-negotiation User SS 1 detects the resource failure. SS 1 re-negotiates the sub-SLAs in failure to find alternative Local Schedulers locally within the same cluster by initiating a sub-SLA negotiation session with the suitable Local Schedulers. If it cannot manage to do so, SS 1 re-negotiates the meta-SLAs with the neighbouring SSs by initiating a meta-SLA negotiation session. SS 2 re-negotiates the sub-SLAs in failure to find alternative Local Schedulers. SS 2 located LS 22 to execute the job in failure. At the end of task execution, LS 22 sends a final report including the output file details to the user. In case SS 1 could not find alternative Local Schedulers at the local and global levels, the SS 1 sends an alert message to the user to inform him that the meta-SLA cannot be fulfilled.

22 Methodology Simulation based approach –Need to evaluate different approaches for agreeing SLAs (e.g., conservative vs overbooking), generating bids, pricing/penalties, scheduling, … –Need to model users behaviour with SLAs Evaluation metrics: –Resource utilisation, jobs completed / SLAs broken Difficult to do a fair comparison with a batch- queuing system! –If job waiting time was the issue, it would translate to comparing FCFS with soft real-time scheduling!

23 Conclusions SLAs have the potential of changing the way that jobs are assigned onto compute resources. Increased flexibility appears to be the main advantage Long-term risk: batch systems have shown a remarkable resistance to change! http://www.gridscheduling.org

24 The people Manchester: –Viktor Yarmolenko –Rizos Sakellariou –Jon MacLaren (now at Louisiana State University) Nottingham: –Djamila Ouelhadj –Jon Garibaldi


Download ppt "Service Level Agreement Based Scheduling Heuristics Rizos Sakellariou, Djamila Ouelhadj."

Similar presentations


Ads by Google