Workflow Scheduling Optimisation: The case for revisiting DAG scheduling Rizos Sakellariou and Henan Zhao University of Manchester.

Workflow Scheduling Optimisation: The case for revisiting DAG scheduling
Rizos Sakellariou and Henan Zhao University of Manchester

Ewa Deelman, deelman@isi.edu www.isi.edu/~deelman pegasus.isi.edu
Scheduling Ewa Deelman, pegasus.isi.edu Slide Courtesy: Ewa Deelman, pegasus.isi.edu

Execution Environment
Slide Courtesy: Ewa Deelman, pegasus.isi.edu

Aim: to minimise the execution time of the workflow How?
In this talk, optimisation relates to performance What affects performance? Aim: to minimise the execution time of the workflow How? Exploit task parallelism But, even if there is enough parallelism, can the environment guarantee that this parallelism can be exploited to improve performance? No! Why? There is interference from the batch job schedulers that are traditionally used to submit jobs to HPC resources!

Example The uncertainty of batch schedulers means that any workflow enactment engine must wait for components to complete before beginning to schedule dependent components. Furthermore, it is not clear if parallelism will be fully exploited; e.g., if the three tasks above that can be executed in parallel are submitted to 3 different queues of different length, there is no guarantee that they will execute in parallel – job queues rule! This execution model fails to hide the latencies resulting from the length of job queues: these determine the execution time of the workflows.

Then… try to get rid of the evil job queues!
Advance reservation of resources has been proposed to make jobs run at a precise time. However, resources would be wasted if they are reserved for the whole execution of the workflow. Can we automatically make advance reservations for individual tasks?

Assuming that there is no job queue…
…what affects performance? The structure of the workflow number of parallel tasks; how long these tasks take to execute; The number of the resources typically, much smaller than the parallelism available. In addition: there are communication costs there is heterogeneity estimating computation+communication is not trivial.

What does all this imply for mapping?
An order by which tasks will be executed needs to be established (eg., red, yellow, or blue first?) Resources need to be chosen for each task (some resources are fast, some are not so fast!) The cost of moving data between resources should not outweigh the benefits of parallelism.

Does the order matter? 6 5 4 3 2 1 8 7 If task 6 on the right takes comparatively longer to run, we’d like to execute task 2 just after task 0 finishes and before tasks 1, 3, 4, 5. 9 Follow the critical path! Is this new? Not really… 

Modelling the problem…
A workflow is a Directed Acyclic Graph (DAG) Scheduling DAGs onto resources is well studied in the context of homogeneous systems – less so, in the context of heterogeneous systems (mostly without taking into account any uncertainty). Needless to say that this is an NP-complete problem. Are workflows really a general type of DAGs or a subclass? We don’t really know… (some are clearly not DAGs – only DAGs considered here…)

Our approach… Revisit the DAG scheduling problem for heterogeneous systems… Start with simple static scenarios… Even this problem is not well understood, despite the fact that there have been perhaps more than 30 heuristics published… (check the Heterogeneous Computing Workshop proceedings for a start…) Try to build on as we obtain a good understanding of each step!

Outline Static DAG scheduling onto heterogeneous systems (ie, we know computation & communication a priori) Introduce uncertainty in computation times. Handle multiple DAGs at the same time. Use the knowledge accumulated above to reserve slots for tasks onto resources.

Based on… [1] Rizos Sakellariou, Henan Zhao. A Hybrid Heuristic for DAG Scheduling on Heterogeneous Systems. Proceedings of the 13th IEEE Heterogeneous Computing Workshop (HCW’04) (in conjunction with IPDPS 2004), Santa Fe, April 2006, IEEE Computer Society Press, 2004. [2] Rizos Sakellariou, Henan Zhao. A low-cost rescheduling policy for efficient mapping of workflows on grid systems. Scientific Programming, 12(4), December 2004, pp [3] Henan Zhao, Rizos Sakellariou. Scheduling Multiple DAGs onto Heterogeneous Systems. Proceedings of the 15th Heterogeneous Computing Workshop (HCW'06) (in conjunction with IPDPS 2006), Rhodes, Apr. 2006, IEEE Computer Society Press. [4] Henan Zhao, Rizos Sakellariou. Advance Reservation Policies for Workflows. Proceedings of the 12th Workshop on Job Scheduling Strategies for Parallel Processing, 2006.

How to schedule? Our model… A DAG, 10 tasks, 3 machines (assume we know execution times, communication costs) 6 5 4 3 2 1 8 7 9 Task M1 M2 M3 37 39 27 1 30 20 24 2 21 28 3 35 38 31 4 29 5 6 22 7 26 8 9 33 23 11

A simple idea… Assign nodes to the fastest machine!
Assign nodes to the fastest machine! 3 2 1 5 4 6 7 Communication between nodes 4 and 8 takes way too long!!! 8 9 Heuristics that take into account the whole structure of the DAG are needed… Makespan is > 1000!

Still, if we consider the whole DAG…
HEFT – a minor change leads to different schedules (~15%): 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 Makespan: Makespan: 164 3 3 2 5 5 1 4 4 1 2 6 7 7 8 8 6 9 9 H.Zhao,R.Sakellariou. An experimental study of the rank function of HEFT. Proceedings of EuroPar’03.

Hmm… This was a rather well defined problem…
This was just a small change in the algorithm… What about different heuristics? What about more generic problems?

DAG scheduling: A Hybrid Heuristic
Trying to find out why there were such differences in the outcome of HEFT…we observed problems with the order… to address those problems we came up with a Hybrid Heuristic… it worked quite well! Phases: Rank (list scheduling) Create groups of independent tasks Schedule independent tasks Can be carried out using any scheduling algorithm for independent tasks, e.g. MinMin, MaxMin, … A novel heuristic (Balanced Minimum Completion Time) R.Sakellariou, H.Zhao. A Hybrid Heuristic for DAG Scheduling on Heterogeneous Systems. Proceedings of the IEEE Heterogeneous Computing Workshop (HCW 04) , 2004.

An Example Node M0 M1 M2 17 19 21 1 22 27 23 2 15 9 3 4 8 14 20 5 30 18 6 16 7 49 46 25 6 5 4 3 2 1 8 7 9 Machines Time for a data unit M0 – M1 0.9 M1 – M2 1.0 M0 – M2 1.4

An Example Mean + Upward Ranking Scheme
Phase 1: Rank the nodes Mean + Upward Ranking Scheme The order is {0, 1, 4, 5, 7, 2, 3, 6, 8, 9} Node Weight Rank 19 149.93 1 24 120.66 2 13 85.6 3 7 84.13 4 17 112.93 5 25 95.39 6 16 58.06 85.66 8 21 57.93 9 23 23.0

An Example Phase 1: Rank the nodes
Phase 2: Create groups of independent tasks The order is {0, 1, 4, 5, 7, 2, 3, 6, 8, 9} 6 5 4 3 2 1 8 7 9 Group Tasks {0} 1 {1, 4, 5} 2 {7, 2, 3} 3 {6, 8} 4 {9}

Balanced Minimum Completion Time Algorithm (BMCT)
Step I: Assign each task to the machine that gives the fastest execution time. Step II: Find the machine M with the maximal finish time. Move a task from M to another machine, if it minimizes the overall makespan.

An Example (1) Phase 3: Schedule Independent Tasks in Group 0
Balanced Minimum Completion Time (BMCT) M M M2 20 40 60 80 100 120 140 Initially assign each task in the group to the machine giving the fastest time No movement for the entry task

M M M2 20 40 60 80 100 120 140 Initially assign each task in the group to the machine giving the fastest time 1 4 5

M M M2 20 40 60 80 100 120 140 Initially assign each task in the group to the machine giving the fastest time M2 is the machine with the Maximal Finish Time (70) 1 4 5

M M M2 20 40 60 80 100 120 140 Task 5 moves to M0 since it can achieve an earlier overall finish time Now M0 is the machine with the Maximal Finish Time (69) 1 4 5 5

M M M2 20 40 60 80 100 120 140 Task 1 moves to M2 since it can achieve an earlier overall finish time Now M2 is the machine with the Maximal Finish Time (59) No task can be moved from M2, the movement stops. Schedule next group 5 1 4

M M M2 20 40 60 80 100 120 140 Initially assign each task in this group to the machine giving the fastest time 5 4 1 3 7 2

M M M2 20 40 60 80 100 120 140 Task 2 moves to M1 since it can achieve an earlier overall finish time M2 is the machine with the Maximal Finish Time No movement from M2 Schedule next group 5 4 1 3 2 7

M M M2 20 40 60 80 100 120 140 Initially assign each task in this group to the machine giving the fastest time 5 4 1 3 2 7 6 8

M M M2 20 40 60 80 100 120 140 Task 6 moves to M0 since it can achieve an earlier overall finish time M2 is the machine with the Maximal Finish Time 5 4 1 3 6 2 7 8

M M M2 20 40 60 80 100 120 140 Task 8 moves to M1 since it can achieve an earlier overall finish time M1 is the machine with the Maximal Finish Time No movement from M1 Schedule next group 5 4 1 3 6 2 7 8

M M M2 20 40 60 80 100 120 140 Initially assign each task in this group to the machine giving the fastest time No movement for the exit task 5 4 1 3 6 2 7 8 9

The Final Schedule M M M2 20 40 60 80 100 120 140 5 1 4 3 6 2 7 8 9

Experiments DAG Scheduling Algorithms Applications
Hybrid.BMCT (i.e. The algorithm as presented), and Hybrid.MinMin (i.e. MinMin instead of BMCT) Applications Random-generated graphs Laplace FFT Fork-join graphs Heterogeneity setting (following an approach by Siegel et al) Consistent Partially-consistent Inconsistent

Hybrid Heuristic Comparison
NSL Random DAGs, tasks with inconsistent heterogeneity Average improvement ~= 25%

don’t we need to understand the static case first?
Hmm… Yes, but, so far, you have used static task execution times… in practice such times are difficult to specify exactly… There is an answer for run-time deviations: adjust at run-time… But: don’t we need to understand the static case first?

Characterise the Schedule
Spare time indicates the maximum time that a node, i, may delay without affecting the start time of an immediate successor, j. A node i with an immediate successor j on the DAG: spare(i,j) = Start_Time(j) – Data_Arrival_Time(i,j) A node i with an immediate successor j on the same machine: spare(i,j) = Start_Time(j) – Finish_Time(i) The minimum of the above: MinSpare for a task. R.Sakellariou, H.Zhao. A low-cost rescheduling policy for efficient mapping of workflows on grid systems. Scientific Programming, 12(4), December 2004, pp

DAT: Data_Arrival_Time, ST: Start_Time, FT: Finish_Time
Example DAT(4,7)=40.5, ST(7)=45.5, hence, spare(4,7) = 5 FT(3)=28, ST(5)=29.5, hence, spare(3,5) = 1.5 DAT: Data_Arrival_Time, ST: Start_Time, FT: Finish_Time

Characterise the schedule (cont.)
Slack indicates the maximum time that a node, i, may delay without affecting the overall makespan. Slack(i)=min(slack(j)+spare(i,j)), for all successor nodes j (both on the DAG and the machine) The idea: keep track of the values of the slack and/or the spare time and reschedule only when the delay exceeds slack… R.Sakellariou, H.Zhao. A low-cost rescheduling policy for efficient mapping of workflows on grid systems. Scientific Programming, 12(4), December 2004, pp

Lessons Learned… (simulation and deviations of up to 100%)
Heuristics that perform better statically, perform better under uncertainties. By using the metrics on spare time, one can provide guarantees for the maximum deviation from the static estimate. Then, we can minimise the number of times we reschedule still achieving good results. Could lead to orders of magnitude improvement with respect to workflow execution using DAGMAN (would depend on the workflow, only partly true with Montage…)

Challenges still unanswered…
What are the representative DAGs (workflows) in the context of Grid computing? Extensive evaluation / analysis (theoretical too) is needed. Not clear what is the best makespan we can get (because it is not easy to find the critical path) What are the uncertainties involved? How good are the estimates that we can obtain for the execution time / communication cost? Performance prediction is hard… How ‘heterogeneous’ our Grid resources really are?

Moving on… to multiple DAGs
It is really ideal to assume that we have exclusive usage of resources… In practice, we may have multiple DAGs competing for resources at the same time… Henan Zhao, Rizos Sakellariou. Scheduling Multiple DAGs onto Heterogeneous Systems. Proceedings of the 15th Heterogeneous Computing Workshop (HCW'06) (in conjunction with IPDPS 2006), Rhodes, Apr. 2006, IEEE Computer Society Press.

Scheduling Multiple DAGs: Approaches
Approach 1: Schedule one DAG after the other with existing DAG scheduling algorithms Low resource utilization & long overall makespan Approach 2: Still one after the other, but do some backfilling and fill the gaps Which DAG to schedule first? The one with longest makespan or the one with shortest makespan? Approach 3: Merge all DAGs into a single, composite DAG. Much better than Approach 1 or 2.

Example: Two DAGs to be scheduled together
DAG A DAG B B6 A1 A2 A3 A4 B4 B1 A5 B5 B3 B2 B7

Composition Techniques
C1: Common Entry and Common Exit Node B6 A1 A2 A3 A4 B4 B1 A5 B5 B3 B2 B7 A6 B8 A0 B0

C2: Level-Based Ordering B6 A1 A2 A3 A4 B4 B1 A5 B5 B3 B2 B7 A6 B8 A0 B0

C3: Alternate between DAGs… (round robin between DAGs)… Easy!

C4: Ranking-Based Composition (compute a weight for each node and merge accordingly) DAG Rank DAG A A A A A4 20 A5 6 DAG B B1 200 B2 152 B3 122 B4 140 B5 45 B6 63 B7 13 A0 B0 A1 A2 A3 A4 A5 B6 B4 B1 B5 B3 B2 B7

But, is makespan optimisation a good objective when scheduling multiple DAGs?

Mission: Fairness In multiple DAGs:
Users perspective: I want my DAG to complete execution as soon as possible. System perspective: I would like to keep as many users as possible happy; I would like to increase resource utilisation. Let’s be fair to users! (The system may want to take into account different levels of quality of service agreed with each user)

Slowdown Slowdown: what is the delay that a DAG would experience as a result of sharing the resources with other DAGs (as opposed to having the resources on its own). Average slowdown for all DAGs:

Unfairness Unfairness indicates, for all DAGs, how different the slowdown of each DAG is from the average slowdown (over all DAGs). The higher the difference, the higher the unfairness!

Scheduling for Fairness
Key idea: at each step (that is, every time a task is to be scheduled), select the most affected DAG (that is the DAG with the highest slowdown value) to schedule. What is the most affected DAG at any given point in time?

Fairness Scheduling Policies
F1: Based on latest Finish Time calculates the slowdown value only at the time the last task that was scheduled for this DAG finishes. F2: Based on Current Time re-calculates the slowdown value for every DAG after any task finishes. A proportion of time, for tasks running, is taken when the calculation is carried out.

Lessons Learned… Open questions…
It is possible to achieve reasonably good fairness without affecting makespan. An algorithm with good behaviour in the static case appears to make things easier in terms of achieving fairness… What is fairness? What is the behavior when run-time changes occur? What about different notions of Quality of Service (SLAs, etc…)

Finally… How to automate advance reservations at the task level for a workflow, when the user has specified a deadline constraint only for the whole workflow? Henan Zhao, Rizos Sakellariou. Advance Reservation Policies for Workflows. Proceedings of the 12th Workshop on Job Scheduling Strategies for Parallel Processing, 2006.

The Schedule M M M2 20 40 60 80 100 120 140 The schedule on the left can be used to plan reservations. However, if one task fails to finish within its slot, the remaining tasks have to be re-negotiated. 5 1 4 3 6 2 7 8 9

What we are looking for is…

The Idea 1. Obtain an initial assignment using any DAG scheduling algorithm (HEFT, HBMCT, …). 2. Repeat I. Compute the Application Spare Time (= user specified deadline – DAG finish time). II. Distribute the Application Spare Time among the tasks. 3. Until the Application Spare Time is below a threshold.

Spare Time The Spare Time indicates the maximum time that a node may delay, without affecting the start time of any of its immediate successor nodes. A node i with an immediate successor j on the DAG: spare(i,j) = Start_Time(j) – Data_Arrival_Time(i,j) A node i with an immediate successor j on the same machine: spare(i,j) = Start_Time(j) – Finish_Time(i) The minimum of the above for all immediate successors is the Spare Time of a task. Distributing the Application Spare Time needs to take care of the inherently present spare time!

Two main strategies Recursive spare time allocation:
The Application Spare Time is divided among all the tasks. This is a repetitive process until the Application Spare Time is below a threshold. Critical Path based allocation: Divide the Application Spare Time among the tasks in the critical path. Balance the Spare Time of all the other tasks. (a total of 6 variants have been studied)

An Example

Critical Path based allocation

Finally…

Findings… Advance reservation of resources for workflows can be automatically converted to reservations at the task level, thus improving resource utilization. If the deadline set for the DAG is such that there is enough spare time, then we can reserve resources for each individual task so that deviations of the same order, for each task, can be afforded without any problems. Advance reservation is known to harm resource utilization. But this study indicated that if the user is prepared to pay for full usage when only 60% of the slot is used there is no loss for the machine owner.

…which leads to pricing!
R.Sakellariou, H.Zhao, E.Tsiakkouri, M.Dikaiakos. “Scheduling workflows under budget constraints”. To appear as a Chapter in a book with selected papers from the 1st CoreGrid Integration Workshop. The idea: Given a specific budget, what is the best schedule you can obtain for your workflow? Multicriteria optimisation is hard! Our approach: Start from a good solution for one objective, and try to meet the other! It works! How well… difficult to tell!

To summarize… Understanding the basic static scenarios and having robust solutions for those scenarios helps the extension to more complex cases… Pretty much everything here is addressed by heuristics. Their evaluation requires extensive experimentation: Still: No agreement about how DAGs (workflows) look like. No agreement about how heterogeneous resources are The problems addressed here are perhaps more related to what is supposed to be core CS… But… we may be talking about lots of work for only incremental improvements… 10-15%…

Who cares in Computer Science about performance improvements in the order of 10-15%???
(yet, if Gordon Brown was to increase our taxes by 10-15% everyone would be so unhappy )… Oh well…

Thank you!

Workflow Scheduling Optimisation: The case for revisiting DAG scheduling Rizos Sakellariou and Henan Zhao University of Manchester.

Similar presentations

Presentation on theme: "Workflow Scheduling Optimisation: The case for revisiting DAG scheduling Rizos Sakellariou and Henan Zhao University of Manchester."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Workflow Scheduling Optimisation: The case for revisiting DAG scheduling Rizos Sakellariou and Henan Zhao University of Manchester.

Similar presentations

Presentation on theme: "Workflow Scheduling Optimisation: The case for revisiting DAG scheduling Rizos Sakellariou and Henan Zhao University of Manchester."— Presentation transcript:

Similar presentations

About project

Feedback