New Workflow Scheduling Techniques Presentation: Anirban Mandal VGrADS Workshop @UCSD, Sep 2005
Outline Drawbacks of Workflow Scheduler v.0 Middle-Out Scheduling Scheduling onto systems with batch queues Scheduling onto Abstract Resource Classes Premise: Automating good application level scheduling using performance models by taking advantage of vgES features
Top-Down Scheduling Top-Down For each heuristic Until all components mapped Map available components to resources Select mapping with minimum makespan While all available components not mapped For each (component, resource) pair ECT(c,r) = rank(c,r) + EAT(r) End For each Run min-min, max-min and sufferage Store mapping End while Top-Down
Drawbacks of Workflow Scheduler v.0 Top-Down Workflow Scheduler suffers from Myopia Top-down traversal implies no look ahead Potential of poor mapping of critical steps for decisions taken higher up in the workflow Assumption of instant resource availability Many systems have batch queue front ends Have to wait before job starts Scaling problems Scheduling onto individual nodes pose scaling problems in large resource environments - Issue raised at the site-visit
Addressing the Drawbacks We address the drawbacks as follows Myopia Middle-Out Scheduling Schedule critical step first and propagate mapping up and down Assumption of instant resource availability Incorporating batch queue wait times to take scheduling decisions (Joint work: Rice+UCSB) Scaling problems Using a two-level scheduling strategy - explicit resource pruning using vgDL/other means and then scheduling (Joint Work: Rice+UCSD+Hawaii) Scheduling onto abstract resource classes / clusters Ryan’s talk
Middle-Out Scheduling Key step Top-Down Middle-Out
Middle-Out Scheduling: Results Compared makespans for middle-out vs. top-down scheduling Resource set: 5 clusters [2 Opteron clusters and 3 Itanium clusters] 6 resource-topology scenarios : combination of Opteron clusters close, normal and far with Fast and Slow Itaniums - {(Opteron close, Fast Itanium), ..} Application: Actual EMAN DAG with 3 different communication-to-computation ratios (CCR): 0.1, 1 and 10 Used known performance model values for computational components Varied file sizes to obtain desired CCR for each pair of synchronization points
Middle-Out Scheduling: Results CCR: 0.1 Computation 10 times the communication Fast Itanium makes top-down scheduler to “get stuck” at the Itanium clusters Since key computation step is scheduled on both the Opteron clusters, makespan depends on the Opteron connectivity In the slow Itanium case, the top-down scheduler “got lucky” Gain from middle-out scheduling not much
Middle-Out Scheduling: Results CCR: 1 Equal communication and computation Fast Itanium makes top-down scheduler to “get stuck” at the Itanium clusters Since key computation step is scheduled on both the Opteron clusters, makespan depends on the Opteron connectivity In the slow Itanium case, the top-down scheduler “got lucky”
Middle-Out Scheduling: Results CCR: 10 Communication 10 times the computation Fast Itanium makes top-down scheduler to “get stuck” at the Itanium clusters Since key computation step is scheduled on both the Opteron clusters, makespan depends on the Opteron connectivity In the slow Itanium case, the top-down scheduler “got lucky”
Middle-Out Scheduling: Results With increasing communication, the middle-out scheduler performs better when the top-down scheduler gets stuck
Outline Drawbacks of Workflow Scheduler v.0 Middle-Out Scheduling Scheduling onto systems with batch queues Scheduling onto Abstract Resource Classes
Scheduling onto Batch-Queue Systems Incorporated Point-value predictions for batch queue wait times Slight modification to the top-down scheduler At every scheduling step, take into account the estimated time the job has to wait in the queue in the estimated completion time for the job [ECT(c,r) in the algorithm] Keep track of the queue wait times for each cluster and the number of nodes that correspond to the queue wait time With each mapping, update the estimated availability time [EAT in the algorithm] with the queue wait time, as required Joint work with Dan Nurmi and Rich Wolski
Scheduling onto Batch-Queue Systems: Example Cluster 0 Cluster 1 Input DAG R0 R1 R2 R3 Queue Wait Time [Cluster 0] = 20 # nodes for this wt. time = 1 Queue Wait Time [Cluster 1] = 10 # nodes for this wt. time = 2 T
Scheduling onto Batch-Queue Systems: Example Cluster 0 Cluster 1 Input DAG R0 R1 R2 R3 Queue Wait Time [Cluster 0] = 20 # nodes for this wt. time = 1 Queue Wait Time [Cluster 1] = 10 # nodes for this wt. time = 2 T
Outline Drawbacks of Workflow Scheduler v.0 Middle-Out Scheduling Scheduling onto systems with batch queues Scheduling onto Abstract Resource Classes Addressing the scaling problem Modify scheduler to schedule onto clusters instead of individual nodes
Scheduling onto Clusters Input: Workflow DAG with restricted structure - nodes at the same level do the same computation Set of available Clusters (numNodes, arch, CPU speed etc.) and inter-cluster network connectivity Per-node performance models for each cluster Output: Mapping: for each level the number of instances mapped to each cluster Objective: Minimize makespan at each step
Scheduling onto Clusters: Modeling Abstract modeling of mapping problem for a DAG level Given: N instances M clusters r1..rM nodes/cluster t1..tM - rank value per node per cluster (incorporates both computation and communication) Aim: To find a partition (n1, n2,… nM) of N such that overall time is minimized with n1+n2+..nM = N Analytical solution: No ‘obvious’ solution because of the discrete nature
Scheduling onto Clusters Iterative approach to solve the problem Addresses the scaling issue For each instance, i from 1 to N For each cluster, j from 1 to M Tentatively map i onto j Record makespan for each j by taking care of round(j) End For each Find cluster, p with minimum makespan increase Map i to p Update round(p), numMapped(p) O(#instances * #clusters)
Discussions…
Middle-Out Scheduling Key step Top-Down Middle-Out