VGrADS Tools Activities Chuck Koelbel VGrADS Workshop, February 23, 2006
Tools: Where We Are Achieved Working On It Initial workflow scheduling methods “Anirban scheduler” [Rice, UH, UCSD, ISI] Supported by performance prediction, NWS Initial fault tolerance implementations FT-MPI [UTK] Optimal checkpoint scheduling [UCSB] Platform-independent application launch and optimization LLVM, run-time reoptimization experiments [Rice] Working On It Virtual Grid scheduling methods Building workflow DAGs
Ongoing Tools Thrusts Scheduling methods Performance prediction Other Most of rest of talk [Rice, UCSD, UCSB, UTK, ISI] All based on pre-scheduling (aka off-line scheduling) of workflow (aka datflow, aka DAGs) using performance prediction Performance prediction Queue delay model [UCSB] Other Launching and reoptimization [Rice] DAG construction [Rice]
Scheduling Methods Two-level (choose VG, map onto it) Richard Huang (UCSD), Anirban Mandal & Ryan Zhang (Rice) Batch queue (include est. queue delay in cost model) Anirban Mandal (Rice), Dan Nurmi (UCSB) Cluster (assign block of tasks to cluster) Anirban Mandal (Rice) Provisioning (minimize reservation time + execution time) Gurmeet Singh (ISI) Robust (schedule to reduce sensitivity to variability) Zhiao Shi (UTK)
Scheduling Comparison Objective function Min / Max Costs Important Assumptions Two-level Scheduler + Makespan DAG nodes & edges; (Rice) Proc type Dedicated resources available for duration of application Batch queue Scheduler + Makespan + Queue delays DAG nodes & edges; Proc type, Queue delay Resources controlled by queues, but no relevant allocation limits Cluster DAG nodes & edges; Proc type Dedicated resources available for duration of application; homogeneous DAG nodes on each level Provisioning Reservation time and Sched quality Reservation, Schedule costs Resources controlled by provisioning mechanism Robust Robustness subject to makespan DAG edges(?) Shared resources resulting in runtime performance variability
Results Maximize f Huang - 2-level scheduler, Montage DAG Mandal - cluster scheduler, EMAN DAG Maximize f Shi - robust scheduler, ??? DAG
Tools Research Going Forward Interface between vgES and schedulers What capabilities can schedulers expect from vgES? How can schedulers exploit this capability? How can schedulers work around this capability? Some interesting operating points vgES provisions VG / application takes what’s given vgES returns shared VG nodes / application adapts to perf variance vgES returns queued VG resources / application manages queues vgES provisions VG, monitors for additional resources / application starts immediately, adapts to changes
Tools Research Going Forward Generating vgDL request for 2-level methods Balance request complexity vs. difficulty scheduling onto VG VG1 = ClusterOf (node) [1:N] [Rank=Cluster.nodes] {node = [CPU=Opteron]} VG2 = ClusterOf (node) [1:N] [Rank=Cluster.nodes*node.clock] {node = [CPU=Opteron]} VG3 = ClusterOf (node) [1:N] [Rank=PerfModel(Cluster.nodes,Cluster.bw,node.clock,node.mem)] {node = [CPU=Opteron]} Automatic vgDL generation from DAGs Template-driven? Heuristic-driven? Extended vgDL capabilities Global constraints (e.g. total # of nodes) Temporal constraints (e.g. available within 60 min) Probabalistic constraints (e.g. 95% likely to succeed)
Tools Research Going Forward New scheduling criteria Deadline scheduling Economic scheduling Real-time scheduling New scheduling situations Rescheduling Adapting to new resources Adapting to resource failures Incremental scheduling Managing dynamic applications “Horizon scheduling” for limited-time predictions Hybrid static / dynamic scheduling Contingency scheduling Static planning for dynamic optimizations
Backup Slides Beyond This Point
Two-Level Scheduling (Huang) Target Application Workflows represented by DAG Performance Metrics Application Turn-Around Time Resource Selection Scheduling Time Application Makespan Major Assumptions of the Scheduler Resources are dedicated Resources available for duration of application Scheduling Algorithms (so far) Greedy Modified Critical Path
Experimental Setup Use synthetic resource generator to generate 1000 clusters (33,667 hosts) Execute one “simple” (greedy) and one “complex” (Modified Critical Path) scheduling heuristic Tests on Montage DAG Scheduling Heuristic Resources Complex Resource Universe Top x percent Fastest Hosts Appropriate Virtual Grid Simple
Initial Results Original CCR CCR = 0.1 Two-phase scheduling necessary to avoid excessive scheduling time Appropriate virtual grids necessary for better performance Using more complex heuristic did not improve performance if you have the appropriate resource abstractions!
Batch Queue Scheduling (Mandal) Make batch-queue predictions on-the-fly from the ``live’’ systems New NWS functionality Parameterize the performance models using the 95% upper bound on the median prediction as a prediction of delay The performance models can take into account the amount of time needed to start a computation Run a top-down (heuristic) scheduler to choose a resource set Scheduler is smart enough to understand that the start-up delay can be amortized Joint work with Dan Nurmi and Rich Wolski
Top-Down Scheduling Top-Down For each heuristic Until all components mapped Map available components to resources Select mapping with minimum makespan While all available components not mapped For each (component, resource) pair ECT(c,r) = rank(c,r) + EAT(r) End For each Run min-min, max-min and sufferage Store mapping End while Top-Down
Scheduling onto Batch-Queue Systems Details: Modification of Top-Down scheduler At every scheduling step, take into account the estimated time the job has to wait in the queue in the estimated completion time for the job [ECT(c,r) in the algorithm] Keep track of the queue wait times for each cluster and the number of nodes that correspond to the queue wait time With each mapping, update the estimated availability time [EAT in the algorithm] with the queue wait time, as required
Scheduling onto Batch-Queue Systems: Example Cluster 0 Cluster 1 Input DAG R0 R1 R2 R3 Queue Wait Time [Cluster 0] = 20 # nodes for this wt. time = 1 Queue Wait Time [Cluster 1] = 10 # nodes for this wt. time = 2 T
Scheduling onto Batch-Queue Systems: Example Cluster 0 Cluster 1 Input DAG R0 R1 R2 R3 Queue Wait Time [Cluster 0] = 20 # nodes for this wt. time = 1 Queue Wait Time [Cluster 1] = 10 # nodes for this wt. time = 2 T
Discussions Experiments to evaluate EMAN scheduling with batch-queues Control experiment Schedule with and without queue-wait estimates, run application with the two schedules on Teragrid and compare turnaround times Accuracy of the results - how close to actual Other future issues Predictive/opportunistic approach Submit to queues even before data arrives with hope that data arrives by the time job moves to the front of the queue Point-valued predictions of probabilistic systems are problematic Need to schedule based on ranges or distributions Probabilistic deadline scheduling
Cluster Scheduling (Mandal) Motivation: Scheduler scaling problem for ‘large’ Grids Idea: Schedule directly onto clusters Input: Workflow DAG with restricted structure - nodes at the same level do the same computation Set of available Clusters (numNodes, arch, CPU speed etc.) and inter-cluster network connectivity (latency, bandwidth) Per-node performance models for each cluster Output: Mapping: for each level the number of instances mapped to each cluster Objective: Minimize makespan
Scheduling onto Clusters: Modeling Abstract modeling of mapping problem for a DAG level Given: N instances M clusters r1..rM nodes/cluster t1..tM - rank value per node per cluster (incorporates both computation and communication) Aim: To find a partition (n1, n2,… nM) of N such that overall time is minimized with n1+n2+..nM = N Analytical solution: No ‘obvious’ solution because of discrete nature of problem
Scheduling onto Clusters Iterative solution Big picture: Iterative assignment of tasks to clusters DP approach For each instance, i from 1 to N For each cluster, j from 1 to M Tentatively map i onto j Record makespan for each j by taking care of round(j) End For each Find cluster, p with minimum makespan increase Map i to p Update round(p), numMapped(p) O(#instances * #clusters)
Scheduling onto Clusters: Evaluation Application Representative DAGs from Montage and EMAN with varying widths Known performance models Simulation Platform Resource Model: Synthetic cluster generator (Kee et al SC’04) Network Model: BRITE to generate network topology, generate latency/bandwidth following a truncated normal distribution Experiment Varying number of clusters (nodes) 250 to 1000 clusters (8.5K to 36K nodes) Ran three scheduling approaches Heuristic (min-min/max-min/sufferage heuristics based) Greedy (simple greedy heuristic based) Simple (the Cluster level scheduler) Compared turnaround time (Makespan + Scheduling Time)
Scheduling onto Clusters: Results Montage Application 103 node Montage DAG 717 node Montage DAG Cluster level Scheduler (Simple) offers Scalability - scales to ‘large’ Grids Improved turnaround time No significant degradation of application makespan quality
Scheduling onto Clusters: Results EMAN Application 171 node EMAN DAG 666 node EMAN DAG Cluster level Scheduler (Simple) offers Scalability - scales to ‘large’ Grids Improved turnaround time No significant degradation of application makespan quality
Robust Task Scheduling (Shi) Task scheduling: Assigning tasks of a meta-task (workflow-type application) to a set of resources, and achieving certain goals, e.g. minimizing the schedule length NP-complete, finding an optimal solution is either impossible or impractical. Heuristics (list scheduling, duplication, clustering) , optimization (Genetic algorithm, Simulated annealing etc.) Previously we focused on list scheduling algorithm considering the case that processors has different capabilities.
Non-deterministic environment Actual resource environment is non-deterministic inherently due to resource sharing. Previously we used expected values of execution times of tasks, network speed. The optimal solution for task scheduling problem with expected values of resource characteristics is NOT optimal for the corresponding problem with non-deterministic values. We focus on variable execution time in this work.
Possible Solutions Static scheduling Dynamic scheduling Overestimate the execution time to avoid exceeding the allotted use of machine at the expense of machine utilization Compute the schedules for various scenarios and at run time adopt the one which fits the current status. Find schedules more robust to variable execution time. Dynamic scheduling at each point of scheduling (when a task is ready to be executed), gather current resource information and compute a new schedule for unscheduled tasks
Robustness Schedule delay Robustness M0(s): Makespan of the schedule s obtained with expect values (execution time) M(s): Makespan of schedule s with real execution time Robustness Each realization of expect values gives different schedule delay.
Slack slack(ni) = makespan – [b_level(ni)+t_level(ni)] Slack of a task node is defined as follow: Slack is closely related to robustness. large slack means a task node can tolerate large increase of execution time without increasing the makespan slack(ni) = makespan – [b_level(ni)+t_level(ni)]
slack(ni) = makespan – [b_level(ni)+t_level(ni)] Robustness and slack Disjunctive graph P1 1 3 8 P2 2 5 7 P3 4 6 9 10 Disjunctive graph is used to calculate expected makespan and real makespan slack(ni) = makespan – [b_level(ni)+t_level(ni)]
Task execution time modeling Least Time to Compute (LTC) matrix : {ltcij} time to compute task i on processor j generated from a single number with twice gamma distribution on 2 dimensions (machine, task) different values of gamma parameters represent different heterogeneities of machine or task Uncertainty level: {ulij} expected actual time to compute / least time to compute Actual computation time: actij = ltcij * ulij
Genetic Algorithm 1.[Start] Generate initial population of n chromosomes (suitable solutions for the problem) 2.[Fitness] Evaluate the fitness f(x) of each chromosome x in the population 3.[New population] Create a new population by repeating following steps until the new population is complete [Selection] Select two parent chromosomes from a population according to their fitness [Crossover] With a crossover probability cross over the parents to form new offspring (children). If no crossover was performed, offspring is the exact copy of parents. [Mutation] With a mutation probability mutate new offspring at each locus (position in chromosome). [Accepting] Place new offspring in the new population 4.[Replace] Use new generated population for a further run of the algorithm 5.[Test] If the end condition is satisfied, stop, and return the best solution in current population 6.[Loop] Go to step 2
Single objective optimization makespan optimization robustness optimization
Multi-objective optimization Goal: Minimize makespan and maximize robustness at the same time. Conflict - there cannot be a single optimum solution which simultaneously optimizes both objectives. Solution – seek balance between the 2 objectives.
Multi-objective optimization Classical methods weighted sum -constraint Weigthed sum scalarizes multiple objectives into a single objective optimize one of the objectives , subject to some constraints imposed on the other objectives
Weighted sum Objective function aws: average weighted slack: ni is scheduled on pj
-constraint Objective: Solutions: Fitness: maximize aws (average weighted slack) subject to: ms < ε*ms0 Solutions: feasible (ms < ε*ms0) infeasible (ms ≥ ε*ms0) Fitness:
Summary Studied robust scheduling in non-deterministic environment using GA. Provided a measurement of robustness. Robust schedule could be generated through the optimization of average weighted slack (AWS) of a task graph. Makespan and robustness are two conflicting objectives. Multi-objective optimization methods are employed. Weighted sum method is easy to use and intuitive. Setting up an appropriate weight vector depends on the scaling of each objective function. Normalization of objectives is usually required. -constraint methods let user optimize one objective while imposing constraints on other objectives.