Scheduling Strategies for Mapping Application Workflows Onto the Grid A. Mandal, K. Kennedy, C. Koelbel, G. Marin, J. Mellor- Crummey, B. Liu, L. Johnsson
The Forest Performance Prediction + Scheduling Heuristics Static Schedule for Workflow Components G. Marin, 2004T. Braun, 1999
Environment GrADSoft –Runs on top of Globus –Facilitates scheduling, launching, and monitoring of grid apps Extend GrADSoft to deal with workflows (not only tightly coupled apps)
What’s a workflow? A set of applications (workflow components) that must be run in a specific order DAG – Directed Acyclic Graph
Workflow Scheduling Condor DAGMan – dynamic, effectively random scheduling This approach is to do static scheduling –Classic problem: given a set of machines, a set of jobs, and the performance of each job on each machine, schedule all jobs as to minimize total makespan
Determining Machine Fitness Marin and Mellor-Crummey’s performance models –For each workflow component and target machine, produce a performance model –Advantage of performance models over cycle accurate simulations! Add data transfer penalty (using Network Weather Service) We now have the expected time to completion (ETC) of every machine for every task.
Minimum Multiprocessor Scheduling Problem Classic problem is NP-Complete Use traditional heuristics: –Min-Min – Schedule minimum-length job –Max-Min – Schedule maximum-length job –Sufferage – Schedule job with most to lose by waiting
Is This a Workflow Problem? Only one component is easy (Marin already showed this works) Scheduling many may not be tractable
Evaluation EMAN – Electron Micrograph Analysis Almost entire time spent here
Evaluation RN: Random Scheduling (DAGMan) RA: Weighted Random HC: Heuristic Scheduling with crude performance models (CPU speed) HA: Heuristic Scheduling with accurate performance models (this scheme)
Evaluation Testbed 147 machines 4 types 64 dual processor Itanium 900MHz IA-64 nodes (RTC – Houston) 16 Opteron 2009MHz nodes (Medusa - Houston) 60 dual processor 1300MHz Itanium IA-64 nodes (acrl – Houston) 7 Pentium IA-32 nodes (Knoxville) – used?
Results 2.2x improvement over random
Discussion Static vs Dynamic Scheduling –Problems? –Why not use performance models dynamically? Application to workflows or more to parameter sweeps? How did they achieve load balance? Barriers to adoption?