GangLL Gang Scheduling on the IBM SP Andy B. Yoo and Morris A. Jette Lawrence Livermore National Laboratory {yoo2, Jose Moreira and Liana Fong IBM T. J. Watson Research Center {jmoreira,
Gang Scheduling Overview Permits time-sharing or preemption of parallel jobs »All tasks of a parallel job are grouped into a “gang” then suspended and resumed synchronously Adds another dimension to scheduling »Virtual machines created as needed Responsiveness improved by 30% in our tests »More users/jobs can make make progress (at a reduced rate) »High priority work started quickly Utilization improved »Large jobs can be started without accumulating resources piecemeal while smaller jobs complete »Well utilized virtual machines can be allocated more resources
Job A Job B Job C Job D Job E Job X High Priority Time 0:00 High Priority Job Arrives Running Time 0:01 Job A Job B Empty Job D Job E Time 0:30 Running Empty Job B Empty Job D Job E Time 1:00 Running Empty Job D Job E Time 1:30 Running Empty Job X High Priority Job E Running Time 2:00 LoadLeveler Scheduling Example Poor Responsiveness and System Utilization Initiation delayed for 2 hours
Job A Job B Job C Job D Job E Job X High Priority Initial State New Arrival Job A Job X High Priority Job E New Configuration (seconds later) Job B Job C Job D Running Stopped Until Job X Completes Running GangLL Scheduling: Preemption Good Responsiveness and System Utilization
Job A Job B Job C Job D Job E Time 0 Job A Job X High Priority Job F Time 1 Job A Job X High Priority Job E Time 2 Job A Job B Job C Job D Job F Time 3 Job A Job X High Priority Job E Time 4 Job A Job X High Priority Job F Time 5Etc.... GangLL Scheduling: Timesharing Good Responsiveness and System Utilization
GangLL Design Built into LoadLeveler Can schedule multiple jobs per node Global time scheduling performed by GangLL central manager »Scheduling matrix changes when jobs initiated or terminated »Scheduling matrix distributed to nodes as needed »Each node follows its individual schedule »Nodes must have synchronized clocks Context switching »All processes of a job stop (SIGSTOP) or resume (SIGCONT) synchronously »User Space switch window state saved/resumed at context switch time by Communications Sub-System (CSS) Joint LLNL and IBM design and development effort
Scheduling Matrix Distribution Scheduling Matrix from GangLL Central Manager Node 1 Node 2 Node 3 Node 5 Node 4Node 6Node 7 New matrix is divided and propagated through nodes with acknowledgement and commit before taking effect Each node keeps only its column of the scheduling matrix and operates according to that
GangLL Configurability Each LoadLeveler job class has several scheduling parameters »Which job classes it can time-share with –How large of a slice should it get relative to other job classes in a time-sharing mode? »Which job classes it can preempt (stop) Each node has a multi-programming level (number of concurrent jobs allowed) Duration of time-slice configurable »Typical values 15 seconds to 15 minutes
Only thread-safe User Space communications support preemption (due to manpower constraints) »MPL not supported (impacts MPICH-G) Jobs must link with thread safe libraries OR use IP communications OR use non-preemptable job class »Size and time limits on non-preemptable job classes may be restricted »Jobs will be killed if a GangLL preemption request is ignored GangLL User Considerations
Ptrace, DPCL, and TotalView jobs need to be made non-preemptable »TotalView modifies application with time-critical connections »LLNL version of TotalView integrated with GangLL tool Real-time clock no longer reflects actual run time »Run time clock to be added at a later time »Use clock() function for CPU time used for now The xgang tool will show real-time scheduling activities GangLL User Considerations
Sample xgang Displays
Context switches are fast if paging is not induced »Paging is painfully slow (79 minutes for 2.5GB in+out) »Most LLNL applications have modest memory demands »Development underway to avoid paging by preventing large memory jobs from being time-shared Need to avoid oversubscribing disk space »Does not appear to be a problem at LLNL It is configurable which LoadLeveler job classes can preempt or time-share with other specific job classes Job can also explicitly be preempted or made non- preemptable Context Switch Issues
LLNL Configuration Before GangLL Parallel debug class - fast response for development »Only have access to a small portion of computer (8 nodes) »Small (4 node) and short (1 hour) jobs Parallel batch class - “production” jobs »Majority of machine resources (315 nodes) »Relatively short time limits to provide daytime responsiveness –8AM-5PM 2 hours (OK for development, but too short for production work) –5PM-8AM weeknights <=8 hours (gradually lowered through morning) –5PM Friday to 8AM Monday <=12 hours (gradually lowered through morning) »Scheduled with backfill algorithm »Scheduling large node count jobs wastes significant resources due to limited selection of jobs for backfill Expedited jobs - To be run ASAP »Uses parallel batch class nodes »Other jobs are manually terminated to free resources pdebug pbatch + expedited
LLNL Configuration After Gang Scheduling (preliminary) Parallel debug class - responsive development runs »Separate partition eliminated »Larger job sizes (32 node) and longer run times (2 hours) permitted, access to all nodes permitted on demand Parallel batch class - “production” jobs »Longer run times permitted without reducing responsiveness –Always 12 hours (or more?) Expedited class - To be run ASAP »Other jobs are automatically preempted to free resources Non-stop class - Jobs which can not be preempted (new) »Access to limited node count and shorter run times Large job class - Jobs with large node counts »Can preempt jobs as desired, no need to accumulate resources, backfill less critical (LLNL workload has few short running jobs)
GangLL Status In production use at LLNL since November 1999 Not an IBM supported product, but limited support is available Management of memory to avoid paging still needs work »Essential for fully operational system »Need accurate estimate of job’s memory need (user issue) »Need enforcement of memory limits (GangLL + AIX issue) Biggest technical problem is slow AIX paging »Being addressed