Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting October 10-11, 2002
Resource Management and Accounting Working Group Working Group Scope and Components Progress made Current issues being worked Next steps Discussions involving larger group
Working Group Scope The Resource Management Working Group is involved in the areas of resource management, scheduling and accounting. This working group will focus on the following software components: Queue Manager (/Job Manager) Scheduler Accounting and Allocation Manager Meta Scheduler Other critical resource management components are being developed in the Process Management and Monitoring Working Group: Process Manager Node Monitor
Proposed Component Architecture Queue Manager Allocation Manager Node Monitor Meta Scheduler Local Scheduler Node Manager Process Manager Security System Information Service Discovery Service Color Key Working Group Resource Management and Accounting Execution Management and Monitoring Node Configuration and Infrastructure Infrastructure Services Event Manager
Resource Management Prototype Demonstration Queue Manager Allocation Manager Node Monitor Local Scheduler Process Manager Discovery Service Color Key Working Group Resource Management and Accounting Execution Management and Monitoring Node Configuration and Infrastructure Job Submission Client 1 Submit-Job 3 Query-Node 6 Exec-Process 4 Create-Reservation 2 Query-Job 5 Run-Job 8 Delete-Job 0 Service-Lookup 7 Query-Job 9 Withdraw-Allocation This demo runs a simple end-to-end test with a job being submitted running past it’s wallclock limit
General Progress Initial draft of Scalable Systems Software Resource Management and Accounting Protocol (SSSRMAP) completed Requirements documents nearly complete for all components All components under revision control
Scheduler Progress Extended internal XML Usage Implemented SSSRMAP XML interface for queue manager, node monitor and allocation/accounting manager Enhanced internal scalability to support up to 50,000 nodes Added support for HTTP framing protocol Added internal suspend/resume and checkpoint/requeue management code (interfaced to PBS, LSF, and LL) Created subset of XML-based job control and state control clients for use with GUI tools Significant testing and documentation of existing features (priority and QOS enhancements)
Queue Manager Progress Conformance to the SSSRMAP XML specification Synchronization of the job attribute types with PBS SSS front-end Full wire protocol compatibility with basic, challenge, and ANL versions of basic and challenge Multiple server ports employed to allow multiple client protocols simultaneously New interface with Event Manager Added job signaling support with the Process Manager
Allocation Manager Progress Requirements and survey sent out to 15 sites and vendors Allocation management component placed under bitkeeper Implemented HTTP framing protocol and tested performance Support for expression grouping in queries Journaling implemented – undo and redo working Got SHA1-HMAC security working with QBank/Maui Reframed bank objects (accounts, users, allocations, etc.) as dynamically introduced objects Object actions defined in metadata cache Creation of dynamic web-GUI using PHP and javascript (forms for object creation, querying, modification, deletion and undeletion)
Meta Scheduler Progress Development of submission client Support for PBS ‘command file’ keywords and semantics Ability to run jobs end-to-end Fault tolerance improvements (Cluster scheduler reconnection and global JobId tracking) Added interfaces to interoperate with grid systems (Globus) Improved user interface Partial XML local scheduler-meta scheduler language defined and implemented
Current Issues Job State Management for Queue Manager Data staging Job signaling Support for Job steps Integration with Node Monitor
Next Work Prepare for SC demos Scalability Testing Release v1.0 of Resource Management System for existing components Basic documentation Security authentication Need to solidify RMS-wide standards for packaging, build procedure, revision control, and distribution home.
Scheduler Future Integrate SSS security protocols Extend GUI support Full support for XML allocation manager language Extend SSS language to support suspend/resume and checkpoint/requeue Test TM interface fault tolerance features (corrupt data, bad connections, etc.)
Queue Manager Future Add Epilogue/Prologue support Add job submission verification script Interface with Node Monitor Full PBS qsub compatibility Add interface with Node Manager to support job dependent node OS image installation
Allocation Manager Future Focus on getting QBank ready for bundling and release with SSS RMS system (security, use key, improved installation procedure) Focus effort on open source of new Allocation Manager (gold) Implementation of enhanced allocation, reservation mechanisms which utilize simple pricing engine and log job and usage data Security authentication (gold) Support for operations on returned fields (sort, sum, max, unique, group by, etc.) Integrate SSSLIB connection protocol & discovery service
Meta Scheduler Future Fault tolerance improvements Initial data management (data stage- in/stage-back) Full XML local scheduler-meta scheduler language defined and implemented
Issues requiring inter-group coordination Resource controller for handling switch allocation, licenses, resource limit enforcement (logical partioning) How is checkpointing and suspend/resume routed through Who manages node access control? Dynamic jobs