Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting January 15-16, 2004 Argonne, IL
Resource Management and Accounting Working Group Working group scope Progress over last quarter Next steps Topics for group consideration
Working Group Scope The Resource Management Working Group is involved in the areas of resource management, scheduling and accounting. This working group will focus on the following software components: Queue Manager Scheduler Accounting and Allocation Manager Meta Scheduler Other critical resource management components are being developed in the Process Management and Monitoring Working Group: Process Manager Cluster Monitor
Resource Management Component Architecture Queue Manager Allocation Manager Node Monitor Meta Scheduler Local Scheduler Node Manager Process Manager Security System Information Service Discovery Service Color Key Working Group Resource Management and Accounting Execution Management and Monitoring Node Configuration and Infrastructure Infrastructure Services Event Manager
Resource Management Prototype Demonstration Queue Manager Allocation Manager Node Monitor Local Scheduler Process Manager Discovery Service Color Key Working Group Resource Management and Accounting Execution Management and Monitoring Node Configuration and Infrastructure Job Submission Client 1 Submit-Job 3 Query-Node 6 Exec-Process 4 Create-Reservation 2 Query-Job 5 Run-Job 8 Delete-Job 0 Service-Lookup 7 Query-Job 9 Withdraw-Allocation This demo runs a simple end-to-end test with a job being submitted running past it’s wallclock limit
General Progress Created Node Object Specification version 2.0 Implemented SSSRMAP v2 response/status codes Completed Portability testing for initial release components –AIX, Tru64, HP-UX, IRIX, Solaris, Linux Completed system testing for SSSRMAP v2 and SC Release –on xtorc-sss, a RedHat 9.0 System (configured similarly to the OSCAR-sss target) –Included Maui, Bamboo, Warehouse, Process Manager, Gold, QBank, OpenPBS_sss, sss_xml_svr, etc.
General Progress Released RMWG components for SC2004 –packaged as tarballs, RPMs and OSCAR packages –Includes (some new) components: Bamboo Queue Manager v0.9.0 Maui-sss Scheduler v3.2p0 Gold Accounting and Allocation Manager v1.0.a0.0 Warehouse System Monitor v0.6.0 RMWG Webpage updated with SC release –Added Bamboo, Gold and Warehouse –Linked into main SSS home page
General Progress Deployed User Oriented Problem Response System –Implemented using RT –Created project and support queues for all RMWG components Created SSSRMAP C-implementation module Completed per-component interface specification documents (binding to SSSRMAP) Something about our functionality milestones
Scheduler Progress Generated Maui SSSRMAP binding document Added response code support Created SSS communication library containing reference implementation of SSSRMAP v2.0 XMLized Silver/Maui interface Augmented implementation of SSSRMAP to use more of the advanced features (where, set, op, units) Added support for (Warehouse) System Monitor Interface (and SSSRMAP v2 Node Object)
Scheduler Progress Completed suspend/resume and checkpoint/restart based SSS calls (synchronized with anticipated XML and tested with QM as we can go) – blocked until can test with CR guys Enhanced support for dynamic modification of job attributes (dynamic jobs) -- blocked until support provided in PM and QM Added support for policy specification for resource limit enforcement and tracking – blocked until support from PM and QM progresses
Queue Manager Progress Initial release of Bamboo made available in Nov. Produced Queue Manager binding document for the SSSRMAP protocol. Data storage via ODBC compliant database fully implemented. Packaging and installation scripts created for sss- oscar release. SSS suite has been installed on a cluster at Ames, not quite production ready, but close.
Accounting and Allocation Manager Progress QBank –Portability testing has been completed Linux, AIX, Tru64, HP-UX, IRIX and Solaris –This is probably all the further we are going to go on it Gold –Released Pre-alpha Early SC release of Gold Public release under a BSD open source license ( 14 NOV 2003) Packaged as a tarball, rpm (RedHat Linux 9.0 and 7.3, x86), and initial OSCAR packaging –Added support for Service Directory registration –Implemented SSSRMAP v2 response/status codes –Implemented instance-level role-based authorization
Accounting and Allocation Manager Progress Gold –Gold test results from PNNL 11.8TF cluster (MPP2) analyzed Accounting was coherent and stable over 2 week test period Memory and performance issues analyzed with profiler Initial chunking implementation was shown to successfully handle large response messages –Progress on GUI Implemented SSSRMAP SSL and Password authentication User, Project and Machine management views nearly complete Added search filter to List (and Modify, Delete, Undelete) operations –Improved debug logging (implemented log4j and debug flags) –Portability enhancements (archived java components into a jar file) –Documentation, Packaging and Installation refinements –Introduced Gnu ReadLine support in interactive client –Creation of interim regression test suite (condor dagman)
Meta-Scheduler Progress Add threaded support for local scheduler interface (can talk to multiple schedulers simultaneously) Improved Silver installation procedure (autoconf) Enhanced user commands to support direct reservation management Successful deployment and testing of data-staging
Future Work Draft and release SSSRMAP v3 protocol specifications Release alpha versions of new components (based on v2) –(Bamboo, Maui, Gold, Warehouse) Portability testing for new (alpha release) components –(at least Linux, AIX, +other_UNIX) Complete Design Specification documents for new components
Future Work Local Scheduler Complete integration of SSSRMAP v2 for queue objects Support full suite of AM interface calls Full support for multi-source RM interface Add support for encryption Intelligent decision response based on error codes Full support for checkpoint/restart, dynamic jobs, and resource limit enforcement and tracking when enabled by other components
Future Work Queue manager Retrieve exit codes and update to the Jan PM XML. Finish prologue/epilogue support (dependant on exit code). Interface with Node Monitor once process monitoring is supported. IO staging (may need API from process manager) Full multi step job support Add support for optional site job submission verification script
Future Work Accounting and Allocation manager Complete Allocation Management portion of GUI Fully implement response chunking (part of v3) Resolve performance issues (reimplement server in Perl?) Automatic association deletion (undeletion) Port Gold to other OS’s Production deployment of Gold on 11.8TF Linux cluster (as primary allocation system) Support for challenge/SSL with Directory Service Open source QBank
Future Work Meta Scheduler More Silver client development Update documentation Enhance co-allocation support (tighter specification language) Implement SSSRMAP v2 Wire Protocol and Message Format Add allocation manager interface support
Issues requiring inter-group discussion Need process exit codes from process manager Need process manager support for resource limit enforcement Timeframe/schedule for dynamic jobs Schedule for integrating/testing with checkpoint/restart Discuss possibility of support for encryption(/type?) within Service Directory
Portability Testing Progress