Presentation is loading. Please wait.

Presentation is loading. Please wait.

Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting Aug 26-27, 2004 Argonne, IL.

Similar presentations


Presentation on theme: "Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting Aug 26-27, 2004 Argonne, IL."— Presentation transcript:

1 Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting Aug 26-27, 2004 Argonne, IL

2 Resource Management and Accounting Working Group Working group scope Progress since last face-to-face Future Work Other issues

3 Working Group Scope The Resource Management Working Group is involved in the areas of resource management, scheduling and accounting. This working group will focus on the following software components: Queue Manager Scheduler Accounting and Allocation Manager Meta Scheduler Other critical resource management components are being developed in the Process Management and Monitoring Working Group: Process Manager Cluster Monitor

4 Resource Management Component Architecture Queue Manager Allocation Manager Node Monitor Grid Scheduler Cluster Scheduler Node Manager Process Manager Security System Information Service Discovery Service Color Key Working Group Resource Management and Accounting Execution Management and Monitoring Node Configuration and Infrastructure Infrastructure Services Event Manager

5 Resource Management Prototype Demonstration Queue Manager Allocation Manager Node Monitor Cluster Scheduler Process Manager Discovery Service Color Key Working Group Resource Management and Accounting Execution Management and Monitoring Node Configuration and Infrastructure Job Submission Client 1 Submit-Job 3 Query-Node 6 Exec-Process 4 Create-Reservation 2 Query-Job 5 Run-Job 8 Delete-Job 0 Service-Lookup 7 Query-Job 9 Withdraw-Allocation This demo runs a simple end-to-end test with a job being submitted running past it’s wallclock limit

6 General Progress Updated and implemented SSSRMAP v3 specifications –SSSRMAP Wire Protocol v3.0.3 Uses chunked HTTP transfer encoding –SSSRMAP Message Format v3.0.3 Moved condition, assignment and option values into body of Element (instead of in value attribute) –SSS Job Object v3.0.3 Added job properties in support of input/output, interactive jobs, dynamic jobs, suspend/resume, checkpoint/restart, resource limit enforcement, partitions, charges

7 General Progress Completed system testing for Second Alpha Release –on xtorc-sss, a RedHat 9.0 System –Included Maui, Bamboo, Warehouse, Gold, Process Manager, etc. Released second alpha versions of RMWG components –Fully implements version 3 of the SSSRMAP specification Bamboo Queue Manager v0.9.6 Maui Scheduler v3.2.6p9 (production version) Gold Accounting and Allocation Manager v1.0.a2.1 Warehouse System Monitor v0.7.0 RMWG Webpage updated with Second Alpha release –Updated info, docs, downloads, etc. –Added an interactive FAQ engine (FAQOMATIC)

8 Cluster Scheduler Progress Completed merger of Maui 3.2 and Maui SSS Further added intrinsic support for SSS messages –client-server, allocation manager, queue manager, resource manager interfaces, callbacks –Status object –Error codes Enhanced support for SSS node and job objects –allocation manager, queue manager, resource manager interfaces –extended MCom library to support additional node and job object attributes improved socket and XML call reliability and security (added buffer checking and detailed failure reporting) Built the SSS integration guide and updated Maui documentation

9 Queue Manager Progress Third release of Bamboo made available Supports basic SSSRMAP v3 message format Interactive job support finished and tested New submission client to handle LoadLeveler job scripts Packaging updated to separate out components required on the execution nodes. Added support for job dependencies (ie chained jobs are now supported)

10 Queue Manager Progress PM interface updated to use scoping of signal –Job termination code changed to implement a “soft” kill. (ie SIGTERM followed later by a SIGKILL, if needed) SSS suite was updated on cluster in Ames in July –Appears to resolve most known problems.

11 Accounting and Allocation Manager Progress Completed rewrite of Gold server and all business logic in Perl Significantly improved account/allocation design Created an account statement report Implemented hierarchical account nesting and tested trickle down deposits and trickle up charges Implemented and tested credit accounts Added support for auto-creation of users, projects and machines Implemented automatic recursive association deletion/undeletion Added support for query row limit, object aliases

12 Accounting and Allocation Manager Progress Made compliant with SSSRMAP v3 specification Fully implemented response chunking Updated clients and Gold User’s Guide Completed Allocation, Reservation, Quotation, and ChargeRates portions of GUI Further simplified dependent module installation Updated Component and Application Binding docs (v3.0.3) Released Second Alpha release of Gold –Regression and system tested on RedHat 9.0 (xtorc-sss) Upgraded Gold on PNNL SGI cluster to the latest second alpha version

13 Grid Scheduler Progress migrated grid scheduler interface to use SSS message format for all scheduler-grid scheduler interface calls migrated silver client commands to utilize SSS MCom XML library enhanced global queue management Added diagnostic clients Verified new job management state machine

14 Grid Scheduler Progress Introduced three new SSS objects –developed new SSS time range object –defined and implemented support for cluster to grid scheduler interface reservation object –proposed new cluster/machine object for exchanging high level policy and resource availability information

15 Future Work Beta release of all components –Including new Silver Meta-scheduler Portability testing for new components –Tier 1:Linux::RedHat (9.0) –Tier 2:Linux::Sousa, AIX, Tru-64 –Tier 3:OS-X, Unicos –Tier 4:HP-UX, IRIX, Solaris Fault Tolerance supporting 25% cluster loss Complete Design Specification documents for new components

16 Future Work Cluster Scheduler Convert to using SSS job object for job submission and resource queries Integrate/test Checkpoint-Restart support Extend and mature the resource manager and grid scheduler interfaces

17 Future Work Queue manager Add job group support (mainly for submission) Add Task Group support (in progress) Add Job Submission filter

18 Future Work Accounting and Allocation manager Complete and test design for distributed accounting and multi- organizational involvement in job startup Add support for multi-site authentication/authorization (each site having its own symmetric key) Complete alpha version of GUI (fully featured) Beta release of Gold (fully functional multi-site version with GUI) Production deployment of Gold on 11.8TF Linux cluster (as primary allocation system) and several other sites as beta testers Documentation to include roles and custom objects Port Gold to other OS’s (Tiers 1 and 2) Create regression test suite (w/ APITest when ready) Performance and scalability testing

19 Future Work Grid Scheduler First SSS release of Silver Grid Scheduler Add additional statistics clients (global information gathering and global policies) Fault tolerance improvements Add improved cluster level job start time estimations Initiate evaluation of peer-to-peer grid scheduling model Test support for Globus 3.x

20 Resource Limit Enforcement Bamboo: PBS JDL Specification, add support to PM Maui: Scheduler policies PM: Specification language and setting OS limits at job launch (Thanks!) Warehouse: Measure the metrics by session and job PM: Need session id/process id mapping Maui-Bamboo: Initialization Phase

21 Dynamic Jobs Maleable Jobs – Ability to change size and duration up until start Dynamically Modifiable Jobs – Change attributes while idle or running Dynamic Jobs – Job changes its size and duration itself while running Bamboo: Needs to add support for opaque extension attributes and QOS as well as dynamically modifiable jobs Maui: Policy support (growth bounds, QOS/queue support) PM: For dynamic jobs, MPI needs to handle growth/shrinkage and have that information reported to QM Warehouse: Aggregate statistics by session id, job id and process id (We need to know the model for dynamic job support with MPI)

22 Checkpoint/Restart {Suspend/Resume, Preempt/Restart, Checkpoint/Continue}? {System Initiated, User Initiated}? Bamboo: How specify in JDL that a job is checkpointable (also maybe specify other parameters like filesystem, etc) Bamboo-Maui: Needs to be able to keep track of how much walltime was used up before checkpoint and not count checkpoint idle time Maui: Policy handling –needs to know which resources released when suspended Checkpoint Manager: Status from Berkeley? Can we reattempt checkpoint/restart test Thursday evening?

23 Other Issues Supercomputing demos


Download ppt "Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting Aug 26-27, 2004 Argonne, IL."

Similar presentations


Ads by Google