Download presentation
Presentation is loading. Please wait.
Published byHelen Waters Modified over 9 years ago
1
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting May 10-11, 2005 Argonne, IL
2
Resource Management and Accounting Working Group Working group scope Progress since last face-to-face Future Work Other issues
3
Working Group Scope The Resource Management Working Group is involved in the areas of resource management, scheduling and accounting. This working group will focus on the following software components: Queue Manager Scheduler Accounting and Allocation Manager Meta Scheduler Other critical resource management components are being developed in the Process Management and Monitoring Working Group: Process Manager Cluster Monitor
4
Resource Management Component Architecture Queue Manager Allocation Manager Node Monitor Grid Scheduler Cluster Scheduler Node Manager Process Manager Security System Discovery Service Color Key Working Group Resource Management and Accounting Execution Management and Monitoring Node Configuration and Infrastructure Infrastructure Services Event Manager
5
Resource Management Prototype Demonstration Queue Manager Allocation Manager Node Monitor Cluster Scheduler Process Manager Discovery Service Color Key Working Group Resource Management and Accounting Execution Management and Monitoring Node Configuration and Infrastructure Job Submission Client 1 Submit-Job 3 Query-Node 6 Exec-Process 4 Create-Reservation 2 Query-Job 5 Run-Job 8 Delete-Job 0 Service-Lookup 7 Query-Job 9 Withdraw-Allocation This demo runs a simple end-to-end test with a job being submitted running past it’s wallclock limit
6
General Progress New release of RMWG components made available from SSS web site –Bamboo Queue Manager v1.1 –Maui Scheduler v3.2.6p13 –Gold Accounting and Allocation Manager v2.b2.10.2
7
General Progress Continued Adoption of SSS components and interfaces –SSS suite running on additional systems in Ames –Gold being used in production on University of Utah’s Icebox cluster
8
General Progress Working on integration of SSSRMAP into ssslib –Bill Pitre -- implementing the SSSRMAP Message Format SDK (Python classes) –Craig Steffen -- integrating SSSRMAP Wire Level protocol into ssslib
9
General Progress Paper accepted for presentation and publication at a conference –Title: Allocation Management Solutions for High Performance Computing –Conference: Parallel and Distributed Processing Techniques and Applications (PDPTA'05) –Workshop on “Scheduling and Resource Management for Parallel and Distributed Systems”
10
General Progress New Documents in SSS RMWG Notebook –Considerations for using SOAP as the basis for SSSRMAP v4 –Fault Tolerance with Gold –Last Quarter’s Weekly RMWG Meeting Notes
11
Queue Manager Progress V1.1 release of Bamboo made available SSS suite running on several systems in Ames. Support for Task Groups and Node Properties added to server. Added a new mailing feature New fountain component created to pull node information from multiple sources. –Simple node information now supported. –Working on adding support for SuperMon, Ganglia and NWPerf
12
Accounting and Allocation Manager Progress New release of Gold available – 2 nd Gold Beta v2.b2.10.2 –v2.b2.7.0 incorporated into OSCAR release Gold being used in production on University of Utah’s Icebox cluster Implemented and tested design for distributed accounting and multi-organizational negotiation in job launching Implemented fault tolerance to 50% cluster loss by adding support for a backup gold server. –Clients can failover to a backup gold server if defined –The database can be made fault tolerant by utilizing a synchronous multi-master replication system such as pgcluster. –documented in RMWG notebook
13
Accounting and Allocation Manager Progress Simplified ease of use for allocation management for basic configurations by adding ability to hide account abstraction layer –enabled account auto-generation, project-level deposits, etc. Ported Gold to Tier3 and Tier4 OS’s –(OS-X, IRIX, HP-UX, Solaris) - unable to get access to Unicos Enabled support for mysql database
14
Cluster Scheduler Progress Migrated latest MCOM library into Maui –includes support for encryption, scalability enhancements, sss return codes, job description extensions, etc. Enabled support for partitions, node features Enhanced recovery modes for failures and unexpected conditions Additional QOS modes for Allocation Manager –fallback QOS, QOS requested vs. delivered Fixed additional packaging bugs, buffer overflows Started work on multi-taskgroup jobs
15
Grid Scheduler Progress Added support for multi-site authentication (per peer-service symmetric keys) Rolling X.509 credential management into MCOM library Enabled support for Globus 3.x (had to workaround a lot of Globus bugs) Enhanced grid job queue and launch Reliability - completed Globus failure diagnostics, logging and auto- recovery Data Staging - completed Globus/non-Globus data staging failure auto- recovery Fairness - implemented Priority, Fairshare, and Usage Limit based policy enforcement Statistics - added credential, job, and cluster based usage statistics
16
Future Work General release of all components –Including new Silver Meta-scheduler Increase deployment base Integrate SSSRMAP into ssslib Portability testing for all components Fault Tolerance supporting 25% cluster loss
17
Future Work Queue manager Add job group support (mainly for submission) Add Job Submission filter Finish final missing portions of PBS style job language support.
18
Future Work Accounting and Allocation manager General release to be made available by mid-year Production deployment of Gold on additional sites Port Gold GUI from JSP to Perl CGI Add support for multi-site authentication (each site having its own symmetric key) Documentation to include object customization
19
Future Work Cluster Scheduler Add support for multi-taskgroup SSS jobs Support SSS job extensions and job-level policies Peer Diagnostics - add auto-recovery to failed service interfaces Resource Utilization - complete development of all resource utilization objectives Resource Limits - complete development of all resource limits objectives Checkpoint Restart – test with LBNL and optimize resource management for suspended jobs Get X.509 credential management working
20
Future Work Grid Scheduler Release Silver meta-scheduler –targeting end of June for alpha release –need to test Maui/Silver interoperability with new MCOM lib Need to test –Priority, Fairshare, and Usage Limit based policy enforcement –credential, job, and cluster based usage statistics Optimization - add network co-allocation reservation General - mature client commands to provide status reporting in more intuitive manner
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.