Download presentation
Presentation is loading. Please wait.
Published byHelen Joseph Modified over 9 years ago
1
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting Jan 25-26, 2005 Washington D.C.
2
Resource Management and Accounting Working Group Working group scope Progress since last face-to-face Future Work
3
Working Group Scope The Resource Management Working Group is involved in the areas of resource management, scheduling and accounting. This working group will focus on the following software components: Queue Manager Scheduler Accounting and Allocation Manager Meta Scheduler Other critical resource management components are being developed in the Process Management and Monitoring Working Group: Process Manager Cluster Monitor
4
Resource Management Component Architecture Queue Manager Allocation Manager Node Monitor Grid Scheduler Cluster Scheduler Node Manager Process Manager Security System Discovery Service Color Key Working Group Resource Management and Accounting Execution Management and Monitoring Node Configuration and Infrastructure Infrastructure Services Event Manager
5
Resource Management Prototype Demonstration Queue Manager Allocation Manager Node Monitor Cluster Scheduler Process Manager Discovery Service Color Key Working Group Resource Management and Accounting Execution Management and Monitoring Node Configuration and Infrastructure Job Submission Client 1 Submit-Job 3 Query-Node 6 Exec-Process 4 Create-Reservation 2 Query-Job 5 Run-Job 8 Delete-Job 0 Service-Lookup 7 Query-Job 9 Withdraw-Allocation This demo runs a simple end-to-end test with a job being submitted running past it’s wallclock limit
6
General Progress Protocol has stabilized – very little change in SSSRMAP Wire Protocol or Message Format Scott - Wrote a good deal of the SSSRMAP Message Format SDK (Python classes) –all that is left is Data integration into Request and Response Craig – initial efforts on SSSRMAP Wire Level integration into ssslib
7
General Progress SC2004 release of RMWG components –System tested and bundled w/ SSS-OSCAR 1.0 Bamboo Queue Manager v1.0.0 Maui Scheduler v3.2.6p10 Gold Accounting and Allocation Manager v2.0.b1.1 Warehouse System Monitor v0.7.0
8
General Progress Starting to see evidences of adoption and value add of the SSS components Bamboo Queue Manager –built-in support for checkpoint/restart –PBS or LoadLeveler job submission syntax –interfaces with ANL process manager –has been in production use on Ames cluster for over a year now
9
General Progress Adoption and value add (continued) Gold Allocation Manager –very successful in ensuring that the right work gets done –very successful in establishing a project cycle and managing capacity –Gold is in production use on multiple PNNL systems including the 11.8TF Linux Cluster –Dozens of sites have downloaded it –about 3 other sites currently evaluating Gold (also began discussions with DOD HPCMP sites)
10
General Progress Adoption and value add (continued) Maui Scheduler –implemented support for checkpoint/restart –sites are using the new resource utilization tracking and enforcement capabilities to advantage –because of SSS-directed work in enhanced prioritization, throttling policies and quality of service, sites are better able to dial in their preferences for improved: fairness higher system utilization improved response time targeted cycle delivery
11
General Progress Maui Scheduler (continued) –Maui has been installed on over 2,500 clusters –and downloaded over 100,000 times last year –Maui is running on more supercomputers than any other scheduler in the world –In early 2003 it was found to be running on (out of top 500 list) ~15 out of the top 20 ~75 out of the top 100
12
Queue Manager Progress v1.0 (and v1.0.1) release of Bamboo made available Full support for SSSRMAP v3 message format Submission clients support PBS in addition to LoadLeveler style job scripts CheckPoint/Restart manager interfaces tested and debugged. –Job output now correct for suspended jobs. SSS suite was updated on cluster in Ames in November with the full SC code release.
13
Accounting and Allocation Manager Progress Released Gold Beta release at SC2004 –Included in SSS-OSCAR 1.0 distribution Beta version of Gold in production on PNNL’s 11.8TF Linux cluster Full-featured Web-based Graphical User Interface Performance testing and tuning carried out Improved robustness (timeout select in non- blocking read/write loops prevents client and server communication hangs)
14
Accounting and Allocation Manager Progress Ported Gold to Tier1 and Tier2 OS’s Added support for SQLite embedded database Added support for encryption/decryption (in Perl) Support for variable decimal precision currency New reservation design improves handling of charges that span allocation boundaries Created a project usage report New User Guide chapters on Allocations, Installation, Roles, gold shell, Passwords
15
Cluster Scheduler Progress Peer Diagnostics - added service health checks SSS Interface - added support for numerous job attributes Packaging - Enhanced packaging for pre-req auto- detection Security - added interface buffer overflow prevention Allocation Manager Interface - extended support for allocation debit/reservation attributes Added end-to-end support for Bamboo+Berkeley Checkpoint Manager based suspend/resume General - numerous stability and usability enhancements
16
Grid Scheduler Progress Cluster Service API - rewrote Cluster Service interface to use SSS job object and message layer communication protocols Usability - added node monitoring, job monitoring, statistics, and job management client commands Submission - significantly enhanced job submission client and Globus job staging infrastructure Data Staging - improved performance and reliability of gridFTP, GASS, and SCP based data staging Grid Fairness - added initial support for grid level usage policies, fairshare, and priority General - enhanced multi-cluster job co-allocation, improved packaging, documentation, and internal diagnostics of Globus, network, job, and resource failures.
17
MCOM Progress (common library used by the cluster scheduler and grid scheduler) XML - added failure logging and exception handling for corrupt XML Compression - added inline socket data compression Encryption - added initial key based data encryption (not full SSS standard) General - made general improvements in socket communication, XML processing, SSS job processing, and node resource monitoring
18
Future Work General release of all components –Including new Silver Meta-scheduler Increase deployment base Portability testing for new components –Tier 1:Linux::RedHat (9.0) –Tier 2:Linux::SuSE, AIX, Tru-64 –Tier 3:OS-X, Unicos –Tier 4:HP-UX, IRIX, Solaris Fault Tolerance supporting 25% cluster loss
19
Future Work Queue manager Add job group support (mainly for submission) Add Task Group support/ multi-requirement job support to submission clients Add Job Submission filter Finish final missing portions of PBS style job language support.
20
Future Work Accounting and Allocation manager General release to be made available by mid-year Production deployment of Gold on additional sites Port Gold to other OS’s (Tiers 3 and 4) and databases Complete and test design for distributed accounting and multi-organizational involvement in job startup Add support for multi-site authentication/authorization (each site having its own symmetric key) Improvements in the web-based GUI Documentation to include object customization
21
Future Work Cluster Scheduler Peer Diagnostics - add auto-recovery to failed service interfaces Resource Utilization - complete development of all resource utilization objectives Resource Limits - complete development of all resource limits objectives Checkpoint Restart - optimize resource management for suspended jobs
22
Future Work Grid Scheduler Reliability - complete Globus failure diagnostics and auto- recovery Data Staging - complete Globus/Non-Globus data staging failure auto-recovery Optimization - add network co-allocation reservation Fairness - complete Priority, Fairshare, and Usage Limit based policy enforcement Statistics - add credential, job, and cluster based usage statistics General - mature client commands to provide status reporting in more intuitive manner
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.