Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting Jan 25-26, 2005 Washington D.C.

Slides:



Advertisements
Similar presentations
Top-Down Network Design Chapter Nine Developing Network Management Strategies Copyright 2010 Cisco Press & Priscilla Oppenheimer.
Advertisements

Distributed Processing, Client/Server and Clusters
Accounting Manager Taking resource usage into your own hands Scott Jackson Pacific Northwest National Laboratory
CERN LCG Overview & Scaling challenges David Smith For LCG Deployment Group CERN HEPiX 2003, Vancouver.
IWay Service Manager 6.1 Product Update Scott Hathaway iWay Software Copyright 2010, Information Builders. Slide 1.
A Computation Management Agent for Multi-Institutional Grids
Presented by: Priti Lohani
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
Chapter 19: Network Management Business Data Communications, 4e.
Presented by Scalable Systems Software Project Al Geist Computer Science Research Group Computer Science and Mathematics Division Research supported by.
Workload Management Workpackage Massimo Sgaravatto INFN Padova.
1 ITC242 – Introduction to Data Communications Week 12 Topic 18 Chapter 19 Network Management.
Milos Kobliha Alejandro Cimadevilla Luis de Alba Parallel Computing Seminar GROUP 12.
Workload Management Massimo Sgaravatto INFN Padova.
MCTS Guide to Microsoft Windows Server 2008 Network Infrastructure Configuration Chapter 8 Introduction to Printers in a Windows Server 2008 Network.
Maintaining and Updating Windows Server 2008
© 2006, Cognizant Technology Solutions. All Rights Reserved. The information contained herein is subject to change without notice. Automation – How to.
Batch VIP — A backend system of video processing VIEW Technologies The Chinese University of Hong Kong.
Kate Keahey Argonne National Laboratory University of Chicago Globus Toolkit® 4: from common Grid protocols to virtualization.
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting February 24-25, 2003.
Hands-On Microsoft Windows Server 2008 Chapter 1 Introduction to Windows Server 2008.
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
Linux Operations and Administration
GRID job tracking and monitoring Dmitry Rogozin Laboratory of Particle Physics, JINR 07/08/ /09/2006.
Resource Management and Accounting Working Group Working Group Scope and Components Progress made Current issues being worked Next steps Discussions involving.
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting Aug 26-27, 2004 Argonne, IL.
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting June 5-6, 2003.
High Performance Louisiana State University - LONI HPC Enablement Workshop – LaTech University,
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting June 13-14, 2002.
Process Management Working Group Process Management “Meatball” Dallas November 28, 2001.
Resource Management Working Group SSS Quarterly Meeting November 28, 2001 Dallas, Tx.
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting October 10-11, 2002.
GRAM5 - A sustainable, scalable, reliable GRAM service Stuart Martin - UC/ANL.
SSS Test Results Scalability, Durability, Anomalies Todd Kordenbrock Technology Consultant Scalable Computing Division Sandia is a multiprogram.
OOI CI LCA REVIEW August 2010 Ocean Observatories Initiative OOI Cyberinfrastructure Architecture Overview Michael Meisinger Life Cycle Architecture Review.
CSF4 Meta-Scheduler Name: Zhaohui Ding, Xiaohui Wei
Crystal Ball Panel ORNL Heterogeneous Distributed Computing Research Al Geist ORNL March 6, 2003 SOS 7.
Grid Workload Management Massimo Sgaravatto INFN Padova.
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting January 15-16, 2004 Argonne, IL.
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting September 11-12, 2003 Washington D.C.
Tool Integration with Data and Computation Grid GWE - “Grid Wizard Enterprise”
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting May 10-11, 2005 Argonne, IL.
 Apache Airavata Architecture Overview Shameera Rathnayaka Graduate Assistant Science Gateways Group Indiana University 07/27/2015.
9 Systems Analysis and Design in a Changing World, Fourth Edition.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
OS and System Software for Ultrascale Architectures – Panel Jeffrey Vetter Oak Ridge National Laboratory Presented to SOS8 13 April 2004 ack.
ABone Architecture and Operation ABCd — ABone Control Daemon Server for remote EE management On-demand EE initiation and termination Automatic EE restart.
Scalable Systems Software for Terascale Computer Centers Coordinator: Al Geist Participating Organizations ORNL ANL LBNL.
Managing and Monitoring the Microsoft Application Platform Damir Bersinic Ruth Morton IT Pro Advisor Microsoft Canada
ClearQuest XML Server with ClearCase Integration Northwest Rational User’s Group February 22, 2007 Frank Scholz Casey Stewart
Data Manipulation with Globus Toolkit Ivan Ivanovski TU München,
Tool Integration with Data and Computation Grid “Grid Wizard 2”
K. Harrison CERN, 22nd September 2004 GANGA: ADA USER INTERFACE - Ganga release status - Job-Options Editor - Python support for AJDL - Job Builder - Python.
LSF Universus By Robert Stober Systems Engineer Platform Computing, Inc.
Process Manager Specification Rusty Lusk 1/15/04.
Jack Malloch Product Service Advisor Global Support Services.
Status of Globus activities Massimo Sgaravatto INFN Padova for the INFN Globus group
Grid Activities in CMS Asad Samar (Caltech) PPDG meeting, Argonne July 13-14, 2000.
Enabling Grids for E-sciencE Agreement-based Workload and Resource Management Tiziana Ferrari, Elisabetta Ronchieri Mar 30-31, 2006.
Allocation Management Solutions for High Performance Computing Scott M. Jackson Workshop on Scheduling and Resource Management for Parallel and Distributed.
Towards a High Performance Extensible Grid Architecture Klaus Krauter Muthucumaru Maheswaran {krauter,
Duncan MacMichael & Galen Deal CSS 534 – Autumn 2016
Overview – SOE PatchTT November 2015.
Introduction to Operating System (OS)
Leigh Grundhoefer Indiana University
What’s changed in the Shibboleth 1.2 Origin
Software models - Software Architecture Design Patterns
Wide Area Workload Management Work Package DATAGRID project
Presentation transcript:

Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting Jan 25-26, 2005 Washington D.C.

Resource Management and Accounting Working Group Working group scope Progress since last face-to-face Future Work

Working Group Scope The Resource Management Working Group is involved in the areas of resource management, scheduling and accounting. This working group will focus on the following software components: Queue Manager Scheduler Accounting and Allocation Manager Meta Scheduler Other critical resource management components are being developed in the Process Management and Monitoring Working Group: Process Manager Cluster Monitor

Resource Management Component Architecture Queue Manager Allocation Manager Node Monitor Grid Scheduler Cluster Scheduler Node Manager Process Manager Security System Discovery Service Color Key Working Group Resource Management and Accounting Execution Management and Monitoring Node Configuration and Infrastructure Infrastructure Services Event Manager

Resource Management Prototype Demonstration Queue Manager Allocation Manager Node Monitor Cluster Scheduler Process Manager Discovery Service Color Key Working Group Resource Management and Accounting Execution Management and Monitoring Node Configuration and Infrastructure Job Submission Client 1 Submit-Job 3 Query-Node 6 Exec-Process 4 Create-Reservation 2 Query-Job 5 Run-Job 8 Delete-Job 0 Service-Lookup 7 Query-Job 9 Withdraw-Allocation This demo runs a simple end-to-end test with a job being submitted running past it’s wallclock limit

General Progress Protocol has stabilized – very little change in SSSRMAP Wire Protocol or Message Format Scott - Wrote a good deal of the SSSRMAP Message Format SDK (Python classes) –all that is left is Data integration into Request and Response Craig – initial efforts on SSSRMAP Wire Level integration into ssslib

General Progress SC2004 release of RMWG components –System tested and bundled w/ SSS-OSCAR 1.0 Bamboo Queue Manager v1.0.0 Maui Scheduler v3.2.6p10 Gold Accounting and Allocation Manager v2.0.b1.1 Warehouse System Monitor v0.7.0

General Progress Starting to see evidences of adoption and value add of the SSS components Bamboo Queue Manager –built-in support for checkpoint/restart –PBS or LoadLeveler job submission syntax –interfaces with ANL process manager –has been in production use on Ames cluster for over a year now

General Progress Adoption and value add (continued) Gold Allocation Manager –very successful in ensuring that the right work gets done –very successful in establishing a project cycle and managing capacity –Gold is in production use on multiple PNNL systems including the 11.8TF Linux Cluster –Dozens of sites have downloaded it –about 3 other sites currently evaluating Gold (also began discussions with DOD HPCMP sites)

General Progress Adoption and value add (continued) Maui Scheduler –implemented support for checkpoint/restart –sites are using the new resource utilization tracking and enforcement capabilities to advantage –because of SSS-directed work in enhanced prioritization, throttling policies and quality of service, sites are better able to dial in their preferences for improved: fairness higher system utilization improved response time targeted cycle delivery

General Progress Maui Scheduler (continued) –Maui has been installed on over 2,500 clusters –and downloaded over 100,000 times last year –Maui is running on more supercomputers than any other scheduler in the world –In early 2003 it was found to be running on (out of top 500 list) ~15 out of the top 20 ~75 out of the top 100

Queue Manager Progress v1.0 (and v1.0.1) release of Bamboo made available Full support for SSSRMAP v3 message format Submission clients support PBS in addition to LoadLeveler style job scripts CheckPoint/Restart manager interfaces tested and debugged. –Job output now correct for suspended jobs. SSS suite was updated on cluster in Ames in November with the full SC code release.

Accounting and Allocation Manager Progress Released Gold Beta release at SC2004 –Included in SSS-OSCAR 1.0 distribution Beta version of Gold in production on PNNL’s 11.8TF Linux cluster Full-featured Web-based Graphical User Interface Performance testing and tuning carried out Improved robustness (timeout select in non- blocking read/write loops prevents client and server communication hangs)

Accounting and Allocation Manager Progress Ported Gold to Tier1 and Tier2 OS’s Added support for SQLite embedded database Added support for encryption/decryption (in Perl) Support for variable decimal precision currency New reservation design improves handling of charges that span allocation boundaries Created a project usage report New User Guide chapters on Allocations, Installation, Roles, gold shell, Passwords

Cluster Scheduler Progress Peer Diagnostics - added service health checks SSS Interface - added support for numerous job attributes Packaging - Enhanced packaging for pre-req auto- detection Security - added interface buffer overflow prevention Allocation Manager Interface - extended support for allocation debit/reservation attributes Added end-to-end support for Bamboo+Berkeley Checkpoint Manager based suspend/resume General - numerous stability and usability enhancements

Grid Scheduler Progress Cluster Service API - rewrote Cluster Service interface to use SSS job object and message layer communication protocols Usability - added node monitoring, job monitoring, statistics, and job management client commands Submission - significantly enhanced job submission client and Globus job staging infrastructure Data Staging - improved performance and reliability of gridFTP, GASS, and SCP based data staging Grid Fairness - added initial support for grid level usage policies, fairshare, and priority General - enhanced multi-cluster job co-allocation, improved packaging, documentation, and internal diagnostics of Globus, network, job, and resource failures.

MCOM Progress (common library used by the cluster scheduler and grid scheduler) XML - added failure logging and exception handling for corrupt XML Compression - added inline socket data compression Encryption - added initial key based data encryption (not full SSS standard) General - made general improvements in socket communication, XML processing, SSS job processing, and node resource monitoring

Future Work General release of all components –Including new Silver Meta-scheduler Increase deployment base Portability testing for new components –Tier 1:Linux::RedHat (9.0) –Tier 2:Linux::SuSE, AIX, Tru-64 –Tier 3:OS-X, Unicos –Tier 4:HP-UX, IRIX, Solaris Fault Tolerance supporting 25% cluster loss

Future Work Queue manager Add job group support (mainly for submission) Add Task Group support/ multi-requirement job support to submission clients Add Job Submission filter Finish final missing portions of PBS style job language support.

Future Work Accounting and Allocation manager General release to be made available by mid-year Production deployment of Gold on additional sites Port Gold to other OS’s (Tiers 3 and 4) and databases Complete and test design for distributed accounting and multi-organizational involvement in job startup Add support for multi-site authentication/authorization (each site having its own symmetric key) Improvements in the web-based GUI Documentation to include object customization

Future Work Cluster Scheduler Peer Diagnostics - add auto-recovery to failed service interfaces Resource Utilization - complete development of all resource utilization objectives Resource Limits - complete development of all resource limits objectives Checkpoint Restart - optimize resource management for suspended jobs

Future Work Grid Scheduler Reliability - complete Globus failure diagnostics and auto- recovery Data Staging - complete Globus/Non-Globus data staging failure auto-recovery Optimization - add network co-allocation reservation Fairness - complete Priority, Fairshare, and Usage Limit based policy enforcement Statistics - add credential, job, and cluster based usage statistics General - mature client commands to provide status reporting in more intuitive manner