Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting Aug 26-27, 2004 Argonne, IL.

Slides:



Advertisements
Similar presentations
TeraGrid Deployment Test of Grid Software JP Navarro TeraGrid Software Integration University of Chicago OGF 21 October 19, 2007.
Advertisements

Accounting Manager Taking resource usage into your own hands Scott Jackson Pacific Northwest National Laboratory
IWay Service Manager 6.1 Product Update Scott Hathaway iWay Software Copyright 2010, Information Builders. Slide 1.
A Computation Management Agent for Multi-Institutional Grids
® IBM Software Group © 2010 IBM Corporation Marco Borgianni May 9-12, 2006 IBM Tivoli Workload Scheduler for Applications.
Chapter 19: Network Management Business Data Communications, 4e.
Distributed components
Presented by Scalable Systems Software Project Al Geist Computer Science Research Group Computer Science and Mathematics Division Research supported by.
Network Management Overview IACT 918 July 2004 Gene Awyzio SITACS University of Wollongong.
David Adams ATLAS DIAL Distributed Interactive Analysis of Large datasets David Adams BNL March 25, 2003 CHEP 2003 Data Analysis Environment and Visualization.
Massimo Cafaro GridLab Review GridLab WP10 Information Services Massimo Cafaro CACT/ISUFI University of Lecce, Italy.
GRID Workload Management System Massimo Sgaravatto INFN Padova.
Dolphin software SCI Software Replace in Title/Slide Master with Company Logo or delete Hugo Kohmann Dolphin Interconnect Solutions.
Maintaining and Updating Windows Server 2008
Next Generation of Apache Hadoop MapReduce Arun C. Murthy - Hortonworks Founder and Architect Formerly Architect, MapReduce.
Resource Management Reading: “A Resource Management Architecture for Metacomputing Systems”
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting February 24-25, 2003.
Client/Server Architectures
Linux Operations and Administration
Building service testbeds on FIRE D5.2.5 Virtual Cluster on Federated Cloud Demonstration Kit August 2012 Version 1.0 Copyright © 2012 CESGA. All rights.
Resource Management and Accounting Working Group Working Group Scope and Components Progress made Current issues being worked Next steps Discussions involving.
OSG Public Storage and iRODS
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting June 5-6, 2003.
Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting June 13-14, 2002.
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting Jan 25-26, 2005 Washington D.C.
DCE (distributed computing environment) DCE (distributed computing environment)
Process Management Working Group Process Management “Meatball” Dallas November 28, 2001.
EGEE is a project funded by the European Union under contract IST Testing processes Leanne Guy Testing activity manager JRA1 All hands meeting,
Resource Management Working Group SSS Quarterly Meeting November 28, 2001 Dallas, Tx.
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting October 10-11, 2002.
GRAM5 - A sustainable, scalable, reliable GRAM service Stuart Martin - UC/ANL.
SSS Test Results Scalability, Durability, Anomalies Todd Kordenbrock Technology Consultant Scalable Computing Division Sandia is a multiprogram.
CSF4 Meta-Scheduler Name: Zhaohui Ding, Xiaohui Wei
An Overview of Berkeley Lab’s Linux Checkpoint/Restart (BLCR) Paul Hargrove with Jason Duell and Eric.
London e-Science Centre GridSAM Job Submission and Monitoring Web Service William Lee, Stephen McGough.
Grid Workload Management Massimo Sgaravatto INFN Padova.
Fermilab Distributed Monitoring System (NGOP) Progress Report J.Fromm K.Genser T.Levshina M.Mengel V.Podstavkov.
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting January 15-16, 2004 Argonne, IL.
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting September 11-12, 2003 Washington D.C.
Tool Integration with Data and Computation Grid GWE - “Grid Wizard Enterprise”
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting May 10-11, 2005 Argonne, IL.
Grid Security: Authentication Most Grids rely on a Public Key Infrastructure system for issuing credentials. Users are issued long term public and private.
Scalable Systems Software for Terascale Computer Centers Coordinator: Al Geist Participating Organizations ORNL ANL LBNL.
Interactive Workflows Branislav Šimo, Ondrej Habala, Ladislav Hluchý Institute of Informatics, Slovak Academy of Sciences.
DGC Paris WP2 Summary of Discussions and Plans Peter Z. Kunszt And the WP2 team.
Architecture View Models A model is a complete, simplified description of a system from a particular perspective or viewpoint. There is no single view.
INFSO-RI Enabling Grids for E-sciencE Ganga 4 – The Ganga Evolution Andrew Maier.
© FPT SOFTWARE – TRAINING MATERIAL – Internal use 04e-BM/NS/HDCV/FSOFT v2/3 JSP Application Models.
ClearQuest XML Server with ClearCase Integration Northwest Rational User’s Group February 22, 2007 Frank Scholz Casey Stewart
Linux Operations and Administration
Tool Integration with Data and Computation Grid “Grid Wizard 2”
K. Harrison CERN, 22nd September 2004 GANGA: ADA USER INTERFACE - Ganga release status - Job-Options Editor - Python support for AJDL - Job Builder - Python.
LSF Universus By Robert Stober Systems Engineer Platform Computing, Inc.
Process Manager Specification Rusty Lusk 1/15/04.
PARALLEL AND DISTRIBUTED PROGRAMMING MODELS U. Jhashuva 1 Asst. Prof Dept. of CSE om.
E-commerce Architecture Ayşe Başar Bener. Client Server Architecture E-commerce is based on client/ server architecture –Client processes requesting service.
Maintaining and Updating Windows Server 2008 Lesson 8.
Towards a High Performance Extensible Grid Architecture Klaus Krauter Muthucumaru Maheswaran {krauter,
Chapter 19: Network Management
OpenPBS – Distributed Workload Management System
Architecting Web Services
Overview – SOE PatchTT December 2013.
Architecting Web Services
CHAPTER 3 Architectures for Distributed Systems
DHCP, DNS, Client Connection, Assignment 1 1.3
What’s changed in the Shibboleth 1.2 Origin
Module 01 ETICS Overview ETICS Online Tutorials
Wide Area Workload Management Work Package DATAGRID project
David Cleverly – Development Lead
Presentation transcript:

Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting Aug 26-27, 2004 Argonne, IL

Resource Management and Accounting Working Group Working group scope Progress since last face-to-face Future Work Other issues

Working Group Scope The Resource Management Working Group is involved in the areas of resource management, scheduling and accounting. This working group will focus on the following software components: Queue Manager Scheduler Accounting and Allocation Manager Meta Scheduler Other critical resource management components are being developed in the Process Management and Monitoring Working Group: Process Manager Cluster Monitor

Resource Management Component Architecture Queue Manager Allocation Manager Node Monitor Grid Scheduler Cluster Scheduler Node Manager Process Manager Security System Information Service Discovery Service Color Key Working Group Resource Management and Accounting Execution Management and Monitoring Node Configuration and Infrastructure Infrastructure Services Event Manager

Resource Management Prototype Demonstration Queue Manager Allocation Manager Node Monitor Cluster Scheduler Process Manager Discovery Service Color Key Working Group Resource Management and Accounting Execution Management and Monitoring Node Configuration and Infrastructure Job Submission Client 1 Submit-Job 3 Query-Node 6 Exec-Process 4 Create-Reservation 2 Query-Job 5 Run-Job 8 Delete-Job 0 Service-Lookup 7 Query-Job 9 Withdraw-Allocation This demo runs a simple end-to-end test with a job being submitted running past it’s wallclock limit

General Progress Updated and implemented SSSRMAP v3 specifications –SSSRMAP Wire Protocol v3.0.3 Uses chunked HTTP transfer encoding –SSSRMAP Message Format v3.0.3 Moved condition, assignment and option values into body of Element (instead of in value attribute) –SSS Job Object v3.0.3 Added job properties in support of input/output, interactive jobs, dynamic jobs, suspend/resume, checkpoint/restart, resource limit enforcement, partitions, charges

General Progress Completed system testing for Second Alpha Release –on xtorc-sss, a RedHat 9.0 System –Included Maui, Bamboo, Warehouse, Gold, Process Manager, etc. Released second alpha versions of RMWG components –Fully implements version 3 of the SSSRMAP specification Bamboo Queue Manager v0.9.6 Maui Scheduler v3.2.6p9 (production version) Gold Accounting and Allocation Manager v1.0.a2.1 Warehouse System Monitor v0.7.0 RMWG Webpage updated with Second Alpha release –Updated info, docs, downloads, etc. –Added an interactive FAQ engine (FAQOMATIC)

Cluster Scheduler Progress Completed merger of Maui 3.2 and Maui SSS Further added intrinsic support for SSS messages –client-server, allocation manager, queue manager, resource manager interfaces, callbacks –Status object –Error codes Enhanced support for SSS node and job objects –allocation manager, queue manager, resource manager interfaces –extended MCom library to support additional node and job object attributes improved socket and XML call reliability and security (added buffer checking and detailed failure reporting) Built the SSS integration guide and updated Maui documentation

Queue Manager Progress Third release of Bamboo made available Supports basic SSSRMAP v3 message format Interactive job support finished and tested New submission client to handle LoadLeveler job scripts Packaging updated to separate out components required on the execution nodes. Added support for job dependencies (ie chained jobs are now supported)

Queue Manager Progress PM interface updated to use scoping of signal –Job termination code changed to implement a “soft” kill. (ie SIGTERM followed later by a SIGKILL, if needed) SSS suite was updated on cluster in Ames in July –Appears to resolve most known problems.

Accounting and Allocation Manager Progress Completed rewrite of Gold server and all business logic in Perl Significantly improved account/allocation design Created an account statement report Implemented hierarchical account nesting and tested trickle down deposits and trickle up charges Implemented and tested credit accounts Added support for auto-creation of users, projects and machines Implemented automatic recursive association deletion/undeletion Added support for query row limit, object aliases

Accounting and Allocation Manager Progress Made compliant with SSSRMAP v3 specification Fully implemented response chunking Updated clients and Gold User’s Guide Completed Allocation, Reservation, Quotation, and ChargeRates portions of GUI Further simplified dependent module installation Updated Component and Application Binding docs (v3.0.3) Released Second Alpha release of Gold –Regression and system tested on RedHat 9.0 (xtorc-sss) Upgraded Gold on PNNL SGI cluster to the latest second alpha version

Grid Scheduler Progress migrated grid scheduler interface to use SSS message format for all scheduler-grid scheduler interface calls migrated silver client commands to utilize SSS MCom XML library enhanced global queue management Added diagnostic clients Verified new job management state machine

Grid Scheduler Progress Introduced three new SSS objects –developed new SSS time range object –defined and implemented support for cluster to grid scheduler interface reservation object –proposed new cluster/machine object for exchanging high level policy and resource availability information

Future Work Beta release of all components –Including new Silver Meta-scheduler Portability testing for new components –Tier 1:Linux::RedHat (9.0) –Tier 2:Linux::Sousa, AIX, Tru-64 –Tier 3:OS-X, Unicos –Tier 4:HP-UX, IRIX, Solaris Fault Tolerance supporting 25% cluster loss Complete Design Specification documents for new components

Future Work Cluster Scheduler Convert to using SSS job object for job submission and resource queries Integrate/test Checkpoint-Restart support Extend and mature the resource manager and grid scheduler interfaces

Future Work Queue manager Add job group support (mainly for submission) Add Task Group support (in progress) Add Job Submission filter

Future Work Accounting and Allocation manager Complete and test design for distributed accounting and multi- organizational involvement in job startup Add support for multi-site authentication/authorization (each site having its own symmetric key) Complete alpha version of GUI (fully featured) Beta release of Gold (fully functional multi-site version with GUI) Production deployment of Gold on 11.8TF Linux cluster (as primary allocation system) and several other sites as beta testers Documentation to include roles and custom objects Port Gold to other OS’s (Tiers 1 and 2) Create regression test suite (w/ APITest when ready) Performance and scalability testing

Future Work Grid Scheduler First SSS release of Silver Grid Scheduler Add additional statistics clients (global information gathering and global policies) Fault tolerance improvements Add improved cluster level job start time estimations Initiate evaluation of peer-to-peer grid scheduling model Test support for Globus 3.x

Resource Limit Enforcement Bamboo: PBS JDL Specification, add support to PM Maui: Scheduler policies PM: Specification language and setting OS limits at job launch (Thanks!) Warehouse: Measure the metrics by session and job PM: Need session id/process id mapping Maui-Bamboo: Initialization Phase

Dynamic Jobs Maleable Jobs – Ability to change size and duration up until start Dynamically Modifiable Jobs – Change attributes while idle or running Dynamic Jobs – Job changes its size and duration itself while running Bamboo: Needs to add support for opaque extension attributes and QOS as well as dynamically modifiable jobs Maui: Policy support (growth bounds, QOS/queue support) PM: For dynamic jobs, MPI needs to handle growth/shrinkage and have that information reported to QM Warehouse: Aggregate statistics by session id, job id and process id (We need to know the model for dynamic job support with MPI)

Checkpoint/Restart {Suspend/Resume, Preempt/Restart, Checkpoint/Continue}? {System Initiated, User Initiated}? Bamboo: How specify in JDL that a job is checkpointable (also maybe specify other parameters like filesystem, etc) Bamboo-Maui: Needs to be able to keep track of how much walltime was used up before checkpoint and not count checkpoint idle time Maui: Policy handling –needs to know which resources released when suspended Checkpoint Manager: Status from Berkeley? Can we reattempt checkpoint/restart test Thursday evening?

Other Issues Supercomputing demos