Progress on Release, API Discussions, Vote on APIs, and PI mtg Al Geist January 14-15, 2004 Chicago, ILL
Coordinator: Al Geist Participating Organizations ORNL ANL LBNL PNNL PSC SDSC IBM SGI SNL LANL Ames NCSA Cray Intel Unlimited Scale Participating Organizations Changes
IBM Cray Intel SGI Scalable Systems Software Participating Organizations ORNL ANL LBNL PNNL NCSA PSC SDSC SNL LANL Ames Collectively (with industry) define standard interfaces between systems components for interoperability Create scalable, standardized management tools for efficiently running our large computing centers Problem Goals Computer centers use incompatible, ad hoc set of systems tools Present tools are not designed to scale to multi-Teraflop systems Resource Management Accounting & user mgmt System Build & Configure Job management System Monitoring To learn more visit
Potential Impact of Project Fundamentally change the way future high-end systems software is developed and distributed Reduced facility management costs reduce need to support ad hoc software better systems tools available able to get machines up and running faster and keep running More effective use of machines by scientific applications scalable launch of jobs and checkpoint/restart job monitoring and management tools allocation management interface
Grid Interfaces Accounting Event Manager Service Directory Meta Scheduler Meta Monitor Meta Manager Scheduler Node State Manager Allocation Management Process Manager Usage Reports Meta Services System & Job Monitor Job Queue Manager Node Configuration & Build Manager Standard XML interfaces Working Components and Interfaces (bold) authentication communication Components written in any mixture of C, C++, Java, Perl, and Python can be integrated into the Scalable Systems Software Suite Checkpoint / Restart Scalable Systems Software Suite Validation & Testing Hardware Infrastructure Manager First Release at SC2003 Packaging & Install
Scalable Systems Software Center September Washington DC Review of Last Meeting Details in Main project notebook
Highlights from Sept. mtg Rusty Lusk – Using SSS as the production systems software on Chiba City for a number of months now. Use restriction syntax for everything. Got blessing of ANL sysadmin group. Scott Jackson – Standard Error reporting and codes across components. Discuss dividing up code space in consistent way. Eric Debenedictus – Issues for peta-scale systems Redstorm and Bluelight mesh rather than switch means that topology is important consideration for SSS to consider: XML attribute to specify topology and I/O resources XML attribute to specify data arrangement on disk OS functionality hints to help auto placement Thomas Naughton – SSS deployment using OSCAR A release of OSCAR that contains all SSS software Roll SSS components into OSCAR packages – RPM format Create repository for OSCAR package uploads
Highlights from Sept. mtg (cont.) Al Geist – Plans for SC2003 Working Group Leaders – What areas their working group is addressing Progress report on what their group has done Present problems being addressed Next steps for the group Discussion items for the larger group to consider Long Term Strategy – Get Computer Centers involved and using suite Get vendors to be compliant with APIs Slides can be found in Main Notebook
Consensus and Voting: Communication Infrastructure Spec Wire protocols – need to add security envelope protocol Added service location. Bootstrapped using /etc/sss/ Vote to Accept as spec for Wire Protocol definition to get new ones accepted Service Directory interface Event Manager interface Second vote: 16 yes 2 abstaining 0 no Agreement for having common error objects with 3 digit codes and messages. Message is human readable string. Two special ones 000 success 999 unknown Straw vote: 15 no 1 Abs 0 Al suggests these general error classes – success, warning, temp failure, partial failure, failure People need to come up with counter proposal if they care
Scalable Systems Software Center September-January Progress Since Last Meeting
Systems Software Suite Release Open Source License – Fred asks that we come up with one general text that all organizations can agree on and then he will bless it. DONE SSS-OSCAR – Packaging done of all components (working around those components with license issues) First Release – Announced at SC2003. Available from project web site
SC2003 Scalable System Demos and Talks Rusty – fancy dancing meatball in wxpython Thomas – SSS-OSCAR working Will – fancy graphic demonstration of APITest ???? Brett – demonstrate swapping components in SSS architecture Paul – chkpoint interacting with PM on chiba Locations: All Across the show floor SciDAC booth – Talks by Rusty, Craig OSCAR BOF on Tuesday 5:00-6:00 mentions SSS-OSCAR
Five Project Notebooks A main notebook for general information And individual notebooks for each working group Over 297 total pages – 16 added since last meeting BC and PM groups need to get specs into their notebooks Add Telecom meeting notes even if short Get to all notebooks through main web site Click on side bar or at “project notebooks” at bottom of page
Bi-Weekly Working Group Telecoms Starting back up after Holidays Resource management, scheduling, and accounting Tuesday 3:00 pm (Eastern) keyword “SSS mtg” Validation and Testing Group No need for telecoms recently Proccess management, system monitoring, and checkpointing Thursday 1:00 pm (Eastern) mtg code Node build, configuration, and information service Thursday 3:00 pm (Eastern) mtg code (changes)
Scalable Systems Software Center January 14-15, 2004 This Meeting
Major Topics this Meeting Stability of Systems Software Suite – first release is out. Are we ready for a more robust second release Large Scale test run – NCSA has dedicated some time tonight to run our suite on their 1250 dual node cluster Quarterly Report Due – would like to get one to Fred by end of January. Will need text from WG leaders. Formal API presentations and voting - it is that time in the project when we are finalizing on some APIs. SciDAC PI Mtg - March in Charleston SC. We will need poster(s), talk, and 2 page summary document
Agenda - January 15 8:30 Al Geist – Project Status. 9:15 Thomas Naughton – SSS OSCAR software suite release Working Group Reports Progress report on what their group has done API Proposals for adoption by the group Progress on software suite improvements 9:30Narayan Desai – Node Build, Configure 10:30 Break 11:00 Will McClendon – Validation and Testing 12:00 Lunch (on own – cafeteria room B) 1:00 Paul Hargrove – Process Management 2:00 Scott Jackson – Resource Management 3.00 Break 3:30 Narayan - Review of "restriction syntax" style of XML 4:00 Rusty - Discussion of restriction syntax for scheduler and queue mgr 4:30 Craig – Brief on on big testbed run 5:00 Eric – competitive system to SSS 5:30 Adjourn Evening Working groups may want to help with large NCSA test run
Agenda – January 16 8:30 Discussion, proposals, votes Rusty - Process Manager API (discussion/vote) Narayan - Node state API (discussion/vote) Scott – Allocation Manager API (discussion/vote) Brett – Queue manager API (discussion/vote) Scott – SSSRMAP interface Al - Progress report Al - SciDAC mtg 2 pager, posters, talks 10:30 Break 11:00 Al Geist – Summary SciDAC PI Mtg March 22-24, Charleston SC next meeting date: May location: Argonne 12:00 meeting ends
Meeting notes Al presents his slides Thomas Naughton – SSS deployment using OSCAR Good – RPMs created for all SSS components! OSCAR packaging (varying levels) SourceForge project supplied central CVS location Bad – not all scripts are created equal – new untested submissions Some pain getting SF accounts. Time constraints forced script hacks OSCAR testing framework Status – Tarball available fairly toxic but builds full working cluster w/ SSS Updated OSCAR pkg HowTo ToDo – clean up hacks, integrate remaining SSS components (qbank) Add SSS interface to OSCAR itself Would like to establish release schedule – March 1 Not clear that anyone has downloaded yet Discussion of how many orgs in our group could shakedown the tarball Group feels better to have few very reliable components than all components
Meeting notes Narayan – node build progress report Only had a few minor bug fixes Infrastructure has been reliable for 6 month Library updates: Portability - OSX support, 64-bit tested, Tru64 support Thread-safety SSL wire-protocol module – soon to be the default protocol in ssslib Node state manager – reliable Build System – building vs configuration interface/conflict issues Hardware infrastructure – model needs refinement WRT topology info Restriction Syntax augmentations New operators added – negations, numeric, regular expression Integrated into all python components Next steps – work on new models for hardware infrastructure Work on multiple implementations of BCM components Performance tuning – for ssslib, event manager, service directory
Meeting notes Will McLnedon – Component Interface testing report Description of his work for the new folks SC2003 demo of APItest v.1 in ASCI booth (GUI HTTP interface) built on Twisted Framework Db interfacing, distributed component testing, HTTPD mode APItest development. Lessons learned. V.2 new test file formats – collab with Jackson separate individual tests from batch grouping Runs through some examples. Feedback is encouraged Hope to get some real test suites going this quarter Ron Oldfield – introduced Shows graphical APItest demo that was given at SC2003
Meeting notes Paul Hargrove – Process management report SSS-OSCAR release Coming to a point where components have to interact more eg. Chkpt Real deployment/testing on Chiba (ANL), XTORC (ORNL) Checkpoint manager – progress ported to RH9 (hard – Red Hat kernel’s…) checkpoint using LAM/MPI stand-alone package w/ LAM/MPI for chkpt suspend/resume interface working with queue manager Outstanding issues – need to design restart-time interactions need to implement a full interface - restriction syntax, event generation, error reporting basic ideas on file management Monitoring progress in SSS-OSCAR Scalability work – thread pool, internal protocol changes fix service directory connections write documentation
Meeting notes Process manager (cont) Rusty Lusk – Process Manager functionality overview Show Schematic of process management components Various commands that are in the syntax Progress – already a stable component, fixed several bugs at SC03 Improved queries and error codes Future INTEGRATION! Stable software makes this possible Chiba production use has forced the issue Continued development
Meeting notes Scott Jackson – Resource Manager report Short overview for new attendees Progress – released in SSS-OSCAR Bamboo, Maui, Gold, Warehouse Updated RM web page for new components being available Deployed user oriented problem response system Created SSSRMAP C-implementation module Completed per-component interface documents Schedule Progress - Completed chkpt/restart based SSS calls. blocked until can test with checkpoint guys - support for dynamic jobs blocked until support provided in PM and QM discussion of feature of dynamic jobs how/if we should work on it - resource limit enforcement and tracking need rusage on process exit blocked until support from PM and QM progress Too much blocking seems RM group lacks coordination with other groups.
Meeting notes Scott Jackson – Resource Manager report (cont) Initial release of Bamboo and wrote API document Accounting and allocation Qbank was an initial solution replaced by Gold Gold – released under BSD open source licence packaged as tarball. And initial OSCAR rpm created added support for Service Directory registration implemented status codes implemented instance-level role-based authorization Gold running on 11 TF cluster at PNNL GUI improved to include user, project, machine management views Meta-scheduler – added thread support improved Silver installation procedure testing of (grid level) data staging Future- draft of SSSRMAP v3 protocol spec (chunking) release alpha versions of Bamboo, Maui, Gold, Warehouse complete design spec documents for above components.
Meeting notes Discussion of having two XML syntax styles (functional, object) Al says he would like to see one common one across the suite that he didn’t care which one as long as the whole group could agree. Rusty brought up a second issue, wire protocol, and having a single library that has all the protocols used by the components in the SSS suite. Narayan – Restriction Syntax Overview Command syntax – incorporates imperative and database operations allows uniform data queries across components easy to process improves atomicity of operations Semantics – Examples given going across attributes are ANDed and multple lines are Ored An issue of uniqueness was brought up and will be taken into consideration by Narayan.
Meeting notes Rusty – Restriction Syntax on Chiba City David would like to see a paper of the requirements that the Chiba effort required. Narayan – Hack of quick interfaces for Queue Manager Restriction Interface has 4 commands (add, del, run, get) Doesn’t show Scheduler Interface Craig – 1280 dual xeon cluster “Titanium” is available this evening To test the scalability of SSS suite. One node will be used as Head node to install our suite and run on entire cluster. Could build everything but Bambo and ssslib due to Xerses Will begin to be available at 6pm Eric – A competing package. From his Russian “secret city” trip Oct. 03 Package for - Distributed calculations, metacomputing, Grid. System is based on XML, web-based user interface, Configure, manage, and submit jobs. Challenges auto load balance.
Meeting notes Late night session on 1280 node testbed PM ran at 1280 worked at 4000, hung at 6000 Warehouse had a problem at 1280 and took out head node RM components ran on head node OK until Warehouse crashed it Rusty – Process Manager Spec for first vote Presentation and discussion… Who is responsible for limited enforcement PM or QM? I.e. Must use certain amount of memory, must not execute OS command (in general - things that happen after fork) Rusty says the question is good and he needs to think about How this may affect the interface. Other items to think about - use of wildcard as “to be returned” operator – OK - Inclusion but don’t show me. - Dynamic jobs and PM. - improve readability Delay vote until we have a written proposal.
Meeting notes How to write spec to describe how XML should be extended to future needs. Narayan – Node State Manager spec (no written doc so no vote) Presentation and lots of discussion… Scott – Allocation Manager spec (has written doc in notebook) Goes through examples in the document. Discussion. Switches to discussion of comparison between both XML syntax And Andrew Lusk thinks that a translator could be created for queries (but not for output) Rusty thinks it is a bad idea and feels It is not problem to have two syntax. David says the translation is good because it could buy time to switch syntax Andrew and Paul and Craig offer to help build a prototype translator To see how / if it is possible. Investigate standardization of tokens across the two syntax
Meeting notes How