Presentation is loading. Please wait.

Presentation is loading. Please wait.

Progress on Integration, Vote on APIs SC2003, and SW release Al Geist September 11-12, 2003 Rockville, MD.

Similar presentations


Presentation on theme: "Progress on Integration, Vote on APIs SC2003, and SW release Al Geist September 11-12, 2003 Rockville, MD."— Presentation transcript:

1 Progress on Integration, Vote on APIs SC2003, and SW release Al Geist September 11-12, 2003 Rockville, MD

2 Coordinator: Al Geist Participating Organizations ORNL ANL LBNL PNNL PSC SDSC IBM SNL LANL Ames NCSA Cray Intel Unlimited Scale Participating Organizations External reviewers want to see more vendors involved. Could be important point in our long-term plans Have begun working with Don Mason and John Lawson to set up a presentation to a vendor forum. Will need your participation when logistics are known No Progress since last meeting

3 IBM Cray Intel Unlimited Scale Scalable Systems Software Participating Organizations ORNL ANL LBNL PNNL NCSA PSC SDSC SNL LANL Ames Collectively (with industry) define standard interfaces between systems components for interoperability Create scalable, standardized management tools for efficiently running our large computing centers Problem Goals Impact Computer centers use incompatible, ad hoc set of systems tools Present tools are not designed to scale to multi-Teraflop systems Reduced facility mgmt costs. More effective use of machines by scientific applications. Resource Management Accounting & user mgmt System Build & Configure Job management System Monitoring www.scidac.org/ScalableSystemsTo learn more visit

4 Grid Interfaces Accounting Event Manager Service Directory Meta Scheduler Meta Monitor Meta Manager Scheduler Node State Manager Allocation Management Process Manager Usage Reports Meta Services System & Job Monitor Job Queue Manager Node Configuration & Build Manager Standard XML interfaces Working Components and Interfaces (bold) authentication communication Components written in any mixture of C, C++, Java, Perl, and Python can be integrated into the Scalable Systems Software Suite Checkpoint / Restart Scalable Systems Software Suite Validation & Testing Hardware Infrastructure Manager

5 Scalable Systems Software Center June 5-6 Chicago ILL Review of Last Meeting Details in Main project notebook

6 Highlights from June. mtg Matt Sottile – Using SSS to create bstat_sss. It is a prototype distribution so some of these issues are expected. Major gripes had to write code for Socket code and XML parsing and creation These should be APIs. XML parsing – the schema and associated parser are intimately related Craig Steffan – Warehouse Monitoring Software Infrastructure Describes the old way cluster monitor worked and scalability issues with it. Presents new design Thomas Naughton – SSS deployment using OSCAR Seems to be consensus of group to do this for SC2003 Slides can be found in Main Notebook

7 Highlights from June. Mtg (cont.) Narayan Desai – All Service directory,BC, and PM APIs changed to restriction syntax – draft spec given. Scott Jackson – SSSRMAP v2 proposal Have taken an object oriented approach to jobs and attributes Discuss of the differences between RM Schema and BC Schema Part of the difference is the incorporation of security Another part is functional vs object oriented Good discussion of the strengths and weaknesses of both.

8 Consensus and Voting: Communication Infrastructure Spec Draft We should be able to hardwire components together. Existence of static file to define where things are – may just have service directory. Unix Domain socket protocol for SMP servers Vote – accept the spec pending amendment to allow hardwired components Yes 15, No, 0 abstaning 0 Agreement for having common error objects with 3 digit codes and messages. Message is human readable string. Two special ones 000 success 999 unknown Straw vote: 15 no 1 Abs 0 Add “supported scheme version” to Service directory Vote: 15 no 0 Abs 0 Discussion of outer (envelope, signature, body) framing and put in SSSlib (SSSlib guys said it would be done no vote taken)

9 Scalable Systems Software Center June-September Progress Since Last Meeting

10 Five Project Notebooks- little activity this Qtr A main notebook for general information And individual notebooks for each working group Over 281 total pages – 11 added since last meeting BC and PM groups need to get info into their notebooks Add Telecom meeting notes even if short Get to all notebooks through main web site www.scidac.org/ScalableSystems Click on side bar or at “project notebooks” at bottom of page

11 Bi-Weekly Working Group Telecoms Starting to pick up as SC2003 approaches Bi-Weekly Working Group Telecoms Starting to pick up as SC2003 approaches Resource management, scheduling, and accounting Tuesday 3:00 pm (Eastern) 1-800-664-0771 keyword “SSS mtg” Validation and Testing (hasn’t met since last year) Wednesday 1:00 pm (Eastern) 1-877-540-9892 mtg code 999157 Proccess management, system monitoring, and checkpointing Thursday 1:00 pm (Eastern) 1-877-252-5250 mtg code 160910 Node build, configuration, and information service Thursday 3:00 pm (Eastern) 1-888-469-1934 mtg code (changes)

12 Scalable Systems Software Center September 11-12, 2003 This Meeting

13 Major Topics this Meeting Five year project – Fred says that the five year projects will go all five years, but they need to be finished at that point. He asks “What is our exit strategy?” Open Source License – Fred asks that we come up with one general text that all organizations can agree on and then he will bless it. Software Release – deadline for a suite release is SC2003 Formal API presentations and voting - it is that time in the project when we should be settling on some APIs. Use less time for progress reports SC2003 prep - booth space, demos, posters

14 Agenda – September 11 8:30 Al Geist – Project Status. 9:15 Rusty Lusk – Use of Scalable Systems Suite on Chiba Working Group Reports Progress report on what their group has done API Proposals for adoption by the group Progress on SC2003 software release date 9:30Scott Jackson – Resource Management 10:30 Break 11:00 Erik Debenedictis – Validation and Testing 12:00 Lunch (on own – hotel restaurant) 1:00 Paul Hargrove – Process Management 2:00 Narayan Desai – Node Build, Configure 3.00 Break 3:30 Thomas Naughton—Discussion of SSS OSCAR software suite release, XML syntax 4:30 Discussion of SC2003 demos, booths, posters 5:30 Adjourn Working groups may wish to prepare material for voting Friday

15 Agenda – September 12 8:30 Discussion, proposals, votes Eric – tweaks for peta-scale systems Scott – error codes extensibility Narayan – communication infrastructure 2 nd vote Rusty – mystery topic Plans for SC2003 demos and talks 10:30 Break 11:00 Al Geist – Summary Review plans for SC2003 next meeting date: January 15-16, 2004 location: ANL 12:00 meeting ends

16 Meeting notes Rusty Lusk – SSS on chiba project (summer project) Experiment to see if sss architecture can replace chiba city SW Needed better software on the cluster, test sss (chkpt), scalable testbed Needed external testing Needed more experience with published XML API Use ANL SSS components, and stubs for scheduler, QM w/PBS compatability Use restriction syntax for everything After 2-week shakedown Remy agreed to go forward and use SSS only Been running user job mix for about 3 weeks without disasters Shook out XML ambiguities, fixed bugs, fixed scalability problems Plans short term incorporate chckpt, LAM support, monitoring warehouse Plans long term incorporate components from RM group, use chiba Question: What is Jazz using? Qbank, MPI-GM, Veridian PBS, etc. Question: What is the bug fix load? Exponential decrease with low load.

17 Meeting notes Scott Jackson – RM progress SSSRMAP v2.0 is in all components except Silver meta-scheduler Tested security Running suite on ORNL’s XTORC some issues with ssslib, and PM Created Node object v1.0 Proposed set of response/status codes (more tomorrow) Suite for SC2003 include openPBS_sss, Maui, Qbank, sss_xml_svr Running on Linux, HP-UX, AIX 5.1, IRIX 6.5, (to come Tru64, Solaris) Uses SSSRMAP v1.0 Webpage for RMwg recreated w/ documentation, tarballs, rpms, bug track Scheduler progress – support for error codes QM progress – named “Bamboo” implements SSSRMAP v2 incl. Security Accounting and Allocation – QBank portability testing Gold – implements SSSRMAPv2.0 Reimplemented in PERL to overcome latency issues in java startup Created a suite of full-featured Perl command line clients. Installed Gold on PNNL 11.8TF Linux cluster to compare to Qbank Slow progress on open sourcing. Asks a/b public domain SW. group says no

18 Meeting notes Scott Jackson – RM progress continued Future work: release alpha of Bamboo, Silver, Gold Support multi-source resource management, multi-step job support Interface to system monitor I/O staging (need API from PM) Package code for distribution Open source Gold (BSD) SSL on web gui. Issues for group discussion Resonse Codes SC03 Problem Response System Need process exit codes from PM Cluster Monitor Open Source Discuss OpenPBS_sss is it a real SSS component, can it drop in? http://sss.scl.ameslab.gov/downloads.shtml

19 Meeting notes Will McClendon – Validation and Testing Neil reports that when the SNL Institutional Cluster is up SSS will be able to use Cplant for scalability testing. API test supports multiprotocol Status daemon-configurable monitoring infrastructure for clusters Distributed Runtime System Testing Progress: week at ANL (July) Major rework on framwork for APItest – individual tests are atomic Framework handles checking tests, dependencies, and aggregate results Extensibility – new types of tests are easy to create Dependency system define relationships as DAG encoded in XML, (shows many examples) edges are boolean dependencies Supported Tests sssTest – use ssslib to communicate with ssslib components shellTest – execute a command httpTest – app testing web interfaces tcpipTest – raw socket via tcp

20 Meeting notes Will McClendon – Validation and Testing How is this different form “DART” or other testing harnesses? They not doing DAG dependences They don’t have regular expression matching The SW is released inside ssslib (already available) Updates are placed directly into the CVS by Will. Issue Tracking (same topic as Scott brought up) Is anyone using Bugzilla on the SSS website? New hire: Ron Oldfield (new PhD)

21 Meeting notes Paul Hargrove – PM progress Checkpoint manager – docs nearly done. Issue of open files 30% done Need to chase kernel versions (need to be a part of OSCAR) Hope to test on unknowing NERSC users on PDSF system Expect to deploy on Chiba Still need to define XML interfaces to checkpoint Outside interest from Altair (PBS Pro), LANL (SLURM), Quadrics (RMS) Will be able to have “something” in the SC2003 suite (toy) Process Manager – improved scalability, mods to support SSS-PM Support for multi-step jobs- uses MPISH Now the production PM on Chiba Monitoring – Data Warehouse written and tested XML parsing 80% done, response not done, Service directory registration not done yet. Future Integrate/deploy on Chiba Release it (OSCAR based release at SC2003) Demo it at SC2003

22 Meeting notes Narayan Desi – BC progress Communication –scale tested and in production Added schema version, added component tier Event manager – data persistence and event statistics Integration with APITest – service directory tests written, event mgr next SSSlib – core rewrite to improve code reuse, smaller code base Node State Manager – improved diagnostics Build system – new config mgmt system, working on OSCAR implementation Cluster HW Infrastructure – identified need for generic topology support Restriction Syntax – command format Provides SQL-like functionality Now it is Disjunctive Normal Form Data ownership is explicit – which component owns Basic command syntax (describe and shows examples from report) Future – improved integration to RM, sdmin tools leveraging R syntax More APITest tests. Dummy components – such as “file stager” Long syntax discussion

23 Meeting notes Thomas Naughton – SSS deployment using OSCAR A release of OSCAR that contains all SSS software Roll SSS components into OSCAR packages – RPM format Create repository for OSCAR package uploads Source forge sss-oscar.sf.net for our team use Accounts & CVS permissions Establish “supported” Linux distribution RedHat 7.3? Or 9.0? Discussion and group decides 9.0 Myrinet? Put an OSCAR RH9.0 version on ftp site for team to grab. OSCAR Homepage http://oscar.sf.nethttp://oscar.sf.net Proposed timeline for SC2003 SW release Oct 06: SSS pkgs OSCAR-ized & in CVS Oct 24: CVS freeze – begin beta tests Nov 17: SC2003

24 Meeting notes Day 2 Eric Debenedictus– Issues for peta-scale systems Redstorm and Bluelight mesh rather than switch Means that topology is important consideration Speed of light is about 5% of communication time But this is growing at 40% a year so that in 2008 light will be 20% Discussion that machine size in 2008 may be physically smaller Either SW has to have hooks for manual placement Or automatically optimized placement For SSS to consider: XML attribute to specify topology and I/O resources XML attribute to specify data arrangement on disk OS functionality hints to help auto placement Ron Oldfield – distributed file copy and permutation To what extent does SSS want to involved in post 100T range? yes Is it appropriate for SNL to consider work in this area as part of contribution to SSS? No. Fred asked us not to do I/O – he funds I/O through other projects.

25 Meeting notes Day 2 Scott Jackson – Error reporting and codes Divide up code space in consistent way. Code 0xx Success 1xx Warning 2xx Wire protocol 3xx Message XML format 4xx Security 5xx Event Management 6xx Reserved 7xx Server application 8xx Client application 9xx Misc Failure 999 Unknown Failure Rusty mentions MPI error classes and error code Al suggests these general error classes – success, warning, temp failure, partial failure, failure People need to come up with counter proposal if they care

26 Meeting notes Day 2 Narayan – Communication second vote Wire protocols – need to add security envelope protocol Added service location. Bootstraped using /etc/sss/ Vote to Accept as spec for Wire Protocol definition to get new ones accepted Service Directory interface Event Manager interface Second vote: 16 yes 2 abs 0 no Rusty – Plan for voting on specific component interfaces Service directory (today) Event manager (today) Node state manager (1 st vote next time) Build system (discuss next time) Process manager (1 st vote next time)

27 Meeting notes Day 2 Al – SC2003 Rusty – fancy dancing meatball in wxpython! Mike – try run SSS suite on 1400 node cluster (not at SC, before SC) Capture a trace log to play. Thomas – SSS-OSCAR working! Implies that the whole suite works together Will – fancy graphic demonstration of APITest Brett – demonstrate swapping components in SSS architecture (show accounting?) Paul – chkpoint interacting with PM on chiba Where? All Across the show floor SciDAC booth – Talks by geist, rusty, craig OSCAR BOF on Tuesday 5:00-6:00 will mention SSS-OSCAR

28 Grid Interfaces Accounting Event Manager Service Directory Meta Scheduler Meta Monitor Meta Manager Scheduler Node State Manager Allocation Management Process Manager Usage Reports Meta Services System & Job Monitor Job Queue Manager Node Configuration & Build Manager Standard XML interfaces Working Components and Interfaces (bold) authentication communication Components written in any mixture of C, C++, Java, Perl, and Python can be integrated into the Scalable Systems Software Suite Checkpoint / Restart Scalable Systems Software Suite Validation & Testing Hardware Infrastructure Manager Interfaces needing work (red)


Download ppt "Progress on Integration, Vote on APIs SC2003, and SW release Al Geist September 11-12, 2003 Rockville, MD."

Similar presentations


Ads by Google