Review report, Vote on APIs Quarterly report, and SW release Al Geist June 5-6, 2003 Chicago, IL
Coordinator: Al Geist Participating Organizations ORNL ANL LBNL PNNL PSC SDSC IBM SNL LANL Ames NCSA Cray Intel Unlimited Scale Participating Organizations External reviewers want to see more vendors involved Have begun working with Don Mason and John Lawson to set up a presentation to a vendor forum. Will need your participation when logistics are known
IBM Cray Intel Unlimited Scale Scalable Systems Software Participating Organizations ORNL ANL LBNL PNNL NCSA PSC SDSC SNL LANL Ames Collectively (with industry) define standard interfaces between systems components for interoperability Create scalable, standardized management tools for efficiently running our large computing centers Problem Goals Impact Computer centers use incompatible, ad hoc set of systems tools Present tools are not designed to scale to multi-Teraflop systems Reduced facility mgmt costs. More effective use of machines by scientific applications. Resource Management Accounting & user mgmt System Build & Configure Job management System Monitoring learn more visit
Scalable Systems Software Center February Chicago ILL Review of Last Meeting Details in Main project notebook
Progress Reports at Feb. mtg Al Geist – preparation for external review, SciDAC PI meeting, posters, and demos Working Group Leaders – What areas their working group is addressing Progress report on what their group has done Present problems being addressed Next steps for the group Discussion items for the larger group to consider Discussion of Prototype Components Prep for external review demo Slides can be found in Main Notebook
Consensus and Voting: None at last meeting. Something we need to start doing again.
Scalable Systems Software Center February-June Progress Since Last Meeting
SciDAC PI mtg – all 50 projects March10-11, 2003 – Napa California Attending for Scalable Systems – Al Geist, Brett Bode 20 minute talk – presented by Al Scalable Systems, CCA, PERC, SDM Poster Presentation
External SciDAC Review mtg March12-13, 2003 – Napa California Attending for Scalable Systems – Al Geist, Brett Bode, Paul Hargrove, Narayan Desai, Mike Showerman. (Rusty) Four ISIC Projects were reviewed separately – Scalable Systems, CCA, PERC, SDM External review panel (9 members) Bob Lucas, Jim McGraw, Jose Munoz, Lauren Smith, Richard Mount, Ricky Kendall, Rod Oldehoeft, and Tony Mezzacappa John Grosh Day 1 – We had 1 ¾ hours to present project Day 2 – We got grilled by panel for 1½ hrs
External Review mtg Agenda Wednesday, March 12 7:45Welcome, charge to reviewers 8:15Plenary session for Common Component Architecture ISIC 10: 00Break 10:15Plenary session for Scalable Systems Software ISIC Al Geist gives 1 hr project overview, vision, goals Last 45 minutes team gives demos, answer questions 12:00Reviewer caucus 12:15 Lunch 1:15Plenary session for Scientific Data Management ISIC 3:00Break 3:15Plenary session for Performance Engineering ISIC 5:00 Reviewer caucus 5:30Adjourn
Grid Interfaces Accounting Event Manager Service Directory Meta Scheduler Meta Monitor Meta Manager Scheduler Node State Manager Allocation Management Process Manager Usage Reports Meta Services System & Job Monitor Job Queue Manager Node Configuration & Build Manager Standard XML interfaces Working Components and Interfaces (bold) authentication communication Components written in any mixture of C, C++, Java, Perl, and Python can be integrated into the Scalable Systems Software Suite Checkpoint / Restart External Review Demo Validation & Testing Hardware Infrastructure Manager
External Review mtg Agenda Day 2 Thursday, March 13 8:00Meetings between reviewers and ISIC members A. Common Component Architecture B. Scalable Systems Software (Jim McGraw) 9:45Break 10:00Meetings between reviewers and ISIC members C. Scientific Data Management D. Performance Engineering 11:45Reviewer Caucus/End of ISIC Reviews 12:15Lunch (McGraw gives initial assessment) Team brain storms on the response to the initial comments These sent to McGraw that same day.
External Review Initial Comments Response on top two issues: 1. Lack of large-scale testbed for scalable systems Mike Showermann of NCSA says they will have a 900 processor system by late summer that Scalable Systems software could be tested on. He also said there are plans to get an additional 1300 node system. CPlant has also been thrown out as a possible large scale (~1200 processor) test platform. 2. Get more vendors involved and more "buy-in" I will redouble my efforts to get SGI to get back involved in Scalable Systems. HP has been a tough nut to crack, both PSC and PNNL have tried to get them to engage. I'll see if PSC and PNNL are willing to try again. By late summer we will have a beta release of the suite that I can use to demonstrate to vendors our progress and advantages of going the scalable systems path.
Official External Review Report Arrived in May 2003 Official External Review Report Arrived in May 2003 Organizationally the project has developed effective working units The project appears to be on schedule for technical issues It has made several noteworthy accomplishments Recommendations: The two greatest obstacles to the success of this project are the availability of an adequate testbed for proving scalability of the interface design and the willingness of vendors to adopt the design for future systems Secondary considerations Investigate relationship with CCA Investigate File system plan Investigate security plan Importance of fault tolerance at smaller cluster sizes Develop test workloads
Five Project Notebooks filling up A main notebook for general information And individual notebooks for each working group Over 270 total pages – few added since last meeting Add Telecon meeting notes even if short Have had several web server problems this quarter Get to all notebooks through main web site Click on side bar or at “project notebooks” at bottom of page
Bi-Weekly Working Group Telecons Have been sparse since March review Bi-Weekly Working Group Telecons Have been sparse since March review Resource management, scheduling, and accounting Tuesday 3:00 pm (Eastern) keyword “SSS mtg” Validation and Testing (hasn’t met since last year) Wednesday 1:00 pm (Eastern) mtg code Proccess management, system monitoring, and checkpointing Thursday 1:00 pm (Eastern) mtg code Node build, configuration, and information service Thursday 3:00 pm (Eastern) mtg code (changes)
Scalable Systems Software Center February 24-25, 2003 This Meeting
Major Topics this Meeting MICS request for Highlights – Fred sent out a call for 2 page highlights due to MICS by June 12. Has anyone responded? I sent in our 2 pager Response to Reviewers Report – need feedback from the team on our official response to the points in the report Quarterly Report Due – would like to get one to Fred by end of June. Will need text from WG leaders. Formal API presentations and voting - it is that time in the project when we should be settling on some APIs. SC2003 Tutorial - proposal submitted at Fred’s request. Have a software suit released before SC2003
Agenda – June 5 8:30 Al Geist – Project Status. Qtr report coming up and External review report 9:00 Matt Sottile – Using Scalable Systems API Working Group Reports 9:30Scott Jackson – Resource Management 10:30 Break 11:00 Erik Debenedictis – Validation and Testing 12:00 Lunch (on own - walk to cafeteria) 1:00 Paul Hargrove – Process Management + Rusty slides 1:30 Craig Stefan – Warehouse Monitoring framework 2:00 Narayan Desai – Node Build, Configure Stephen Scott – OSCAR release with SSS inside 3.00 Break 3:30 Presentation of formal APIs for discussion 5:00 Rusty, Scott, Narayan, Paul? 5:30 Adjourn Working groups may wish to prepare material for voting Friday
Agenda – June 6 8:30 Discussion, proposals, straw votes Discussion of review report API proposals for envelope 10:30 Break 11:00 Al Geist – Summary Qtr Report. next meeting date:. 12:00 meeting ends
Meeting notes Matt Sottile – bproc (bstat_sss) software integrated with cluster status component Good (was able to do it in a day) bad (shouldn’t take 8 hours) ugly (python) Example with distribution didn’t help much. XML isn’t well documented But it is a prototype distribution so some of these issues are expected Major gripes had to write code for Socket code and XML parsing and creation These should be APIs – He then talks about Linux TCP being a hack XML parsing – the schema and associated parser are intimately related Noted that code had some constructs that could be made more robust CCA thoughts on relation to our project His expertise is language interoperability and runtime frameworks Law of least surprises. Consistency is good Insulate developers from the support structure Components the wheel everyone continues to reinvent But is SSS there aren’t components – just XML and wire protocol CCA provides: SIDL, standard interfaces to runtimes – CCAFFEINE, CCAT, Dune, … Suggests: Could try to leverage CCA messaging layer, Define interfaces in SIDL, Build services that conform to SIDL. CCA provides no security Concentrate on interfaces and problem of mapping concrete services into the interface space of SSS Conclusion: Clean up APIs to minimize possibilities for version skew. Too late to adopt CCA model Overall things worked – a good accomplishment Showed demo
Meeting notes Scott Jackson – RM wg report Progress – SSS front end created for Qbank, Soon Release v1.0 Open PBS, Maui, and Qbank all with SSS XML front end. Created Job Object specification v2.0 Created SSSRMAP v 2.0 – in notebook Scheduler progress: 40% of clients now using SSSRMAP, supports dynamic reservations to support growing and shrinking MPI jobs Security- support for a user specified keyfile Fault tolerance – implemented a fallback server Ease of use- initial web-GUI developed Oueue Manager Progress – updated service directory and event manager interfaces Accounting and allocation manager progress – GOLD All functionality of Qbank plus support for deposits, support for hierarchical accounts, support for refunds, guaranteed quotes, negotiation of options Added role-based access control, authentication, and encryption Got PNL OK to open source as BSD, sent to Fred for DOE OK Will talk about SSSRMAP v2 details this afternoon, in particular interfaces to other working group components.
Meeting notes Will McLendon – Validation and testing WG update Strategies for distributed runtime system testing – users expect high quality ESP benchmark – out of NERSC used in procurement to predict the effectiveness of a system before it is purchased. Could be used to test the SSS suite Consider putting ESP on the SSS testbed(s) APItest – most of the work this quarter is going on here. Recoded in Python for portability (C++ version had portability problems) Integrated into SSSlib as part of the distribution Tests well under develoment for ssslib components Status slide shows working, prototype, and planned features Black box testing – does component support the API White box testing – coverage tests, internal states of component, unreachable states Encoding XML inside XML is a problem Ran real demos of APItest running on Chiba City MySQL database support – used to store raw test results Work still to do – see status slide
Meeting notes Paul Hargrove – used my laptop for presentation – see slides Checkpoint/restart progress is stalled because person has been pulled off our project by Bill McCurdy to work on NERSC projects. Craig Steffan – Warehouse Monitoring Software Infrastructure Describes the old way cluster monitor worked and scalability issues with it Presents new design – each node is a peer each can be root of subtree They can be grouped into “information storehouses” w/ multiple sources and sinks Showed how it can be used to monitor multiple clusters in a compute center Information storehouse infrastructure is done. Sources and Sinks – next step will be to write simple ones, then more complex Lots of questions about the design. Good answers from Craig Only update changing information In next 6 months - Self balancing systems by tuning update intervals and Message passing to request information through the tree
Meeting notes Narayan Desai – BCWG report All APIs changed to restriction syntax – draft spec Service directory – new schema and new implementation Event manager – same SSSlib – more wire protocol modules – SSL, SSSRMAP OSX port in progress Build and configuration now has diagnostic services Hardware infrastructure issues discussed – what does system look like right now? Open issues specification formats – what tests does it need to pass release formats – see OSCAR slides XML interface formats multiple implementations Thomas Naughton – SSS deployment using OSCAR How users download and install SSS suite? Propose leverage OSCAR framework OSCAR core – SIS, C3, ODA, Env-Switcher OSCAR package facility – RPMs and other package classes OSCAR package loader Seems to be consensus of group to do this for SC2003
Meeting notes Rusty Proposal – an API for the Process Mangement Component He says the material is not quite in the form needed to vote on, but here is the process we should follow to vote in standard APIs Voting should be on a document that has descriptions examples both simple and complex Details of XML schema See his slides for details of his process manager interface proposal Much discussion. Scott Jackson – SSSRMAP v2 proposal Have taken an object oriented approach to jobs and attributes Goes over Basic examples in proposal (found in RM notebook) Discuss of the differences between RM Schema and BC Schema Part of the difference is the incorporation of security Another part is functional vs object oriented Discussion of outer (envelope, signature, body) framing and put in SSSlib (vote)
Meeting notes Day 2 Al Geist – action items 1. Need Working group leaders to send me a couple pages for the Qtr rpt Status and Progress from Feb-June 2. Any comments on points in the external reviewers report. Paragraph or two is fine.
Meeting notes Narayan Desai – Restriction syntax proposal Goes over basic command syntax where an attribute can be “*” wildcarded Goes over complex command syntax Matching semantics – especially for wildcards Benefits of this approach – compact, powerful, simple syntax, validatable, data ownership is explicit Uses MySQL on the backend This syntax has Constructive Normal Form Discussion that need to add negation before this is true What about regular expression support? – More discussion on how to do various things like “join” and “union” Discussion of the Communication Infrastructure Spec Draft (hardcopy handed out) We should be able to hardwire components together. Existence of static file to define where things are – may just have service directory Uunix Domain socket protocol for SMP servers Vote – accept the spec pending Yes 15, No, 0 abstaning 0
Meeting notes Paul – Discusses the idea of hiding the socket code in a library Matt says he would be happy to contribute such a server. Discussion of scalability of the event manager – not a problem because the Number of meatballs does not increase with system size. Question about the Ordering of events notification Scott – Lively discussion of the two XML variants What are the strengths and weakness of both Agreement for having common error objects with 3 digit codes and messages Message is human readable string. Two special ones 000 success 999 unknown Straw vote: 15 no 1 Abs 0 Add “supported scheme version” to Service directory Vote: 15 no 0 Abs 0 Next meeting September 9-10 in DC so Fred can attend?