Progress on Release, API Discussions, Vote on APIs, and PI mtg Al Geist January 14-15, 2004 Chicago, ILL.

Slides:



Advertisements
Similar presentations
Andrew McNab - Manchester HEP - 24 May 2001 WorkGroup H: Software Support Both middleware and application support Installation tools and expertise Communication.
Advertisements

Performance Testing - Kanwalpreet Singh.
CSC 360- Instructor: K. Wu Overview of Operating Systems.
A View from the Top Al Geist February Houston TX.
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
Presented by Scalable Systems Software Project Al Geist Computer Science Research Group Computer Science and Mathematics Division Research supported by.
© , Michael Aivazis DANSE Software Issues Michael Aivazis California Institute of Technology DANSE Software Workshop September 3-8, 2003.
Understanding and Managing WebSphere V5
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting February 24-25, 2003.
A View from the Top End of Year 1 Al Geist October Houston TX.
A Scalable Application Architecture for composing News Portals on the Internet Serpil TOK, Zeki BAYRAM. Eastern MediterraneanUniversity Famagusta Famagusta.
Global Customer Partnership Council Forum | 2008 | November 18 1IBM - GCPC MeetingIBM - GCPC Meeting IBM Lotus® Sametime® Meeting Server Deployment and.
Progress on Integration, Vote on APIs SC2003, and SW release Al Geist September 11-12, 2003 Rockville, MD.
Resource Management and Accounting Working Group Working Group Scope and Components Progress made Current issues being worked Next steps Discussions involving.
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting Aug 26-27, 2004 Argonne, IL.
SC04 Release, API Discussions, SDK, and FastOS Al Geist August 26-27, 2004 Chicago, ILL.
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting June 5-6, 2003.
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting June 13-14, 2002.
A View from the Top November Dallas TX. Coordinator: Al Geist Participating Organizations ORNL ANL LBNL PNNL PSC.
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting Jan 25-26, 2005 Washington D.C.
9 Chapter Nine Compiled Web Server Programs. 9 Chapter Objectives Learn about Common Gateway Interface (CGI) Create CGI programs that generate dynamic.
DCE (distributed computing environment) DCE (distributed computing environment)
Process Management Working Group Process Management “Meatball” Dallas November 28, 2001.
A View from the Top Preparing for Review Al Geist February Chicago, IL.
EGEE is a project funded by the European Union under contract IST Testing processes Leanne Guy Testing activity manager JRA1 All hands meeting,
Component updates, SSS-OSCAR Releases, API Discussions, External Users, and SciDAC Phase 2 Al Geist January 25-26, 2005 Washington DC.
Working Group updates, SSS-OSCAR Releases, API Discussions, External Users, and SciDAC Phase 2 Al Geist May 10-11, 2005 Chicago, ILL.
Resource Management Working Group SSS Quarterly Meeting November 28, 2001 Dallas, Tx.
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting October 10-11, 2002.
SSS Test Results Scalability, Durability, Anomalies Todd Kordenbrock Technology Consultant Scalable Computing Division Sandia is a multiprogram.
Progress on Release, API Discussions, Vote on APIs, and Quarterly Report Al Geist May 6-7, 2004 Chicago, ILL.
CSF4 Meta-Scheduler Name: Zhaohui Ding, Xiaohui Wei
Crystal Ball Panel ORNL Heterogeneous Distributed Computing Research Al Geist ORNL March 6, 2003 SOS 7.
Review report, Vote on APIs Quarterly report, and SW release Al Geist June 5-6, 2003 Chicago, IL.
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting January 15-16, 2004 Argonne, IL.
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting September 11-12, 2003 Washington D.C.
Tool Integration with Data and Computation Grid GWE - “Grid Wizard Enterprise”
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting May 10-11, 2005 Argonne, IL.
SciDAC SSS Quarterly Report Sandia Labs August 27, 2004 William McLendon Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
Process Management & Monitoring WG Quarterly Report January 25, 2005.
OS and System Software for Ultrascale Architectures – Panel Jeffrey Vetter Oak Ridge National Laboratory Presented to SOS8 13 April 2004 ack.
Scalable Systems Software for Terascale Computer Centers Coordinator: Al Geist Participating Organizations ORNL ANL LBNL.
Oak Ridge National Laboratory -- U.S. Department of Energy 1 SSS Deployment using OSCAR John Mugler, Thomas Naughton & Stephen Scott May 2005, Argonne,
DGC Paris WP2 Summary of Discussions and Plans Peter Z. Kunszt And the WP2 team.
Architecture View Models A model is a complete, simplified description of a system from a particular perspective or viewpoint. There is no single view.
A View from the Top Al Geist June Houston TX.
ClearQuest XML Server with ClearCase Integration Northwest Rational User’s Group February 22, 2007 Frank Scholz Casey Stewart
SSS Build and Configuration Management Update February 24, 2003 Narayan Desai
G.Govi CERN/IT-DB 1 September 26, 2003 POOL Integration, Testing and Release Procedure Integration  Packages structure  External dependencies  Configuration.
Tool Integration with Data and Computation Grid “Grid Wizard 2”
SPI NIGHTLIES Alex Hodgkins. SPI nightlies  Build and test various software projects each night  Provide a nightlies summary page that displays all.
Process Manager Specification Rusty Lusk 1/15/04.
IPS Infrastructure Technological Overview of Work Done.
Chapter – 8 Software Tools.
STAR Scheduling status Gabriele Carcassi 9 September 2002.
Quattor tutorial Introduction German Cancio, Rafael Garcia, Cal Loomis.
JRA1 Meeting – 09/02/ Software Configuration Management and Integration EGEE is proposed as a project funded by the European Union under contract.
Five todos when moving an application to distributed HTC.
SciDAC CS ISIC Scalable Systems Software for Terascale Computer Centers Al Geist SciDAC CS ISIC Meeting February 17, 2005 DOE Headquarters Research sponsored.
Process Management & Monitoring WG Quarterly Report August 26, 2004.
SciDAC SSS Quarterly Report Sandia Labs January 25, 2005 William McLendon Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed.
Maria Alandes Pradillo, CERN Training on GLUE 2 information validation EGI Technical Forum September 2013.
Component updates, SSS-OSCAR Releases, API Discussions, External Users, and SciDAC Phase 2 Al Geist January 25-26, 2005 Washington DC.
Netscape Application Server
A View from the Top Al Geist February Houston TX.
Scalable Systems Software for Terascale Computer Centers
Chapter 2: The Linux System Part 1
Global Grid Forum (GGF) Orientation
Presentation transcript:

Progress on Release, API Discussions, Vote on APIs, and PI mtg Al Geist January 14-15, 2004 Chicago, ILL

Coordinator: Al Geist Participating Organizations ORNL ANL LBNL PNNL PSC SDSC IBM SGI SNL LANL Ames NCSA Cray Intel Unlimited Scale Participating Organizations Changes

IBM Cray Intel SGI Scalable Systems Software Participating Organizations ORNL ANL LBNL PNNL NCSA PSC SDSC SNL LANL Ames Collectively (with industry) define standard interfaces between systems components for interoperability Create scalable, standardized management tools for efficiently running our large computing centers Problem Goals Computer centers use incompatible, ad hoc set of systems tools Present tools are not designed to scale to multi-Teraflop systems Resource Management Accounting & user mgmt System Build & Configure Job management System Monitoring To learn more visit

Potential Impact of Project Fundamentally change the way future high-end systems software is developed and distributed Reduced facility management costs reduce need to support ad hoc software better systems tools available able to get machines up and running faster and keep running More effective use of machines by scientific applications scalable launch of jobs and checkpoint/restart job monitoring and management tools allocation management interface

Grid Interfaces Accounting Event Manager Service Directory Meta Scheduler Meta Monitor Meta Manager Scheduler Node State Manager Allocation Management Process Manager Usage Reports Meta Services System & Job Monitor Job Queue Manager Node Configuration & Build Manager Standard XML interfaces Working Components and Interfaces (bold) authentication communication Components written in any mixture of C, C++, Java, Perl, and Python can be integrated into the Scalable Systems Software Suite Checkpoint / Restart Scalable Systems Software Suite Validation & Testing Hardware Infrastructure Manager First Release at SC2003 Packaging & Install

Scalable Systems Software Center September Washington DC Review of Last Meeting Details in Main project notebook

Highlights from Sept. mtg Rusty Lusk – Using SSS as the production systems software on Chiba City for a number of months now. Use restriction syntax for everything. Got blessing of ANL sysadmin group. Scott Jackson – Standard Error reporting and codes across components. Discuss dividing up code space in consistent way. Eric Debenedictus – Issues for peta-scale systems Redstorm and Bluelight mesh rather than switch means that topology is important consideration for SSS to consider: XML attribute to specify topology and I/O resources XML attribute to specify data arrangement on disk OS functionality hints to help auto placement Thomas Naughton – SSS deployment using OSCAR A release of OSCAR that contains all SSS software Roll SSS components into OSCAR packages – RPM format Create repository for OSCAR package uploads

Highlights from Sept. mtg (cont.) Al Geist – Plans for SC2003 Working Group Leaders – What areas their working group is addressing Progress report on what their group has done Present problems being addressed Next steps for the group Discussion items for the larger group to consider Long Term Strategy – Get Computer Centers involved and using suite Get vendors to be compliant with APIs Slides can be found in Main Notebook

Consensus and Voting: Communication Infrastructure Spec Wire protocols – need to add security envelope protocol Added service location. Bootstrapped using /etc/sss/ Vote to Accept as spec for Wire Protocol definition to get new ones accepted Service Directory interface Event Manager interface Second vote: 16 yes 2 abstaining 0 no Agreement for having common error objects with 3 digit codes and messages. Message is human readable string. Two special ones 000 success 999 unknown Straw vote: 15 no 1 Abs 0 Al suggests these general error classes – success, warning, temp failure, partial failure, failure People need to come up with counter proposal if they care

Scalable Systems Software Center September-January Progress Since Last Meeting

Systems Software Suite Release Open Source License – Fred asks that we come up with one general text that all organizations can agree on and then he will bless it. DONE SSS-OSCAR – Packaging done of all components (working around those components with license issues) First Release – Announced at SC2003. Available from project web site

SC2003 Scalable System Demos and Talks Rusty – fancy dancing meatball in wxpython Thomas – SSS-OSCAR working Will – fancy graphic demonstration of APITest ???? Brett – demonstrate swapping components in SSS architecture Paul – chkpoint interacting with PM on chiba Locations: All Across the show floor SciDAC booth – Talks by Rusty, Craig OSCAR BOF on Tuesday 5:00-6:00 mentions SSS-OSCAR

Five Project Notebooks A main notebook for general information And individual notebooks for each working group Over 297 total pages – 16 added since last meeting BC and PM groups need to get specs into their notebooks Add Telecom meeting notes even if short Get to all notebooks through main web site Click on side bar or at “project notebooks” at bottom of page

Bi-Weekly Working Group Telecoms Starting back up after Holidays Resource management, scheduling, and accounting Tuesday 3:00 pm (Eastern) keyword “SSS mtg” Validation and Testing Group No need for telecoms recently Proccess management, system monitoring, and checkpointing Thursday 1:00 pm (Eastern) mtg code Node build, configuration, and information service Thursday 3:00 pm (Eastern) mtg code (changes)

Scalable Systems Software Center January 14-15, 2004 This Meeting

Major Topics this Meeting Stability of Systems Software Suite – first release is out. Are we ready for a more robust second release Large Scale test run – NCSA has dedicated some time tonight to run our suite on their 1250 dual node cluster Quarterly Report Due – would like to get one to Fred by end of January. Will need text from WG leaders. Formal API presentations and voting - it is that time in the project when we are finalizing on some APIs. SciDAC PI Mtg - March in Charleston SC. We will need poster(s), talk, and 2 page summary document

Agenda - January 15 8:30 Al Geist – Project Status. 9:15 Thomas Naughton – SSS OSCAR software suite release Working Group Reports Progress report on what their group has done API Proposals for adoption by the group Progress on software suite improvements 9:30Narayan Desai – Node Build, Configure 10:30 Break 11:00 Will McClendon – Validation and Testing 12:00 Lunch (on own – cafeteria room B) 1:00 Paul Hargrove – Process Management 2:00 Scott Jackson – Resource Management 3.00 Break 3:30 Narayan - Review of "restriction syntax" style of XML 4:00 Rusty - Discussion of restriction syntax for scheduler and queue mgr 4:30 Craig – Brief on on big testbed run 5:00 Eric – competitive system to SSS 5:30 Adjourn Evening Working groups may want to help with large NCSA test run

Agenda – January 16 8:30 Discussion, proposals, votes Rusty - Process Manager API (discussion/vote) Narayan - Node state API (discussion/vote) Scott – Allocation Manager API (discussion/vote) Brett – Queue manager API (discussion/vote) Scott – SSSRMAP interface Al - Progress report Al - SciDAC mtg 2 pager, posters, talks 10:30 Break 11:00 Al Geist – Summary SciDAC PI Mtg March 22-24, Charleston SC next meeting date: May location: Argonne 12:00 meeting ends

Meeting notes Al presents his slides Thomas Naughton – SSS deployment using OSCAR Good – RPMs created for all SSS components! OSCAR packaging (varying levels) SourceForge project supplied central CVS location Bad – not all scripts are created equal – new untested submissions Some pain getting SF accounts. Time constraints forced script hacks OSCAR testing framework Status – Tarball available fairly toxic but builds full working cluster w/ SSS Updated OSCAR pkg HowTo ToDo – clean up hacks, integrate remaining SSS components (qbank) Add SSS interface to OSCAR itself Would like to establish release schedule – March 1 Not clear that anyone has downloaded yet Discussion of how many orgs in our group could shakedown the tarball Group feels better to have few very reliable components than all components

Meeting notes Narayan – node build progress report Only had a few minor bug fixes Infrastructure has been reliable for 6 month Library updates: Portability - OSX support, 64-bit tested, Tru64 support Thread-safety SSL wire-protocol module – soon to be the default protocol in ssslib Node state manager – reliable Build System – building vs configuration interface/conflict issues Hardware infrastructure – model needs refinement WRT topology info Restriction Syntax augmentations New operators added – negations, numeric, regular expression Integrated into all python components Next steps – work on new models for hardware infrastructure Work on multiple implementations of BCM components Performance tuning – for ssslib, event manager, service directory

Meeting notes Will McLnedon – Component Interface testing report Description of his work for the new folks SC2003 demo of APItest v.1 in ASCI booth (GUI HTTP interface) built on Twisted Framework Db interfacing, distributed component testing, HTTPD mode APItest development. Lessons learned. V.2 new test file formats – collab with Jackson separate individual tests from batch grouping Runs through some examples. Feedback is encouraged Hope to get some real test suites going this quarter Ron Oldfield – introduced Shows graphical APItest demo that was given at SC2003

Meeting notes Paul Hargrove – Process management report SSS-OSCAR release Coming to a point where components have to interact more eg. Chkpt Real deployment/testing on Chiba (ANL), XTORC (ORNL) Checkpoint manager – progress ported to RH9 (hard – Red Hat kernel’s…) checkpoint using LAM/MPI stand-alone package w/ LAM/MPI for chkpt suspend/resume interface working with queue manager Outstanding issues – need to design restart-time interactions need to implement a full interface - restriction syntax, event generation, error reporting basic ideas on file management Monitoring progress in SSS-OSCAR Scalability work – thread pool, internal protocol changes fix service directory connections write documentation

Meeting notes Process manager (cont) Rusty Lusk – Process Manager functionality overview Show Schematic of process management components Various commands that are in the syntax Progress – already a stable component, fixed several bugs at SC03 Improved queries and error codes Future INTEGRATION! Stable software makes this possible Chiba production use has forced the issue Continued development

Meeting notes Scott Jackson – Resource Manager report Short overview for new attendees Progress – released in SSS-OSCAR Bamboo, Maui, Gold, Warehouse Updated RM web page for new components being available Deployed user oriented problem response system Created SSSRMAP C-implementation module Completed per-component interface documents Schedule Progress - Completed chkpt/restart based SSS calls. blocked until can test with checkpoint guys - support for dynamic jobs blocked until support provided in PM and QM discussion of feature of dynamic jobs how/if we should work on it - resource limit enforcement and tracking need rusage on process exit blocked until support from PM and QM progress Too much blocking seems RM group lacks coordination with other groups.

Meeting notes Scott Jackson – Resource Manager report (cont) Initial release of Bamboo and wrote API document Accounting and allocation Qbank was an initial solution replaced by Gold Gold – released under BSD open source licence packaged as tarball. And initial OSCAR rpm created added support for Service Directory registration implemented status codes implemented instance-level role-based authorization Gold running on 11 TF cluster at PNNL GUI improved to include user, project, machine management views Meta-scheduler – added thread support improved Silver installation procedure testing of (grid level) data staging Future- draft of SSSRMAP v3 protocol spec (chunking) release alpha versions of Bamboo, Maui, Gold, Warehouse complete design spec documents for above components.

Meeting notes Discussion of having two XML syntax styles (functional, object) Al says he would like to see one common one across the suite that he didn’t care which one as long as the whole group could agree. Rusty brought up a second issue, wire protocol, and having a single library that has all the protocols used by the components in the SSS suite. Narayan – Restriction Syntax Overview Command syntax – incorporates imperative and database operations allows uniform data queries across components easy to process improves atomicity of operations Semantics – Examples given going across attributes are ANDed and multple lines are Ored An issue of uniqueness was brought up and will be taken into consideration by Narayan.

Meeting notes Rusty – Restriction Syntax on Chiba City David would like to see a paper of the requirements that the Chiba effort required. Narayan – Hack of quick interfaces for Queue Manager Restriction Interface has 4 commands (add, del, run, get) Doesn’t show Scheduler Interface Craig – 1280 dual xeon cluster “Titanium” is available this evening To test the scalability of SSS suite. One node will be used as Head node to install our suite and run on entire cluster. Could build everything but Bambo and ssslib due to Xerses Will begin to be available at 6pm Eric – A competing package. From his Russian “secret city” trip Oct. 03 Package for - Distributed calculations, metacomputing, Grid. System is based on XML, web-based user interface, Configure, manage, and submit jobs. Challenges auto load balance.

Meeting notes Late night session on 1280 node testbed PM ran at 1280 worked at 4000, hung at 6000 Warehouse had a problem at 1280 and took out head node RM components ran on head node OK until Warehouse crashed it Rusty – Process Manager Spec for first vote Presentation and discussion… Who is responsible for limited enforcement PM or QM? I.e. Must use certain amount of memory, must not execute OS command (in general - things that happen after fork) Rusty says the question is good and he needs to think about How this may affect the interface. Other items to think about - use of wildcard as “to be returned” operator – OK - Inclusion but don’t show me. - Dynamic jobs and PM. - improve readability Delay vote until we have a written proposal.

Meeting notes How to write spec to describe how XML should be extended to future needs. Narayan – Node State Manager spec (no written doc so no vote) Presentation and lots of discussion… Scott – Allocation Manager spec (has written doc in notebook) Goes through examples in the document. Discussion. Switches to discussion of comparison between both XML syntax And Andrew Lusk thinks that a translator could be created for queries (but not for output) Rusty thinks it is a bad idea and feels It is not problem to have two syntax. David says the translation is good because it could buy time to switch syntax Andrew and Paul and Craig offer to help build a prototype translator To see how / if it is possible. Investigate standardization of tokens across the two syntax

Meeting notes How