A View from the Top Preparing for Review Al Geist February 24-25 Chicago, IL.

Slides:



Advertisements
Similar presentations
Introduction to Systems Management Server 2003 Tyler S. Farmer Sr. Technology Specialist II Education Solutions Group Microsoft Corporation.
Advertisements

A View from the Top Al Geist February Houston TX.
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
Presented by Scalable Systems Software Project Al Geist Computer Science Research Group Computer Science and Mathematics Division Research supported by.
NGNS Program Managers Richard Carlson Thomas Ndousse ASCAC meeting 11/21/2014 Next Generation Networking for Science Program Update.
Understanding and Managing WebSphere V5
Overview of Eclipse Parallel Tools Platform Adam Leko UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red:
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting February 24-25, 2003.
Oak Ridge National Laboratory — U.S. Department of Energy 1 The ORNL Cluster Computing Experience… John L. Mugler Stephen L. Scott Oak Ridge National Laboratory.
QCDgrid Technology James Perry, George Beckett, Lorna Smith EPCC, The University Of Edinburgh.
A View from the Top End of Year 1 Al Geist October Houston TX.
SSI-OSCAR A Single System Image for OSCAR Clusters Geoffroy Vallée INRIA – PARIS project team COSET-1 June 26th, 2004.
Progress on Integration, Vote on APIs SC2003, and SW release Al Geist September 11-12, 2003 Rockville, MD.
Resource Management and Accounting Working Group Working Group Scope and Components Progress made Current issues being worked Next steps Discussions involving.
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting Aug 26-27, 2004 Argonne, IL.
SC04 Release, API Discussions, SDK, and FastOS Al Geist August 26-27, 2004 Chicago, ILL.
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting June 5-6, 2003.
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting June 13-14, 2002.
A View from the Top November Dallas TX. Coordinator: Al Geist Participating Organizations ORNL ANL LBNL PNNL PSC.
Nightly Releases and Testing Alexander Undrus Atlas SW week, May
Presented by Open Source Cluster Application Resources (OSCAR) Stephen L. Scott Thomas Naughton Geoffroy Vallée Network and Cluster Computing Computer.
Oak Ridge National Laboratory — U.S. Department of Energy 1 The ORNL Cluster Computing Experience… Stephen L. Scott Oak Ridge National Laboratory Computer.
CoG Kit Overview Gregor von Laszewski Keith Jackson.
March 3rd, 2006 Chen Peng, Lilly System Biology1 Cluster and SGE.
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting Jan 25-26, 2005 Washington D.C.
GT Components. Globus Toolkit A “toolkit” of services and packages for creating the basic grid computing infrastructure Higher level tools added to this.
Process Management Working Group Process Management “Meatball” Dallas November 28, 2001.
Component updates, SSS-OSCAR Releases, API Discussions, External Users, and SciDAC Phase 2 Al Geist January 25-26, 2005 Washington DC.
Working Group updates, SSS-OSCAR Releases, API Discussions, External Users, and SciDAC Phase 2 Al Geist May 10-11, 2005 Chicago, ILL.
Resource Management Working Group SSS Quarterly Meeting November 28, 2001 Dallas, Tx.
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting October 10-11, 2002.
National Center for Supercomputing Applications NCSA OPIE Presentation November 2000.
Progress on Release, API Discussions, Vote on APIs, and Quarterly Report Al Geist May 6-7, 2004 Chicago, ILL.
Progress on Release, API Discussions, Vote on APIs, and PI mtg Al Geist January 14-15, 2004 Chicago, ILL.
An Overview of Berkeley Lab’s Linux Checkpoint/Restart (BLCR) Paul Hargrove with Jason Duell and Eric.
Crystal Ball Panel ORNL Heterogeneous Distributed Computing Research Al Geist ORNL March 6, 2003 SOS 7.
ArcGIS Server for Administrators
Review report, Vote on APIs Quarterly report, and SW release Al Geist June 5-6, 2003 Chicago, IL.
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting January 15-16, 2004 Argonne, IL.
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting September 11-12, 2003 Washington D.C.
Tool Integration with Data and Computation Grid GWE - “Grid Wizard Enterprise”
Kurt Mueller San Diego Supercomputer Center NPACI HotPage Updates.
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting May 10-11, 2005 Argonne, IL.
SciDAC SSS Quarterly Report Sandia Labs August 27, 2004 William McLendon Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed.
Institute For Digital Research and Education Implementation of the UCLA Grid Using the Globus Toolkit Grid Center’s 2005 Community Workshop University.
Process Management & Monitoring WG Quarterly Report January 25, 2005.
1October 9, 2001 Sun in Scientific & Engineering Computing Grid Computing with Sun Wolfgang Gentzsch Director Grid Computing Cracow Grid Workshop, November.
OS and System Software for Ultrascale Architectures – Panel Jeffrey Vetter Oak Ridge National Laboratory Presented to SOS8 13 April 2004 ack.
Erik P. DeBenedictis Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of.
Scalable Systems Software for Terascale Computer Centers Coordinator: Al Geist Participating Organizations ORNL ANL LBNL.
SSS Validation and Testing September 11, 2003 Rockville, MD William McLendon Neil Pundit Erik DeBenedictis Sandia is a multiprogram laboratory operated.
Oak Ridge National Laboratory -- U.S. Department of Energy 1 SSS Deployment using OSCAR John Mugler, Thomas Naughton & Stephen Scott May 2005, Argonne,
SAN DIEGO SUPERCOMPUTER CENTER Inca Control Infrastructure Shava Smallen Inca Workshop September 4, 2008.
A View from the Top Al Geist June Houston TX.
SSS Build and Configuration Management Update February 24, 2003 Narayan Desai
1 Global Design Effort: Controls & LLRF Controls & LLRF Working Group: Tuesday Session (29 May 07) John Carwardine Kay Rehlich.
Tool Integration with Data and Computation Grid “Grid Wizard 2”
SDM Center High-Performance Parallel I/O Libraries (PI) Alok Choudhary, (Co-I) Wei-Keng Liao Northwestern University In Collaboration with the SEA Group.
Process Manager Specification Rusty Lusk 1/15/04.
SAN DIEGO SUPERCOMPUTER CENTER Welcome to the 2nd Inca Workshop Sponsored by the NSF September 4 & 5, 2008 Presenters: Shava Smallen
March 4, 2003SOS-71 FAST-OS Arthur B. (Barney) Maccabe Computer Science Department The University of New Mexico SOS 7 Durango, Colorado March 4, 2003.
A System for Monitoring and Management of Computational Grids Warren Smith Computer Sciences Corporation NASA Ames Research Center.
SciDAC CS ISIC Scalable Systems Software for Terascale Computer Centers Al Geist SciDAC CS ISIC Meeting February 17, 2005 DOE Headquarters Research sponsored.
Process Management & Monitoring WG Quarterly Report August 26, 2004.
SciDAC SSS Quarterly Report Sandia Labs January 25, 2005 William McLendon Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed.
Component updates, SSS-OSCAR Releases, API Discussions, External Users, and SciDAC Phase 2 Al Geist January 25-26, 2005 Washington DC.
A View from the Top Al Geist February Houston TX.
Computing Experience…
Study course: “Computing clusters, grids and clouds” Andrey Y. Shevel
Scalable Systems Software for Terascale Computer Centers
Presentation transcript:

A View from the Top Preparing for Review Al Geist February Chicago, IL

Coordinator: Al Geist Participating Organizations ORNL ANL LBNL PNNL PSC SDSC IBM SNL LANL Ames NCSA Cray Intel Unlimited Scale Participating Organizations Main Web Site

IBM Cray Intel Unlimited Scale Scalable Systems Software Participating Organizations ORNL ANL LBNL PNNL NCSA PSC SDSC SNL LANL Ames Collectively (with industry) define standard interfaces between systems components for interoperability Create scalable, standardized management tools for efficiently running our large computing centers Problem Goals Impact Computer centers use incompatible, ad hoc set of systems tools Present tools are not designed to scale to multi-Teraflop systems Reduced facility mgmt costs. More effective use of machines by scientific applications. Resource Management Accounting & user mgmt System Build & Configure Job management System Monitoring learn more visit

Grid Interfaces Accounting Event Manager Service Directory Meta Scheduler Meta Monitor Meta Manager Scheduler Node State Manager Allocation Management Process Manager Usage Reports Meta Services System & Job Monitor Job Queue Manager Node Configuration & Build Manager Standard XML interfaces Working Components and Interfaces (bold) authentication communication Components written in any mixture of C, C++, Java, Perl, and Python can be integrated into the Scalable Systems Software Suite Checkpoint / Restart Progress so far on Integrated Suite Validation & Testing Hardware Infrastructure Manager

Scalable Systems Software Center October Houston TX Review of Last Meeting Details in Main project notebook

Progress Reports at Oct. mtg Al Geist – preparation for Supercomputing 2002, booth space, posters, demos Working Group Leaders – What areas their working group is addressing Progress report on what their group has done Present problems being addressed Next steps for the group Discussion items for the larger group to consider Demonstrations of Prototype Components Prep for SC demo Slides can be found in Main Notebook page 29

Consensus and Voting:

Accounting File System Event Manager Service Directory Meta Scheduler Meta Monitor Meta Manager Scheduler User DB Allocation Management Process Manager Usage Reports User Utilities High Performance Communication & I/O Application Environment Meta Services System & Job Monitor Checkpoint / Restart Grid Interfaces Job Queue Manager These Interface To all Node Configuration & Build Manager

Scalable Systems Software Center November-February Progress Since Last Meeting

SciDAC Booth

SC2002 Systems Posters

Five Project Notebooks filling up A main notebook for general information And individual notebooks for each working group Over 216 total pages – 20 added since last meeting A lot of XML scheme to comment on New subscription feature Get to all notebooks through main web site Click on side bar or at “project notebooks” at bottom of page

Weekly Working Group Telecoms Resource management, scheduling, and accounting Tuesday 3:00 pm (Eastern) keyword “SSS mtg” Validation and Testing (hasn’t met since last year) Wednesday 1:00 pm (Eastern) mtg code Proccess management, system monitoring, and checkpointing Thursday 1:00 pm (Eastern) mtg code Node build, configuration, and information service Thursday 3:00 pm (Eastern) mtg code (changes)

Scalable Systems Software Center February 24-25, 2003 This Meeting

Agenda – February 24 8:30 Al Geist – Project Status. SciDAC PI mtg and External Project review 9:00 Matt Sottile – Science Appliance Project Working Group Reports 9:30Scott Jackson – Resource Management 10:30 Break 11:00 Erik Debenedictis – Validation and Testing 12:00 Lunch (on own - walk to cafeteria) 1:00 Paul Hargrove – Process Management 2:00 Narayan Desai – Node Build, Configure 3.00 Break 3:30 Large Scale Run on Chiba debugging components 5:00 Open Discussion of Review report 5:30 Adjourn Working groups may wish to hack in evening

Agenda – February 25 8:30 Discussion, proposals, straw votes Write paper on each component Draft report in main notebook Comments on “restricted interface” XML shown by Rusty External review demo – can we? 10:30 Break 11:00 Al Geist – Summary PI mtg talk and poster. External review agenda next meeting date: June 5&6 at Argonne. thank our hosts ANL 12:00 meeting ends

SciDAC PI mtg – all 50 projects March10-11, 2003 – Napa California Attending for Scalable Systems – Al Geist, Brett Bode 20 minute talk – presented by Al Scalable Systems, CCA, PERC, SDM Poster Presentation

External SciDAC Review mtg March12-13, 2003 – Napa California Attending for Scalable Systems – Al Geist, Brett Bode, Paul Hargrove, Narayan Desai, Mike Showerman. (Rusty) Four ISIC Projects are reviewed separately – Scalable Systems, CCA, PERC, SDM External review panel (8 members) Bob Lucas, Jim McGraw, Jose Munoz, Lauren Smith, Richard Mount, Ricky Kendall, Rod Oldehoeft, and Tony Mezzacappa [John Grosh?] We owe them a Review report Day 1 – Each gets 1 ¾ hours to present project Day 2 – Each project gets grilled by panel for 1½ hrs

External Review mtg Agenda Wednesday, March 12 7:45Welcome, charge to reviewers 8:15Plenary session for Common Component Architecture ISIC 10: 00Break 10:15Plenary session for Scalable Systems Software ISIC 12:00Reviewer caucus 12:15 Lunch 1:15Plenary session for Scientific Data Management ISIC 3:00Break 3:15Plenary session for Performance Engineering ISIC 5:00 Reviewer caucus 5:30Adjourn

External Review mtg Agenda Thursday, March 13 8:00Meetings between reviewers and ISIC members A. Common Component Architecture B. Scalable Systems Software 9:45Break 10:00Meetings between reviewers and ISIC members C. Scientific Data Management D. Performance Engineering 11:45Reviewer Caucus/End of ISIC Reviews 12:15Lunch (on your own) 1:15Programming Models Review Session I 3:00Break 3:15Programming Models Review Session II 5:00Programming Models Reviewer Caucus 5:30Meeting adjourns

Meeting Notes Matt - Pink: a 1024 node science appliance. Provide pseudo SSI that scales to Tolerates failure. Singe point for management. Reduce boot and install time by x100. Reduce number of FTP per number of nodes. Science Appliance – very little in common with older linux. Software is called Clustermatic – linuxBIOS, Bproc, V9fs, supermon, Panasas or Lustre (parallel file system by someone else) Beoboot, asymmetric SSI, private name spaces from Plan 9, BJS (Bproc Job Scheduler) Other work – ZPL (automatic check point) Debuggers (parallel, relative debugging –Guard) port totalview. Latency tolerant applications Users – SNL/CA, U Penn, Clemson What are overlap opportunities? Each piece can be separated out. Supermon, Bproc Remy will be sending more material on collaboration soon

Meeting Notes Scott- RM update. Diagram of architecture and infrastructure services Sc02 demo what components working. They used polling. Now moving to event driven components Release of initial RM suite – from website OpenPBS-sss Maui scheduler Qbank (accounting system) SSSRMAP protocol using HTTP validated Scalability testing performed on all components Scheduler progress Queue Manager progress Accounting and Allocation Manager progress (Qbank and Gold prototype) Meta-scheduler progress – Globus interface, Gold Information service. Next work Release 2 of RM interface Implement and test SSSRMAP security authentication (XML digital sigs) Discuss need to have SSS wrappers on initial RM suite

Meeting Notes Will- Validation and Testing update Users expect a high degree of quality in today’s HPC. Strategies QMTest – RM group using it ( They like it “easy” App test packages APITEST – growing out of October discussion C++ driven XML schema scriptable test of network components blackbox testing. Tcp, ssslib, portals support, fault injection whitebox testing. Try to exercise all paths in a known suite v0.1a underway 75% done Discussion how this could be useful to Scalable Systems Cluster Integration Toolkit (CIT) –James Laros management tasks on Cplant – scalable to 1800 nodes done in Perl create Scalable Systems interface to CIT would be a good test of implementation of flexibility of standard. USI, IBM, and Linux Networx looking at it.

Meeting Notes Paul – Process management report. Moving beyond prototypes of: Checkpoint manager beta-code April release awaiting legal OK will do scalability test today working on XML interface for checkpoint/restart (draft in May) Mike - Monitoring – job, system, node, and meta-version what data is needed – an extensible framework defined stream and single item. working on scalability now Rusty - Process Manager schematic of PM component MPD-2 in python and distributed with MPICH-2 -supports separate executables, arguments, and environment variables New XML for PM (with queries that allow wildcards and ranges) Combination of published interfaces, XML, and communication lib gives us a power greater than the sum of its parts.

Meeting Notes Narayan – Build and configure report Tests suggest scalability to 2000 host clusters Communication Infrastructure more protocol support, high availability option. Build and configuration complete implementation on Chiba City second OSCAR implementation undreway three components - hardware manager (needs more modular, extensible design) - build system - node manager (admin control panel for a cluster) system diagnostics Restriction Based Syntax for XML interfaces API augmentation APIs need more documentation to describe event handling protocol

Meeting Notes John Dawson asks about license. Al says like MPI. Don (Cray) asks about license !GNU and holding a workshop for industry Talk with Remy about Science Appliance collaboration Talk with Rusty about writing a paper on each component. Groups Work on large scalability test on Chiba City and XTORC