Download presentation
Presentation is loading. Please wait.
Published byArline Hill Modified over 9 years ago
1
Working Group updates, SSS-OSCAR Releases, API Discussions, External Users, and SciDAC Phase 2 Al Geist May 10-11, 2005 Chicago, ILL
2
IBM Cray Intel SGI Scalable Systems Software Participating Organizations ORNL ANL LBNL PNNL NCSA PSC SNL LANL Ames Collectively (with industry) define standard interfaces between systems components for interoperability Create scalable, standardized management tools for efficiently running our large computing centers Problem Goals Computer centers use incompatible, ad hoc set of systems tools Present tools are not designed to scale to multi-Teraflop systems Resource Management Accounting & user mgmt System Build & Configure Job management System Monitoring www.scidac.org/ScalableSystems To learn more visit
3
Grid Interfaces Accounting Event Manager Service Directory Meta Scheduler Meta Monitor Meta Manager Scheduler Node State Manager Allocation Management Process Manager Usage Reports Meta Services System & Job Monitor Job Queue Manager Node Configuration & Build Manager Standard XML interfaces authentication communication Components written in any mixture of C, C++, Java, Perl, and Python can be integrated into the Scalable Systems Software Suite Checkpoint / Restart Validation & Testing Hardware Infrastructure Manager SSS-OSCAR Scalable Systems Software Suite Any Updates to this diagram?
4
Components in Suites Gold EM SD Grid scheduler Warehouse Meta Manager Maui sched NSM Gold PM Usage Reports Meta Services Warehouse (superMon NWPerf) Bamboo QM BCM Multiple Component Implementations exits ssslib BLCR APITest HIM Compliant with PBS, Loadlever job scripts
5
Scalable Systems Users Production use today: Running an SSS suite at ANL, and Ames Running components at PNNL Maui w/ SSS API (3000/mo), Moab (Amazon, Ford, TeraGrid, …) Who can we involve before the end of the project? - National Leadership-class facility? NLCF is a partnership between ORNL (Cray), ANL (BG), PNNL (cluster) -NERSC and NSF centers NCSA cluster(s) NERSC cluster?
6
Goals for This Meeting Updates on the Integrated Software Suite components Planning for SciDAC phase 2 – discuss new directions and June SciDAC meeting Preparing for next SSS-OSCAR software suite release What is missing? What needs to be done? Getting more outside Users. Production and feedback to suite Discussion of involvement with NLCF machines: IBM BG/L, Cray XT3, Clusters
7
Highlights of Last Meeting ( Jan. 25-26 in DC) Details in Main project notebook Fred Attended - he gave state of MICS, SciDAC-2, and his vision for changed focus Discussion of whitepaper and presentation for Strayer ideas and Fred feedback API Discussions -voted for Process Manager API (12 yes 0 no 0 abstain) -New Warehouse protocol presented Agreed to Quarterly Suite Releases this year– and dates.
8
Since Last Meeting CS ISICs meet with SciDAC director (Strayer) Feb 17 DC Whitepaper – some issues with Mezzacapa Give hour “highlight” presentation on goals, impact, and potential CS ISIC ideas for next round. Strayer was very positive. Fred reported that the meeting could not have gone any better. Cray Software Workshop (called by Fred) January in Minneapolis Status of Cray SW and how DOE research could help Several SSS members there. Anything since? Telecoms and New entries in Electronic Notebooks Pretty sparse since last meeting
9
Major Topics for This Meeting Latest news on the Software Suite components Preparing for next SSS-OSCAR software suite release Discuss ideas for next round of CS ISICs Preparation for upcoming meetings in June Presentation/ 1 st vote on Queue Manager API Getting more users and feedback on suite
10
Agenda – May 10 8:00 Continental Breakfast 8:30 Al Geist - Project Status 9:00 Discussion of ideas presented to Strayer 9:30 Scott Jackson - Resource Management components 10:30 Break 11:00 Will Mclendon - Validation and Testing Ron Oldfield – integrated SSS test suites 12:00 Lunch (on own at cafeteria ) 1:30 Paul Hargrove Process Management and Monitoring 2:30 Narayan Desai - Node Build, Configure, and Cobalt on BG/L 3:30 Break 4:00 Craig Steffen – SSSRMAP in ssslib 4:30 Discussion of getting SSS users and feedback 5:30 Adjourn for dinner
11
Agenda – May 11 8:00 Continental Breakfast 8:30 Thomas Naughton - SSS OSCAR software releases through SC05 9:30 Discussion and voting Bret Bode - XML API for Queue Manager 10:30 Group discussion of ideas for SciDAC-2. 11:00 Preparations for upcoming meetings FastOS meeting June 8-10, SciDAC PI Meeting in June 26-30 (poster and panels), Set next meeting date/location: August 17-19, ORNL 12:00 Meeting Ends
12
Ideas Presented to SciDAC Director Mike Strayer February 17, 2005 Washington DC
13
View to the Future HW, CS, and Science Teams all contribute to the science breakthroughs Leadership-class Platforms Breakthrough Science Software & Libs SciDAC CS teams Tuned codes Research team High-End science problem Computing Environment Common look&feel across diverse HW Ultrascale Hardware Rainer, Blue Gene, Red Storm OS/HW teams SciDAC Science Teams
14
SciDAC Phase 2 and CS ISICs Future CS ISICs need to be mindful of needs of National Leadership Computing facility w/ Cray, IBM BG, SGI, clusters, multiple OS No one architecture is best for all applications SciDAC Science Teams Needs depend on application areas chosen End stations? Do they have special SW needs? FastOS Research Projects Complement, don’t duplicate these efforts Cray software roadmap Making the Leadership computers usable, efficient, fast
15
Gaps and potential next steps Heterogeneous leadership-class machines science teams need to have a robust environment that presents similar programming interfaces and tools across the different machines. Fault tolerance requirements in apps and systems software particularly as systems scale up to petascale around 2010 Support for application users submitting interactive jobs computational steering as means of scientific discovery High performance File System and I/O research increasing demands of security, scalability, and fault tolerance Security One-time-passwords and impact on scientific progress
16
Heterogeneous Machines Heterogeneous Architectures Vector architectures, Scalar, SMP, Hybrids, Clusters How is a science team to know what is best for them? Multiple OS Even within one machine, eg. Blue Gene, Red Storm How to effectively and efficiently administer such systems? Diverse programming environment science teams need to have a robust environment that presents similar programming interfaces and tools across the different machines Diverse system management environment Managing and scheduling multiple node types System updates, accounting, … everything will be harder in round 2
17
Fault Tolerance Holistic Fault Tolerance Research into schemes that take into account the full impact of faults: application, middleware, OS, and hardware Fault tolerance in systems software Research into prediction and prevention Survivability and resiliency when faults can not be avoided Application recovery transparent failure recovery Research into Intelligent checkpointing based on active monitoring, sophisticated rule-based recoverys, diskless checkpointing… For petascale systems research into recovery w/o checkpointing
18
Interactive Computing Batch jobs are not the always the best for Science Good for large numbers of users, wide mix of jobs, but National Leadership Computing Facility has different focus Computational Steering as a paradigm for discovery Break the cycle: simulate, dump results, analyze, rerun simulation More efficient use of the computer resources Needed for Application development Scaling studies on terascale systems Debugging applications which only fail at scale
19
File System and I/O Research Lustre is today’s answer There are already concerns about its capabilities as systems scale up to 100+ TF What is the answer for 2010? Research is needed to explore the file system and I/O requirements for petascale systems that will be here in 5 years I/O continues to be a bottleneck in large systems Hitting the memory access wall on a node To expensive to scale I/O bandwidth with Teraflops across nodes Research needed to understand how to structure applications or modify I/O to allow applications to run efficiently
20
Security New stricter access policies to computer centers Attacks on supercomputer centers have gotten worse. One-Time-Passwords, PIV? Sites are shifting policies, tightening firewalls, going to SecureID tokens Impact on scientific progress Collaborations within international teams Foreign nationals clearance delays Access to data and computational resources Advances required in system software To allow compliance with different site policies and be able to handle tightest requirements Study how to reduce impact on scientists
21
Meeting notes Al Geist – project status Al Geist – Ideas for CS ISICs in next round of SciDAC Scott Jackson – production use at more places eg. U. Utah Icebox 430proc Incorporation of SSSRMAP into ssslib in progress Paper accepted and new documents (see RM notebook) SOAP as basis for SSSRMAP v4 Discussion of pros and cons (scalability issues, but ssslib can support) Fault tolerance in Gold using hot failover New Gold release v2 b2.10.2 includes distributed accounting Simplify allocation management Enabled support for mysql database Bamboo QM v1.1 released New fountain component alternate to Warehouse used in Work for support for SuperMon, Ganglin, and Nwperf Maui – improved grid scheduler multisite authentication. Support for Globus 4 Future Work - increase deployment base, ssslib integration, portability support for loadlever-like multi-step jobs, and PBS job language release of Silver meta-scheduler
22
Meeting notes Will McClendon – APITest project status current release v 1.0 Latest work – new look using cascading style sheets new capabilities – pass/fail batch files, better parse error reporting User Guide Documentation done (50 pages) and SNL approved SW requirements: Python 2.3+, ElementTree, MySQL, ssslib, Twisted (version 2.0 added new dependencies) Helping fix bad tests – led to good discussion of this utility Future work: config file, test developer GUI, more… Ron Oldfield – Testing SSS suites 2 wks ago hired full time contractor (Tod Cordenbach) plus summer student Goals and deliverables for summer work performance testing of SSS-OSCAR comparison to other components write tech report of results What is important for each component: scheduler, job launch, queue, I/O,… Discussion of metrics. Scalability? User time, Admin time, HW resource efficiency Report what works, what doesn’t, what is performance critical
23
Meeting notes Paul Hargrove – PM update Checkpoint (BLCR) status: users on four continents, bug fixes, Works with Linux2.6.11, partial AMD64/EM64T pot Next step is process groups/sessions OpenMPI work this summer ( student of Lumsdane) Have sketch of less restrictive syntax API Process manager status: complete rewrite of MPD more OO and pythonic provided a non-MPD implementation for BG/L using SSS API Narayan Dasi – BCM update SSS infrastructure in use at ANL: clusters, BG/L, IA32, PPC64 Better documentation LRS Syntax: spec done, SDK complete, todo ssslib integration BG/L: arrived in January, initial Cobalt (SSS) suite on February many features being requested eg, node modes set in mpirun DB2 used for everything Cobalt – same as SW on Chiba City. All python components implemented using SSS-SDK several major extensions required for BG/L
24
Meeting notes Narayan Dasi – Cobalt update for BG/L Scheduler (bgsched): new implementation needed to be topology aware, use DB2 partition unit is 512 nodes. Queue Manager (cqm): same SW as Chiba OS change on BG/L is trivial since system rebooted for each job Process Manager (bgpm): new implementation computer nodes don’t run full OS so no MPD mpirun complicated Allocation Manager (am): same as chiba very simple design Experiences: SSS really works Easy to port, simple approach makes system easy to understand Agility required for BG/L Comprehensive interfaces expose all information Admins can access internal state component behavior less mysterious extracting new info is easy Shipping Cobalt to a couple other sites
25
Meeting notes Craig Stefan – (no slides) Not as much to report. Sidetracked for past three months on other projects Gives reasons Warehouse bugs also not done. Fixes to be done by next OSCAR release Graphical display for Warehouse created Same interfaces as Maui wrt requesting everything from all nodes SSSRMAP into ssslib Initial skeleton code for integration into ssslib begun. Needs questions answered from Jackson and Narayan to proceed
26
Meeting notes Thomas Naughton – SSS OSCAR releases Testing for v1.1 release Base OSCAR v4.1 includes SSS APItest runs post-install tests on packages Discussion that Debian support will require both RPM and DEB formats Future work: complete v1.1 testing, migrate distribution to FRE repository extend SSS component tests, distribute as basic OSCAR “package set” needed ordering within a phase (work around for now) Release schedule: version Freeze Release New v1.0 Nov (SC05) first full suite release v1.1 Feb 15 May Gold update, bug fixes v1.2 Jun 15 July RH9 to Fedora2 oscar 4.1, BLCR to linux 2.6, improved tests, close known bug reports v2.0b Aug 15 Sept less restrictive syntax switch over, perf tests Silver meta-scheduler, Fedora4 v2.0 Oct 15 Nov (SC05) bug fixes, minor updates In oscar 5.0 as package set (after SC05) Remove Bugzilla link from web page
27
Meeting notes Bret Bode – Queue Manager API Lists all the functions then goes through detailed scheme of each Bamboo Uses SSSRMAP messaging and wire protocol Authentication – uses ssslib Authorization – uses info in SSSRMAP wire protocol Questions and discussion of interfaces
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.