Scalable Systems Software for Terascale Computer Centers Coordinator: Al Geist Participating Organizations ORNL ANL LBNL PNNL PSC SDSC IBM Compaq SNL LANL Ames NCSA SGI Scyld Intel Unlimited Scale
The Problem Today Computer centers use incompatible, ad hoc set of systems tools Present tools are not designed to scale to multi-Teraflop systems Commercial solutions not happening because business forces drive industry towards servers not HPC. System administrators and managers of terascale computer centers are facing a crisis:
Checkpoint restart Scope of the Effort Resource & Queue Management Accounting & user mgmt System Build & Configure Job management System Monitoring Security Allocation management Fault Tolerance Allocationmanagement Submit jobs To batch queue Start parallel processes JobMonitoring Checkpointrestart
Goals Collectively (with industry) agree on and specify standardized interfaces between system components in order to promote interoperability, portability, and long-term usability. The specification will proceed through a series of open meetings following a format similar to that used by the MPI forum. Produce a fully integrated suite of systems software and tools for the effective management and utilization of terascale computational resources particularly those at the DOE facilities. Research and development of more advanced versions of the components required to support the scalability, fault tolerance, and performance requirements of large science applications. Carry out a software lifecycle plan for support and maintenance of systems software suite.
Impact Fundamentally change the way future high-end systems software is developed and distributed Reduced facility management costs reduce need to support ad hoc software better systems tools available able to get machines up and running faster and keep running More effective use of machines by scientific applications scalable launch of jobs and checkpoint/restart job monitoring and management tools allocation management interface
Four Working Groups to interact with 1.Node build, configuration, and information service 2.Resource management, scheduling, and allocation 3.Proccess management, system monitoring, and checkpointing 4.Validation and Integration Allows groups to keep track of other groups progress and comment on the items of overlap Allows Center members and interested parties to see what is being defined and implemented A main notebook for general information & mtg notes And individual notebooks for each working group Electronic Notebooks keep WG on track
Interactions Principle customers are sysadmin and supercomputer managers CCA looks to Scalable Systems to provide services to launch parallel components on large systems and provide event services for fault detection and monitoring. DOE Science GRID will be involved with the Scalable Systems through their integration of Grid tools with the monitoring and resource management services layer of the systems software Applications using the terascale SciDAC resources including climate, accelerator design, and astrophysics, etc. will be utilizing job submission, job monitoring, user assisted checkpointing, and allocation tools developed by the Center. Other organizations and vendors participating in the Scalable Systems effort even though not funded by SciDAC.
Reading entries Input from Keyboard Files Images voice Instruments sketchpad Annotation by remote colleagues Shared electronic notebook Accessible with password through secure web site Personal (stand alone) notebook Drag and drop notes from private to shared notebooks Advantages and Features ã look&feel of paper notebook ã access from any web browser ã no software to install ã can be shared across group ã or setup as personal notebook ã can run stand alone on laptop ORNL Electronic Notebook