OS and System Software for Ultrascale Architectures – Panel Jeffrey Vetter Oak Ridge National Laboratory Presented to SOS8 13 April 2004 ack.

Slides:



Advertisements
Similar presentations
What is Cloud Computing? Massive computing resources, deployed among virtual datacenters, dynamically allocated to specific users and tasks and accessed.
Advertisements

What is Cloud Computing? Massive computing resources, deployed among virtual datacenters, dynamically allocated to specific users and tasks and accessed.
Performance Testing - Kanwalpreet Singh.
Introduction to Systems Management Server 2003 Tyler S. Farmer Sr. Technology Specialist II Education Solutions Group Microsoft Corporation.
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
Harness and H2O Alternative approaches to metacomputing Distributed Computing Laboratory Emory University, Atlanta, USA
Presented by Scalable Systems Software Project Al Geist Computer Science Research Group Computer Science and Mathematics Division Research supported by.
Simo Niskala Teemu Pasanen
VMware vCenter Server Module 4.
Kate Keahey Argonne National Laboratory University of Chicago Globus Toolkit® 4: from common Grid protocols to virtualization.
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting February 24-25, 2003.
Oak Ridge National Laboratory — U.S. Department of Energy 1 The ORNL Cluster Computing Experience… John L. Mugler Stephen L. Scott Oak Ridge National Laboratory.
1 MOLAR: MOdular Linux and Adaptive Runtime support Project Team David Bernholdt 1, Christian Engelmann 1, Stephen L. Scott 1, Jeffrey Vetter 1 Arthur.
Grid Toolkits Globus, Condor, BOINC, Xgrid Young Suk Moon.

Module 13: Configuring Availability of Network Resources and Content.
SSI-OSCAR A Single System Image for OSCAR Clusters Geoffroy Vallée INRIA – PARIS project team COSET-1 June 26th, 2004.
Microsoft ® System Center Operations Manager 2007 Infrastructure Planning and Design Published: June 2008 Updated: July 2010.
OFC 200 Microsoft Solution Accelerator for Intranets Scott Fynn Microsoft Consulting Services National Practices.
Cloud Computing 1. Outline  Introduction  Evolution  Cloud architecture  Map reduce operation  Platform 2.
High Performance Computing Cluster OSCAR Team Member Jin Wei, Pengfei Xuan CPSC 424/624 Project ( 2011 Spring ) Instructor Dr. Grossman.
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting June 5-6, 2003.
An Introduction to IBM Systems Director
A Lightweight Platform for Integration of Resource Limited Devices into Pervasive Grids Stavros Isaiadis and Vladimir Getov University of Westminster
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting June 13-14, 2002.
DISTRIBUTED COMPUTING
Presented by Open Source Cluster Application Resources (OSCAR) Stephen L. Scott Thomas Naughton Geoffroy Vallée Network and Cluster Computing Computer.
Oak Ridge National Laboratory — U.S. Department of Energy 1 The ORNL Cluster Computing Experience… Stephen L. Scott Oak Ridge National Laboratory Computer.
Scalable Cluster Management: Frameworks, Tools, and Systems David A. Evensky Ann C. Gentile Pete Wyckoff Robert C. Armstrong Robert L. Clay Ron Brightwell.
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting Jan 25-26, 2005 Washington D.C.
GT Components. Globus Toolkit A “toolkit” of services and packages for creating the basic grid computing infrastructure Higher level tools added to this.
A View from the Top Preparing for Review Al Geist February Chicago, IL.
PVM and MPI What Else is Needed For Cluster Computing? Al Geist Oak Ridge National Laboratory DAPSYS/EuroPVM-MPI Balatonfured,
Resource Management Working Group SSS Quarterly Meeting November 28, 2001 Dallas, Tx.
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting October 10-11, 2002.
Presented by Reliability, Availability, and Serviceability (RAS) for High-Performance Computing Stephen L. Scott and Christian Engelmann Computer Science.
Loosely Coupled Parallelism: Clusters. Context We have studied older archictures for loosely coupled parallelism, such as mesh’s, hypercubes etc, which.
Supercomputing Cross-Platform Performance Prediction Using Partial Execution Leo T. Yang Xiaosong Ma* Frank Mueller Department of Computer Science.
Crystal Ball Panel ORNL Heterogeneous Distributed Computing Research Al Geist ORNL March 6, 2003 SOS 7.
MAPLD Reconfigurable Computing Birds-of-a-Feather Programming Tools Jeffrey S. Vetter M. C. Smith, P. C. Roth O. O. Storaasli, S. R. Alam
1 Week #10Business Continuity Backing Up Data Configuring Shadow Copies Providing Server and Service Availability.
Tool Integration with Data and Computation Grid GWE - “Grid Wizard Enterprise”
Headline in Arial Bold 30pt HPC User Forum, April 2008 John Hesterberg HPC OS Directions and Requirements.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting May 10-11, 2005 Argonne, IL.
Cloud Age Time to change the programming paradigm?
1October 9, 2001 Sun in Scientific & Engineering Computing Grid Computing with Sun Wolfgang Gentzsch Director Grid Computing Cracow Grid Workshop, November.
Presented by Open Source Cluster Application Resources (OSCAR) Stephen L. Scott Thomas Naughton Geoffroy Vallée Computer Science Research Group Computer.
Scalable Systems Software for Terascale Computer Centers Coordinator: Al Geist Participating Organizations ORNL ANL LBNL.
CEOS Working Group on Information Systems and Services - 1 Data Services Task Team Discussions on GRID and GRIDftp Stuart Doescher, USGS WGISS-15 May 2003.
Oak Ridge National Laboratory -- U.S. Department of Energy 1 SSS Deployment using OSCAR John Mugler, Thomas Naughton & Stephen Scott May 2005, Argonne,
Introduction to Grids By: Fetahi Z. Wuhib [CSD2004-Team19]
A View from the Top Al Geist June Houston TX.
SSS Build and Configuration Management Update February 24, 2003 Narayan Desai
Comprehensive Scientific Support Of Large Scale Parallel Computation David Skinner, NERSC.
GRID ANATOMY Advanced Computing Concepts – Dr. Emmanuel Pilli.
Tool Integration with Data and Computation Grid “Grid Wizard 2”
To provide the world with a next generation storage platform for unstructured data, enabling deployment of mobile applications, virtualization solutions,
Presented by The Harness Workbench: Unified and Adaptive Access to Diverse HPC Platforms Christian Engelmann Computer Science Research Group Computer Science.
ETICS An Environment for Distributed Software Development in Aerospace Applications SpaceTransfer09 Hannover Messe, April 2009.
PARALLEL AND DISTRIBUTED PROGRAMMING MODELS U. Jhashuva 1 Asst. Prof Dept. of CSE om.
Photos placed in horizontal position with even amount of white space between photos and header Sandia National Laboratories is a multi-program laboratory.
SciDAC CS ISIC Scalable Systems Software for Terascale Computer Centers Al Geist SciDAC CS ISIC Meeting February 17, 2005 DOE Headquarters Research sponsored.
Towards a High Performance Extensible Grid Architecture Klaus Krauter Muthucumaru Maheswaran {krauter,
XtreemOS IP project is funded by the European Commission under contract IST-FP Scientific coordinator Christine Morin, INRIA Presented by Ana.
VisIt Project Overview
Univa Grid Engine Makes Work Management Automatic and Efficient, Accelerates Deployment of Cloud Services with Power of Microsoft Azure MICROSOFT AZURE.
Computing Experience…
University of Technology
Scalable Systems Software for Terascale Computer Centers
Presentation transcript:

OS and System Software for Ultrascale Architectures – Panel Jeffrey Vetter Oak Ridge National Laboratory Presented to SOS8 13 April 2004 ack

JSV2 Panel Charge Each panelist has an short opportunity to stand on a soap box and start a riot if he wishes. The overall purpose of the panel is to raise the issues and suggest paths forward for OS and system software for Ultrascale architectures... [Standard disclaimers apply… ]

JSV3 Many Objectives for Ultrascale System Software – more than Performance  Performance efficiency is critical  However, other system software qualities can be equally important for effective Ultrascale computing –Functionality (compatibility) –Reliability, Availability, Serviceability –Usability, Administration  How can we make the proper tradeoffs that balance performance with these other qualities?

JSV4 Some questions  Imagine if you gain X% performance on your application by changing the system software, –Is it acceptable to make Y% of applications, libraries, tools incompatible? –Is it acceptable to make the system software Z% less reliable than before?  Others –How many FTE’s should it take to keep an Ultrascale system up and running 24/7 ? –FTE scaling rate? Source code scaling rate? –Performance stability? How many possible configurations?  Myopic focus on performance (or any single factor) can have long-term detrimental effects on overall Ultrascale system effectiveness

JSV5 ORNL is Involved in Several Projects that Span these Objectives  OSCAR –Cluster building and installation toolkit  Scalable Systems Software –Scalable, standardized management tools and interfaces for system management  HARNESS –Customizable, runtime infrastructure for scientific computing

JSV6 OSCAR: Cluster Toolkit  Framework for cluster management –Wizard based cluster software installation –Operating system –Cluster environment –Automatically configures cluster components –Increases consistency among cluster builds –Reduces time to build / install a cluster –Reduces need for expertise –requires: pre-installed headnode w. supported Linux distribution –thereafter: wizard guides user thru setup/install of entire cluster  Package-based framework –Content: Software + Configuration, Tests, Docs –Types: –Core: SIS, C3, Switcher, ODA, OPD, (Support Libs) –Non-core: selected & third-party –Access: repositories accessible via OPD/OPDer  Many partners…  Over 120,000 downloads on Sourceforge!

JSV7 What’s next for OSCAR: High Availability release  Goals: –COTS-based HPC solution towards non-stop services –Linux clustering production quality –Ease of build, operation, maintenance  HA-OSCAR 1.0 Beta release (March 2004) –The first known field-grade HA Beowulf cluster release –Self-configuration Multi-head Beowulf system –HA and HPC clustering techniques to enable critical HPC infrastructure –Self-healing with 3-5 sec automatic failover time –1-1.5 hour to self-build failover headnodes w/o preloaded OS –Optional Image Server for disaster recovery –Support existing HPC App (e.g. MPI) without any modification

IBM Cray Intel SGI Scalable Systems Software Participating Organizations ORNL ANL LBNL PNNL NCSA PSC SDSC SNL LANL Ames Clemson Collectively (with industry) define standard interfaces between systems components for interoperability Create scalable, standardized management tools for efficiently running our large computing centers Problem Goals Impact Computer centers use incompatible, ad hoc set of systems tools Present tools are not designed to scale to multi-Teraflop systems Reduced facility mgmt costs. More effective use of machines by scientific applications. Resource Management Accounting & user mgmt System Build & Configure Job management System Monitoring learn more visit Schedulers, Job Managers System Monitors Accounting & User management Checkpoint/Restart Build & Configuration systems

JSV9 SSS Status  Currently doing testing on 2nd pre-release * –Bundled for distribution via OSCAR –Builds full working cluster with current SSS pkgs –sss-oscar-0.2a4-v3.0 * Release information as of 3/29/04

JSV10 Grid Interfaces Accounting Event Manager Service Directory Meta Scheduler Meta Monitor Meta Manager Scheduler Node State Manager Allocation Management Process Manager Usage Reports Meta Services System & Job Monitor Job Queue Manager Node Configuration & Build Manager Working Components and Interfaces (bold) authentication communication Components written in any mixture of C, C++, Perl, Java, and Python can be integrated into the Scalable Systems Software Suite through defined XML interfaces Checkpoint / Restart Validation & Testing Hardware Infrastructure Manager OSCAR (Open Source Cluster Resources) used to package, build, and install the suite OSCAR-SSS Release of Scalable Systems Software Integrated Suite

JSV11 Harness  Key ideas for Harness –Parallel plug-in interface that allows users or applications to dynamically customize, adapt, and extend the environment’s features –Distributed peer-to-peer control that prevents single point of failure –Multiple distributed virtual machines that can collaborate, merge, or split  Collaborative effort between ORNL, University of Tennessee, and Emory University  Design –Uses pluggable framework in C and Java –Allows to dynamically customize the computing environment to suit the applications needs –Manages a set of plug-ins as directed by the scientific application w/ lightweight kernel

JSV12 Harness Implementation  The kernel provides only basic functions, such as dynamic loading and unloading of plug-ins  Plug-ins offer a wide variety of services, where parallel plug-ins enable distributed services. –These services include: numerical libraries, parallel programming models, networking, resource discovery and distributed control –FT-MPI and PVM plug-ins available  The pluggable remote method invocation framework RMIX provides IPC with standard protocols (e.g. RPC, JRMP and SOAP)

JSV13 Summary  Performance is one important criteria  Other important criteria include –Functionality (compatibility) –Reliability, Availability, Serviceability –Usability, Administration  We need measurements, historical data –Costs, reliability, FTE levels  ORNL is involved in several projects to address these issues –OSCAR, SSS, and HARNESS

Bonus Slides