Oak Ridge National Laboratory — U.S. Department of Energy 1 The ORNL Cluster Computing Experience… Stephen L. Scott Oak Ridge National Laboratory Computer.

Slides:



Advertisements
Similar presentations
Test harness and reporting framework Shava Smallen San Diego Supercomputer Center Grid Performance Workshop 6/22/05.
Advertisements

INTRODUCTION TO SIMULATION WITH OMNET++ José Daniel García Sánchez ARCOS Group – University Carlos III of Madrid.
Beowulf Supercomputer System Lee, Jung won CS843.
Information Technology Center Introduction to High Performance Computing at KFUPM.
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
Presented by Scalable Systems Software Project Al Geist Computer Science Research Group Computer Science and Mathematics Division Research supported by.
NPACI Panel on Clusters David E. Culler Computer Science Division University of California, Berkeley
Building a Cluster Support Service Implementation of the SCS Program UC Computing Services Conference Gary Jung SCS Project Manager
Cluster Computing Applications Project: Parallelizing BLAST The field of Bioinformatics needs faster string matching algorithms. What Exactly is BLAST?
Linux clustering Morris Law, IT Coordinator, Science Faculty, Hong Kong Baptist University.
O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Cluster Computing Applications Project Parallelizing BLAST Research Alliance of Minorities.
Chiba City: A Testbed for Scalablity and Development FAST-OS Workshop July 10, 2002 Rémy Evard Mathematics.
Simo Niskala Teemu Pasanen
Cluster Computing - GCB1 Cluster Computing Javier Delgado Grid-Enabledment of Scientific Applications Professor S. Masoud Sadjadi.
CPP Staff - 30 CPP Staff - 30 FCIPT Staff - 35 IPR Staff IPR Staff ITER-India Staff ITER-India Staff Research Areas: 1.Studies.
VMware vCenter Server Module 4.
Oak Ridge National Laboratory Tools for Cluster Administration and Applications (ancient technology – from 2001…)
Oak Ridge National Laboratory — U.S. Department of Energy 1 The ORNL Cluster Computing Experience… John L. Mugler Stephen L. Scott Oak Ridge National Laboratory.
1 MOLAR: MOdular Linux and Adaptive Runtime support Project Team David Bernholdt 1, Christian Engelmann 1, Stephen L. Scott 1, Jeffrey Vetter 1 Arthur.

SSI-OSCAR A Single System Image for OSCAR Clusters Geoffroy Vallée INRIA – PARIS project team COSET-1 June 26th, 2004.
Open Source Cluster Applications Resources. Overview What is O.S.C.A.R.? History Installation Operation Spin-offs Conclusions.
High Performance Computing Cluster OSCAR Team Member Jin Wei, Pengfei Xuan CPSC 424/624 Project ( 2011 Spring ) Instructor Dr. Grossman.
PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University.
University of Illinois at Urbana-Champaign NCSA Supercluster Administration NT Cluster Group Computing and Communications Division NCSA Avneesh Pant
So, Jung-ki Distributed Computing System LAB School of Computer Science and Engineering Seoul National University Implementation of Package Management.
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting June 13-14, 2002.
Rocks ‘n’ Rolls An Introduction to Programming Clusters using Rocks © 2008 UC Regents Anoop Rajendra.
Presented by Open Source Cluster Application Resources (OSCAR) Stephen L. Scott Thomas Naughton Geoffroy Vallée Network and Cluster Computing Computer.
OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY Parallel Solution of 2-D Heat Equation Using Laplace Finite Difference Presented by Valerie Spencer.
A View from the Top Preparing for Review Al Geist February Chicago, IL.
Ohio Supercomputer Center Cluster Computing Overview Summer Institute for Advanced Computing August 22, 2000 Doug Johnson, OSC.
PVM and MPI What Else is Needed For Cluster Computing? Al Geist Oak Ridge National Laboratory DAPSYS/EuroPVM-MPI Balatonfured,
The Red Storm High Performance Computer March 19, 2008 Sue Kelly Sandia National Laboratories Abstract: Sandia National.
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting October 10-11, 2002.
Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,
Presented by Reliability, Availability, and Serviceability (RAS) for High-Performance Computing Stephen L. Scott and Christian Engelmann Computer Science.
SSS Test Results Scalability, Durability, Anomalies Todd Kordenbrock Technology Consultant Scalable Computing Division Sandia is a multiprogram.
Loosely Coupled Parallelism: Clusters. Context We have studied older archictures for loosely coupled parallelism, such as mesh’s, hypercubes etc, which.
Crystal Ball Panel ORNL Heterogeneous Distributed Computing Research Al Geist ORNL March 6, 2003 SOS 7.
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting January 15-16, 2004 Argonne, IL.
ARGONNE NATIONAL LABORATORY Climate Modeling on the Jazz Linux Cluster at ANL John Taylor Mathematics and Computer Science & Environmental Research Divisions.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
PARALLEL COMPUTING overview What is Parallel Computing? Traditionally, software has been written for serial computation: To be run on a single computer.
OS and System Software for Ultrascale Architectures – Panel Jeffrey Vetter Oak Ridge National Laboratory Presented to SOS8 13 April 2004 ack.
Presented by Open Source Cluster Application Resources (OSCAR) Stephen L. Scott Thomas Naughton Geoffroy Vallée Computer Science Research Group Computer.
Scalable Systems Software for Terascale Computer Centers Coordinator: Al Geist Participating Organizations ORNL ANL LBNL.
Oak Ridge National Laboratory -- U.S. Department of Energy 1 SSS Deployment using OSCAR John Mugler, Thomas Naughton & Stephen Scott May 2005, Argonne,
Cluster Software Overview
A View from the Top Al Geist June Houston TX.
By Chi-Chang Chen.  Cluster computing is a technique of linking two or more computers into a network (usually through a local area network) in order.
Comprehensive Scientific Support Of Large Scale Parallel Computation David Skinner, NERSC.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.
Process Manager Specification Rusty Lusk 1/15/04.
Northwest Indiana Computational Grid Preston Smith Rosen Center for Advanced Computing Purdue University - West Lafayette West Lafayette Calumet.
Building PetaScale Applications and Tools on the TeraGrid Workshop December 11-12, 2007 Scott Lathrop and Sergiu Sanielevici.
LTU Site Report Dick Greenwood LTU Site Report Dick Greenwood Louisiana Tech University DOSAR II Workshop at UT-Arlington March 30-31, 2005.
Introduction to Data Analysis with R on HPC Texas Advanced Computing Center Feb
Photos placed in horizontal position with even amount of white space between photos and header Sandia National Laboratories is a multi-program laboratory.
SciDAC CS ISIC Scalable Systems Software for Terascale Computer Centers Al Geist SciDAC CS ISIC Meeting February 17, 2005 DOE Headquarters Research sponsored.
Advanced Network Administration Computer Clusters.
Performance Technology for Scalable Parallel Systems
OpenLabs Security Laboratory
Computing Experience…
Is System X for Me? Cal Ribbens Computer Science Department
Scalable Systems Software for Terascale Computer Centers
System G And CHECS Cal Ribbens
Grid Aware: HA-OSCAR By
NCSA Supercluster Administration
Presentation transcript:

Oak Ridge National Laboratory — U.S. Department of Energy 1 The ORNL Cluster Computing Experience… Stephen L. Scott Oak Ridge National Laboratory Computer Science and Mathematics Division Network and Cluster Computing Group December 6, 2004 RAMS Workshop Oak Ridge, TN

Oak Ridge National Laboratory — U.S. Department of Energy 2 OSCAR

Oak Ridge National Laboratory — U.S. Department of Energy 3 What is OSCAR? OSCAR Framework (cluster installation configuration and management) –Remote installation facility –Small set of “core” components –Modular package & test facility –Package repositories Use “best known methods” –Leverage existing technology where possible Wizard based cluster software installation –Operating system –Cluster environment Administration Operation Automatically configures cluster components Increases consistency among cluster builds Reduces time to build / install a cluster Reduces need for expertise O pen S ource C luster A pplication R esources Step 5 Step 8 Done! Step 6 Step 1 Start… Step 2 Step 3 Step 4 Step 7

Oak Ridge National Laboratory — U.S. Department of Energy 4 OSCAR Components Administration/Configuration –SIS, C3, OPIUM, Kernel-Picker, NTPconfig cluster services (dhcp, nfs,...) –Security: Pfilter, OpenSSH HPC Services/Tools –Parallel Libs: MPICH, LAM/MPI, PVM –Torque, Maui, OpenPBS –HDF5 –Ganglia, Clumon, … [monitoring systems] –Other 3 rd party OSCAR Packages Core Infrastructure/Management –System Installation Suite (SIS), Cluster Command & Control (C3), Env-Switcher, –OSCAR DAtabase (ODA), OSCAR Package Downloader (OPD)

Oak Ridge National Laboratory — U.S. Department of Energy 5 Open Source Community Development Effort Open Cluster Group (OCG) –Informal group formed to make cluster computing more practical for HPC research and development –Membership is open, direct by steering committee OCG working groups –OSCAR (core group) –Thin-OSCAR (Diskless Beowulf) –HA-OSCAR (High Availability) –SSS-OSCAR (Scalable Systems Software) –SSI-OSCAR (Single System Image) –BIO-OSCAR (Bioinformatics cluster system)

Oak Ridge National Laboratory — U.S. Department of Energy 6 OSCAR Core Partners Dell IBM Intel Bald Guy Software RevolutionLinux Indiana University NCSA Oak Ridge National Laboratory Université de Sherbrooke Louisiana Tech Univ. November 2004

Oak Ridge National Laboratory — U.S. Department of Energy 7 eXtreme TORC eXtreme TORC powered by OSCAR 65 Pentium IV Machines Peak Performance: GFLOPS RAM memory: GB Disk Capacity: 2.68 TB Dual interconnects - Gigabit & Fast Ethernet

Oak Ridge National Laboratory — U.S. Department of Energy 8

Oak Ridge National Laboratory — U.S. Department of Energy 9 HA-OSCAR: The first known field-grade open source HA Beowulf cluster release Self-configuration Multi-head Beowulf system HA and HPC clustering techniques to enable critical HPC infrastructure Active/Hot Standby Self-healing with 3-5 sec automatic failover time RAS Management for HPC cluster

Oak Ridge National Laboratory — U.S. Department of Energy 10

Oak Ridge National Laboratory — U.S. Department of Energy 11 Scalable Systems Software

Oak Ridge National Laboratory — U.S. Department of Energy 12 IBM Cray Intel SGI Scalable Systems Software Participating Organizations ORNL ANL LBNL PNNL NCSA PSC SDSC SNL LANL Ames Collectively (with industry) define standard interfaces between systems components for interoperability Create scalable, standardized management tools for efficiently running our large computing centers Problem Goals Impact Computer centers use incompatible, ad hoc set of systems tools Present tools are not designed to scale to multi-Teraflop systems Reduced facility mgmt costs. More effective use of machines by scientific applications. Resource Management Accounting & user mgmt System Build & Configure Job management System Monitoring To learn more visit

Oak Ridge National Laboratory — U.S. Department of Energy 13 SSS-OSCAR Scalable System Software Leverage OSCAR framework to package and distribute the Scalable System Software (SSS) suite, sss-oscar. sss-oscar – A release of OSCAR containing all SSS software in single downloadable bundle. SSS project developing standard interface for scalable tools Improve interoperability Improve long-term usability & manageability Reduce costs for supercomputing centers Map out functional areas Schedulers, Job Mangers System Monitors Accounting & User management Checkpoint/Restart Build & Configuration systems Standardize the system interfaces Open forum of universities, labs, industry reps Define component interfaces in XML Develop communication infrastructure

Oak Ridge National Laboratory — U.S. Department of Energy 14 OSCAR-ized SSS Components Bamboo – Queue/Job Manager BLCR – Berkeley Checkpoint/Restart Gold – Accounting & Allocation Management System LAM/MPI (w/ BLCR) – Checkpoint/Restart enabled MPI MAUI-SSS – Job Scheduler SSSLib – SSS Communication library –Includes: SD, EM, PM, BCM, NSM, NWI Warehouse – Distributed System Monitor MPD2 – MPI Process Manager

Oak Ridge National Laboratory — U.S. Department of Energy 15 Cluster Power Tools

Oak Ridge National Laboratory — U.S. Department of Energy 16 C3 Power Tools Command-line interface for cluster system administration and parallel user tools. Parallel execution cexec –Execute across a single cluster or multiple clusters at same time Scatter/gather operations cpush / cget –Distribute or fetch files for all node(s)/cluster(s) Used throughout OSCAR and as underlying mechanism for tools like OPIUM’s useradd enhancements.

Oak Ridge National Laboratory — U.S. Department of Energy 17 C3 Building Blocks System administration cpushimage - “push” image across cluster cshutdown - Remote shutdown to reboot or halt cluster User & system tools cpush - push single file -to- directory crm - delete single file -to- directory cget - retrieve files from each node ckill - kill a process on each node cexec - execute arbitrary command on each node cexecs – serial mode, useful for debugging clist – list each cluster available and it’s type cname – returns a node name from a given node position cnum – returns a node position from a given node name

Oak Ridge National Laboratory — U.S. Department of Energy 18 C3 Power Tools Example to run hostname on all nodes of default cluster: $ cexec hostname Example to push an RPM to /tmp on the first 3 nodes $ cpush :1-3 helloworld-1.0.i386.rpm /tmp Example to get a file from node1 and nodes 3-6 $ cget :1,3-6 /tmp/results.dat /tmp * Can leave off the destination with cget and will use the same location as source.

Oak Ridge National Laboratory — U.S. Department of Energy 19 Motivation for Success!

Oak Ridge National Laboratory — U.S. Department of Energy 20 RAMS – Summer 2004

Oak Ridge National Laboratory — U.S. Department of Energy 21 Preparation for Success! Personality & Attitude Adventurous Self starter Self learner Dedication Willing to work long hours Able to manage time Willing to fail Work experience Responsible Mature personal and professional behavior Academic Minimum of Sophomore standing CS major Above average GPA Extremely high faculty recommendations Good communication skills Two or more programming languages Data structures Software engineering