Download presentation
Presentation is loading. Please wait.
Published byLiliana Melton Modified over 9 years ago
1
Oak Ridge National Laboratory — U.S. Department of Energy 1 The ORNL Cluster Computing Experience… Stephen L. Scott Oak Ridge National Laboratory Computer Science and Mathematics Division Network and Cluster Computing Group December 6, 2004 RAMS Workshop Oak Ridge, TN scottsl@ornl.gov www.csm.ornl.gov/~sscott
2
Oak Ridge National Laboratory — U.S. Department of Energy 2 OSCAR
3
Oak Ridge National Laboratory — U.S. Department of Energy 3 What is OSCAR? OSCAR Framework (cluster installation configuration and management) –Remote installation facility –Small set of “core” components –Modular package & test facility –Package repositories Use “best known methods” –Leverage existing technology where possible Wizard based cluster software installation –Operating system –Cluster environment Administration Operation Automatically configures cluster components Increases consistency among cluster builds Reduces time to build / install a cluster Reduces need for expertise O pen S ource C luster A pplication R esources Step 5 Step 8 Done! Step 6 Step 1 Start… Step 2 Step 3 Step 4 Step 7
4
Oak Ridge National Laboratory — U.S. Department of Energy 4 OSCAR Components Administration/Configuration –SIS, C3, OPIUM, Kernel-Picker, NTPconfig cluster services (dhcp, nfs,...) –Security: Pfilter, OpenSSH HPC Services/Tools –Parallel Libs: MPICH, LAM/MPI, PVM –Torque, Maui, OpenPBS –HDF5 –Ganglia, Clumon, … [monitoring systems] –Other 3 rd party OSCAR Packages Core Infrastructure/Management –System Installation Suite (SIS), Cluster Command & Control (C3), Env-Switcher, –OSCAR DAtabase (ODA), OSCAR Package Downloader (OPD)
5
Oak Ridge National Laboratory — U.S. Department of Energy 5 Open Source Community Development Effort Open Cluster Group (OCG) –Informal group formed to make cluster computing more practical for HPC research and development –Membership is open, direct by steering committee OCG working groups –OSCAR (core group) –Thin-OSCAR (Diskless Beowulf) –HA-OSCAR (High Availability) –SSS-OSCAR (Scalable Systems Software) –SSI-OSCAR (Single System Image) –BIO-OSCAR (Bioinformatics cluster system)
6
Oak Ridge National Laboratory — U.S. Department of Energy 6 OSCAR Core Partners Dell IBM Intel Bald Guy Software RevolutionLinux Indiana University NCSA Oak Ridge National Laboratory Université de Sherbrooke Louisiana Tech Univ. November 2004
7
Oak Ridge National Laboratory — U.S. Department of Energy 7 eXtreme TORC eXtreme TORC powered by OSCAR 65 Pentium IV Machines Peak Performance: 129.7 GFLOPS RAM memory: 50.152 GB Disk Capacity: 2.68 TB Dual interconnects - Gigabit & Fast Ethernet
8
Oak Ridge National Laboratory — U.S. Department of Energy 8
9
Oak Ridge National Laboratory — U.S. Department of Energy 9 HA-OSCAR: The first known field-grade open source HA Beowulf cluster release Self-configuration Multi-head Beowulf system HA and HPC clustering techniques to enable critical HPC infrastructure Active/Hot Standby Self-healing with 3-5 sec automatic failover time RAS Management for HPC cluster
10
Oak Ridge National Laboratory — U.S. Department of Energy 10
11
Oak Ridge National Laboratory — U.S. Department of Energy 11 Scalable Systems Software
12
Oak Ridge National Laboratory — U.S. Department of Energy 12 IBM Cray Intel SGI Scalable Systems Software Participating Organizations ORNL ANL LBNL PNNL NCSA PSC SDSC SNL LANL Ames Collectively (with industry) define standard interfaces between systems components for interoperability Create scalable, standardized management tools for efficiently running our large computing centers Problem Goals Impact Computer centers use incompatible, ad hoc set of systems tools Present tools are not designed to scale to multi-Teraflop systems Reduced facility mgmt costs. More effective use of machines by scientific applications. Resource Management Accounting & user mgmt System Build & Configure Job management System Monitoring www.scidac.org/ScalableSystems To learn more visit
13
Oak Ridge National Laboratory — U.S. Department of Energy 13 SSS-OSCAR Scalable System Software Leverage OSCAR framework to package and distribute the Scalable System Software (SSS) suite, sss-oscar. sss-oscar – A release of OSCAR containing all SSS software in single downloadable bundle. SSS project developing standard interface for scalable tools Improve interoperability Improve long-term usability & manageability Reduce costs for supercomputing centers Map out functional areas Schedulers, Job Mangers System Monitors Accounting & User management Checkpoint/Restart Build & Configuration systems Standardize the system interfaces Open forum of universities, labs, industry reps Define component interfaces in XML Develop communication infrastructure
14
Oak Ridge National Laboratory — U.S. Department of Energy 14 OSCAR-ized SSS Components Bamboo – Queue/Job Manager BLCR – Berkeley Checkpoint/Restart Gold – Accounting & Allocation Management System LAM/MPI (w/ BLCR) – Checkpoint/Restart enabled MPI MAUI-SSS – Job Scheduler SSSLib – SSS Communication library –Includes: SD, EM, PM, BCM, NSM, NWI Warehouse – Distributed System Monitor MPD2 – MPI Process Manager
15
Oak Ridge National Laboratory — U.S. Department of Energy 15 Cluster Power Tools
16
Oak Ridge National Laboratory — U.S. Department of Energy 16 C3 Power Tools Command-line interface for cluster system administration and parallel user tools. Parallel execution cexec –Execute across a single cluster or multiple clusters at same time Scatter/gather operations cpush / cget –Distribute or fetch files for all node(s)/cluster(s) Used throughout OSCAR and as underlying mechanism for tools like OPIUM’s useradd enhancements.
17
Oak Ridge National Laboratory — U.S. Department of Energy 17 C3 Building Blocks System administration cpushimage - “push” image across cluster cshutdown - Remote shutdown to reboot or halt cluster User & system tools cpush - push single file -to- directory crm - delete single file -to- directory cget - retrieve files from each node ckill - kill a process on each node cexec - execute arbitrary command on each node cexecs – serial mode, useful for debugging clist – list each cluster available and it’s type cname – returns a node name from a given node position cnum – returns a node position from a given node name
18
Oak Ridge National Laboratory — U.S. Department of Energy 18 C3 Power Tools Example to run hostname on all nodes of default cluster: $ cexec hostname Example to push an RPM to /tmp on the first 3 nodes $ cpush :1-3 helloworld-1.0.i386.rpm /tmp Example to get a file from node1 and nodes 3-6 $ cget :1,3-6 /tmp/results.dat /tmp * Can leave off the destination with cget and will use the same location as source.
19
Oak Ridge National Laboratory — U.S. Department of Energy 19 Motivation for Success!
20
Oak Ridge National Laboratory — U.S. Department of Energy 20 RAMS – Summer 2004
21
Oak Ridge National Laboratory — U.S. Department of Energy 21 Preparation for Success! Personality & Attitude Adventurous Self starter Self learner Dedication Willing to work long hours Able to manage time Willing to fail Work experience Responsible Mature personal and professional behavior Academic Minimum of Sophomore standing CS major Above average GPA Extremely high faculty recommendations Good communication skills Two or more programming languages Data structures Software engineering
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.