Download presentation
Presentation is loading. Please wait.
Published byJulius Garrett Modified over 8 years ago
1
National Computational Science Alliance Clusters, Cluster in a Box Rob Pennington Acting Associate Director Computing and Communications Division NCSA How to stuff a penguin in a box and make everyone happy, even the penguin.
2
National Computational Science Distributed systems MP systems Gather (unused) resources System SW manages resources System SW adds value 10% - 20% overhead is OK Resources drive applications Time to completion is not critical Time-shared Commercial: PopularPower, United Devices, Centrata, ProcessTree, Applied Meta, etc. Bounded set of resources Apps grow to consume all cycles Application manages resources System SW gets in the way 5% overhead is maximum Apps drive purchase of equipment Real-time constraints Space-shared Legion\Globus Beowulf Berkley NOW Superclusters Internet ASCI Red Tflops SETI@home Condor Where Do Clusters Fit? Src: B. Maccabe, UNM, R.Pennington NCSA 15 TF/s delivered 1 TF/s delivered
3
National Computational Science Alliance Clusters Overview Major Alliance cluster systems –NT-based cluster at NCSA –Linux-based clusters –University of New Mexico - Roadrunner, LosLobos –Argonne National Lab - Chiba City –“Develop Locally, Run Globally” –Local clusters Used for Development and Parameter Studies Issues –Compatible Software Environments –Compatible Hardware –Evaluate Technologies at Multiple Sites –OS, Processors, Interconnect, Middleware Computational resource for users
4
National Computational Science Cluster in a Box Rationale Conventional wisdom: Building a cluster is easy –Recipe: –Buy hardware from Computer Shopper, Best Buy or Joe’s place –Find a grad student not making enough progress on thesis work and distract him/her with the prospect of playing with the toys –Allow to incubate for a few days to weeks –Install your application, run and be happy Building it right is a little more difficult –Multi user cluster, security, performance tools –Basic question - what works reliably? Building it to be compatible with Grid/Alliance... –Compilers, libraries –Accounts, file storage, reproducibility Hardware configs may be an issue
5
National Computational Science Alliance Cluster Growth: 1 TFLOP IN 2 YEARS Oct-00 0 200 400 600 800 1000 1200 1400 1600 1800 Jan-98Jul-98Feb-99Aug-99Mar-00 Intel Processors NCSA 192p, HP, Compaq UNM 128p, Alta NCSA 128p, HP NCSA 32p, SGI ANL 512p, IBM, VA Linux NCSA 128p, HP UNM 512p, IBM 256p NT Cluster 1600+ Intel CPUs
6
National Computational Science Alliance Cluster Status UNM Los Lobos –Linux –512 processors –May 2000 –operational system –first performance tests –friendly users Argonne Chiba City –Linux –512 processors –Myrinet interconnect –November 1999 –deployment NCSA NT Cluster –Windows NT 4 –256 processors –Myrinet – December 1999 – Review Board Allocations UNM Road Runner –Linux, 128 processors –Myrinet – September 1999 – Review Board Allocations
7
National Computational Science NT Cluster Usage - Large, Long Jobs
8
National Computational Science Full production resources at major site(s) A Pyramid Scheme: (Involve Your Friends and Win Big) Small, private systems in labs/offices Alliance resources at partner sites Can a “Cluster in a Box” support all of the different configs at all of the sites?? No, but it can provide an established & tested base configuration This is a non-exclusive club at all levels!
9
National Computational Science Cluster in a Box Goals Open source software kit for scientific computing –Surf the ground swell –Some things are going to be add-ons –Invest in compilers, vendors have spent BIG $ optimizing them Integration of commonly used components –Minimal development effort –Time to delivery is critical Initial target is small to medium clusters –Up to 64 processors –~1 interconnect switch Compatible environment for development and execution across different systems (Grid, anyone?) –Common libraries, compilers
10
National Computational Science Key Challenges and Opportunities Technical and Applications –Development Environment –Compilers, Debuggers –Performance Tools –Storage Performance –Scalable Storage –Common Filesystem –Admin Tools –Scalable Monitoring Tools –Parallel Process Control –Node Size –Resource Contention –Shared Memory Apps –Few Users => Many Users –600 Users/month on O2000 –Heterogeneous Systems –New generations of systems –Integration with the Grid Organizational –Integration with Existing Infrastructure –Accounts, Accounting –Mass Storage –Training –Acceptance by Community –Increasing Quickly –Software environments
11
National Computational Science Compute Nodes Front end Nodes I/O Nodes Mgmt Nodes Network Cluster Configuration Visualization Nodes HSM Debug Nodes Systems Testbed Green: present generation clusters User Logins Storage
12
National Computational Science App1 App2 App3 App4 App6 App5 Users “own” the nodes allocated to them Space Sharing Example on 64 Nodes
13
National Computational Science OSCAR Open Source Cluster Application Resources is a snapshot of the best known methods for building and using cluster software. OSCAR A(nother) Package for Linux Clustering
14
National Computational Science The OSCAR Consortium OSCAR is being developed by: –NCSA/Alliance –Oak Ridge National Laboratory –Intel –IBM –Veridian Systems Additional supporters are: –SGI, HP, Dell, MPI Software Technology, MSC
15
National Computational Science OSCAR Components: Status Packaging Integration underway. Documentation under development. Job Management PBS validated and awaiting integration. Long term replacement for PBS under consideration. Cluster Management C3/MC3 core complete, but further refinement is planned. Evaluation of alternative solutions underway. Installation & Cloning Configuration Database design is complete. LUI is complete and awaiting integration with database. OS Core validation OS’s selected (Red Hat, Turbo and Suse). Integration support issues being worked. Src; N. Gorsuch, NCSA
16
National Computational Science Open source cluster on a “CD” Integration meeting v0.5 - September 2000 Integration meeting at ORNL October 24 & 25 - v1.0 v1.0 to be released at Supercomputing 2000 (November 2000) Research and industry consortium NCSA, ORNL, Intel, IBM, MCS Software, SGI, HP, Veridian, Dell Components OS LayerLinux (Redhat, Turbulinux, Suse, etc.) Installation and cloningLUI Securityopenssh for now Cluster managementC3/M3C Job managementOpenPBS Programming environmentgcc etc. PackagingOSCAR Open Source Cluster Application Resources Src; N. Gorsuch, NCSA
17
National Computational Science OSCAR Cluster Installation Process Install Linux on cluster master or head node Copy contents of OSCAR CD into cluster head Collect cluster information and enter into LUI database –This is a manual phase right now Run the pre-client installation script Boot the clients and let them install themselves –Can be done over the net or from a floppy Run the post-client installation script KEEP IT SIMPLE!
18
National Computational Science Testbeds Basic cluster configuration for prototyping at NCSA –Interactive node + 4 compute nodes –Development site for OSCAR contributors –2nd set of identical machines for testbed –Rolling development between the two testbeds POSIC - Linux –56 dual processor nodes –Mixture of ethernet and Myrinet –User accessible testbed for apps porting and testing
19
National Computational Science IA-64 Itanium Systems at NCSA Prototype systems –Early hardware –Not running at production spec –Code porting and validation –Community codes –Required software infrastructure Running 64 bit Linux and Windows –Dual boot capable –Usually one OS for extended periods Clustered IA-64 systems –Focused on MPI applications porting/testing –Myrinet, Ethernet, Shared Memory
20
National Computational Science HPC Applications Running on Itanium IA-64 4p IA-64 4p IA-64 4p IA-64 4p IA-64 test cluster: IA-64 compute nodes + IA-32 compile nodes + Linux or Win64 IA-32 Linux Cactus MILC ARPI-3D ATLAS sPPM WRF IA-64 Compute Nodes Compilers for C/C++/F90 PUPI ASPCG HDF4, 5 PBS FFTW Globus Applications/Packages: Interconnects: Shared memory Fast Enet + MPICH Myrinet+GM+VMI+MPICH IA-64 2p IA-32 Win32 Myrinet
21
National Computational Science Future Scale up Current Cluster Efforts –Capability computing at NCSA and Alliance sites –NT and Linux clusters expand –Scalable Computing Platforms –Commodity turnkey systems –Current technology has 1 TF Within Reach –<1000 IA-32 processors Teraflop Systems Integrated With the Grid –Multiple Systems Within the Alliance –Complement to current SGI SMP Systems at NCSA –Next generation of technologies –Itanium at ~3 GFLOP, 1 TF is ~350 Processors
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.