Swiss-T1 : A Commodity MPI computing solution Mars 1999 Ralf Gruber, EPFL-SIC/CAPA/Swiss-Tx, Lausanne
Swiss-T1 : A Commodity MPI computing solution March 2000 Content: 1.Distributed Commodity HPC 2.Characterisation of machines and applications 3.Swiss-Tx project
July 1998 Past : SUPERCOMPUTER Cray Research Convex Connection Machines KSR Intel Paragon Japanese companies Teracomputers Taken over by SGI Taken over by HP Disappeared Stopped supercomputing Still existing (not main) Develop since 6 years Produced own processors Developped own memory switches Needed special memories Developped own operating system Developped own compiler Special I/O : HW and SW Own communication system ManufacturesWhat happenedWhy it happened
Processor performance evolution July 1998
SMP/NUMA DIGITAL SUN IBM HP SGI ….. Wildfire Starfire SP-2 Exemplar Origin 2000 ….. Off the shelf processors Off the shelf memory switches Off the shelf memories Special parts of operating system Special compiler extensions Special I/O and SW Own communication system ManufacturerParallel serverPresent situation What is the trend ?
March 2000 Commodity Computing (MPI/PCI) PC clusters/Linux: Fast Ethernet: Beowulf SOS cooperation (Alpha): Myrinet/DS10: C-Plant (SNL) T-Net/DS20: Swiss-T1 (EPFL) Customised commodity: Quadrics/ES40: Compaq/Sierra Off the shelf processors Off the shelf memory switches Off the shelf memories Off the shelf local I/O HW and SW Off the shelf operating systems Off the shelf compilers New communication system New distributed file/IO system
March th SOS workshop on Distributed Commodity HPC Participants: SNL, ORNL, Swiss-Tx, LLNL, LANL, ANL, NASA, LBL, PSC, DOE, UNM, Syracuse, Compaq, IBM, Cray, Sun, SME’s Content: Vision, Clusters, Interconnects, Integration, OS, I/O, Applications, Usability, Crystal ball
March 2000 Distributed commodity HPC User’s Group Goals: Characterise the machines Characterise the applications Match machines to applications
Characterise processors, machines, and applications Performance Processors: V mac V mac = peak proc. performance/peak memory BW Parallel machines: mac mac = effective proc. perf./effective network perf. Applications: app app = operation count/words to be sent
15 juin 1998 In a box: V mac values V mac = R [Mflop/s] / M [Mword/s] Table: V mac values for Alpha and boxes and NEC SX-4 Machine N R M V mac Alpha server DS DS NEC SX
Between boxes: mac value mac = N * R [Mflop/s] * / C [Mword/s] Table: mac of different machines Machine Type Nproc Peak Eff perf Eff bw mac Gravitor Beowulf * Swiss-T1 T-Net Swiss-T1 FE Baby T1 C+PCI Origin2K NUMA/MPI NEC SX4 vector Effective performance measured with MATMULT, * estimated. Effective bandwidth measured with point to point
The app value app = Operations/Communicated words Material sciences (3D Fourier analysis): app ~ 50 Beowulf insufficient, Swiss-T1 just about right Crash analysis (3D non-linear FE): app > 1000 Beowulf sufficient, latency?
The app value for Finite Elements app = Operations/Communicated words FE: Ops Nb of volume nodes Ops Nb of variables per node square Ops Nb of non-zero matrix elements Ops Nb of operations per matrix element FE: Comm Nb of surface nodes Comm Nb of variables per node FE: app Nb of nodes in one direction app Nb of variables per node app Nb of non-zero matrix elements app Nb of operations per matrix element app Nb of surfaces
The app value Statistics for 3D brick problem (Finite elements) Nb ofNb ofNb MflopMflopkBkB app SubdNodesinterface/cycle/data /cycle/cycle Nodes/proctransfer/proc Table: Current day case, 4096 elements
March 2000 Fat-tree/Crossbars 16x16 N=8, P=8, N*P=64 PUs, X=12, BiW=32, L=64
March 2000 Circulant graphs/Crossbars 12x12 K=2 (1/3) N=8, P=8, X=8 BiW=8, L=16 K=3 (1/3/5) N=11, P=6, X=11 BiW=18, L=33 K=4 (1/3/5/7) N=16, P=4, X=16 BiW=32, L=64
March 2000 Fat-tree/Circulant graphs
The Swiss-Tx machines September 1998 Swiss-T0 Machine Swiss-T0 * (Dual) Baby T1* Swiss-T1 Installation Date Place EPFL EPFL 8.99 EPFL 4.00 DGM 1.00 EPFL #P Peak Gflop/s Memory GBytes Disk GBytes Archive TBytes 1** - - Operating system Digital Unix Windows NT Digital Unix Tru64 Unix Connection EasyNet bus FE bus system Crossbar 12x12 FE switch EasyNet bus FE switch Crossbar 12x12 FE switch ? Not decided Crossbar 12x12 FE switch Swiss-T2 * Baby T1 is an upgrade of T0(Dual)** Archive ported from T0 to T1
March 2000 Swiss-T1
Components 32 computational DS20E 2 frontend DS20E 1 development DS20E 300 GB RAID disks 600 GB distributed disks 1 TB DLT archive Fast/Gigabit Ethernet Tru64/TruCluster Unix LSF, GRD/Codine Totalview, Paradyn MPICH/PVM T-Net network technology ( 8+1)12x12 crossbar 100MB/s 32 bit PCI adapter 75 MB/s (64 bit PCI adapter 180 MB/s) Flexible, non-blocking Reliable Optimal routing FCI 5 s MPI 18 s Monitoring system Remote control Up to 3 Tflop/s ( < 100)
March 2000 Swiss-T1 Architecture
March 2000 Swiss-T1 Routing table
Swiss-T1: Software in a Box March 2000 *Digital UnixCompaqOperating system in each box *F77/F90CompaqFortran compilers *HPFCompaqHigh performance Fortran *C/C++CompaqC and C++ compilers *DXMLCompaqDigital math library in each box *MPICompaqSMP message passing interface *Posix threadsCompaqThreading in a box *OpenMPCompaqMultiprocessor usage in a box through directives *KAP-FKAITo parallelise a Fortran code in a multiprocessor box *KAP-CKAITo parallelise a C program in a multiprocessor box
Swiss-T1: Software between Boxes March 2000 *LSFPlatform Inc.Load Sharing Facility for resource management *TotalviewDolphinParallel debugger *ParadynMadison/CSCS Profiler to help parallelising programs *MPI-1/FCISCS AGMessage passing interface between boxes running over TNET *MPICHArgonneMessage passing interface running over Fast Ethernet **PVMUTKParallel virtual machine running over Fast Ethernet *BLACSUTKBasic linear algebra subroutines *ScaLAPACKUTKLinear algebra matrix solvers MPI I/OSCS/LSPMessage passing interface for I/O MONITOREPFLMonitoring of system parameters NAGNAGMath library package EnsightEnsight4D visualisation MEMCOMSMR SAData management system for distributed architectures ShmemEPFLInterface Cray to Swiss-Tx
March 2000 Baby T1 Architecture
Swiss-T1 : Alternative network March 2000
Swiss-T2 : K-Ring architecture
Create SwissTx Company Commercialise T-Net Commercialise dedicated machines Transfer knowhow in parallel application technology
Between boxes: mac value * measured (SAXPY and Parkbench)** expected mac = N * R [Mflop/s] * / C [Mword/s] Table : The mac values for Swiss-T0, Swiss-T0(Dual) and Swiss-T1 for MATMUL Machine N R % N * R C mac T0 (Bus) * 400 * 4 * 1100 T0(Dual) (Bus) 8* * 1000 * 4 * 1250 Baby T1 (Switch) 6* * 2400 * 90* 1 27 T1(local) (Switch) 4* * 1600 * 60 ** 1 27 T1(global)(Switch) 32* * * 400** T1 (Fast Ethernet) 32* * 12800* 80** 1160
Time Schedule March st phase2nd phase Swiss-T2 504 processors OS not defined Baby T1 12 processors Digital Unix Swiss-T0(Dual) 16 processors Windows NT Swiss-T0(Dual) 16 processors Digital Unix Swiss-T1 68 processors Digital Unix EasyNet bus based prototypesT-Net switch based prototype/production machines
March 2000 Phase I: Machines installed Swiss-T0: 23 December 97 (accepted 25 May 98) Swiss-T0(Dual): 29 September 98 (accepted 11 Dec. 98 / NT) Swiss-T0(Dual): 29 September 98 (accepted 22 Jan. 99 / Unix) Swiss-T1 Baby: 19 August 99 (accepted 18 Oct. 99 / Unix) Swiss-T1: 21 Jan. 2000
Swiss-T1 Node Architecture Mars 1999
March nd Phase Swiss-Tx: The 8 WPs Managing Board: Michel Deville Technical Team: Ralf Gruber Management: Jean-Michel Lafourcade WP1: Hardware developmentRoland Paul, SCS WP2: Communication software developmentMartin Frey, SCS WP3: System and user environmentMichel Jaunin, SIC-EPFL WP4: Data management issuesRoger Hersch, DI-EPFL WP5: ApplicationsRalf Gruber, CAPA/SIC-EPFL WP6: Swiss-Tx conceptPierre Kuonen, DI-EPFL WP7: ManagementJean-Michel Lafourcade, CAPA/DGM-EPFL WP8: SwissTx Spin-off CompanyJean-Michel Lafourcade, CAPA/DGM-EPFL
;March nd Phase Swiss-Tx: The MUSTs WP1: PCI adapter page table/ 64 bit PCI adapter WP2: Dual processor FCI / Network monitoring / Shmem WP3: Management / Automatic SI / Monitoring / PE / Libraries WP4: MPI-I/O / Distributed file management WP5: Applications WP6: Swiss-Tx architecture / Autoparallelisation WP7: Management WP8: SwissTx Spin-off Company