Download presentation
Presentation is loading. Please wait.
Published byAthena Kimbell Modified over 9 years ago
1
Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 1 CCS-3 P AL STATE OF THE ART
2
Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 2 CCS-3 P AL Section 2 n Overview u We are going to briefly describe some state-of-the- art supercomputers u The goal is to evaluate the degree of integration of the three main components, processing nodes, interconnection network and system software u Analysis limited to 6 supercomputers (ASCI Q and Thunder, System X, BlueGene/L, Cray XD1 and ASCI Red Storm), due to space and time limitations
3
Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 3 CCS-3 P AL ASCI Q: Los Alamos National Laboratory
4
Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 4 CCS-3 P AL ASCI Q Total — 20.48 TF/s, #3 in the top 500 Systems — 2048 AlphaServer ES45s 8,192 EV-68 1.25-GHz CPUs with 16-MB cache Memory — 22 Terabytes System Interconnect Dual Rail Quadrics Interconnect 4096 QSW PCI adapters Four 1024-way QSW federated switches Operational in 2002
5
Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 5 CCS-3 P AL Memory Up to 32 GB MMB 2 MMB 1 MMB 0 Serial, Parallel keyboard/mouse floppy Cache 16 MB per CPU 256b 125 MHz (4.0 GB/s) 256b 125 MHz (4.0 GB/s) EV68 1.25 GHz PCI5PCI4 PCI0 PCI2 PCI1 PCI6PCI7 PCI-USB PCI-junk IO PCI3PCI8 PCI 9 64b 33MHz (266MB/S) 64b 66MHz (528 MB/S ) PCI5PCI4 PCI0 PCI2 PCI1 64b 66MHz (528 MB/S) PCI6PCI7 PCI-USB PCI-junk IO PCI3PCI8 PCI 9 64b 33 MHz (266 MB/S) 64b 66 MHz (528 MB/S) Quad C-Chip Controller PCI Chip Bus 0 PCI Chip Bus 1 D D D D DD D D Quad C-Chip Controller PCI Chip Bus 0,1 PCI Chip Bus 2,3 D D D D DD D D MMB 3 PCI7 HS PCI5PCI4 PCI3 HSPCI2 HSPCI1 HS PCI0 Each @ 64b 500 MHz (4.0 GB/s) PCI9 HSPCI8 HSPCI6 HS 3.3V I/O5.0V I/O Node: HP (Compaq) AlphaServer ES45 21264 System Architecture
6
Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 6 CCS-3 P AL QsNET: Quaternary Fat Tree Hardware support for Collective Communication MPI Latency 4 s, Bandwidth 300 MB/s Barrier latency less than 10 s
7
Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 7 CCS-3 P AL Interconnection Network 1 st 64U64D Nodes 0-63 16 th 64U64D Nodes 960-1023 48631023 1 2 3... Switch Level 4 5 960 6 Mid Level Super Top Level 1024 nodes (2x = 2048 nodes)
8
Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 8 CCS-3 P AL System Software n Operating System is Tru64 n Nodes organized in Clusters of 32 for resource allocation and administration purposes (TruCluster) n Resource management executed through Ethernet (RMS)
9
Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 9 CCS-3 P AL ASCI Q: Overview n Node Integration u Low (multiple boards per node, network interface on I/O bus) n Network Integration u High (HW support for atomic collective primitives) n System Software Integration u Medium/Low (TruCluster)
10
Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 10 CCS-3 P AL ASCI Thunder, 1,024 Nodes, 23 TF/s peak
11
Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 11 CCS-3 P AL ASCI Thunder, Lawrence Livermore National Laboratory 1,024 Nodes, 4096 Processors, 23 TF/s, #2 in the top 500
12
Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 12 CCS-3 P AL ASCI Thunder: Configuration n 1,024 Nodes, Quad 1.4 Ghz Itanium2, 8GB DDR266 SDRAM (8 Terabytes total) 2.5 s, 912 MB/s MPI latency and bandwidth over Quadrics Elan4 Barrier synchronization 6 s, allreduce 15 s n 75 TB in local disk in 73GB/node UltraSCSI320 n Lustre file system with 6.4 GB/s delivered parallell I/O performance n Linux RH 3.0, SLURM, Chaos
13
Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 13 CCS-3 P AL n CHAOS: Clustered High Availability Operating System u Derived from Red Hat, but differs in the following areas F Modified kernel (Lustre and hw specific) F New packages for cluster monitoring, system installation, power/console management F SLURM, an open-source resource manager
14
Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 14 CCS-3 P AL ASCI Thunder: Overview n Node Integration u Medium/Low (network interface on I/O bus) n Network Integration u Very High (HW support for atomic collective primitives) n System Software Integration u Medium (Chaos)
15
Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 15 CCS-3 P AL System X: Virginia Tech
16
Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 16 CCS-3 P AL System X, 10.28 TF/s n 1100 dual Apple G5 2GHz CPU based nodes. u 8 billion operations/second/processor (8 GFlops) peak double precision floating performance. n Each node has 4GB of main memory and 160 GB of Serial ATA storage. u 176TB total secondary storage. Infiniband, 8 s and 870 MB/s, latency and bandwidth, partial support for collective communication n System-level Fault-tolerance ( Déjà vu)
17
Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 17 CCS-3 P AL System X: Overview n Node Integration u Medium/Low (network interface on I/O bus) n Network Integration u Medium (limited support for atomic collective primitives) n System Software Integration u Medium (system-level fault-tolerance)
18
Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 18 CCS-3 P AL Chip (2 processors) Compute Card (2 chips, 2x1x1) Node Card (32 chips, 4x4x2) 16 Compute Cards System (64 cabinets, 64x32x32) Cabinet (32 Node boards, 8x8x16) 2.8/5.6 GF/s 4 MB 5.6/11.2 GF/s 0.5 GB DDR 90/180 GF/s 8 GB DDR 2.9/5.7 TF/s 256 GB DDR 180/360 TF/s 16 TB DDR BlueGene/L System
19
Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 19 CCS-3 P AL BlueGene/L Compute ASIC PLB (4:1) “Double FPU” Ethernet Gbit JTAG Access 144 bit wide DDR 256/512MB JTAG Gbit Ethernet 440 CPU I/O proc L2 Multiported Shared SRAM Buffer Torus DDR Control with ECC Shared L3 directory for EDRAM Includes ECC 4MB EDRAM L3 Cache or Memory 6 out and 6 in, each at 1.4 Gbit/s link 256 1024+ 144 ECC 256 128 32k/32k L1 “Double FPU” 256 snoop Tree 3 out and 3 in, each at 2.8 Gbit/s link Global Interrupt 4 global barriers or interrupts 128 IBM CU-11, 0.13 µm 11 x 11 mm die size 25 x 32 mm CBGA 474 pins, 328 signal 1.5/2.5 Volt
20
Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 20 CCS-3 P AL
21
Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 21 CCS-3 P AL DC-DC Converters: 40V 1.5, 2.5V 2 I/O cards 16 compute cards
22
Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 22 CCS-3 P AL
23
Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 23 CCS-3 P AL BlueGene/L Interconnection Networks 3 Dimensional Torus u Interconnects all compute nodes (65,536) u Virtual cut-through hardware routing u 1.4Gb/s on all 12 node links (2.1 GBytes/s per node) u 350/700 GBytes/s bisection bandwidth u Communications backbone for computations Global Tree u One-to-all broadcast functionality u Reduction operations functionality u 2.8 Gb/s of bandwidth per link u Latency of tree traversal in the order of 5 µs u Interconnects all compute and I/O nodes (1024) Ethernet u Incorporated into every node ASIC u Active in the I/O nodes (1:64) u All external comm. (file I/O, control, user interaction, etc.) Low Latency Global Barrier u 8 single wires crossing whole system, touching all nodes Control Network (JTAG) u For booting, checkpointing, error logging
24
Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 24 CCS-3 P AL BlueGene/L System Software Organization n Compute nodes dedicated to running user application, and almost nothing else - simple compute node kernel (CNK) n I/O nodes run Linux and provide O/S services u file access u process launch/termination u debugging n Service nodes perform system management services (e.g., system boot, heart beat, error monitoring) - largely transparent to application/system software
25
Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 25 CCS-3 P AL Operating Systems n Compute nodes: CNK u Specialized simple O/S F 5000 lines of code, F 40KBytes in core u No thread support, no virtual memory u Protection F Protect kernel from application F Some net devices in userspace u File I/O offloaded (“function shipped”) to IO nodes F Through kernel system calls u “Boot, start app and then stay out of the way” n I/O nodes: Linux u 2.4.19 kernel (2.6 underway) w/ ramdisk u NFS/GPFS client u CIO daemon to F Start/stop jobs F Execute file I/O n Global O/S (CMCS, service node) u Invisible to user programs u Global and collective decisions u Interfaces with external policy modules (e.g., job scheduler) u Commercial database technology (DB2) stores static and dynamic state F Partition selection F Partition boot F Running of jobs F System error logs F Checkpoint/restart mechanism u Scalability, robustness, security n Execution mechanisms in the core n Policy decisions in the service node
26
Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 26 CCS-3 P AL BlueGeneL: Overview n Node Integration u High (processing node integrates processors and network interfaces, network interfaces directly connected to the processors) n Network Integration u High (separate tree network) n System Software Integration u Medium/High (Compute kernels are not globally coordinated) n #2 and #4 in the top500
27
Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 27 CCS-3 P AL Cray XD1
28
Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 28 CCS-3 P AL Cray XD1 System Architecture Compute n 12 AMD Opteron 32/64 bit, x86 processors n High Performance Linux RapidArray Interconnect n 12 communications processors n 1 Tb/s switch fabric Active Management n Dedicated processor Application Acceleration n 6 co-processors n Processors directly connected to the interconnect
29
Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 29 CCS-3 P AL Cray XD1 Processing Node Six SATA Hard Drives Four independent PCI-X Slots 500 Gb/s crossbar switch 12-port Inter- chassis connector Connector to 2 nd 500 Gb/s crossbar switch and 12-port inter-chassis connector 4 Fans Chassis Rear Chassis Front Six 2-way SMP Blades
30
Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 30 CCS-3 P AL Cray XD1 Compute Blade 4 DIMM Sockets for DDR 400 Registered ECC Memory 4 DIMM Sockets for DDR 400 Registered ECC Memory RapidArray Communications Processor AMD Opteron 2XX Processor Connector to Main Board AMD Opteron 2XX Processor
31
Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 31 CCS-3 P AL Fast Access to the Interconnect Processor I/OInterconnect GigaBytesGFLOPSGigaBytes per Second Cray XD1 Memory Xeon Server 6.4GB/s DDR 400 8 GB/s 5.3 GB/s DDR 333 0.25 GB/s GigE 1 GB/s PCI-X
32
Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 32 CCS-3 P AL Communications Optimizations RapidArray Communications Processor u HT/RA tunnelling with bonding u Routing with route redundancy u Reliable transport u Short message latency optimization u DMA operations u System-wide clock synchronization RapidArray Communications Processor 2 GB/s 3.2 GB/s 2 GB/s AMD Opteron 2XX Processor RA
33
Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 33 CCS-3 P AL Usability u Single System Command and Control Resiliency u Dedicated management processors, real-time OS and communications fabric. u Proactive background diagnostics with self- healing. u Synchronized Linux kernels Active Manager System Active Management Software
34
Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 34 CCS-3 P AL Cray XD1: Overview n Node Integration u High (direct access from HyperTransport to RapidArray) n Network Integration u Medium/High (HW support for collective communication) n System Software Integration u High (Compute kernels are globally coordinated) n Early stage
35
Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 35 CCS-3 P AL ASCI Red STORM
36
Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 36 CCS-3 P AL Red Storm Architecture n Distributed memory MIMD parallel supercomputer n Fully connected 3D mesh interconnect. Each compute node processor has a bi-directional connection to the primary communication network n 108 compute node cabinets and 10,368 compute node processors (AMD Sledgehammer @ 2.0 GHz) n ~10 TB of DDR memory @ 333MHz n Red/Black switching: ~1/4, ~1/2, ~1/4 n 8 Service and I/O cabinets on each end (256 processors for each color240 TB of disk storage (120 TB per color)
37
Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 37 CCS-3 P AL Red Storm Architecture n Functional hardware partitioning: service and I/O nodes, compute nodes, and RAS nodes n Partitioned Operating System (OS): LINUX on service and I/O nodes, LWK (Catamount) on compute nodes, stripped down LINUX on RAS nodes n Separate RAS and system management network (Ethernet) n Router table-based routing in the interconnect
38
Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 38 CCS-3 P AL Net I/O Service Users File I/O Compute /home Red Storm architecture
39
Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 39 CCS-3 P AL System Layout (27 x 16 x 24 mesh) Normally Unclassified Normally Classified Switchable Nodes Disconnect Cabinets {{
40
Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 40 CCS-3 P AL Run-Time System Logarithmic loader Fast, efficient Node allocator Batch system – PBS Libraries – MPI, I/O, Math File Systems being considered include PVFS – interim file system Lustre – Pathforward support, Panassas… Operating Systems LINUX on service and I/O nodes Sandia’s LWK (Catamount) on compute nodes LINUX on RAS nodes Red Storm System Software
41
Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 41 CCS-3 P AL ASCI Red Storm: Overview n Node Integration u High (direct access from HyperTransport to network through custom network interface chip) n Network Integration u Medium (No support for collective communication) n System Software Integration u Medium/High (scalable resource manager, no global coordination between nodes) n Expected to become the most powerful machine in the world (competition permitting)
42
Kei Davis and Fabrizio Petrini {kei,fabrizio}@lanl.gov Europar 2004, Pisa Italy 42 CCS-3 P AL Overview Node Integration Network Integration Software Integration ASCI Q ASCI Thunder System X BlueGene/L Cray XD1 Red Storm
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.