Case study IBM Bluegene/L system InfiniBand
Interconnect Family share for 06/2011 top 500 supercomputers Interconnect Family CountShare % Rmax Sum (GF) Rpeak Sum (GF) Processor Sum Myrinet40.80 % Quadrics10.20 % Gigabit Ethernet % Infiniband % Mixed10.20 % NUMAlink20.40 % SP Switch10.20 % Proprietary % Fat Tree10.20 % Custom % Totals500100%
Overview of the IBM Blue Gene/L System Architecture Design objectives Hardware overview –System architecture –Node architecture –Interconnect architecture
Highlights A 64K-node highly integrated supercomputer based on system-on-a-chip technology –Two ASICs Blue Gene/L compute (BLC), Blue Gene/L Link (BLL) Distributed memory, massively parallel processing (MPP) architecture. Use the message passing programming model (MPI). 360 Tflops peak performance Optimized for cost/performance
Design objectives Objective 1: 360-Tflops supercomputer –Earth Simulator (Japan, fastest supercomputer from 2002 to 2004): Tflops Objective 2: power efficiency –Performance/rack = performance/watt * watt/rack Watt/rack is a constant of around 20kW Performance/watt determines performance/rack
Power efficiency: –360Tflops => 20 megawatts with conventional processors –Need low-power processor design (2-10 times better power efficiency)
Design objectives (continue) Objective 3: extreme scalability –Optimized for cost/performance use low power, less powerful processors need a lot of processors Up to processors. –Interconnect scalability
Blue Gene/L system components
Blue Gene/L Compute ASIC 2 Power PC440 cores with floating-point enhancements –700MHz –Everything of a typical superscalar processor Pipelined microarchitecture with dual instruction fetch, decode, and out of order issue, out of order dispatch, out of order execution and out of order completion, etc –1 W each through extensive power management
Blue Gene/L Compute ASIC
Memory system on a BGL node BG/L only supports distributed memory paradigm. No need for efficient support for cache coherence on each node. –Coherence enforced by software if needed. Two cores operate in two modes: –Communication coprocessor mode Need coherence, managed in system level libraries –Virtual node mode Memory is physical partitioned (not shared).
Blue Gene/L networks Five networks. –100 Mbps Ethernet control network for diagnostics, debugging, and some other things. –1000 Mbps Ethernet for I/O –Three high-band width, low-latency networks for data transmission and synchronization. 3-D torus network for point-to-point communication Collective network for global operations Barrier network All network logic is integrated in the BG/L node ASIC –Memory mapped interfaces from user space
3-D torus network Support p2p communication Link bandwidth 1.4Gb/s, 6 bidirectional link per node (1.2GB/s). 64x32x32 torus: diameter =64 hops, worst case hardware latency 6.4us. Cut-through routing Adaptive routing
Collective network Binary tree topology, static routing Link bandwidth: 2.8Gb/s Maximum hardware latency: 5us With arithmetic and logical hardware: can perform integer operation on the data –Efficient support for reduce, scan, global sum, and broadcast operations –Floating point operation can be done with 2 passes.
Barrier network Hardware support for global synchronization. 1.5us for barrier on 64K nodes.
IBM BlueGene/L summary Optimize cost/performance –limiting applications. –Use low power design Lower frequency, system-on-a-chip Great performance per watt metric Scalability support –Hardware support for global communication and barrier –Low latency, high bandwidth support
Case 2: Infiniband architecture –Specification (Infiniband architecture specification release 1.2.1, January 2008/Oct. 2006) available at Infiniband Trade Association (
Infiniband architecture overview
–Components: Links Channel adaptors Switches Routers –The specification allows Infiniband wide area network, but mostly adopted as a system/storage area network. –Topology: Irregular Regular: Fat tree –Link speed: Single data rate (SDR): 2.5Gbps (X), 10Gbps (4X), and 30Gbps (12X). Double data rate (DDR): 5Gbps (X), 20 Gbps (4X) Quad data rate (QDR): 40Gbps (4X)
Layers: somewhat similar to TCP/IP –Physical layer –Link layer Error detection (CRC checksum) flow control (credit based) switching, virtual lanes (VL), forwarding table computed by subnet manager –Single path deterministic routing (not adaptive) –Network layer: across subnets. No use for the cluster environment –Transport layer Reliable/unreliable, connection/datagram –Verbs: interface between adaptors and OS/Users
Infinoband Link layer Packet format: Local Route Header (LRH): 8 bytes. Used for local routing by switches within a IBA subnet Global Route Header (GRH): 40 Bytes. Used for routing between subnets Base Transport header (BTH): 12 Bytes, for IBA transport Extened transport header –Reliable datagram extended transport header (RDETH): 4 bytes, just for reliable datagram –Datagram extended transport header (DETH): 8 bytes –RDMA extended transport header (RETH): 16 bytes –Atomic, ACK, Atomic ACK, Immediate DATA extended transport header: 4 bytes, optimized for small packets. Invariant CRC and variant CRC: –CRC for fields not changed and changed.
Local Route Header: –Switching based on the destination port address (LID) –Multipath switching by allocating multiple LIDs to one port
Subnet management Initialize the network –Discover subnet topology and topology changes, compute the paths, assign LIDs, distribute the routes, configure devices. –Related devices and entities Devices: Channel Adapters (CA), Host Channel Adapters, switches, routers Subnet manager (SM): discovering, configuring, activating and managing the subnet A subnet management agent (SMA) in every device generates, responses to control packets (subnet management packets (SMPs)), and configures local components for subnet management SM exchange control packets with SMA with subnet management interface (SMI).
Subnet Management phases: –Topology discovery: sending direct routed SMP to every port and processing the responses. –Path computation: computing valid paths between each pair of end node –Path distribution phase: configuring the forwarding table
Base transport header:
Verbs –OS/Users access the adaptor through verbs –Communication mechanism: Queue Pair (QP) Users can queue up a set of instructions that the hardware executes. A pair of queues in each QP: one for send, one for receive. Users can post send requests to the send queue and receive requests to the receive queue. Three types of send operations: SEND, RDMA- (WRITE, READ, ATOMIC), MEMORY-BINDING One receive operation (matching SEND)
To communicate: –Make system calls to setup everything (open QP, bind QP to port, bind complete queues, connect local QP to remote QP, register memory, etc). –Post send/receive requests as user level instructions. –Check completion.
InfiniBand has an almost perfect software/network interface: –The network subsystem realizes most user level functionality. Network supports in-order delivery and and fault tolerance. Buffer management is pushed out to the user. –OS bypass: User level accesses to the network interface. A few machine instructions will accomplish the transmission task without involving the OS.