Interconnection network network interface and a case study.

Interconnection network network interface and a case study

Network interface design issue The networking requirement user’s perspective – In-order message delivery – Reliable delivery Error control Flow control – Deadlock free Typical network hardware features – Arbitrary delivery order (adaptive/multipath routing) – Finite buffering – Limited fault handling How and where should we bridge the gap? – Network hardware? Network systems? Or a hardware/systems/software approach?

The Internet approach – How does the Internet realize these functions? No deadlock issue Reliability/flow control/in-order delivery are done at the TCP layer? The network layer (IP) provides best effort service. – IP is done in the software as well. – Drawbacks: Too many layers of software Users need to go through the OS to access the communication hardware (system calls can cause context switching).

Approach in HPC networks Where should these functions be realized? – High performance networking Most functionality below the network layer are done by the hardware (or almost hardware) – This provide the APIs for network transactions If there is mis-match between what the network provides and what users want, a software messaging layer is created to bridge the gaps.

Messaging Layer Bridge between the hardware functionality and the user communication requirement – Typical network hardware features Arbitrary delivery order (adaptive/multipath routing) Finite buffering Limited fault handling – Typical user communication requirement In-order delivery End-to-end flow control Reliable transmission

Messaging Layer

Communication cost Communication cost = hardware cost + software cost (messaging layer cost) – Hardware message time: msize/bandwidth – Software time: Buffer management End-to-end flow control Running protocols – Which one is dominating? Depends on how much the software has to do.

Network software/hardware interaction -- a case study A case study on the communication performance issues on CM5 – V. Karamcheti and A. A. Chien, “Software Overhead in Messaging layers: Where does the time go?” ACM ASPLOS-VI, 1994.

What do we see in the study? The mis-match between the user requirement and network functionality can introduce significant software overheads (50%-70%). Implication? – Should we focus on hardware or software or software/hardware co-design? – Improving routing performance may increase software cost Adaptive routing introduces out of order packets – Providing low level network feature to applications is problematic.

Summary from the study In the design of the communication system, holistic understanding must be achieved: – Focusing on network hardware may not be sufficient. Software overhead can be much larger than routing time. It would be ideal for the network to directly provide high level services. – The newer generation interconnect hardware tries to achieve this.

Case study IBM Bluegene/L system InfiniBand

Interconnect Family share for 06/2011 top 500 supercomputers Interconnect Family CountShare %Rmax Sum (GF) Rpeak Sum (GF) Processor Sum Myrinet40.80 %38445152441255152 Quadrics10.20 %52840637959968 Gigabit Ethernet 23246.40 %11796979220421812098562 Infiniband20641.20 %22980393327595812411516 Mixed10.20 %665678294413824 NUMAlink20.40 %10796112124118944 SP Switch10.20 %757609278112208 Proprietary295.80 %9841862139010821886982 Fat Tree10.20 %1224001310721280 Custom234.60 %13500813154608591271488 Totals500100%58930025.5985179949.007779924

Overview of the IBM Blue Gene/L System Architecture Design objectives Hardware overview – System architecture – Node architecture – Interconnect architecture

Highlights A 64K-node highly integrated supercomputer based on system-on-a-chip technology – Two ASICs Blue Gene/L compute (BLC), Blue Gene/L Link (BLL) Distributed memory, massively parallel processing (MPP) architecture. Use the message passing programming model (MPI). 360 Tflops peak performance Optimized for cost/performance

Design objectives Objective 1: 360-Tflops supercomputer – Earth Simulator (Japan, fastest supercomputer from 2002 to 2004): 35.86 Tflops Objective 2: power efficiency – Performance/rack = performance/watt * watt/rack Watt/rack is a constant of around 20kW Performance/watt determines performance/rack

Power efficiency: – 360Tflops => 20 megawatts with conventional processors – Need low-power processor design (2-10 times better power efficiency)

Design objectives (continue) Objective 3: extreme scalability – Optimized for cost/performance  use low power, less powerful processors  need a lot of processors Up to 65536 processors. – Interconnect scalability – Reliability, availability, and serviceability – Application scalability

Blue Gene/L system components

Blue Gene/L Compute ASIC 2 Power PC440 cores with floating-point enhancements – 700MHz – Everything of a typical superscalar processor Pipelined microarchitecture with dual instruction fetch, decode, and out of order issue, out of order dispatch, out of order execution and out of order completion, etc – 1 W each through extensive power management

Blue Gene/L Compute ASIC

Memory system on a BGL node BG/L only supports distributed memory paradigm. No need for efficient support for cache coherence on each node. – Coherence enforced by software if needed. Two cores operate in two modes: – Communication coprocessor mode Need coherence, managed in system level libraries – Virtual node mode Memory is physical partitioned (not shared).

Blue Gene/L networks Five networks. – 100 Mbps Ethernet control network for diagnostics, debugging, and some other things. – 1000 Mbps Ethernet for I/O – Three high-band width, low-latency networks for data transmission and synchronization. 3-D torus network for point-to-point communication Collective network for global operations Barrier network All network logic is integrated in the BG/L node ASIC – Memory mapped interfaces from user space

3-D torus network Support p2p communication Link bandwidth 1.4Gb/s, 6 bidirectional link per node (1.2GB/s). 64x32x32 torus: diameter 32+16+16=64 hops, worst case hardware latency 6.4us. Cut-through routing Adaptive routing

Collective network Binary tree topology, static routing Link bandwidth: 2.8Gb/s Maximum hardware latency: 5us With arithmetic and logical hardware: can perform integer operation on the data – Efficient support for reduce, scan, global sum, and broadcast operations – Floating point operation can be done with 2 passes.

Barrier network Hardware support for global synchronization. 1.5us for barrier on 64K nodes.

IBM BlueGene/L summary Optimize cost/performance – limiting applications. – Use low power design Lower frequency, system-on-a-chip Great performance per watt metric Scalability support – Hardware support for global communication and barrier – Low latency, high bandwidth support

Interconnection network network interface and a case study.

Similar presentations

Presentation on theme: "Interconnection network network interface and a case study."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Interconnection network network interface and a case study.

Similar presentations

Presentation on theme: "Interconnection network network interface and a case study."— Presentation transcript:

Similar presentations

About project

Feedback