Download presentation
Presentation is loading. Please wait.
Published byMartha Maxwell Modified over 8 years ago
1
Interconnection network network interface and a case study
2
Network interface design issue The networking requirement user’s perspective – In-order message delivery – Reliable delivery Error control Flow control – Deadlock free Typical network hardware features – Arbitrary delivery order (adaptive/multipath routing) – Finite buffering – Limited fault handling How and where should we bridge the gap? – Network hardware? Network systems? Or a hardware/systems/software approach?
3
The Internet approach – How does the Internet realize these functions? No deadlock issue Reliability/flow control/in-order delivery are done at the TCP layer? The network layer (IP) provides best effort service. – IP is done in the software as well. – Drawbacks: Too many layers of software Users need to go through the OS to access the communication hardware (system calls can cause context switching).
4
Approach in HPC networks Where should these functions be realized? – High performance networking Most functionality below the network layer are done by the hardware (or almost hardware) – This provide the APIs for network transactions If there is mis-match between what the network provides and what users want, a software messaging layer is created to bridge the gaps.
5
Messaging Layer Bridge between the hardware functionality and the user communication requirement – Typical network hardware features Arbitrary delivery order (adaptive/multipath routing) Finite buffering Limited fault handling – Typical user communication requirement In-order delivery End-to-end flow control Reliable transmission
6
Messaging Layer
7
Communication cost Communication cost = hardware cost + software cost (messaging layer cost) – Hardware message time: msize/bandwidth – Software time: Buffer management End-to-end flow control Running protocols – Which one is dominating? Depends on how much the software has to do.
8
Network software/hardware interaction -- a case study A case study on the communication performance issues on CM5 – V. Karamcheti and A. A. Chien, “Software Overhead in Messaging layers: Where does the time go?” ACM ASPLOS-VI, 1994.
9
What do we see in the study? The mis-match between the user requirement and network functionality can introduce significant software overheads (50%-70%). Implication? – Should we focus on hardware or software or software/hardware co-design? – Improving routing performance may increase software cost Adaptive routing introduces out of order packets – Providing low level network feature to applications is problematic.
10
Summary from the study In the design of the communication system, holistic understanding must be achieved: – Focusing on network hardware may not be sufficient. Software overhead can be much larger than routing time. It would be ideal for the network to directly provide high level services. – The newer generation interconnect hardware tries to achieve this.
11
Case study IBM Bluegene/L system InfiniBand
12
Interconnect Family share for 06/2011 top 500 supercomputers Interconnect Family CountShare %Rmax Sum (GF) Rpeak Sum (GF) Processor Sum Myrinet40.80 %38445152441255152 Quadrics10.20 %52840637959968 Gigabit Ethernet 23246.40 %11796979220421812098562 Infiniband20641.20 %22980393327595812411516 Mixed10.20 %665678294413824 NUMAlink20.40 %10796112124118944 SP Switch10.20 %757609278112208 Proprietary295.80 %9841862139010821886982 Fat Tree10.20 %1224001310721280 Custom234.60 %13500813154608591271488 Totals500100%58930025.5985179949.007779924
13
Overview of the IBM Blue Gene/L System Architecture Design objectives Hardware overview – System architecture – Node architecture – Interconnect architecture
14
Highlights A 64K-node highly integrated supercomputer based on system-on-a-chip technology – Two ASICs Blue Gene/L compute (BLC), Blue Gene/L Link (BLL) Distributed memory, massively parallel processing (MPP) architecture. Use the message passing programming model (MPI). 360 Tflops peak performance Optimized for cost/performance
15
Design objectives Objective 1: 360-Tflops supercomputer – Earth Simulator (Japan, fastest supercomputer from 2002 to 2004): 35.86 Tflops Objective 2: power efficiency – Performance/rack = performance/watt * watt/rack Watt/rack is a constant of around 20kW Performance/watt determines performance/rack
16
Power efficiency: – 360Tflops => 20 megawatts with conventional processors – Need low-power processor design (2-10 times better power efficiency)
17
Design objectives (continue) Objective 3: extreme scalability – Optimized for cost/performance use low power, less powerful processors need a lot of processors Up to 65536 processors. – Interconnect scalability – Reliability, availability, and serviceability – Application scalability
18
Blue Gene/L system components
19
Blue Gene/L Compute ASIC 2 Power PC440 cores with floating-point enhancements – 700MHz – Everything of a typical superscalar processor Pipelined microarchitecture with dual instruction fetch, decode, and out of order issue, out of order dispatch, out of order execution and out of order completion, etc – 1 W each through extensive power management
20
Blue Gene/L Compute ASIC
21
Memory system on a BGL node BG/L only supports distributed memory paradigm. No need for efficient support for cache coherence on each node. – Coherence enforced by software if needed. Two cores operate in two modes: – Communication coprocessor mode Need coherence, managed in system level libraries – Virtual node mode Memory is physical partitioned (not shared).
22
Blue Gene/L networks Five networks. – 100 Mbps Ethernet control network for diagnostics, debugging, and some other things. – 1000 Mbps Ethernet for I/O – Three high-band width, low-latency networks for data transmission and synchronization. 3-D torus network for point-to-point communication Collective network for global operations Barrier network All network logic is integrated in the BG/L node ASIC – Memory mapped interfaces from user space
23
3-D torus network Support p2p communication Link bandwidth 1.4Gb/s, 6 bidirectional link per node (1.2GB/s). 64x32x32 torus: diameter 32+16+16=64 hops, worst case hardware latency 6.4us. Cut-through routing Adaptive routing
24
Collective network Binary tree topology, static routing Link bandwidth: 2.8Gb/s Maximum hardware latency: 5us With arithmetic and logical hardware: can perform integer operation on the data – Efficient support for reduce, scan, global sum, and broadcast operations – Floating point operation can be done with 2 passes.
25
Barrier network Hardware support for global synchronization. 1.5us for barrier on 64K nodes.
26
IBM BlueGene/L summary Optimize cost/performance – limiting applications. – Use low power design Lower frequency, system-on-a-chip Great performance per watt metric Scalability support – Hardware support for global communication and barrier – Low latency, high bandwidth support
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.