Case study IBM Bluegene/L system InfiniBand. Interconnect Family share for 06/2011 top 500 supercomputers Interconnect Family CountShare % Rmax Sum (GF)

Slides:



Advertisements
Similar presentations
System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.
Advertisements

The Development of Mellanox - NVIDIA GPUDirect over InfiniBand A New Model for GPU to GPU Communications Gilad Shainer.
PARAM Padma SuperComputer
♦ Commodity processor with commodity inter- processor connection Clusters Pentium, Itanium, Opteron, Alpha GigE, Infiniband, Myrinet, Quadrics, SCI NEC.
Today’s topics Single processors and the Memory Hierarchy
IBM RS6000/SP Overview Advanced IBM Unix computers series Multiple different configurations Available from entry level to high-end machines. POWER (1,2,3,4)
VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.
1 BGL Photo (system) BlueGene/L IBM Journal of Research and Development, Vol. 49, No. 2-3.
Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.
IBM RS/6000 SP POWER3 SMP Jari Jokinen Pekka Laurila.
A Scalable, Commodity Data Center Network Architecture Mohammad Al-Fares, Alexander Loukissas, Amin Vahdat Presented by Gregory Peaker and Tyler Maclean.
Interconnection and Packaging in IBM Blue Gene/L Yi Zhu Feb 12, 2007.
Storage area network and System area network (SAN)
Switching, routing, and flow control in interconnection networks.
Semester 1 Module 8 Ethernet Switching Andres, Wen-Yuan Liao Department of Computer Science and Engineering De Lin Institute of Technology
Chapter 6 High-Speed LANs Chapter 6 High-Speed LANs.
1 Lecture 7: Part 2: Message Passing Multicomputers (Distributed Memory Machines)
G64INC Introduction to Network Communications Ho Sooi Hock Internet Protocol.
Chapter 2 Network Models
Presentation on Osi & TCP/IP MODEL
1/29/2002 CS Distributed Systems 1 Infiniband Architecture Aniruddha Bohra.
CSCI-235 Micro-Computer in Science The Network. © Prentice-Hall, Inc Communications  Communication is the process of sending and receiving messages 
1 CISCO NETWORKING ACADEMY PROGRAM (CNAP) SEMESTER 1/ MODULE 8 Ethernet Switching.
Protocols and the TCP/IP Suite
Current major high performance networking technologies InfiniBand 10G-Ethernet.
Infiniband subnet management Discuss the Infiniband subnet management system Discuss fat tree and subnet management in an Infiniband with a fat tree topology.
Department of Electronic Engineering City University of Hong Kong EE3900 Computer Networks Introduction Slide 1 A Communications Model Source: generates.
© 2010 IBM Corporation Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems Gabor Dozsa 1, Sameer Kumar 1, Pavan Balaji 2,
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
1 Next Few Classes Networking basics Protection & Security.
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
1 The Internet and Networked Multimedia. 2 Layering  Internet protocols are designed to work in layers, with each layer building on the facilities provided.
ECE 526 – Network Processing Systems Design Networking: protocols and packet format Chapter 3: D. E. Comer Fall 2008.
CCNA 3 Week 4 Switching Concepts. Copyright © 2005 University of Bolton Introduction Lan design has moved away from using shared media, hubs and repeaters.
The IBM Blue Gene/L System Architecture Presented by Sabri KANTAR.
Networks and Protocols CE Week 7b. Routing an Overview.
Integrating New Capabilities into NetPIPE Dave Turner, Adam Oline, Xuehua Chen, and Troy Benjegerdes Scalable Computing Laboratory of Ames Laboratory This.
Infiniband Bart Taylor. What it is InfiniBand™ Architecture defines a new interconnect technology for servers that changes the way data centers will be.
The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.
Anshul Kumar, CSE IITD ECE729 : Advanced Computer Architecture Lecture 27, 28: Interconnection Mechanisms In Multiprocessors 29 th, 31 st March, 2010.
An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.
1 Physical and Data Link Layer Computer Network System Sirak Kaewjamnong.
Lecture 4 Overview. Ethernet Data Link Layer protocol Ethernet (IEEE 802.3) is widely used Supported by a variety of physical layer implementations Multi-access.
Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 April 11, 2006 Session 23.
Chapter 2 Network Models
Interconnection network network interface and a case study.
Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
BluesGene/L Supercomputer A System Overview Pietro Cicotti October 10, 2005 University of California, San Diego.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
+ Lecture#2: Ethernet Asma ALOsaimi. + Objectives In this chapter, you will learn to: Describe the operation of the Ethernet sublayers. Identify the major.
Network Models. The OSI Model Open Systems Interconnection (OSI). Developed by the International Organization for Standardization (ISO). Model for understanding.
BLUE GENE Sunitha M. Jenarius. What is Blue Gene A massively parallel supercomputer using tens of thousands of embedded PowerPC processors supporting.
Enhancements for Voltaire’s InfiniBand simulator
Infiniband Architecture
Scaling the Network: The Internet Protocol
Chap. 2 Network Models.
Simple Connectivity Between InfiniBand Subnets
Chapter 3: Open Systems Interconnection (OSI) Model
BlueGene/L Supercomputer
Switching, routing, and flow control in interconnection networks
Network Core and QoS.
NTHU CS5421 Cloud Computing
Storage area network and System area network (SAN)
Networks Networking has become ubiquitous (cf. WWW)
Scaling the Network: The Internet Protocol
Cluster Computers.
Network Core and QoS.
Presentation transcript:

Case study IBM Bluegene/L system InfiniBand

Interconnect Family share for 06/2011 top 500 supercomputers Interconnect Family CountShare % Rmax Sum (GF) Rpeak Sum (GF) Processor Sum Myrinet40.80 % Quadrics10.20 % Gigabit Ethernet % Infiniband % Mixed10.20 % NUMAlink20.40 % SP Switch10.20 % Proprietary % Fat Tree10.20 % Custom % Totals500100%

Overview of the IBM Blue Gene/L System Architecture Design objectives Hardware overview –System architecture –Node architecture –Interconnect architecture

Highlights A 64K-node highly integrated supercomputer based on system-on-a-chip technology –Two ASICs Blue Gene/L compute (BLC), Blue Gene/L Link (BLL) Distributed memory, massively parallel processing (MPP) architecture. Use the message passing programming model (MPI). 360 Tflops peak performance Optimized for cost/performance

Design objectives Objective 1: 360-Tflops supercomputer –Earth Simulator (Japan, fastest supercomputer from 2002 to 2004): Tflops Objective 2: power efficiency –Performance/rack = performance/watt * watt/rack Watt/rack is a constant of around 20kW Performance/watt determines performance/rack

Power efficiency: –360Tflops => 20 megawatts with conventional processors –Need low-power processor design (2-10 times better power efficiency)

Design objectives (continue) Objective 3: extreme scalability –Optimized for cost/performance  use low power, less powerful processors  need a lot of processors Up to processors. –Interconnect scalability

Blue Gene/L system components

Blue Gene/L Compute ASIC 2 Power PC440 cores with floating-point enhancements –700MHz –Everything of a typical superscalar processor Pipelined microarchitecture with dual instruction fetch, decode, and out of order issue, out of order dispatch, out of order execution and out of order completion, etc –1 W each through extensive power management

Blue Gene/L Compute ASIC

Memory system on a BGL node BG/L only supports distributed memory paradigm. No need for efficient support for cache coherence on each node. –Coherence enforced by software if needed. Two cores operate in two modes: –Communication coprocessor mode Need coherence, managed in system level libraries –Virtual node mode Memory is physical partitioned (not shared).

Blue Gene/L networks Five networks. –100 Mbps Ethernet control network for diagnostics, debugging, and some other things. –1000 Mbps Ethernet for I/O –Three high-band width, low-latency networks for data transmission and synchronization. 3-D torus network for point-to-point communication Collective network for global operations Barrier network All network logic is integrated in the BG/L node ASIC –Memory mapped interfaces from user space

3-D torus network Support p2p communication Link bandwidth 1.4Gb/s, 6 bidirectional link per node (1.2GB/s). 64x32x32 torus: diameter =64 hops, worst case hardware latency 6.4us. Cut-through routing Adaptive routing

Collective network Binary tree topology, static routing Link bandwidth: 2.8Gb/s Maximum hardware latency: 5us With arithmetic and logical hardware: can perform integer operation on the data –Efficient support for reduce, scan, global sum, and broadcast operations –Floating point operation can be done with 2 passes.

Barrier network Hardware support for global synchronization. 1.5us for barrier on 64K nodes.

IBM BlueGene/L summary Optimize cost/performance –limiting applications. –Use low power design Lower frequency, system-on-a-chip Great performance per watt metric Scalability support –Hardware support for global communication and barrier –Low latency, high bandwidth support

Case 2: Infiniband architecture –Specification (Infiniband architecture specification release 1.2.1, January 2008/Oct. 2006) available at Infiniband Trade Association (

Infiniband architecture overview

–Components: Links Channel adaptors Switches Routers –The specification allows Infiniband wide area network, but mostly adopted as a system/storage area network. –Topology: Irregular Regular: Fat tree –Link speed: Single data rate (SDR): 2.5Gbps (X), 10Gbps (4X), and 30Gbps (12X). Double data rate (DDR): 5Gbps (X), 20 Gbps (4X) Quad data rate (QDR): 40Gbps (4X)

Layers: somewhat similar to TCP/IP –Physical layer –Link layer Error detection (CRC checksum) flow control (credit based) switching, virtual lanes (VL), forwarding table computed by subnet manager –Single path deterministic routing (not adaptive) –Network layer: across subnets. No use for the cluster environment –Transport layer Reliable/unreliable, connection/datagram –Verbs: interface between adaptors and OS/Users

Infinoband Link layer Packet format: Local Route Header (LRH): 8 bytes. Used for local routing by switches within a IBA subnet Global Route Header (GRH): 40 Bytes. Used for routing between subnets Base Transport header (BTH): 12 Bytes, for IBA transport Extened transport header –Reliable datagram extended transport header (RDETH): 4 bytes, just for reliable datagram –Datagram extended transport header (DETH): 8 bytes –RDMA extended transport header (RETH): 16 bytes –Atomic, ACK, Atomic ACK, Immediate DATA extended transport header: 4 bytes, optimized for small packets. Invariant CRC and variant CRC: –CRC for fields not changed and changed.

Local Route Header: –Switching based on the destination port address (LID) –Multipath switching by allocating multiple LIDs to one port

Subnet management Initialize the network –Discover subnet topology and topology changes, compute the paths, assign LIDs, distribute the routes, configure devices. –Related devices and entities Devices: Channel Adapters (CA), Host Channel Adapters, switches, routers Subnet manager (SM): discovering, configuring, activating and managing the subnet A subnet management agent (SMA) in every device generates, responses to control packets (subnet management packets (SMPs)), and configures local components for subnet management SM exchange control packets with SMA with subnet management interface (SMI).

Subnet Management phases: –Topology discovery: sending direct routed SMP to every port and processing the responses. –Path computation: computing valid paths between each pair of end node –Path distribution phase: configuring the forwarding table

Base transport header:

Verbs –OS/Users access the adaptor through verbs –Communication mechanism: Queue Pair (QP) Users can queue up a set of instructions that the hardware executes. A pair of queues in each QP: one for send, one for receive. Users can post send requests to the send queue and receive requests to the receive queue. Three types of send operations: SEND, RDMA- (WRITE, READ, ATOMIC), MEMORY-BINDING One receive operation (matching SEND)

To communicate: –Make system calls to setup everything (open QP, bind QP to port, bind complete queues, connect local QP to remote QP, register memory, etc). –Post send/receive requests as user level instructions. –Check completion.

InfiniBand has an almost perfect software/network interface: –The network subsystem realizes most user level functionality. Network supports in-order delivery and and fault tolerance. Buffer management is pushed out to the user. –OS bypass: User level accesses to the network interface. A few machine instructions will accomplish the transmission task without involving the OS.