Interconnection network network interface and a case study.

Slides:

Advertisements

Similar presentations

System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.

Advertisements

© 2007 IBM Corporation IBM Global Engineering Solutions IBM Blue Gene/P Blue Gene/P System Overview - Hardware.

Dr. Kalpakis CMSC 621, Advanced Operating Systems. Fall 2003 URL: Distributed System Architectures.

Presented by: Quinn Gaumer CPS 221.  16,384 Processing Nodes (32 MHz)  30 m x 30 m  Teraflop  1992.

CM-5 Massively Parallel Supercomputer ALAN MOSER Thinking Machines Corporation 1993.

Case study IBM Bluegene/L system InfiniBand. Interconnect Family share for 06/2011 top 500 supercomputers Interconnect Family CountShare % Rmax Sum (GF)

♦ Commodity processor with commodity inter- processor connection Clusters Pentium, Itanium, Opteron, Alpha GigE, Infiniband, Myrinet, Quadrics, SCI NEC.

Today’s topics Single processors and the Memory Hierarchy

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts Amherst Operating Systems CMPSCI 377 Lecture.

Parallel System Performance CS 524 – High-Performance Computing.

IBM RS6000/SP Overview Advanced IBM Unix computers series Multiple different configurations Available from entry level to high-end machines. POWER (1,2,3,4)

1 BGL Photo (system) BlueGene/L IBM Journal of Research and Development, Vol. 49, No. 2-3.

Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.

Inside the Internet. INTERNET ARCHITECTURE The Internet system consists of a number of interconnected packet networks supporting communication among host.

Sun FIRE Jani Raitavuo Niko Ronkainen. Sun FIRE 15K Most powerful and scalable Up to 106 processors, 576 GB memory and 250 TB online disk storage Fireplane.

IBM RS/6000 SP POWER3 SMP Jari Jokinen Pekka Laurila.

Parallel Computer Architectures

NPACI: National Partnership for Advanced Computational Infrastructure August 17-21, 1998 NPACI Parallel Computing Institute 1 Cluster Archtectures and.

Interconnection and Packaging in IBM Blue Gene/L Yi Zhu Feb 12, 2007.

Storage area network and System area network (SAN)

Chapter 5 Array Processors. Introduction  Major characteristics of SIMD architectures –A single processor(CP) –Synchronous array processors(PEs) –Data-parallel.

Real Parallel Computers. Modular data centers Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra,

Switching, routing, and flow control in interconnection networks.

Blue Gene / C Cellular architecture 64-bit Cyclops64 chip: –500 Mhz –80 processors ( each has 2 thread units and a FP unit) Software –Cyclops64 exposes.

1 Lecture 7: Part 2: Message Passing Multicomputers (Distributed Memory Machines)

Synchronization and Communication in the T3E Multiprocessor.

“The Architecture of Massively Parallel Processor CP-PACS” Taisuke Boku, Hiroshi Nakamura, et al. University of Tsukuba, Japan by Emre Tapcı.

1 Computer Networks DA Chapter 1-3 Introduction.

Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

1 Next Few Classes Networking basics Protection & Security.

Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.

High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.

Data and Computer Communications Circuit Switching and Packet Switching.

Example: Sorting on Distributed Computing Environment Apr 20,

N. GSU Slide 1 Chapter 05 Clustered Systems for Massive Parallelism N. Xiong Georgia State University.

Parallel Programming on the SGI Origin2000 With thanks to Igor Zacharov / Benoit Marchand, SGI Taub Computer Center Technion Moshe Goldberg,

LogP and BSP models. LogP model Common MPP organization: complete machine connected by a network. LogP attempts to capture the characteristics of such.

Patrick R. Haspel, University of Mannheim1 FutureDAQ Kick-off Network Design Space Exploration andAnalysis Computer Architecture Group Prof. Brüning Patrick.

The IBM Blue Gene/L System Architecture Presented by Sabri KANTAR.

Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.

Disco : Running commodity operating system on scalable multiprocessor Edouard et al. Presented by Vidhya Sivasankaran.

The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.

Non-Data-Communication Overheads in MPI: Analysis on Blue Gene/P P. Balaji, A. Chan, W. Gropp, R. Thakur, E. Lusk Argonne National Laboratory University.

An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.

Design an MPI collective communication scheme A collective communication involves a group of processes. –Assumption: Collective operation is realized based.

Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 April 11, 2006 Session 23.

Modeling Billion-Node Torus Networks Using Massively Parallel Discrete-Event Simulation Ning Liu, Christopher Carothers 1.

Software Overhead in Messaging Layers Pitch Patarasuk.

Reduced Communication Protocol for Clusters Clunix Inc. Donghyun Kim

COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant.

1 Next Generation Correlators, June 26 th −29 th, 2006 The LOFAR Blue Gene/L Correlator Stichting ASTRON (Netherlands Foundation for Research in Astronomy)

Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.

BluesGene/L Supercomputer A System Overview Pietro Cicotti October 10, 2005 University of California, San Diego.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.

BLUE GENE Sunitha M. Jenarius. What is Blue Gene A massively parallel supercomputer using tens of thousands of embedded PowerPC processors supporting.

COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs in Parallel Machines Dr. Xiao Qin Auburn University

PARALLEL MODEL OF EVOLUTIONARY GAME DYNAMICS Amanda Peters MIT /13/2009.

Network Connected Multiprocessors

Software Defined Networking (SDN)

BlueGene/L Supercomputer

Switching, routing, and flow control in interconnection networks

Storage area network and System area network (SAN)

Switching, routing, and flow control in interconnection networks

Computer Networks DA2402.

Cluster Computers.

Presentation transcript:

Interconnection network network interface and a case study

Network interface design issue The networking requirement user’s perspective – In-order message delivery – Reliable delivery Error control Flow control – Deadlock free Typical network hardware features – Arbitrary delivery order (adaptive/multipath routing) – Finite buffering – Limited fault handling How and where should we bridge the gap? – Network hardware? Network systems? Or a hardware/systems/software approach?

The Internet approach – How does the Internet realize these functions? No deadlock issue Reliability/flow control/in-order delivery are done at the TCP layer? The network layer (IP) provides best effort service. – IP is done in the software as well. – Drawbacks: Too many layers of software Users need to go through the OS to access the communication hardware (system calls can cause context switching).

Approach in HPC networks Where should these functions be realized? – High performance networking Most functionality below the network layer are done by the hardware (or almost hardware) – This provide the APIs for network transactions If there is mis-match between what the network provides and what users want, a software messaging layer is created to bridge the gaps.

Messaging Layer Bridge between the hardware functionality and the user communication requirement – Typical network hardware features Arbitrary delivery order (adaptive/multipath routing) Finite buffering Limited fault handling – Typical user communication requirement In-order delivery End-to-end flow control Reliable transmission

Messaging Layer

Communication cost Communication cost = hardware cost + software cost (messaging layer cost) – Hardware message time: msize/bandwidth – Software time: Buffer management End-to-end flow control Running protocols – Which one is dominating? Depends on how much the software has to do.

Network software/hardware interaction -- a case study A case study on the communication performance issues on CM5 – V. Karamcheti and A. A. Chien, “Software Overhead in Messaging layers: Where does the time go?” ACM ASPLOS-VI, 1994.

What do we see in the study? The mis-match between the user requirement and network functionality can introduce significant software overheads (50%-70%). Implication? – Should we focus on hardware or software or software/hardware co-design? – Improving routing performance may increase software cost Adaptive routing introduces out of order packets – Providing low level network feature to applications is problematic.

Summary from the study In the design of the communication system, holistic understanding must be achieved: – Focusing on network hardware may not be sufficient. Software overhead can be much larger than routing time. It would be ideal for the network to directly provide high level services. – The newer generation interconnect hardware tries to achieve this.

Case study IBM Bluegene/L system InfiniBand

Interconnect Family share for 06/2011 top 500 supercomputers Interconnect Family CountShare %Rmax Sum (GF) Rpeak Sum (GF) Processor Sum Myrinet40.80 % Quadrics10.20 % Gigabit Ethernet % Infiniband % Mixed10.20 % NUMAlink20.40 % SP Switch10.20 % Proprietary % Fat Tree10.20 % Custom % Totals500100%

Overview of the IBM Blue Gene/L System Architecture Design objectives Hardware overview – System architecture – Node architecture – Interconnect architecture

Highlights A 64K-node highly integrated supercomputer based on system-on-a-chip technology – Two ASICs Blue Gene/L compute (BLC), Blue Gene/L Link (BLL) Distributed memory, massively parallel processing (MPP) architecture. Use the message passing programming model (MPI). 360 Tflops peak performance Optimized for cost/performance

Design objectives Objective 1: 360-Tflops supercomputer – Earth Simulator (Japan, fastest supercomputer from 2002 to 2004): Tflops Objective 2: power efficiency – Performance/rack = performance/watt * watt/rack Watt/rack is a constant of around 20kW Performance/watt determines performance/rack

Power efficiency: – 360Tflops => 20 megawatts with conventional processors – Need low-power processor design (2-10 times better power efficiency)

Design objectives (continue) Objective 3: extreme scalability – Optimized for cost/performance  use low power, less powerful processors  need a lot of processors Up to processors. – Interconnect scalability – Reliability, availability, and serviceability – Application scalability

Blue Gene/L system components

Blue Gene/L Compute ASIC 2 Power PC440 cores with floating-point enhancements – 700MHz – Everything of a typical superscalar processor Pipelined microarchitecture with dual instruction fetch, decode, and out of order issue, out of order dispatch, out of order execution and out of order completion, etc – 1 W each through extensive power management

Blue Gene/L Compute ASIC

Memory system on a BGL node BG/L only supports distributed memory paradigm. No need for efficient support for cache coherence on each node. – Coherence enforced by software if needed. Two cores operate in two modes: – Communication coprocessor mode Need coherence, managed in system level libraries – Virtual node mode Memory is physical partitioned (not shared).

Blue Gene/L networks Five networks. – 100 Mbps Ethernet control network for diagnostics, debugging, and some other things. – 1000 Mbps Ethernet for I/O – Three high-band width, low-latency networks for data transmission and synchronization. 3-D torus network for point-to-point communication Collective network for global operations Barrier network All network logic is integrated in the BG/L node ASIC – Memory mapped interfaces from user space

3-D torus network Support p2p communication Link bandwidth 1.4Gb/s, 6 bidirectional link per node (1.2GB/s). 64x32x32 torus: diameter =64 hops, worst case hardware latency 6.4us. Cut-through routing Adaptive routing

Collective network Binary tree topology, static routing Link bandwidth: 2.8Gb/s Maximum hardware latency: 5us With arithmetic and logical hardware: can perform integer operation on the data – Efficient support for reduce, scan, global sum, and broadcast operations – Floating point operation can be done with 2 passes.

Barrier network Hardware support for global synchronization. 1.5us for barrier on 64K nodes.

IBM BlueGene/L summary Optimize cost/performance – limiting applications. – Use low power design Lower frequency, system-on-a-chip Great performance per watt metric Scalability support – Hardware support for global communication and barrier – Low latency, high bandwidth support