IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec. 19-22 Copyright by IBM HPS Switch and Adapter Architecture, Design & Performance Rama K Govindaraju.

Slides:

Advertisements

Similar presentations

IP Router Architectures. Outline Basic IP Router Functionalities IP Router Architectures.

Advertisements

Middleware Support for RDMA-based Data Transfer in Cloud Computing Yufei Ren, Tan Li, Dantong Yu, Shudong Jin, Thomas Robertazzi Department of Electrical.

System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.

The Development of Mellanox - NVIDIA GPUDirect over InfiniBand A New Model for GPU to GPU Communications Gilad Shainer.

Dr. Kalpakis CMSC 621, Advanced Operating Systems. Fall 2003 URL: Distributed System Architectures.

Protocols and software for exploiting Myrinet clusters Congduc Pham and the main contributors P. Geoffray, L. Prylli, B. Tourancheau, R. Westrelin.

Today’s topics Single processors and the Memory Hierarchy

COS 461 Fall 1997 Networks and Protocols u networks and protocols –definitions –motivation –history u protocol hierarchy –reasons for layering –quick tour.

CS162 Section Lecture 9. KeyValue Server Project 3 KVClient (Library) Client Side Program KVClient (Library) Client Side Program KVClient (Library) Client.

Spring 2004 EE4272 EE4272: Computer Networks Instructor: Dr. Tricia Chigan Dept.: Elec. & Comp. Eng.

VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.

Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.

I/O Hardware n Incredible variety of I/O devices n Common concepts: – Port – connection point to the computer – Bus (daisy chain or shared direct access)

Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.

1 Last Class: Introduction Operating system = interface between user & architecture Importance of OS OS history: Change is only constant User-level Applications.

IBM RS/6000 SP POWER3 SMP Jari Jokinen Pekka Laurila.

Realizing the Performance Potential of the Virtual Interface Architecture Evan Speight, Hazim Abdel-Shafi, and John K. Bennett Rice University, Dep. Of.

1 Today I/O Systems Storage. 2 I/O Devices Many different kinds of I/O devices Software that controls them: device drivers.

An overview of Infiniband Reykjavik, June 24th 2008 R E Y K J A V I K U N I V E R S I T Y Dept. Computer Science Center for Analysis and Design of Intelligent.

Router Architectures An overview of router architectures.

IWARP Ethernet Key to Driving Ethernet into the Future Brian Hausauer Chief Architect NetEffect, Inc.

Router Architectures An overview of router architectures.

1 Lecture 7: Part 2: Message Passing Multicomputers (Distributed Memory Machines)

1 Computing platform Andrew A. Chien Mohsen Saneei University of Tehran.

RSC Williams MAPLD 2005/BOF-S1 A Linux-based Software Environment for the Reconfigurable Scalable Computing Project John A. Williams 1

IWARP Redefined: Scalable Connectionless Communication Over High-Speed Ethernet M. J. Rashti, R. E. Grant, P. Balaji and A. Afsahi.

ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.

UNIX System Administration OS Kernal Copyright 2002, Dr. Ken Hoganson All rights reserved. OS Kernel Concept Kernel or MicroKernel Concept: An OS architecture-design.

1/29/2002 CS Distributed Systems 1 Infiniband Architecture Aniruddha Bohra.

High Performance User-Level Sockets over Gigabit Ethernet Pavan Balaji Ohio State University Piyush Shivam Ohio State University.

A TCP/IP transport layer for the DAQ of the CMS Experiment Miklos Kozlovszky for the CMS TriDAS collaboration CERN European Organization for Nuclear Research.

Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.

1 Lecture 20: I/O n I/O hardware n I/O structure n communication with controllers n device interrupts n device drivers n streams.

1 XenSocket: VM-to-VM IPC John Linwood Griffin Jagged Technology virtual machine inter-process communication Suzanne McIntosh, Pankaj Rohatgi, Xiaolan.

MIT Lincoln Laboratory VXFabric-1 Kontron 9/22/2011 VXFabric: PCI-Express Switch Fabric for HPEC Poster B.7, Technologies and Systems Robert Negre, Business.

Boosting Event Building Performance Using Infiniband FDR for CMS Upgrade Andrew Forrest – CERN (PH/CMD) Technology and Instrumentation in Particle Physics.

The NE010 iWARP Adapter Gary Montry Senior Scientist

© 2010 IBM Corporation Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems Gabor Dozsa 1, Sameer Kumar 1, Pavan Balaji 2,

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

The Red Storm High Performance Computer March 19, 2008 Sue Kelly Sandia National Laboratories Abstract: Sandia National.

1 Using HPS Switch on Bassi Jonathan Carter User Services Group Lead NERSC User Group Meeting June 12, 2006.

Remote Direct Memory Access (RDMA) over IP PFLDNet 2003, Geneva Stephen Bailey, Sandburst Corp., Allyn Romanow, Cisco Systems,

Impact of High Performance Sockets on Data Intensive Applications Pavan Balaji, Jiesheng Wu, D.K. Panda, CIS Department The Ohio State University Tahsin.

Minimizing Communication Latency to Maximize Network Communication Throughput over InfiniBand Design and Implementation of MPICH-2 over InfiniBand with.

1 Public DAFS Storage for High Performance Computing using MPI-I/O: Design and Experience Arkady Kanevsky & Peter Corbett Network Appliance Vijay Velusamy.

Infiniband Bart Taylor. What it is InfiniBand™ Architecture defines a new interconnect technology for servers that changes the way data centers will be.

CS 4396 Computer Networks Lab Router Architectures.

A record and replay mechanism using programmable network interface cards Laurent Lefèvre INRIA / LIP (UMR CNRS, INRIA, ENS, UCB)

ND The research group on Networks & Distributed systems.

Intel Research & Development ETA: Experience with an IA processor as a Packet Processing Engine HP Labs Computer Systems Colloquium August 2003 Greg Regnier.

Mellanox Connectivity Solutions for Scalable HPC Highest Performing, Most Efficient End-to-End Connectivity for Servers and Storage April 2010.

LRPC Firefly RPC, Lightweight RPC, Winsock Direct and VIA.

1Thu D. NguyenCS 545: Distributed Systems CS 545: Distributed Systems Spring 2002 Communication Medium Thu D. Nguyen

Interconnection network network interface and a case study.

Comprehensive Scientific Support Of Large Scale Parallel Computation David Skinner, NERSC.

Intro to Distributed Systems and Networks Hank Levy.

IT 210: Web-based IT Fall 2012 Lecture: Network Basics, OSI, & Internet Architecture.

Mellanox Connectivity Solutions for Scalable HPC Highest Performing, Most Efficient End-to-End Connectivity for Servers and Storage September 2010 Brandon.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

F. HemmerUltraNet® Experiences SHIFT Model CPU Server CPU Server CPU Server CPU Server CPU Server CPU Server Disk Server Disk Server Tape Server Tape Server.

Multimedia Retrieval Architecture Electrical Communication Engineering, Indian Institute of Science, Bangalore – , India Multimedia Retrieval Architecture.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

Major OS Components CS 416: Operating Systems Design, Spring 2001 Department of Computer Science Rutgers University

CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.

Background Computer System Architectures Computer System Software.

Experiences with VI Communication for Database Storage Yuanyuan Zhou, Angelos Bilas, Suresh Jagannathan, Cezary Dubnicki, Jammes F. Philbin, Kai Li.

Internetworking: Hardware/Software Interface

Application taxonomy & characterization

Cluster Computers.

Presentation transcript:

IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec Copyright by IBM HPS Switch and Adapter Architecture, Design & Performance Rama K Govindaraju HiPC Conference Bangalore, India December 19-22, 2004

IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec Copyright by IBM Team Architecture –Peter Hochschild, Don Grice, Kevin Gildea, Rama Govindaraju Hardware –Carl A Bender, Jay Herring, Piyush Chaudhary, Steven Martin, Jason Goscinski, John Houston, … Software –Chulho Kim, Robert Blackmore, Rajeev Sivaram, Hanhong Xue, … And many others contributed to this effort

IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec Copyright by IBM Outline What is HPS? Example HPS customers Interconnect Historical Performance HPS switch architecture HPS adapter architecture HPS software architecture Transport Modes HPS Performance Lessons Learned and Future Work

IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec Copyright by IBM What is HPS? HPS (High Performance Switch) –4 th generation switch and adapter to interconnect IBM’s Power processor based nodes (Power 4 and 5) –To be used in many of the world’s fastest supercomputers 20 of the top 100 today use HPS –Addressing requirements of HPC labs, DOE, and others Weather Forecasting, Petroleum sector, Automotive and Aerospace sector NSA and DOD –Core infrastructure for the 100TF ASCI Purple system to be delivered in June 2005

IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec Copyright by IBM Example HPS Customers More than 30 and growing Several over 1000 CPUs Total over: 200TF

IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec Copyright by IBM Historical Interconnect Performance Adapter Switch Processor TB2 HPS Power 2 TB3 TBS Power 2 TBMX TBS Power PC/3 Colony SP-Switch2 Power 3 HPS Power 4 Peak link bandwidth 40MB/s150MB/s 500MB/s2GB/s MPI bandwidth 35MB/s110MB/s135MB/s375MB/s1.8-14GB/s MPI latency40us24us21us17us <4.2us Links/node server 1111,22,4,6,8 IBM developed Switch Interconnects and Adapters

IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec Copyright by IBM HPS Switch Fabric 4K end points, 59ns latency, 2GB/s bandwidth per link per direction

IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec Copyright by IBM HPS Adapter Microcode Model

IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec Copyright by IBM HMC FNM DD HYP HPS Switch Fabric HPS Adapter User Space Kernel Space LAPI IBM’s MPI Parallel ESSL VSD GPFS SOCKETS TCPUDP IP APPLICATION ESSL IF_LSHAL Service Processor HPS Software Architecture LL CSM

IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec Copyright by IBM User Space Kernel Space MPI LAPI HAL Federation Adapter Interface Layer User Buffer HAL Buffers IP Interface UDP TCP Sockets FIFO versus RDMA models FIFO copy FIFO DMA RDMA

IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec Copyright by IBM Supported Communication Modes FIFO Mode –Message chopped into 2K packet chunks on the host and copied by CPU –Memory bus crossing depends on caching. At least 1 IO bus crossing RDMA enablement –No slave side protocol –CPU offload –Enhanced Programming model –1 IO bus crossing User Buffer CPU Network FIFO Adapter Ld/St DMA RDMA

IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec Copyright by IBM RDMA value proposition Possible overlap of computation and communication –Fragmentation/reassembly offloaded to the adapter –Minimize packet arrival interrupts –Requires application to be written take advantage of overlap One sided programming model Zero copy transport and reduced memory subsystem load Striping advantage KEY DIFFERENTIATOR: reliable RDMA protocol over unreliable datagram transport –Allows striping across multiple paths –Out of order arrival –Reduces hot spotting and contention Cons –Pinned memory usage –Resource management and fairness issues

IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec Copyright by IBM Federation Performance Summary: –Latency: Power 4, 1.9GHz, HPS MPI latency 4.34us Interrupt latency: adds 10us 8 task latency: adds 1us –Bandwidth: Power 4, 1.9GHz, HPS FIFO mode: –Unidirectional bandwidth: ~ 1.8GB/s –Bidirectional bandwidth: 2.1GB/s RDMA mode: –Unidirectional bandwidth: ~1.8GB/s –Bidirectional bandwidth: ~3.0GB/s –Linear striping performance up to 8 links »Unidirectional: 14GB/s, Bidirectional: 24GB/s These are preliminary measurements

IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec Copyright by IBM HPS: MPI Latency Machine TypeLatency Measurement 1.9GHz, p us 1.7GHz, p us 1.7GHz, p us 1.5GHz, p us 1.3GHz, p6905.5us All measurements measured using IBM’s thread safe MPI libraries 8 task latency adds approximately 1 additional microsecond Interrupt latency adds approximately microseconds All measurements are preliminary

IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec Copyright by IBM Unidirectional Bandwidth Peak Machine TypePeak Uni-dir Bandwidth 1.9GHz, p GB/s 1.7GHz, p GB/s 1.7GHz, p GB/s 1.5GHz, p GB/s 1.3GHz, p GB/s All measurements are preliminary

IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec Copyright by IBM Unidirectional Bandwidth Profile Message Size (bytes) Bandwidth (MB/s) P655, 1.7GHz based system M1/2= 32K, M3/4=128K

IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec Copyright by IBM Bidirectional Bandwidth Profile Message Size (bytes) Bandwidth (MB/s) P655, 1.7GHz based system M1/2=16K, M3/4=64K

IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec Copyright by IBM T1 T2T3 T1 T2T3 T1 T2T3 = Communication time by thread/task a) Asynchronous Model b) Synchronous Modelc) Aggregate Comm Thread Model Striping Options

IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec Copyright by IBM Striping Models MPI Layer LAPI Layer HAL ADAPTERS MPI Layer LAPI Layer HAL ADAPTERS Multiple threads doing copies model Single Thread with Pipelined RDMA model Second approach: - More elegant failover model - Less synchronization issues and CPU contention via RDMA

IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec Copyright by IBM RDMA Unidirectional Bandwidth

IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec Copyright by IBM RDMA Bidirectional Bandwidth

IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec Copyright by IBM How can users exploit RDMA? Overlap computation and communication –Non blocking calls –Reuse communication buffers if possible –User exposed RDMA in 11/05 Minimize interrupts for large transfers Reduce contention for memory Better raw bandwidth for messages over 80KB Possibility of overlapping collectives better (via striping) IP transport much more efficient (translates to improved GPFS performance) Select striping when sending large messages

IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec Copyright by IBM Future Work Enabling HPS for Power 5 based nodes Exploit SMT in Power 5 processor for FIFO mode Further attack MPI latency Use RDMA to improve MPI collectives performance Parallel file systems (GPFS) further exploitation of IP over RDMA Take lessons learned into the Percs project