1 Using HPS Switch on Bassi Jonathan Carter User Services Group Lead NERSC User Group Meeting June 12, 2006.

Slides:



Advertisements
Similar presentations
ECMWF Slide 1Porting MPI Programs to the IBM Cluster 1600 Peter Towers March 2004.
Advertisements

Middleware Support for RDMA-based Data Transfer in Cloud Computing Yufei Ren, Tan Li, Dantong Yu, Shudong Jin, Thomas Robertazzi Department of Electrical.
Efficient Collective Operations using Remote Memory Operations on VIA-Based Clusters Rinku Gupta Dell Computers Dhabaleswar Panda.
Protocols and software for exploiting Myrinet clusters Congduc Pham and the main contributors P. Geoffray, L. Prylli, B. Tourancheau, R. Westrelin.
GWDG Matrix Transpose Results with Hybrid OpenMP / MPI O. Haan Gesellschaft für wissenschaftliche Datenverarbeitung Göttingen, Germany ( GWDG ) SCICOMP.
AMLAPI: Active Messages over Low-level Application Programming Interface Simon Yau, Tyson Condie,
CS162 Section Lecture 9. KeyValue Server Project 3 KVClient (Library) Client Side Program KVClient (Library) Client Side Program KVClient (Library) Client.
1 Performance Modeling l Basic Model »Needed to evaluate approaches »Must be simple l Synchronization delays l Main components »Latency and Bandwidth »Load.
VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.
A New Network Processor Architecture for High-speed Communication Xiaoning Nie; Gazsi, L.; Engel, F.; Fettweis, G. Signal Processing Systems, SiPS.
Low Overhead Fault Tolerant Networking (in Myrinet)
I/O Hardware n Incredible variety of I/O devices n Common concepts: – Port – connection point to the computer – Bus (daisy chain or shared direct access)
1 Parallel Computing—Introduction to Message Passing Interface (MPI)
1 Message protocols l Message consists of “envelope” and data »Envelope contains tag, communicator, length, source information, plus impl. private data.
Federated DAFS: Scalable Cluster-based Direct Access File Servers Murali Rangarajan, Suresh Gopalakrishnan Ashok Arumugam, Rabita Sarker Rutgers University.
An overview of Infiniband Reykjavik, June 24th 2008 R E Y K J A V I K U N I V E R S I T Y Dept. Computer Science Center for Analysis and Design of Intelligent.
N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Comparison of Communication and I/O of the Cray T3E and IBM SP Jonathan Carter NERSC User.
Evaluation of High-Performance Networks as Compilation Targets for Global Address Space Languages Mike Welcome In conjunction with the joint UCB and NERSC/LBL.
IWARP Ethernet Key to Driving Ethernet into the Future Brian Hausauer Chief Architect NetEffect, Inc.
MULTICOMPUTER 1. MULTICOMPUTER, YANG DIPELAJARI Multiprocessors vs multicomputers Interconnection topologies Switching schemes Communication with messages.
Infiniband enables scalable Real Application Clusters – Update Spring 2008 Sumanta Chatterjee, Oracle Richard Frank, Oracle.
2-2008UP-Copyrights reserved1 ITGD4103 Data Communications and Networks OSI Reference Model Lecture-5: OSI Reference Model week 5- q-2/ 2008 Dr. Anwar.
Optimizing Threaded MPI Execution on SMP Clusters Hong Tang and Tao Yang Department of Computer Science University of California, Santa Barbara.
1 Lecture 7: Part 2: Message Passing Multicomputers (Distributed Memory Machines)
IFS Benchmark with Federation Switch John Hague, IBM.
Protocols for Wide-Area Data-intensive Applications: Design and Performance Issues Yufei Ren, Tan Li, Dantong Yu, Shudong Jin, Thomas Robertazzi, Brian.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
Lecture 8: Design of Parallel Programs Part III Lecturer: Simon Winberg.
A TCP/IP transport layer for the DAQ of the CMS Experiment Miklos Kozlovszky for the CMS TriDAS collaboration CERN European Organization for Nuclear Research.
The MPC Parallel Computer Hardware, Low-level Protocols and Performances University P. & M. Curie (PARIS) LIP6 laboratory Olivier Glück.
1 Version 3.0 Module 11 TCP Application and Transport.
© 2010 IBM Corporation Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems Gabor Dozsa 1, Sameer Kumar 1, Pavan Balaji 2,
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Porting from the Cray T3E to the IBM SP Jonathan Carter NERSC User Services.
Computer Architecture Lecture10: Input/output devices Piotr Bilski.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
2006 Sonoma Workshop February 2006Page 1 Sockets Direct Protocol (SDP) for Windows - Motivation and Plans Gilad Shainer Mellanox Technologies Inc.
CIS 725 High Speed Networks. High-speed networks High bandwidth, high latency Low error rates Message loss mainly due to congestion.
Remote Direct Memory Access (RDMA) over IP PFLDNet 2003, Geneva Stephen Bailey, Sandburst Corp., Allyn Romanow, Cisco Systems,
August 22, 2005Page 1 of (#) Datacenter Fabric Workshop Open MPI Overview and Current Status Tim Woodall - LANL Galen Shipman - LANL/UNM.
Network Programming Eddie Aronovich mail:
Data and Computer Communications Chapter 11 – Asynchronous Transfer Mode.
© 2005 IBM MPI Louisiana Tech University Ruston, Louisiana Charles Grassl IBM January, 2006.
Minimizing Communication Latency to Maximize Network Communication Throughput over InfiniBand Design and Implementation of MPICH-2 over InfiniBand with.
IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec Copyright by IBM HPS Switch and Adapter Architecture, Design & Performance Rama K Govindaraju.
Integrating New Capabilities into NetPIPE Dave Turner, Adam Oline, Xuehua Chen, and Troy Benjegerdes Scalable Computing Laboratory of Ames Laboratory This.
1 Public DAFS Storage for High Performance Computing using MPI-I/O: Design and Experience Arkady Kanevsky & Peter Corbett Network Appliance Vijay Velusamy.
1 Presented By: Eyal Enav and Tal Rath Eyal Enav and Tal Rath Supervisor: Mike Sumszyk Mike Sumszyk.
Latency Reduction Techniques for Remote Memory Access in ANEMONE Mark Lewandowski Department of Computer Science Florida State University.
The influence of system calls and interrupts on the performances of a PC cluster using a Remote DMA communication primitive Olivier Glück Jean-Luc Lamotte.
Optimizing Charm++ Messaging for the Grid Gregory A. Koenig Parallel Programming Laboratory Department of Computer.
Chapter 13 – I/O Systems (Pgs ). Devices  Two conflicting properties A. Growing uniformity in interfaces (both h/w and s/w): e.g., USB, TWAIN.
1 Lecture 4: Part 2: MPI Point-to-Point Communication.
Comprehensive Scientific Support Of Large Scale Parallel Computation David Skinner, NERSC.
Mellanox Connectivity Solutions for Scalable HPC Highest Performing, Most Efficient End-to-End Connectivity for Servers and Storage September 2010 Brandon.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
F. HemmerUltraNet® Experiences SHIFT Model CPU Server CPU Server CPU Server CPU Server CPU Server CPU Server Disk Server Disk Server Tape Server Tape Server.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
Major OS Components CS 416: Operating Systems Design, Spring 2001 Department of Computer Science Rutgers University
Silberschatz, Galvin and Gagne ©2011 Operating System Concepts Essentials – 8 th Edition Chapter 2: The Linux System Part 5.
High Performance and Reliable Multicast over Myrinet/GM-2
Implementation and Optimization of MPI point-to-point communications on SMP-CMP clusters with RDMA capability.
Alternative system models
Computer Architecture
CMSC 611: Advanced Computer Architecture
I/O Systems I/O Hardware Application I/O Interface
Chapter 2: The Linux System Part 5
Application taxonomy & characterization
ECE 671 – Lecture 8 Network Adapters.
Presentation transcript:

1 Using HPS Switch on Bassi Jonathan Carter User Services Group Lead NERSC User Group Meeting June 12, 2006

2 IBM Switch Evolution

3 YearNamePeak BWLatencyProcessor 1996SP Switch300 MB/s per node 2x150 MB/s channel usPower2/ Power3 2000SP Switch2 (Colony) 2GB/s per node 2x500MB/s per port ~17 usPower3/ Power4 2003HPS (Federation) 2GB/s per port5-14 usPower4/ Power5

4 HPS Switch Configuration

5 Bassi Switch Configuration B0101B0201B0301B0401B0501B0601B0701B0801B0901B1001B1101B1201 B0102B0202B0302B0402B0502B0602B0702B0802B0902B1002B1102B1202 B0103B0203B0303B0403B0503B0603B0703B0803B0903B1003B1103B1203 B2904B0304B0404B0504B0704B0804B0904B1004B1104B1204 B0205B0305B0405B0505B0705B8905B0905B1005B1105B1205 B0206B0306B0406B0506B0706B0806B0906B1006B1106B1206 B0207B0307B0407B0507B0707B8907B0907B1007B1107B1207 B2908B0308B0408B0508B0708B0808B0908B1008B1108B1208 B0209B0309B0409B0709B0809B0909B1009B1109B1209 B0210B0310B0410B0710B0810B0910B1010B1110B1210 B0211B0311B0411B0711B0811B0911B1011B1111B1211 B0212B0312B0412B0712B0812B0912B1012B1112B1212

6 IBM Software Parallel Environment (PE 4.2.2) which contains poe and MPI remains unchanged Parallel System Support Package (PSSP 3.5.0), which contains LAPI, absorbed in Reliable Scalable Clustering Technology (RSCT 2.4.2) software stack.

7 IBM Software MPI –Uses LAPI as reliable transport layer –Uses threads not signals for asynchronous activities Binary compatible New performance characteristics –Eager –Bulk transfer –Collectives

8 IBM Software Stack HPS SMA3+ Adapter HAL LAPI IF_LS IP MPI Application ESSLPESSLGPFSSockets VSDTCPUDP

9 Communication Modes FIFO mode –Chopped into 2KB chunks on host, copied by CPU Remote Direct Memory Access (RDMA) –CPU offload –One I/O bus crossing Adapter CPU User Buffer FIFO RDMA DMA

10 RDMA (Bulk transfer) Overlap of communication and computation possible –Asynchronous-messaging applications –One-sided communications Reduce CPU work –Offload fragmentation and reassembly –Minimize packet arrival interrupts Reduce memory subsystem load –Zero copy transport Striping across adapters

11 RDMA vs. Packet

12 MPI Transfer Protocols Eager: send data immediately; store in remote buffer –No synchronization –Only one message sent –Uses memory for buffering (less for application) Rendezvous: send message header; wait for recv to be posted; send data –No data copy may be required –No memory required for buffering (more for application) –More messages required –Synchronization (standard send blocks until recv posted) P0P1 data ack req ack data ack

13 Eager vs. Rendezvous

14 Latency SystemIntra (us)Inter (us) Seaborg Jacquard Bassi1.14.5

15 Internode Comparison

16 Internode Comparison

17 Intranode Comparison

18 Intranode Comparison

19 Packed-node Comparison

20 Packed-node Comparison

21 MP_SINGLE_THREAD –Set to Yes for slight latency decrease, set to No for MPI I/O and OpenMP, etc. MP_USE_BULK_XFER –Default to Yes MP_BULK_MIN_MSG_SIZE –Default to ~150KB POE environment variables

22 MP_BUFFER_MEM –Default is 64MB MP_EAGER_LIMIT –Varies from 32KB to 1KB depending on job size, can be increased in conjunction with MP_BUFFER_MEM LAPI parameters for apps with many blocking send of small mgs: –MP_REXMIT_BUF_SIZE Default 128 bytes –MP_REXMIT_BUF_CNT Default is 128 buffers POE environment variables

23 IBM Documentation RSCT for AIX 5L LAPI Programming Guide (SA ) –LAPI programming Parallel Environment for AIX 5L V4.2.2 Operation and Use, Vol 1 (SA ) –Running jobs Parallel Environment for AIX 5L V4.2.2 Operation and Use, Vol 2 (SA ) –Performance tools Parallel Environment for AIX 5L V4.2.2 MPI Programming Guide (SA ) –IBM MPI implementation