Multiplexing Endpoints of HCA for Scaling MPI Applications: Design and Performance Evaluation with uDAPL Jasjit Singh, Yogeshwar Sonawane C-DAC, Pune,

Slides:

Advertisements

Similar presentations

Threads, SMP, and Microkernels

Advertisements

Middleware Support for RDMA-based Data Transfer in Cloud Computing Yufei Ren, Tan Li, Dantong Yu, Shudong Jin, Thomas Robertazzi Department of Electrical.

System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.

CoMPI: Enhancing MPI based applications performance and scalability using run-time compression. Rosa Filgueira, David E.Singh, Alejandro Calderón and Jesús.

Efficient Collective Operations using Remote Memory Operations on VIA-Based Clusters Rinku Gupta Dell Computers Dhabaleswar Panda.

The Who, What, Why and How of High Performance Computing Applications in the Cloud Soheila Abrishami 1.

Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.

Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial.

Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.

Computer Systems/Operating Systems - Class 8

Chapter 5 Processes and Threads Copyright © 2008.

MPI and RDMA Yufei 10/15/2010. MPI over uDAPL: abstract MPI: most popular parallel computing standard MPI needs the ability to deliver high performace.

VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.

OS Fall ’ 02 Introduction Operating Systems Fall 2002.

3.5 Interprocess Communication Many operating systems provide mechanisms for interprocess communication (IPC) –Processes must communicate with one another.

3.5 Interprocess Communication

Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.

An overview of Infiniband Reykjavik, June 24th 2008 R E Y K J A V I K U N I V E R S I T Y Dept. Computer Science Center for Analysis and Design of Intelligent.

1 Threads Chapter 4 Reading: 4.1,4.4, Process Characteristics l Unit of resource ownership - process is allocated: n a virtual address space to.

Gursharan Singh Tatla Transport Layer 16-May

14.1 Silberschatz, Galvin and Gagne ©2009 Operating System Concepts with Java – 8 th Edition Chapter 4: Threads.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

New Direction Proposal: An OpenFabrics Framework for high-performance I/O apps OFA TAC, Key drivers: Sean Hefty, Paul Grun.

Virtualizing Modern High-Speed Interconnection Networks with Performance and Scalability Institute of Computing Technology, Chinese Academy of Sciences,

1 Chapter Client-Server Interaction. 2 Functionality  Transport layer and layers below  Basic communication  Reliability  Application layer.

Institute of Computer and Communication Network Engineering OFC/NFOEC, 6-10 March 2011, Los Angeles, CA Lessons Learned From Implementing a Path Computation.

Silberschatz, Galvin and Gagne ©2011Operating System Concepts Essentials – 8 th Edition Chapter 4: Threads.

1 March 2010 A Study of Hardware Assisted IP over InfiniBand and its Impact on Enterprise Data Center Performance Ryan E. Grant 1, Pavan Balaji 2, Ahmad.

An Introduction to Software Architecture

Energy Profiling And Analysis Of The HPC Challenge Benchmarks Scalable Performance Laboratory Department of Computer Science Virginia Tech Shuaiwen Song,

Dynamic Time Variant Connection Management for PGAS Models on InfiniBand Abhinav Vishnu 1, Manoj Krishnan 1 and Pavan Balaji 2 1 Pacific Northwest National.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

Boosting Event Building Performance Using Infiniband FDR for CMS Upgrade Andrew Forrest – CERN (PH/CMD) Technology and Instrumentation in Particle Physics.

MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.

1 Configurable Security for Scavenged Storage Systems NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany, Matei Ripeanu.

© 2012 MELLANOX TECHNOLOGIES 1 The Exascale Interconnect Technology Rich Graham – Sr. Solutions Architect.

PARMON A Comprehensive Cluster Monitoring System A Single System Image Case Study Developer: PARMON Team Centre for Development of Advanced Computing,

© 2010 IBM Corporation Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems Gabor Dozsa 1, Sameer Kumar 1, Pavan Balaji 2,

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech.

Processes and Threads Processes have two characteristics: – Resource ownership - process includes a virtual address space to hold the process image – Scheduling/execution.

 2004 Deitel & Associates, Inc. All rights reserved. 1 Chapter 4 – Thread Concepts Outline 4.1 Introduction 4.2Definition of Thread 4.3Motivation for.

Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device Shuang LiangRanjit NoronhaDhabaleswar K. Panda IEEE.

August 22, 2005Page 1 of (#) Datacenter Fabric Workshop Open MPI Overview and Current Status Tim Woodall - LANL Galen Shipman - LANL/UNM.

1 Public DAFS Storage for High Performance Computing using MPI-I/O: Design and Experience Arkady Kanevsky & Peter Corbett Network Appliance Vijay Velusamy.

2009/4/21 Third French-Japanese PAAP Workshop 1 A Volumetric 3-D FFT on Clusters of Multi-Core Processors Daisuke Takahashi University of Tsukuba, Japan.

PMI: A Scalable Process- Management Interface for Extreme-Scale Systems Pavan Balaji, Darius Buntinas, David Goodell, William Gropp, Jayesh Krishna, Ewing.

Chapter 2 Processes and Threads Introduction 2.2 Processes A Process is the execution of a Program More specifically… – A process is a program.

Fabric Interfaces Architecture Sean Hefty - Intel Corporation.

11 CLUSTERING AND AVAILABILITY Chapter 11. Chapter 11: CLUSTERING AND AVAILABILITY2 OVERVIEW  Describe the clustering capabilities of Microsoft Windows.

The Client-Server Model And the Socket API. Client-Server (1) The datagram service does not require cooperation between the peer applications but such.

Efficient Live Checkpointing Mechanisms for computation and memory-intensive VMs in a data center Kasidit Chanchio Vasabilab Dept of Computer Science,

CC-MPI: A Compiled Communication Capable MPI Prototype for Ethernet Switched Clusters Amit Karwande, Xin Yuan Department of Computer Science, Florida State.

TI Information – Selective Disclosure Implementation of Linear Algebra Libraries for Embedded Architectures Using BLIS September 28, 2015 Devangi Parikh.

Sockets Direct Protocol for Hybrid Network Stacks: A Case Study with iWARP over 10G Ethernet P. Balaji, S. Bhagvat, R. Thakur and D. K. Panda, Mathematics.

The World Leader in High Performance Signal Processing Solutions Heterogeneous Multicore for blackfin implementation Open Platform Solutions Steven Miao.

CSCI/CMPE 4334 Operating Systems Review: Exam 1 1.

Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.

Advanced Operating Systems CS6025 Spring 2016 Processes and Threads (Chapter 2)

SC’13 BoF Discussion Sean Hefty Intel Corporation.

Chapter 4 – Thread Concepts

The Multikernel: A New OS Architecture for Scalable Multicore Systems

OPERATING SYSTEMS CS3502 Fall 2017

Chapter 4 – Thread Concepts

Fabric Interfaces Architecture – v4

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

Chapter 4: Threads.

Threads Chapter 4.

Chapter 4: Threads.

Presentation transcript:

Multiplexing Endpoints of HCA for Scaling MPI Applications: Design and Performance Evaluation with uDAPL Jasjit Singh, Yogeshwar Sonawane C-DAC, Pune, India. IEEE Cluster st September 2010 This work has been developed under the project 'National PARAM Supercomputing Facility and Next Generation HPC Technology' sponsored by Government of India's Department of Information Technology (DIT) under Ministry of Communication and Information Technology (MCIT) vide administrative approval No. DIT/R&D/C-DAC/2(2)/2008 dated 26/05/2008.

Presentation outline Introduction Problem Statement Proposed Design Performance Evaluation Related Work Conclusion & Future Work 2

Introduction HPC clusters are increasing in size to address the computational needs of large challenging problems. MPI is the de-facto standard for writing parallel applications. It typically uses fully connected topology. ADI provides portability to MPI for multiple networks and network interfaces. 3

uDAPL Overview uDAPL is proposed by Direct Access Transport (DAT) Collaborative. It defines lightweight, transport-independent and platform-independent set of user level APIs to exploit RDMA capabilities, such as those present in InfiniBand, VIA and iWARP. Supported by many MPIs like MVAPICH2, Intel MPI, OpenMPI and HP-MPI. 4

Software Hardware Event Completion Descriptor Posting SQ RQ EVD Endpoint Memory buffers CQ Process SQ RQ EVD Endpoint Memory buffers CQ Process uDAPL Communication Model 5

Reliable Connection In RC, a connection is formed between every process pair using endpoints (equivalent to queue pairs) at both ends. Limited endpoints of a HCA restrict the number of connections that can be established by an MPI application. –Thus limiting nodes to be deployed in cluster. 6

Endpoint (EP) requirement A cluster has (N * P) number of processes. where, P = number of processes or cores per node. N = number of nodes in cluster. Every process need to establish connections to rest of (N * P – 1) processes. For simplicity, assume it be (N * P). EP requirement for a Process = (N * P) EP requirement for a node = (N * P * P) Increasing N or P increases EP requirement. –Increasing P drastically increases the EP requirement. N max = Endpoints with HCA / (P * P) 7

Problem Statement Hardware upgrade to meet increased endpoint requirements is costly and time-consuming. Can an optimal solution with existing HCA be thought ? 8

Multiplexing approach Extends scalability with existing hardware. Maps multiple software connections to fewer hardware connections without incurring any significant performance penalty. Thus, same HCA can support more number of nodes in the cluster. 9

Multiplexing Design: swep & hwep We distinguish software ep (swep) and hardware ep (hwep). Multiple sweps use single hwep for data transfer. Software Hardware hweps P1 P2 P3 P4 sweps A hardware connection is between hweps from two nodes. –Therefore software connections only between these two nodes will use this hardware connection. One hwep is shared by sweps belonging to different processes on a node. Multiplexing should support both connection management as well as data transfer routines such as send, receive, RDMA Write etc. 10

Multi-Way Binding Problem P1 N1 H0 P3 P5 P7 P2 N2 P4 P6 P8 H1 h0 h1 H1 H2 H1 H3 H2 11

Multiplexing Design: Multi-way binding The processing (issuing or servicing) of a connection request at a node is completely independent of the processing at the remote node. Without multiplexing, multi-way binding will not occur as every connection request sent or received will allocate a separate hwep. Issue related to Connection management. Connection between hweps has to be strictly one-to-one. Two hweps on one side (H1 and H3) are trying to bind to a single remote hwep (h2). P1 P3 H3 H1 H2 N1 P2 N2 12

H1, vid 0 H2, vid 0 Solution with VID P1 N1 H0 P3 P5 P7 P2 N2 P4 P6 P8 H1 h0 h1 vid 0 H1 H3 H2 13

Multiplexing Design: Solution with VID For equal sharing, total number of hweps on a HCA can be divided as N * m, where N is the number of nodes in cluster. –Here m is less than the practical EP requirement of P * P. If range of VID for a remote node (0 to m-1) is exhausted, a hwep already used (preferably least used) has to be reused. Virtual Identifier (VID) as a unique identifier for a hwep. Hweps with the same VID will be connected to each other. H1, vid 0 P1 P3 H3, vid 1 H2, vid 0 N1 P2 N2 14

Multiplexing Design: Endpoint as a Queue-pair A hwep context contains all the information about a single swep or a connection, like EVD number and PZ number. In multiplexing, one hwep is used by multiple sweps. –Either of the queue can own the context information. Fig. (a) is redrawn to show hwep as a queue-pair in fig. (b). Both queues will inherit VID. Generally, one hwep corresponds to one swep. hweps P1 P2 Software Hardware SQ RQSQ RQ P1 P2 (a) (b) sweps 15

Separating SQ and RQ Many MPI libraries use single EVD, single PZ and same memory privilege for a process. –Hence all sweps of a process use the same EVD and PZ. We share SQ among processes and RQ with only one process. –Thus RQ owns information stored in a hwep context while the same information for SQ is conveyed as a part of descriptor. During connection establishment, only RQ is selected. –Remote SQ is automatically chosen with VID of the remote SQ same as that of the local RQ. SRQ functionality is feasible using RQ of a hwep. 16

Static Mapping: Division of sweps For a fixed cluster environment, static mapping avoids various multiplexing overheads. –Such as during allocating hweps, sweps and maintaining their association. LPID 2 to (P-2) RN2 to RN(N-2) RN (N-1) RN 1 RN 0 RPID 0 RPID 1 RPID (P - 1) LPID 0 LPID 1 LPID (P-1) P number of sweps for each LPID LPID = Local Process Identifier RPID = Remote Process Identifier RN = Remote Node Number 17

Static Mapping: Division of hweps Similarly, static allocation of hweps is possible. Multiplexing is (N * P * P) : (N * P * X) i.e. P : X. –where X is less than P. –P sweps will share X hweps. –X SQs and X RQs will be used by P sweps. Combination of LPID, RPID and RN acts as a VID. 18

Performance Evaluation We compare results for following two models a)without multiplexing termed as basic model b)with multiplexing termed as scalable model. We have evaluated multiplexing design using uDAPL over PARAMNet-3 (pnet3) interconnect. 19

Experimental Platform Two clusters: Cluster A of 16 nodes, Cluster B of 48 nodes. Each node has quad 2.93 GHz Intel Xeon Tigerton quad-core processors, 64 GB RAM and PCI-express based pnet3 HCA. Intel MPI having environment variable based control for using only RDMA-Write operations. Pnet3 is a high-performance cluster interconnect developed by C-DAC. It comprises of –48 port switch with 10Gbps full-duplex CX4 connectivity. –X4/x8 PCIe HCA having 4096 endpoints. –Light weight protocol software stack known as KSHIPRA. KSHIPRA supports uDAPL library as well as some selected components of OFED stack i.e. IPoIB, SDP and iSER. 20

Multiplexing Ratio (mux-ratio) Multiplexing ratio Sweps supported No. of nodes (max) No multiplexing4k16 2:18k32 4:116k64 8:132k :164k 256 hweps used (sweps / mux-ratio) mux-ratio8 nodes16 nodes32 nodes48 nodes Basic Model Mux-ratio is the ratio in which multiple sweps use a single hwep. It is not possible to run applications using Basic Model beyond 16 nodes. In multiplexing, increasing mux-ratio increases the number of nodes that can be deployed in the cluster. –Brings down the hwep requirement to number of hweps supported by HCA. 21

Intel MPI Benchmarks (IMB) Very little variation in readings is observed across all the mux-ratios in nearly all of the benchmarks. IMB Alltoall, 128 processes on 8 nodes 22

NAS Parallel Benchmarks (NPB) NPB contains computing kernels typical of various CFD scientific applications. Each benchmark has different communication pattern. IS shows maximum of 5 % degradation with 16:1 multiplexing. NAS Class C readings, 256 processes on 16 nodes 23

HPL Benchmark NodesProcesses % Memory used for N Peak Computing power (TFlops) Basic Model (Gflops) 2:1 MUX (Gflops) 4:1 MUX (Gflops) 8:1 MUX (Gflops) 16:1 MUX (Gflops) Not Applicable Not Applicable Not Applicable and 48 nodes run shows successful scalability of MPI applications using multiplexing technique. The marginal improvement is due to management of lesser number of hweps on HCA. 24

Related Work SRQ based designs for reducing communication buffer requirements. On-demand connection management: connection only when required. –Worst case all-to-all pattern may emerge. –As our work is incorporated into uDAPL provider, many features of MPI can be used in conjunction with our technique. eXtended Reliable Connection (XRC) transport provides services of RC transport while providing additional scalability for multi-core clusters. –It allows a single connection from one process to entire node. Hybrid programming model (e.g. OpenMP with MPI) uses threads within a node and MPI processes across nodes. –All threads running on a node share same set of connections. –For hybrid model to work, MPI applications should be thread enabled. –Our work is part of transport library, so MPI applications can run seamlessly. 25

Conclusion and Future Work Proposed multiplexing technique to extend scalability of MPI applications. –effort is to map the MPI requirement to the available pool of endpoints on HCA. The multiplexing technique can be applied to any transport library that provides connection-oriented service. We can scale the cluster size in a proportion same as the mux-ratio. –E.g. with 16:1 mux-ratio, the number of nodes in the cluster can be 16 times with the same HCA. No visible performance degradation is observed up to 48 nodes. Future work includes evaluation at larger scale, addition of send- receive support and addition of SRQ support. 26

Thank you

Backup slides

uDAPL Communication Model Support for both Channel Semantics (Send/Receive) and Memory Semantics (RDMA Write and RDMA Read). Reliable Connection oriented model with endpoints as source and sink of a communication channel. Data Transfer Operations (DTO) (i.e. Work Requests or descriptors) are posted on an endpoint. Completion of DTO is reported as an event on Event Dispatcher (EVD) (similar to CQ). –Either polling/de-queue or wait model can be used for completion reaping. Protection Zone (PZ) and Memory Privilege flags validates memory access. defines SRQ mechanism that provides the ability to share receive buffers among several connections. 29

Send-Receive Handling Complexities During recv DTO processing, mismatch in receive descriptors corresponding to their send descriptors can happen. This is due to sharing of hwep RQ. Hwep RQ can have descriptors from different sweps of varied lengths. Additional hardware support to handle above complexities is required. 30