Multiplexing Endpoints of HCA for Scaling MPI Applications: Design and Performance Evaluation with uDAPL Jasjit Singh, Yogeshwar Sonawane C-DAC, Pune,

Multiplexing Endpoints of HCA for Scaling MPI Applications: Design and Performance Evaluation with uDAPL Jasjit Singh, Yogeshwar Sonawane C-DAC, Pune, India. IEEE Cluster 2010 21 st September 2010 This work has been developed under the project 'National PARAM Supercomputing Facility and Next Generation HPC Technology' sponsored by Government of India's Department of Information Technology (DIT) under Ministry of Communication and Information Technology (MCIT) vide administrative approval No. DIT/R&D/C-DAC/2(2)/2008 dated 26/05/2008.

Presentation outline Introduction Problem Statement Proposed Design Performance Evaluation Related Work Conclusion & Future Work 2

Introduction HPC clusters are increasing in size to address the computational needs of large challenging problems. MPI is the de-facto standard for writing parallel applications. It typically uses fully connected topology. ADI provides portability to MPI for multiple networks and network interfaces. 3

uDAPL Overview uDAPL is proposed by Direct Access Transport (DAT) Collaborative. It defines lightweight, transport-independent and platform-independent set of user level APIs to exploit RDMA capabilities, such as those present in InfiniBand, VIA and iWARP. Supported by many MPIs like MVAPICH2, Intel MPI, OpenMPI and HP-MPI. 4

Software Hardware Event Completion Descriptor Posting SQ RQ EVD Endpoint Memory buffers CQ Process SQ RQ EVD Endpoint Memory buffers CQ Process uDAPL Communication Model 5

Reliable Connection In RC, a connection is formed between every process pair using endpoints (equivalent to queue pairs) at both ends. Limited endpoints of a HCA restrict the number of connections that can be established by an MPI application. –Thus limiting nodes to be deployed in cluster. 6

Endpoint (EP) requirement A cluster has (N * P) number of processes. where, P = number of processes or cores per node. N = number of nodes in cluster. Every process need to establish connections to rest of (N * P – 1) processes. For simplicity, assume it be (N * P). EP requirement for a Process = (N * P) EP requirement for a node = (N * P * P) Increasing N or P increases EP requirement. –Increasing P drastically increases the EP requirement. N max = Endpoints with HCA / (P * P) 7

Problem Statement Hardware upgrade to meet increased endpoint requirements is costly and time-consuming. Can an optimal solution with existing HCA be thought ? 8

Multiplexing approach Extends scalability with existing hardware. Maps multiple software connections to fewer hardware connections without incurring any significant performance penalty. Thus, same HCA can support more number of nodes in the cluster. 9

Multiplexing Design: swep & hwep We distinguish software ep (swep) and hardware ep (hwep). Multiple sweps use single hwep for data transfer. Software Hardware hweps P1 P2 P3 P4 sweps A hardware connection is between hweps from two nodes. –Therefore software connections only between these two nodes will use this hardware connection. One hwep is shared by sweps belonging to different processes on a node. Multiplexing should support both connection management as well as data transfer routines such as send, receive, RDMA Write etc. 10

Multi-Way Binding Problem P1 N1 H0 P3 P5 P7 P2 N2 P4 P6 P8 H1 h0 h1 H1 H2 H1 H3 H2 11

Multiplexing Design: Multi-way binding The processing (issuing or servicing) of a connection request at a node is completely independent of the processing at the remote node. Without multiplexing, multi-way binding will not occur as every connection request sent or received will allocate a separate hwep. Issue related to Connection management. Connection between hweps has to be strictly one-to-one. Two hweps on one side (H1 and H3) are trying to bind to a single remote hwep (h2). P1 P3 H3 H1 H2 N1 P2 N2 12

H1, vid 0 H2, vid 0 Solution with VID P1 N1 H0 P3 P5 P7 P2 N2 P4 P6 P8 H1 h0 h1 vid 0 H1 H3 H2 13

Multiplexing Design: Solution with VID For equal sharing, total number of hweps on a HCA can be divided as N * m, where N is the number of nodes in cluster. –Here m is less than the practical EP requirement of P * P. If range of VID for a remote node (0 to m-1) is exhausted, a hwep already used (preferably least used) has to be reused. Virtual Identifier (VID) as a unique identifier for a hwep. Hweps with the same VID will be connected to each other. H1, vid 0 P1 P3 H3, vid 1 H2, vid 0 N1 P2 N2 14

Multiplexing Design: Endpoint as a Queue-pair A hwep context contains all the information about a single swep or a connection, like EVD number and PZ number. In multiplexing, one hwep is used by multiple sweps. –Either of the queue can own the context information. Fig. (a) is redrawn to show hwep as a queue-pair in fig. (b). Both queues will inherit VID. Generally, one hwep corresponds to one swep. hweps P1 P2 Software Hardware SQ RQSQ RQ P1 P2 (a) (b) sweps 15

Separating SQ and RQ Many MPI libraries use single EVD, single PZ and same memory privilege for a process. –Hence all sweps of a process use the same EVD and PZ. We share SQ among processes and RQ with only one process. –Thus RQ owns information stored in a hwep context while the same information for SQ is conveyed as a part of descriptor. During connection establishment, only RQ is selected. –Remote SQ is automatically chosen with VID of the remote SQ same as that of the local RQ. SRQ functionality is feasible using RQ of a hwep. 16

Static Mapping: Division of sweps For a fixed cluster environment, static mapping avoids various multiplexing overheads. –Such as during allocating hweps, sweps and maintaining their association. LPID 2 to (P-2) RN2 to RN(N-2) RN (N-1) RN 1 RN 0 RPID 0 RPID 1 RPID (P - 1) LPID 0 LPID 1 LPID (P-1) P number of sweps for each LPID LPID = Local Process Identifier RPID = Remote Process Identifier RN = Remote Node Number 17

Static Mapping: Division of hweps Similarly, static allocation of hweps is possible. Multiplexing is (N * P * P) : (N * P * X) i.e. P : X. –where X is less than P. –P sweps will share X hweps. –X SQs and X RQs will be used by P sweps. Combination of LPID, RPID and RN acts as a VID. 18

Performance Evaluation We compare results for following two models a)without multiplexing termed as basic model b)with multiplexing termed as scalable model. We have evaluated multiplexing design using uDAPL over PARAMNet-3 (pnet3) interconnect. 19

Experimental Platform Two clusters: Cluster A of 16 nodes, Cluster B of 48 nodes. Each node has quad 2.93 GHz Intel Xeon Tigerton quad-core processors, 64 GB RAM and PCI-express based pnet3 HCA. Intel MPI having environment variable based control for using only RDMA-Write operations. Pnet3 is a high-performance cluster interconnect developed by C-DAC. It comprises of –48 port switch with 10Gbps full-duplex CX4 connectivity. –X4/x8 PCIe HCA having 4096 endpoints. –Light weight protocol software stack known as KSHIPRA. KSHIPRA supports uDAPL library as well as some selected components of OFED stack i.e. IPoIB, SDP and iSER. 20

Multiplexing Ratio (mux-ratio) Multiplexing ratio Sweps supported No. of nodes (max) No multiplexing4k16 2:18k32 4:116k64 8:132k 128 16:164k 256 hweps used (sweps / mux-ratio) mux-ratio8 nodes16 nodes32 nodes48 nodes Basic Model20484096819212288 21024204840966144 4512102420483072 825651210241536 16128256512768 Mux-ratio is the ratio in which multiple sweps use a single hwep. It is not possible to run applications using Basic Model beyond 16 nodes. In multiplexing, increasing mux-ratio increases the number of nodes that can be deployed in the cluster. –Brings down the hwep requirement to number of hweps supported by HCA. 21

Intel MPI Benchmarks (IMB) Very little variation in readings is observed across all the mux-ratios in nearly all of the benchmarks. IMB Alltoall, 128 processes on 8 nodes 22

NAS Parallel Benchmarks (NPB) NPB contains computing kernels typical of various CFD scientific applications. Each benchmark has different communication pattern. IS shows maximum of 5 % degradation with 16:1 multiplexing. NAS Class C readings, 256 processes on 16 nodes 23

HPL Benchmark NodesProcesses % Memory used for N Peak Computing power (TFlops) Basic Model (Gflops) 2:1 MUX (Gflops) 4:1 MUX (Gflops) 8:1 MUX (Gflops) 16:1 MUX (Gflops) 1625690321212182214421812173 32512806 Not Applicable 4157411142474274 48768809 Not Applicable Not Applicable 607660316024 32 and 48 nodes run shows successful scalability of MPI applications using multiplexing technique. The marginal improvement is due to management of lesser number of hweps on HCA. 24

Related Work SRQ based designs for reducing communication buffer requirements. On-demand connection management: connection only when required. –Worst case all-to-all pattern may emerge. –As our work is incorporated into uDAPL provider, many features of MPI can be used in conjunction with our technique. eXtended Reliable Connection (XRC) transport provides services of RC transport while providing additional scalability for multi-core clusters. –It allows a single connection from one process to entire node. Hybrid programming model (e.g. OpenMP with MPI) uses threads within a node and MPI processes across nodes. –All threads running on a node share same set of connections. –For hybrid model to work, MPI applications should be thread enabled. –Our work is part of transport library, so MPI applications can run seamlessly. 25

Conclusion and Future Work Proposed multiplexing technique to extend scalability of MPI applications. –effort is to map the MPI requirement to the available pool of endpoints on HCA. The multiplexing technique can be applied to any transport library that provides connection-oriented service. We can scale the cluster size in a proportion same as the mux-ratio. –E.g. with 16:1 mux-ratio, the number of nodes in the cluster can be 16 times with the same HCA. No visible performance degradation is observed up to 48 nodes. Future work includes evaluation at larger scale, addition of send- receive support and addition of SRQ support. 26

Thank you yogeshwars@cdac.in www.cdac.in www.cdac.in/html/htdg/products.asp

Backup slides

uDAPL Communication Model Support for both Channel Semantics (Send/Receive) and Memory Semantics (RDMA Write and RDMA Read). Reliable Connection oriented model with endpoints as source and sink of a communication channel. Data Transfer Operations (DTO) (i.e. Work Requests or descriptors) are posted on an endpoint. Completion of DTO is reported as an event on Event Dispatcher (EVD) (similar to CQ). –Either polling/de-queue or wait model can be used for completion reaping. Protection Zone (PZ) and Memory Privilege flags validates memory access. defines SRQ mechanism that provides the ability to share receive buffers among several connections. 29

Send-Receive Handling Complexities During recv DTO processing, mismatch in receive descriptors corresponding to their send descriptors can happen. This is due to sharing of hwep RQ. Hwep RQ can have descriptors from different sweps of varied lengths. Additional hardware support to handle above complexities is required. 30

Multiplexing Endpoints of HCA for Scaling MPI Applications: Design and Performance Evaluation with uDAPL Jasjit Singh, Yogeshwar Sonawane C-DAC, Pune,

Similar presentations

Presentation on theme: "Multiplexing Endpoints of HCA for Scaling MPI Applications: Design and Performance Evaluation with uDAPL Jasjit Singh, Yogeshwar Sonawane C-DAC, Pune,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Multiplexing Endpoints of HCA for Scaling MPI Applications: Design and Performance Evaluation with uDAPL Jasjit Singh, Yogeshwar Sonawane C-DAC, Pune,

Similar presentations

Presentation on theme: "Multiplexing Endpoints of HCA for Scaling MPI Applications: Design and Performance Evaluation with uDAPL Jasjit Singh, Yogeshwar Sonawane C-DAC, Pune,"— Presentation transcript:

Similar presentations

About project

Feedback