Optimizing Threaded MPI Execution on SMP Clusters Hong Tang and Tao Yang Department of Computer Science University of California, Santa Barbara
June, 20th, 2001Hong Tang2 Parallel Computation on SMP Clusters Massively Parallel Machines SMP Clusters Massively Parallel Machines SMP Clusters Commodity Components: Off-the-shelf Processors + Fast Network (Myrinet, Fast/GigaBit Ethernet) Commodity Components: Off-the-shelf Processors + Fast Network (Myrinet, Fast/GigaBit Ethernet) Parallel Programming Model for SMP Clusters Parallel Programming Model for SMP Clusters MPI: Portability, Performance, Legacy Programs MPI: Portability, Performance, Legacy Programs MPI+Variations: MPI+Multithreading, MPI+OpenMP MPI+Variations: MPI+Multithreading, MPI+OpenMP
June, 20th, 2001Hong Tang3 Threaded MPI Execution MPI Paradigm: Separated Address Spaces for Different MPI Nodes MPI Paradigm: Separated Address Spaces for Different MPI Nodes Natural Solution: MPI Nodes Processes Natural Solution: MPI Nodes Processes What if we map MPI nodes to threads? What if we map MPI nodes to threads? Faster synchronization among MPI nodes running on the same machine. Faster synchronization among MPI nodes running on the same machine. Demonstrated in previous work [PPoPP ’99] for a single shared memory machine. (Developed techniques to safely execute MPI programs using threads.) Demonstrated in previous work [PPoPP ’99] for a single shared memory machine. (Developed techniques to safely execute MPI programs using threads.) Threaded MPI Execution on SMP Clusters Threaded MPI Execution on SMP Clusters Intra-Machine Comm. through Shared Memory Intra-Machine Comm. through Shared Memory Inter-Machine Comm. through Network Inter-Machine Comm. through Network
June, 20th, 2001Hong Tang4 Threaded MPI Execution Benefits Inter- Machine Communication Common Intuition Common Intuition Our Findings Our Findings Inter-machine communication cost is dominated by network delay, so the advantage of executing MPI nodes as threads diminishes. Using threads can significantly reduce the buffering and orchestration overhead for inter- machine communications.
June, 20th, 2001Hong Tang5 Related Work MPI on Network Clusters MPI on Network Clusters MPICH – a portable MPI implementation. MPICH – a portable MPI implementation. LAM/MPI – communication through a standalone RPI server. LAM/MPI – communication through a standalone RPI server. Collective Communication Optimization Collective Communication Optimization SUN-MPI and MPI-StarT – modify MPICH ADI layer; target for SMP clusters. SUN-MPI and MPI-StarT – modify MPICH ADI layer; target for SMP clusters. MagPIe – target for SMP clusters connected through WAN. MagPIe – target for SMP clusters connected through WAN. Lower Communication Layer Optimization Lower Communication Layer Optimization MPI-FM and MPI-AM. MPI-FM and MPI-AM. Threaded Execution of Message Passing Programs Threaded Execution of Message Passing Programs MPI-Lite, LPVM, TPVM. MPI-Lite, LPVM, TPVM.
June, 20th, 2001Hong Tang6 Background: MPICH Design
June, 20th, 2001Hong Tang7 MPICH Communication Structure MPICH without shared memory MPICH with shared memory
June, 20th, 2001Hong Tang8 TMPI Communication Structure
June, 20th, 2001Hong Tang9 Comparison of TMPI and MPICH Drawbacks of MPICH w/ Shared Memory Intra-node communication limited by shared memory size. Busy polling to check messages from either daemon or local peer. Cannot do automatic resource clean-up. Drawbacks of MPICH w/o Shared Memory Big overhead for intra-node communication. Too many daemon processes and open connections. Drawbacks of both MPICH Systems Extra data copying for inter-machine communication.
June, 20th, 2001Hong Tang10 TMPI Communication Design
June, 20th, 2001Hong Tang11 Separation of Point-to-Point and Collective Communication Channels Observations: MPI Point-to-point Communication and Collective Communication Semantics are Different. Observations: MPI Point-to-point Communication and Collective Communication Semantics are Different. Separated channels for pt2pt and collective comm. Separated channels for pt2pt and collective comm. Eliminate daemon intervention for collective communication. Eliminate daemon intervention for collective communication. Less effective for MPICH – no sharing of ports among processes. Less effective for MPICH – no sharing of ports among processes. Point-to-pointCollective Unknown Source (MPI_ANY_SOURCE) Determined Source (Ancestor in the spanning tree.) Out-of-order (Message Tag) In order delivery Asynchronous (Non-block Receive) Synchronous
June, 20th, 2001Hong Tang12 Observation: Two level communication hierarchy. Observation: Two level communication hierarchy. Inside an SMP node: shared memory ( sec) Inside an SMP node: shared memory ( sec) Between SMP nodes: network ( sec) Between SMP nodes: network ( sec) Idea: Building the communication spanning tree in two steps Idea: Building the communication spanning tree in two steps Choose a root MPI node on each cluster node and build a spanning tree among all the cluster nodes. Choose a root MPI node on each cluster node and build a spanning tree among all the cluster nodes. Second, all other MPI nodes connect to the local root node. Second, all other MPI nodes connect to the local root node. Hierarchy-Aware Collective Communication
June, 20th, 2001Hong Tang13 Question: How do we manage temporary buffering of message data when the remote receiver is not ready to accept data? Question: How do we manage temporary buffering of message data when the remote receiver is not ready to accept data? Choices: Choices: Send the data with the request – eager push. Send the data with the request – eager push. Send request only and send data when the receiver is ready – three-phase protocol. Send request only and send data when the receiver is ready – three-phase protocol. TMPI – adapt between both methods. TMPI – adapt between both methods. Adaptive Buffer Management
June, 20th, 2001Hong Tang14 Experimental Study Goal: Illustrate the advantage of threaded MPI execution on SMP clusters. Goal: Illustrate the advantage of threaded MPI execution on SMP clusters. Hardware Setting Hardware Setting A cluster of 6 Quad-Xeon 500MHz SMPs, with 1GB main memory and 2 fast Ethernet cards per machine. A cluster of 6 Quad-Xeon 500MHz SMPs, with 1GB main memory and 2 fast Ethernet cards per machine. Software Setting Software Setting OS: RedHat Linux 6.0, kernel version w/ channel bonding enabled. OS: RedHat Linux 6.0, kernel version w/ channel bonding enabled. Process-based MPI System: MPICH 1.2 Process-based MPI System: MPICH 1.2 Thread-based MPI System: TMPI (45 functions in MPI 1.1 standard) Thread-based MPI System: TMPI (45 functions in MPI 1.1 standard)
June, 20th, 2001Hong Tang15 Inter-Cluster-Node Point-to-Point Ping-ping, TMPI vs MPICH w/ shared memory Ping-ping, TMPI vs MPICH w/ shared memory Message Size (bytes) Round Trip Time ( s ) (a) Ping-Pong Short Message TMPI MPICH Message Size (KB) Transfer Rate (MB) (b) Ping-Pong Long Message TMPI MPICH
June, 20th, 2001Hong Tang16 Intra-Cluster-Node Point-to-Point Message Size (bytes) Round Trip Time ( s ) (a) Ping-Pong Short Message TMPI MPICH1 MPICH Message Size (KB) Transfer Rate (MB) (b) Ping-Pong Long Message TMPI MPICH1 MPICH2 Ping-pong, TMPI vs MPICH1 (MPICH w/ shared memory) and MPICH2 (MPICH w/o shared memory) Ping-pong, TMPI vs MPICH1 (MPICH w/ shared memory) and MPICH2 (MPICH w/o shared memory)
June, 20th, 2001Hong Tang17 Collective Communication Reduce, Bcast, Allreduce. Reduce, Bcast, Allreduce. TMPI / MPICH_SHM / MPICH_NOSHM TMPI / MPICH_SHM / MPICH_NOSHM Three node distributions, three root node settings. Three node distributions, three root node settings. (us)rootReduceBcastAllreduce 4x1 same 9/121/ /137/ /175/627 rotate 33/81/ /91/4238 combo 25/102/ /32/966 1x4 same 28/1999/ /1610/ /675/775 rotate 146/1944/ /1774/1834 combo 167/1977/ /409/392 4x4 same 39/2532/ /2792/ /1412/19914 rotate 161/1718/ /2204/8036 combo 141/2242/ /489/2054 1) MPICH w/o shared memory performs the worst.2) TMPI is 70+ times faster than MPICH w/ Shared Memory for MPI_Bcast and MPI_Reduce. 3) For TMPI, the performance of 4X4 cases is roughly the summation of that of the 4X1 cases and that of the 1X4 cases.
June, 20th, 2001Hong Tang18 Macro-Benchmark Performance
June, 20th, 2001Hong Tang19 Conclusions Great Advantage of Threaded MPI Execution on SMP Clusters Great Advantage of Threaded MPI Execution on SMP Clusters Micro-benchmark: 70+ times faster than MPICH. Micro-benchmark: 70+ times faster than MPICH. Macro-benchmark: 100% faster than MPICH. Macro-benchmark: 100% faster than MPICH. Optimization Techniques Optimization Techniques Separated Collective and Point-to-Point Communication Channels Separated Collective and Point-to-Point Communication Channels Adaptive Buffer Management Adaptive Buffer Management Hierarchy-Aware Communications Hierarchy-Aware Communications
June, 20th, 2001Hong Tang20 Background: Safe Execution of MPI Programs using Threads Program Transformation: Eliminate global and static variables (called permanent variables). Program Transformation: Eliminate global and static variables (called permanent variables). Thread-Specific Data (TSD) Thread-Specific Data (TSD) Each thread can associate a pointer-sized data variable with a commonly defined key value (an integer). With the same key, different threads can set/get the values of their own copy of the data variable. TSD-based Transformation Each permanent variable declaration is replaced with a KEY declaration. Each node associates its private copy of the permanent variable with the corresponding key. In places where global variables are referenced, use the global keys to retrieve the per-thread copies of the variables. TSD-based Transformation Each permanent variable declaration is replaced with a KEY declaration. Each node associates its private copy of the permanent variable with the corresponding key. In places where global variables are referenced, use the global keys to retrieve the per-thread copies of the variables.
June, 20th, 2001Hong Tang21 Program Transformation – An Example