Implementation and Optimization of MPI point-to-point communications on SMP-CMP clusters with RDMA capability
MPI point-to-point communication Pairing MPI_Send with MPI_Recv or MPI_Isend/MPI_Irecv/MPI_Wait There is an implicit synchronization – Receiver can complete only after sender performs the send; the communication operation cannot complete until both sender and receiver are ready.
MPI point-to-point communication Use different protocol for large and small messages Eager protocol for small messages Low latency communication Sender not depending on receiver Rendevuous protocols for large messages No message copy
Eager protocol
Rendezvous protocol
Existing RDMA based small message channel – the MVAPICH design [Liu03]
Our improved design – eliminating persistent buffer association
Further improvement – node-shared Small message channels
Optimizing Rendezvous protocol – ideal rendezvous protocol SS – Send start, SW – Send wait, RS– Receive start, RW – Receive wait. When both sender and receiver have initiate the communication, data transfer should start
Optimizing Rendezvous protocol – the problem Poor progress
Optimizing Rendezvous protocol – the problem The performance is heavily affected by the timing of the events? Is it possible to have near optimal performance for all timing situations?
How to use these protocols Dynamic protocol selection – design maga-protocol that combines multiple of these protocols. Profile-guided optimization – use profiling to determine the timing information, and use the timing information to select the protocol. Compiler-assisted optimization – use compiler analysis to determine the timing information, and use the timing information to select the best performing protocol.