Optimization of Collective Communication in Intra- Cell MPI Optimization of Collective Communication in Intra- Cell MPI Ashok Srinivasan Florida State University Goals 1.Efficient implementation of collectives for intra-Cell MPI 2.Evaluate the impact of different algorithms on the performance Collaborators: A. Kumar 1, G. Senthilkumar 1, M. Krishna 1, N. Jayam 1, P.K. Baruah 1, R. Sarma 1, S. Kapoor 2 1 Sri Sathya Sai University, Prashanthi Nilayam, India 2 IBM, Austin Acknowledgment: IBM, for providing access to a Cell blade under the VLP program
Outline Cell Architecture Intra-Cell MPI Design Choices Barrier Broadcast Reduce Conclusions and Future Work
A PowerPC core, with 8 co-processors (SPE) with 256 K local store each Shared 512 MB - 2 GB main memory - SPEs can DMA Peak speeds of Gflops in single precision and Gflops in double precision for SPEs GB/s EIB bandwidth, 25.6 GB/s for memory Two Cell processors can be combined to form a Cell blade with global shared memory Cell Architecture DMA put times Memory to Memory Copy using: SPE local store memcpy by PPE
Intra-Cell MPI Design Choices Cell features In order execution, but DMAs can be out of order Over 100 simultaneous DMAs can be in flight Constraints Unconventional, heterogeneous architecture SPEs have limited functionality, and can act directly only on local stores SPEs access main memory through DMA Use of PPE should be limited to get good performance MPI design choices Application data in: (i) local store or (ii) main memory MPI data in: (i) local store or (ii) main memory PPE involvement: (i) active or (ii) only during initialization and finalization Collective calls can: (i) synchronize or (ii) not synchronize
Barrier (1) OTA List: “Root” receives notification from all others, and then acknowledges through a DMA list OTA: Like OTA List, but root notifies others through individual non-blocking DMAs SIG: Like OTA, but others notify root through a signal register in OR mode Degree-k TREE In each step, a node has k-1 children In the first phase, children notify parents In the second phase, parents acknowledge children
Barrier (2) PE: Consider SPUs to be a logical hypercube – in each step, each SPU exchanges messages with neighbor along one dimension DIS: In step i, SPU j sends to SPU j + 2 i and receives from j – 2 i Comparison of MPI_Barrier on different hardware PCell (PE) s Xeon/Myrinet s NEC SX-8 s SGI Altix BX2 s 80.4 10 13 14 5 Alternatives Atomic increments in main memory – several microseconds PPE coordinates using mailbox – tens of microseconds
Broadcast (1) OTA on 4 SPUs OTA: Each SPE copies data to its location Different shifts are used to avoid hotspots in memory Different shifts on larger number of SPUs yield results that are close to each other AG on 16 SPUs AG: Each SPE is responsible for a different portion of data Different minimum sizes are tried
Broadcast (2) TREEMM on 12 SPUs TREEMM: Tree structured Send/Recv type implementation Data for degrees 2 and 4 are close Degree 3 is best, or close to it, for all SPU counts TREE on 16 SPUs TREE: Pipelined tree structured communication based on local stores Results are similar to this figure for other SPU counts
Broadcast (3) Broadcast on 16 SPEs (2 processors) TREE: Pipelined tree structured communication based on LS TREEMM: Tree structured Send/Recv type implementation AG: Each SPE is responsible for a different portion of data OTA: Each SPE copies data to its location G: Root copies all data Broadcast with good choice of algorithms for each data size and SPE count Maximum main memory bandwidth is also shown
Broadcast (4) Each node of the SX-8 has 8 vector processors capable of 16 Gflop/s, with 64 GB/s bandwidth to memory from each processor The total bandwidth to memory for a node is 512 GB/s Nodes are connected through a crossbar switch capable of 16 GB/s in each direction The Altix is a CC-NUMA system with a global shared memory Each node contains eight Itanium 2 processors Nodes are connected using NUMALINK4 -- bandwidth between processors on a node is 3.2 GB/s, and between nodes 1.6 GB/s Data Size Cell (PE) sInfiniband sNEC SX-8 sSGI Altix BX2 s P = 8P = 16P = 8P = 16P = 8P = 16P = 8P = B 18 10 1 KB 25 KB MB 100 215 2600 3100 Comparison of MPI_Bcast on different hardware
Reduce Reduce of MPI_INT with MPI_SUM on 16 SPUs Similar trends were observed for other SPU counts Data Size Cell (PE) s IBM SP s NEC SX-8 sSGI Altix BX2 s P = 8P = 16 P = 8P = 16 P = 8P = B 40 1 KB 60 1 MB 230 350 Each node of the IBM SP was a 16- processor SMP Comparison of MPI_Bcast on different hardware
Conclusions and Future Work Conclusions The Cell processor has good potential for MPI implementations PPE should have a limited role High bandwidth and low latency even with application data in main memory But local store should be used effectively, with double buffering to hide latency Main memory bandwidth is then the bottleneck Current and future work Implemented Collective communication operations optimized for contiguous data Future work Optimize collectives for derived data types with non-contiguous data