Nor Asilah Wati Abdul Hamid, Paul Coddington, Francis Vaughan School of Computer Science, University of Adelaide IPDPS - PMEO April 2006 Comparison of MPI Benchmark Programs on an SGI Altix ccNUMA Shared Memory Machine
Motivation MPI benchmark programs were developed and tested mainly on Distributed Memory machines. Measuring MPI performance of ccNUMA shared memory machines may be more complex than for distributed memory machines. Our recent work on measuring MPI performance on SGI Altix showed differences in results from some MPI benchmarks. There are no existing detailed comparisons of results from different MPI benchmarks. So, we aim to Compare results of different MPI benchmarks on SGI Altix. Investigate the reasons for any variations, which expected due to:- Differences in the measurement techniques Implementation details Default configurations of the different benchmarks
MPI benchmarks tested were: Pallas MPI Benchmark (PMB), SKaMPI, MPBench, Mpptest, MPIBench. Measurements were done on SGI Altix 3700 at SAPAC. ccNUMA architecture 160 Intel Itanium 1.3 GHz. 160 GB of RAM. SGI NUMAlink3 for communication SGI Linux (ProPac3) SGI MPI library Used MPI_DSM_CPULIST for process binding. Must bind processes to CPUs for best performance Avoided use of CPU 0 (used for system processes). MPI Benchmark Experiments on the Altix
SGI Altix Architecture The Architecture of 128 Processor for SGI Altix Itanium2™ C-brick CPU and memory R-brick Router interconnect
It is possible to avoid the need to buffer messages in SGI MPI. SGI MPI has single copy options with better performance. This is the default for MPI_Isend, MPI_Sendrecv, MPI_Alltoall, MPI_Bcast, MPI_Allreduce and MPI_Reduce for messages above a specified size (default is 2 Kbytes). However, it is not enabled by default for MPI_Send. Can force single copy (non-buffered) using MPI_BUFFER_MAX Using a single copy Send can give significantly better communication performance almost a factor of 10 in some cases. Avoiding Message Buffering Using Single Copy
Point-to-Point Communication Different MPI benchmarks use different communication patterns to measure MPI_Send/MPI_Recv. MPIBench measures the effects of contention when all processors take part in point-to-point communication concurrently not just the time for a ping-pong communication between two processors processor p communicates with processor (p+n/2) mod n n is the total number of processors P1P2P0P7P6P5P4P3
PMB and Mpptest which involve processors 0 and 1 only. MPBench uses the first and last processor. SKaMPI tests for processor with slowest communication time to P0 and uses that pair for measurements. P1P2P0P7P6P5P4P3 P1P2P0P7P6P5P4P3 These different approaches may give very different on a heirarchical ccNUMA architecture.
Comparison of results from different MPI benchmarks for Point-to-Point (send/receive) communications using 8 CPUs. MPI_Send/MPI_Recv
MPIBench results are higher since all processors communicate at once so there is contention. SKaMPI and MPBench is the second highest because they measure between the first and the last CPUs, which are on different C-Bricks. PMB and Mpptest results are lowest since they measure communication time between CPUs on the same node. In measuring single copy MPI_Send: The results for SKaMPI and MPIBench were the same as for the default MPI setting that uses buffered copy. This problem occurs because both SKaMPI and MPIBench use the same array to hold send and receive message data This appears to be an artefact of the SGI MPI implementation. MPI_Send/MPI_Recv
MPI_Bcast Comparison between MPI benchmarks for MPI_Bcast on 8 CPUs.
The main difference is that by default: SKaMPI, Mpptest and PMB assume data is not held in cache memory. MPIBench and MPBench do preliminary “warm-up” repetitions to ensure data is in cache before measurements are taken. Another difference is: Measure collective communications time at the root node. However for broadcast, the root node is the first to finish, and this may lead to biased results (pipelining effect). To solve the problem - insert a barrier operation before each repetition. But Mpptest and PMB adopt a different approach – they assign a different root processor for each repetition. Which also acts to clear the cache. On a distributed memory machine this has little affect on the results. However for ccNUMA shared memory machine, moving the root to a different processor has a significant overhead. Changing benchmarks to remove these differences gives very similar results (within a few percent)
Node time produced by SKaMPI for MPI_Bcast at 4MBytes on 8 cpus. Times for Different Processors Distribution results produce by MPIBench for MPI_Bcast at 4MBytes on 8 cpus. SKaMPI and MPIBench provide average times for each processor. MPIBench also provides distributions of times for all processors or individual processors. Can see effects of binary tree algorithm.
MPI_Barrier SKaMPI result is a bit higher than MPIBench and PMB. Probably due to the global clock synchronization that is used by default for their measurement. they claim this is more accurate since it avoids pipelining effect. Comparison between MPI benchmarks for MPI_Barrier for 2 until 128 CPUs
MPI_Scatter Comparison between MPI benchmarks for MPI_Scatter on 32 CPUs Only MPIBench and SKaMPI measure MPI_Scatter and MPI_Gather. Use similar approach and report very similar times. Unexpected hump for data sizes between 128 bytes and 2 KB per process. SGI MPI uses buffered communications for message sizes less than 2 KBytes. The time for MPI_Scatter grows remarkably slowly with data size.
Time approx proportional to total data size for a fixed number of CPUs. Bottleneck is root processor that gathers the data Slower times for more CPUs due to serialization and contention effects. MPI_Gather Comparison between MPI benchmarks for MPI_Gather on 32 CPUs
Measured by MPIBench, PMB and SKaMPI. Results for 32 processors similar to MPI_Scatter, but with a sharper increase for larger data sizes probably indicating effects of contention. Results from different benchmarks mostly agree within about 10%. MPI_Alltoall Comparison between MPI benchmarks for MPI_Alltoall on 32 CPUs
Summary Different MPI benchmarks can give significantly different results for certain MPI routines on the SGI Altix Not usually the case for typical distributed memory architectures Due to different measurement techniques for the benchmarks. For point-to-point communications: Different communications patterns Differences in how averages are computed. Implementation details of SGI MPI on the Altix, which affects whether single copy is used (e.g. SKaMPI and MPIBench for MPI_Send/Recv) For some collective communications routines: Different defaults for use of cache Differences in synchronizing calls to the routines on each processor. Hierarchical ccNUMA architecture enhances effects of the differences.
Users of MPI benchmarks on shared memory machines should be careful in the interpretation of the results. MPI benchmarks were primarily designed for, and tested on, distributed memory machines. Consideration should be given to how they perform on shared memory machines. Developers of MPI benchmarks might consider making modifications to their benchmark programs to provide more accurate results for ccNUMA shared memory machines. Recommendations
END