Parallel Scaling of parsparsecircuit3.c Tim Warburton
1 process per node In these tests we only use one out of two processors per node.
blackbear: 16 processors, 16 nodes
Apart from the mpi_allreduce calls, this is an almost perfect picture of parallelism
2 Processes Per Node We use both processors on each node
blackbear 8 nodes, 16 processes Notice, the prevelance of waitany. Clearly this code is not working as well as it does when running with 1 process per node.
blackbear 8 nodes, 16 processes (zoom in) I suspect that the threaded mpi communicators for the unblocked isend and irecv are competing for cpu time with the user code. Also – there could be competition for the memory bus and the network bus between the processors.
Timings for M=1024 (N=1024^2) (blackbear –O3) nodesNprocswallclock time
Timings for Two Processes Per Nodes on Los Lobos nodesNprocswallclock time Timings courtesy of Zhaoxian Zhou