Parallel Scaling of parsparsecircuit3.c Tim Warburton.

Parallel Scaling of parsparsecircuit3.c Tim Warburton

1 process per node In these tests we only use one out of two processors per node.

blackbear: 16 processors, 16 nodes

Apart from the mpi_allreduce calls, this is an almost perfect picture of parallelism

2 Processes Per Node We use both processors on each node

blackbear 8 nodes, 16 processes Notice, the prevelance of waitany. Clearly this code is not working as well as it does when running with 1 process per node.

blackbear 8 nodes, 16 processes (zoom in) I suspect that the threaded mpi communicators for the unblocked isend and irecv are competing for cpu time with the user code. Also – there could be competition for the memory bus and the network bus between the processors.

Timings for M=1024 (N=1024^2) (blackbear –O3) nodesNprocswallclock time 1219.4909 249.85369 485.01486 8163.19801 16323.77791 1119.2675 2210.2486 445.43999 882.79451 16 1.43782

Timings for Two Processes Per Nodes on Los Lobos nodesNprocswallclock time 12 8.9453 24 4.47474 48 2.17246 816 1.15644 Timings courtesy of Zhaoxian Zhou

Parallel Scaling of parsparsecircuit3.c Tim Warburton.

Similar presentations

Presentation on theme: "Parallel Scaling of parsparsecircuit3.c Tim Warburton."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Parallel Scaling of parsparsecircuit3.c Tim Warburton.

Similar presentations

Presentation on theme: "Parallel Scaling of parsparsecircuit3.c Tim Warburton."— Presentation transcript:

Similar presentations

About project

Feedback