Non blocking communications in RK dynamics Non blocking communications in RK dynamics. Current status and future work. Stefano Zampini, CASPUR/CNMCA WG-6 PP POMPA @ Cosmo GM Rome, September 6 2011
Halo exchange in Cosmo 3 types of point to point communications: 2 partially non blocking and 1 full blocking (with MPI_SENDRECV) Halo swapping needs completion of East to West before starting South to North communication (implicit corner exchange) Also: choice for explicit buffering or derived MPI datatypes
Details on nonblocking exchange Full halo exchange including corners: 2x messages, same amount of data on network. 3 different stages: send, receive and wait. Minimizing overhead: at first time step persistent requests are created using calls to MPI_SEND_INIT and MPI_RECV_INIT. During model run: MPI_STARTALL used for starting requests. MPI_TESTANY/MPI_WAITANY used for completion. Actual implementation with explicit send and receive buffering only: needs to be extended to derived MPI datatypes. Strategy used in RK dynamics (manual implementation): - Sends are posted whenever needed data has been locally computed. - Receives are posted whenever receive buffer is ready to be used. - Waits are posted just before data is needed for next local computation.
New synopsis for swap subroutine Actual call to subroutine exchg_boundaries 4 more argument in call to subroutine iexchg_boundaries - ilocalreq(16): array of request (integer declared as module variable, one for each swap scenario inside the module) - operation(3): array of logicals indicating stage to perform (send,recv,wait) - istartpar,iendpar: needed for corners' definition
New Implementation
Benchmark details COSMO RAPS 5.0 with MeteoSwiss namelist (25h hours of forecast) Cosmo2 (520x350x60, dt 20) and Cosmo7 (393x338x60, dt 60) Decompositions: tiny (10x12+4), small (20x24+4) and usual (28x35+4) Code compiled with Intel ifort 11.1.072 and HPMPI COMFLG1 = -xssse3 -O3 -fp-model precise -free -fpp -override-limits -convert big_endian COMFLG2 = -xssse3 -O3 -fp-model precise -free -fpp -override-limits -convert big_endian COMFLG3 = -xssse3 -O2 -fp-model precise -free -fpp -override-limits -convert big_endian COMFLG4 = -xssse3 -O2 -fp-model precise -free -fpp -override-limits -convert big_endian LDFLG = -finline-functions -O3 Runs on PORDOI linux cluster at CNMCA:128 dual-socket quad-core nodes (1024 total cores) Each socket: quad core Intel Xeon E5450 @3.00 Ghz with 1 GB RAM for each core Profiling with Scalasca 1.3.3 (very small overhead)
Early results: COSMO 7 Total time (s) for model runs Mean total time for RK dynamics
Early results: COSMO2 Total time (s) for model runs Mean total time for RK dynamics
Comments and future works Almost same computational times for test cases considered with INTEL-HPMPI configuration Not shown: 5% improve in computational times for PGI-MVAPICH2 (but with worse absolute times) CFL check performed only locally with izdebug<2. Still a lot of sinchronization in collective calls during multiplicative filling in semi- lagrange scheme: Allreduce and Allgather operations in multiple calls to sum_DDI subroutine (bottleneck for number of cores > 1000) Bad perfomances in w_bbc_rk_up5 during RK loop over small time steps. Rewrite loop code? What about automatic detection/insertion of swapping calls in microphysics and other parts of code? Is Testany/Waitany the most efficient way to assure completion?