Presentation is loading. Please wait.

Presentation is loading. Please wait.

Non blocking communications in RK dynamics

Similar presentations


Presentation on theme: "Non blocking communications in RK dynamics"— Presentation transcript:

1 Non blocking communications in RK dynamics
Non blocking communications in RK dynamics. Current status and future work. Stefano Zampini, CASPUR/CNMCA WG-6 PP Cosmo GM Rome, September

2 Halo exchange in Cosmo 3 types of point to point communications: 2 partially non blocking and 1 full blocking (with MPI_SENDRECV) Halo swapping needs completion of East to West before starting South to North communication (implicit corner exchange) Also: choice for explicit buffering or derived MPI datatypes

3 Details on nonblocking exchange
Full halo exchange including corners: 2x messages, same amount of data on network. 3 different stages: send, receive and wait. Minimizing overhead: at first time step persistent requests are created using calls to MPI_SEND_INIT and MPI_RECV_INIT. During model run: MPI_STARTALL used for starting requests. MPI_TESTANY/MPI_WAITANY used for completion. Actual implementation with explicit send and receive buffering only: needs to be extended to derived MPI datatypes. Strategy used in RK dynamics (manual implementation): - Sends are posted whenever needed data has been locally computed. - Receives are posted whenever receive buffer is ready to be used. - Waits are posted just before data is needed for next local computation.

4 New synopsis for swap subroutine
Actual call to subroutine exchg_boundaries 4 more argument in call to subroutine iexchg_boundaries - ilocalreq(16): array of request (integer declared as module variable, one for each swap scenario inside the module) - operation(3): array of logicals indicating stage to perform (send,recv,wait) - istartpar,iendpar: needed for corners' definition

5 New Implementation

6 Benchmark details COSMO RAPS 5.0 with MeteoSwiss namelist (25h hours of forecast) Cosmo2 (520x350x60, dt 20) and Cosmo7 (393x338x60, dt 60) Decompositions: tiny (10x12+4), small (20x24+4) and usual (28x35+4) Code compiled with Intel ifort and HPMPI COMFLG1 = -xssse3 -O3 -fp-model precise -free -fpp -override-limits -convert big_endian COMFLG2 = -xssse3 -O3 -fp-model precise -free -fpp -override-limits -convert big_endian COMFLG3 = -xssse3 -O2 -fp-model precise -free -fpp -override-limits -convert big_endian COMFLG4 = -xssse3 -O2 -fp-model precise -free -fpp -override-limits -convert big_endian LDFLG = -finline-functions -O3 Runs on PORDOI linux cluster at CNMCA:128 dual-socket quad-core nodes (1024 total cores) Each socket: quad core Intel Xeon Ghz with 1 GB RAM for each core Profiling with Scalasca (very small overhead)

7 Early results: COSMO 7 Total time (s) for model runs Mean total time for RK dynamics

8 Early results: COSMO2 Total time (s) for model runs Mean total time for RK dynamics

9 Comments and future works
Almost same computational times for test cases considered with INTEL-HPMPI configuration Not shown: 5% improve in computational times for PGI-MVAPICH2 (but with worse absolute times) CFL check performed only locally with izdebug<2. Still a lot of sinchronization in collective calls during multiplicative filling in semi- lagrange scheme: Allreduce and Allgather operations in multiple calls to sum_DDI subroutine (bottleneck for number of cores > 1000) Bad perfomances in w_bbc_rk_up5 during RK loop over small time steps. Rewrite loop code? What about automatic detection/insertion of swapping calls in microphysics and other parts of code? Is Testany/Waitany the most efficient way to assure completion?


Download ppt "Non blocking communications in RK dynamics"

Similar presentations


Ads by Google