Porting DL_MESO_DPD on GPUs

Porting DL_MESO_DPD on GPUs
Jony Castagna (Hartree Centre, Daresbury Laboratory, STFC) 1

What is DL_MESO (DPD) DL_MESO is a general purpose mesoscale simulation package developed by Michael Seaton for CCP5 and UKCOMES under a grant provided by EPSRC. It is written in Fortran90 and C++ and supports both Lattice Boltzmann Equation (LBE) and Dissipative Particle Dynamics (DPD) methods. 2

...similar to MD - Free spherical particles which interact over a range that is of the same order as their diameters. - The particles can be thought of as assemblies or aggregates of molecules, such as solvent molecules or polymers, or more simply as carriers of momentum. cut off for short range forces + long range forces i j Fi is the sum of conservative, drag and random (or stochastic) pair forces: 3

Examples of DL_MESO_DPD applications
Vesicle Formation Lipid Bilayer Phase separation Polyelectrolyte DL_MESO: highly scalable mesoscale simulations Molecular Simulation 39 (10) pp , 2013 4

Common Algorithms Integrate in time Newton equations for particles using: Verlet Velocity algorithm => O (N) cell-linked method + Verlet list for short range interactions => O (N) Ewald summation for long interactions => O (N2) or O (N3/2) SPME summation for long interactions => O (N*log(N)) Domain decomposition for parallelization using MPI/OpenMP 5

Porting on NVidia GPU start main loop
DL_MESO_DPD on GPU initialisation, IO, etc. (Fortran) host = CPU device = GPU pass arrays to C (Fortran) copy to device (CUDA C) start main loop first step VV algorithm all done by the GPU with 1 thread per particle construct re-organized cell-linked array find particle-particle forces second step VV algorithm currently done by the host without overlapping computation and communication gather statistics no time = final time ? yes pass arrays back to Fortran and End 6 6

Main problem: memory access pattern
cell-linked method i = 7 Interactions in cell j for particle i: 7-6; 7-13 ; 7-16; etc. thread: … … particle: … x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16 y2 y3 y4 y5 y6 y7 y8 y9 y10 y11 y12 y13 y14 y15 y16 z2 z3 z4 z5 z6 z7 z8 z9 z10 z11 z12 z13 z14 z15 z16 particle locations are stored in a continuous order… Very uncoalescent access to the memory! 7

Reorganize the cell-linked array
cells locations are stored in a continuous order… cell j thread j access to j-lx location thread j+1 access to j-lx+1 location thread j-lx j-lx+1 … j-1 j j+1 j+lx x16 x10 x2 x7 x30 x25 y16 y10 y2 y7 y30 y25 z16 z10 z2 z7 z30 z25 x29 x19 x4 x6 x23 x11 y29 y19 y4 y6 y23 y11 z29 z19 z4 z6 z23 z11 x18 x14 x17 x13 x22 x12 y18 y14 y17 y13 y22 y12 z18 z14 z17 z13 z22 z12 coalescent! saving a N = 6*(maxNpc-1) of uncoalescent accesses per time step per thread! 6 because 3 positions and 3 velocities (maxNPc = max numer of particles per cell) coalescent! 8

Shared Memory usage (?) Load the data into the shared memory
j-lx j-lx+1 … j-1 j j+1 j+lx x16 x10 x2 x7 x30 x25 y16 y10 y2 y7 y30 y25 z16 z10 z2 z7 z30 z25 x29 x19 x4 x6 x23 x11 y29 y19 y4 y6 y23 y11 z29 z19 z4 z6 z23 z11 x18 x14 x17 x13 x22 x12 y18 y14 y17 y13 y22 y12 z18 z14 z17 z13 z22 z12 Load the data into the shared memory save up to 13*SM/6 readings from global memory! 9

Results for a Gaussian electrolyte system
Size: 46x46x46 (reduced units) Algorithm: Ewald summation → rec. space vector length k=23! Charges: ±1all particles (neutral plasma) Speedup Tesla P100 vs Intel IvyBridge (E5-2697)-12cores typical DPD simulation (5p/cell) 77% GPU memory used: need multiple GPUs! threads schedule overhead (?) 10 10

Energy & Costs Tesla P100 IvyBridge (E5-2697) 12-cores ratio
Current Cost (Jun-17) £5084.0 £ 2 Max Power (W) 250 130 1.9 Theoretical peak (TeraFLOPS) 4.7 0.336 14 Bandwidth GB/s 732 59.7 12! Mixture of 0.5M particles (no electrostatic force) particles are clustering very quickly introducing strong imbalance between threads! 11 11

Resume: pro and cons… Pro (for a 4x speedup): 2x more cost efficient
2x more power efficient users can easily install 1 or more GPUs on their workstation … Cons: need to maintain at least 2 versions of the code (of which 1 is in Fortran!) sugg.: never introduce new physics directly in the CUDA version porting the main framework took around 4 months (still not all physic is covered!) linked to a single vendor (NVidia) you don’t need a cluster! 12 12

…and future plan use shared memory (?) ~5M particles 3rd APOD cycle
Add SPME method ~ 20M particles (4 GPUxnode) Extend to multiple GPUs on the same node NVLink MPI ~ 500M particles (30 nodes) Extend to multiple nodes split CPU and GPU workloads MPI/OpenMP 13 13

Acknowledgments STFC Michael Seaton Silvia Chiacchiera Leon Petit
Luke Mason and the HPSE group! ECAM EU Alan O’Cais * Grant Agr. N Liang Liang 14 14

Questions? 15 15

Porting DL_MESO_DPD on GPUs

Similar presentations

Presentation on theme: "Porting DL_MESO_DPD on GPUs"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Porting DL_MESO_DPD on GPUs

Similar presentations

Presentation on theme: "Porting DL_MESO_DPD on GPUs"— Presentation transcript:

Similar presentations

About project

Feedback