Porting DL_MESO_DPD on GPUs

Slides:



Advertisements
Similar presentations
Instructor Notes This lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts),
Advertisements

Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters
Christopher McCabe, Derek Causon and Clive Mingham Centre for Mathematical Modelling & Flow Analysis Manchester Metropolitan University MANCHESTER M1 5GD.
Formulation of an algorithm to implement Lowe-Andersen thermostat in parallel molecular simulation package, LAMMPS Prathyusha K. R. and P. B. Sunil Kumar.
Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
OpenFOAM on a GPU-based Heterogeneous Cluster
1 Parallel multi-grid summation for the N-body problem Jesús A. Izaguirre with Thierry Matthey Department of Computer Science and Engineering University.
Computing Platform Benchmark By Boonyarit Changaival King Mongkut’s University of Technology Thonburi (KMUTT)
HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.
Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms Reporter: Jilin Zhang Authors:Changjun Hu, Yali.
Performance Evaluation of Hybrid MPI/OpenMP Implementation of a Lattice Boltzmann Application on Multicore Systems Department of Computer Science and Engineering,
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th.
Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.
Computational issues in Carbon nanotube simulation Ashok Srinivasan Department of Computer Science Florida State University.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Porting the physical parametrizations on GPUs using directives X. Lapillonne, O. Fuhrer, Cristiano Padrin, Piero Lanucara, Alessandro Cheloni Eidgenössisches.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
GPU Architecture and Programming
An FPGA Implementation of the Ewald Direct Space and Lennard-Jones Compute Engines By: David Chui Supervisor: Professor P. Chow.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Innovation for Our Energy Future Opportunities for WRF Model Acceleration John Michalakes Computational Sciences Center NREL Andrew Porter Computational.
Smoothed Particle Hydrodynamics Matthew Zhu CSCI 5551 — Fall 2015.
MSc in High Performance Computing Computational Chemistry Module Parallel Molecular Dynamics (i) Bill Smith CCLRC Daresbury Laboratory
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
A Pattern Language for Parallel Programming Beverly Sanders University of Florida.
Heterogeneous Computing With GPGPUs Matthew Piehl Overview Introduction to CUDA Project Overview Issues faced nvcc Implementation Performance Metrics Conclusions.
D. M. Ceperley, 2000 Simulations1 Neighbor Tables Long Range Potentials A & T pgs , Today we will learn how we can handle long range potentials.
GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.
PuReMD Design Initialization – neighbor-list, bond-list, hydrogenbond-list and Coefficients of QEq matrix Bonded interactions – Bond-order, bond-energy,
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo Vignesh T. Ravi Gagan Agrawal Department of Computer Science and Engineering,
Early Experience with Applications on POWER8 Mike Ashworth, Rob Allan, Rupert Ford, Xiaohu Guo, Mark Mawson, Jianping Meng, Andrew Porter STFC Hartree.
Multicore Applications in Physics and Biochemical Research Hristo Iliev Faculty of Physics Sofia University “St. Kliment Ohridski” 3 rd Balkan Conference.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
GPU Acceleration of Particle-In-Cell Methods B. M. Cowan, J. R. Cary, S. W. Sides Tech-X Corporation.
Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.
Parallel Programming Models
Computational Techniques for Efficient Carbon Nanotube Simulation
Implementation of the TIP5P Potential
CS427 Multicore Architecture and Parallel Computing
Validation October 13, 2017.
I. E. Venetis1, N. Nikoloutsakos1, E. Gallopoulos1, John Ekaterinaris2
Leiming Yu, Fanny Nina-Paravecino, David Kaeli, Qianqian Fang
ECE408 Fall 2015 Applied Parallel Programming Lecture 21: Application Case Study – Molecular Dynamics.
Implementing Simplified Molecular Dynamics Simulation in Different Parallel Paradigms Chao Mei April 27th, 2006 CS498LVK.
Clusters of Computational Accelerators
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Lecture 5: GPU Compute Architecture
A Pattern Specification and Optimizations Framework for Accelerating Scientific Computations on Heterogeneous Clusters Linchuan Chen Xin Huo and Gagan.
Parallel Programming in C with MPI and OpenMP
Linchuan Chen, Xin Huo and Gagan Agrawal
Experience with Maintaining the GPU Enabled Version of COSMO
MASS CUDA Performance Analysis and Improvement
Lecture 5: GPU Compute Architecture for the last time
ECE408 / CS483 Applied Parallel Programming Lecture 23: Application Case Study – Electrostatic Potential Calculation.
Milind A. Bhandarkar Adaptive MPI Milind A. Bhandarkar
Compiler Back End Panel
Compiler Back End Panel
Chapter 9: Molecular-dynamics
Sathish Vadhiyar Courtesy: Dr. David Walker, Cardiff University
Hybrid Programming with OpenMP and MPI
Computational Techniques for Efficient Carbon Nanotube Simulation
Parallel Programming in C with MPI and OpenMP
6- General Purpose GPU Programming
Parallel Exact Stochastic Simulation in Biochemical Systems
Presentation transcript:

Porting DL_MESO_DPD on GPUs Jony Castagna (Hartree Centre, Daresbury Laboratory, STFC) 1

What is DL_MESO (DPD) ​DL_MESO is a general purpose mesoscale simulation package developed by Michael Seaton for CCP5 and UKCOMES under a grant provided by EPSRC. It is written in Fortran90 and C++ and supports both Lattice Boltzmann Equation (LBE) and Dissipative Particle Dynamics (DPD) methods. https://www.scd.stfc.ac.uk/Pages/DL_MESO.aspx 2

...similar to MD - Free spherical particles which interact over a range that is of the same order as their diameters. - The particles can be thought of as assemblies or aggregates of molecules, such as solvent molecules or polymers, or more simply as carriers of momentum. cut off for short range forces + long range forces i j Fi is the sum of conservative, drag and random (or stochastic) pair forces: 3

Examples of DL_MESO_DPD applications Vesicle Formation Lipid Bilayer Phase separation Polyelectrolyte DL_MESO: highly scalable mesoscale simulations Molecular Simulation 39 (10) pp. 796-821, 2013 4

Common Algorithms Integrate in time Newton equations for particles using: Verlet Velocity algorithm => O (N) cell-linked method + Verlet list for short range interactions => O (N) Ewald summation for long interactions => O (N2) or O (N3/2) SPME summation for long interactions => O (N*log(N)) Domain decomposition for parallelization using MPI/OpenMP 5

Porting on NVidia GPU start main loop DL_MESO_DPD on GPU initialisation, IO, etc. (Fortran) host = CPU device = GPU pass arrays to C (Fortran) copy to device (CUDA C) start main loop first step VV algorithm all done by the GPU with 1 thread per particle construct re-organized cell-linked array find particle-particle forces second step VV algorithm currently done by the host without overlapping computation and communication gather statistics no time = final time ? yes pass arrays back to Fortran and End 6 6

Main problem: memory access pattern cell-linked method i = 7 Interactions in cell j for particle i: 7-6; 7-13 ; 7-16; etc. thread: … 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 … particle: … x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16 y2 y3 y4 y5 y6 y7 y8 y9 y10 y11 y12 y13 y14 y15 y16 z2 z3 z4 z5 z6 z7 z8 z9 z10 z11 z12 z13 z14 z15 z16 particle locations are stored in a continuous order… Very uncoalescent access to the memory! 7

Reorganize the cell-linked array cells locations are stored in a continuous order… cell j thread j access to j-lx location thread j+1 access to j-lx+1 location thread j-lx j-lx+1 … j-1 j j+1 j+lx x16 x10 x2 x7 x30 x25 y16 y10 y2 y7 y30 y25 z16 z10 z2 z7 z30 z25 x29 x19 x4 x6 x23 x11 y29 y19 y4 y6 y23 y11 z29 z19 z4 z6 z23 z11 x18 x14 x17 x13 x22 x12 y18 y14 y17 y13 y22 y12 z18 z14 z17 z13 z22 z12 coalescent! saving a N = 6*(maxNpc-1) of uncoalescent accesses per time step per thread! 6 because 3 positions and 3 velocities (maxNPc = max numer of particles per cell) coalescent! 8

Shared Memory usage (?) Load the data into the shared memory j-lx j-lx+1 … j-1 j j+1 j+lx x16 x10 x2 x7 x30 x25 y16 y10 y2 y7 y30 y25 z16 z10 z2 z7 z30 z25 x29 x19 x4 x6 x23 x11 y29 y19 y4 y6 y23 y11 z29 z19 z4 z6 z23 z11 x18 x14 x17 x13 x22 x12 y18 y14 y17 y13 y22 y12 z18 z14 z17 z13 z22 z12 Load the data into the shared memory save up to 13*SM/6 readings from global memory! 9

Results for a Gaussian electrolyte system Size: 46x46x46 (reduced units) Algorithm: Ewald summation → rec. space vector length k=23! Charges: ±1all particles (neutral plasma) Speedup Tesla P100 vs Intel IvyBridge (E5-2697)-12cores typical DPD simulation (5p/cell) 77% GPU memory used: need multiple GPUs! threads schedule overhead (?) 10 10

Energy & Costs Tesla P100 IvyBridge (E5-2697) 12-cores ratio Current Cost (Jun-17) £5084.0 £2531.99 2 Max Power (W) 250 130 1.9 Theoretical peak (TeraFLOPS) 4.7 0.336 14 Bandwidth GB/s 732 59.7 12! Mixture of 0.5M particles (no electrostatic force) particles are clustering very quickly introducing strong imbalance between threads! 11 11

Resume: pro and cons… Pro (for a 4x speedup): 2x more cost efficient 2x more power efficient users can easily install 1 or more GPUs on their workstation … Cons: need to maintain at least 2 versions of the code (of which 1 is in Fortran!) sugg.: never introduce new physics directly in the CUDA version porting the main framework took around 4 months (still not all physic is covered!) linked to a single vendor (NVidia) you don’t need a cluster! 12 12

…and future plan use shared memory (?) ~5M particles 3rd APOD cycle Add SPME method ~ 20M particles (4 GPUxnode) Extend to multiple GPUs on the same node NVLink MPI ~ 500M particles (30 nodes) Extend to multiple nodes split CPU and GPU workloads MPI/OpenMP 13 13

Acknowledgments STFC Michael Seaton Silvia Chiacchiera Leon Petit Luke Mason and the HPSE group! ECAM EU Alan O’Cais * Grant Agr. N 676531 Liang Liang 14 14

Questions? 15 15