GPU Acceleration of Particle-In-Cell Methods B. M. Cowan, J. R. Cary, S. W. Sides Tech-X Corporation.

GPU Acceleration of Particle-In-Cell Methods B. M. Cowan, J. R. Cary, S. W. Sides Tech-X Corporation

GPU characteristics Graphics processing units (GPUs) offer tremendous computational performance Much greater processing capability per monetary and power cost Achieved through massive parallelism Example: NVIDIA Tesla K20 1.17 Tflop/s (double), 3.52 Tflop/s (single) 13 streaming multiprocessors with 192 cores each = 2,496 processor cores 5 GB memory with 208 GB/s bandwidth Cost: ~$3000 Power consumption: 225 W

GPU restrictions Multiprocessors are SIMD devices Warps of 32 threads must execute the same instruction Threads can be grouped into (larger) blocks for convenience Branches can effectively stall threads (all threads go through all logical paths) Full bandwidth of global memory only realized if memory accesses are coalesced Blocks of 128 bytes accessed by consecutive groups of threads Multiprocessor shared memory can be accessed without such restrictions by all threads in a block, but has limited size

The particle-in-cell algorithm Lorentz force interpolated from gridded fields Currents deposited to grid from particles Push particles Deposit current to grid Advance fields Interpolate fields to particle positions

Difficulties of PIC on GPU Both field update and particle phase space advance are straightforward For field update, each thread updates a cell For particle push, each thread updates a particle Field interpolation and current deposition present problems It’s not known a priori which cells particles occupy and hence which field values are needed Naïve one-particle-per-thread memory accesses won’t be coalesced Deposition may also experience race conditions: Multiple threads try to write the same current value

Some techniques Optimize memory access For interpolation, read fields into shared memory Can then interpolate using one particle per thread But need to be careful about available shared memory size Tile the particles Group the particles by small rectangular regions (tiles) of cells Particles in a tile will generally be processed by the same thread block Tile size trade-off: Smaller tiles increase occupancy, but each tile needs guard cells Tile size must be set dynamically, based on problem specifications and hardware

General coding principles Portability Write main computational procedures (e.g. cell field update, particle push) in functions that can be executed on both host and device And take advantage of MIC, vectorization On CPU, function will be executed in a loop On GPU, function will be executed by a thread Generality Design main management routines to work with multiple algorithm variants Different types of field updates, e.g. absorbing boundaries, controlled dispersion High-order particles: Complicates memory management Other physics: Metallic boundaries, dielectric materials, collisions, cut cells…

Status of full GPU PIC in VSim Work in progress, part of ongoing DARPA project Completed main interpolate-push-deposit update Results correct in basic tests Coded consistently with general practices for good GPU performance, but not optimized yet Still, we can start to get insight about performance trade-offs

Performance scaling with PPC Tests run for interpolate/push part of algorithm As expected, more particles per cell give better per- particle performance Amortizes time to load field data from global memory

Performance scaling with simulation size Scaling with number of cells in domain is more complex Number of cells per tile held constant Could be seeing effects of register pressure, limited shared memory, occupancy of SM’s Sizing tile to shared memory limit not the most performant

Code considerations VSim uses the Vorpal engine, which was written from the ground up in C++ Different algorithms selected through run-time polymorphism with virtual functions This was great in 2007 Now this approach has limitations Hardware considerations: want to avoid branching CUDA restrictions: __global__ void kernel(Object myObj) { /*... */ } Run-time polymorphism still OK for high-level logic, but use template policy classes for low-level logic This object must be “flat”: No virtual methods or bases

Moving toward code generality VSim is a full multiphysics package Plasmas, metals, dielectrics, collisions,… We want to enable all these features on the GPU Starting with grids Cartesian and cylindrical coordinate systems; uniform and variable discretizations All grid types except variable cylindrical implemented; uniform grids tested and working Variable discretizations require runtime branching Collisions in progress

Next steps Refactoring basic classes to be GPU-friendly Grids Fields Integrate with FDTD field update Implement particle sinks Performance testing and optimization What drives the performance fluctuations as tile size, domain size are changed? Are there savings to be had from data management? Can we take advantage of the fact that particles move by at most one cell or tile per step? Manual organization vs. global sort Integrate with in-progress domain decomposition work General for arbitrary number of CPU cores and GPUs on system

Acknowledgments Work supported by DARPA contract W31P4Q-15- C-0061 (SBIR) Helpful discussions with D. N. Smithe

GPU Acceleration of Particle-In-Cell Methods B. M. Cowan, J. R. Cary, S. W. Sides Tech-X Corporation.

Similar presentations

Presentation on theme: "GPU Acceleration of Particle-In-Cell Methods B. M. Cowan, J. R. Cary, S. W. Sides Tech-X Corporation."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

GPU Acceleration of Particle-In-Cell Methods B. M. Cowan, J. R. Cary, S. W. Sides Tech-X Corporation.

Similar presentations

Presentation on theme: "GPU Acceleration of Particle-In-Cell Methods B. M. Cowan, J. R. Cary, S. W. Sides Tech-X Corporation."— Presentation transcript:

Similar presentations

About project

Feedback