Space Charge with PyHEADTAIL and PyPIC on the GPU Stefan Hegglin and Adrian Oeftiger Space Charge Working Group meeting –
Overview 1. PIC: Reminder 2. Implementation / Parallelisation Approach 3. Results Stefan Hegglin and Adrian Oeftiger3
Motivation Stefan Hegglin and Adrian Oeftiger4 self-consistent space charge models: particle-in-cell (PIC) algorithm is dominating time consumer in simulations parallelisation is challenging (PIC memory-bound algorithm, i.e. few FLOP/byte)
Output Stefan Hegglin and Adrian Oeftiger5 we parallelised PIC on the GPU (graphics processing unit) PyPIC: PIC algorithms in shared python library 2.5D (slice-by-slice transverse) and full 3D model much higher resolution possible, suppress noise issues courtesy: F. Kesting, GSI, example: on mesh size 128x128, reduced artificial emittance growth for more particles
How to Approach Noise Issues? less noise longer applicability/validity of simulations e.g. SPS injection plateau: 10.8 seconds ≈ 500’000 turns! impossible, instead we typically gain O(10’000 turns) validity for a simulation time scale O(1 week) with current software Stefan Hegglin and Adrian Oeftiger6 choose grid resolution (acc. to physics) ≥10 macro- particles per grid cell fix total #macro- particles evaluate emittance growth convergence study
New Available Parameter Space 1’000’000 macro- particles 20 slices128 x 128 mesh size Stefan Hegglin and Adrian Oeftiger7 152ms per kick 134ms per kick 110ms per kick
Poisson Solving with PIC particle-in-cell algorithm: standard in accelerator physics domain solve Poisson equation finite differences Hockney: FFT, (integrated) Green’s function for open boundaries FMM, particle-particle,… see Ji Qiang’s talks in PyHEADTAIL meeting and Space Charge WG meeting: Stefan Hegglin and Adrian Oeftiger
PIC – 3 Steps particle-in-cell algorithm: 1) particles to mesh: deposit charges to mesh nodes 2) solve the Poisson equation on the mesh Hockney’s algorithm 3) mesh to particles: interpolate the mesh fields to the particles Stefan Hegglin and Adrian Oeftiger
Hockney’s algorithm Solve Poisson equation on a structured grid Green’s function: analytical solution for open boundaries Formal solution using convolution: O(n^2) Trick: implementation using FFTs of 2x domain size, Stefan Hegglin and Adrian Oeftiger
Green’s function approach has problems when mesh has large aspect ratio (numerical integration uses constant function value per cell) Integrated Green’s function: main idea: integrate Green’s function analytically for each mesh cell, then sum all cells 11 Integrated Green’s function Stefan Hegglin and Adrian Oeftiger
Integrated Green’s function Stefan Hegglin and Adrian Oeftiger12 Error of ex x Comparing IGF and GF for an aspect ratio of 1:5 Abell et al, PAC 07,
GPUs GPU = Graphic Processing Unit: threads running massively parallel one concurrent instruction on >1000 cores large data arrays ‼ expensive global memory access resources for ABP simulations: CERN: LIU-PS-GPU server 4x NVIDIA Tesla C2075 cards (mid 2011) CNAF (Bologna): high performance cluster 7x NVIDIA Tesla K20m (early 2013) 8x Tesla K40m (late 2013) Stefan Hegglin and Adrian Oeftiger
How to use the GPU Script: minimal changes for GPU how to submit a GPU job (CNAF): python: GPU data introspection works as flexible as on CPU (print(), calculations with GPUArrays, …) Stefan Hegglin and Adrian Oeftiger14 GPUCPU
Parallelisation Approach Stefan Hegglin and Adrian Oeftiger15 identify bottleneck optimise code verify functionality profiling
Different Bottlenecks: CPU vs. GPU Stefan Hegglin and Adrian Oeftiger16 CPUGPU FFT solving is bottleneck FFT: O(nx² log nx), p2m: O(nx²) particle-to-mesh deposition is bottleneck
Implementation of 3 Steps particle-in-cell algorithm: particles to mesh (p2m): 1) atomicAdd: thread particle 2) parallel sort: thread cell Solve: cuFFT (parallel FFT) mesh to particles (m2p): thread particle Stefan Hegglin and Adrian Oeftiger
Variant 1 of p2m 1 thread per particle Stefan Hegglin and Adrian Oeftiger18 race condition AtomicAdd: properly serialise memory updates slow but correct
Variant 2 of p2m 1 thread per node Sort particles by node index (optimise memory access!) Stefan Hegglin and Adrian Oeftiger19 Avoids race condition (no concurrent writes)
Different numerical models 2.5D slice bunch into n slices: solve n independent 2D Poisson equations. Approximation: bunch very long CPU: serial GPU: compute all slices simultaneously 3D solve the full 3D bunch on a 3D grid CPU: not implemented (very slow) GPU: large memory requirements due to Hockney’s algorithm Stefan Hegglin and Adrian Oeftiger
Stefan Hegglin and Adrian Oeftiger fixed mesh size: 256x256 Numeric Parameter Scans: Fixed nx fixed mesh size: 512x512 x4 x2
Timing: Fully Loaded GPU Parameters 2.5D model works well at high particle numbers, i.e. at low numbers the GPU is far from full exploitation! different slope of CPU vs. GPU (characteristic behaviour) new hardware at CNAF more efficient (x1.8) Stefan Hegglin and Adrian Oeftiger
Timing: CUDA 6 vs CUDA Stefan Hegglin and Adrian Oeftiger23 speedup of up to x1.5 due to a faster implementation of the sorting algorithm (thrust 1.8) and better cuFFT 2D, CNAF
Summary PyHEADTAIL now offers 2.5D (slice-by-slice transverse) and 3D self-consistent direct space charge models (on CPU and GPU): 3D model allows cross-checking 2.5D approximations GPU speeds up ≈13x for large meshes and #particles wide numeric parameter spaces available now! larger resolutions help to mitigate noise effects (artefacts such as numerical emittance blow-up) improved validity for long simulations (real machine time) next steps: SPS simulations (resonances) Stefan Hegglin and Adrian Oeftiger24
Specifications of Used GPU Machines available machines at CNAF: Stefan Hegglin and Adrian Oeftiger13
Specification of Used CPU Machine LIUPSGPU CPU: Stefan Hegglin and Adrian Oeftiger27
PyPIC on GPU Standalone Python module: GPU interfacing via PyCUDA/Thrust Flexible 2D/3D (integrated) Green’s function cuFFT (new interface under branch: new_pypic_cpu_and_gpu) Stefan Hegglin and Adrian Oeftiger
Stefan Hegglin and Adrian Oeftiger29 Timing: Fully Loaded GPU Parameters II on GPU, particle-to-mesh deposition dominates for fixed mesh size, more macro-particles onto the same grid induce memory bandwidth limitations on speed up