Download presentation
Presentation is loading. Please wait.
Published byGrant Jennings Modified over 9 years ago
2
Space Charge with PyHEADTAIL and PyPIC on the GPU Stefan Hegglin and Adrian Oeftiger Space Charge Working Group meeting – 29.10.2015
3
Overview 1. PIC: Reminder 2. Implementation / Parallelisation Approach 3. Results 29.10.2015 Stefan Hegglin and Adrian Oeftiger3
4
Motivation 29.10.2015 Stefan Hegglin and Adrian Oeftiger4 self-consistent space charge models: particle-in-cell (PIC) algorithm is dominating time consumer in simulations parallelisation is challenging (PIC memory-bound algorithm, i.e. few FLOP/byte)
5
Output 29.10.2015 Stefan Hegglin and Adrian Oeftiger5 we parallelised PIC on the GPU (graphics processing unit) PyPIC: PIC algorithms in shared python library 2.5D (slice-by-slice transverse) and full 3D model much higher resolution possible, suppress noise issues courtesy: F. Kesting, GSI, https://eventbooking.stfc.ac.uk/uploads/spacecharge15/numericalnoisekesting.pdf example: on mesh size 128x128, reduced artificial emittance growth for more particles
6
How to Approach Noise Issues? less noise longer applicability/validity of simulations e.g. SPS injection plateau: 10.8 seconds ≈ 500’000 turns! impossible, instead we typically gain O(10’000 turns) validity for a simulation time scale O(1 week) with current software 29.10.2015 Stefan Hegglin and Adrian Oeftiger6 choose grid resolution (acc. to physics) ≥10 macro- particles per grid cell fix total #macro- particles evaluate emittance growth convergence study
7
New Available Parameter Space 1’000’000 macro- particles 20 slices128 x 128 mesh size 29.10.2015 Stefan Hegglin and Adrian Oeftiger7 152ms per kick 134ms per kick 110ms per kick
8
Poisson Solving with PIC particle-in-cell algorithm: standard in accelerator physics domain solve Poisson equation finite differences Hockney: FFT, (integrated) Green’s function for open boundaries FMM, particle-particle,… see Ji Qiang’s talks in PyHEADTAIL meeting and Space Charge WG meeting: https://indico.cern.ch/event/433371/ https://indico.cern.ch/event/433371/ 8 29.10.2015 Stefan Hegglin and Adrian Oeftiger
9
PIC – 3 Steps particle-in-cell algorithm: 1) particles to mesh: deposit charges to mesh nodes 2) solve the Poisson equation on the mesh Hockney’s algorithm 3) mesh to particles: interpolate the mesh fields to the particles 9 29.10.2015 Stefan Hegglin and Adrian Oeftiger
10
Hockney’s algorithm Solve Poisson equation on a structured grid Green’s function: analytical solution for open boundaries Formal solution using convolution: O(n^2) Trick: implementation using FFTs of 2x domain size, 10 29.10.2015 Stefan Hegglin and Adrian Oeftiger
11
Green’s function approach has problems when mesh has large aspect ratio (numerical integration uses constant function value per cell) Integrated Green’s function: main idea: integrate Green’s function analytically for each mesh cell, then sum all cells 11 Integrated Green’s function 29.10.2015 Stefan Hegglin and Adrian Oeftiger
12
Integrated Green’s function 12.10.2015 Stefan Hegglin and Adrian Oeftiger12 Error of ex x Comparing IGF and GF for an aspect ratio of 1:5 Abell et al, PAC 07, 9850561
13
GPUs GPU = Graphic Processing Unit: threads running massively parallel one concurrent instruction on >1000 cores large data arrays ‼ expensive global memory access resources for ABP simulations: CERN: LIU-PS-GPU server 4x NVIDIA Tesla C2075 cards (mid 2011) CNAF (Bologna): high performance cluster 7x NVIDIA Tesla K20m (early 2013) 8x Tesla K40m (late 2013) 13 29.10.2015 Stefan Hegglin and Adrian Oeftiger
14
How to use the GPU Script: minimal changes for GPU how to submit a GPU job (CNAF): python: GPU data introspection works as flexible as on CPU (print(), calculations with GPUArrays, …) 29.10.2015 Stefan Hegglin and Adrian Oeftiger14 GPUCPU
15
Parallelisation Approach 29.10.2015 Stefan Hegglin and Adrian Oeftiger15 identify bottleneck optimise code verify functionality profiling
16
Different Bottlenecks: CPU vs. GPU 29.10.2015 Stefan Hegglin and Adrian Oeftiger16 CPUGPU FFT solving is bottleneck FFT: O(nx² log nx), p2m: O(nx²) particle-to-mesh deposition is bottleneck
17
Implementation of 3 Steps particle-in-cell algorithm: particles to mesh (p2m): 1) atomicAdd: thread particle 2) parallel sort: thread cell Solve: cuFFT (parallel FFT) mesh to particles (m2p): thread particle 17 29.10.2015 Stefan Hegglin and Adrian Oeftiger
18
Variant 1 of p2m 1 thread per particle 29.10.2015 Stefan Hegglin and Adrian Oeftiger18 race condition AtomicAdd: properly serialise memory updates slow but correct
19
Variant 2 of p2m 1 thread per node Sort particles by node index (optimise memory access!) 29.10.2015 Stefan Hegglin and Adrian Oeftiger19 Avoids race condition (no concurrent writes)
20
Different numerical models 2.5D slice bunch into n slices: solve n independent 2D Poisson equations. Approximation: bunch very long CPU: serial GPU: compute all slices simultaneously 3D solve the full 3D bunch on a 3D grid CPU: not implemented (very slow) GPU: large memory requirements due to Hockney’s algorithm 20 29.10.2015 Stefan Hegglin and Adrian Oeftiger
21
21 29.10.2015 Stefan Hegglin and Adrian Oeftiger fixed mesh size: 256x256 Numeric Parameter Scans: Fixed nx fixed mesh size: 512x512 x4 x2
22
Timing: Fully Loaded GPU Parameters 2.5D model works well at high particle numbers, i.e. at low numbers the GPU is far from full exploitation! different slope of CPU vs. GPU (characteristic behaviour) new hardware at CNAF more efficient (x1.8) 22 29.10.2015 Stefan Hegglin and Adrian Oeftiger
23
Timing: CUDA 6 vs CUDA 7 29.10.2015 Stefan Hegglin and Adrian Oeftiger23 speedup of up to x1.5 due to a faster implementation of the sorting algorithm (thrust 1.8) and better cuFFT 2D, K20m @ CNAF
24
Summary PyHEADTAIL now offers 2.5D (slice-by-slice transverse) and 3D self-consistent direct space charge models (on CPU and GPU): 3D model allows cross-checking 2.5D approximations GPU speeds up ≈13x for large meshes and #particles wide numeric parameter spaces available now! larger resolutions help to mitigate noise effects (artefacts such as numerical emittance blow-up) improved validity for long simulations (real machine time) next steps: SPS simulations (resonances) 29.10.2015 Stefan Hegglin and Adrian Oeftiger24
26
Specifications of Used GPU Machines available machines at CNAF: http://wiki.infn.it/strutture/cnaf/clusterhpc/home 29.10.2015 Stefan Hegglin and Adrian Oeftiger13
27
Specification of Used CPU Machine LIUPSGPU CPU: 29.10.2015 Stefan Hegglin and Adrian Oeftiger27
28
PyPIC on GPU Standalone Python module: GPU interfacing via PyCUDA/Thrust Flexible 2D/3D (integrated) Green’s function cuFFT http://github.com/PyCOMPLETE/PyPIC (new interface under branch: new_pypic_cpu_and_gpu) http://github.com/PyCOMPLETE/PyPIC 28 29.10.2015 Stefan Hegglin and Adrian Oeftiger
29
29.10.2015 Stefan Hegglin and Adrian Oeftiger29 Timing: Fully Loaded GPU Parameters II on GPU, particle-to-mesh deposition dominates for fixed mesh size, more macro-particles onto the same grid induce memory bandwidth limitations on speed up
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.