Space Charge with PyHEADTAIL and PyPIC on the GPU Stefan Hegglin and Adrian Oeftiger Space Charge Working Group meeting – 29.10.2015.

Slides:

Advertisements

Similar presentations

Semantics Consistent Parallelism Li-Yi Wei Microsoft Research.

Advertisements

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

IIAA GPMAD A beam dynamics code using Graphics Processing Units GPMAD (GPU Processed Methodical Accelerator Design) utilises Graphics Processing Units.

GPU System Architecture Alan Gray EPCC The University of Edinburgh.

HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.

Development of a track trigger based on parallel architectures Felice Pantaleo PH-CMG-CO (University of Hamburg) Felice Pantaleo PH-CMG-CO (University.

PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.

OpenFOAM on a GPU-based Heterogeneous Cluster

2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.

L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,

DCABES 2009 China University Of Geosciences 1 The Parallel Models of Coronal Polarization Brightness Calculation Jiang Wenqian.

A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.

1 1 © 2011 The MathWorks, Inc. Accelerating Bit Error Rate Simulation in MATLAB using Graphics Processors James Lebak Brian Fanous Nick Moore High-Performance.

HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.

Introduction Status of SC simulations at CERN

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

Simulation Technology & Applied Research, Inc N. Port Washington Rd., Suite 201, Mequon, WI P:

Computer Graphics Graphics Hardware

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

Beam Dynamic Calculation by NVIDIA® CUDA Technology E. Perepelkin, V. Smirnov, and S. Vorozhtsov JINR, Dubna 7 July 2009.

Results from measurement and simulation methods (LW, PIC, Astra) and setup About LW & Astra Simulations of The Pitz Gun comparison LW  Astra more analysis.

Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.

Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.

FEL simulation code FAST Free-Electron Laser at the TESLA Test Facility Generic name FAST stands for a set of codes for ``full physics'' analysis of FEL.

Implementing a Speech Recognition System on a GPU using CUDA

Scheduling Many-Body Short Range MD Simulations on a Cluster of Workstations and Custom VLSI Hardware Sumanth J.V, David R. Swanson and Hong Jiang University.

1 © 2012 The MathWorks, Inc. Parallel computing with MATLAB.

Use of GPUs in ALICE (and elsewhere) Thorsten Kollegger TDOC-PG | CERN |

Diane Marinkas CDA 6938 April 30, Outline Motivation Algorithm CPU Implementation GPU Implementation Performance Lessons Learned Future Work.

VORPAL Optimizations for Petascale Systems Paul Mullowney, Peter Messmer, Ben Cowan, Keegan Amyx, Stefan Muszala Tech-X Corporation Boyana Norris Argonne.

HPCLatAm 2013 HPCLatAm 2013 Permutation Index and GPU to Solve efficiently Many Queries AUTORES  Mariela Lopresti  Natalia Miranda  Fabiana Piccoli.

Large-scale Deep Unsupervised Learning using Graphics Processors

Beam-Beam Simulations for RHIC and LHC J. Qiang, LBNL Mini-Workshop on Beam-Beam Compensation July 2-4, 2007, SLAC, Menlo Park, California.

GPU Architecture and Programming

Future farm technologies & architectures John Baines 1.

 Advanced Accelerator Simulation Panagiotis Spentzouris Fermilab Computing Division (member of the SciDAC AST project)

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

QCAdesigner – CUDA HPPS project

Rassul Ayani 1 Performance of parallel and distributed systems  What is the purpose of measurement?  To evaluate a system (or an architecture)  To compare.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

 Genetic Algorithms  A class of evolutionary algorithms  Efficiently solves optimization tasks  Potential Applications in many fields  Challenges.

Graduate Institute of Astrophysics, National Taiwan University Leung Center for Cosmology and Particle Astrophysics Chia-Yu Hu OSU Radio Simulation Workshop.

‘Computer power’ budget for the CERN Space Charge Group Alexander Molodozhentsev for the CERN-ICE ‘space-charge’ group meeting March 16, 2012 LIU project.

Physics of electron cloud build up Principle of the multi-bunch multipacting. No need to be on resonance, wide ranges of parameters allow for the electron.

IMPACT-T - A 3D Parallel Beam Dynamics Code for Modeling High Brightness Beams in Photo-Injectors Ji Qiang Lawrence Berkeley National Laboratory Work performed.

Beam-Beam Simulations Ji Qiang US LARP CM12 Collaboration Meeting Napa Valley, April 8-10, 2009 Lawrence Berkeley National Laboratory.

An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert.

Update on injection studies of LHC beams from Linac4 V. Forte (BE/ABP-HSC) Acknowledgements: J. Abelleira, C. Bracco, E. Benedetto, S. Hancock, M. Kowalska.

Smoothed Particle Hydrodynamics Matthew Zhu CSCI 5551 — Fall 2015.

GSI Helmholtzzentrum für Schwerionenforschung GmbH Sabrina Appel | PBBP11 March GSI Helmholtzzentrum für Schwerionenforschung GmbH Tracking simulations.

Geant4 on GPU prototype Nicholas Henderson (Stanford Univ. / ICME)

Development of a GPU based PIC

IceCube simulation with PPC Photonics: 2000 – up to now Photon propagation code PPC: now.

Scientific Computing Goals Past progress Future. Goals Numerical algorithms & computational strategies Solve specific set of problems associated with.

3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.

1 IMPACT: Benchmarking Ji Qiang Lawrence Berkeley National Laboratory CERN Space-Charge Collaboration Meeting, May 20-21, 2014.

Warm linac simulations (DTL) and errors analysis M. Comunian F. Grespan.

Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.

Hybrid Parallel Implementation of The DG Method Advanced Computing Department/ CAAM 03/03/2016 N. Chaabane, B. Riviere, H. Calandra, M. Sekachev, S. Hamlaoui.

LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

GPU Acceleration of Particle-In-Cell Methods B. M. Cowan, J. R. Cary, S. W. Sides Tech-X Corporation.

Unstructured Meshing Tools for Fusion Plasma Simulations

Parallel Plasma Equilibrium Reconstruction Using GPU

People who attended the meeting:

FASTION L. Mether, G. Rumolo ABP-CWG meeting

Real-Time Ray Tracing Stefan Popov.

Update on PTC/ORBIT space charge studies in the PSB

Preliminary results of the PTC/ORBIT convergence studies in the PSB

Presentation transcript:

Space Charge with PyHEADTAIL and PyPIC on the GPU Stefan Hegglin and Adrian Oeftiger Space Charge Working Group meeting –

Overview 1. PIC: Reminder 2. Implementation / Parallelisation Approach 3. Results Stefan Hegglin and Adrian Oeftiger3

Motivation Stefan Hegglin and Adrian Oeftiger4 self-consistent space charge models: particle-in-cell (PIC) algorithm is dominating time consumer in simulations parallelisation is challenging (PIC  memory-bound algorithm, i.e. few FLOP/byte)

Output Stefan Hegglin and Adrian Oeftiger5 we parallelised PIC on the GPU (graphics processing unit) PyPIC: PIC algorithms in shared python library 2.5D (slice-by-slice transverse) and full 3D model  much higher resolution possible, suppress noise issues courtesy: F. Kesting, GSI, example: on mesh size 128x128, reduced artificial emittance growth for more particles

How to Approach Noise Issues? less noise  longer applicability/validity of simulations e.g. SPS injection plateau: 10.8 seconds ≈ 500’000 turns!  impossible, instead we typically gain O(10’000 turns) validity for a simulation time scale O(1 week) with current software Stefan Hegglin and Adrian Oeftiger6 choose grid resolution (acc. to physics) ≥10 macro- particles per grid cell fix total #macro- particles evaluate emittance growth convergence study

New Available Parameter Space 1’000’000 macro- particles 20 slices128 x 128 mesh size Stefan Hegglin and Adrian Oeftiger7 152ms per kick 134ms per kick 110ms per kick

Poisson Solving with PIC particle-in-cell algorithm: standard in accelerator physics domain solve Poisson equation  finite differences  Hockney: FFT, (integrated) Green’s function for open boundaries  FMM, particle-particle,…  see Ji Qiang’s talks in PyHEADTAIL meeting and Space Charge WG meeting: Stefan Hegglin and Adrian Oeftiger

PIC – 3 Steps particle-in-cell algorithm: 1) particles to mesh: deposit charges to mesh nodes 2) solve the Poisson equation on the mesh  Hockney’s algorithm 3) mesh to particles: interpolate the mesh fields to the particles Stefan Hegglin and Adrian Oeftiger

Hockney’s algorithm Solve Poisson equation on a structured grid Green’s function: analytical solution for open boundaries Formal solution using convolution: O(n^2) Trick: implementation using FFTs of 2x domain size, Stefan Hegglin and Adrian Oeftiger

Green’s function approach has problems when mesh has large aspect ratio (numerical integration uses constant function value per cell)  Integrated Green’s function: main idea: integrate Green’s function analytically for each mesh cell, then sum all cells 11 Integrated Green’s function Stefan Hegglin and Adrian Oeftiger

Integrated Green’s function Stefan Hegglin and Adrian Oeftiger12 Error of ex x Comparing IGF and GF for an aspect ratio of 1:5 Abell et al, PAC 07,

GPUs GPU = Graphic Processing Unit: threads running massively parallel one concurrent instruction on >1000 cores  large data arrays ‼ expensive global memory access resources for ABP simulations: CERN: LIU-PS-GPU server 4x NVIDIA Tesla C2075 cards (mid 2011) CNAF (Bologna): high performance cluster 7x NVIDIA Tesla K20m (early 2013) 8x Tesla K40m (late 2013) Stefan Hegglin and Adrian Oeftiger

How to use the GPU Script: minimal changes for GPU how to submit a GPU job (CNAF): python: GPU data introspection works as flexible as on CPU (print(), calculations with GPUArrays, …) Stefan Hegglin and Adrian Oeftiger14 GPUCPU

Parallelisation Approach Stefan Hegglin and Adrian Oeftiger15 identify bottleneck optimise code verify functionality profiling

Different Bottlenecks: CPU vs. GPU Stefan Hegglin and Adrian Oeftiger16 CPUGPU FFT solving is bottleneck FFT: O(nx² log nx), p2m: O(nx²) particle-to-mesh deposition is bottleneck

Implementation of 3 Steps particle-in-cell algorithm: particles to mesh (p2m): 1) atomicAdd: thread  particle 2) parallel sort: thread  cell Solve: cuFFT (parallel FFT) mesh to particles (m2p): thread  particle Stefan Hegglin and Adrian Oeftiger

Variant 1 of p2m 1 thread per particle Stefan Hegglin and Adrian Oeftiger18 race condition  AtomicAdd: properly serialise memory updates  slow but correct

Variant 2 of p2m 1 thread per node Sort particles by node index (optimise memory access!) Stefan Hegglin and Adrian Oeftiger19 Avoids race condition (no concurrent writes)

Different numerical models 2.5D slice bunch into n slices: solve n independent 2D Poisson equations. Approximation: bunch very long CPU: serial GPU: compute all slices simultaneously 3D solve the full 3D bunch on a 3D grid CPU: not implemented (very slow) GPU: large memory requirements due to Hockney’s algorithm Stefan Hegglin and Adrian Oeftiger

Stefan Hegglin and Adrian Oeftiger fixed mesh size: 256x256 Numeric Parameter Scans: Fixed nx fixed mesh size: 512x512 x4 x2

Timing: Fully Loaded GPU Parameters  2.5D model works well at high particle numbers, i.e. at low numbers the GPU is far from full exploitation!  different slope of CPU vs. GPU (characteristic behaviour)  new hardware at CNAF more efficient (x1.8) Stefan Hegglin and Adrian Oeftiger

Timing: CUDA 6 vs CUDA Stefan Hegglin and Adrian Oeftiger23 speedup of up to x1.5 due to a faster implementation of the sorting algorithm (thrust 1.8) and better cuFFT 2D, CNAF

Summary PyHEADTAIL now offers 2.5D (slice-by-slice transverse) and 3D self-consistent direct space charge models (on CPU and GPU):  3D model allows cross-checking 2.5D approximations GPU speeds up ≈13x for large meshes and #particles wide numeric parameter spaces available now!  larger resolutions help to mitigate noise effects (artefacts such as numerical emittance blow-up)  improved validity for long simulations (real machine time) next steps: SPS simulations (resonances) Stefan Hegglin and Adrian Oeftiger24

Specifications of Used GPU Machines available machines at CNAF: Stefan Hegglin and Adrian Oeftiger13

Specification of Used CPU Machine LIUPSGPU CPU: Stefan Hegglin and Adrian Oeftiger27

PyPIC on GPU Standalone Python module: GPU interfacing via PyCUDA/Thrust Flexible 2D/3D (integrated) Green’s function cuFFT  (new interface under branch: new_pypic_cpu_and_gpu) Stefan Hegglin and Adrian Oeftiger

Stefan Hegglin and Adrian Oeftiger29 Timing: Fully Loaded GPU Parameters II on GPU, particle-to-mesh deposition dominates  for fixed mesh size, more macro-particles onto the same grid induce memory bandwidth limitations on speed up