Development of a GPU based PIC

Slides:

Advertisements

Similar presentations

Shredder GPU-Accelerated Incremental Storage and Computation

Advertisements

The Development of Mellanox - NVIDIA GPUDirect over InfiniBand A New Model for GPU to GPU Communications Gilad Shainer.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Christopher McCabe, Derek Causon and Clive Mingham Centre for Mathematical Modelling & Flow Analysis Manchester Metropolitan University MANCHESTER M1 5GD.

PyECLOUD G. Iadarola, G. Rumolo Thanks to: F. Zimmermann, G. Arduini, H. Bartosik, C. Bhat, O. Dominguez, M. Driss Mensi, E. Metral, M. Taborelli.

IIAA GPMAD A beam dynamics code using Graphics Processing Units GPMAD (GPU Processed Methodical Accelerator Design) utilises Graphics Processing Units.

Dynamic Load Balancing for VORPAL Viktor Przebinda Center for Integrated Plasma Studies.

Development of a track trigger based on parallel architectures Felice Pantaleo PH-CMG-CO (University of Hamburg) Felice Pantaleo PH-CMG-CO (University.

OpenFOAM on a GPU-based Heterogeneous Cluster

2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.

DCABES 2009 China University Of Geosciences 1 The Parallel Models of Coronal Polarization Brightness Calculation Jiang Wenqian.

Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)

Data Storage Technology

GPGPU platforms GP - General Purpose computation using GPU

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

Institute for Mathematical Modeling RAS 1 Dynamic load balancing. Overview. Simulation of combustion problems using multiprocessor computer systems For.

SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering1 Score-P Hands-On CUDA: Jacobi example.

Shu Nishioka Faculty of Science and Technology, Keio Univ.

Beam Dynamic Calculation by NVIDIA® CUDA Technology E. Perepelkin, V. Smirnov, and S. Vorozhtsov JINR, Dubna 7 July 2009.

Extracted directly from:

Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.

Interactive Time-Dependent Tone Mapping Using Programmable Graphics Hardware Nolan GoodnightGreg HumphreysCliff WoolleyRui Wang University of Virginia.

CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.

Massively Parallel Mapping of Next Generation Sequence Reads Using GPUs Azita Nouri, Reha Oğuz Selvitopi, Özcan Öztürk, Onur Mutlu, Can Alkan Bilkent University,

MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.

The Finite-Volume Dynamical Core on GPUs within GEOS-5 William Putman Global Modeling and Assimilation Office NASA GSFC 9/8/11 Programming weather, climate,

VORPAL Optimizations for Petascale Systems Paul Mullowney, Peter Messmer, Ben Cowan, Keegan Amyx, Stefan Muszala Tech-X Corporation Boyana Norris Argonne.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.

CUDA Optimizations Sathish Vadhiyar Parallel Programming.

Author: Sriram Ramabhadran, George Varghese Publisher: SIGMETRICS’03 Presenter: Yun-Yan Chang Date: 2010/12/29 1.

Molecular Dynamics Simulations on a GPU in OpenCL Alex Cappiello.

GPU Architecture and Programming

Tone Mapping on GPUs Cliff Woolley University of Virginia Slides courtesy Nolan Goodnight.

GPU Programming with CUDA – CUDA 5 and 6 Paul Richmond

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

The eggbox effect: converging the mesh cutoff Objectives - study the convergence of the energy and the forces with respect to the real space grid.

Ale with Mixed Elements 10 – 14 September 2007 Ale with Mixed Elements Ale with Mixed Elements C. Aymard, J. Flament, J.P. Perlat.

12/7/ Performance Optimizations for running NIM on GPUs Jacques Middlecoff NOAA/OAR/ESRL/GSD/AB Mark Govett, Tom Henderson.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 15: Atomic Operations.

An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert.

Space Charge with PyHEADTAIL and PyPIC on the GPU Stefan Hegglin and Adrian Oeftiger Space Charge Working Group meeting –

CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.

PuReMD Design Initialization – neighbor-list, bond-list, hydrogenbond-list and Coefficients of QEq matrix Bonded interactions – Bond-order, bond-energy,

Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.

Hybrid Parallel Implementation of The DG Method Advanced Computing Department/ CAAM 03/03/2016 N. Chaabane, B. Riviere, H. Calandra, M. Sekachev, S. Hamlaoui.

NVIDIA® TESLA™ GPU Based Super Computer By : Adam Powell Student # For COSC 3P93.

GPU Acceleration of Particle-In-Cell Methods B. M. Cowan, J. R. Cary, S. W. Sides Tech-X Corporation.

A novel approach to visualizing dark matter simulations

NFV Compute Acceleration APIs and Evaluation

GPU-based iterative CT reconstruction

Memory COMPUTER ARCHITECTURE

A Dynamic Scheduling Framework for Emerging Heterogeneous Systems

Sathish Vadhiyar Parallel Programming

Ioannis E. Venetis Department of Computer Engineering and Informatics

CS427 Multicore Architecture and Parallel Computing

Image Convolution with CUDA

GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)

Edexcel GCSE Computer Science Topic 15 - The Processor (CPU)

Bisection and Twisted SVD on GPU

Parallel Algorithm Design

Implementing Simplified Molecular Dynamics Simulation in Different Parallel Paradigms Chao Mei April 27th, 2006 CS498LVK.

Linchuan Chen, Xin Huo and Gagan Agrawal

© 2012 Elsevier, Inc. All rights reserved.

Programming Massively Parallel Processors Lecture Slides for Chapter 9: Application Case Study – Electrostatic Potential Calculation © David Kirk/NVIDIA.

A Comparison-FREE SORTING ALGORITHM ON CPUs

GCSE OCR 1 The CPU Computer Science J276 Unit 1

6- General Purpose GPU Programming

Overview Problem Solution CPU vs Memory performance imbalance

Presentation transcript:

Development of a GPU based PIC Martina Sofranac 9.10.2015

CST PARTICLE STUDIO

Particle in Cell Update electric and magnetic field Update momentums and positions Interpolate field at particle positions Calculate current distribution from charged particle motion

CPU vs GPU: different design phylosophies

PIC on Multi GPU: Domain Decomposition Exchange field data between GPUs Exchange particle data between GPUs Calculate current distribution from charged particle motion Update electric and magnetic field Interpolate field at particle positions Update momentums and positions

Domain Decomposition GPU i GPU i-1 GPU i+1 … Cutting plane i Pinned memory … Cutting plane i Cutting plane i+1 Domain is decomposed and cutting planes are along the direction in which the CPU-GPU transfers are minimal

Exchange of data

Pinned Vs Non Pinned Memory One allocation, transfer to GPU, copy back of data to CPU and deallocation

Number of transfers to break even (Number of Time Steps) * 2

DRAM and Memory Coalescing

Extraction from dynamical set of data Magnetron Gap Filter BWO TWT

Privatization concept Avoid contention by aggregating updates locally Requires storage resources to keep copies of data structures

Kernel for extraction of particle data Block counters … Sub-buffer Sub-buffer Sub-buffer Result of atomic Add on block counters -> in registers Result of atomic Add on global counter -> in shared memory Global counter Buffer in DRAM

Current interpolation scheme

Standard domain decomposition … Computational domain ‘Halo’ layers

CPU Buffer for exchange of current densities Load Balancing Option 1: Dynamical Load Balanced Domain decomposition Option 2: Load Balanced Splitting of Particles, whole domain on every GPU CPU Buffer for exchange of current densities

Example 1: Results Mesh Cells Number of Particles Simulation Time 1 GPU [s] Simulation Time 2 GPUs [s] Speed up factor 81 920 41 876 50 39 1.22 128 180 84 65 1.23 234 900 106 74 1.30 490 100 173 124 1.28 918 836 296 212 1.29 4 320 000 3326 1922 1.73

Comparison of different architectures Simulation of klystron with 1.4 million of cells and 30 000 particles CPU [s] Fermi C2050 [s] Kepler K40 [s] 2 Fermi Cards [s] 10 517 5 224 4 321 3 862 Number of cells [millions] 1 Fermi [s] 2 Fermi [s] Speed up 2.4 6 458 4 473 1,44 3.1 8 369 5 542 1,51

Klystron example Conclusion: Kepler K80 versus Tesla C2050 gives speed up of 3.45

THANK YOU!