GPU Acceleration of Particle-In-Cell Methods B. M. Cowan, J. R. Cary, S. W. Sides Tech-X Corporation.

Slides:

Advertisements

Similar presentations

Taking CUDA to Ludicrous Speed Getting Righteous Performance from your GPU 1.

Advertisements

Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

Optimization on Kepler Zehuan Wang

Cell Broadband Engine. INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure SPE PPE MIC EIB.

Instructor Notes This lecture deals with how work groups are scheduled for execution on the compute units of devices Also explain the effects of divergence.

BWUPEP2011, UIUC, May 29 - June Taking CUDA to Ludicrous Speed BWUPEP2011, UIUC, May 29 - June Blue Waters Undergraduate Petascale Education.

Multi Agent Simulation and its optimization over parallel architecture using CUDA™ Abdur Rahman and Bilal Khan NEDUET(Department Of Computer and Information.

1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.

L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.

L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Accelerating SYMV kernel on GPUs Ahmad M Ahmad, AMCS Division, KAUST

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

Extracted directly from:

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

GPU Programming with CUDA – Optimisation Mike Griffiths

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

1 Evaluation of parallel particle swarm optimization algorithms within the CUDA™ architecture Luca Mussi, Fabio Daolio, Stefano Cagnoni, Information Sciences,

VORPAL Optimizations for Petascale Systems Paul Mullowney, Peter Messmer, Ben Cowan, Keegan Amyx, Stefan Muszala Tech-X Corporation Boyana Norris Argonne.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.

CUDA Optimizations Sathish Vadhiyar Parallel Programming.

1. Could I be getting better performance? Probably a little bit. Most of the performance is handled in HW How much better? If you compile –O3, you can.

GPU Architecture and Programming

Hardware Acceleration Using GPUs M Anirudh Guide: Prof. Sachin Patkar VLSI Consortium April 4, 2008.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

ICAL GPU 架構中所提供分散式運算之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 4: CUDA Threads – Part 2.

QCAdesigner – CUDA HPPS project

GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14,

Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.

Threaded Programming Lecture 1: Concepts. 2 Overview Shared memory systems Basic Concepts in Threaded Programming.

Space Charge with PyHEADTAIL and PyPIC on the GPU Stefan Hegglin and Adrian Oeftiger Space Charge Working Group meeting –

Sunpyo Hong, Hyesoon Kim

GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.

My Coordinates Office EM G.27 contact time:

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.

CUDA programming Performance considerations (CUDA best practices)

GPGPU Programming with CUDA Leandro Avila - University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University of Northern Iowa.

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

Single Instruction Multiple Threads

Computer Engg, IIT(BHU)

Prof. Zhang Gang School of Computer Sci. & Tech.

Productive Performance Tools for Heterogeneous Parallel Computing

Sathish Vadhiyar Parallel Programming

CS427 Multicore Architecture and Parallel Computing

EECE571R -- Harnessing Massively Parallel Processors ece

Distributed Processors

GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)

Parallel Computing Lecture

Lecture 2: Intro to the simd lifestyle and GPU internals

Presented by: Isaac Martin

NVIDIA Fermi Architecture

General Programming on Graphical Processing Units

General Programming on Graphical Processing Units

6- General Purpose GPU Programming

Presentation transcript:

GPU Acceleration of Particle-In-Cell Methods B. M. Cowan, J. R. Cary, S. W. Sides Tech-X Corporation

GPU characteristics Graphics processing units (GPUs) offer tremendous computational performance Much greater processing capability per monetary and power cost Achieved through massive parallelism Example: NVIDIA Tesla K Tflop/s (double), 3.52 Tflop/s (single) 13 streaming multiprocessors with 192 cores each = 2,496 processor cores 5 GB memory with 208 GB/s bandwidth Cost: ~$3000 Power consumption: 225 W

GPU restrictions Multiprocessors are SIMD devices Warps of 32 threads must execute the same instruction Threads can be grouped into (larger) blocks for convenience Branches can effectively stall threads (all threads go through all logical paths) Full bandwidth of global memory only realized if memory accesses are coalesced Blocks of 128 bytes accessed by consecutive groups of threads Multiprocessor shared memory can be accessed without such restrictions by all threads in a block, but has limited size

The particle-in-cell algorithm Lorentz force interpolated from gridded fields Currents deposited to grid from particles Push particles Deposit current to grid Advance fields Interpolate fields to particle positions

Difficulties of PIC on GPU Both field update and particle phase space advance are straightforward For field update, each thread updates a cell For particle push, each thread updates a particle Field interpolation and current deposition present problems It’s not known a priori which cells particles occupy and hence which field values are needed Naïve one-particle-per-thread memory accesses won’t be coalesced Deposition may also experience race conditions: Multiple threads try to write the same current value

Some techniques Optimize memory access For interpolation, read fields into shared memory Can then interpolate using one particle per thread But need to be careful about available shared memory size Tile the particles Group the particles by small rectangular regions (tiles) of cells Particles in a tile will generally be processed by the same thread block Tile size trade-off: Smaller tiles increase occupancy, but each tile needs guard cells Tile size must be set dynamically, based on problem specifications and hardware

General coding principles Portability Write main computational procedures (e.g. cell field update, particle push) in functions that can be executed on both host and device And take advantage of MIC, vectorization On CPU, function will be executed in a loop On GPU, function will be executed by a thread Generality Design main management routines to work with multiple algorithm variants Different types of field updates, e.g. absorbing boundaries, controlled dispersion High-order particles: Complicates memory management Other physics: Metallic boundaries, dielectric materials, collisions, cut cells…

Status of full GPU PIC in VSim Work in progress, part of ongoing DARPA project Completed main interpolate-push-deposit update Results correct in basic tests Coded consistently with general practices for good GPU performance, but not optimized yet Still, we can start to get insight about performance trade-offs

Performance scaling with PPC Tests run for interpolate/push part of algorithm As expected, more particles per cell give better per- particle performance Amortizes time to load field data from global memory

Performance scaling with simulation size Scaling with number of cells in domain is more complex Number of cells per tile held constant Could be seeing effects of register pressure, limited shared memory, occupancy of SM’s Sizing tile to shared memory limit not the most performant

Code considerations VSim uses the Vorpal engine, which was written from the ground up in C++ Different algorithms selected through run-time polymorphism with virtual functions This was great in 2007 Now this approach has limitations Hardware considerations: want to avoid branching CUDA restrictions: __global__ void kernel(Object myObj) { /*... */ } Run-time polymorphism still OK for high-level logic, but use template policy classes for low-level logic This object must be “flat”: No virtual methods or bases

Moving toward code generality VSim is a full multiphysics package Plasmas, metals, dielectrics, collisions,… We want to enable all these features on the GPU Starting with grids Cartesian and cylindrical coordinate systems; uniform and variable discretizations All grid types except variable cylindrical implemented; uniform grids tested and working Variable discretizations require runtime branching Collisions in progress

Next steps Refactoring basic classes to be GPU-friendly Grids Fields Integrate with FDTD field update Implement particle sinks Performance testing and optimization What drives the performance fluctuations as tile size, domain size are changed? Are there savings to be had from data management? Can we take advantage of the fact that particles move by at most one cell or tile per step? Manual organization vs. global sort Integrate with in-progress domain decomposition work General for arbitrary number of CPU cores and GPUs on system

Acknowledgments Work supported by DARPA contract W31P4Q-15- C-0061 (SBIR) Helpful discussions with D. N. Smithe