VORPAL Optimizations for Petascale Systems Paul Mullowney, Peter Messmer, Ben Cowan, Keegan Amyx, Stefan Muszala Tech-X Corporation Boyana Norris Argonne.

VORPAL Optimizations for Petascale Systems Paul Mullowney, Peter Messmer, Ben Cowan, Keegan Amyx, Stefan Muszala Tech-X Corporation Boyana Norris Argonne National Lab *mullowney@txcorp.com *messmer@txcorp.com Work supported by DOE Office of Science SBIR Phase II Award DE-FG02-07ER84731

VORPAL: Introduction VORPAL plasma framework widely used on leadership- class systems –Particle-in-cell (PIC) algorithm for kinetic plasma model –Finite-Difference Time-Domain (FDTD) for Maxwell solver –Access to various libraries (Trilinos, PETSc) for doing linear system solves (Electrostatic PIC) –ADI methods for doing implicit Maxwell solve Self-consistent model for charged particles and electromagnetic field –Electromagnetic field discretized on 3D cartesian mesh –Particles located anywhere in space –Particles gather forces and scatter charges/currents to the field Parallelization via domain decomposition

VORPAL Optimization Challenges Petascale systems require strong scaling –FDTD computationally cheap to start with –Need good performance on small domain  Efficient messaging on small computational domains  Efficient computation and messaging for heterogeneous architectures If particles present, they are the main contribution to compute time –Significant time savings via optimized particle-push algorithm –Key challenge: finding good data layout  Optimize nearest neighbor messaging for small domain sizes  Optimize for heterogeneous architectures … GPUs

JumpShot Messaging Patterns in an FDTD Simulation PETSc on BG/L VORPAL on BG/P

Improving parallel efficiency with different messaging patterns Conventional field messaging: Send and receive messages to/from all neighbors at once Staged messaging: Send in one direction at a time, waiting for one direction to complete before starting the next Reduces overall number of messages: 6 instead of 26

Messaging results Staged messaging can be up to 5× faster for small domain sizes Similar performance on Cray XT4 and BG/P Cray XT4 BG/P

Effect of E/B Field Memory Structure on FDTD Performance Ex(i,j,k) Ey(i,j,k+1) Ey(i,j,k) Ez(i,j,k) Ex(i,j,k+1) Ez(i,j,k+1) Ex(i,j,k) Ex(i,j,k+1) Ey(i,j,k) Ey(i,j,k+1) Ez(i,j,k) Ez(i,j,k+1) … … … Layout A Layout B … … … Memory Layout is a key consideration for GPU optimization

Using GPUs for Accelerating FDTD FERMI : 8x improvement over Tesla Series GPUs for double precision (> 500 GFlops), likely 2x improvement in single precision (~2 TFlops) FERMI available in Spring 2010 ???

GPU Implementations of FDTD Implementation 1: Generic 4 pt Stencil Kernels for(int i = tid; i < nx; i += nThreads){ float r = res[i]; r += a1 * in1[i]; r += a2 * in2[i]; r += a3 * in3[i]; r += a4 * in4[i]; res[i] += r; } Implementation 2: Yee Mesh Specific Kernels float ex, ey, ez; for(int i = tid; i < n; i += nThreads){ ex = Ex[i], ey = Ey[i], ez = Ez[i]; Bx[i] += dtOverDy*(ez-EzYp1[i]) + dtOverDz*(EyZp1[i]-ey); By[i] += dtOverDz*(ex-ExZp1[i]) + dtOverDx*(EzXp1[i]-ez); Bz[i] += dtOverDx*(ey-EyXp1[i]) + dtOverDy*(ExYp1[i]-ex); } Requires 6 calls to Generic Kernel to do the full FDTD update Makes no use of GPU memory hierarchy All memory accesses are global Corresponding call to Ampere update Reuse 3 global memory accesses no use of shared memory or other trickery

Timings: FDTD Performance

Speedup

Specifics Boundary Conditions/PMLs can be handled through a spatially dependent dielectric constant Dey-Mittra Cut Cell algorithms can be used through slight modifications to 4 pt Stencil Kernel. Collection of optimized vector routines that can perform all of the above mentioned algorithms –Accessible from within VORPAL or from High-Level Languages like IDL and MATLAB –See Peter Messmer’s dinner-time presentation on GPULib at the NUG meeting (Wed. 10/7) GPULib ( https://gpulib.txcorp.com)

Future Work Fully optimize the FDTD algorithm Move to multiple GPUs so that VORPAL can take advantage of new heterogeneous systems VORPAL ADI algorithm on GPUs: –GPULib has highly optimized tridiagonal and pentadiagonal solvers for “small” linear systems (<1000 unknowns) that are easily tasked-farmed out to GPU thread blocks. Potential for huge speedup vs CPU implementation. Move Particle Push to GPU Vlasov Solver??

VORPAL Optimizations for Petascale Systems Paul Mullowney, Peter Messmer, Ben Cowan, Keegan Amyx, Stefan Muszala Tech-X Corporation Boyana Norris Argonne.

Similar presentations

Presentation on theme: "VORPAL Optimizations for Petascale Systems Paul Mullowney, Peter Messmer, Ben Cowan, Keegan Amyx, Stefan Muszala Tech-X Corporation Boyana Norris Argonne."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

VORPAL Optimizations for Petascale Systems Paul Mullowney, Peter Messmer, Ben Cowan, Keegan Amyx, Stefan Muszala Tech-X Corporation Boyana Norris Argonne.

Similar presentations

Presentation on theme: "VORPAL Optimizations for Petascale Systems Paul Mullowney, Peter Messmer, Ben Cowan, Keegan Amyx, Stefan Muszala Tech-X Corporation Boyana Norris Argonne."— Presentation transcript:

Similar presentations

About project

Feedback