L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Memory-Efficient Optimization of Gyrokinetic Particle-to-Grid Interpolation for Multicore.

Slides:

Advertisements

Similar presentations

Parallel Programming and Algorithms : A Primer Kishore Kothapalli IIIT-H Workshop on Multi-core Technologies International Institute.

Advertisements

CSE431 Chapter 7A.1Irwin, PSU, 2008 CSE 431 Computer Architecture Fall 2008 Chapter 7A: Intro to Multiprocessor Systems Mary Jane Irwin (

SE-292 High Performance Computing

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Phillip Dickens, Department of Computer Science, University of Maine. In collaboration with Jeremy Logan, Postdoctoral Research Associate, ORNL. Improving.

Lecture 6: Multicore Systems

Thoughts on Shared Caches Jeff Odom University of Maryland.

1 Chapter 1 Why Parallel Computing? An Introduction to Parallel Programming Peter Pacheco.

Microprocessors General Features To be Examined For Each Chip Jan 24 th, 2002.

March 18, 2008SSE Meeting 1 Mary Hall Dept. of Computer Science and Information Sciences Institute Multicore Chips and Parallel Programming.

Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.

ECE669 L5: Grid Computations February 12, 2004 ECE 669 Parallel Computer Architecture Lecture 5 Grid Computations.

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 Auto-tuning Sparse Matrix.

Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.

State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.

1 New Architectures Need New Languages A triumph of optimism over experience! Ian Watson 3 rd July 2009.

Chapter 13 Finite Difference Methods: Outline Solving ordinary and partial differential equations Finite difference methods (FDM) vs Finite Element Methods.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.

Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms Reporter: Jilin Zhang Authors:Changjun Hu, Yali.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

L15: Putting it together: N-body (Ch. 6) October 30, 2012.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Designing and Evaluating Parallel Programs Anda Iamnitchi Federated Distributed Systems Fall 2006 Textbook (on line): Designing and Building Parallel Programs.

Lecture 2 CSS314 Parallel Computing

Computational issues in Carbon nanotube simulation Ashok Srinivasan Department of Computer Science Florida State University.

Multi-core architectures. Single-core computer Single-core CPU chip.

The WRF Model The Weather Research and Forecasting (WRF) Model is a mesoscale numerical weather prediction system designed for both atmospheric research.

Multi-Core Architectures

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP Performance Optimization of HPC Applications on Multi- and Manycore Processors Samuel.

L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 A Generalized Framework for Auto-tuning Stencil Computations Shoaib Kamil 1,3, Cy.

Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.

L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Lattice Boltzmann Hybrid Auto-Tuning on High-End Computational Platforms Samuel Williams,

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

Distributed Shared Memory Based on Reference paper: Distributed Shared Memory, Concepts and Systems.

Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.

Evaluation of Modern Parallel Vector Architectures Leonid Oliker Future Technologies Group Computational Research Division LBNL

L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP Extracting Ultra-Scale Lattice Boltzmann Performance via Hierarchical and Distributed.

Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.

(Superficial!) Review of Uniprocessor Architecture Parallel Architectures and Related concepts CS 433 Laxmikant Kale University of Illinois at Urbana-Champaign.

Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.

A Pattern Language for Parallel Programming Beverly Sanders University of Florida.

Space Charge with PyHEADTAIL and PyPIC on the GPU Stefan Hegglin and Adrian Oeftiger Space Charge Working Group meeting –

Parallel Computing Presented by Justin Reschke

CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.

Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.

Tools and Libraries for Manycore Computing Kathy Yelick U.C. Berkeley and LBNL.

GPU Acceleration of Particle-In-Cell Methods B. M. Cowan, J. R. Cary, S. W. Sides Tech-X Corporation.

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE

Software Coherence Management on Non-Coherent-Cache Multicores

CS5102 High Performance Computer Systems Thread-Level Parallelism

The University of Adelaide, School of Computer Science

Morgan Kaufmann Publishers

Challenges in Concurrent Computing

EE 193: Parallel Computing

Massive Parallelization of SAT Solvers

CS 252 Project Presentation

EE 4xx: Computer Architecture and Performance Programming

CSC3050 – Computer Architecture

Lecture 2 The Art of Concurrency

Presentation transcript:

L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Memory-Efficient Optimization of Gyrokinetic Particle-to-Grid Interpolation for Multicore Processors Kamesh Madduri, Samuel Williams, Stephane Ethier, Leonid Oliker, John Shalf, Erich Strohmaier, Katherine Yelick Lawrence Berkeley National Laboratory (LBNL) National Energy Research Scientific Computing Center (NERSC) Princeton Plasma Physics Laboratory (PPPL)

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Outline 1. Gyrokinetic Toroidal Code 2. Challenges for efficient PIC simulations 3. Solutions for multicore 4. Results 5. Summary and Discussion 2

L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 3 Gyrokinetic Toroidal Code

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Gyrokinetic Toroidal Simulations  Simulate the particle-particle interactions of a charged plasma in a Tokamak fusion reactor  With millions of particles per processor, the naïve N 2 method is totally intractable.  Solution is to use a particle-in-cell (PIC) method 4

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Particle-in-Cell Methods  Particle-in-cell (or particle-mesh) methods simulate particle-particle interactions in O(N) time by examining the field rather than individual forces.  Typically involves iterating on four steps:  From individual particles, determine the spatial distribution of charge  From the distribution of charge, determine the electromagnetic potential  From the potential, determine the force on each particle  Given force, move the particle.  This requires creation of two auxiliary meshes (arrays):  the spatial distribution of charge density  the spatial distribution of electromagnetic potential  In the sequential world, the sizes of the particle arrays are an order of magnitude larger than the grids 5  Particle-in-cell (or particle-mesh) methods simulate particle-particle interactions in O(N) time by examining the field rather than individual forces.  Typically involves iterating on four steps:  Scatter Charge:  Poisson Solve:  Gather:  Push: 2 φ ~ ρ determine ρ φ accelerates particles

L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 6 Challenges: Technology PIC GTC Memory-level parallelism Locality

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Technology  In the past DRAM capacity per core grew exponentially.  In the future, DRAM costs will dominate the cost & power of extreme scale machines  As such, DRAM per socket will remain constant or grow slower than cores  Applications must be re-optimized for a fixed DRAM budget = sustained Flop/s per byte of DRAM capacity  Algorithms/optimizations whose DRAM capacity requirements scale linearly with the number of cores are unacceptable 7

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY  Nominally, push is embarrassingly parallel, and the technologies for solving PDEs on structured grids are well developed.  Unfortunately efficient HW/SW support for gather/scatter operations is still a developing area of research (single thread/multicore/multinode) Although particles and grid points appear linearly in memory, PIC Challenges

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY  Nominally, push is embarrassingly parallel, and the technologies for solving PDEs on structured grids are well developed.  Unfortunately efficient HW/SW support for gather/scatter operations is still a developing area of research (single thread/multicore/multinode) When the particles’ spatial coordinates are mapped to the grid, there is no correlation PIC Challenges

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY  Nominally, push is embarrassingly parallel, and the technologies for solving PDEs on structured grids are well developed.  Unfortunately efficient HW/SW support for gather/scatter operations is still a developing area of research (single thread/multicore/multinode) Thus particles will update random locations in the grid, or conversely, grid points are updated by random particles PIC Challenges

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY  Nominally, push is embarrassingly parallel, and the technologies for solving PDEs on structured grids are well developed.  Unfortunately efficient HW/SW support for gather/scatter operations is still a developing area of research (single thread/multicore/multinode) Moreover, the load-store nature of modern microprocessors demands the operations be serialized (load-increment-store) PIC Challenges

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Sequential GTC Challenges  As if this weren’t enough, GTC further complicates matters as  the grid is a 3D torus  points in psi are spatially uniform  particles are non-circular rings (approximated by 4 points), and  Luckily rings only exist in a poloidal plane, but the radius of the ring can grow to ~6% of the poloidal radius. 12 c bd a r psi zeta mgrid = total number of points 2D “Poloidal Plane” 3D Torus

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY 3D Issues  Remember, GTC is a 3D code  As such, particles are sandwiched between two poloidal planes  and scatter their charge to as many 16 points in each plane 13

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Multicore GTC Challenges (memory-level parallelism)  Although out-order processors reorder instructions to exploit instruction-level parallelism, they resolve the data dependencies in hardware.  If the sequence load 1, add 1, store 1, load 2, add 2, store 2 runs on one core, hardware can reorder it into : load 1, load 2, add 1, add 2, store 1, store 2 assuming addresses 1 and 2 are different.  However, if sequences 1 and 2 run on different cores, this benefit is lost and the programmer must manage the data dependency in software. 14

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Multicore GTC Challenges (Data locality)  Multicore SMPs have complex memory hierarchies.  Although the caches are coherent, data migration between caches is slow and should be avoided.  Moreover each core has a limited cache size. If random access working set exceeds the size, performance will be diminished.  Given the random access nature (scatter/gather) of GTC, how do we partition the problem to mitigate these limitations? 15

L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 16 Multicore Solutions

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Focus: Charge Deposition  The charge deposition phase is the most complex as it requires solving the data dependency challenges in addition to the data locality challenges found in the gather phase  As such, this talk will focus on optimizing charge deposition (scatter) phase for shared memory (threaded) multicore environments  In the MPI version of GTC, the torus is first partitioned in zeta (around the torus) into “poloidal planes” (1 per process)  Unfortunately the physics limits this decomposition to about processes.  Currently, additional processes work collaboratively on each poloidal plane reducing together at the end of scatter.  We explore threading rather than MPI parallelization of each plane 17

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Managing Data Locality (Particle Decomposition)  Throughout this work we use a simple 1D decomposition of the particle array:  Particles are initially sorted by their radial coordinate thread 0thread 1thread 2thread 3

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Managing Data Locality (Grid Decomposition)  We explored four different strategies for managing data locality and total memory usage.  In all cases there is a shared grid.  It may be augmented with (private) per-thread copies  update thread’s copy of grid if possible, else update shared grid. 19 shared grid no replication full overlap partitioned grid O(1) replicas no overlap thread 0 thread 1 thread 2 thread 3 partitioned grid (w/ghosts) O(1) replicas overlap by rmax/16 thread 0 thread 1 thread 2 thread 3 replicated grids O(P) replicas no overlap thread 0 thread 1 thread 2 thread 3

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Example #1  Consider an initial distribution of particles on the shared grid.  As the grid is a single shared data structure, all updates require some form of synchronization

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Example #2  When using the partitioned grid, we see that some accesses go to the private partitions, but others go to the shared grid (where they will need some form of synchronization)

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Managing Data Hazards (Synchronization)  We explored five different synchronization strategies:  coarselock all r & zeta for a given psi (2 rings)  mediumlock all zeta for a given r & psi (2 grid points)  finelock one grid point at a time  atomic64b FP atomic increment via CAS (required some assembly/intrinsics)  noneone barrier at the end of the scatter phase  Remember the coarser the lock  the more overhead is amortized  the less the available concurrency 22

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Visualizing Locking Granularity Coarse 0 Medium / Fine note, medium locking locks the same point in both sandwiching poloidal planes where fine locks the point in one plane at a time.

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Locality × Synchronization  There are 20 combinations of grid decomposition and data synchronization.  However,  3 won’t guarantee correct results (lack of required synchronization)  4 are nonsensical (synchronization when none is required)  As such, only 15 needed to be implemented 24 shared partitioned (w/ghosts) replicated coarsemediumfineatomicnone ✔ ✔ ✔ nonsensical ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ incorrect synchronization decomposition

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Miscellaneous  In addition, we implemented a number of sequential optimizations including:  Structure-of-arrays data layout  explicit SIMDization (via intrinsics)  Data alignment  loop fusion  process pinning 25

L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 26 Results

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Experimental Setup  We examined charge deposition performance on three multicore SMPs:  Dual-socket, quad-core, hyperthreaded 2.66GHz Intel Nehalem  Dual-socket, octal-core, 8-way VMT 1.16GHz Sun Niagara2  Dual-socket, quad-core 2.3GHz Barcelona (in SC’09 paper) Niagara is a proxy for the TLP of tomorrow’s manycore machines  Problems are based on:  grid size (mgrid)32K, 151K, 600K, 2.4M  particles per grid point (micell) 2, 5, 10, 20, 50, 100  Generally, we examine the performance of the threaded variant as a function of optimization or problem size  Additionally, we compare against the conventional wisdom MPI version. 27

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Performance as a function of grid decomposition and synchronization  Consider problem with 150K grid points and 5 particles/point  As locks become increasingly finer, the overhead of pthreads becomes an impediment, but Atomic operations reduce the overhead dramatically  Nehalem did very well with the partially overlapping decomposition  Performance is much better than MPI  Partitioned decomposition attained performance comparable to replication Performance (GFlop/s) MPI NehalemNiagara2 28 Shared Partitioned + ghosts Shared Partitioned + ghosts Reduction

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Memory Usage as a function of grid decomposition and synchronization  Although the threaded performance was comparable to either the MPI variant or the naïve replication approach, the memory usage was dramatically improved  ~12x on Nehalem, and ~100x on Niagara 29 NehalemNiagara2

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Performance as a function of problem configuration 30  For the memory-efficient implementations (i.e. no replication)  Performance generally increases with increasing density (higher locality)  Performance generally decreases with increasing grid size (larger working set)  On Niagara, problems need to be large enough to avoid contention among the 128 threads 32K 151K 600K 32K 151K 600K 2.4M Grid Size Particles per Grid Point NehalemNiagara2

L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 31 Summary & Discussion

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Summary  GTC (and PIC in general) exhibit a number of challenges to locality, parallelism, and synchronization.  Message passing implementations won’t deliver the efficiency  Managing data dependencies is a nightmare for shared memory  We’ve shown that threading the charge deposition kernel can deliver roughly twice the performance of the MPI implementation  Moreover, we’ve shown that we can be memory-efficient (grid partitioning with synchronization) without sacrificing performance. 32

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY 33 Further Reading K. Madduri, S. Williams, S. Ethier, L. Oliker, J. Shalf, E. Strohmaier, K. Yelick, "Memory-Efficient Optimization of Gyrokinetic Particle-to-Grid Interpolation for Multicore Processors", Supercomputing (SC), 2009.

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY 34 Acknowledgements  Research supported by DOE Office of Science under contract number DE-AC02-05CH11231  Research supported by Microsoft (Award #024263) and Intel (Award #024894) funding and by matching funding by U.C. Discovery (Award #DIG )

L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 35 Questions?