L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Lattice Boltzmann Hybrid Auto-Tuning on High-End Computational Platforms Samuel Williams,

Slides:



Advertisements
Similar presentations
OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
Advertisements

Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.
Distributed Systems CS
Lecture 6: Multicore Systems
Thoughts on Shared Caches Jeff Odom University of Maryland.
PGAS Language Update Kathy Yelick. PGAS Languages: Why use 2 Programming Models when 1 will do? Global address space: thread may directly read/write remote.
1 3D Lattice Boltzmann Magneto-hydrodynamics (LBMHD3D) Sam Williams 1,2, Jonathan Carter 2, Leonid Oliker 2, John Shalf 2, Katherine Yelick 1,2 1 University.
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams
BIPS C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Lattice Boltzmann Simulation Optimization on Leading Multicore Platforms Samuel Williams.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 Auto-tuning Sparse Matrix.
Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.
P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Tuning Stencils Kaushik Datta.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters Nikolaos Drosinos and Nectarios Koziris National Technical.
Advanced Hybrid MPI/OpenMP Parallelization Paradigms for Nested Loop Algorithms onto Clusters of SMPs Nikolaos Drosinos and Nectarios Koziris National.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.
1 Titanium Review: Ti Parallel Benchmarks Kaushik Datta Titanium NAS Parallel Benchmarks Kathy Yelick U.C. Berkeley September.
Computer System Architectures Computer System Software
CC02 – Parallel Programming Using OpenMP 1 of 25 PhUSE 2011 Aniruddha Deshmukh Cytel Inc.
Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms Reporter: Jilin Zhang Authors:Changjun Hu, Yali.
This module was created with support form NSF under grant # DUE Module developed by Martin Burtscher Module B1 and B2: Parallelization.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 Autotuning Sparse Matrix.
CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application.
The WRF Model The Weather Research and Forecasting (WRF) Model is a mesoscale numerical weather prediction system designed for both atmospheric research.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
OpenMP OpenMP A.Klypin Shared memory and OpenMP Simple Example Threads Dependencies Directives Handling Common blocks Synchronization Improving load balance.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 A Generalized Framework for Auto-tuning Stencil Computations Shoaib Kamil 1,3, Cy.
Planned AlltoAllv a clustered approach Stephen Booth (EPCC) Adrian Jackson (EPCC)
Hybrid MPI and OpenMP Parallel Programming
1. Could I be getting better performance? Probably a little bit. Most of the performance is handled in HW How much better? If you compile –O3, you can.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.
Evaluation of Modern Parallel Vector Architectures Leonid Oliker Future Technologies Group Computational Research Division LBNL
L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP Extracting Ultra-Scale Lattice Boltzmann Performance via Hierarchical and Distributed.
Manno, , © by Supercomputing Systems 1 1 COSMO - Dynamical Core Rewrite Approach, Rewrite and Status Tobias Gysi POMPA Workshop, Manno,
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Memory-Efficient Optimization of Gyrokinetic Particle-to-Grid Interpolation for Multicore.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Optimizing Stencil Computations March 18, Administrative Midterm coming April 3? In class March 25, can bring one page of notes Review notes, readings.
Outline Why this subject? What is High Performance Computing?
Smoothed Particle Hydrodynamics Matthew Zhu CSCI 5551 — Fall 2015.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.
A Pattern Language for Parallel Programming Beverly Sanders University of Florida.
Sunpyo Hong, Hyesoon Kim
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
1 Programming with Shared Memory - 3 Recognizing parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Jan 22, 2016.
P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 PERI : Auto-tuning Memory.
Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.
Tools and Libraries for Manycore Computing Kathy Yelick U.C. Berkeley and LBNL.
These slides are based on the book:
Distributed Systems CS
Hybrid Programming with OpenMP and MPI
EE 4xx: Computer Architecture and Performance Programming
Lecture 2 The Art of Concurrency
ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
Introduction, background, jargon
Programming with Shared Memory Specifying parallelism
Parallel Programming in C with MPI and OpenMP
Presentation transcript:

L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Lattice Boltzmann Hybrid Auto-Tuning on High-End Computational Platforms Samuel Williams, Jonathan Carter, Leonid Oliker, John Shalf, Katherine Yelick Lawrence Berkeley National Laboratory (LBNL) National Energy Research Scientific Computing Center (NERSC)

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Outline 1. LBMHD 2. Auto-tuning LMBHD on Multicore SMPs 3. Hybrid MPI-Pthreads implementations 4. Distributed, Hybrid LBMHD Auto-tuning 5. pthread Results 6. OpenMP results 7. Summary 2

L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 3 LBMHD

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY 4 LBMHD  Lattice Boltzmann Magnetohydrodynamics (CFD+Maxwell’s Equations)  Plasma turbulence simulation via Lattice Boltzmann Method for simulating astrophysical phenomena and fusion devices  Three macroscopic quantities:  Density  Momentum (vector)  Magnetic Field (vector)  Two distributions:  momentum distribution (27 scalar components)  magnetic distribution (15 Cartesian vector components) momentum distribution Y Z +X magnetic distribution Y +Z +X macroscopic variables +Y +Z +X

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY 5 LBMHD  Code Structure  time evolution through a series of collision( ) and stream( ) functions  When parallelized, stream( ) should constitute 10% of the runtime.  collision( )’s Arithmetic Intensity:  Must read 73 doubles, and update 79 doubles per lattice update (1216 bytes)  Requires about 1300 floating point operations per lattice update  Just over 1.0 flops/byte (ideal architecture)  Suggests LBMHD is memory-bound on the XT4.  Structure-of-arrays layout (component’s are separated) ensures that cache capacity requirements are independent of problem size  However, TLB capacity requirement increases to >150 entries  periodic boundary conditions

L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 6 Auto-tuning LBMHD on Multicore SMPs Samuel Williams, Jonathan Carter, Leonid Oliker, John Shalf, Katherine Yelick, "Lattice Boltzmann Simulation Optimization on Leading Multicore Platforms", International Parallel & Distributed Processing Symposium (IPDPS), 2008.

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY 7 LBMHD Performance (reference implementation)  Generally, scalability looks good  Scalability is good  but is performance good? * collision() only Reference+NUMA

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Lattice-Aware Padding  For a given lattice update, the requisite velocities can be mapped to a relatively narrow range of cache sets (lines).  As one streams through the grid, one cannot fully exploit the capacity of the cache as conflict misses evict entire lines.  In an structure-of-arrays format, pad each component such that when referenced with the relevant offsets (±x,±y,±z) they are uniformly distributed throughout the sets of the cache  Maximizes cache utilization and minimizes conflict misses. 8

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY LBMHD Performance (lattice-aware array padding) 9  LBMHD touches >150 arrays.  Most caches have limited associativity  Conflict misses are likely  Apply heuristic to pad arrays +Padding Reference+NUMA

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Vectorization  Two phases with a lattice method’s collision() operator:  reconstruction of macroscopic variables  updating discretized velocities  Normally this is done one point at a time.  Change to do a vector’s worth at a time (loop interchange + tuning) 10

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY 11 LBMHD Performance (architecture specific optimizations)  Add unrolling and reordering of inner loop  Additionally, it exploits SIMD where the compiler doesn’t  Include a SPE/Local Store optimized version * collision() only +Explicit SIMDization +SW Prefetching +Unrolling +Vectorization +Padding Reference+NUMA +small pages

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY 12 LBMHD Performance (architecture specific optimizations)  Add unrolling and reordering of inner loop  Additionally, it exploits SIMD where the compiler doesn’t  Include a SPE/Local Store optimized version * collision() only +Explicit SIMDization +SW Prefetching +Unrolling +Vectorization +Padding Reference+NUMA +small pages 1.6x 4x 3x 130x

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Limitations  Ignored MPP (distributed) world  Kept problem size fixed and cubical  When run with only 1 process per SMP, maximizing threads per process always looked best 13

L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 14 Hybrid MPI+Pthreads Implementation

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Flat MPI  In the flat MPI world, there is one process per core, and only one thread per process  All communication is through MPI 15

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Hybrid MPI + Pthreads/OpenMP  As multicore processors already provide cache coherency for free, we can exploit it to reduce MPI overhead and traffic.  We examine using pthreads and OpenMP for threading (other possibilities exist)  For correctness in pthreads, we are required to include a intra-process (thread) barrier between function calls for correctness. (we wrote our own)  Implicitly, OpenMP will barrier via the #pragma  We can choose any balance between processes/node and threads/process (we explored powers of 2)  Initially, we did not assume a thread-safe MPI implementation (many versions return MPI_THREAD_SERIALIZED). As such, only thread 0 performs MPI calls 16

L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 17 Distributed, Hybrid Auto-tuning

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY The Distributed Auto-tuning Problem  We believe that even for relatively large problems, auto-tuning only the local computation (e.g. IPDPS’08) will deliver sub-optimal MPI performance.  Want to explore MPI/Hybrid decomposition as well  We have a combinatoric explosion in the search space coupled with a large problem size (number of nodes) 18 benchmark for all code unrollings/reorderings for all vector lengths for all prefetching for all coding styles (reference, vectorized, vectorized+SIMDized) for all data structures for all thread grids for all aspect ratios at each concurrency: for all process/thread balances

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Our Approach  We employ a resource-efficient 3-stage greedy algorithm that successively prunes the search space: 19 benchmark for all code unrollings/reorderings for all vector lengths for all prefetching for all coding styles (reference, vectorized, vectorized+SIMDized) for all data structures for all thread grids for all aspect ratios at limited concurrency (single node): for all process/thread balances benchmark at full concurrency: for all process/thread balances benchmark

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Stage 1 20  In stage 1, we prune the code generation space.  We ran this as a problem with 4 threads.  As VL, unrolling, and reordering may be problem dependent, we only prune:  padding  coding style  prefetch distance  We observe that vectorization with SIMDization, and a prefetch distance of 64 Bytes worked best

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Stage 2  Hybrid Auto-tuning requires we mimic the SPMD environment  Suppose we wish to explore this color-coded optimization space.  In the serial world (or fully threaded nodes), the tuning is easily run  However, in the MPI or hybrid world a problem arises as processes are not guaranteed to be synchronized.  As such, one process may execute some optimizations faster than others simply due to fortuitous scheduling with another processes’ trials  Solution: add an MPI_barrier() around each trial (a configuration with 100’s of iterations) 21 time process0process process0process one node

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Stage 2 (continued)  We create a database of optimal VL/unrolling/DLP parameters for each thread/process balance, thread grid, and aspect ratio configuration 22

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Stage 3  Given the data base from Stage 2,  we run few large problem using the best known parameters/thread grid for different thread/process balances.  We select the parameters based on minimizing  overall local time  collision( ) time  local stream( ) time 23

L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 24 Results

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY XT4 Results (512 3 problem on 512 cores)  Finally, we present the best data for progressively more aggressive auto-tuning efforts  Note each of the last 3 bars may have unique MPI decompositions as well as VL/unroll/DLP  Observe that for this large problem, auto-tuning flat MPI delivered significant boosts (2.5x)  However, expanding auto-tuning to include the domain decomposition and balance between threads and processes provided an extra 17%  2 processes with 2 threads was best 25

L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 26 What about OpenMP ?

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Conversion to OpenMP  Converting the auto-tuned pthreads implementation to OpenMP seems relatively straightforward (#pragma omp parallel for)  We modified code to be single source that supports:  Flat MPI  MPI+pthreads  MPI+OpenMP  However, it is imperative (especially on NUMA SMPs) to correctly utilize the available affinity mechanisms:  on XT, aprun has options to handle this  on linux clusters (like NERSC’s Carver), user must manage it: #ifdef _OPENMP #pragma omp parallel {Affinity_Bind_Thread( MyFirstHWThread+omp_get_thread_num());} #else Affinity_Bind_Thread( MyFirstHWThread+Thread_Rank); #endif  use both to be safe  Failure to miss these or other key pragmas can cut performance in half (or 90% in one particularly bad bug) 27

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Optimization of Stream()  In addition, we further optimized the stream() routine along 3 axes: 1.messages could be blocked (24 velocities/direction/phase/process) or aggregated (1/direction/phase/process) 2.packing could be sequential (thread 0 does all the work) or thread parallel (using pthreads/openMP) 3.MPI calls could be serialized (thread 0 does all the work) or parallel (MPI_THREAD_MULTIPLE)  Of these eight combinations, we implemented 4:  aggregate, sequential packing, serialized MPI  blocked, sequential packing, serialized MPI  aggregate, parallel packing, serialized MPI (simplest openMP code)  blocked, parallel packing, parallel MPI (simplest pthread code)  Threaded MPI on Franklin requires using  threaded MPICH  calling using MPI_THREAD_MULTIPLE  setting MPICH_MAX_THREAD_SAFETY=multiple 28

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Optimization of Stream()  In addition, we further optimized the stream() routine along 3 axes: 1.messages could be blocked (24 velocities/direction/phase/process) or aggregated (1/direction/phase/process) 2.packing could be sequential (thread 0 does all the work) or thread parallel (using pthreads/openMP) 3.MPI calls could be serialized (thread 0 does all the work) or parallel (MPI_THREAD_MULTIPLE)  Of these eight combinations, we implemented 4:  aggregate, sequential packing, serialized MPI  blocked, sequential packing, serialized MPI  aggregate, parallel packing, serialized MPI (simplest openMP code)  blocked, parallel packing, parallel MPI (simplest pthread code)  Threaded MPI on Franklin requires using  threaded MPICH  calling using MPI_THREAD_MULTIPLE  setting MPICH_MAX_THREAD_SAFETY=multiple 29

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Optimization of Stream()  In addition, we further optimized the stream() routine along 3 axes: 1.messages could be blocked (24 velocities/direction/phase/process) or aggregated (1/direction/phase/process) 2.packing could be sequential (thread 0 does all the work) or thread parallel (using pthreads/openMP) 3.MPI calls could be serialized (thread 0 does all the work) or parallel (MPI_THREAD_MULTIPLE)  Of these eight combinations, we implemented 4:  aggregate, sequential packing, serialized MPI  blocked, sequential packing, serialized MPI  aggregate, parallel packing, serialized MPI (simplest openMP code)  blocked, parallel packing, parallel MPI (simplest pthread code)  Threaded MPI on Franklin requires using  threaded MPICH  calling using MPI_THREAD_MULTIPLE  setting MPICH_MAX_THREAD_SAFETY=multiple 30

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY MPI vs. MPI+Pthreads vs. MPI+OpenMP  When examining overall performance per core (512 3 problem with cores), we see choice made relatively little difference  MPI+pthreads was slightly faster with 2thread/process  MPI+OpenMP was slightly faster with 4 threads/process  choice of best stream() optimization is dependent on thread concurrency and threading model 31 1 thread per process (flat MPI) 2 threads per process 4 threads per process Up is good

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY MPI vs. MPI+Pthreads vs. MPI+OpenMP  When we look at collision time, we see that surprisingly 2 threads per process delivered slightly better performance for pthreads, but 4 threads per process was better for OpenMP.  Interestingly, threaded MPI resulted in slower compute time 32 1 thread per process (flat MPI) 2 threads per process 4 threads per process Down is good

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY MPI vs. MPI+Pthreads vs. MPI+OpenMP  Variation in stream time was dramatic with 4 threads.  Here the blocked implementation was far faster.  Interestingly, pthreads was faster for 2 threads, openMP was faster for 4 threads thread per process (flat MPI) 2 threads per process 4 threads per process Down is good

L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 34 Summary & Discussion

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Summary  Multicore cognizant auto-tuning dramatically improves (2.5x) flat MPI performance.  Tuning the domain decomposition and hybrid implementations yielded almost an additional 20% performance boost.  Although hybrid MPI promises improved performance through reduced communication, the observed benefit is thus far small.  Moreover, the performance difference among hybrid models is small.  Initial experiments on the XT5 (Hopper) and the Nehalem cluster (Carver) show similar results (little trickier to get good OpenMP performance on the linux cluster)  LBM’s probably will not make the case for hybrid programming models (purely concurrent with no need for collaborative behavior) 35

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY 36 Acknowledgements  Research supported by DOE Office of Science under contract number DE-AC02-05CH11231  All XT4 simulations were performed on the XT4 (Franklin) at the National Energy Research Scientific Computing Center (NERSC)  George Vahala and his research group provided the original (FORTRAN) version of the LBMHD code.

L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 37 Questions?

L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 38 BACKUP SLIDES

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Memory access patterns for Stencils  Laplacian, Divergence, and Gradient  Different reuse, Different #’s of read/write arrays 39

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY LBMHD Stencil  Simple example reading from 9 arrays and writing to 9 arrays  Actual LBMHD reads 73, writes 79 arrays 40

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY 41 Arithmetic Intensity  True Arithmetic Intensity (AI) ~ Total Flops / Total DRAM Bytes  Some HPC kernels have an arithmetic intensity that scales with problem size (increased temporal locality), but remains constant on others  Arithmetic intensity is ultimately limited by compulsory traffic  Arithmetic intensity is diminished by conflict or capacity misses. A r i t h m e t i c I n t e n s i t y O( N ) O( log(N) ) O( 1 ) SpMV, BLAS1,2 Stencils (PDEs) Lattice Methods FFTs Dense Linear Algebra (BLAS3) Particle Methods

FUTURE TECHNOLOGIES GROUP L AWRENCE B ERKELEY N ATIONAL L ABORATORY Kernel Arithmetic Intensity and Architecture 42  For a given architecture, one may calculate its flop:byte ratio.  For a 2.3GHz Quad Core Opteron (like in the XT4),  1 SIMD add + 1 SIMD multiply per cycle per core  12.8GB/s of DRAM bandwidth  = 36.8 / 12.8 ~ 2.9 flops per byte  When a kernel’s arithmetic intensity is substantially less than the architecture’s flop:byte ratio, transferring data will take longer than computing on it  memory-bound  When a kernel’s arithmetic intensity is substantially greater than the architecture’s flop:byte ratio, computation will take longer than data transfers  compute-bound