Single Node Optimization Computational Astrophysics
Outline Node Topology Vectorization and cache blocking OpenMP Performance Tuning Tips
Node Topology Consider dual socket Intel Haswell node Each socket is a NUMA domain Non-uniform memory access domain Local memory controller is gatekeeper to data access to main memory (local DIMMS) Each socket has its own inclusive L3 cache Each core has its own inclusive L2 cache Each core has its own Data and Instruction L1 cache
Intel Haswell 12 core server chip From cyberparse.co.uk
From anandtech.com
SIMD Vectorization SIMD = single instruction multiple data For x86 we have SSE and revisions (128 bit vector length = 2 DP values) AVX and revisions (256 bit vector length = 4 DP values) AVX-512 (coming on KNL) (512 bit vector length = 8 DP values) Before MPP, SIMD was the way to get parallelization and performance Cray Black Widow vector processors had 128 element vector registers
Vectorization Requirements 1.Independent operations for each iteration Operations that depend on the last iteration are called “recurrence” and often prevent vectorization 2.Stride-1 access pattern Data used for each iteration must be contiguous Without restrict on pointers, this is typically why C code will not vectorize 3.Little to no conditional code in the loop body x86 processors have “predicate registers” that allow conditional execution on portions of vectors, but number is limited
Loop 1: DO i = 1, N q(i) = g(i)*b(i) + c(i) END DO Loop 2: DO i = 1, N q(i) = q(i) + f * b(i) END DO Loop 3: DO i = 1, N q(i) = 0.5 * (q(i-1) + q(i+1)) END DO Loop 4: DO i = 1, N, 2 q(i) = 0.5 * (q(i-1) + q(i+1)) END DO Loop 5: DO i = 1, Nx IF (F(i+1).LT. 1.0E-10) F(i+1) = 0.0 q(i) = q(i) – dtdx * (F(i+1) – F(i)) END DO
Example Code
Cache Blocking and Reuse Code may reuse data across multiple iterations in a loop Best would be to keep that in the highest level cache until it is no longer needed Board example
OpenMP Threading Directive-based language extension (C, C++, Fortran) that allows code to multithreaded by the compiler Generally much easier to use than pthread library or equivalent and code is much more portable Many compilers have fairly good OpenMP implementations that can scale to dozens of cores Intel is quite good but launches a “helper” thread that often gets in the way GNU is fairly good, but thread synchronization performance generally slower PGI is OK Cray/IBM/etc custom compilers may perform much better on specific codes
OpenMP Threading User defined parallel regions where all threads operate Work can be divided amongst threads either via sets of “tasks” or by giving portions of a loop to each thread Un-safe operations (such as stores to a shared variable) can be done with atomic operations or critical sections is a great resource!
From llnl.gov
Put it all together!
Version/ThreadsGNUIntelGNU cache blocked Intel cache blocked Serial / OMP / OMP / OMP / OMP / Performance Results
Tuning Tips Experiment! Write multiple versions of your code with various techniques included Try all available compilers. What do they each do? It is almost always fastest to write code that can vectorize AND it makes adding OpenMP much easier if it does Count you operations if you can. Measure the run time and ask yourself, am I getting 0.1% of peak or 10%. Aim for 10% or better for your “hot” loops and subroutines IT CAN ALWAYS BE FASTER!!!
Helpful Compiler Flags GNU -fopt-info will show what optimizations the compiler did -fopenmp to enable OpenMP Intel -opt-report will produce a *.optrpt file for each source showing where vectorization was done -openmp to enable OpenMP
Useful Tools objdump –l –d [executable or.o] Can be used to look at generated assembly. Use with –g compiler flag to see where in the source you are. Try to learn assembly basics as it is a very useful skill when investigating performance issues. No need to learn to program it. PAPI Used to gather counters from the processor Requires admin to install on linux DDT / TAU / Intel Vtune / CrayPAT All full performance suites for analyzing application, finding bottlenecks, etc.