Single Node Optimization Computational Astrophysics.

Single Node Optimization Computational Astrophysics

Outline  Node Topology  Vectorization and cache blocking  OpenMP  Performance Tuning Tips

Node Topology  Consider dual socket Intel Haswell node  Each socket is a NUMA domain  Non-uniform memory access domain  Local memory controller is gatekeeper to data access to main memory (local DIMMS)  Each socket has its own inclusive L3 cache  Each core has its own inclusive L2 cache  Each core has its own Data and Instruction L1 cache

 Intel Haswell 12 core server chip  From cyberparse.co.uk

From anandtech.com

SIMD Vectorization  SIMD = single instruction multiple data  For x86 we have  SSE and revisions (128 bit vector length = 2 DP values)  AVX and revisions (256 bit vector length = 4 DP values)  AVX-512 (coming on KNL) (512 bit vector length = 8 DP values)  Before MPP, SIMD was the way to get parallelization and performance  Cray Black Widow vector processors had 128 element vector registers

Vectorization Requirements 1.Independent operations for each iteration  Operations that depend on the last iteration are called “recurrence” and often prevent vectorization 2.Stride-1 access pattern  Data used for each iteration must be contiguous  Without restrict on pointers, this is typically why C code will not vectorize 3.Little to no conditional code in the loop body  x86 processors have “predicate registers” that allow conditional execution on portions of vectors, but number is limited

Loop 1: DO i = 1, N q(i) = g(i)*b(i) + c(i) END DO Loop 2: DO i = 1, N q(i) = q(i) + f * b(i) END DO Loop 3: DO i = 1, N q(i) = 0.5 * (q(i-1) + q(i+1)) END DO Loop 4: DO i = 1, N, 2 q(i) = 0.5 * (q(i-1) + q(i+1)) END DO Loop 5: DO i = 1, Nx IF (F(i+1).LT. 1.0E-10) F(i+1) = 0.0 q(i) = q(i) – dtdx * (F(i+1) – F(i)) END DO

Example Code

Cache Blocking and Reuse  Code may reuse data across multiple iterations in a loop  Best would be to keep that in the highest level cache until it is no longer needed  Board example

OpenMP Threading  Directive-based language extension (C, C++, Fortran) that allows code to multithreaded by the compiler  Generally much easier to use than pthread library or equivalent and code is much more portable  Many compilers have fairly good OpenMP implementations that can scale to dozens of cores  Intel is quite good but launches a “helper” thread that often gets in the way  GNU is fairly good, but thread synchronization performance generally slower  PGI is OK  Cray/IBM/etc custom compilers may perform much better on specific codes

OpenMP Threading  User defined parallel regions where all threads operate  Work can be divided amongst threads either via sets of “tasks” or by giving portions of a loop to each thread  Un-safe operations (such as stores to a shared variable) can be done with atomic operations or critical sections  www.openmp.org is a great resource!

From llnl.gov

Put it all together!

Version/ThreadsGNUIntelGNU cache blocked Intel cache blocked Serial / 15.7245.5565.2765.344 OMP / 23.0823.0282.6882.708 OMP / 41.6341.5291.4321.373 OMP / 80.9520.7740.8420.683 OMP / 160.5470.3880.4700.344 Performance Results

Tuning Tips  Experiment!  Write multiple versions of your code with various techniques included  Try all available compilers. What do they each do?  It is almost always fastest to write code that can vectorize AND it makes adding OpenMP much easier if it does  Count you operations if you can. Measure the run time and ask yourself, am I getting 0.1% of peak or 10%. Aim for 10% or better for your “hot” loops and subroutines  IT CAN ALWAYS BE FASTER!!!

Helpful Compiler Flags  GNU  -fopt-info will show what optimizations the compiler did  -fopenmp to enable OpenMP  Intel  -opt-report will produce a *.optrpt file for each source showing where vectorization was done  -openmp to enable OpenMP

Useful Tools  objdump –l –d [executable or.o]  Can be used to look at generated assembly. Use with –g compiler flag to see where in the source you are.  Try to learn assembly basics as it is a very useful skill when investigating performance issues. No need to learn to program it.  PAPI  Used to gather counters from the processor  Requires admin to install on linux  DDT / TAU / Intel Vtune / CrayPAT  All full performance suites for analyzing application, finding bottlenecks, etc.

Single Node Optimization Computational Astrophysics.

Similar presentations

Presentation on theme: "Single Node Optimization Computational Astrophysics."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Single Node Optimization Computational Astrophysics.

Similar presentations

Presentation on theme: "Single Node Optimization Computational Astrophysics."— Presentation transcript:

Similar presentations

About project

Feedback