Download presentation
Presentation is loading. Please wait.
Published byLenard Cobb Modified over 9 years ago
1
Single Node Optimization Computational Astrophysics
2
Outline Node Topology Vectorization and cache blocking OpenMP Performance Tuning Tips
3
Node Topology Consider dual socket Intel Haswell node Each socket is a NUMA domain Non-uniform memory access domain Local memory controller is gatekeeper to data access to main memory (local DIMMS) Each socket has its own inclusive L3 cache Each core has its own inclusive L2 cache Each core has its own Data and Instruction L1 cache
4
Intel Haswell 12 core server chip From cyberparse.co.uk
5
From anandtech.com
7
SIMD Vectorization SIMD = single instruction multiple data For x86 we have SSE and revisions (128 bit vector length = 2 DP values) AVX and revisions (256 bit vector length = 4 DP values) AVX-512 (coming on KNL) (512 bit vector length = 8 DP values) Before MPP, SIMD was the way to get parallelization and performance Cray Black Widow vector processors had 128 element vector registers
8
Vectorization Requirements 1.Independent operations for each iteration Operations that depend on the last iteration are called “recurrence” and often prevent vectorization 2.Stride-1 access pattern Data used for each iteration must be contiguous Without restrict on pointers, this is typically why C code will not vectorize 3.Little to no conditional code in the loop body x86 processors have “predicate registers” that allow conditional execution on portions of vectors, but number is limited
9
Loop 1: DO i = 1, N q(i) = g(i)*b(i) + c(i) END DO Loop 2: DO i = 1, N q(i) = q(i) + f * b(i) END DO Loop 3: DO i = 1, N q(i) = 0.5 * (q(i-1) + q(i+1)) END DO Loop 4: DO i = 1, N, 2 q(i) = 0.5 * (q(i-1) + q(i+1)) END DO Loop 5: DO i = 1, Nx IF (F(i+1).LT. 1.0E-10) F(i+1) = 0.0 q(i) = q(i) – dtdx * (F(i+1) – F(i)) END DO
10
Example Code
11
Cache Blocking and Reuse Code may reuse data across multiple iterations in a loop Best would be to keep that in the highest level cache until it is no longer needed Board example
12
OpenMP Threading Directive-based language extension (C, C++, Fortran) that allows code to multithreaded by the compiler Generally much easier to use than pthread library or equivalent and code is much more portable Many compilers have fairly good OpenMP implementations that can scale to dozens of cores Intel is quite good but launches a “helper” thread that often gets in the way GNU is fairly good, but thread synchronization performance generally slower PGI is OK Cray/IBM/etc custom compilers may perform much better on specific codes
13
OpenMP Threading User defined parallel regions where all threads operate Work can be divided amongst threads either via sets of “tasks” or by giving portions of a loop to each thread Un-safe operations (such as stores to a shared variable) can be done with atomic operations or critical sections www.openmp.org is a great resource!
14
From llnl.gov
15
Put it all together!
16
Version/ThreadsGNUIntelGNU cache blocked Intel cache blocked Serial / 15.7245.5565.2765.344 OMP / 23.0823.0282.6882.708 OMP / 41.6341.5291.4321.373 OMP / 80.9520.7740.8420.683 OMP / 160.5470.3880.4700.344 Performance Results
17
Tuning Tips Experiment! Write multiple versions of your code with various techniques included Try all available compilers. What do they each do? It is almost always fastest to write code that can vectorize AND it makes adding OpenMP much easier if it does Count you operations if you can. Measure the run time and ask yourself, am I getting 0.1% of peak or 10%. Aim for 10% or better for your “hot” loops and subroutines IT CAN ALWAYS BE FASTER!!!
18
Helpful Compiler Flags GNU -fopt-info will show what optimizations the compiler did -fopenmp to enable OpenMP Intel -opt-report will produce a *.optrpt file for each source showing where vectorization was done -openmp to enable OpenMP
19
Useful Tools objdump –l –d [executable or.o] Can be used to look at generated assembly. Use with –g compiler flag to see where in the source you are. Try to learn assembly basics as it is a very useful skill when investigating performance issues. No need to learn to program it. PAPI Used to gather counters from the processor Requires admin to install on linux DDT / TAU / Intel Vtune / CrayPAT All full performance suites for analyzing application, finding bottlenecks, etc.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.