High Performance Computing with AMD Opteron Maurizio Davini
Agenda OS Compilers Libraries Some Benchmark results Conclusions
64-Bit Operating Systems Recommendations and Status SUSE SLES 9 with latest Service Pack available Has technology for supporting latest AMD processor features Widest breadth of NUMA support and enabled by default Oprofile system profiler installable as an RPM and modularized complete support for static & dynamically linked 32-bit binaries Red Hat Enterprise Server 3.0 Service Pack 2 or later NUMA features support not as complete as that of SUSE SLES 9 Oprofile installable as an RPM but installation is not modularized and may require a kernel rebuild if RPM version isn ’ t satisfactory only SP 2 or later has complete 32-bit shared object library support (a requirement to run all 32-bit binaries in 64-bit) Posix-threading library changed between 2.1 and 3.0, may require users to rebuild applications
AMD Opteron Compilers PGI,, GNU, PGI, Pathscale, GNU, Absoft Intel, Microsoft and SUN
Compiler Comparisons Table Critical Features Supported by x86 Compilers VectorSIMD Suppo rt Peel s Vect or Loop s Glob al IPA Ope n MPLinksACML Librari es ProfileGuided Feedba ck Align s Vect or Loop s Parallel Debugg ers Large Array Suppo rt Mediu m Memor y Model PGI GNU Intel Pathscale Absoft SUN Microsoft
Tuning Performance with Compilers Maintaining Stability while Optimizing STEP 0: Build application using the following procedure: compile all files with the most aggressive optimization flags below: -tp k8-64 –fastsse if compilation fails or the application doesn’t run properly, turn off vectorization: -tp k8-64 –fast –Mscalarsse if problems persist compile at Optimization level 1: -tp k8-64 –O0 STEP 1: Profile binary and determine performance critical routines STEP 2: Repeat STEP 0 on performance critical functions, one at a time, and run binary after each step to check stability
PGI Compiler Flags Optimization Flags Below are 3 different sets of recommended PGI compiler flags for flag mining application source bases: Most aggressive: -tp k8-64 –fastsse –Mipa=fast enables instruction level tuning for Opteron, O2 level optimizations, sse scalar and vector code generation, inter-procedural analysis, LRE optimizations and unrolling strongly recommended for any single precision source code strongly recommended for any single precision source code Middle of the ground: -tp k8-64 –fast –Mscalarsse enables all of the most aggressive except vector code generation, which can reorder loops and generate slightly different results in double precision source bases a good substitute since Opteron has the same throughput on both scalar and vector code Least aggressive: -tp k8-64 –O0 (or –O1)
PGI Compiler Flags Functionality Flags -mcmodel=medium use if your application statically allocates a net sum of data structures greater than 2GB -Mlarge_arrays use if any array in your application is greater than 2GB -KPIC use when linking to shared object (dynamically linked) libraries -mp OpenMPSGI process OpenMP/SGI directives/pragmas (build multi-threaded code) -Mconcur attempt auto-parallelization of your code on SMP system with OpenMP
Absoft Compiler Flags Optimization Flags Below are 3 different sets of recommended PGI compiler flags for flag mining application source bases: Most aggressive: -O3 loop transformations, instruction preference tuning, cache tiling, & SIMD code generation (CG). Generally provides the best performance but may cause compilation failure or slow performance in some cases strongly recommended for any single precision source code strongly recommended for any single precision source code Middle of the ground: -O2 enables most options by –O3, including SIMD CG, instruction preferences, common sub-expression elimination, & pipelining and unrolling. in double precision source bases a good substitute since Opteron has the same throughput on both scalar and vector code Least aggressive: -O1
Absoft Compiler Flags Functionality Flags -mcmodel=medium use if your application statically allocates a net sum of data structures greater than 2GB -g77 enables full compatibility with g77 produced objects and libraries (must use this option to link to GNU ACML libraries) -fpic use when linking to shared object (dynamically linked) libraries -safefp performs certain floating point operations in a slower manner that avoids overflow, underflow and assures proper handling of NaNs
Pathscale Compiler Flags Optimization Flags Most aggressive: -Ofast Equivalent to –O3 –ipa –OPT:Ofast –fno-math-errno Aggressive : -O3 optimizations for highest quality code enabled at cost of compile time Some generally beneficial optimization included may hurt performance Reasonable: -O2 Extensive conservative optimizations Optimizations almost always beneficial Faster compile time Avoids changes which affect floating point accuracy.
Pathscale Compiler Flags Functionality Flags -mcmodel=medium use if static data structures are greater than 2GB -ffortran-bounds-check (fortran) check array bounds -shared generate position independent code for calling shared object libraries Feedback Directed Optimization STEP 0: Compile binary with -fb_create_fbdata STEP 1: Run code collect data STEP 2: Recompile binary with -fb_opt fbdata -march=(opteron|athlon64|athlon64fx) Optimize code for selected platform (Opteron is default)
ACML 2.1 Features Features BLAS, LAPACK, FFT Performance Open MP Performance ACML 2.5 Snap Shot – Soon to be released
Components of ACML BLAS, LAPACK, FFTs Linear Algebra (LA) Basic Linear Algebra Subroutines (BLAS) oLevel 1 (vector-vector operations) oLevel 2 (matrix-vector operations) oLevel 3 (matrix-matrix operations) oRoutines involving sparse vectors Linear Algebra PACKage (LAPACK) oleverage BLAS to perform complex operations o28 Threaded LAPACK routines Fast Fourier Transforms (FFTs) 1D, 2D, single, double, r-r, r-c, c-r, c-c support C and Fortran interfaces
64-bit BLAS Performance DGEMM ( Double Precision General Matrix Multiply )
64-bit FFT Performance (non-power of 2) MKL vs ACML on 2.2 Ghz Opteron
64-bit FFT Performance (non-power of 2) 2.2 Ghz Opteron vs 3.2 Ghz XeonEMT
Multithreaded LAPACK Performance Double Precsion (LU, Cholesky, QR Factorize/Solve)
Conclusion and Closing Points How good is our performance? Averaging over 70 BLAS/LAPACK/FFT routines Computation weighted average All measurements performed on an 4P AMD Opteron TM 844 Quartet Server ACML 32-bit is 55% faster than MKL ACML 64-bit is 80% faster than MKL 6.1
64-ACML 2.5 Snapshot Small Dgemm Enhancements
ATLSIM: A full-scale GEANT3 simulation of ATLAS detector (P.Nevski) (typical LHC Higgs events) SixTrack: Tracking of two particles in a 6-dimensional phase space including synchrotron oscillations (F.Schmidt) ( Sixtrack benchmark code: E.McIntosh ( CERN U: Ancient “CERN Units” Benchmark (E.McIntosh) Recent Caspur Results ( thanks to M.Rosati) Benchmark suites
What was measured On both platforms, we were running one or two simultaneous jobs for each of the benchmarks. On Opteron, we used the SuSE “numactl” interface to make sure that at any time each of the two processors makes use of the right bank of memory. Example of submission, 2 simultaneous jobs: Intel:./TestJob;./TestJob AMD: numactl –cpubind=0 –membind=0./TestJob; numactl –cpubind=1 –membind=1./TestJob
Results CERN UnitsSixTrack (seconds/run) ATLSIM (seconds/event) 1 job2 jobs1 job2 jobs1 job2 jobs Intel Nocona , , ,484 AMD Opteron , , ,389 While both machines behave in a similar way when only one job is run, the situation changes in a visible manner in the case of two jobs. It may take up to 30% more time to run two simultaneous jobs on Intel, while on AMD there is a notable absence of any visible performance drop.
HEP Software bench
Hep software Bench
HEP software bench
HEP Software bench
An original MPI work on AMD Opteron We got access to the MPI wrapper-library source Environment: –4 way servers –Myrinet interconnect –Linux 2.6 kernel –LAM MPI We inserted libnuma calls after MPI_INIT to bind the newly-created MPI tasks to specific processors –We avoid unnecessary memory traffic by having each processor accessing its own memory
>20% improvement
Conclusioni AMD Opteron: HPEW High Performance Easy Way