Download presentation
Presentation is loading. Please wait.
Published bySilvia Mills Modified over 6 years ago
1
Compiler Ecosystem November 22, 2018 Computation Products Group
2
Compiler Comparisons Table Critical Features Supported by x86 Compilers
Vector SIMD Support Peels Loops Global IPA Open MP Links ACML Libraries Profile Guided Feedback Aligns Parallel Debuggers Large Array Support Medium Memory Model PGI GNU Intel Pathscale Absoft SUN Microsoft November 22, 2018 Computation Products Group
3
Intel CPUID Checks How to determine if they exist in a binary
CPUID instruction reports: Types of x86/x86-64 instructions supported (SSE, SSE2, SSE3) Vendor of the processor (Genuine Intel or Authentic AMD) Intel C and FORTRAN compiler’s runtime library enviorments check “Vendor of Processor” and then run down alternate code path that: segmentation faults because Intel doesn’t support non-Intel processors executes legacy code optimized for Pentium PRO, PII or PIII CPUID checks also exist in Intel’s Math Kernel Library applications calling FFTs or Linear Algebra strongly impacted ISVs and customers must utilize ACML (likely a 2x performance boost) November 22, 2018 Computation Products Group
4
Intel CPUID Checks How to determine if they exist in a binary
How to check if CPUID checks exist in a binary, type: Dump all assembly instructions in binary to a txt file, type: objdump –d “binary” > binary.txt Search “binary.txt” file for lines containing cpuid instructions, type: grep “cpuid” binary.txt Search above will print out instruction address at the beginning of each line containing cpuid cpuid located in function called: “IntelProcessorIdentificationFunction:” determine how many times it is called in “binary.txt” by typing: grep “IntelProcessorIdentificationFunction” binary.txt Illustrating to ISVs and customers the practices employed by Intel at the user’s inconvenience builds rapport and confidence between them and AMD November 22, 2018 Computation Products Group
5
Intel Compiler and MKL on Opteron Threat Assessment of using Intel Compilers
The compiler is a weapon – maker can control the code generated and run upon their chip and their competitor working with PGI and NAG we can address the performance and functionality issues of a customer by modifying the compiler or ACML CPUID checks – instruction compatibility not checked but rather the Vendor ID AMD platform issues not supported unless reproducible on Intel platforms CPUID checks placed into code because Intel doesn’t trust users intellect Issues on AMD platforms can not be addressed and will not be reproducible since we do not issue the same VENDOR ID in the CPUID instruction ISVs and customers draw the conclusion AMD Platforms aren’t dependable November 22, 2018 Computation Products Group
6
Intel Compiler and MKL on Opteron Threat Assessment of using Intel Compilers
The AMD Core Math Library (ACML) can not be linked with the Intel 8.1 AMD64 compiler, the only option is Intel’s MKL Opteron runs many Intel MKL routines 25-75% the rate it runs the counterpart ACML routines (ex: CFFT1D, CFFT2D, DGEMM, …) ISVs and customers whose applications are performance bound by FFTs, BLAS or LAPACK strongly impacted (ex: ANSYS performance increased 43% moving to 64-bit using ACML rather than MKL) Necessitates increasing the # of compilers and binaries required to support both AMD and Intel platforms PGI creates both AMD (-tp k8-64) and Intel (-tp p7-64) tuned binaries work done by AMD tuning PGI compiler leveraged also in Intel binaries On LS-DYNA the PGI 64-bit binary targeted towards XEON with -tp p7-64 is faster than the Intel 8.1 binary by 4% November 22, 2018 Computation Products Group
7
Intel Compiler and MKL on Opteron Threat Assessment of using Intel Compilers
Intel has stated at the link below that in 8.1 Intel compilers the switches to target chips without SSE2 or SSE3 will no longer function Opteron lacks SSE3 support until Jackhammer in Q2 ‘05 The user will be unable to tell the compiler not to utilize SSE3 insturctions ISVs and Customers will have no solution as to using binaries built by Intel compilers upon Opteron Occurrences such as this will continue every time Intel introduces a new instruction set for x86 based systems (SSE4?) Users presently using the Intel compiler upon Opteron based systems or ISVs supporting customers in a similar manner will have no method of optimizing code for an AMD based system with the exception of compiling without optimization November 22, 2018 Computation Products Group
8
Tuning Performance with Compilers Maintaining Stability while Optimizing
STEP 0: Build application using the following procedure: compile all files with the most aggressive optimization flags below: -tp k8-64 –fastsse if compilation fails or the application doesn’t run properly, turn off vectorization: -tp k8-64 –fast –Mscalarsse if problems persist compile at Optimization level 1: -tp k8-64 –O0 STEP 1: Profile binary and determine performance critical routines STEP 2: Repeat STEP 0 on performance critical functions, one at a time, and run binary after each step to check stability November 22, 2018 Computation Products Group
9
Tuning Memory IO Bandwidth Optimizing large streaming operations
2 Methods of writing to memory in x86/x86-64: traditional memory stores cause write allocates to cache Mov %rax,[%rdi] movsd %xmm0,[%rdi] movapd %xmm0,[%rdi] page to be modified is read into cache cache is modified, written to memory when new memory page loaded to write N bytes, 2N bytes of bandwidth generated non-temporal stores bypass cache and write directly to memory no write allocate to cache, to write N bytes, N bytes of bandwidth generated data is not backed up into cache, do not use with often reused data Use only on functions which write L2/2 > bytes of data or more, normally would assure little cache reuse value Group all eligible routines into a common file to as to simplify the compilation procedure. Enable non-temporal stores in PGI compiler with the –Mnontemporal compiler option November 22, 2018 Computation Products Group
10
PGI Compiler Flags Optimization Flags
Below are 3 different sets of recommended PGI compiler flags for flag mining application source bases: Most aggressive: -tp k8-64 –fastsse –Mipa=fast enables instruction level tuning for Opteron, O2 level optimizations, sse scalar and vector code generation, inter-procedural analysis, LRE optimizations and unrolling strongly recommended for any single precision source code Middle of the ground: -tp k8-64 –fast –Mscalarsse enables all of the most aggressive except vector code generation, which can reorder loops and generate slightly different results in double precision source bases a good substitute since Opteron has the same throughput on both scalar and vector code Least aggressive: -tp k8-64 –O0 (or –O1) November 22, 2018 Computation Products Group
11
PGI Compiler Flags Functionality Flags
-mcmodel=medium use if your application statically allocates a net sum of data structures greater than 2GB -Mlarge_arrays use if any array in your application is greater than 2GB -KPIC use when linking to shared object (dynamically linked) libraries -mp process OpenMP/SGI directives/pragmas (build multi-threaded code) -Mconcur attempt auto-parallelization of your code on SMP system with OpenMP November 22, 2018 Computation Products Group
12
Absoft Compiler Flags Optimization Flags
Below are 3 different sets of recommended Absoft compiler flags for flag mining application source bases: Most aggressive: -O3 loop transformations, instruction preference tuning, cache tiling, & SIMD code generation (CG). Generally provides the best performance but may cause compilation failure or slow performance in some cases strongly recommended for any single precision source code Middle of the road: -O2 enables most options by –O3, including SIMD CG, instruction preferences, common sub-expression elimination, & pipelining and unrolling. in double precision source bases a good substitute since Opteron has the same throughput on both scalar and vector code Least aggressive: -O1 November 22, 2018 Computation Products Group
13
Absoft Compiler Flags Functionality Flags
-mcmodel=medium use if your application statically allocates a net sum of data structures greater than 2GB -g77 enables full compatibility with g77 produced objects and libraries (must use this option to link to GNU ACML libraries) -fpic use when linking to shared object (dynamically linked) libraries -safefp performs certain floating point operations in a slower manner that avoids overflow, underflow and assures proper handling of NaNs November 22, 2018 Computation Products Group
14
Pathscale Compiler Flags Optimization Flags
Most aggressive: -Ofast Equivalent to –O3 –ipa –OPT:Ofast –fno-math-errno Aggressive : -O3 optimizations for highest quality code enabled at cost of compile time Some generally beneficial optimization included may hurt performance Reasonable: -O2 Extensive conservative optimizations Optimizations almost always beneficial Faster compile time Avoids changes which affect floating point accuracy. November 22, 2018 Computation Products Group
15
Pathscale Compiler Flags Functionality Flags
-mcmodel=medium use if static data structures are greater than 2GB -ffortran-bounds-check (fortran) check array bounds -shared generate position independent code for calling shared object libraries Feedback Directed Optimization STEP 0: Compile binary with -fb_create_fbdata STEP 1: Run code collect data STEP 2: Recompile binary with -fb_opt fbdata -march=(opteron|athlon64|athlon64fx) Optimize code for selected platform (Opteron is default) November 22, 2018 Computation Products Group
16
Microsoft Compiler Flags Optimization Flags
Recommended Flags : /O2 /Ob2 /GL /fp:fast /O2 turns on several general optimization & /O2 enable inline expansion /GL enables inter-procedural optimizations /fp:fast allows the compiler to use a fast floating point model Feedback Directed Optimization STEP 0: Compile binary with /LTCG:PGI STEP 1: Run code collect data STEP 2: Recompile binary with /LTCG:PGO Turn off Buffer Over Run Checking The compiler by default runs on /GS to check for buffer overruns. Turning off checking by specifying /GS- may result in additional performance November 22, 2018 Computation Products Group
17
Microsoft Compiler Flags Functionality Flags
/GT enables run-time information /Wp64 supports fiber safety for data allocated using static thread-local storage /LD detects most 64-bit portability problems /Oa creates a dynamic-link library /Ow assumes aliasing across function calls but not inside functions November 22, 2018 Computation Products Group
18
64-Bit Operating Systems Recommendations and Status
SUSE SLES 9 with latest Service Pack available Has technology for supporting latest AMD processor features Widest breadth of NUMA support and enabled by default Oprofile system profiler installable as an RPM and modularized complete support for static & dynamically linked 32-bit binaries Red Hat Enterprise Server 3.0 Service Pack 2 or later NUMA features support not as complete as that of SUSE SLES 9 Oprofile installable as an RPM but installation is not modularized and may require a kernel rebuild if RPM version isn’t satisfactory only SP 2 or later has complete 32-bit shared object library support (a requirement to run all 32-bit binaries in 64-bit) Posix-threading library changed between 2.1 and 3.0, may require users to rebuild applications November 22, 2018 Computation Products Group
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.