1 Intel® Compilers For Xeon™ Processor
Agenda General Xeon™ processor optimizations Loop level optimizations Multi-pass optimizations Other
Agenda General Xeon™ processor optimizations Loop level optimizations Multi-pass optimizations Other
General Optimizations /Od, -O0: disable optimizations /Zi, -g: Create Symbols /O1, -O1: Optimizes for speed without increasing code size – i.e. disables library function inlining /O2, -O2 – default – Optimize for speed /O3, -O3 – High-level optimizations
Agenda General Xeon™ processor optimizations Loop level optimizations Multi-pass optimizations Other
Instruction Scheduling Schedule instructions to be optimal for specific processor instruction latencies and cache sizes WindowsLinux Pentium ® processors and Pentium processors with MMX™ technology -G5-tpp5 Pentium Pro, Pentium II and Pentium III processors -G6 (Default) -tpp6 (Default) Pentium 4 processor -G7-tpp7 Note: default may change in future compilers
Shift/Multiply Latency Pentium –Shift has ~1x latency of adds –Multiply has ~10x latency of adds Pentium Pro, II, and III –Shift has ~1x latency of adds –Multiply has ~3x latency of adds Pentium 4 (may change in future releases) –Shift has ~8x latency of adds –Multiply has ~26x latency of adds Under the Covers: P4 Compiler accounts for these differences for you!
for (int i=0;i<length;i++) { p[i] = q[i] * 32; } .B1.7: # -tpp6 movl (%ebx,%edx,4),%eax shll $5, %eax movl %eax, (%esi,%edx,4) incl %edx cmpl %ecx, %edx jl.B1.7 .B1.7: # -tpp7 movl (%ebx,%edx,4),%eax addl %eax, %eax movl %eax, (%esi,%edx,4) addl $1, %edx cmpl %ecx, %edx jl.B1.7 Under the Covers: Xeon
Which Processor: [a]x? To require at least... UseWindows*Linux* Pentium Pro and Pentium II processors with CMOV and FCMOV instructions iQaxiaxi Pentium processors with MMX instructions MQaxMaxM Pentium III processor with Streaming SIMD Extensions (implies i and M above) KQaxKaxK Pentium 4 processor with Streaming SIMD Extensions 2 (implies i, M and K above) WQaxWaxW
Automatic Processor Dispatch Single executable –Pentium 4 target that runs on all x86 processors. For Target Processor it uses: –Processor Specific Opcodes –Prefetch (Pentium III only) –Vectorization Low Overhead –Some increase in code size Can mix and match: -xK –axW together makes Xeon/Pentium 4 the target and Pentium III the default
Agenda General Xeon™ processor optimizations Loop level optimizations Multi-pass optimizations Other
Vectorization Automatically converts loops to utilize MMX/SSE/SSE2 instructions and registers. Data types: char/short/int/float/double –(but not mixed) Can Use Short Vector Math Library Enabled through -[Q]xW, -[Q]xK, -[Q]axW, -[Q]axK -vec_report3 tells you which loops were vectorized, and if not, why not.
High Level Optimizer Windows: /O3 or Linux: -O3Windows: /O3 or Linux: -O3 Use with –xW, -xK, -QxW, -QxK, etc.Use with –xW, -xK, -QxW, -QxK, etc. – additional loop optimizations – more aggressive dependency analysis – scalar replacement – software prefetch (-xK on Pentium III) Loops must meet criteria related to those for vectorization Under the Covers: Xeon
SMP parallelism OpenMP –Easy multithreading using directives –Use KSL tools for Development –Use Intel tools to optimize for IA in tandem with OpenMP Auto-parallelization –Simple loops threaded by compiler alone Loops must meet certain criteria…
OpenMP* Support OpenMP 1.1 for Fortran & 1.0 for C / C++ –Debugger info support for OpenMP –Assure for Threads supported with Intel Compiler OpenMP switches: –-Qopenmp, -openmp (or -openmpP) –-QopenmpS, -openmpS (serial, for debugging) –-openmp_report[n] (diagnostics) – works in conjunction with vectorization
Auto Parallelization Auto-parallelization: Automatic threading of loops without having to manually insert OpenMP* directive. –-Qparallel (Windows*), -parallel (Linux*) –-Qpar_report[n], -par_report[n] (diagnostics) Better to use OpenMP directives – Compiler can identify “easy” candidates for parallelization, but large applications are difficult to analyze.
Agenda General and processor optimization Loop level optimizations Multi-pass optimizations –Inter Procedural Optimization –Profile Guided Optimization Other
Inter-Procedural Optimizations (IPO) -Qip, -ip: Enables interprocedural optimizations for single file compilation. -Qipo, -ipo: Enables interprocedural optimizations across files.
Inter-Procedural Optimizations (IPO) More benefits than just inlining –Partial inlining –Interprocedural constant propagation –Passing arguments in registers –Loop-invariant code motion –Dead code elimination –Helps vectorization, memory disambiguation
Pass 1 Pass 2 virtual.obj and.il files executable Compiling: Windows*: icl -c /Qipo main.c func1.c func2.c Linux*: icc -c -ipo main.c func1.c func2.c Linking: Windows*: icl /Qipo main.obj func1.obj func2.obj Linux*: icc -ipo main.obj func1.obj func2.obj IPO Usage: 2 Step Process Windows* Hint: LINK=link.exe should be replaced with LINK=xilink.exe ie: xilink /Qipo main.obj func1.obj func2.obj
Use execution-time feedback to guide opt Helps I-cache, paging, branch-prediction Enabled Optimizations: –Basic block ordering –Better register allocation –Better decision of functions to inline –Function ordering –Switch-statement optimization –Better vectorization decisions Profile-Guided Optimizations (PGO)
Instrumented Compilation Windows: icl /Qprof_gen prog.c Linux: icc -prof_gen prog.c Instrumented Execution prog.exe (on a typical dataset) Feedback Compilation Windows: icl /Qprof_use prog.c Linux: icc -prof_use prog.c DYN file containing dynamic info:.dyn Instrumented Executable: prog.exe Merged DYN Summary File:.dpi Delete old dyn files if you don’t want their info included too Step 1 Step 2 Step 3 PGO Usage: 3 Step Process
Applications with lots of functions, calls, or branching that are not loop-bound –Examples: Databases, Decision-support (enterprise), MCAD –Apps with computation spread throughout; not confined to kernels Considerations: –Different paradigm for builds - 3 steps –Schedule time in final stages of development when code is more stable. –Use representative data set(s) (not for corner cases) When To Use PGO
Programs That Benefit Consistent hot paths Many if statements or switches Nested if statements or switches PGO Significant Benefit Little Benefit
Indirect Branches Indirect Branches not as predictable –Compared with conditional branches –Usually generated for switch statements –Have much larger relative latency than Direct Branches Intel Compiler does: –Optimizes likely cases to use conditional branches Under the Covers: P4
Agenda General and processor optimization Loop level optimizations Multi-pass optimizations Other –Float point precision –Math Libraries –Other
Floating Point Precision WindowsLinuxDescription-Op-mp Strict ANSI C and IEEE 754 Floating Point (subset of -Za/-ansi) -Za-Xc Strict ANSI C and IEEE 754 -Qlong_double-long_double long double=80, not the default of 64 -Qprec*-mp1 Precision closer to - but not quite – ANSI ; faster than ANSI -Qprec_div*-prec_div* Turn off - division into reciprocal multiply -Qpcn* -pcn* Round to n precision. n={32,64,80} -Qrcd*-rcd* Remove code that truncates during float to integer conversions * Only available on IA32
Math Libraries Intel’s LIBM (libimf on Linux) Short Vector Math Library (SVML) –Used when vectorizing loops which have math functions in them Automatically used when needed –LIB (windows), LD_LIBRARY_PATH (Linux) environment variables Common math functions –sin/cos/tan/exp/sqrt/log, etc Processor dispatch for every IA processor
Libraries on Linux -i_dynamic link to shared libraries (default) -static link to static libraries -shared create a shared object -Vaxlib link to portability library
Other Switches More Switches Pragmas –#pragma IVDEP –hints to compiler that loops are independent and can be vectorized See Compiler User’s Guide and Reference icc –help | icl -help Intel Developer Forum
Summary Presented the major optimization switches of the Intel Compiler –General Switches –Vectorization & High Level Optimizations –Profile Guided Optimizations –InterProcedural Optimizations Explained how the Intel Compiler takes advantage of current IA Optimized PovRay using the Intel Compiler