Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 www.intel.com/software/products Intel® Compilers For Xeon™ Processor.

Similar presentations


Presentation on theme: "1 www.intel.com/software/products Intel® Compilers For Xeon™ Processor."— Presentation transcript:

1 1 www.intel.com/software/products Intel® Compilers For Xeon™ Processor

2 Agenda  General  Xeon™ processor optimizations  Loop level optimizations  Multi-pass optimizations  Other

3 Agenda  General  Xeon™ processor optimizations  Loop level optimizations  Multi-pass optimizations  Other

4 General Optimizations  /Od, -O0: disable optimizations  /Zi, -g: Create Symbols  /O1, -O1: Optimizes for speed without increasing code size – i.e. disables library function inlining  /O2, -O2 – default – Optimize for speed  /O3, -O3 – High-level optimizations

5 Agenda  General  Xeon™ processor optimizations  Loop level optimizations  Multi-pass optimizations  Other

6 Instruction Scheduling  Schedule instructions to be optimal for specific processor instruction latencies and cache sizes WindowsLinux Pentium ® processors and Pentium processors with MMX™ technology -G5-tpp5 Pentium Pro, Pentium II and Pentium III processors -G6 (Default) -tpp6 (Default) Pentium 4 processor -G7-tpp7 Note: default may change in future compilers

7 Shift/Multiply Latency  Pentium –Shift has ~1x latency of adds –Multiply has ~10x latency of adds  Pentium Pro, II, and III –Shift has ~1x latency of adds –Multiply has ~3x latency of adds  Pentium 4 (may change in future releases) –Shift has ~8x latency of adds –Multiply has ~26x latency of adds Under the Covers: P4 Compiler accounts for these differences for you!

8 for (int i=0;i<length;i++) { p[i] = q[i] * 32; } .B1.7: # -tpp6  movl (%ebx,%edx,4),%eax  shll $5, %eax  movl %eax, (%esi,%edx,4)  incl %edx  cmpl %ecx, %edx  jl.B1.7 .B1.7: # -tpp7  movl (%ebx,%edx,4),%eax  addl %eax, %eax  movl %eax, (%esi,%edx,4)  addl $1, %edx  cmpl %ecx, %edx  jl.B1.7 Under the Covers: Xeon

9 Which Processor: [a]x? To require at least... UseWindows*Linux* Pentium Pro and Pentium II processors with CMOV and FCMOV instructions iQaxiaxi Pentium processors with MMX instructions MQaxMaxM Pentium III processor with Streaming SIMD Extensions (implies i and M above) KQaxKaxK Pentium 4 processor with Streaming SIMD Extensions 2 (implies i, M and K above) WQaxWaxW

10 Automatic Processor Dispatch  Single executable –Pentium 4 target that runs on all x86 processors.  For Target Processor it uses: –Processor Specific Opcodes –Prefetch (Pentium III only) –Vectorization  Low Overhead –Some increase in code size  Can mix and match: -xK –axW together makes Xeon/Pentium 4 the target and Pentium III the default

11 Agenda  General  Xeon™ processor optimizations  Loop level optimizations  Multi-pass optimizations  Other

12 Vectorization  Automatically converts loops to utilize MMX/SSE/SSE2 instructions and registers.  Data types: char/short/int/float/double –(but not mixed)  Can Use Short Vector Math Library  Enabled through -[Q]xW, -[Q]xK, -[Q]axW, -[Q]axK  -vec_report3 tells you which loops were vectorized, and if not, why not.

13 High Level Optimizer Windows: /O3 or Linux: -O3Windows: /O3 or Linux: -O3 Use with –xW, -xK, -QxW, -QxK, etc.Use with –xW, -xK, -QxW, -QxK, etc. – additional loop optimizations – more aggressive dependency analysis – scalar replacement – software prefetch (-xK on Pentium III)  Loops must meet criteria related to those for vectorization Under the Covers: Xeon

14 SMP parallelism  OpenMP –Easy multithreading using directives –Use KSL tools for Development –Use Intel tools to optimize for IA in tandem with OpenMP  Auto-parallelization –Simple loops threaded by compiler alone  Loops must meet certain criteria…

15 OpenMP* Support  OpenMP 1.1 for Fortran & 1.0 for C / C++ –Debugger info support for OpenMP –Assure for Threads supported with Intel Compiler  OpenMP switches: –-Qopenmp, -openmp (or -openmpP) –-QopenmpS, -openmpS (serial, for debugging) –-openmp_report[n] (diagnostics) – works in conjunction with vectorization

16 Auto Parallelization  Auto-parallelization: Automatic threading of loops without having to manually insert OpenMP* directive. –-Qparallel (Windows*), -parallel (Linux*) –-Qpar_report[n], -par_report[n] (diagnostics)  Better to use OpenMP directives – Compiler can identify “easy” candidates for parallelization, but large applications are difficult to analyze.

17 Agenda  General and processor optimization  Loop level optimizations  Multi-pass optimizations –Inter Procedural Optimization –Profile Guided Optimization  Other

18 Inter-Procedural Optimizations (IPO)  -Qip, -ip: Enables interprocedural optimizations for single file compilation.  -Qipo, -ipo: Enables interprocedural optimizations across files.

19 Inter-Procedural Optimizations (IPO)  More benefits than just inlining –Partial inlining –Interprocedural constant propagation –Passing arguments in registers –Loop-invariant code motion –Dead code elimination –Helps vectorization, memory disambiguation

20 Pass 1 Pass 2 virtual.obj and.il files executable Compiling: Windows*: icl -c /Qipo main.c func1.c func2.c Linux*: icc -c -ipo main.c func1.c func2.c Linking: Windows*: icl /Qipo main.obj func1.obj func2.obj Linux*: icc -ipo main.obj func1.obj func2.obj IPO Usage: 2 Step Process Windows* Hint: LINK=link.exe should be replaced with LINK=xilink.exe ie: xilink /Qipo main.obj func1.obj func2.obj

21  Use execution-time feedback to guide opt  Helps I-cache, paging, branch-prediction  Enabled Optimizations: –Basic block ordering –Better register allocation –Better decision of functions to inline –Function ordering –Switch-statement optimization –Better vectorization decisions Profile-Guided Optimizations (PGO)

22 Instrumented Compilation Windows: icl /Qprof_gen prog.c Linux: icc -prof_gen prog.c Instrumented Execution prog.exe (on a typical dataset) Feedback Compilation Windows: icl /Qprof_use prog.c Linux: icc -prof_use prog.c DYN file containing dynamic info:.dyn Instrumented Executable: prog.exe Merged DYN Summary File:.dpi Delete old dyn files if you don’t want their info included too Step 1 Step 2 Step 3 PGO Usage: 3 Step Process

23  Applications with lots of functions, calls, or branching that are not loop-bound –Examples: Databases, Decision-support (enterprise), MCAD –Apps with computation spread throughout; not confined to kernels  Considerations: –Different paradigm for builds - 3 steps –Schedule time in final stages of development when code is more stable. –Use representative data set(s) (not for corner cases) When To Use PGO

24 Programs That Benefit  Consistent hot paths  Many if statements or switches  Nested if statements or switches PGO Significant Benefit Little Benefit

25 Indirect Branches  Indirect Branches not as predictable –Compared with conditional branches –Usually generated for switch statements –Have much larger relative latency than Direct Branches  Intel Compiler does: –Optimizes likely cases to use conditional branches Under the Covers: P4

26 Agenda  General and processor optimization  Loop level optimizations  Multi-pass optimizations  Other –Float point precision –Math Libraries –Other

27 Floating Point Precision WindowsLinuxDescription-Op-mp Strict ANSI C and IEEE 754 Floating Point (subset of -Za/-ansi) -Za-Xc Strict ANSI C and IEEE 754 -Qlong_double-long_double long double=80, not the default of 64 -Qprec*-mp1 Precision closer to - but not quite – ANSI ; faster than ANSI -Qprec_div*-prec_div* Turn off - division into reciprocal multiply -Qpcn* -pcn* Round to n precision. n={32,64,80} -Qrcd*-rcd* Remove code that truncates during float to integer conversions * Only available on IA32

28 Math Libraries  Intel’s LIBM (libimf on Linux)  Short Vector Math Library (SVML) –Used when vectorizing loops which have math functions in them  Automatically used when needed –LIB (windows), LD_LIBRARY_PATH (Linux) environment variables  Common math functions –sin/cos/tan/exp/sqrt/log, etc  Processor dispatch for every IA processor

29 Libraries on Linux  -i_dynamic link to shared libraries (default)  -static link to static libraries  -shared create a shared object  -Vaxlib link to portability library

30 Other Switches  More Switches  Pragmas –#pragma IVDEP –hints to compiler that loops are independent and can be vectorized  See Compiler User’s Guide and Reference  icc –help | icl -help  http://www.intel.com/software/products  Intel Developer Forum

31 Summary  Presented the major optimization switches of the Intel Compiler –General Switches –Vectorization & High Level Optimizations –Profile Guided Optimizations –InterProcedural Optimizations  Explained how the Intel Compiler takes advantage of current IA  Optimized PovRay using the Intel Compiler


Download ppt "1 www.intel.com/software/products Intel® Compilers For Xeon™ Processor."

Similar presentations


Ads by Google