Presentation is loading. Please wait.

Presentation is loading. Please wait.

Multi-core Programming Tools. 2 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version Topics General Ideas Compiler Switches Dual Core.

Similar presentations


Presentation on theme: "Multi-core Programming Tools. 2 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version Topics General Ideas Compiler Switches Dual Core."— Presentation transcript:

1 Multi-core Programming Tools

2 2 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version Topics General Ideas Compiler Switches Dual Core Vectorization

3 3 General Ideas - Optimization Exploiting Architectural Power requires Sophisticated Compilers Optimal use of – Registers & functional units – Dual-Core/Multi-processor – SSE instructions – Cache architecture

4 4 General Ideas – Compatibility Compatibility is a key concern for software development. Always read manuals to ensure: Tool compatibility with hardware revisions. Tool compatibility with software revisions. Tool compatibility with IDE. Tool compatibility with native (development) and target (deployment) operating system.

5 5 General Ideas – Intel C++ Compatibility with Microsoft Source & binary compatible with VC2003 with /Qvc71, Source & binary compatible with w/ VC 2005 under /Qvc8. Microsoft* & Intel OpenMP binaries are not compatible. Use the one compiler for all modules compiled with OpenMP For more information, refer to the User’s Guide

6 6 General Ideas – Tools Never ignore or take tools for granted. A key part of system development is initial specification and qualification of tools required to get the job done. The wrong tool alone can destroy project success chances.

7 7 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version General Ideas - Use Intel Compiler in Microsoft IDE C++

8 8 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version Topics General Ideas Compiler Switches Dual Core Vectorization

9 9 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version Compiler Switches - General Optimizations Windows*Linux*Mac* /Od-O0 Disables optimizations /Zi-g Creates symbols /O1-O1 Optimize for Binary Size: Server Code /O2-O2 Optimizes for speed (default) /O3-O3 Optimize for Data Cache: Loopy Floating Point Code

10 10 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version Compiler Switches - Multi-pass Optimization Interprocedural Optimizations (IPO) ip: Enables interprocedural optimizations for single file compilation ipo: Enables interprocedural optimizations across files Can inline functions in separate files Enhances optimization when used in combination with other compiler features Windows*Linux*Mac* /Qip-ip /Qipo-ipo

11 11 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version Compiler Switches - Multi-pass Optimization - IPO Usage: Two-Step Process Linking Windows*icl /Qipo main.o func1.o func2.o Linux*icc -ipo main.o func1.o func2.o Mac*icc -ipo main.o func1.o func2.o Pass 1 Pass 2 virtual.o executable Compiling Windows*icl -c /Qipo main.c func1.c func2.c Linux*icc -c -ipo main.c func1.c func2.c Mac*icc -c -ipo main.c func1.c func2.c

12 12 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version Compiler Switches - Profile Guided Optimizations (PGO) Use execution-time feedback to guide many other compiler optimizations Helps I-cache, paging, branch-prediction Enabled optimizations: – Basic block ordering – Better register allocation – Better decision of functions to inline – Function ordering – Switch-statement optimization – Better vectorization decisions

13 13 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version Instrumented Compilation (Mac*/Linux*)icc -prof_gen[x] prog.c (Windows*)icl -Qprof_gen[x] prog.c Instrumented Execution Run program on a typical dataset Feedback Compilation (Mac/Linux)icc -prof_use prog.c (Windows)icl -Qprof_use prog.c DYN file containing dynamic info:.dyn Instrumented executable Merged DYN summary file:.dpi Delete old dyn files if you do not want the info included Step 1 Step 2 Step 3 Compiler Switches - Multi-pass Optimization PGO: Three-Step Process

14 14 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version Topics General Ideas Compiler Switches Dual Core Auto Parallelization OpenMP Threading Diagnostics Vectorization

15 15 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version Auto-parallelization Auto-parallelization: Automatic threading of loops without having to manually insert OpenMP* directives. – Compiler can identify “easy” candidates for parallelization, but large applications are difficult to analyze. Windows*Linux*Mac* /Qparallel-parallel /Qpar_report[n]-par_report[n]

16 16 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version OpenMP* Threading Technology Pragma based approach to parallelism Usage: – OpenMP switches: -openmp : /Qopenmp – OpenMP reports: - openmp-report : /Qopenmp-report #pragma omp parallel for for (i=0;i<MAX;i++) A[i]= c*A[i] + B[i];

17 17 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version OpenMP: Workqueueing Extension Example Intel Compiler’s Workqueuing extension – Create Queue of tasks…Works on… Recursive functions Linked lists, etc. #pragma intel omp parallel taskq shared(p) { while (p != NULL) { #pragma intel omp task captureprivate(p) do_work1(p); p = p->next; }

18 18 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version Parallel Diagnostics Source Instrumentation for Intel Thread Checker Allows thread checker to diagnose threading correctness bugs To use tcheck/Qtcheck you must have Intel Thread Checker installed Windows*Linux*Mac* /Qtcheck-tcheckNo support

19 19 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version Topics General Ideas Compiler Switches Dual Core Vectorization SSE & Vectorization Vectorization Reports Explanations of a few specific vectorization inhibitors

20 20 SIMD – SSE, SSE2, SSE3 Support 16x bytes 8x words 4x dwords 2x qwords 1x dqword 4x floats 2x doubles MMX* SSE SSE2 SSE3 * MMX actually used the x87 Floating Point Registers - SSE, SSE2, and SSE3 use the new SSE registers

21 21 SIMD FP using AOS format* Thread Synchronization Video encoding Complex arithmetic FP to integer conversions HADDPD, HSUBPD HADDPS, HSUBPS MONITOR, MWAIT LDDQU ADDSUBPD, ADDSUBPS, MOVDDUP, MOVSHDUP, MOVSLDUP FISTTP * Also benefits Complex and Vectorization SSE3 Instructions

22 22 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version Using SSE3 - Your Task: Convert This… 128-bit Registers A[0] B[0] C[0] + + + + A[1] B[1] C[1] not used for (i=0;i<=MAX;i++) c[i]=a[i]+b[i];

23 23 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version … Into This … 128-bit Registers A[3] A[2] B[3] B[2] C[3] C[2] + + A[1] A[0] B[1] B[0] C[1] C[0] + + for (i=0;i<=MAX;i++) c[i]=a[i]+b[i];

24 24 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version Compiler Based Vectorization Processor Specific DescriptionUseWindows*Linux*Mac* Generate instructions and optimize for Intel ® Pentium ® 4 compatible processors including MMX, SSE and SSE2. W/QxW-xWDoes not apply Generate instructions and optimize for Intel ® processors with SSE3 capability including Core Duo. These processors support SSE3 as well as MMX,SSE and SSE2. P/QxP /QaxP -xP, -axP Vector- ization occurs by default

25 25 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version Compiler Based Vectorization Automatic Processor Dispatch – ax[?] Single executable – Optimized for Intel® Core Duo processors and generic code that runs on all IA32 processors. For each target processor it uses: – Processor-specific instructions – Vectorization Low overhead – Some increase in code size

26 26 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version Why Loops Don’t Vectorize Independence – Loop Iterations generally must be independent Some relevant qualifiers: – Some dependent loops can be vectorized. – Most function calls cannot be vectorized. – Some conditional branches prevent vectorization. – Loops must be countable. – Outer loop of nest cannot be vectorized. – Mixed data types cannot be vectorized.

27 27 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version Why Didn’t My Loop Vectorize? Windows* Linux* Macintosh* -Qvec_reportn-vec_reportn-vec_reportn Set diagnostic level dumped to stdout – n=0: No diagnostic information – n=1: (Default) Loops successfully vectorized – n=2: Loops not vectorized – and the reason why not – n=3: Adds dependency Information – n=4: Reports only non-vectorized loops – n=5: Reports only non-vectorized loops and adds dependency info

28 28 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version Why Loops Don’t Vectorize – “Existence of vector dependence” – “Nonunit stride used” – “Mixed Data Types” – “Unsupported Loop Structure” – “Contains unvectorizable statement at line XX” – There are more reasons loops don’t vectorize but we will disucss the reasons above

29 29 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version “Existence of Vector Dependency” Usually, indicates a real dependency between iterations of the loop, as shown here: for (i = 0; i < 100; i++) x[i] = A * x[i + 1];

30 30 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version Defining Loop Independence Iteration Y of a loop is independent of when (or whether) iteration X occurs. int a[MAX], b[MAX]; for (j=0;j<MAX;j++) { a[j] = b[j]; }

31 31 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version “Nonunit stride used” for (I=0;I<=MAX;I++) for (J=0;J<=MAX;J++) { c[I][J]+=1; // Unit Stride c[J][I]+=1; // Non-Unit A[J*J]+=1; // Non-unit A[B[J]]+=1; // Non-Unit if (A[MAX-J])=1 last1=J;}// Non-Unit End Result: Loading Vector may take more cycles than executing operation sequentially. Memory

32 32 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version “Mixed Data Types” An example: int howmany_close(double *x, double *y) { int withinborder=0; double dist; for(int i=0;i<MAX;i++) { dist=sqrtf(x[i]*x[i] + y[i]*y[i]); if (dist<5) withinborder++; } Mixed data types are possible – but complicate things i.e.: 2 doubles vs 4 ints per SIMD register Some operations with specific data types won’t work

33 33 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version “Unsupported Loop Structure” Example: struct _xx { int data; int bound; } ; doit1(int *a, struct _xx *x) { for (int i=0; i bound; i++) a[i] = 0; An unsupported loop structure means the loop is not countable, or the compiler for whatever reason can’t construct a run-time expression for the trip count.

34 34 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version “Contains unvectorizable statement” for (i=1;i<nx;i++) { B[i] = func(A[i]); } 128-bit Registers A[3] A[2] B[3] B[2] func A[1] A[0] B[1] B[0] func

35 35 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version Reference White papers and technical notes – www.intel.com/ids www.intel.com/ids – www.intel.com/software/products www.intel.com/software/products Product support resources – www.intel.com/software/products/support www.intel.com/software/products/support


Download ppt "Multi-core Programming Tools. 2 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version Topics General Ideas Compiler Switches Dual Core."

Similar presentations


Ads by Google