Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Day 1 Module 2:. 2 Use key compiler optimization switches Upon completion of this module, you will be able to: Optimize software for the architecture.

Similar presentations


Presentation on theme: "1 Day 1 Module 2:. 2 Use key compiler optimization switches Upon completion of this module, you will be able to: Optimize software for the architecture."— Presentation transcript:

1 1 Day 1 Module 2:

2 2 Use key compiler optimization switches Upon completion of this module, you will be able to: Optimize software for the architecture Enhance performance with vectorization and other techniques

3 IntroductionCompiler switchesDual CoreVectorization

4  Exploiting Architectural Power requires sophisticated compilers  Optimal use of:  Registers and functional units  Dual-Core / Multi Processor  SSE instructions  Cache architecture

5 Software Construction  Source and binary compatible with VC2003 with /Qvc71.  Source and binary compatible with w/VC 2005 under /Qvc8.  Microsoft and Intel OpenMP binaries are not compatible.  Use the one compiler for all modules compiled with OpenMP

6 6 Visual C++ 2005, X64 Compiler does support an IDE framework starting with the 9.1 compilers.

7 7

8 8  ip: Enables interprocedural optimizations for single file compiliation  ipo: Enables interprocedural optimizations across files.  Can inline functions in separate files.  Enhances optimization when used in combination with other compiler features

9 9 Pass 1 Pass 2 virtual.o executable

10 9/7/201510  Use Execution time feedback to guide many other compiler optimizations.  Helps I-cache, paging, branch-prediction  Enabled optimizations:  Basic block ordering  Better register allocation  Better decision of functions to inline  Function ordering  Switch-statement optimization

11 11 Step 1 Instrumented Compilation (Mac*/Linux*)icc -prof_gen[x] prog.c (Windows*)icl -Qprof_gen[x] prog.c Instrumented executable Step 3 Feedback Compilation (Mac/Linux)icc -prof_use prog.c (Windows)icl -Qprof_use prog.c Merged DYN summary file:.dpi Delete old dyn files if you do not want the info included Step 2 Instrumented Execution Run program on a typical dataset DYN file containing dynamic info:.dyn

12 12 Auto-parallezitation: Automatic threading of loops without having to manually insert OpenMP* Directives Compiler can identify “easy” candidates for parallelization, but large applications are difficult to analyze.

13 13 Pragma based approach to parallelism Usage: OpenMP switches: -openmp : /Qopenmp OpenMP reports: - openmp-report : /Qopenmp-report #pragma omp parallel for for (i=0;i<MAX;i++) A[i]= c*A[i] + B[i];

14 14 Intel Compiler’s Workqueuing extension:  Create Queue of tasks…Works on…  Recursive functions  Linked lists, etc. #pragma intel omp parallel taskq shared(p) { while (p != NULL) { #pragma intel omp task captureprivate(p) do_work1(p); p = p->next; }

15 15  Source Instrumentation for Intel Thread Checker  Allows thread checker to diagnose threading correctness bugs  To use tcheck/Qtcheck you must have Intel Thread Checker installed  See thread checker documentation: http://www.intel.com/support/performancetools/sb/CS-009681.htm

16 16

17 17 *Also benefits Complex and Vectorization SIMD FP using AOS format* Thread Synchronization Video encoding Complex arithmetic FP to integer conversions HADDPD, HSUBPD HADDPS, HSUBPS MONITOR, MWAIT LDDQU ADDSUBPD, ADDSUBPS, MOVDDUP, MOVSHDUP, MOVSLDUP FISTTP

18 18 + + + + A[1] B[1] C[1] not used for (i=0;i<=MAX;i++) c[i]=a[i]+b[i];

19 19 for (i=0;i<=MAX;i++) c[i]=a[i]+b[i]; A[3] A[2] B[3] B[2] C[3] C[2] + + A[1] A[0] B[1] B[0] C[1] C[0] + +

20 20

21 21  Single executable  Optimized for Intel ® Core Duo processors and generic code that runs on all IA32 processors  For each target processor it uses:  Processor – specific instructions  Vectorization  Low overhead  Some increase in code size

22 22  Independence  Loop Iterations generally must be independent Some relevant qualifiers  Some dependent loops can be vectorized  Most function calls cannot be vectorized  Some conditional branches prevent vectorization  Loops must be countable  Mixed data types cannot be vectorized

23 23 Windows* Linux* Macintosh* -Qvec_reportn-vec_reportn-vec_reportn Set diagnostic level dumped to stdout n=0: No diagnostic information n=1: (Default) Loops successfully vectorized n=2: Loops not vectorized – and the reason why not n=3: Adds dependency Information n=4: Reports only non-vectorized loops n=5: Reports only non-vectorized loops and adds dependency info

24 24  Existence of vector dependence  Nonunit stride used  Mixed data types  Unsupported loop structure  Contains unvectorized statement at line XX  There are more reasons loops don’t vectorize but we will discuss the reasons above

25 25 Usually, indicates a real dependency between iterations of the loop, as shown here: for (i = 0; i < 100; i++) x[i] = A * x[i + 1];

26 26 Iteration Y of a loop is independent of when (or whether) iteration X occurs. int a[MAX], b[MAX]; for (j=0;j<MAX;j++) { a[j] = b[j]; }

27 27 Memory End Result: Loading Vector may take more cycles than executing operation sequentially.

28 28 An Example: Mixed data types are possible – but complicate things.  Two dobles vs. 4 ints per SIMD register Some operations with specific data types won’t work.

29 29 An Example: An unsupported loop structure means the loop is not countable, or the compiler for whatever can’t construct a run-time expression for the trip count.

30 30


Download ppt "1 Day 1 Module 2:. 2 Use key compiler optimization switches Upon completion of this module, you will be able to: Optimize software for the architecture."

Similar presentations


Ads by Google