Download presentation
Presentation is loading. Please wait.
Published byMerryl Strickland Modified over 9 years ago
1
1 Day 1 Module 2:
2
2 Use key compiler optimization switches Upon completion of this module, you will be able to: Optimize software for the architecture Enhance performance with vectorization and other techniques
3
IntroductionCompiler switchesDual CoreVectorization
4
Exploiting Architectural Power requires sophisticated compilers Optimal use of: Registers and functional units Dual-Core / Multi Processor SSE instructions Cache architecture
5
Software Construction Source and binary compatible with VC2003 with /Qvc71. Source and binary compatible with w/VC 2005 under /Qvc8. Microsoft and Intel OpenMP binaries are not compatible. Use the one compiler for all modules compiled with OpenMP
6
6 Visual C++ 2005, X64 Compiler does support an IDE framework starting with the 9.1 compilers.
7
7
8
8 ip: Enables interprocedural optimizations for single file compiliation ipo: Enables interprocedural optimizations across files. Can inline functions in separate files. Enhances optimization when used in combination with other compiler features
9
9 Pass 1 Pass 2 virtual.o executable
10
9/7/201510 Use Execution time feedback to guide many other compiler optimizations. Helps I-cache, paging, branch-prediction Enabled optimizations: Basic block ordering Better register allocation Better decision of functions to inline Function ordering Switch-statement optimization
11
11 Step 1 Instrumented Compilation (Mac*/Linux*)icc -prof_gen[x] prog.c (Windows*)icl -Qprof_gen[x] prog.c Instrumented executable Step 3 Feedback Compilation (Mac/Linux)icc -prof_use prog.c (Windows)icl -Qprof_use prog.c Merged DYN summary file:.dpi Delete old dyn files if you do not want the info included Step 2 Instrumented Execution Run program on a typical dataset DYN file containing dynamic info:.dyn
12
12 Auto-parallezitation: Automatic threading of loops without having to manually insert OpenMP* Directives Compiler can identify “easy” candidates for parallelization, but large applications are difficult to analyze.
13
13 Pragma based approach to parallelism Usage: OpenMP switches: -openmp : /Qopenmp OpenMP reports: - openmp-report : /Qopenmp-report #pragma omp parallel for for (i=0;i<MAX;i++) A[i]= c*A[i] + B[i];
14
14 Intel Compiler’s Workqueuing extension: Create Queue of tasks…Works on… Recursive functions Linked lists, etc. #pragma intel omp parallel taskq shared(p) { while (p != NULL) { #pragma intel omp task captureprivate(p) do_work1(p); p = p->next; }
15
15 Source Instrumentation for Intel Thread Checker Allows thread checker to diagnose threading correctness bugs To use tcheck/Qtcheck you must have Intel Thread Checker installed See thread checker documentation: http://www.intel.com/support/performancetools/sb/CS-009681.htm
16
16
17
17 *Also benefits Complex and Vectorization SIMD FP using AOS format* Thread Synchronization Video encoding Complex arithmetic FP to integer conversions HADDPD, HSUBPD HADDPS, HSUBPS MONITOR, MWAIT LDDQU ADDSUBPD, ADDSUBPS, MOVDDUP, MOVSHDUP, MOVSLDUP FISTTP
18
18 + + + + A[1] B[1] C[1] not used for (i=0;i<=MAX;i++) c[i]=a[i]+b[i];
19
19 for (i=0;i<=MAX;i++) c[i]=a[i]+b[i]; A[3] A[2] B[3] B[2] C[3] C[2] + + A[1] A[0] B[1] B[0] C[1] C[0] + +
20
20
21
21 Single executable Optimized for Intel ® Core Duo processors and generic code that runs on all IA32 processors For each target processor it uses: Processor – specific instructions Vectorization Low overhead Some increase in code size
22
22 Independence Loop Iterations generally must be independent Some relevant qualifiers Some dependent loops can be vectorized Most function calls cannot be vectorized Some conditional branches prevent vectorization Loops must be countable Mixed data types cannot be vectorized
23
23 Windows* Linux* Macintosh* -Qvec_reportn-vec_reportn-vec_reportn Set diagnostic level dumped to stdout n=0: No diagnostic information n=1: (Default) Loops successfully vectorized n=2: Loops not vectorized – and the reason why not n=3: Adds dependency Information n=4: Reports only non-vectorized loops n=5: Reports only non-vectorized loops and adds dependency info
24
24 Existence of vector dependence Nonunit stride used Mixed data types Unsupported loop structure Contains unvectorized statement at line XX There are more reasons loops don’t vectorize but we will discuss the reasons above
25
25 Usually, indicates a real dependency between iterations of the loop, as shown here: for (i = 0; i < 100; i++) x[i] = A * x[i + 1];
26
26 Iteration Y of a loop is independent of when (or whether) iteration X occurs. int a[MAX], b[MAX]; for (j=0;j<MAX;j++) { a[j] = b[j]; }
27
27 Memory End Result: Loading Vector may take more cycles than executing operation sequentially.
28
28 An Example: Mixed data types are possible – but complicate things. Two dobles vs. 4 ints per SIMD register Some operations with specific data types won’t work.
29
29 An Example: An unsupported loop structure means the loop is not countable, or the compiler for whatever can’t construct a run-time expression for the trip count.
30
30
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.