Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version Intel Software College
Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 2 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version Objectives At the successful completion of this module, you will be able to: Use key compiler optimization switches Optimize software for the Architecture Enhance performance with vectorization and other techniques
Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 3 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version Agenda Introduction Compiler Switches Dual Core Vectorization
Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 4 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version Key to optimizing: Intel ® Core™ Duo Exploiting Architectural Power requires Sophisticated Compilers Optimal use of Registers & functional units Dual-Core/Multi-processor SSE instructions Cache architecture
Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 5 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version C++ Compatibility with Microsoft Source & binary compatible with VC2003 with /Qvc71, Source & binary compatible with w/ VC 2005 under /Qvc8. Microsoft* & Intel OpenMP binaries are not compatible. Use the one compiler for all modules compiled with OpenMP For more information, refer to the User’s Guide
Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 6 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version Use Intel Compiler in Microsoft IDE C++
Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 7 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version Agenda Introduction Compiler Switches Intel® C++ compiler Dual Core Vectorization
Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 8 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version General Optimizations Windows*Linux*Mac* /Od-O0 Disables optimizations /Zi-g Creates symbols /O1-O1 Optimize for Binary Size: Server Code /O2-O2 Optimizes for speed (default) /O3-O3 Optimize for Data Cache: Loopy Floating Point Code
Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 9 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version Multi-pass Optimization Interprocedural Optimizations (IPO) ip: Enables interprocedural optimizations for single file compilation ipo: Enables interprocedural optimizations across files Can inline functions in separate files Enhances optimization when used in combination with other compiler features Windows*Linux*Mac* /Qip-ip /Qipo-ipo
Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 10 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version Multi-pass Optimization - IPO Usage: Two-Step Process Linking Windows*icl /Qipo main.o func1.o func2.o Linux*icc -ipo main.o func1.o func2.o Mac*icc -ipo main.o func1.o func2.o Pass 1 Pass 2 virtual.o executable Compiling Windows*icl -c /Qipo main.c func1.c func2.c Linux*icc -c -ipo main.c func1.c func2.c Mac*icc -c -ipo main.c func1.c func2.c
Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 11 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version Profile Guided Optimizations (PGO) Use execution-time feedback to guide many other compiler optimizations Helps I-cache, paging, branch-prediction Enabled optimizations: Basic block ordering Better register allocation Better decision of functions to inline Function ordering Switch-statement optimization Better vectorization decisions
Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 12 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version Instrumented Compilation (Mac*/Linux*)icc -prof_gen[x] prog.c (Windows*)icl -Qprof_gen[x] prog.c Instrumented Execution Run program on a typical dataset Feedback Compilation (Mac/Linux)icc -prof_use prog.c (Windows)icl -Qprof_use prog.c DYN file containing dynamic info:.dyn Instrumented executable Merged DYN summary file:.dpi Delete old dyn files if you do not want the info included Step 1 Step 2 Step 3 Multi-pass Optimization PGO: Three-Step Process
Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 13 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version Agenda Introduction Compiler Switches Dual Core Auto Parallelization OpenMP Threading Diagnostics Vectorization
Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 14 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version Auto-parallelization Auto-parallelization: Automatic threading of loops without having to manually insert OpenMP* directives. Compiler can identify “easy” candidates for parallelization, but large applications are difficult to analyze. Windows*Linux*Mac* /Qparallel-parallel /Qpar_report[n]-par_report[n]
Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 15 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version OpenMP* Threading Technology Pragma based approach to parallelism Usage: OpenMP switches: -openmp : /Qopenmp OpenMP reports: - openmp-report : /Qopenmp-report #pragma omp parallel for for (i=0;i<MAX;i++) A[i]= c*A[i] + B[i];
Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 16 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version OpenMP: Workqueueing Extension Example Intel Compiler’s Workqueuing extension Create Queue of tasks…Works on… Recursive functions Linked lists, etc. #pragma intel omp parallel taskq shared(p) { while (p != NULL) { #pragma intel omp task captureprivate(p) do_work1(p); p = p->next; }
Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 17 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version Parallel Diagnostics Source Instrumentation for Intel Thread Checker Allows thread checker to diagnose threading correctness bugs To use tcheck/Qtcheck you must have Intel Thread Checker installed See thread checker documentation mancetools/sb/CS htm Windows*Linux*Mac* /Qtcheck-tcheckNo support
Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 18 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version Agenda Introduction Compiler Switches Dual Core Vectorization SSE & Vectorization Vectorization Reports Explanations of a few specific vectorization inhibitors
Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 19 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version SIMD – SSE, SSE2, SSE3 Support 16x bytes 8x words 4x dwords 2x qwords 1x dqword 4x floats 2x doubles MMX* SSE SSE2 SSE3 * MMX actually used the x87 Floating Point Registers - SSE, SSE2, and SSE3 use the new SSE registers
Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 20 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version SIMD FP using AOS format* Thread Synchronization Video encoding Complex arithmetic FP to integer conversions HADDPD, HSUBPD HADDPS, HSUBPS MONITOR, MWAIT LDDQU ADDSUBPD, ADDSUBPS, MOVDDUP, MOVSHDUP, MOVSLDUP FISTTP * Also benefits Complex and Vectorization SSE3 Instructions
Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 21 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version Using SSE3 - Your Task: Convert This… 128-bit Registers A[0] B[0] C[0] A[1] B[1] C[1] not used for (i=0;i<=MAX;i++) c[i]=a[i]+b[i];
Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 22 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version … Into This … 128-bit Registers A[3] A[2] B[3] B[2] C[3] C[2] + + A[1] A[0] B[1] B[0] C[1] C[0] + + for (i=0;i<=MAX;i++) c[i]=a[i]+b[i];
Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 23 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version Compiler Based Vectorization Processor Specific DescriptionUseWindows*Linux*Mac* Generate instructions and optimize for Intel ® Pentium ® 4 compatible processors including MMX, SSE and SSE2. W/QxW-xWDoes not apply Generate instructions and optimize for Intel ® processors with SSE3 capability including Core Duo. These processors support SSE3 as well as MMX,SSE and SSE2. P/QxP /QaxP -xP, -axP Vector- ization occurs by default
Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 24 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version Compiler Based Vectorization Automatic Processor Dispatch – ax[?] Single executable Optimized for Intel® Core Duo processors and generic code that runs on all IA32 processors. For each target processor it uses: Processor-specific instructions Vectorization Low overhead Some increase in code size
Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 25 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version Why Loops Don’t Vectorize Independence Loop Iterations generally must be independent Some relevant qualifiers: Some dependent loops can be vectorized. Most function calls cannot be vectorized. Some conditional branches prevent vectorization. Loops must be countable. Outer loop of nest cannot be vectorized. Mixed data types cannot be vectorized.
Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 26 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version Why Didn’t My Loop Vectorize? Windows* Linux* Macintosh* -Qvec_reportn-vec_reportn-vec_reportn Set diagnostic level dumped to stdout n=0: No diagnostic information n=1: (Default) Loops successfully vectorized n=2: Loops not vectorized – and the reason why not n=3: Adds dependency Information n=4: Reports only non-vectorized loops n=5: Reports only non-vectorized loops and adds dependency info
Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 27 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version Why Loops Don’t Vectorize “Existence of vector dependence” “Nonunit stride used” “Mixed Data Types” “Unsupported Loop Structure” “Contains unvectorizable statement at line XX” There are more reasons loops don’t vectorize but we will disucss the reasons above
Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 28 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version “Existence of Vector Dependency” Usually, indicates a real dependency between iterations of the loop, as shown here: for (i = 0; i < 100; i++) x[i] = A * x[i + 1];
Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 29 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version Defining Loop Independence Iteration Y of a loop is independent of when (or whether) iteration X occurs. int a[MAX], b[MAX]; for (j=0;j<MAX;j++) { a[j] = b[j]; }
Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 30 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version “Nonunit stride used” for (I=0;I<=MAX;I++) for (J=0;J<=MAX;J++) { c[I][J]+=1; // Unit Stride c[J][I]+=1; // Non-Unit A[J*J]+=1; // Non-unit A[B[J]]+=1; // Non-Unit if (A[MAX-J])=1 last1=J;}// Non-Unit End Result: Loading Vector may take more cycles than executing operation sequentially. Memory
Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 31 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version “Mixed Data Types” An example: int howmany_close(double *x, double *y) { int withinborder=0; double dist; for(int i=0;i<MAX;i++) { dist=sqrtf(x[i]*x[i] + y[i]*y[i]); if (dist<5) withinborder++; } Mixed data types are possible – but complicate things i.e.: 2 doubles vs 4 ints per SIMD register Some operations with specific data types won’t work
Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 32 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version “Unsupported Loop Structure” Example: struct _xx { int data; int bound; } ; doit1(int *a, struct _xx *x) { for (int i=0; i bound; i++) a[i] = 0; An unsupported loop structure means the loop is not countable, or the compiler for whatever reason can’t construct a run-time expression for the trip count.
Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 33 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version “Contains unvectorizable statement” for (i=1;i<nx;i++) { B[i] = func(A[i]); } 128-bit Registers A[3] A[2] B[3] B[2] func A[1] A[0] B[1] B[0] func
Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 34 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version Reference Web-based and classroom training White papers and technical notes Product support resources
Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 35 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version
Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 36 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version Activity 1 - raytrace2: Initial Compilation Set up environment and compile with both Microsoft* Visual C++.NET (MSVC*) and Intel® C++ Compiler (icl)
Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 37 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version Activity 2 - raytrace2: O3 Compilation Use Intel compiler’s High Level Optimizer (-O3) for loop centric codes
Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 38 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version Activity 3 - raytrace2: IPO Compilation Use Intel compiler’s Inter-procedural Optimization (-Qipo)
Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 39 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version Activity 4 - raytrace2: PGO Compilation Use Intel compiler’s Profile-guided Optimization
Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 40 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version Activity 5 – raytrace2: Vectorization Use Intel compiler’s Vectorization optimization (-QxP)
Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 41 Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version Activity 6 - raytrace2: Putting it all together Use all previous optimizations in tandem (-O3, -QxP, IPO and PGO)