Presentation is loading. Please wait.

Presentation is loading. Please wait.

*All other brands and names are the property of their respective owners Intel Confidential IA64_Tools_Overview2.ppt 1 修改程序代码以 利用编译器实现优化 www.intel.com/software/products.

Similar presentations


Presentation on theme: "*All other brands and names are the property of their respective owners Intel Confidential IA64_Tools_Overview2.ppt 1 修改程序代码以 利用编译器实现优化 www.intel.com/software/products."— Presentation transcript:

1

2 *All other brands and names are the property of their respective owners Intel Confidential IA64_Tools_Overview2.ppt 1 修改程序代码以 利用编译器实现优化 www.intel.com/software/products

3 Responsible Pointer Usage  Compiler alias analysis limits optimizations  Developer knows App – tell compiler!  Avoid pointing to same memory address with 2 different pointers  Use array notation when possible  Avoid pointer arithmetic if possible Data Issues

4 Pointer Disambiguation  -Oa file.c (Windows)-fno-alias file.c (Linux)  All pointers in file.c are assumed not to alias  -Ow file.c (Windows)Not (yet) on Linux  Assume no aliasing within functions (ie, pointer arguments are unique)  -Qrestrict file.c (Windows)-restrict (Linux)  Restrict Qualifier: Enables pointer disambiguation  -Za file.c (Windows)-ansi (Linux)  Enforce strict ANSI compilance (requires that pointers to different data types are not aliased) Data Issues

5 High Level Optimizations Available at O3  Prefetch  Loop interchange  Unrolling  Cache blocking  Unroll-and-jam  Scalar replacement  Redundant zero-trip elimination  Data dependence analysis  Reuse analysis  Loop recovery  Canonical expressions  Loop fusion  Loop distribution  Loop reversal  Loop skewing  Loop peeling  Scalar expansion  Register blocking HLOHLO

6 5 Data Prefetching for i = 1, M for j = 1, N A[j, i] = B[0, j] + B[0, j+1] end_for for i = 1, M for j = 1, N A[j, i] = B[0, j] + B[0, j+1] if (mod(j,8) == 0) lfetch.nta(A[j+d, i]) if (i == 1) lfetch.nt1(B[0, j+d]) end_for Adding prefetching instructions using selective prefetching. Works for array, pointers, C structure, C/C++ parameters Goal: to issue one prefetch instruction per cache line Itanium cache lines are L1 : 32B, L2 : 64B, L3 : 64B Itanium 2 cache lines are L1 : 64B, L2 : 128B, L3 : 128B - O3 does this for you “Let the Compiler do the work!”HLOHLO

7 Loop Interchange  Note: c[i][j] term is constant in inner loop  Interchange to allow unit stride memory access DemoHLOHLO for(i=0;i<NUM;i++) { for(j=0;j<NUM;j++) { for(k=0;k<NUM;k++) { c[i][j] =c[i][j] + a[i][k] * b[k][j]; } Consecutive memory index Fast Inner loop index Lab : Matrix with Loop Interchange, -O2

8 Unit Stride memory access C/C++ Example – Fortran opposite bN-10bN-1jbN-1N-1 b10b11b12b13b1jb1N-1 b00b01b02b03b0jb0N-1 b Non-unit strided data access aN-10aN-1N-1 ai0ai1ai2ai3aiN-1 a10a11a12a13a1N-1 a00a01a02a03a0N-1 a k j k i incrementing K gets non consecutive memory elements Unit strided data access incrementing K gets consecutive memory elementsHLOHLO

9 Loop after interchange  Note: a[i][k] term is constant in inner loop  Two loads, one Store, one FMA: F/M =.33, Unit stride for(i=0;i<NUM;i++) { for(k=0;k<NUM;k++) { for(j=0;j<NUM;j++) { c[i][j] =c[i][j] + a[i][k] * b[k][j]; } HLOHLO Demo Lab : Matrix with Loop Interchange, -O3

10 Unit Stride memory access (C/C++) All Unit strided data access aN-10aN-1N-1 ai0ai1ai2ai3aiN-1 a10a11a12a13a1N-1 a00a01a02a03a0N-1 k a k i bN-10bN-1N-1 bk0bk1bk2bk3bkN-1 b10b11b12b13b1N-1 b00b01b02b03b0N-1 j j b k Fastest incremented index Consecutive memory access Next fastest loop index Consecutive memory indexHLOHLO

11 Loop Unrolling N=1025 M=5 DO I=1,N DO J=1,M A(J,I) = B(J,I) + C(J,I) * D ENDDO II = IMOD (N,4) DO I = 1, II DO J=1,M A(J,I) = B(J,I) + C(J,I) * D ENDDO DO I = II,N,4 DO J=1,M A(J,I) = B(J,I) + C(J,I) * D A(J,I+1) = B(J,I+1) + C(J,I+1) * D A(J,I+2) = B(J,I+2) + C(J,I+2) * D A(J,I+3) = B(J,I+3) + C(J,I+3) * D ENDDO Unroll Outer loop by 4 Preconditioning loop Unroll largest loops If loop size known can eliminate preconditioning loop by choosing number of times to unrollHLOHLO Demo Lab : Matrix with Loop Unrolling by 2

12 Loop Unrolling - Candidates  If trip count is low and known at compile time it may make sense to Fully unroll  Poor Candidates: (similar issues for SWP or vectorizer)  Low trip count loops – for (j=0; j < N; j++) : N=4 at runtime  Fat loops – loop body already has lots of computation taking place  Loops containing procedure calls  Loops with branches HLOHLO

13 Loop Unrolling - Benefits  Benefits  perform more computations per loop iteration  Reduces the effect of loop overhead  Can increase Floating point to memory access ratio (F/M)  Costs  Register pressure  Code bloat HLOHLO

14  All loops unrolled by 4 results in (per iteration) 32 Loads, 16 stores, 64 FMA: F/M = 1.33 Loop Unrolling - Example for(i=0;i<NUM;i=i+2) { for(k=0;k<NUM;k=k+2){ for(j=0;j<NUM;j++){ c[i][j]= c[i][j]+ a[i][k]* b[k][j]; c[i+1][j]= c[i+1][j]+ a[i+1][k]* b[k][j]; c[i][j]= c[i][j]+ a[i][k+1]* b[k+1][j]; c[i+1][j]= c[i+1][j]+ a[i+1][k+1]* b[k+1][j]; } Loop invariantHLOHLO Lab Demo Lab : Matrix with Loop Unrolling by 4

15 14 Cache Blocking for i = 1, 1000 for j = 1, 1000 for k = 1, 1000 A[i, j, k] = A[i, j, k] + B[i, k, j] end_for for v = 1, 1000, 20 for u = 1, 1000, 20 for k = v, v+19 for j = u, u+19 for i = 1, 1000 A[i, j, k] = A[i, j, k] + B[i, k, j] end_for When all arrays in loop do not fit in cache Effective for huge out-of-core memory applications Effective for large out-of-cache applications Work on “neighborhoods” of data and keep these neighborhoods in cache Helps reduce TLB & Cache missesHLOHLO


Download ppt "*All other brands and names are the property of their respective owners Intel Confidential IA64_Tools_Overview2.ppt 1 修改程序代码以 利用编译器实现优化 www.intel.com/software/products."

Similar presentations


Ads by Google