*All other brands and names are the property of their respective owners Intel Confidential IA64_Tools_Overview2.ppt 1 修改程序代码以 利用编译器实现优化
Responsible Pointer Usage Compiler alias analysis limits optimizations Developer knows App – tell compiler! Avoid pointing to same memory address with 2 different pointers Use array notation when possible Avoid pointer arithmetic if possible Data Issues
Pointer Disambiguation -Oa file.c (Windows)-fno-alias file.c (Linux) All pointers in file.c are assumed not to alias -Ow file.c (Windows)Not (yet) on Linux Assume no aliasing within functions (ie, pointer arguments are unique) -Qrestrict file.c (Windows)-restrict (Linux) Restrict Qualifier: Enables pointer disambiguation -Za file.c (Windows)-ansi (Linux) Enforce strict ANSI compilance (requires that pointers to different data types are not aliased) Data Issues
High Level Optimizations Available at O3 Prefetch Loop interchange Unrolling Cache blocking Unroll-and-jam Scalar replacement Redundant zero-trip elimination Data dependence analysis Reuse analysis Loop recovery Canonical expressions Loop fusion Loop distribution Loop reversal Loop skewing Loop peeling Scalar expansion Register blocking HLOHLO
5 Data Prefetching for i = 1, M for j = 1, N A[j, i] = B[0, j] + B[0, j+1] end_for for i = 1, M for j = 1, N A[j, i] = B[0, j] + B[0, j+1] if (mod(j,8) == 0) lfetch.nta(A[j+d, i]) if (i == 1) lfetch.nt1(B[0, j+d]) end_for Adding prefetching instructions using selective prefetching. Works for array, pointers, C structure, C/C++ parameters Goal: to issue one prefetch instruction per cache line Itanium cache lines are L1 : 32B, L2 : 64B, L3 : 64B Itanium 2 cache lines are L1 : 64B, L2 : 128B, L3 : 128B - O3 does this for you “Let the Compiler do the work!”HLOHLO
Loop Interchange Note: c[i][j] term is constant in inner loop Interchange to allow unit stride memory access DemoHLOHLO for(i=0;i<NUM;i++) { for(j=0;j<NUM;j++) { for(k=0;k<NUM;k++) { c[i][j] =c[i][j] + a[i][k] * b[k][j]; } Consecutive memory index Fast Inner loop index Lab : Matrix with Loop Interchange, -O2
Unit Stride memory access C/C++ Example – Fortran opposite bN-10bN-1jbN-1N-1 b10b11b12b13b1jb1N-1 b00b01b02b03b0jb0N-1 b Non-unit strided data access aN-10aN-1N-1 ai0ai1ai2ai3aiN-1 a10a11a12a13a1N-1 a00a01a02a03a0N-1 a k j k i incrementing K gets non consecutive memory elements Unit strided data access incrementing K gets consecutive memory elementsHLOHLO
Loop after interchange Note: a[i][k] term is constant in inner loop Two loads, one Store, one FMA: F/M =.33, Unit stride for(i=0;i<NUM;i++) { for(k=0;k<NUM;k++) { for(j=0;j<NUM;j++) { c[i][j] =c[i][j] + a[i][k] * b[k][j]; } HLOHLO Demo Lab : Matrix with Loop Interchange, -O3
Unit Stride memory access (C/C++) All Unit strided data access aN-10aN-1N-1 ai0ai1ai2ai3aiN-1 a10a11a12a13a1N-1 a00a01a02a03a0N-1 k a k i bN-10bN-1N-1 bk0bk1bk2bk3bkN-1 b10b11b12b13b1N-1 b00b01b02b03b0N-1 j j b k Fastest incremented index Consecutive memory access Next fastest loop index Consecutive memory indexHLOHLO
Loop Unrolling N=1025 M=5 DO I=1,N DO J=1,M A(J,I) = B(J,I) + C(J,I) * D ENDDO II = IMOD (N,4) DO I = 1, II DO J=1,M A(J,I) = B(J,I) + C(J,I) * D ENDDO DO I = II,N,4 DO J=1,M A(J,I) = B(J,I) + C(J,I) * D A(J,I+1) = B(J,I+1) + C(J,I+1) * D A(J,I+2) = B(J,I+2) + C(J,I+2) * D A(J,I+3) = B(J,I+3) + C(J,I+3) * D ENDDO Unroll Outer loop by 4 Preconditioning loop Unroll largest loops If loop size known can eliminate preconditioning loop by choosing number of times to unrollHLOHLO Demo Lab : Matrix with Loop Unrolling by 2
Loop Unrolling - Candidates If trip count is low and known at compile time it may make sense to Fully unroll Poor Candidates: (similar issues for SWP or vectorizer) Low trip count loops – for (j=0; j < N; j++) : N=4 at runtime Fat loops – loop body already has lots of computation taking place Loops containing procedure calls Loops with branches HLOHLO
Loop Unrolling - Benefits Benefits perform more computations per loop iteration Reduces the effect of loop overhead Can increase Floating point to memory access ratio (F/M) Costs Register pressure Code bloat HLOHLO
All loops unrolled by 4 results in (per iteration) 32 Loads, 16 stores, 64 FMA: F/M = 1.33 Loop Unrolling - Example for(i=0;i<NUM;i=i+2) { for(k=0;k<NUM;k=k+2){ for(j=0;j<NUM;j++){ c[i][j]= c[i][j]+ a[i][k]* b[k][j]; c[i+1][j]= c[i+1][j]+ a[i+1][k]* b[k][j]; c[i][j]= c[i][j]+ a[i][k+1]* b[k+1][j]; c[i+1][j]= c[i+1][j]+ a[i+1][k+1]* b[k+1][j]; } Loop invariantHLOHLO Lab Demo Lab : Matrix with Loop Unrolling by 4
14 Cache Blocking for i = 1, 1000 for j = 1, 1000 for k = 1, 1000 A[i, j, k] = A[i, j, k] + B[i, k, j] end_for for v = 1, 1000, 20 for u = 1, 1000, 20 for k = v, v+19 for j = u, u+19 for i = 1, 1000 A[i, j, k] = A[i, j, k] + B[i, k, j] end_for When all arrays in loop do not fit in cache Effective for huge out-of-core memory applications Effective for large out-of-cache applications Work on “neighborhoods” of data and keep these neighborhoods in cache Helps reduce TLB & Cache missesHLOHLO