Download presentation
Presentation is loading. Please wait.
Published byBridget Clare Johnston Modified over 9 years ago
2
*All other brands and names are the property of their respective owners Intel Confidential IA64_Tools_Overview2.ppt 1 修改程序代码以 利用编译器实现优化 www.intel.com/software/products
3
Responsible Pointer Usage Compiler alias analysis limits optimizations Developer knows App – tell compiler! Avoid pointing to same memory address with 2 different pointers Use array notation when possible Avoid pointer arithmetic if possible Data Issues
4
Pointer Disambiguation -Oa file.c (Windows)-fno-alias file.c (Linux) All pointers in file.c are assumed not to alias -Ow file.c (Windows)Not (yet) on Linux Assume no aliasing within functions (ie, pointer arguments are unique) -Qrestrict file.c (Windows)-restrict (Linux) Restrict Qualifier: Enables pointer disambiguation -Za file.c (Windows)-ansi (Linux) Enforce strict ANSI compilance (requires that pointers to different data types are not aliased) Data Issues
5
High Level Optimizations Available at O3 Prefetch Loop interchange Unrolling Cache blocking Unroll-and-jam Scalar replacement Redundant zero-trip elimination Data dependence analysis Reuse analysis Loop recovery Canonical expressions Loop fusion Loop distribution Loop reversal Loop skewing Loop peeling Scalar expansion Register blocking HLOHLO
6
5 Data Prefetching for i = 1, M for j = 1, N A[j, i] = B[0, j] + B[0, j+1] end_for for i = 1, M for j = 1, N A[j, i] = B[0, j] + B[0, j+1] if (mod(j,8) == 0) lfetch.nta(A[j+d, i]) if (i == 1) lfetch.nt1(B[0, j+d]) end_for Adding prefetching instructions using selective prefetching. Works for array, pointers, C structure, C/C++ parameters Goal: to issue one prefetch instruction per cache line Itanium cache lines are L1 : 32B, L2 : 64B, L3 : 64B Itanium 2 cache lines are L1 : 64B, L2 : 128B, L3 : 128B - O3 does this for you “Let the Compiler do the work!”HLOHLO
7
Loop Interchange Note: c[i][j] term is constant in inner loop Interchange to allow unit stride memory access DemoHLOHLO for(i=0;i<NUM;i++) { for(j=0;j<NUM;j++) { for(k=0;k<NUM;k++) { c[i][j] =c[i][j] + a[i][k] * b[k][j]; } Consecutive memory index Fast Inner loop index Lab : Matrix with Loop Interchange, -O2
8
Unit Stride memory access C/C++ Example – Fortran opposite bN-10bN-1jbN-1N-1 b10b11b12b13b1jb1N-1 b00b01b02b03b0jb0N-1 b Non-unit strided data access aN-10aN-1N-1 ai0ai1ai2ai3aiN-1 a10a11a12a13a1N-1 a00a01a02a03a0N-1 a k j k i incrementing K gets non consecutive memory elements Unit strided data access incrementing K gets consecutive memory elementsHLOHLO
9
Loop after interchange Note: a[i][k] term is constant in inner loop Two loads, one Store, one FMA: F/M =.33, Unit stride for(i=0;i<NUM;i++) { for(k=0;k<NUM;k++) { for(j=0;j<NUM;j++) { c[i][j] =c[i][j] + a[i][k] * b[k][j]; } HLOHLO Demo Lab : Matrix with Loop Interchange, -O3
10
Unit Stride memory access (C/C++) All Unit strided data access aN-10aN-1N-1 ai0ai1ai2ai3aiN-1 a10a11a12a13a1N-1 a00a01a02a03a0N-1 k a k i bN-10bN-1N-1 bk0bk1bk2bk3bkN-1 b10b11b12b13b1N-1 b00b01b02b03b0N-1 j j b k Fastest incremented index Consecutive memory access Next fastest loop index Consecutive memory indexHLOHLO
11
Loop Unrolling N=1025 M=5 DO I=1,N DO J=1,M A(J,I) = B(J,I) + C(J,I) * D ENDDO II = IMOD (N,4) DO I = 1, II DO J=1,M A(J,I) = B(J,I) + C(J,I) * D ENDDO DO I = II,N,4 DO J=1,M A(J,I) = B(J,I) + C(J,I) * D A(J,I+1) = B(J,I+1) + C(J,I+1) * D A(J,I+2) = B(J,I+2) + C(J,I+2) * D A(J,I+3) = B(J,I+3) + C(J,I+3) * D ENDDO Unroll Outer loop by 4 Preconditioning loop Unroll largest loops If loop size known can eliminate preconditioning loop by choosing number of times to unrollHLOHLO Demo Lab : Matrix with Loop Unrolling by 2
12
Loop Unrolling - Candidates If trip count is low and known at compile time it may make sense to Fully unroll Poor Candidates: (similar issues for SWP or vectorizer) Low trip count loops – for (j=0; j < N; j++) : N=4 at runtime Fat loops – loop body already has lots of computation taking place Loops containing procedure calls Loops with branches HLOHLO
13
Loop Unrolling - Benefits Benefits perform more computations per loop iteration Reduces the effect of loop overhead Can increase Floating point to memory access ratio (F/M) Costs Register pressure Code bloat HLOHLO
14
All loops unrolled by 4 results in (per iteration) 32 Loads, 16 stores, 64 FMA: F/M = 1.33 Loop Unrolling - Example for(i=0;i<NUM;i=i+2) { for(k=0;k<NUM;k=k+2){ for(j=0;j<NUM;j++){ c[i][j]= c[i][j]+ a[i][k]* b[k][j]; c[i+1][j]= c[i+1][j]+ a[i+1][k]* b[k][j]; c[i][j]= c[i][j]+ a[i][k+1]* b[k+1][j]; c[i+1][j]= c[i+1][j]+ a[i+1][k+1]* b[k+1][j]; } Loop invariantHLOHLO Lab Demo Lab : Matrix with Loop Unrolling by 4
15
14 Cache Blocking for i = 1, 1000 for j = 1, 1000 for k = 1, 1000 A[i, j, k] = A[i, j, k] + B[i, k, j] end_for for v = 1, 1000, 20 for u = 1, 1000, 20 for k = v, v+19 for j = u, u+19 for i = 1, 1000 A[i, j, k] = A[i, j, k] + B[i, k, j] end_for When all arrays in loop do not fit in cache Effective for huge out-of-core memory applications Effective for large out-of-cache applications Work on “neighborhoods” of data and keep these neighborhoods in cache Helps reduce TLB & Cache missesHLOHLO
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.