Download presentation
Presentation is loading. Please wait.
Published byKevin Flynn Modified over 9 years ago
1
Carnegie Mellon Lecture 15 Loop Transformations Chapter 11.10-11.11 Dror E. MaydanCS243: Loop Optimization and Array Analysis1
2
Carnegie Mellon Loop Optimization Domain – Loops: Change the order in which we iterate through loops Goals – Minimize inner loop dependences that inhibit software pipelining – Minimize loads and stores – Parallelism SIMD Vector today, in general multiprocessor as well – Minimize cache misses – Minimize register spilling Tools – Loop interchange – Fusion – Fission – Outer loop unrolling – Cache Tiling – Vectorization Algorithm for putting it all together Dror E. Maydan CS243: Loop Optimization and Array Analysis 2
3
Carnegie Mellon Loop Interchange for i = 1 to n for j = 1 to n A[j][i] = A[j-1][i-1] * b[i] Should I interchange the two loops? – Stride-1 accesses are better for caches – But one more load in the inner loop – But one less register needed to hold the result of the loop Dror E. Maydan CS243: Loop Optimization and Array Analysis 3 for j = 1 to n for i = 1 to n A[j][i] = A[j-1][i-1] * b[i]
4
Carnegie Mellon Loop Interchange for i = 1 to n for j = 1 to n A[j][i] = A[j+1][i-1] * b[i] Distance Vector is (deltai, deltaj) = (1, -1) Direction Vector is (>, <) Dependence represents that one ref, a w must happen before another a r To permute loops, permute direction vectors in the same manner – Permutation is legal iff all permuted direction vectors are lexicographically positive Special case: Fully permutable loop nest – Either dependence “carried” by a loop outside of the nest or all components > or = All the loops in the nest can be arbitrarily permuted – (>, >, <) Inner two loops are fully permutable – (>=, =, >) All three loops are fully permutable Dror E. Maydan CS243: Loop Optimization and Array Analysis 4 j i
5
Carnegie Mellon Loop Interchange for i = 1 to n for j = 1 to i … How do I interchange for j = 1 to n for i = j to n – In general ugly but doable Dror E. Maydan CS243: Loop Optimization and Array Analysis 5 j i
6
Carnegie Mellon Non Perfectly Nested loops for i = 1 to n for j = 1 to n S1 for j = 1 to n S2 Can’t always interchange – Can be expensive when you can Dror E. Maydan CS243: Loop Optimization and Array Analysis 6
7
Carnegie Mellon Loop Fusion for i = 1 to n for j = 1 to n S1 for j = 1 to n S2 Moving S2 across “j” iterations but not any of “i” iterations Pretend to fuse Legal as long as there is no direction vector from S2 to S1 with “=“ in all the outer loops and > in one of the inner (=, =, …, =, >, …), – That would imply that S2 is now before S1 Dror E. Maydan CS243: Loop Optimization and Array Analysis 7 for i = 1 to n for j = 1 to n S1 S2
8
Carnegie Mellon Loop Fusion for i = 1 to n for j = 1 to n a[i][j] = … for j = 1 to n … = a[i][j+1] Legal as long as there is no direction vector from the read to the write with “=“ in all the outer loops and > in one of the inner (=, =, …, =, >, …) – (=, 1) so can’t fusion Dror E. Maydan CS243: Loop Optimization and Array Analysis 8
9
Carnegie Mellon Loop Fusion for i = 1 to n for j = 1 to n a[i][j] = … for j = 1 to n … = a[i][j+1] If the first “+” direction is always a small literal constant, can skew the loop and allow fusion Bonus: can get rid of a load and maybe a store Dror E. Maydan CS243: Loop Optimization and Array Analysis 9 for i = 1 to n a[i][1] = … for j = 2 to n a[i][j] = … for j = 1, n-1 … = a[i][j+1] … = a[i][n+1] for i = 1 to n a[i][1] = … for j = 2 to n a[i][j] = … for j = 2, n … = a[i][j] … = a[i][n+1] for i = 1 to n a[i][1] = … for j = 2 to n { a[i][j] = … … = a[i][j] } … = a[i][n-1]
10
Carnegie Mellon Loop Fission for i = 1 to n for j = 1 to n S1 for j = 1 to n S2 Moving S2 across all later “i” iterations – Legal as long as no dependences from S2 to S1 with > in the fissioned outer loops Dror E. Maydan CS243: Loop Optimization and Array Analysis 10 for i = 1 to n for j = 1 to n S1 for i = 1 to n for j = 1 to n S2
11
Carnegie Mellon Loop Fission for i = 1 to n for j = 1 to n = a[i-1][j] for j = 1 to n a[i][j] = Moving S2 across all later “i” iterations – Legal as long as no dependences from S2 to S1 with > in the fissioned outer loops Dror E. Maydan CS243: Loop Optimization and Array Analysis 11 for i = 1 to n for j = 1 to n = a[i-1][j] for i = 1 to n for j = 1 to n a[i][j] = Dep from write to read of (1)
12
Carnegie Mellon Inner Loop Fission for i = 1 to n for j = 1 to n … = h[i]; … = h[i+1]; … … = h[i+49]; … = h[i+50]; Legal as long as there is no dependence from an S2 to an S1 where the first “>” is in the “j” loop. Dror E. Maydan CS243: Loop Optimization and Array Analysis 12 for i = 1 to n for j = 1 to n = h[i]; … … = h[i+25]; for j = 1 to n … = h[26]; … … = h[50];
13
Carnegie Mellon Inner Loop Fission for j = 1 to n S1 S2 S3 Looking at edges carried by the inner most loops Strongly Connected Components can not be fissioned Everything else can be fissoned as long as loops are emitted in topological order Dror E. Maydan CS243: Loop Optimization and Array Analysis 13 S1 S2 S3 = = >
14
Carnegie Mellon Outer Loop Unrolling for i = 1 to n for j = 1 to n for k = 1 to n c[i][j] += a[i][k] * b[k][j]; How many loads in the inner loop? How many MACs? Dror E. Maydan CS243: Loop Optimization and Array Analysis 14
15
Carnegie Mellon Outer Loop Unrolling for i = 1 to n by 2 for j = 1 to n by 2 for k = 1 to n c[i][j] += a[i][k] * b[k][j]; c[i][j+1] += a[i][k] * b[k][j+1]; c[i+1][j] += a[i+1][k] * b[k][j]; c[i+1][j+1] += a[i+1][k] * b[k][j+1]; Is it legal? Dror E. Maydan CS243: Loop Optimization and Array Analysis 15
16
Carnegie Mellon Outer Loop Unrolling If n = 2 – Original order was (1, 1, 1) (1, 1, 2) (1, 2, 1) (1, 2, 2) (2, 1, 1) (2, 1, 2) (2, 2, 1) (2, 2, 2) – New order is (1, 1, 1) (1, 2, 1) (2, 1, 1) (2, 2, 1) (1, 1, 2) (1, 2, 2) (2, 1, 2) (2, 2, 2) Equivalent to permuting the loops into for k = 1 to 2 for i = 1 to 2 for j = 1 to 2 If loops are fully permutable can also outer loop unroll Dror E. Maydan CS243: Loop Optimization and Array Analysis 16 for i = 1 to 2 by 2 for j = 1 to 2 by 2 for k = 1 to 2 c[i][j] += a[i][k] * b[k]; c[i][j+1] += a[i][k] * b[k][j+1]; c[i+1][j] += a[i+1][k] * b[k][j]; c[i+1][j+1] += a[i+1][k] * b[k][j+1];
17
Carnegie Mellon Unrolling Trapezoidal Loops Dror E. Maydan CS243: Loop Optimization and Array Analysis 17 j i Ugly We unroll two level trapezoidal loops but the details are very ugly for i=1 to n by 2 for j = 1 to i
18
Carnegie Mellon Trapezoidal Example for (i=0; i<n; i++) { for (j=2*i; j<n-i; j++) { a[i][j] += 1; } D.E. Maydan CS243: Loop Optimization and Array Analysis 18 for(i = 0; i <= (n + -2); i = i + 2) { lstar = (i * 2) + 2; ustar = (n - (i + 1)) + -1; if(((i * 2) + 2) < (n - (i + 1))) { for(r2d_i = i; r2d_i <= (i + 1); r2d_i = r2d_i + 1){ for(j = r2d_i * 2; j <= ((i * 2) + 1); j = j + 1){ a[r2d_i][j] = a[r2d_i][j] + 1; } for(j0 = lstar; ustar >= j0; j0 = j0 + 1) { a[i][j0] = a[i][j0] + 1; a[i + 1][j0] = a[i + 1][j0] + 1; }; for(r2d_i0 = i; r2d_i0 <= (i + 1); r2d_i0 = r2d_i0 + 1) { for(j1 = n - (i + 1); j1 < (n - r2d_i0); j1 = j1 + 1) { a[r2d_i0][j1] = a[r2d_i0][j1] + 1; }; } } else { for(r2d_i1 = i; r2d_i1 <= (i + 1); r2d_i1 = r2d_i1 + 1) { for(j2 = r2d_i1 * 2; j2 < (n - r2d_i1); j2 = j2 + 1) { a[r2d_i1][j2] = a[r2d_i1][j2] + 1; } if(n > i) { for(j3 = i * 2; j3 < (n - i); j3 = j3 + 1) { a[i][j3] = a[i][j3] + 1; }; }
19
Carnegie Mellon Cache Tiling for i = 1 to n for j = 1 to n for k = 1 to n c[i][j] += a[i][k] * b[k][j]; How many cache misses? Dror E. Maydan CS243: Loop Optimization and Array Analysis 19
20
Carnegie Mellon Cache Tiling for jb = 1 to n by b for kb = 1 to n by b for i = 1 to n for j = jb to jb+b for k = kb to kb + b c[i][j] += a[i][k] * b[k][j]; How many cache misses? – Order b reuse for each array If loops are fully permutable can cache tile Dror E. Maydan CS243: Loop Optimization and Array Analysis 20
21
Carnegie Mellon Vectorization: SIMD for i = 1 to n for j = 1 to n a[j][i] = 0; N-way parallel where N is the SIMD width of the machine Dror E. Maydan CS243: Loop Optimization and Array Analysis 21 for i = 1 to n by 8 for j = 1 to n a[j][i:i+7] = 0;
22
Carnegie Mellon Vectorization: SIMD for i = 1 to n for j = 1 to n for k = 1 to n S 1 …. S M We have moved later iterations of S 1 ahead of earlier iterations of S 2, …, S M, etc Legal as long as no dependence from a latter S to an earlier S where that dependence is carried by the vector loop – E.g legal to vectorize ‘j’ above if no dependence from a latter S to an earlier S with direction (=, >, *) Dror E. Maydan CS243: Loop Optimization and Array Analysis 22
23
Carnegie Mellon Putting It All Together Three phase algorithm 1.Use fission and fusion to build perfectly nested loops 1.We prefer fusion but not obvious that that is right 2.Enumerate possibilities for unrolling, interchanging, cache tiling and vectorizing 3.Use inner loop fission if necessary to minimize register pressure Dror E. Maydan CS243: Loop Optimization and Array Analysis 23
24
Carnegie Mellon Phase 2 Choose a loop to vectorize All references that refer to vector loop must be stride-1 For each possible inner loop Compute best possible unrollings for each outer Compute best possible ordering and tiling To compute best possible unrolling Try all combinations of unrolling up to a max product of 16 For each possible unrolling Estimate the machine cycles for the inner loop (ignoring cache) Estimate the register pressure Don’t unroll more if too much register pressure To compute best possible ordering and tiling Consider only loops with “reuse” Choose best three Iterate over all orderings of three with a binary search on cache tile size Note and record the total cycle time for this configuration. Pick the best Estimating cycles Could compile every combination, but … Dror E. Maydan CS243: Loop Optimization and Array Analysis 24
25
Carnegie Mellon Machine Modeling Recall that software pipelining had resource limits and latency limits – Map high level IR to machine resources – Unroll high level IR operations Remove duplicate loads and stores Count machine resources Build a latency graph of unrolled operations – Iterate over inner loop cycles and find worst cycle Assume performance is worst of two limits Model register pressure – Count loop invariant loads and stores – Count address streams – Count cross iteration cse’s = a[i] + a[i-2] – Add machine dependent constant Dror E. Maydan CS243: Loop Optimization and Array Analysis 25
26
Carnegie Mellon Cache Modeling Given a loop ordering and a set of tile factors Combine array references that differ by constant, e.g. a[i][j] and a[i+1][j+1] Estimate capacity of all array references, multiply by fudge factor for interference, stop increasing block sizes if capacity is larger than cache Estimate quantity of data that must be brought into cache Dror E. Maydan CS243: Loop Optimization and Array Analysis 26
27
Carnegie Mellon Phase 3: Inner loop fission Does inner loop use too many registers – Break down into SCCs – Pick biggest SCC Does it use too many registers – If yes, too bad – If no, search for other SCCs to merge in » Pick one with most commonality » Keep merging while enough registers Dror E. Maydan CS243: Loop Optimization and Array Analysis 27
28
Carnegie Mellon Extra: Reductions for i = 1 to n for j = 1 to n a[j] += b[i][j]; Can I unroll for i = 1 to n by 2 for j = 1 to n a[j] += b[i][j]; a[j] += b[i+1][j]; Legal – Integer: yes – Floating point: maybe Dror E. Maydan CS243: Loop Optimization and Array Analysis 28
29
Carnegie Mellon Extra: Outer Loop Invariants for i for j a[i][j] += b[i] * cos(c[j]) Can replace with for j t[j] = cos(c[j]) for i for j a[i][j] += b[i] * t[j]; Need to integrate with model – Model must assume that invariant computation will be replaced with loads Dror E. Maydan CS243: Loop Optimization and Array Analysis 29
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.