Download presentation
Presentation is loading. Please wait.
Published byAnna Caldwell Modified over 9 years ago
1
Optimizing Compilers for Modern Architectures Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9
2
Optimizing Compilers for Modern Architectures Last time… Single loop methods Privatization Loop distribution Alignment Loop Fusion
3
Optimizing Compilers for Modern Architectures Loop Interchange Moves dependence-free loops to outermost level Theorem —In a perfect nest of loops, a particular loop can be parallelized at the outermost level if and only if the column of the direction matrix for that nest contain only ‘=‘ entries Vectorization moves loops to innermost level
4
Optimizing Compilers for Modern Architectures Loop Interchange DO I = 1, N DO J = 1, N A(I+1, J) = A(I, J) + B(I, J) ENDDO OK for vectorization Problematic for parallelization
5
Optimizing Compilers for Modern Architectures Loop Interchange PARALLEL DO J = 1, N DO I = 1, N A(I+1, J) = A(I, J) + B(I, J) ENDDO END PARALLEL DO
6
Optimizing Compilers for Modern Architectures Loop Interchange Working with direction matrix —Move loops with all “=“ entries into outermost position and parallelize it and remove the column from the matrix —Move loops with most “<“ entries into next outermost position and sequentialize it, eliminate the column and any rows representing carried dependences —Repeat step 1
7
Optimizing Compilers for Modern Architectures while L is not empty while there exist columns in M with all “=“ success := true; l:= loop with all “=“ column; remove l from L; parallelize l; eliminate l from M; end; if L is not empty select_loop_and_interchange(L); l:= outermost loop; remove l from L; sequentialize l; remove column corresponding to l from M; remove all rows corresponding to dependences carried by l from M; Loop Interchange
8
Optimizing Compilers for Modern Architectures Loop Selection Generate most parallelism with adequate granularity —Key is to select proper loops to run in parallel Informal parallel code generation strategy While there are loops that can be run in parallel, move them to the outermost position and parallelize them, then Select a sequential loop, run it sequentially, and find what new parallelism may have been revealed.
9
Optimizing Compilers for Modern Architectures = < < < = = Loop Selection DO I = 2, N+1 DO J = 2, M+1 DO K = 1, L A(I, J, K+1) = A(I, J-1, K) + A(I-1, J, K+2) + A(I-1, J, K) ENDDO
10
Optimizing Compilers for Modern Architectures Loop Selection DO I = 2, N+1 DO J = 2, M+1 PARALLEL DO K = 1, L A(I, J, K+1) = A(I, J-1, K) + A(I-1, J, K+2) + A(I-1, J, K) ENDDO
11
Optimizing Compilers for Modern Architectures < < = = < = < = = < = < = = = = < = = = = < Loop Selection Is it possible to derive a selection heuristic that provides optimal code? —Probably not possible, NP-complete problem Assume simple approach of selecting the loop with the most ‘ < ‘ directions to eliminate the max number of rows from the direction matrix —Applying to this matrix will fail
12
Optimizing Compilers for Modern Architectures Loop Selection Favor the selection of loops that must be sequentialized before parallelism can be uncovered. If there exists a loop that can legally be moved to the outermost position and there is a dependence for which that loop has the only ‘<‘ direction, sequentialize that loop If there are several such loops, they will all need to be sequentialized at some point in the process
13
Optimizing Compilers for Modern Architectures To show that loop selection is NP-complete —Bit vector —Problem of loop selection corresponds to finding a minimal basis among the loops, with “logical or” as the combining operation —The same as minimum set cover problem, known to be NP-complete Loop selection is best done by a heuristic 1 1 0 0 1 0 1 0 0 1 0 1 0 0 0 0 1 0 0 0 0 1 Loop Selection
14
Optimizing Compilers for Modern Architectures = < = < = < = < < Loop Selection Example of principals involved in heuristic loop selection DO I = 2, N DO J = 2, M DO K = 2, L A(I, J, K) = A(I, J-1, K) + A(I-1, J, K-1) + A(I, J+1, K+1) + A(I-1, J, K+1) ENDDO
15
Optimizing Compilers for Modern Architectures Loop Selection DO J = 2, M DO I = 2, N PARALLEL DO K = 2, L A(I, J, K) = A(I, J-1, K) + A(I-1, J, K-1) + A(I, J+1, K+1) + A(I-1, J, K+1) ENDDO
16
Optimizing Compilers for Modern Architectures = Loop Reversal DO I = 2, N+1 DO J = 2, M+1 DO K = 1, L A(I, J, K) = A(I, J-1, K+1) + A(I-1, J, K+1) ENDDO
17
Optimizing Compilers for Modern Architectures Loop Reversal DO K = L, 1, -1 PARALLEL DO I = 2, N+1 PARALLEL DO J = 2, M+1 A(I, J, K) = A(I, J-1, K+1) + A(I-1, J, K+1) END PARALLEL DO ENDDO Increase the range of options available for loop selection heuristics
18
Optimizing Compilers for Modern Architectures = < = < = = = = < = = = Loop Skewing DO I = 2, N+1 DO J = 2, M+1 DO K = 1, L A(I, J, K) = A(I, J-1, K) + A(I-1, J, K) B(I, J, K+1) = B(I, J, K) + A(I, J, K) ENDDO
19
Optimizing Compilers for Modern Architectures = < < < = < = = < = = = Loop Skewing Skewed using k = K + I + J yield: DO I = 2, N+1 DO J = 2, M+1 DO k = I+J+1, I+J+L A(I, J, k-I-J) = A(I, J-1, k-I-J) + A(I-1, J, k-I-J) B(I, J, k-I-J+1) = B(I, J, k-I-J) + A(I, J, k-I-J) ENDDO
20
Optimizing Compilers for Modern Architectures Loop Skewing DO k = 5, N+M+1 PARALLEL DO I = MAX(2, k-M-L-1), MIN(N+1, k-L-2) PARALLEL DO J = MAX(2, k-I-L), MIN(M+1, k-I-1) A(I, J, k-I-J) = A(I, J-1, k-I-J) + A(I-1, J, k-I-J) B(I, J, k-I-J+1) = B(I, J, k-I-J) + A(I, J, k-I-J) ENDDO
21
Optimizing Compilers for Modern Architectures Loop Skewing Transforms skewed loop into one that can be interchanged to the outermost position without changing the meaning of the program Can be used to transform the skewed loop in such a way that, after outward interchange, it will carry all dependences formerly carried by the loop with respect to which it is skewed
22
Optimizing Compilers for Modern Architectures Loop Skewing Selection Heuristics —Parallelize outermost loop if possible —Sequentializes at most one outer loop to find parallelism in the next loop —If 1 and 2 fails, try skewing —If 3 fails, sequentialize the loop that can be moved to the outermost position and cover the most other loops
23
Optimizing Compilers for Modern Architectures Unimodular Transformations Definitions —A transformation represented by a matrix T is unimodular if —T is square —All the elements of T are integral and —The absolute value of the determinant of T is 1 Any composition of unimodular transformations is unimodular
24
Optimizing Compilers for Modern Architectures Profitability-Based Methods Motivation —Minimum granularity, low sync cost —Many alternatives for parallel code generation Static performance estimation function —No need to be accurate —Good at selecting the better of two alternatives Key considerations —Cost of memory references —Sufficiency of granularity
25
Optimizing Compilers for Modern Architectures Profitability-Based Methods Pick the best of all possible permutations and parallelizations —Impractical –Total number of alternative is exponential in the number of loops in a nest –Many of the loop upper bounds are unknown at compile time Consider only subset of the possible code arrangements, based on properties of the cost function
26
Optimizing Compilers for Modern Architectures Profitability-Based Methods Subdivide all the references in the loop body into reference groups Determine whether subsequent accesses to the same reference are —Loop invariant –Cost = 1 —Unit stride –Cost = number of iterations / cache line size —Non-unit stride –Cost = number of iterations Assign the loop a cost of the sum of the reference costs times the aggregate number of times the loop will be executed if it is innermost in the loop nest
27
Optimizing Compilers for Modern Architectures Profitability-Based Methods DO I = 1, N DO J = 1, N DO K = 1, N C(I, J) = C(I, J) + A(I, K) * B(K, J) ENDDO —C = 1 A = N B = N/L —Innermost K loop = N3(1+1/L)+N2 —Innermost J loop = 2N3+N2 —Innermost I loop = 2N3/L+N2 Reorder loop from innermost to outermost by increasing loop cost Can’t always have desired loop order
28
Optimizing Compilers for Modern Architectures Multilevel Loop Fusion Commonly used for imperfect loop nests Used after maximal loop distribution
29
Optimizing Compilers for Modern Architectures Multilevel Loop Fusion Decision making needs look-ahead Heuristic: Fuse with the loop that cannot be fused with one of its successors
30
Optimizing Compilers for Modern Architectures Parallel Code Generation procedure Parallelize(l, Dl); ParallelizeNest(l, success); if not success then begin if l can be distributed then begin distribute l into loop nests l1, l2, …, ln; for i:=1 to n do begin Parallelize(li, Di); end Merge({l1, l2, …, ln}); end else if l cannot be distributed then begin
31
Optimizing Compilers for Modern Architectures Parallel Code Generation for each outer loop l0 nested in l do begin let D0 be the set of dependences between statements in l0 less dependences carried by l; Parallelize(Io,D0); end let S be the set of outer loops and statements loops left in l; If ||S||>1 then Merge(S);
32
Optimizing Compilers for Modern Architectures DO J = 1, JMAXD DO I = 1, IMAXD F(I, J, 1) = F(I, J, 1) * B(1) DO K = 2, N-1 DO J = 1, JMAXD DO I = 1, IMAXD F(I, J, K) = (F(I, J, K) – A(K) * F(I, J, K-1)) * B(K) DO J = 1, JMAXD DO I = 1, IMAXD TOT(I, J) = 0.0 DO J = 1, JMAXD DO I = 1, IMAXD TOT(I, J) = TOT(I, J) + D(1) * F(I, J, 1) DO K = 2, N-1 DO J = 1, JMAXD DO I = 1, IMAXD TOT(I, J) = TOT(I, J) + D(K) * F(I, J, K) Erlebacher
33
Optimizing Compilers for Modern Architectures Erlebacher PARALLEL DO = 1, JMAXD DO I = 1, IMAXD F(I, J, 1) = F(I, J, 1) * B(1) DO K = 2, N – 1 DO I = 1, IMAXD F(I, J, K) = ( F(I, J, K) – A(K) * F(I, J, K-1)) * B(K) DO I = 1, IMAXD TOT(I, J) = 0.0 DO I = 1, IMAXD TOT(I, J) = TOT(I, J) + D(1) * F(I, J, 1) DO K = 2, N-1 DO I = 1, IMAXD TOT(I, J) = TOT(I, J) + D(K) * F(I, J, K)
34
Optimizing Compilers for Modern Architectures Erlebacher PARALLEL DO J = 1, JMAXD DO I = 1, IMAXD F(I, J, 1) = F(I, J, 1) * B(1) TOT(I, J) = 0.0 TOT(I, J) = TOT(I, J) + D(1) * F(I, J, 1) ENDDO DO K = 2, N-1 DO I = 1, IMAXD F(I, J, K) = ( F(I, J, K) – A(K) * F(I, J, K-1)) * B(K) TOT(I, J) = TOT(I, J) + D(K) * F(I, J, K) ENDDO
35
Optimizing Compilers for Modern Architectures Strip Mining Converts available parallelism into a form more suitable for the hardware DO I = 1, N A(I) = A(I) + B(I) ENDDO k = CEIL (N / P) PARALLEL DO I = 1, N, k DO i = I, MIN(I + k-1, N) A(i) = A(i) + B(i) ENDDO END PARALLEL DO
36
Optimizing Compilers for Modern Architectures Strip Mining DO I = 1, N DO J = 2, I A(I, J) = A(I, J-1) + B(I) ENDDO Choose smaller unit size to allow more balanced load distribution
37
Optimizing Compilers for Modern Architectures Pipeline Parallelism Fortran command DOACROSS Useful where parallelization is not available High synchronization costs DO I = 2, N-1 DO J = 2, N-1 A(I, J) =.25 * (A(I-1, J) + A(I, J-1) + A(I+1, J) + A(I, J+1)) ENDDO
38
Optimizing Compilers for Modern Architectures Pipeline Parallelism DOACROSS I = 2, N-1 POST (EV(1)) DO J = 2, N-1 WAIT(EV(J-1)) A(I, J) =.25 * (A(I-1, J) + A(I, J-1) + A(I+1, J) + A(I, J+1)) POST (EV(J)) ENDDO
39
Optimizing Compilers for Modern Architectures Pipeline Parallelism
40
Optimizing Compilers for Modern Architectures Pipeline Parallelism DOACROSS I = 2, N-1 POST (E(1)) K = 0 DO J = 2, N-1, 2 K = K+1 WAIT(EV(K)) DO j = J, MAX(J+1, N-1) A(I, J) =.25 * (A(I-1, J) + A(I, J-1) + A(I+1, J) + A(I, J+1) ENDDO POST (EV(K+1)) ENDDO
41
Optimizing Compilers for Modern Architectures Pipeline Parallelism
42
Optimizing Compilers for Modern Architectures Scheduling Parallel Work Parallel execution is slower than serial execution if Bakery-counter scheduling —High synchronization overhead
43
Optimizing Compilers for Modern Architectures Guided Self-Scheduling Minimize synchronization overhead —Schedules groups of iterations unlike the bakery counter method —Going from large to small chunks of work Keep all processors busy at all times Iterations dispensed at time t follows: Alternatively we can have GSS(k) that guarantees that all blocks handed out are of size k or greater
44
Optimizing Compilers for Modern Architectures Guided Self-Scheduling GSS(1)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.