Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures
Fine-Grained Parallelism Techniques to enhance fine-grained parallelism: Loop Interchange Scalar Expansion Scalar Renaming Array Renaming Node Splitting
Optimizing Compilers for Modern Architectures Recall Vectorization procedure…. procedure codegen(R, k, D); // R is the region for which we must generate code. // k is the minimum nesting level of possible parallel loops. // D is the dependence graph among statements in R.. find the set {S 1, S 2,..., S m } of maximal strongly-connected regions in the dependence graph D restricted to R construct R p from R by reducing each S i to a single node and compute D p, the dependence graph naturally induced on R p by D let {p 1, p 2,..., p m } be the m nodes of R p numbered in an order consistent with D p (use topological sort to do the numbering); for i = 1 to m do begin if p i is cyclic then begin generate a level-k DO statement; let D i be the dependence graph consisting of all dependence edges in D that are at level k+1 or greater and are internal to p i ; codegen (p i, k+1, D i ); generate the level-k ENDDO statement; end else generate a vector statement for p i in r(p i )-k+1 dimensions, where r (p i ) is the number of loops containing p i ; end We can fail here
Optimizing Compilers for Modern Architectures Can we do better? Codegen: tries to find parallelism using transformations of loop distribution and statement reordering If we deal with loops containing cyclic dependences early on in the loop nest, we can potentially vectorize more loops Goal in Chapter 5: To explore other transformations to exploit parallelism
Optimizing Compilers for Modern Architectures Motivational Example DO J = 1, M DO I = 1, N T = 0.0 DO K = 1,L T = T + A(I,K) * B(K,J) ENDDO C(I,J) = T ENDDO codegen will not uncover any vector operations. However, by scalar expansion, we can get: DO J = 1, M DO I = 1, N T$(I) = 0.0 DO K = 1,L T$(I) = T$(I) + A(I,K) * B(K,J) ENDDO C(I,J) = T$(I) ENDDO
Optimizing Compilers for Modern Architectures Motivational Example DO J = 1, M DO I = 1, N T$(I) = 0.0 DO K = 1,L T$(I) = T$(I) + A(I,K) * B(K,J) ENDDO C(I,J) = T$(I) ENDDO
Optimizing Compilers for Modern Architectures Motivational Example II Loop Distribution gives us: DO J = 1, M DO I = 1, N T$(I) = 0.0 ENDDO DO I = 1, N DO K = 1,L T$(I) = T$(I) + A(I,K) * B(K,J) ENDDO DO I = 1, N C(I,J) = T$(I) ENDDO
Optimizing Compilers for Modern Architectures Motivational Example III Finally, interchanging I and K loops, we get: DO J = 1, M T$(1:N) = 0.0 DO K = 1,L T$(1:N) = T$(1:N) + A(1:N,K) * B(K,J) ENDDO C(1:N,J) = T$(1:N) ENDDO A couple of new transformations used: —Loop interchange —Scalar Expansion
Optimizing Compilers for Modern Architectures Loop Interchange DO I = 1, N DO J = 1, M S A(I,J+1) = A(I,J) + B DV: (=, <) ENDDO Applying loop interchange: DO J = 1, M DO I = 1, N S A(I,J+1) = A(I,J) + B DV: (<, =) ENDDO leads to: DO J = 1, M S A(1:N,J+1) = A(1:N,J) + B ENDDO
Optimizing Compilers for Modern Architectures Loop Interchange Loop interchange is a reordering transformation Why? —Think of statements being parameterized with the corresponding iteration vector —Loop interchange merely changes the execution order of these statements. — It does not create new instances, or delete existing instances DO J = 1, M DO I = 1, N S ENDDO If interchanged, S(2, 1) will execute before S(1, 2)
Optimizing Compilers for Modern Architectures Loop Interchange: Safety Safety: not all loop interchanges are safe DO J = 1, M DO I = 1, N A(I,J+1) = A(I+1,J) + B ENDDO Direction vector ( ) If we interchange loops, we violate the dependence
Optimizing Compilers for Modern Architectures Loop Interchange: Safety A dependence is interchange-preventing with respect to a given pair of loops if interchanging those loops would reorder the endpoints of the dependence.
Optimizing Compilers for Modern Architectures Loop Interchange: Safety A dependence is interchange-sensitive if it is carried by the same loop after interchange. That is, an interchange-sensitive dependence moves with its original carrier loop to the new level. Example: Interchange-Sensitive? Example: Interchange-Insensitive?
Optimizing Compilers for Modern Architectures Loop Interchange: Safety Theorem 5.1 Let D(i,j) be a direction vector for a dependence in a perfect nest of n loops. Then the direction vector for the same dependence after a permutation of the loops in the nest is determined by applying the same permutation to the elements of D(i,j). The direction matrix for a nest of loops is a matrix in which each row is a direction vector for some dependence between statements contained in the nest and every such direction vector is represented by a row.
Optimizing Compilers for Modern Architectures Loop Interchange: Safety DO I = 1, N DO J = 1, M DO K = 1, L A(I+1,J+1,K) = A(I,J,K) + A(I,J+1,K+1) ENDDO The direction matrix for the loop nest is: < < = Theorem 5.2 A permutation of the loops in a perfect nest is legal if and only if the direction matrix, after the same permutation is applied to its columns, has no ">" direction as the leftmost non-"=" direction in any row. Follows from Theorem 5.1 and Theorem 2.3
Optimizing Compilers for Modern Architectures Loop Interchange: Profitability Profitability depends on architecture DO I = 1, N DO J = 1, M DO K = 1, L S A(I+1,J+1,K) = A(I,J,K) + B ENDDO For SIMD machines with large number of FU’s: DO I = 1, N S A(I+1,2:M+1,1:L) = A(I,1:M,1:L) + B ENDDO Not suitable for vector register machines
Optimizing Compilers for Modern Architectures Loop Interchange: Profitability For Vector machines, we want to vectorize loops with stride-one memory access Since Fortran stores in column-major order: —useful to vectorize the I-loop Thus, transform to: DO J = 1, M DO K = 1, L S A(2:N+1,J+1,K) = A(1:N,J,K) + B ENDDO
Optimizing Compilers for Modern Architectures Loop Interchange: Profitability MIMD machines with vector execution units: want to cut down synchronization costs Hence, shift K-loop to outermost level: PARALLEL DO K = 1, L DO J = 1, M A(2:N+1,J+1,K) = A(1:N,J,K) + B ENDDO END PARALLEL DO
Optimizing Compilers for Modern Architectures Scalar Expansion DO I = 1, N S 1 T = A(I) S 2 A(I) = B(I) S 3 B(I) = T ENDDO Scalar Expansion: DO I = 1, N S 1 T$(I) = A(I) S 2 A(I) = B(I) S 3 B(I) = T$(I) ENDDO T = T$(N) leads to: S 1 T$(1:N) = A(1:N) S 2 A(1:N) = B(1:N) S 3 B(1:N) = T$(1:N) T = T$(N)
Optimizing Compilers for Modern Architectures Scalar Expansion However, not always profitable. Consider: DO I = 1, N T = T + A(I) + A(I+1) A(I) = T ENDDO Scalar expansion gives us: T$(0) = T DO I = 1, N S 1 T$(I) = T$(I-1) + A(I) + A(I+1) S 2 A(I) = T$(I) ENDDO T = T$(N)
Optimizing Compilers for Modern Architectures Scalar Expansion: Safety Scalar expansion is always safe When is it profitable? —Naïve approach: Expand all scalars, vectorize, shrink all unnecessary expansions. —However, we want to predict when expansion is profitable Dependences due to reuse of memory location vs. reuse of values —Dependences due to reuse of values must be preserved —Dependences due to reuse of memory location can be deleted by expansion
Optimizing Compilers for Modern Architectures Scalar Expansion: Drawbacks Expansion increases memory requirements Solutions: —Expand in a single loop —Strip mine loop before expansion —Forward substitution: DO I = 1, N T = A(I) + A(I+1) A(I) = T + B(I) ENDDO DO I = 1, N A(I) = A(I) + A(I+1) + B(I) ENDDO
Optimizing Compilers for Modern Architectures Scalar Renaming DO I = 1, 100 S 1 T = A(I) + B(I) S 2 C(I) = T + T S 3 T = D(I) - B(I) S 4 A(I+1) = T * T ENDDO Renaming scalar T: DO I = 1, 100 S 1 T1 = A(I) + B(I) S 2 C(I) = T1 + T1 S 3 T2 = D(I) - B(I) S 4 A(I+1) = T2 * T2 ENDDO
Optimizing Compilers for Modern Architectures Scalar Renaming will lead to: S 3 T2$(1:100) = D(1:100) - B(1:100) S 4 A(2:101) = T2$(1:100) * T2$(1:100) S 1 T1$(1:100) = A(1:100) + B(1:100) S 2 C(1:100) = T1$(1:100) + T1$(1:100) T = T2$(100)
Optimizing Compilers for Modern Architectures Node Splitting Sometimes Renaming fails DO I = 1, N S1:A(I) = X(I+1) + X(I) S2:X(I+1) = B(I) + 32 ENDDO Recurrence kept intact by renaming algorithm
Optimizing Compilers for Modern Architectures Node Splitting DO I = 1, N S1:A(I) = X(I+1) + X(I) S2:X(I+1) = B(I) + 32 ENDDO Break critical antidependence Make copy of node from which antidependence emanates DO I = 1, N S1’:X$(I) = X(I+1) S1:A(I) = X$(I) + X(I) S2:X(I+1) = B(I) + 32 ENDDO Recurrence broken Vectorized to X$(1:N) = X(2:N+1) X(2:N+1) = B(1:N) + 32 A(1:N) = X$(1:N) + X(1:N)
Optimizing Compilers for Modern Architectures Node Splitting Determining minimal set of critical antidependences is in NP-C Perfect job of Node Splitting difficult Heuristic: —Select antidependences —Delete it to see if acyclic —If acyclic, apply Node Splitting