Optimizing Compilers for Modern Architectures Enhancing Fine-Grained Parallelism Part II Chapter 5 of Allen and Kennedy.

Optimizing Compilers for Modern Architectures Enhancing Fine-Grained Parallelism Part II Chapter 5 of Allen and Kennedy

Optimizing Compilers for Modern Architectures Seen So Far... Uncovering potential vectorization in loops by —Loop Interchange —Scalar Expansion —Scalar and Array Renaming Safety and Profitability of these transformations

Optimizing Compilers for Modern Architectures Today’s Talk... More transformations —Node Splitting —Recognition of Reductions —Index-Set Splitting —Run-time Symbolic Resolution —Loop Skewing Unified framework to generate vector code

Optimizing Compilers for Modern Architectures Node Splitting Sometimes Renaming fails DO I = 1, N S1:A(I) = X(I+1) + X(I) S2:X(I+1) = B(I) + 32 ENDDO Recurrence kept intact by renaming algorithm

Optimizing Compilers for Modern Architectures Node Splitting DO I = 1, N S1:A(I) = X(I+1) + X(I) S2:X(I+1) = B(I) + 32 ENDDO Break critical antidependence Make copy of node from which antidependence emanates DO I = 1, N S1’:X$(I) = X(I+1) S1:A(I) = X$(I) + X(I) S2:X(I+1) = B(I) + 32 ENDDO Recurrence broken Vectorized to X$(1:N) = X(2:N+1) X(2:N+1) = B(1:N) + 32 A(1:N) = X$(1:N) + X(1:N)

Optimizing Compilers for Modern Architectures Node Splitting Algorithm Takes a constant loop independent antidependence D Add new assignment x: T$=source(D) Insert x before source(D) Replace source(D) with T$ Make changes in the dependence graph

Optimizing Compilers for Modern Architectures Node Splitting: Profitability Not always profitable For example DO I = 1, N S1:A(I) = X(I+1) + X(I) S2:X(I+1) = A(I) + 32 ENDDO Node Splitting gives DO I = 1, N S1’:X$(I) = X(I+1) S1:A(I) = X$(I) + X(I) S2:X(I+1) = A(I) + 32 ENDDO Recurrence still not broken Antidependence was not critical

Optimizing Compilers for Modern Architectures Node Splitting Determining minimal set of critical antidependences is in NP-C Perfect job of Node Splitting difficult Heuristic: —Select antidependences —Delete it to see if acyclic —If acyclic, apply Node Splitting

Optimizing Compilers for Modern Architectures Recognition of Reductions Sum Reduction, Min/Max Reduction, Count Reduction Vector ---> Single Element S = 0.0 DO I = 1, N S = S + A(I) ENDDO Not directly vectorizable

Optimizing Compilers for Modern Architectures Recognition of Reductions Assuming commutativity and associativity S = 0.0 DO k = 1, 4 SUM(k) = 0.0 DO I = k, N, 4 SUM(k) = SUM(k) + A(I) ENDDO S = S + SUM(k) ENDDO Distribute k loop S = 0.0 DO k = 1, 4 SUM(k) = 0.0 ENDDO DO k = 1, 4 DO I = k, N, 4 SUM(k) = SUM(k) + A(I) ENDDO DO k = 1, 4 S = S + SUM(k) ENDDO

Optimizing Compilers for Modern Architectures Recognition of Reductions After Loop Interchange DO I = 1, N, 4 DO k = I, min(I+3,N) SUM(k-I+1) = SUM(k-I+1) + A(I) ENDDO Vectorize DO I = 1, N, 4 SUM(1:4) = SUM(1:4) + A(I:I+3) ENDDO

Optimizing Compilers for Modern Architectures Recognition of Reductions Useful for vector machines with 4 stage pipeline Recognize Reduction and Replace by the efficient version

Optimizing Compilers for Modern Architectures Recognition of Reductions Properties of Reductions —Reduce Vector/Array to one element —No use of Intermediate values —Reduction operates on vector and nothing else

Optimizing Compilers for Modern Architectures Recognition of Reductions Reduction recognized by —Presence of self true, output and anti dependences —Absence of other true dependences DO I = 1, N S = S + A(I) ENDDO DO I = 1, N S = S + A(I) T(I) = S ENDDO

Optimizing Compilers for Modern Architectures Index-set Splitting Subdivide loop into different iteration ranges to achieve partial parallelization —Threshold Analysis [Strong SIV, Weak Crossing SIV] —Loop Peeling [Weak Zero SIV] —Section Based Splitting [Variation of loop peeling]

Optimizing Compilers for Modern Architectures Threshold Analysis DO I = 1, 20 A(I+20) = A(I) + B ENDDO Vectorize to.. A(21:40) = A(1:20) + B DO I = 1, 100 A(I+20) = A(I) + B ENDDO Strip mine to.. DO I = 1, 100, 20 DO i = I, I+19 A(i+20) = A(i) + B ENDDO Vectorize this

Optimizing Compilers for Modern Architectures Threshold Analysis Crossing thresholds DO I = 1, 100 A(100-I) = A(I) + B ENDDO Strip mine to.. DO I = 1, 100, 50 DO i = I, I+49 A(101-i) = A(i) + B ENDDO Vectorize to.. DO I = 1, 100, 50 A(101-I:51-I) = A(I:I+49) + B ENDDO

Optimizing Compilers for Modern Architectures Loop Peeling Source of dependence is a single iteration DO I = 1, N A(I) = A(I) + A(1) ENDDO Loop peeled to.. A(1) = A(1) + A(1) DO I = 2, N A(I) = A(I) + A(1) ENDDO Vectorize to.. A(1) = A(1) + A(1) A(2:N)= A(2:N) + A(1)

Optimizing Compilers for Modern Architectures Section-based Splitting DO I = 1, N DO J = 1, N/2 S1: B(J,I ) = A( J,I ) + C ENDDO DO J = 1, N S2: A( J,I+1 ) = B ( J,I ) + D ENDDO —J Loop bound by recurrence due to B —Only a portion of B is responsible for it Partition second loop into loop that uses result of S1 and loop that does not DO I = 1, N DO J = 1, N/2 S1: B(J,I ) = A( J,I ) + C ENDDO DO J = 1, N/2 S2: A( J,I+1 ) = B ( J,I ) + D ENDDO DO J = N/2+1, N S3: A( J,I+1 ) = B ( J,I ) + D ENDDO

Optimizing Compilers for Modern Architectures Section-based Splitting DO I = 1, N DO J = 1, N/2 S1: B(J,I ) = A( J,I ) + C ENDDO DO J = 1, N/2 S2: A( J,I+1 ) = B ( J,I ) + D ENDDO DO J = N/2+1, N S3: A( J,I+1 ) = B ( J,I ) + D ENDDO S3 now independent of S1 and S2 Loop distribute to DO I = 1, N DO J = N/2+1, N S3: A( J,I+1 ) = B ( J,I ) + D ENDDO DO I = 1, N DO J = 1, N/2 S1: B(J,I ) = A( J,I ) + C ENDDO DO J = 1, N/2 S2: A( J,I+1 ) = B ( J,I ) + D ENDDO

Optimizing Compilers for Modern Architectures Section-based Splitting DO I = 1, N DO J = N/2+1, N S3: A( J,I+1 ) = B ( J,I ) + D ENDDO DO I = 1, N DO J = 1, N/2 S1: B(J,I ) = A( J,I ) + C ENDDO DO J = 1, N/2 S2: A( J,I+1 ) = B ( J,I ) + D ENDDO Vectorized to A(N/2+1:N, 2:N +1 ) = B(N/2+1:N, 1:N) + D DO I = 1, N B(1:N/2,I ) = A(1:N/2,I ) + C A(1:N/2,I+1 ) = B(1:N/2,I ) + D ENDDO

Optimizing Compilers for Modern Architectures Run-time Symbolic Resolution “Breaking Conditions” DO I = 1, N A(I+L) = A(I) + B(I) ENDDO Transformed to.. IF(L.LE.0) THEN A(L:N+L) = A(1:N) + B(1:N) ELSE DO I = 1, N A(I+L) = A(I) + B(I) ENDDO ENDIF

Optimizing Compilers for Modern Architectures Run-time Symbolic Resolution Identifying minimum number of breaking conditions to break a recurrence is in NP-C Heuristic: —Identify when a critical dependence can be conditionally eliminated via a breaking condition

Optimizing Compilers for Modern Architectures Loop Skewing Reshape Iteration Space to uncover parallelism DO I = 1, N DO J = 1, N (=,<) S:A(I,J) = A(I-1,J) + A(I,J-1) (<,=) ENDDO Parallelism not apparent

Optimizing Compilers for Modern Architectures Loop Skewing Dependence Pattern before loop skewing

Optimizing Compilers for Modern Architectures Loop Skewing Do the following transformation called loop skewing j=J+I or J=j-I DO I = 1, N DO j = I+1, I+N (=,<) S:A(I,j-I) = A(I-1,j-I) + A(I,j-I-1) (<,<) ENDDO Note: Direction Vector Changes

Optimizing Compilers for Modern Architectures Loop Skewing The accesses to A have the following pattern 1,1 = 0,1 + 1,0 S(1,2) 1,2 = 0,2 + 1,1 S(1,3) 1,3 = 0,3 + 1,2 S(1,4) 1,4 = 0,4 + 1,3 S(1,5) 2,1 = 1,1 + 2,0 S(2,3) 2,2 = 1,2 + 2,1 S(2,4) 2,3 = 1,3 + 2,2 S(2,5) 2,4 = 1,4 + 2,3 S(2,6)

Optimizing Compilers for Modern Architectures Loop Skewing Dependence pattern after loop skewing

Optimizing Compilers for Modern Architectures Loop Skewing DO I = 1, N DO j = I+1, I+N S:A(I,j-I) = A(I-1,j-I) + A(I,j-I-1) ENDDO Loop interchange to.. DO j = 2, N+N DO I = max(1,j-N), min(N,j-1) S:A(I,j-I) = A(I-1,j-I) + A(I,j-I-1) ENDDO Vectorize to.. DO j = 2, N+N FORALL I = max(1,j-N), min(N,j-1) S:A(I,j-I) = A(I-1,j-I) + A(I,j-I-1) END FORALL ENDDO

Optimizing Compilers for Modern Architectures Loop Skewing Disadvantages: —Varying vector length –Not profitable if N is small —If vector startup time is more than speedup time, this is not profitable —Vector bounds must be recomputed on each iteration of outer loop Apply loop skewing if everything else fails

Optimizing Compilers for Modern Architectures Putting It All Together Good Part —Many transformations imply more choices to exploit parallelism Bad Part —Choosing the right transformation —How to automate transformation selection process? —Interference between transformations

Optimizing Compilers for Modern Architectures Putting It All Together Example of Interference DO I = 1, N DO J = 1, M S(I) = S(I) + A(I,J) ENDDO Sum Reduction gives.. DO I = 1, N S(I) = S(I) + SUM (A(I,1:M)) ENDDO While Loop Interchange and Vectorization gives.. DO J = 1, N S(1:N) = S(1:N) + A(1:N,J) ENDDO

Optimizing Compilers for Modern Architectures Putting It All Together Any algorithm which tries to tie all transformations must —Take a global view of transformed code —Know the architecture of the target machine Goal of our algorithm —Finding ONE good vector loop [works good for most vector register architectures]

Optimizing Compilers for Modern Architectures Unified Framework Detection: finding ALL loops for EACH statement that can be run in vector Selection: choosing best loop for vector execution for EACH statement Transformation: carrying out the transformations necessary to vectorize the selected loop

Optimizing Compilers for Modern Architectures Unified Framework: Detection procedure mark_loop(S,D) for each edge e in D deletable by scalar expansion, array and scalar renaming, node splitting or symbolic resolution do begin add e to deletable_edges; delete e from D; end mark_gen(S,1,D); for each statement x in S with no vector loop marked do begin attempt Index-Set Splitting and loop skewing; mark vector loops found; end //Restore deletable edges from deletable_edges to D end mark_loop

Optimizing Compilers for Modern Architectures Unified Framework: Detection procedure mark_gen(S,k,D) //Variation of codegen; Doesn’t do vectorization; Only marks vector loops for i =1 to m do begin if S i is cyclic then if outermost carried dependence is at level p>k then //Loop Shifting mark all loops at level < p as vector for S i; else if S i is a reduction mark loop k as vector; mark S i reduction; else begin //Recur at deeper level mark_gen(S i,k+1,D i ); end else mark statements in S i as vector for loops k and deeper; end end mark_gen

Optimizing Compilers for Modern Architectures Selection and Transformation procedure transform_code(R,k,D) //Variation of codegen; for i =1 to m do begin if k is the index of the best vector loop for some statement in R i then if R i is cyclic then select_and_apply_transformation(R i,k,D); //retry vectorization on new dependence graph transform_code(R i,k,D); else generate a vector statement for R i in loop k; end else begin //Recur at deeper level //Generate level k DO and ENDDO statements transform_code(R i,k+1,D); end end transform_code

Optimizing Compilers for Modern Architectures Selection of Transformations procedure select_and_apply_transformation(R i,k,D) if loop k does not carry a dependence in R i then shift loop k to innermost position; else if R i is a reduction at level k then replace with reduction and adjust dependences; else //transform and adjust dependences if array renaming possible then apply array renaming and adjust dependences; else if node-splitting possible then apply node-splitting and adjust dependences; else if scalar expansion possible then apply scalar expansion and adjust dependences; else apply loop skewing or index-set splitting and adjust dependencies; end end select_and_apply_transformation

Optimizing Compilers for Modern Architectures Performance on Benchmark

Optimizing Compilers for Modern Architectures Enhancing Fine-Grained Parallelism Part II Chapter 5 of Allen and Kennedy.

Similar presentations

Presentation on theme: "Optimizing Compilers for Modern Architectures Enhancing Fine-Grained Parallelism Part II Chapter 5 of Allen and Kennedy."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Optimizing Compilers for Modern Architectures Enhancing Fine-Grained Parallelism Part II Chapter 5 of Allen and Kennedy.

Similar presentations

Presentation on theme: "Optimizing Compilers for Modern Architectures Enhancing Fine-Grained Parallelism Part II Chapter 5 of Allen and Kennedy."— Presentation transcript:

Similar presentations

About project

Feedback