Download presentation
Presentation is loading. Please wait.
Published byStella Payne Modified over 8 years ago
1
Optimizing Compilers for Modern Architectures Enhancing Fine-Grained Parallelism Part II Chapter 5 of Allen and Kennedy
2
Optimizing Compilers for Modern Architectures Seen So Far... Uncovering potential vectorization in loops by —Loop Interchange —Scalar Expansion —Scalar and Array Renaming Safety and Profitability of these transformations
3
Optimizing Compilers for Modern Architectures Today’s Talk... More transformations —Node Splitting —Recognition of Reductions —Index-Set Splitting —Run-time Symbolic Resolution —Loop Skewing Unified framework to generate vector code
4
Optimizing Compilers for Modern Architectures Node Splitting Sometimes Renaming fails DO I = 1, N S1:A(I) = X(I+1) + X(I) S2:X(I+1) = B(I) + 32 ENDDO Recurrence kept intact by renaming algorithm
5
Optimizing Compilers for Modern Architectures Node Splitting DO I = 1, N S1:A(I) = X(I+1) + X(I) S2:X(I+1) = B(I) + 32 ENDDO Break critical antidependence Make copy of node from which antidependence emanates DO I = 1, N S1’:X$(I) = X(I+1) S1:A(I) = X$(I) + X(I) S2:X(I+1) = B(I) + 32 ENDDO Recurrence broken Vectorized to X$(1:N) = X(2:N+1) X(2:N+1) = B(1:N) + 32 A(1:N) = X$(1:N) + X(1:N)
6
Optimizing Compilers for Modern Architectures Node Splitting Algorithm Takes a constant loop independent antidependence D Add new assignment x: T$=source(D) Insert x before source(D) Replace source(D) with T$ Make changes in the dependence graph
7
Optimizing Compilers for Modern Architectures Node Splitting: Profitability Not always profitable For example DO I = 1, N S1:A(I) = X(I+1) + X(I) S2:X(I+1) = A(I) + 32 ENDDO Node Splitting gives DO I = 1, N S1’:X$(I) = X(I+1) S1:A(I) = X$(I) + X(I) S2:X(I+1) = A(I) + 32 ENDDO Recurrence still not broken Antidependence was not critical
8
Optimizing Compilers for Modern Architectures Node Splitting Determining minimal set of critical antidependences is in NP-C Perfect job of Node Splitting difficult Heuristic: —Select antidependences —Delete it to see if acyclic —If acyclic, apply Node Splitting
9
Optimizing Compilers for Modern Architectures Recognition of Reductions Sum Reduction, Min/Max Reduction, Count Reduction Vector ---> Single Element S = 0.0 DO I = 1, N S = S + A(I) ENDDO Not directly vectorizable
10
Optimizing Compilers for Modern Architectures Recognition of Reductions Assuming commutativity and associativity S = 0.0 DO k = 1, 4 SUM(k) = 0.0 DO I = k, N, 4 SUM(k) = SUM(k) + A(I) ENDDO S = S + SUM(k) ENDDO Distribute k loop S = 0.0 DO k = 1, 4 SUM(k) = 0.0 ENDDO DO k = 1, 4 DO I = k, N, 4 SUM(k) = SUM(k) + A(I) ENDDO DO k = 1, 4 S = S + SUM(k) ENDDO
11
Optimizing Compilers for Modern Architectures Recognition of Reductions After Loop Interchange DO I = 1, N, 4 DO k = I, min(I+3,N) SUM(k-I+1) = SUM(k-I+1) + A(I) ENDDO Vectorize DO I = 1, N, 4 SUM(1:4) = SUM(1:4) + A(I:I+3) ENDDO
12
Optimizing Compilers for Modern Architectures Recognition of Reductions Useful for vector machines with 4 stage pipeline Recognize Reduction and Replace by the efficient version
13
Optimizing Compilers for Modern Architectures Recognition of Reductions Properties of Reductions —Reduce Vector/Array to one element —No use of Intermediate values —Reduction operates on vector and nothing else
14
Optimizing Compilers for Modern Architectures Recognition of Reductions Reduction recognized by —Presence of self true, output and anti dependences —Absence of other true dependences DO I = 1, N S = S + A(I) ENDDO DO I = 1, N S = S + A(I) T(I) = S ENDDO
15
Optimizing Compilers for Modern Architectures Index-set Splitting Subdivide loop into different iteration ranges to achieve partial parallelization —Threshold Analysis [Strong SIV, Weak Crossing SIV] —Loop Peeling [Weak Zero SIV] —Section Based Splitting [Variation of loop peeling]
16
Optimizing Compilers for Modern Architectures Threshold Analysis DO I = 1, 20 A(I+20) = A(I) + B ENDDO Vectorize to.. A(21:40) = A(1:20) + B DO I = 1, 100 A(I+20) = A(I) + B ENDDO Strip mine to.. DO I = 1, 100, 20 DO i = I, I+19 A(i+20) = A(i) + B ENDDO Vectorize this
17
Optimizing Compilers for Modern Architectures Threshold Analysis Crossing thresholds DO I = 1, 100 A(100-I) = A(I) + B ENDDO Strip mine to.. DO I = 1, 100, 50 DO i = I, I+49 A(101-i) = A(i) + B ENDDO Vectorize to.. DO I = 1, 100, 50 A(101-I:51-I) = A(I:I+49) + B ENDDO
18
Optimizing Compilers for Modern Architectures Loop Peeling Source of dependence is a single iteration DO I = 1, N A(I) = A(I) + A(1) ENDDO Loop peeled to.. A(1) = A(1) + A(1) DO I = 2, N A(I) = A(I) + A(1) ENDDO Vectorize to.. A(1) = A(1) + A(1) A(2:N)= A(2:N) + A(1)
19
Optimizing Compilers for Modern Architectures Section-based Splitting DO I = 1, N DO J = 1, N/2 S1: B(J,I ) = A( J,I ) + C ENDDO DO J = 1, N S2: A( J,I+1 ) = B ( J,I ) + D ENDDO —J Loop bound by recurrence due to B —Only a portion of B is responsible for it Partition second loop into loop that uses result of S1 and loop that does not DO I = 1, N DO J = 1, N/2 S1: B(J,I ) = A( J,I ) + C ENDDO DO J = 1, N/2 S2: A( J,I+1 ) = B ( J,I ) + D ENDDO DO J = N/2+1, N S3: A( J,I+1 ) = B ( J,I ) + D ENDDO
20
Optimizing Compilers for Modern Architectures Section-based Splitting DO I = 1, N DO J = 1, N/2 S1: B(J,I ) = A( J,I ) + C ENDDO DO J = 1, N/2 S2: A( J,I+1 ) = B ( J,I ) + D ENDDO DO J = N/2+1, N S3: A( J,I+1 ) = B ( J,I ) + D ENDDO S3 now independent of S1 and S2 Loop distribute to DO I = 1, N DO J = N/2+1, N S3: A( J,I+1 ) = B ( J,I ) + D ENDDO DO I = 1, N DO J = 1, N/2 S1: B(J,I ) = A( J,I ) + C ENDDO DO J = 1, N/2 S2: A( J,I+1 ) = B ( J,I ) + D ENDDO
21
Optimizing Compilers for Modern Architectures Section-based Splitting DO I = 1, N DO J = N/2+1, N S3: A( J,I+1 ) = B ( J,I ) + D ENDDO DO I = 1, N DO J = 1, N/2 S1: B(J,I ) = A( J,I ) + C ENDDO DO J = 1, N/2 S2: A( J,I+1 ) = B ( J,I ) + D ENDDO Vectorized to A(N/2+1:N, 2:N +1 ) = B(N/2+1:N, 1:N) + D DO I = 1, N B(1:N/2,I ) = A(1:N/2,I ) + C A(1:N/2,I+1 ) = B(1:N/2,I ) + D ENDDO
22
Optimizing Compilers for Modern Architectures Run-time Symbolic Resolution “Breaking Conditions” DO I = 1, N A(I+L) = A(I) + B(I) ENDDO Transformed to.. IF(L.LE.0) THEN A(L:N+L) = A(1:N) + B(1:N) ELSE DO I = 1, N A(I+L) = A(I) + B(I) ENDDO ENDIF
23
Optimizing Compilers for Modern Architectures Run-time Symbolic Resolution Identifying minimum number of breaking conditions to break a recurrence is in NP-C Heuristic: —Identify when a critical dependence can be conditionally eliminated via a breaking condition
24
Optimizing Compilers for Modern Architectures Loop Skewing Reshape Iteration Space to uncover parallelism DO I = 1, N DO J = 1, N (=,<) S:A(I,J) = A(I-1,J) + A(I,J-1) (<,=) ENDDO Parallelism not apparent
25
Optimizing Compilers for Modern Architectures Loop Skewing Dependence Pattern before loop skewing
26
Optimizing Compilers for Modern Architectures Loop Skewing Do the following transformation called loop skewing j=J+I or J=j-I DO I = 1, N DO j = I+1, I+N (=,<) S:A(I,j-I) = A(I-1,j-I) + A(I,j-I-1) (<,<) ENDDO Note: Direction Vector Changes
27
Optimizing Compilers for Modern Architectures Loop Skewing The accesses to A have the following pattern 1,1 = 0,1 + 1,0 S(1,2) 1,2 = 0,2 + 1,1 S(1,3) 1,3 = 0,3 + 1,2 S(1,4) 1,4 = 0,4 + 1,3 S(1,5) 2,1 = 1,1 + 2,0 S(2,3) 2,2 = 1,2 + 2,1 S(2,4) 2,3 = 1,3 + 2,2 S(2,5) 2,4 = 1,4 + 2,3 S(2,6)
28
Optimizing Compilers for Modern Architectures Loop Skewing Dependence pattern after loop skewing
29
Optimizing Compilers for Modern Architectures Loop Skewing DO I = 1, N DO j = I+1, I+N S:A(I,j-I) = A(I-1,j-I) + A(I,j-I-1) ENDDO Loop interchange to.. DO j = 2, N+N DO I = max(1,j-N), min(N,j-1) S:A(I,j-I) = A(I-1,j-I) + A(I,j-I-1) ENDDO Vectorize to.. DO j = 2, N+N FORALL I = max(1,j-N), min(N,j-1) S:A(I,j-I) = A(I-1,j-I) + A(I,j-I-1) END FORALL ENDDO
30
Optimizing Compilers for Modern Architectures Loop Skewing Disadvantages: —Varying vector length –Not profitable if N is small —If vector startup time is more than speedup time, this is not profitable —Vector bounds must be recomputed on each iteration of outer loop Apply loop skewing if everything else fails
31
Optimizing Compilers for Modern Architectures Putting It All Together Good Part —Many transformations imply more choices to exploit parallelism Bad Part —Choosing the right transformation —How to automate transformation selection process? —Interference between transformations
32
Optimizing Compilers for Modern Architectures Putting It All Together Example of Interference DO I = 1, N DO J = 1, M S(I) = S(I) + A(I,J) ENDDO Sum Reduction gives.. DO I = 1, N S(I) = S(I) + SUM (A(I,1:M)) ENDDO While Loop Interchange and Vectorization gives.. DO J = 1, N S(1:N) = S(1:N) + A(1:N,J) ENDDO
33
Optimizing Compilers for Modern Architectures Putting It All Together Any algorithm which tries to tie all transformations must —Take a global view of transformed code —Know the architecture of the target machine Goal of our algorithm —Finding ONE good vector loop [works good for most vector register architectures]
34
Optimizing Compilers for Modern Architectures Unified Framework Detection: finding ALL loops for EACH statement that can be run in vector Selection: choosing best loop for vector execution for EACH statement Transformation: carrying out the transformations necessary to vectorize the selected loop
35
Optimizing Compilers for Modern Architectures Unified Framework: Detection procedure mark_loop(S,D) for each edge e in D deletable by scalar expansion, array and scalar renaming, node splitting or symbolic resolution do begin add e to deletable_edges; delete e from D; end mark_gen(S,1,D); for each statement x in S with no vector loop marked do begin attempt Index-Set Splitting and loop skewing; mark vector loops found; end //Restore deletable edges from deletable_edges to D end mark_loop
36
Optimizing Compilers for Modern Architectures Unified Framework: Detection procedure mark_gen(S,k,D) //Variation of codegen; Doesn’t do vectorization; Only marks vector loops for i =1 to m do begin if S i is cyclic then if outermost carried dependence is at level p>k then //Loop Shifting mark all loops at level < p as vector for S i; else if S i is a reduction mark loop k as vector; mark S i reduction; else begin //Recur at deeper level mark_gen(S i,k+1,D i ); end else mark statements in S i as vector for loops k and deeper; end end mark_gen
37
Optimizing Compilers for Modern Architectures Selection and Transformation procedure transform_code(R,k,D) //Variation of codegen; for i =1 to m do begin if k is the index of the best vector loop for some statement in R i then if R i is cyclic then select_and_apply_transformation(R i,k,D); //retry vectorization on new dependence graph transform_code(R i,k,D); else generate a vector statement for R i in loop k; end else begin //Recur at deeper level //Generate level k DO and ENDDO statements transform_code(R i,k+1,D); end end transform_code
38
Optimizing Compilers for Modern Architectures Selection of Transformations procedure select_and_apply_transformation(R i,k,D) if loop k does not carry a dependence in R i then shift loop k to innermost position; else if R i is a reduction at level k then replace with reduction and adjust dependences; else //transform and adjust dependences if array renaming possible then apply array renaming and adjust dependences; else if node-splitting possible then apply node-splitting and adjust dependences; else if scalar expansion possible then apply scalar expansion and adjust dependences; else apply loop skewing or index-set splitting and adjust dependencies; end end select_and_apply_transformation
39
Optimizing Compilers for Modern Architectures Performance on Benchmark
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.