Creating Coarse-Grained Parallelism Chapter 6 of Allen and Kennedy Dan Guez
Introduction Pervious lectures: Fine-Grained Parallelism Superscalar and vector architecture Parallelizing Inner loops This lecture: Coarse-Grained Parallelism Symmetric Multi Processor (SMP) architecture Parallelizing Outer loops
SMP Architecture Multiple asynchronous Processors Shared memory Synchronization required! Communication between processors Barrier as synchronization mechanism Expensive!
Roadmap Single Loop Methods Privatization Alignment Loop Fusion
Coarse-Grained vs. Fine Grained Fine-GrainedCoarse-Grained Scalar ExpansionPrivatization Loop Distribution Alignment Loop fusion
Roadmap Privatization Alignment Loop Fusion
Scalar Expansion - Reminder Scalar Expansion – loop carried dependences elimination+ vectorizaion DO I = 1, N T = A(I) A(I) = B(I) B(I) = T ENDDO DO I = 1, N T(I) = A(I) A(I) = B(I) B(I) = T(I) ENDDO T(1:N) = A( 1:N ) A( 1:N ) = B( 1:N ) B( 1:N ) = T( 1:N ) Scalar Expansion Vectoring
Privatization Privatization - loop carried dependences elimination DO I = 1, N T = A(I) A(I) = B(I) B(I) = T ENDDO PARALLEL DO I = 1, N PRIVATE t t = A(I) A(I) = B(I) B(I) = t END PARALLEL DO Privatization
Privatization - Definition A scalar variable x defined within a loop is said to be privatizable with respect to that loop if and only if every path from the beginning of the loop body to a use of x within the loop body must pass through a definition of x before reaching that use
Privatization – formal solution For each block x in the loop body define equation (upward-exposed variables): use(x) – the set of all variables used within block x that have no prior definitions within the block def(x) – variables defined in x Solve the above equations …
Privatization – formal solution The set of private variables is: B – collection of loop body blocks b 0 – the entry block to the loop body
Privatization - Theorem A variable x defined in a loop may be made private if and only if the SSA graph for the variable does not have a Φ -node at the entry to the loop
Roadmap Privatization Alignment Loop Fusion
Loop Distribution Distributing loops using codegen DO I = 1, 100 DO J = 1, 100 A(I,J) = B(I,J) + C(I,J) D(I,J) = A(I,J-1) *2 ENDDO codegen DO I = 1, 100 DO J = 1, 100 A(I,J) = B(I,J) + C(I,J) ENDDO DO J = 1, 100 D(I,J) = A(I,J-1) *2 ENDDO
Loop Distribution - vector architectures DO I = 1, 100 DO J = 1, 100 A(I,J) = B(I,J) + C(I,J) ENDDO DO J = 1, 100 D(I,J) = A(I,J-1) *2 ENDDO DO I = 1, 100 A(I,1:100) = B(I, 1:100) + C(I, 1:100) D(I, 1:100) = A(I,0:99) *2 ENDDO
Loop Distribution - SMP architectures DO I = 1, 100 DO J = 1, 100 A(I,J) = B(I,J) + C(I,J) ENDDO Barrier() DO J = 1, 100 D(I,J) = A(I,J-1) *2 ENDDO
The solution - Alignment DO I = 2, N S 1 A(I) = B(I) + C(I) S 2 D(I) = A(I-1) *2 ENDDO S1S1 S2S2 I 2N
Basic Alignment S1S1 S2S2 DO I = 1, N S 1 IF (I > 1) A(I) = B(I) + C(I) S 2 IF (I< N) D(I+1) = A(I) *2 ENDDO DO I = 2, N S 1 A(I) = B(I) + C(I) S 2 D(I) = A(I-1) *2 ENDDO I 2N1
Optimized Alignment – Option 1 S 1 S 2 DO I = 1, N-1 J=I ; IF (I = 1) J = N A(J) = B(J) + C(J) D(I+1) = A(I) *2 ENDDO DO I = 1, N S 1 IF(I>1) A(I) = B(I) + C(I) S 2 IF(I<N) D(I+1) = A(I) *2 ENDDO I 2N1
Optimized Alignment – Option 2 S1S1 S2S2 D(2) = A(1)*2 DO I = 2, N - 1 A(I) = B(I) + C(I) D(I+1) = A(I) *2 ENDDO A(N) = B(N)+C(N) DO I = 1, N S 1 IF(I>1) A(I) = B(I) + C(I) S 2 IF(I<N) D(I+1) = A(I) *2 ENDDO I 2N1
Alignment problems Recurrence – impossible to align Different dependency distances – Alignment fails (Alignment conflict) DO I = 1, N A(I+1) = B(I) + C X(I) = A(I+1) +A(I) ENDDO DO I = 0, N IF (I>0) A(I+1) = B(I) + C IF (I<N) X(I+1) = A(I+2) +A(I+1) ENDDO
Code Replication Solves Alignment conflict DO I = 1, N A(I+1) = B(I) + C X(I) = A(I+1) +A(I) ENDDO DO I = 1, N A(I+1) = B(I) + C IF (I=1) t = A(I) ELSE t = B(I-1) + C X(I) = A(I+1) + t ENDDO
Alignment Graph G = (V,E) – Directed Acyclic Graph. V - Set of loop body statements Labeling o(v) – vertex offset E – Set of dependences Labeling d(e) - dependence distance
Alignment Graph - Example DO I = 1, N S 1 A(I+2) = B(I) + C S 2 X(I+1) = A(I) + D S 3 Y(I) = A(I+1) +X(I) ENDDO S1S1 S2S2 S3S3 d=2 d=1 o = 0
Alignment Goal The Graph G = (V,E) is said to be carry- free if for each edge e=(u,v) o(u) + d(e) = o(v) Alignment procedure gets alignment graph and generated carry-free alignment graph
Align Procedure While V is not empty Add to worklist W arbitrary vertex v from V While W is not empty Remove vertex u from worklist W Align all adjacent vertices of u, replicate node if different alignments required Add new aligned nodes in W
Align Procedure-Example S1S1 S2S2 S3S3 d=2 d=1 o = 0 o = -1 S1`S1` o = -3 d=2
GenAlign Procedure Set variables hi maximal vertex offset lo minimal vertex offset Ivar original iteration variable Lvar original loop lower bound Uvar original loop upper bound Generate loop statement “ DO Ivar = Lvar-hi, Uvar + lo ”
GenAlign Procedure cont. Scan vertices in a topological sort order Let v be the current vertex if o(v) = lo then generate “ IF ( Ivar >= Lvar-o(v) ) THEN “ + The related statement of v with Ivar+o(v) substituted for Ivar else if o(v) = hi then generate “ IF ( Ivar =< Uvar-o(v) ) THEN “… else generate “ IF ( Ivar >= Lvar-o(v) AND Ivar <= Uvar-o(v) ) THEN “…
GenAlign Procedure cont. if v is a replicated vertex, replace the statement S with the following “ THEN t v = RHS(S) with Ivar+o(v) substituted for Ivar ELSE t v = LHS(S) with Ivar+o(v) substituted for Ivar ENDIF ” Where t v is new unique scalar. Replace reference at the sink of every dependence from v by t v
GenAlign - Example DO I = 1, N S 1 A(I+2) = B(I) + C S 2 X(I+1) = A(I) + D S 3 Y(I) = A(I+1) +X(I) ENDDO DO I = 1, N+3 S 1 IF (I>=4) A(I-1) = B(I-3) + C S 1 `IF (I>=2 AND I<=N+1) THEN t = B(I-1) + C ELSE t = A(I+1) ENDIF S 2 IF (I>=2 AND I<=N+1) X(I)=A(I-1)+D S 3 IF (I<=N) Y(I)=t+X(I) ENDDO
Roadmap Privatization Alignment Loop Fusion
Loop Fusion - Motivation DO I = 1, N A(I) = B(I) + 1 C(I) = A(I) + C(I-1) D(I) = A(I) + X ENDDO DO I = 1, N A(I) = B(I) + 1 ENDDO DO I = 1, N C(I) = A(I) + C(I-1) ENDDO DO I = 1, N D(I) = A(I) + X ENDDO Distribution Parallelizable Serial
Loop Fusion - Motivation PARALLEL DO I = 1, N A(I) = B(I) + 1 ENDDO DO I = 1, N C(I) = A(I) + C(I-1) ENDDO PARALLE DO I = 1, N D(I) = A(I) + X ENDDO Fusion PARALLEL DO I = 1, N A(I) = B(I) + 1 D(I) = A(I) + X ENDDO DO I = 1, N C(I) = A(I) + C(I-1) ENDDO
Loop Fusion – Graphical View L1L1 L2L2 L3L3 L 1,3 L2L2
Loop Fusion - Safety Constraints PARALLEL DO I = 1, N A(I) = B(I) + 1 ENDDO PARALLE DO I = 1, N D(I) = A(I+1) + X ENDDO Fusion PARALLEL DO I = 1, N A(I) = B(I) + 1 D(I) = A(I+1) + X ENDDO Fusion-preventing dependence constraint: the fused loops generates backward loop carried dependence X
Loop Fusion - Safety Constraints Ordering constraint: there is a path that contains a loop-independent dependence between the loops that can ’ t be fused with them L1L1 L2L2 L3L3 L 1,3 L2L2 X Fusion
Loop Fusion - Profitability Constraints Separation constraint: do not fuse parallel loops with sequential loops … Parallelism inhibiting constraint: the fused loop will have a forward carried dependence. PARALLEL DO I = 1, N A(I) = B(I) + 1 ENDDO PARALLE DO I = 1, N D(I) = A(I-1) + X ENDDO Fusion PARALLEL DO I = 1, N A(I) = B(I) + 1 D(I) = A(I-1) + X ENDDO
Typed Fusion - Definition P=(G, T, m, B, t 0 ) G=(V,E) directed acyclic graph (dependence graph) T – set of types (parallel, sequential) m:V T mapping of types to vertices B – set of bad edges (constraints) t 0 – objective type (parallel)
Main Data Structures num[n] – holds the number of the node number of n in the fused graph maxBadPrev[n] – holds the maximal vertex number of type t 0 in the fused graph that cannot be fused with n “The maximal fused vertex that n is preventing from being further fused”
TypedFusion Example , 2, 3, 7 4, 5,
update_successors(n) Set t = type(n) For each edge (n,m) do If (t != t 0 ) maxBadPrev[m] = MAX(maxBadPrev[m], maxBadPrev[n]) Else If ( type(m) != t 0 or (n,m) in B ) maxBadPrev[m] = MAX(maxBadPrev[m], num[n]) Else maxBadPrev[m] = MAX(maxBadPrev[m], maxBadPrev[n])
TypedFusion Procedure Scan the vertices in topological order For each vertex v If v is of type t 0 update_successors(v) fuse v with the first possible available node Else update_successors(v) create_new_node(v)
TypedFusion Example
(0,1) (0,2) (1,3) (1,4) (1,5) (1,4) (4,6) (maxBadPrev, num) (0, ) (1, ) (0, )(1, ) (4, ) (1, )
TypedFusion Example 1, 3 4 5, (1) (2) (3) (4) (5) (6)
Fusing sequential loops … 1, 3 2,4,6 5, 8 7 (1) (3) (4) (6)
Ordered Fusion More than 2 types of loops: Different loop headers Setting the “ right ” order of types to fuse is NP-hard Define priorities to each type
Type conflict
Ordered Fusion
Cohort Fusion
Allows running sequential loops with parallel loop Settings: Single type Bad edges Fusion-preventing edges Parallelism-inhibiting edges Edges mixing parallel and sequential loops
Cohort Fusion Pros Minimal number of barriers Cons Bad load balancing