Download presentation
Presentation is loading. Please wait.
1
Creating Coarse-Grained Parallelism Chapter 6 of Allen and Kennedy Dan Guez
2
Introduction Pervious lectures: Fine-Grained Parallelism Superscalar and vector architecture Parallelizing Inner loops This lecture: Coarse-Grained Parallelism Symmetric Multi Processor (SMP) architecture Parallelizing Outer loops
3
SMP Architecture Multiple asynchronous Processors Shared memory Synchronization required! Communication between processors Barrier as synchronization mechanism Expensive!
4
Roadmap Single Loop Methods Privatization Alignment Loop Fusion
5
Coarse-Grained vs. Fine Grained Fine-GrainedCoarse-Grained Scalar ExpansionPrivatization Loop Distribution Alignment Loop fusion
6
Roadmap Privatization Alignment Loop Fusion
7
Scalar Expansion - Reminder Scalar Expansion – loop carried dependences elimination+ vectorizaion DO I = 1, N T = A(I) A(I) = B(I) B(I) = T ENDDO DO I = 1, N T(I) = A(I) A(I) = B(I) B(I) = T(I) ENDDO T(1:N) = A( 1:N ) A( 1:N ) = B( 1:N ) B( 1:N ) = T( 1:N ) Scalar Expansion Vectoring
8
Privatization Privatization - loop carried dependences elimination DO I = 1, N T = A(I) A(I) = B(I) B(I) = T ENDDO PARALLEL DO I = 1, N PRIVATE t t = A(I) A(I) = B(I) B(I) = t END PARALLEL DO Privatization
9
Privatization - Definition A scalar variable x defined within a loop is said to be privatizable with respect to that loop if and only if every path from the beginning of the loop body to a use of x within the loop body must pass through a definition of x before reaching that use
10
Privatization – formal solution For each block x in the loop body define equation (upward-exposed variables): use(x) – the set of all variables used within block x that have no prior definitions within the block def(x) – variables defined in x Solve the above equations …
11
Privatization – formal solution The set of private variables is: B – collection of loop body blocks b 0 – the entry block to the loop body
12
Privatization - Theorem A variable x defined in a loop may be made private if and only if the SSA graph for the variable does not have a Φ -node at the entry to the loop
13
Roadmap Privatization Alignment Loop Fusion
14
Loop Distribution Distributing loops using codegen DO I = 1, 100 DO J = 1, 100 A(I,J) = B(I,J) + C(I,J) D(I,J) = A(I,J-1) *2 ENDDO codegen DO I = 1, 100 DO J = 1, 100 A(I,J) = B(I,J) + C(I,J) ENDDO DO J = 1, 100 D(I,J) = A(I,J-1) *2 ENDDO
15
Loop Distribution - vector architectures DO I = 1, 100 DO J = 1, 100 A(I,J) = B(I,J) + C(I,J) ENDDO DO J = 1, 100 D(I,J) = A(I,J-1) *2 ENDDO DO I = 1, 100 A(I,1:100) = B(I, 1:100) + C(I, 1:100) D(I, 1:100) = A(I,0:99) *2 ENDDO
16
Loop Distribution - SMP architectures DO I = 1, 100 DO J = 1, 100 A(I,J) = B(I,J) + C(I,J) ENDDO Barrier() DO J = 1, 100 D(I,J) = A(I,J-1) *2 ENDDO
17
The solution - Alignment DO I = 2, N S 1 A(I) = B(I) + C(I) S 2 D(I) = A(I-1) *2 ENDDO S1S1 S2S2 I 2N
18
Basic Alignment S1S1 S2S2 DO I = 1, N S 1 IF (I > 1) A(I) = B(I) + C(I) S 2 IF (I< N) D(I+1) = A(I) *2 ENDDO DO I = 2, N S 1 A(I) = B(I) + C(I) S 2 D(I) = A(I-1) *2 ENDDO I 2N1
19
Optimized Alignment – Option 1 S 1 S 2 DO I = 1, N-1 J=I ; IF (I = 1) J = N A(J) = B(J) + C(J) D(I+1) = A(I) *2 ENDDO DO I = 1, N S 1 IF(I>1) A(I) = B(I) + C(I) S 2 IF(I<N) D(I+1) = A(I) *2 ENDDO I 2N1
20
Optimized Alignment – Option 2 S1S1 S2S2 D(2) = A(1)*2 DO I = 2, N - 1 A(I) = B(I) + C(I) D(I+1) = A(I) *2 ENDDO A(N) = B(N)+C(N) DO I = 1, N S 1 IF(I>1) A(I) = B(I) + C(I) S 2 IF(I<N) D(I+1) = A(I) *2 ENDDO I 2N1
21
Alignment problems Recurrence – impossible to align Different dependency distances – Alignment fails (Alignment conflict) DO I = 1, N A(I+1) = B(I) + C X(I) = A(I+1) +A(I) ENDDO DO I = 0, N IF (I>0) A(I+1) = B(I) + C IF (I<N) X(I+1) = A(I+2) +A(I+1) ENDDO
22
Code Replication Solves Alignment conflict DO I = 1, N A(I+1) = B(I) + C X(I) = A(I+1) +A(I) ENDDO DO I = 1, N A(I+1) = B(I) + C IF (I=1) t = A(I) ELSE t = B(I-1) + C X(I) = A(I+1) + t ENDDO
23
Alignment Graph G = (V,E) – Directed Acyclic Graph. V - Set of loop body statements Labeling o(v) – vertex offset E – Set of dependences Labeling d(e) - dependence distance
24
Alignment Graph - Example DO I = 1, N S 1 A(I+2) = B(I) + C S 2 X(I+1) = A(I) + D S 3 Y(I) = A(I+1) +X(I) ENDDO S1S1 S2S2 S3S3 d=2 d=1 o = 0
25
Alignment Goal The Graph G = (V,E) is said to be carry- free if for each edge e=(u,v) o(u) + d(e) = o(v) Alignment procedure gets alignment graph and generated carry-free alignment graph
26
Align Procedure While V is not empty Add to worklist W arbitrary vertex v from V While W is not empty Remove vertex u from worklist W Align all adjacent vertices of u, replicate node if different alignments required Add new aligned nodes in W
27
Align Procedure-Example S1S1 S2S2 S3S3 d=2 d=1 o = 0 o = -1 S1`S1` o = -3 d=2
28
GenAlign Procedure Set variables hi maximal vertex offset lo minimal vertex offset Ivar original iteration variable Lvar original loop lower bound Uvar original loop upper bound Generate loop statement “ DO Ivar = Lvar-hi, Uvar + lo ”
29
GenAlign Procedure cont. Scan vertices in a topological sort order Let v be the current vertex if o(v) = lo then generate “ IF ( Ivar >= Lvar-o(v) ) THEN “ + The related statement of v with Ivar+o(v) substituted for Ivar else if o(v) = hi then generate “ IF ( Ivar =< Uvar-o(v) ) THEN “… else generate “ IF ( Ivar >= Lvar-o(v) AND Ivar <= Uvar-o(v) ) THEN “…
30
GenAlign Procedure cont. if v is a replicated vertex, replace the statement S with the following “ THEN t v = RHS(S) with Ivar+o(v) substituted for Ivar ELSE t v = LHS(S) with Ivar+o(v) substituted for Ivar ENDIF ” Where t v is new unique scalar. Replace reference at the sink of every dependence from v by t v
31
GenAlign - Example DO I = 1, N S 1 A(I+2) = B(I) + C S 2 X(I+1) = A(I) + D S 3 Y(I) = A(I+1) +X(I) ENDDO DO I = 1, N+3 S 1 IF (I>=4) A(I-1) = B(I-3) + C S 1 `IF (I>=2 AND I<=N+1) THEN t = B(I-1) + C ELSE t = A(I+1) ENDIF S 2 IF (I>=2 AND I<=N+1) X(I)=A(I-1)+D S 3 IF (I<=N) Y(I)=t+X(I) ENDDO
32
Roadmap Privatization Alignment Loop Fusion
33
Loop Fusion - Motivation DO I = 1, N A(I) = B(I) + 1 C(I) = A(I) + C(I-1) D(I) = A(I) + X ENDDO DO I = 1, N A(I) = B(I) + 1 ENDDO DO I = 1, N C(I) = A(I) + C(I-1) ENDDO DO I = 1, N D(I) = A(I) + X ENDDO Distribution Parallelizable Serial
34
Loop Fusion - Motivation PARALLEL DO I = 1, N A(I) = B(I) + 1 ENDDO DO I = 1, N C(I) = A(I) + C(I-1) ENDDO PARALLE DO I = 1, N D(I) = A(I) + X ENDDO Fusion PARALLEL DO I = 1, N A(I) = B(I) + 1 D(I) = A(I) + X ENDDO DO I = 1, N C(I) = A(I) + C(I-1) ENDDO
35
Loop Fusion – Graphical View L1L1 L2L2 L3L3 L 1,3 L2L2
36
Loop Fusion - Safety Constraints PARALLEL DO I = 1, N A(I) = B(I) + 1 ENDDO PARALLE DO I = 1, N D(I) = A(I+1) + X ENDDO Fusion PARALLEL DO I = 1, N A(I) = B(I) + 1 D(I) = A(I+1) + X ENDDO Fusion-preventing dependence constraint: the fused loops generates backward loop carried dependence X
37
Loop Fusion - Safety Constraints Ordering constraint: there is a path that contains a loop-independent dependence between the loops that can ’ t be fused with them L1L1 L2L2 L3L3 L 1,3 L2L2 X Fusion
38
Loop Fusion - Profitability Constraints Separation constraint: do not fuse parallel loops with sequential loops … Parallelism inhibiting constraint: the fused loop will have a forward carried dependence. PARALLEL DO I = 1, N A(I) = B(I) + 1 ENDDO PARALLE DO I = 1, N D(I) = A(I-1) + X ENDDO Fusion PARALLEL DO I = 1, N A(I) = B(I) + 1 D(I) = A(I-1) + X ENDDO
39
Typed Fusion - Definition P=(G, T, m, B, t 0 ) G=(V,E) directed acyclic graph (dependence graph) T – set of types (parallel, sequential) m:V T mapping of types to vertices B – set of bad edges (constraints) t 0 – objective type (parallel)
40
Main Data Structures num[n] – holds the number of the node number of n in the fused graph maxBadPrev[n] – holds the maximal vertex number of type t 0 in the fused graph that cannot be fused with n “The maximal fused vertex that n is preventing from being further fused”
41
TypedFusion Example 1 3 4 6 5 2 1, 2, 3, 7 4, 5, 0 1 1 33
42
update_successors(n) Set t = type(n) For each edge (n,m) do If (t != t 0 ) maxBadPrev[m] = MAX(maxBadPrev[m], maxBadPrev[n]) Else If ( type(m) != t 0 or (n,m) in B ) maxBadPrev[m] = MAX(maxBadPrev[m], num[n]) Else maxBadPrev[m] = MAX(maxBadPrev[m], maxBadPrev[n])
43
TypedFusion Procedure Scan the vertices in topological order For each vertex v If v is of type t 0 update_successors(v) fuse v with the first possible available node Else update_successors(v) create_new_node(v)
44
TypedFusion Example 1 4 5 7 8 6 3 2
45
1 4 5 7 8 6 3 2 (0,1) (0,2) (1,3) (1,4) (1,5) (1,4) (4,6) (maxBadPrev, num) (0, ) (1, ) (0, )(1, ) (4, ) (1, )
46
TypedFusion Example 1, 3 4 5, 8 7 6 2 (1) (2) (3) (4) (5) (6)
47
Fusing sequential loops … 1, 3 2,4,6 5, 8 7 (1) (3) (4) (6)
48
Ordered Fusion More than 2 types of loops: Different loop headers Setting the “ right ” order of types to fuse is NP-hard Define priorities to each type
49
Type conflict 1 2 34
50
Ordered Fusion 1 2 45 3 6
51
1 4 5 7 8 6 3 2 Cohort Fusion
52
Allows running sequential loops with parallel loop Settings: Single type Bad edges Fusion-preventing edges Parallelism-inhibiting edges Edges mixing parallel and sequential loops
53
Cohort Fusion Pros Minimal number of barriers Cons Bad load balancing
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.