Creating Coarse-Grained Parallelism Chapter 6 of Allen and Kennedy Dan Guez.

Creating Coarse-Grained Parallelism Chapter 6 of Allen and Kennedy Dan Guez

Introduction Pervious lectures: Fine-Grained Parallelism Superscalar and vector architecture Parallelizing Inner loops This lecture: Coarse-Grained Parallelism Symmetric Multi Processor (SMP) architecture Parallelizing Outer loops

SMP Architecture Multiple asynchronous Processors Shared memory Synchronization required! Communication between processors Barrier as synchronization mechanism Expensive!

Roadmap Single Loop Methods Privatization Alignment Loop Fusion

Coarse-Grained vs. Fine Grained Fine-GrainedCoarse-Grained Scalar ExpansionPrivatization Loop Distribution Alignment Loop fusion

Roadmap Privatization Alignment Loop Fusion

Scalar Expansion - Reminder Scalar Expansion – loop carried dependences elimination+ vectorizaion DO I = 1, N T = A(I) A(I) = B(I) B(I) = T ENDDO DO I = 1, N T(I) = A(I) A(I) = B(I) B(I) = T(I) ENDDO T(1:N) = A( 1:N ) A( 1:N ) = B( 1:N ) B( 1:N ) = T( 1:N ) Scalar Expansion Vectoring

Privatization Privatization - loop carried dependences elimination DO I = 1, N T = A(I) A(I) = B(I) B(I) = T ENDDO PARALLEL DO I = 1, N PRIVATE t t = A(I) A(I) = B(I) B(I) = t END PARALLEL DO Privatization

Privatization - Definition A scalar variable x defined within a loop is said to be privatizable with respect to that loop if and only if every path from the beginning of the loop body to a use of x within the loop body must pass through a definition of x before reaching that use

Privatization – formal solution For each block x in the loop body define equation (upward-exposed variables): use(x) – the set of all variables used within block x that have no prior definitions within the block def(x) – variables defined in x Solve the above equations …

Privatization – formal solution The set of private variables is: B – collection of loop body blocks b 0 – the entry block to the loop body

Privatization - Theorem A variable x defined in a loop may be made private if and only if the SSA graph for the variable does not have a Φ -node at the entry to the loop

Loop Distribution Distributing loops using codegen DO I = 1, 100 DO J = 1, 100 A(I,J) = B(I,J) + C(I,J) D(I,J) = A(I,J-1) *2 ENDDO codegen DO I = 1, 100 DO J = 1, 100 A(I,J) = B(I,J) + C(I,J) ENDDO DO J = 1, 100 D(I,J) = A(I,J-1) *2 ENDDO

Loop Distribution - vector architectures DO I = 1, 100 DO J = 1, 100 A(I,J) = B(I,J) + C(I,J) ENDDO DO J = 1, 100 D(I,J) = A(I,J-1) *2 ENDDO DO I = 1, 100 A(I,1:100) = B(I, 1:100) + C(I, 1:100) D(I, 1:100) = A(I,0:99) *2 ENDDO

Loop Distribution - SMP architectures DO I = 1, 100 DO J = 1, 100 A(I,J) = B(I,J) + C(I,J) ENDDO Barrier() DO J = 1, 100 D(I,J) = A(I,J-1) *2 ENDDO 

The solution - Alignment DO I = 2, N S 1 A(I) = B(I) + C(I) S 2 D(I) = A(I-1) *2 ENDDO S1S1 S2S2 I 2N

Basic Alignment S1S1 S2S2 DO I = 1, N S 1 IF (I > 1) A(I) = B(I) + C(I) S 2 IF (I< N) D(I+1) = A(I) *2 ENDDO DO I = 2, N S 1 A(I) = B(I) + C(I) S 2 D(I) = A(I-1) *2 ENDDO I 2N1

Optimized Alignment – Option 1 S 1 S 2 DO I = 1, N-1 J=I ; IF (I = 1) J = N A(J) = B(J) + C(J) D(I+1) = A(I) *2 ENDDO DO I = 1, N S 1 IF(I>1) A(I) = B(I) + C(I) S 2 IF(I<N) D(I+1) = A(I) *2 ENDDO I 2N1

Optimized Alignment – Option 2 S1S1 S2S2 D(2) = A(1)*2 DO I = 2, N - 1 A(I) = B(I) + C(I) D(I+1) = A(I) *2 ENDDO A(N) = B(N)+C(N) DO I = 1, N S 1 IF(I>1) A(I) = B(I) + C(I) S 2 IF(I<N) D(I+1) = A(I) *2 ENDDO I 2N1

Alignment problems Recurrence – impossible to align Different dependency distances – Alignment fails (Alignment conflict) DO I = 1, N A(I+1) = B(I) + C X(I) = A(I+1) +A(I) ENDDO DO I = 0, N IF (I>0) A(I+1) = B(I) + C IF (I<N) X(I+1) = A(I+2) +A(I+1) ENDDO

Code Replication Solves Alignment conflict DO I = 1, N A(I+1) = B(I) + C X(I) = A(I+1) +A(I) ENDDO DO I = 1, N A(I+1) = B(I) + C IF (I=1) t = A(I) ELSE t = B(I-1) + C X(I) = A(I+1) + t ENDDO

Alignment Graph G = (V,E) – Directed Acyclic Graph. V - Set of loop body statements Labeling o(v) – vertex offset E – Set of dependences Labeling d(e) - dependence distance

Alignment Graph - Example DO I = 1, N S 1 A(I+2) = B(I) + C S 2 X(I+1) = A(I) + D S 3 Y(I) = A(I+1) +X(I) ENDDO S1S1 S2S2 S3S3 d=2 d=1 o = 0

Alignment Goal The Graph G = (V,E) is said to be carry- free if for each edge e=(u,v) o(u) + d(e) = o(v) Alignment procedure gets alignment graph and generated carry-free alignment graph

Align Procedure While V is not empty Add to worklist W arbitrary vertex v from V While W is not empty Remove vertex u from worklist W Align all adjacent vertices of u, replicate node if different alignments required Add new aligned nodes in W

Align Procedure-Example S1S1 S2S2 S3S3 d=2 d=1 o = 0 o = -1 S1`S1` o = -3 d=2

GenAlign Procedure Set variables hi maximal vertex offset lo minimal vertex offset Ivar original iteration variable Lvar original loop lower bound Uvar original loop upper bound Generate loop statement “ DO Ivar = Lvar-hi, Uvar + lo ”

GenAlign Procedure cont. Scan vertices in a topological sort order Let v be the current vertex if o(v) = lo then generate “ IF ( Ivar >= Lvar-o(v) ) THEN “ + The related statement of v with Ivar+o(v) substituted for Ivar else if o(v) = hi then generate “ IF ( Ivar =< Uvar-o(v) ) THEN “… else generate “ IF ( Ivar >= Lvar-o(v) AND Ivar <= Uvar-o(v) ) THEN “…

GenAlign Procedure cont. if v is a replicated vertex, replace the statement S with the following “ THEN t v = RHS(S) with Ivar+o(v) substituted for Ivar ELSE t v = LHS(S) with Ivar+o(v) substituted for Ivar ENDIF ” Where t v is new unique scalar. Replace reference at the sink of every dependence from v by t v

GenAlign - Example DO I = 1, N S 1 A(I+2) = B(I) + C S 2 X(I+1) = A(I) + D S 3 Y(I) = A(I+1) +X(I) ENDDO DO I = 1, N+3 S 1 IF (I>=4) A(I-1) = B(I-3) + C S 1 `IF (I>=2 AND I<=N+1) THEN t = B(I-1) + C ELSE t = A(I+1) ENDIF S 2 IF (I>=2 AND I<=N+1) X(I)=A(I-1)+D S 3 IF (I<=N) Y(I)=t+X(I) ENDDO

Loop Fusion - Motivation DO I = 1, N A(I) = B(I) + 1 C(I) = A(I) + C(I-1) D(I) = A(I) + X ENDDO DO I = 1, N A(I) = B(I) + 1 ENDDO DO I = 1, N C(I) = A(I) + C(I-1) ENDDO DO I = 1, N D(I) = A(I) + X ENDDO Distribution Parallelizable Serial

Loop Fusion - Motivation PARALLEL DO I = 1, N A(I) = B(I) + 1 ENDDO DO I = 1, N C(I) = A(I) + C(I-1) ENDDO PARALLE DO I = 1, N D(I) = A(I) + X ENDDO Fusion PARALLEL DO I = 1, N A(I) = B(I) + 1 D(I) = A(I) + X ENDDO DO I = 1, N C(I) = A(I) + C(I-1) ENDDO

Loop Fusion – Graphical View L1L1 L2L2 L3L3 L 1,3 L2L2

Loop Fusion - Safety Constraints PARALLEL DO I = 1, N A(I) = B(I) + 1 ENDDO PARALLE DO I = 1, N D(I) = A(I+1) + X ENDDO Fusion PARALLEL DO I = 1, N A(I) = B(I) + 1 D(I) = A(I+1) + X ENDDO Fusion-preventing dependence constraint: the fused loops generates backward loop carried dependence X

Loop Fusion - Safety Constraints Ordering constraint: there is a path that contains a loop-independent dependence between the loops that can ’ t be fused with them L1L1 L2L2 L3L3 L 1,3 L2L2 X Fusion

Loop Fusion - Profitability Constraints Separation constraint: do not fuse parallel loops with sequential loops … Parallelism inhibiting constraint: the fused loop will have a forward carried dependence. PARALLEL DO I = 1, N A(I) = B(I) + 1 ENDDO PARALLE DO I = 1, N D(I) = A(I-1) + X ENDDO Fusion PARALLEL DO I = 1, N A(I) = B(I) + 1 D(I) = A(I-1) + X ENDDO 

Typed Fusion - Definition P=(G, T, m, B, t 0 ) G=(V,E) directed acyclic graph (dependence graph) T – set of types (parallel, sequential) m:V  T mapping of types to vertices B – set of bad edges (constraints) t 0 – objective type (parallel)

Main Data Structures num[n] – holds the number of the node number of n in the fused graph maxBadPrev[n] – holds the maximal vertex number of type t 0 in the fused graph that cannot be fused with n “The maximal fused vertex that n is preventing from being further fused”

TypedFusion Example 1 3 4 6 5 2 1, 2, 3, 7 4, 5, 0 1 1 33

update_successors(n) Set t = type(n) For each edge (n,m) do If (t != t 0 ) maxBadPrev[m] = MAX(maxBadPrev[m], maxBadPrev[n]) Else If ( type(m) != t 0 or (n,m) in B ) maxBadPrev[m] = MAX(maxBadPrev[m], num[n]) Else maxBadPrev[m] = MAX(maxBadPrev[m], maxBadPrev[n])

TypedFusion Procedure Scan the vertices in topological order For each vertex v If v is of type t 0 update_successors(v) fuse v with the first possible available node Else update_successors(v) create_new_node(v)

TypedFusion Example 1 4 5 7 8 6 3 2

1 4 5 7 8 6 3 2 (0,1) (0,2) (1,3) (1,4) (1,5) (1,4) (4,6) (maxBadPrev, num) (0, ) (1, ) (0, )(1, ) (4, ) (1, )

TypedFusion Example 1, 3 4 5, 8 7 6 2 (1) (2) (3) (4) (5) (6)

Fusing sequential loops … 1, 3 2,4,6 5, 8 7 (1) (3) (4) (6)

Ordered Fusion More than 2 types of loops: Different loop headers Setting the “ right ” order of types to fuse is NP-hard Define priorities to each type

Type conflict 1 2 34

Ordered Fusion 1 2 45 3 6

1 4 5 7 8 6 3 2 Cohort Fusion

Allows running sequential loops with parallel loop Settings: Single type Bad edges Fusion-preventing edges Parallelism-inhibiting edges Edges mixing parallel and sequential loops

Cohort Fusion Pros Minimal number of barriers Cons Bad load balancing

Creating Coarse-Grained Parallelism Chapter 6 of Allen and Kennedy Dan Guez.

Similar presentations

Presentation on theme: "Creating Coarse-Grained Parallelism Chapter 6 of Allen and Kennedy Dan Guez."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Creating Coarse-Grained Parallelism Chapter 6 of Allen and Kennedy Dan Guez.

Similar presentations

Presentation on theme: "Creating Coarse-Grained Parallelism Chapter 6 of Allen and Kennedy Dan Guez."— Presentation transcript:

Similar presentations

About project

Feedback