Presentation is loading. Please wait.

Presentation is loading. Please wait.

Parallelizing Loops Moreno Marzolla

Similar presentations


Presentation on theme: "Parallelizing Loops Moreno Marzolla"— Presentation transcript:

1 Parallelizing Loops Moreno Marzolla
Dip. di Informatica—Scienza e Ingegneria (DISI) Università di Bologna

2 Moreno Marzolla, Università di Bologna, Italy
Copyright © 2017 Moreno Marzolla, Università di Bologna, Italy ( This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0). To view a copy of this license, visit or send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA. Parallelizing Loops

3 Credits Salvatore Orlando (Univ. Ca' Foscari di Venezia)
Mary Hall (Univ. of Utah) Parallelizing Loops

4 Loop optimization 90% of execution time in 10% of the code
Mostly in loops Loop optimizations Transform loops preserving the semantics Goal Single-threaded system: mostly optimizing for memory hierarchy Multi-threaded and vector systems: loop parallelization Parallelizing Loops

5 Reordering instructions
When can you change the order of two instructions without changing the semantics? They do not operate (read or write) on the same variables They can be only read the same variables This is formally captured in the concept of data dependence True dependence: Write X – Read X (RAW) Output dependence: Write X – Write X (WAW) Anti dependence: Read X – Write X (WAR) If you detect Read X – Read X (RAR) it is safe to change the order Parallelizing Loops

6 Race Condition vs Data Dependence
A race condition exists when the result of an execution depends on the timing of two or more events A data dependence is an ordering on a pair of memory operations that must be preserved to maintain correctness Parallelizing Loops

7 Key Control Concept: Data Dependence
Q: When is parallelization guaranteed to be safe? A: If there are no data dependencies across reordered computations. Definition: Two memory accesses are involved in a data dependence if they may refer to the same memory location and one of the accesses is a write. Bernstein’s conditions Ij is the set of memory locations read by process Pj Oj the set of memory locations written by Pj To execute Pj and another process Pk in parallel, the following conditions must hold Ij ∩ Ok = Æ write after read Ik ∩ Oj = Æ read after write Oj ∩ Ok = Æ write after write Bernstein, A. J. (1 October 1966). "Analysis of Programs for Parallel Processing". IEEE Transactions on Electronic Computers. EC-15 (5): 757–763. doi: /PGEC Parallelizing Loops

8 Data Dependence Two memory accesses are involved in a data dependence if they may refer to the same memory location and one of the references is a write Data-Flow or true dependency RAW (Read After Write) Anti dependency WAR (Write After Read) Output dependency WAW (Write After Write) S1 S1 a = b + c; d = 2 * a; S2 S2 S1 c = a + b; a = 2 * a; S1 S2 S2 S1 a = k; if (a>0) { a = 2 * c; } S1 S2 Parallelizing Loops S2

9 Control Dependence An instruction S2 has a control dependency on S1 if the outcome of S1 determines whether S2 is executed or not Of course, S1 and S2 can not be exchanged This type of dependency applies to the condition of an if- then-else or loop with respect to their bodies S1 S1 if (a>0) { a = 2 * c; } else { b = 3; } S2 S2 S3 S3 Parallelizing Loops

10 Data Dependence In the following, we always use a simple arrow to denote any type of dependence S2 depends on S1 S1 S2 Parallelizing Loops

11 Fundamental Theorem of Dependence
Any reordering transformation that preserves every dependence in a program preserves the meaning of that program Recognizing parallel loops (intuitively) Find data dependences in loop No dependences crossing iteration boundary → parallelization of loop’s iterations is safe Parallelizing Loops

12 Example Each iteration does not depend on previous ones
for (i=0; i<n; i++) { a[i] = b[i] + c[i]; } S1 i = 0 i = 1 i = 2 i = 3 S1 S1 S1 S1 ... Each iteration does not depend on previous ones There are no dependencies crossing iteration boundaries This loop is fully parallelizable Loop iterations can be performed concurrently, in any order Iterations can be split across processors Parallelizing Loops

13 Example Each iteration depends on the previous one (RAW)
for (i=1; i<n; i++) { a[i] = a[i-1] + b[i] } S1 i = 1 i = 2 i = 3 i = 4 S1 S1 S1 S1 ... Each iteration depends on the previous one (RAW) Loop-carried dependency Hence, this loop is not parallelizable with trivial transformations Parallelizing Loops

14 Example s = 0; for (i=0; i<n; i++) { s = s + a[i]; } S1 i = 0 i = 1 i = 2 i = 3 S1 S1 S1 S1 ... We have a loop-carried dependency on s that can not be removed with trivial loop transformations but can be removed with non-trivial transformations this is a reduction, indeed! Parallelizing Loops

15 Exercise: draw the dependencies among iterations of the following loop
for (i=2; i<n; i++) { a[i] = 4 * c[i-1] - 2; b[i] = a[i] * 2; c[i] = a[i-1] + 3; d[i] = b[i] + c[i-2]; } S1 S2 S3 S4 i = 2 i = 3 i = 4 i = 5 i = 6 S1 S1 S1 S1 S1 S2 S2 S2 S2 S2 ... S3 S3 S3 S3 S3 S4 S4 S4 S4 S4 Parallelizing Loops

16 Exercise (cont.) for (i=2; i<n; i++) { a[i] = 4 * c[i-1] - 2;
b[i] = a[i] * 2; c[i] = a[i-1] + 3; d[i] = b[i] + c[i-2]; } S1 S2 S3 S4 i = 2 i = 3 i = 4 i = 5 i = 6 S1 S1 S1 S1 S1 RAW S2 S2 S2 S2 S2 ... S3 S3 S3 S3 S3 S4 S4 S4 S4 S4 Parallelizing Loops

17 Exercise (cont.) for (i=2; i<n; i++) { a[i] = 4 * c[i-1] - 2;
b[i] = a[i] * 2; c[i] = a[i-1] + 3; d[i] = b[i] + c[i-2]; } S1 S2 S3 S4 i = 2 i = 3 i = 4 i = 5 i = 6 S1 S1 S1 S1 S1 RAW S2 S2 S2 S2 S2 ... S3 S3 S3 S3 S3 S4 S4 S4 S4 S4 Parallelizing Loops

18 Exercise (cont.) for (i=2; i<n; i++) { a[i] = 4 * c[i-1] - 2;
b[i] = a[i] * 2; c[i] = a[i-1] + 3; d[i] = b[i] + c[i-2]; } S1 S2 S3 S4 i = 2 i = 3 i = 4 i = 5 i = 6 S1 S1 S1 S1 S1 S2 S2 S2 S2 S2 ... RAW S3 S3 S3 S3 S3 S4 S4 S4 S4 S4 Parallelizing Loops

19 Exercise (cont.) for (i=2; i<n; i++) { a[i] = 4 * c[i-1] - 2;
b[i] = a[i] * 2; c[i] = a[i-1] + 3; d[i] = b[i] + c[i-2]; } S1 S2 S3 S4 i = 2 i = 3 i = 4 i = 5 i = 6 S1 S1 S1 S1 S1 RAW + RAW S2 S2 S2 S2 S2 ... S3 S3 S3 S3 S3 S4 S4 S4 S4 S4 Parallelizing Loops

20 Exercise (cont.) for (i=2; i<n; i++) { a[i] = 4 * c[i-1] - 2;
b[i] = a[i] * 2; c[i] = a[i-1] + 3; d[i] = b[i] + c[i-2]; } S1 S2 S3 S4 i = 2 i = 3 i = 4 i = 5 i = 6 S1 S1 S1 S1 S1 S2 S2 S2 S2 S2 ... S3 S3 S3 S3 S3 S4 S4 S4 S4 S4 Parallelizing Loops

21 Removing dependencies: Loop aligning
Dependencies can sometimes be removed by aligning loop iterations a[0] = 0; d[0] = a[0] + 2; for (i=1; i<n-1; i++) { a[i] = b[i-1] * c[i-1]; d[i] = a[i] + 2; } a[n-1] = b[n-1] * c[n-1]; a[0] = 0; for (i=0; i<n-1; i++) { a[i+1] = b[i] * c[i]; d[i] = a[i] + 2; } T1 S1 T2 S2 i = 0 i = 1 i = 2 i = n-2 i = 1 i = 2 i = 2 i = n-2 S1 S1 S1 S1 T1 T1 T1 T1 ... ... S2 S2 S2 S2 T2 T2 T2 T2 Parallelizing Loops

22 Removing dependencies: Reordering / 1
Some dependencies can be removed by reordering for (i=0; i<n-1; i++) { a[i+1] = c[i] + k; b[i] = b[i] + a[i]; c[i+1] = d[i] + w; } for (i=0; i<n-1; i++) { c[i+1] = d[i] + w; a[i+1] = c[i] + k; b[i] = b[i] + a[i]; } S1 S3 S2 S1 S3 S2 i = 0 i = 1 i = 2 i = n-2 i = 0 i = 1 i = 2 i = n-2 S1 S1 S1 S1 S3 S3 S3 S3 S2 S2 S2 S2 S1 S1 S1 S1 ... ... S3 S3 S3 S3 S2 S2 S2 S2 Parallelizing Loops

23 Reordering / 2 After reordering, we can align loop iterations
b[0] = b[0] + a[0]; a[1] = c[1] + k; b[1] = c[1] + a[1]; for (i=1; i<n-2; i++) { c[i] = d[i-1] + w; a[i+1] = c[i] + k; b[i+1] = b[i+1] + a[i+1]; } c[n-2] = d[n-2] + w; a[n-1] = c[n-2] + k; c[n-1] = d[n-1] + w; After reordering, we can align loop iterations T3 T1 for (i=0; i<n-1; i++) { c[i+1] = d[i] + w; a[i+1] = c[i] + k; b[i] = b[i] + a[i]; } T2 S3 S1 S2 i = 0 i = 1 i = 2 i = n-2 i = 1 i = 1 i = 1 i = 1 S3 S3 S3 S3 T3 T3 T3 T3 S1 S1 S1 S1 T1 T1 T1 T1 ... ... S2 S2 S2 S2 T2 T2 T2 T2 Parallelizing Loops

24 Loop interchange Exchanging the loop indexes might allow the outer loop to be parallelized Why? To use coarse-grained parallelism (if appropriate) for (j=1; j<m; j++) { #pragma omp parallel for for (i=0; i<n; i++) { a[i][j] = a[i][j-1] + b[i]; } #pragma omp parallel for for (i=0; i<n; i++) { for (j=1; j<m; j++) { a[i][j] = a[i][j-1] + b[i]; } S1 S1 ... i=0 S1 S1 S1 S1 i=0 S1 S1 S1 S1 ... i=1 S1 S1 S1 S1 i=1 S1 S1 S1 S1 ... ... i=2 S1 S1 S1 S1 i=2 S1 S1 S1 S1 ... ... ... ... ... ... i=n-1 S1 S1 S1 S1 i=n-1 S1 S1 S1 S1 Parallelizing Loops

25 Example 1 (Arnold's Cat Map exercise)
Which loop(s) can be parallelized? for (i=0; i<k; i++) { for (y=0; y<h; y++) { for (x=0; x<w; x++) { int xnext = (2*x+y) % w; int ynext = (x + y) % h; next[ynext][xnext] = cur[y][x]; } /* Swap old and new */ tmp = cur; cur = next; next = tmp; Parallelizing Loops

26 Example 1 (Arnold's Cat Map exercise)
Which loop(s) can be parallelized? for (i=0; i<k; i++) { for (y=0; y<h; y++) { for (x=0; x<w; x++) { int xnext = (2*x+y) % w; int ynext = (x + y) % h; next[ynext][xnext] = cur[y][x]; } /* Swap old and new */ tmp = cur; cur = next; next = tmp; i = 0 i = 1 i = 2 S1 S1 S1 RAW S2 S2 S2 ... S1 S3 S3 S3 Note that S1 depends on S4 in a RAW way: next[i][j] actually does not modify the variable next (a pointer), but reads the pointer to modify the content of the pointed block. The pointer itself is modified (written) by statement S4 S4 S4 S4 S2 S3 S4 There are loop-carried dependencies: this loop can not be parallelized Parallelizing Loops

27 Example 1 (Arnold's Cat Map exercise)
Which loop(s) can be parallelized? There are no dependencies across loop iterations (there could be a WAW dependency on next, but we know that the cat map is bijective so it can't happen). These loops can be fully parallelized. for (i=0; i<k; i++) { for (y=0; y<h; y++) { for (x=0; x<w; x++) { int xnext = (2*x+y) % w; int ynext = (x + y) % h; next[ynext][xnext] = cur[y][x]; } /* Swap old and new */ tmp = cur; cur = next; next = tmp; S1 S2 S3 Parallelizing Loops

28 Example 2 (Bellman-Ford Algorithm)
Which loop(s) can be parallelized? for (r=0; r<n; r++) { for (e=0; e<m; e++) { const int i = g->edges[e].src; const int j = g->edges[e].dst; const double wij = g->edges[e].w; if (d[i] + wij < d[j]) { dnew[j] = d[i] + wij; } for (i=0; i<n; i++) { d[i] = dnew[i]; S1 S2 S3 Parallelizing Loops

29 Example 2 (Bellman-Ford Algorithm)
Which loop(s) can be parallelized? Control dependency: S2 must follow S1 for (r=0; r<n; r++) { for (e=0; e<m; e++) { const int i = g->edges[e].src; const int j = g->edges[e].dst; const double wij = g->edges[e].w; if (d[i] + wij < d[j]) { dnew[j] = d[i] + wij; } for (i=0; i<n; i++) { d[i] = dnew[i]; r = 0 r = 1 r = 2 S1 S1 S1 S1 S2 S2 S2 S2 ... S3 S3 S3 S3 Looking at the outer loop only, we observe a loop-carried dependency (RAW) S3 → S1. Therefore, the outer loop can not be parallelized Parallelizing Loops

30 Example 2 (Bellman-Ford Algorithm)
Which loop(s) can be parallelized? for (r=0; r<n; r++) { for (e=0; e<m; e++) { const int i = g->edges[e].src; const int j = g->edges[e].dst; const double wij = g->edges[e].w; if (d[i] + wij < d[j]) { dnew[j] = d[i] + wij; } for (i=0; i<n; i++) { d[i] = dnew[i]; No data dependencies (read d, write dnew) S1 S2 No data dependencies (read dnew, write d) Parallelizing Loops

31 Difficult dependencies
for (i=1; i<n; i++) { for (j=1; j<m; j++) { a[i][j] = a[i-1][j-1] + a[i-1][j] + a[i][j-1]; } j=0 j=1 j=2 j=3 j=m-1 i=0 i=1 i=2 a[i-1][j-1] a[i-1][j] a[i][j-1] a[i][j] i=n-1 Parallelizing Loops

32 Difficult dependencies
It is not possible to parallelize the loop iterations no matter what loop we consider j=0 j=1 j=2 j=3 j=m-1 j=0 j=1 j=2 j=3 j=m-1 i=0 i=0 i=1 i=1 i=2 i=2 i=n-1 i=n-1 Parallelizing Loops

33 Solution It is possible to parallelize the inner loop by sweeping the matrix diagonally Wavefront sweep for (slice=0; slice < n + m - 1; slice++) { z1 = slice < m ? 0 : slice - m + 1; z2 = slice < n ? 0 : slice - n + 1; /* The following loop can be parallelized */ for (i = slice - z2; i >= z1; i--) { j = slice - i; /* process a[i][j] … */ } Parallelizing Loops

34 Example: Levenshtein's String Distance
Based on the concept of “edit distance” Edit distance = number of editing operations required to transform a string S in a (possibly different) string T Start with a “cursor” on the first character of S Allowed edit operations Leave a character unchanged (cost 0) Remove a character (cost 1) Insert a character (cost 1) Replace a character with a different one (cost 1) After each operation, move the cursor to the next character of S Parallelizing Loops

35 Example Trasnform “ALBERO” in “LIBRO”
One simple possibility is to remove all characters from “ALBERO” and insert all characters of “LIBRO” Total cost: 6 removals + 5 inserts = 11 ALBERO → LBERO remove A LBERO → BERO remove L BERO → ERO remove B ERO → RO remove E RO → O remove R O → remove O → L insert L L → LI insert I LI → LIB insert B LIB → LIBR insert R LIBR → LIBRO insert O Parallelizing Loops

36 Example We can do better 2 removals + 1 insert = 3
ALBERO → LBERO remove A LBERO → LBERO do not change LBERO → LIBERO insert I LIBERO → LIBERO do not change LIBERO → LIBRO remove E Parallelizing Loops

37 Definition Two strings S[1..n] and T[1..m] of lenghts n and m
One or both can be empty The Levenshtein distance between S[1..n] e T[1..m] is the minimum cost among all sequences of edit operations that transform S in T Some additional definitions S[1..i] is the substring made from the first i characters of S i = 0 denotes the empty substring T[1..j] is the substring made from the first j characters of T j = 0 denotes the empty substring Parallelizing Loops

38 Dynamic Programming approach
Let L[i, j] be the minimum number of edit operations required to transform the prefix S[1..i] of S into the prefix T[1..j] of T The edit distance of S and T is therefore L[n, m] Parallelizing Loops

39 Computing L[i, j] If i = 0 or j = 0
To transform an empty string into a nonempty string it is necessary to insert all characters To transform a nonempty string into an empty string it is necessary to remove all characters If i > 0 and j > 0, take the minimum among: Transform S[1..i - 1] into T[1..j] and then remove the last character S[i] from S Transform S[1..i] into T[1..j - 1] and then insert the last character T[j] from T Transform S[1..i - 1] into T[1..j - 1] and replace S[i] with T[j] if they differ S T S[1..i-1] T[1..j] S[1..i] T[1..j-1] S[1..i-1] T[1..j-1] Parallelizing Loops

40 Computing L[i, j] If i = 0 or j = 0 L[i, j] = max { i, j } Otherwise
Cost to transform S[1..i-1] into T[1..j] and remove S[i] If i = 0 or j = 0 L[i, j] = max { i, j } Otherwise If S[i] = T[j] L[i, j] = min{ L[i-1, j] + 1, L[i, j-1] + 1, L[i-1, j-1] } If S[i] ≠ T[j] L[i, j] = min{ L[i-1, j] + 1, L[i, j-1] + 1, L[i-1, j-1] + 1 } Cost to transform S[1..i] into T[1..j-1] and insert T[j] Cost to transform S[1..i-1] into T[1..j-1], and leave the last character unchanged Parallelizing Loops

41 Example “” A L B E R O 1 2 3 4 5 6 I I valori nella tabella sono stati calcolati mediante il programma Levenshtein.java Minimum number of edit operations required to transform ALBE into LIB Levenshtein distance between ALBERO and LIBRO Parallelizing Loops

42 Conclusions A loop can be parallelized if there are no dependencies crossing loop iterations Some kinds of dependencies can be eliminated using different algorithms E.g., reductions Other kinds of dependencies can be eliminated by sweeping across loop iterations "diagonally" Unfortunately, there are situations where dependencies can not be removed, no matter what Parallelizing Loops


Download ppt "Parallelizing Loops Moreno Marzolla"

Similar presentations


Ads by Google