Parallelizing Loops Moreno Marzolla

Parallelizing Loops Moreno Marzolla
Dip. di Informatica—Scienza e Ingegneria (DISI) Università di Bologna

Moreno Marzolla, Università di Bologna, Italy
Copyright © 2017 Moreno Marzolla, Università di Bologna, Italy ( This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0). To view a copy of this license, visit or send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA. Parallelizing Loops

Credits Salvatore Orlando (Univ. Ca' Foscari di Venezia)
Mary Hall (Univ. of Utah) Parallelizing Loops

Loop optimization 90% of execution time in 10% of the code
Mostly in loops Loop optimizations Transform loops preserving the semantics Goal Single-threaded system: mostly optimizing for memory hierarchy Multi-threaded and vector systems: loop parallelization Parallelizing Loops

Reordering instructions
When can you change the order of two instructions without changing the semantics? They do not operate (read or write) on the same variables They can be only read the same variables This is formally captured in the concept of data dependence True dependence: Write X – Read X (RAW) Output dependence: Write X – Write X (WAW) Anti dependence: Read X – Write X (WAR) If you detect Read X – Read X (RAR) it is safe to change the order Parallelizing Loops

Race Condition vs Data Dependence
A race condition exists when the result of an execution depends on the timing of two or more events A data dependence is an ordering on a pair of memory operations that must be preserved to maintain correctness Parallelizing Loops

Key Control Concept: Data Dependence
Q: When is parallelization guaranteed to be safe? A: If there are no data dependencies across reordered computations. Definition: Two memory accesses are involved in a data dependence if they may refer to the same memory location and one of the accesses is a write. Bernstein’s conditions Ij is the set of memory locations read by process Pj Oj the set of memory locations written by Pj To execute Pj and another process Pk in parallel, the following conditions must hold Ij ∩ Ok = Æ write after read Ik ∩ Oj = Æ read after write Oj ∩ Ok = Æ write after write Bernstein, A. J. (1 October 1966). "Analysis of Programs for Parallel Processing". IEEE Transactions on Electronic Computers. EC-15 (5): 757–763. doi: /PGEC Parallelizing Loops

Data Dependence Two memory accesses are involved in a data dependence if they may refer to the same memory location and one of the references is a write Data-Flow or true dependency RAW (Read After Write) Anti dependency WAR (Write After Read) Output dependency WAW (Write After Write) S1 S1 a = b + c; d = 2 * a; S2 S2 S1 c = a + b; a = 2 * a; S1 S2 S2 S1 a = k; if (a>0) { a = 2 * c; } S1 S2 Parallelizing Loops S2

Control Dependence An instruction S2 has a control dependency on S1 if the outcome of S1 determines whether S2 is executed or not Of course, S1 and S2 can not be exchanged This type of dependency applies to the condition of an if- then-else or loop with respect to their bodies S1 S1 if (a>0) { a = 2 * c; } else { b = 3; } S2 S2 S3 S3 Parallelizing Loops

Data Dependence In the following, we always use a simple arrow to denote any type of dependence S2 depends on S1 S1 S2 Parallelizing Loops

Fundamental Theorem of Dependence
Any reordering transformation that preserves every dependence in a program preserves the meaning of that program Recognizing parallel loops (intuitively) Find data dependences in loop No dependences crossing iteration boundary → parallelization of loop’s iterations is safe Parallelizing Loops

Example Each iteration does not depend on previous ones
for (i=0; i<n; i++) { a[i] = b[i] + c[i]; } S1 i = 0 i = 1 i = 2 i = 3 S1 S1 S1 S1 ... Each iteration does not depend on previous ones There are no dependencies crossing iteration boundaries This loop is fully parallelizable Loop iterations can be performed concurrently, in any order Iterations can be split across processors Parallelizing Loops

Example Each iteration depends on the previous one (RAW)
for (i=1; i<n; i++) { a[i] = a[i-1] + b[i] } S1 i = 1 i = 2 i = 3 i = 4 S1 S1 S1 S1 ... Each iteration depends on the previous one (RAW) Loop-carried dependency Hence, this loop is not parallelizable with trivial transformations Parallelizing Loops

Example s = 0; for (i=0; i<n; i++) { s = s + a[i]; } S1 i = 0 i = 1 i = 2 i = 3 S1 S1 S1 S1 ... We have a loop-carried dependency on s that can not be removed with trivial loop transformations but can be removed with non-trivial transformations this is a reduction, indeed! Parallelizing Loops

Exercise: draw the dependencies among iterations of the following loop
for (i=2; i<n; i++) { a[i] = 4 * c[i-1] - 2; b[i] = a[i] * 2; c[i] = a[i-1] + 3; d[i] = b[i] + c[i-2]; } S1 S2 S3 S4 i = 2 i = 3 i = 4 i = 5 i = 6 S1 S1 S1 S1 S1 S2 S2 S2 S2 S2 ... S3 S3 S3 S3 S3 S4 S4 S4 S4 S4 Parallelizing Loops

Exercise (cont.) for (i=2; i<n; i++) { a[i] = 4 * c[i-1] - 2;
b[i] = a[i] * 2; c[i] = a[i-1] + 3; d[i] = b[i] + c[i-2]; } S1 S2 S3 S4 i = 2 i = 3 i = 4 i = 5 i = 6 S1 S1 S1 S1 S1 RAW S2 S2 S2 S2 S2 ... S3 S3 S3 S3 S3 S4 S4 S4 S4 S4 Parallelizing Loops

b[i] = a[i] * 2; c[i] = a[i-1] + 3; d[i] = b[i] + c[i-2]; } S1 S2 S3 S4 i = 2 i = 3 i = 4 i = 5 i = 6 S1 S1 S1 S1 S1 S2 S2 S2 S2 S2 ... RAW S3 S3 S3 S3 S3 S4 S4 S4 S4 S4 Parallelizing Loops

b[i] = a[i] * 2; c[i] = a[i-1] + 3; d[i] = b[i] + c[i-2]; } S1 S2 S3 S4 i = 2 i = 3 i = 4 i = 5 i = 6 S1 S1 S1 S1 S1 RAW + RAW S2 S2 S2 S2 S2 ... S3 S3 S3 S3 S3 S4 S4 S4 S4 S4 Parallelizing Loops

b[i] = a[i] * 2; c[i] = a[i-1] + 3; d[i] = b[i] + c[i-2]; } S1 S2 S3 S4 i = 2 i = 3 i = 4 i = 5 i = 6 S1 S1 S1 S1 S1 S2 S2 S2 S2 S2 ... S3 S3 S3 S3 S3 S4 S4 S4 S4 S4 Parallelizing Loops

Removing dependencies: Loop aligning
Dependencies can sometimes be removed by aligning loop iterations a[0] = 0; d[0] = a[0] + 2; for (i=1; i<n-1; i++) { a[i] = b[i-1] * c[i-1]; d[i] = a[i] + 2; } a[n-1] = b[n-1] * c[n-1]; a[0] = 0; for (i=0; i<n-1; i++) { a[i+1] = b[i] * c[i]; d[i] = a[i] + 2; } T1 S1 T2 S2 i = 0 i = 1 i = 2 i = n-2 i = 1 i = 2 i = 2 i = n-2 S1 S1 S1 S1 T1 T1 T1 T1 ... ... S2 S2 S2 S2 T2 T2 T2 T2 Parallelizing Loops

Removing dependencies: Reordering / 1
Some dependencies can be removed by reordering for (i=0; i<n-1; i++) { a[i+1] = c[i] + k; b[i] = b[i] + a[i]; c[i+1] = d[i] + w; } for (i=0; i<n-1; i++) { c[i+1] = d[i] + w; a[i+1] = c[i] + k; b[i] = b[i] + a[i]; } S1 S3 S2 S1 S3 S2 i = 0 i = 1 i = 2 i = n-2 i = 0 i = 1 i = 2 i = n-2 S1 S1 S1 S1 S3 S3 S3 S3 S2 S2 S2 S2 S1 S1 S1 S1 ... ... S3 S3 S3 S3 S2 S2 S2 S2 Parallelizing Loops

Reordering / 2 After reordering, we can align loop iterations
b[0] = b[0] + a[0]; a[1] = c[1] + k; b[1] = c[1] + a[1]; for (i=1; i<n-2; i++) { c[i] = d[i-1] + w; a[i+1] = c[i] + k; b[i+1] = b[i+1] + a[i+1]; } c[n-2] = d[n-2] + w; a[n-1] = c[n-2] + k; c[n-1] = d[n-1] + w; After reordering, we can align loop iterations T3 T1 for (i=0; i<n-1; i++) { c[i+1] = d[i] + w; a[i+1] = c[i] + k; b[i] = b[i] + a[i]; } T2 S3 S1 S2 i = 0 i = 1 i = 2 i = n-2 i = 1 i = 1 i = 1 i = 1 S3 S3 S3 S3 T3 T3 T3 T3 S1 S1 S1 S1 T1 T1 T1 T1 ... ... S2 S2 S2 S2 T2 T2 T2 T2 Parallelizing Loops

Loop interchange Exchanging the loop indexes might allow the outer loop to be parallelized Why? To use coarse-grained parallelism (if appropriate) for (j=1; j<m; j++) { #pragma omp parallel for for (i=0; i<n; i++) { a[i][j] = a[i][j-1] + b[i]; } #pragma omp parallel for for (i=0; i<n; i++) { for (j=1; j<m; j++) { a[i][j] = a[i][j-1] + b[i]; } S1 S1 ... i=0 S1 S1 S1 S1 i=0 S1 S1 S1 S1 ... i=1 S1 S1 S1 S1 i=1 S1 S1 S1 S1 ... ... i=2 S1 S1 S1 S1 i=2 S1 S1 S1 S1 ... ... ... ... ... ... i=n-1 S1 S1 S1 S1 i=n-1 S1 S1 S1 S1 Parallelizing Loops

Example 1 (Arnold's Cat Map exercise)
Which loop(s) can be parallelized? for (i=0; i<k; i++) { for (y=0; y<h; y++) { for (x=0; x<w; x++) { int xnext = (2*x+y) % w; int ynext = (x + y) % h; next[ynext][xnext] = cur[y][x]; } /* Swap old and new */ tmp = cur; cur = next; next = tmp; Parallelizing Loops

Which loop(s) can be parallelized? for (i=0; i<k; i++) { for (y=0; y<h; y++) { for (x=0; x<w; x++) { int xnext = (2*x+y) % w; int ynext = (x + y) % h; next[ynext][xnext] = cur[y][x]; } /* Swap old and new */ tmp = cur; cur = next; next = tmp; i = 0 i = 1 i = 2 S1 S1 S1 RAW S2 S2 S2 ... S1 S3 S3 S3 Note that S1 depends on S4 in a RAW way: next[i][j] actually does not modify the variable next (a pointer), but reads the pointer to modify the content of the pointed block. The pointer itself is modified (written) by statement S4 S4 S4 S4 S2 S3 S4 There are loop-carried dependencies: this loop can not be parallelized Parallelizing Loops

Which loop(s) can be parallelized? There are no dependencies across loop iterations (there could be a WAW dependency on next, but we know that the cat map is bijective so it can't happen). These loops can be fully parallelized. for (i=0; i<k; i++) { for (y=0; y<h; y++) { for (x=0; x<w; x++) { int xnext = (2*x+y) % w; int ynext = (x + y) % h; next[ynext][xnext] = cur[y][x]; } /* Swap old and new */ tmp = cur; cur = next; next = tmp; S1 S2 S3 Parallelizing Loops

Example 2 (Bellman-Ford Algorithm)
Which loop(s) can be parallelized? for (r=0; r<n; r++) { for (e=0; e<m; e++) { const int i = g->edges[e].src; const int j = g->edges[e].dst; const double wij = g->edges[e].w; if (d[i] + wij < d[j]) { dnew[j] = d[i] + wij; } for (i=0; i<n; i++) { d[i] = dnew[i]; S1 S2 S3 Parallelizing Loops

Which loop(s) can be parallelized? Control dependency: S2 must follow S1 for (r=0; r<n; r++) { for (e=0; e<m; e++) { const int i = g->edges[e].src; const int j = g->edges[e].dst; const double wij = g->edges[e].w; if (d[i] + wij < d[j]) { dnew[j] = d[i] + wij; } for (i=0; i<n; i++) { d[i] = dnew[i]; r = 0 r = 1 r = 2 S1 S1 S1 S1 S2 S2 S2 S2 ... S3 S3 S3 S3 Looking at the outer loop only, we observe a loop-carried dependency (RAW) S3 → S1. Therefore, the outer loop can not be parallelized Parallelizing Loops

Which loop(s) can be parallelized? for (r=0; r<n; r++) { for (e=0; e<m; e++) { const int i = g->edges[e].src; const int j = g->edges[e].dst; const double wij = g->edges[e].w; if (d[i] + wij < d[j]) { dnew[j] = d[i] + wij; } for (i=0; i<n; i++) { d[i] = dnew[i]; No data dependencies (read d, write dnew) S1 S2 No data dependencies (read dnew, write d) Parallelizing Loops

Difficult dependencies
for (i=1; i<n; i++) { for (j=1; j<m; j++) { a[i][j] = a[i-1][j-1] + a[i-1][j] + a[i][j-1]; } j=0 j=1 j=2 j=3 j=m-1 i=0 i=1 i=2 a[i-1][j-1] a[i-1][j] a[i][j-1] a[i][j] i=n-1 Parallelizing Loops

Difficult dependencies
It is not possible to parallelize the loop iterations no matter what loop we consider j=0 j=1 j=2 j=3 j=m-1 j=0 j=1 j=2 j=3 j=m-1 i=0 i=0 i=1 i=1 i=2 i=2 i=n-1 i=n-1 Parallelizing Loops

Solution It is possible to parallelize the inner loop by sweeping the matrix diagonally Wavefront sweep for (slice=0; slice < n + m - 1; slice++) { z1 = slice < m ? 0 : slice - m + 1; z2 = slice < n ? 0 : slice - n + 1; /* The following loop can be parallelized */ for (i = slice - z2; i >= z1; i--) { j = slice - i; /* process a[i][j] … */ } Parallelizing Loops

Example: Levenshtein's String Distance
Based on the concept of “edit distance” Edit distance = number of editing operations required to transform a string S in a (possibly different) string T Start with a “cursor” on the first character of S Allowed edit operations Leave a character unchanged (cost 0) Remove a character (cost 1) Insert a character (cost 1) Replace a character with a different one (cost 1) After each operation, move the cursor to the next character of S Parallelizing Loops

Example Trasnform “ALBERO” in “LIBRO”
One simple possibility is to remove all characters from “ALBERO” and insert all characters of “LIBRO” Total cost: 6 removals + 5 inserts = 11 ALBERO → LBERO remove A LBERO → BERO remove L BERO → ERO remove B ERO → RO remove E RO → O remove R O → remove O → L insert L L → LI insert I LI → LIB insert B LIB → LIBR insert R LIBR → LIBRO insert O Parallelizing Loops

Example We can do better 2 removals + 1 insert = 3
ALBERO → LBERO remove A LBERO → LBERO do not change LBERO → LIBERO insert I LIBERO → LIBERO do not change LIBERO → LIBRO remove E Parallelizing Loops

Definition Two strings S[1..n] and T[1..m] of lenghts n and m
One or both can be empty The Levenshtein distance between S[1..n] e T[1..m] is the minimum cost among all sequences of edit operations that transform S in T Some additional definitions S[1..i] is the substring made from the first i characters of S i = 0 denotes the empty substring T[1..j] is the substring made from the first j characters of T j = 0 denotes the empty substring Parallelizing Loops

Dynamic Programming approach
Let L[i, j] be the minimum number of edit operations required to transform the prefix S[1..i] of S into the prefix T[1..j] of T The edit distance of S and T is therefore L[n, m] Parallelizing Loops

Computing L[i, j] If i = 0 or j = 0
To transform an empty string into a nonempty string it is necessary to insert all characters To transform a nonempty string into an empty string it is necessary to remove all characters If i > 0 and j > 0, take the minimum among: Transform S[1..i - 1] into T[1..j] and then remove the last character S[i] from S Transform S[1..i] into T[1..j - 1] and then insert the last character T[j] from T Transform S[1..i - 1] into T[1..j - 1] and replace S[i] with T[j] if they differ S T S[1..i-1] T[1..j] S[1..i] T[1..j-1] S[1..i-1] T[1..j-1] Parallelizing Loops

Computing L[i, j] If i = 0 or j = 0 L[i, j] = max { i, j } Otherwise
Cost to transform S[1..i-1] into T[1..j] and remove S[i] If i = 0 or j = 0 L[i, j] = max { i, j } Otherwise If S[i] = T[j] L[i, j] = min{ L[i-1, j] + 1, L[i, j-1] + 1, L[i-1, j-1] } If S[i] ≠ T[j] L[i, j] = min{ L[i-1, j] + 1, L[i, j-1] + 1, L[i-1, j-1] + 1 } Cost to transform S[1..i] into T[1..j-1] and insert T[j] Cost to transform S[1..i-1] into T[1..j-1], and leave the last character unchanged Parallelizing Loops

Example “” A L B E R O 1 2 3 4 5 6 I I valori nella tabella sono stati calcolati mediante il programma Levenshtein.java Minimum number of edit operations required to transform ALBE into LIB Levenshtein distance between ALBERO and LIBRO Parallelizing Loops

Conclusions A loop can be parallelized if there are no dependencies crossing loop iterations Some kinds of dependencies can be eliminated using different algorithms E.g., reductions Other kinds of dependencies can be eliminated by sweeping across loop iterations "diagonally" Unfortunately, there are situations where dependencies can not be removed, no matter what Parallelizing Loops

Parallelizing Loops Moreno Marzolla

Similar presentations

Presentation on theme: "Parallelizing Loops Moreno Marzolla"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Parallelizing Loops Moreno Marzolla

Similar presentations

Presentation on theme: "Parallelizing Loops Moreno Marzolla"— Presentation transcript:

Similar presentations

About project

Feedback