Parallelizing Loops Moreno Marzolla

Slides:



Advertisements
Similar presentations
Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.
Advertisements

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
Longest Common Subsequence
Optimizing single thread performance Dependence Loop transformations.
Bernstein’s Conditions. Techniques to Exploit Parallelism in Sequential Programming Hierarchy of levels of parallelism: Procedure or Methods Statements.
Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.
Dependence Analysis Kathy Yelick Bebop group meeting, 8/3/01.
CS420 lecture six Loops. Time Analysis of loops Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if (A[j] > A[j+1) swap(A,j,j+1) 1. loop.
Inexact Matching of Strings General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic.
Enhancing Fine- Grained Parallelism Chapter 5 of Allen and Kennedy Mirit & Haim.
Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.
Stanford University CS243 Winter 2006 Wei Li 1 Data Dependences and Parallelization.
CMPUT Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic B: Loop Restructuring José Nelson Amaral
A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam.
Data Dependences CS 524 – High-Performance Computing.
Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp
Dependence: Theory and Practice Allen and Kennedy, Chapter 2 Liza Fireman.
Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2.
09/15/2011CS4961 CS4961 Parallel Programming Lecture 8: Dependences and Locality Optimizations Mary Hall September 15,
Recursion and Induction UC Berkeley Fall 2004, E77 Copyright 2005, Andy Packard. This work is licensed under the.
CPSC 388 – Compiler Design and Construction Optimization.
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2011 Dependence Analysis and Loop Transformations.
10/02/2012CS4230 CS4230 Parallel Programming Lecture 11: Breaking Dependences and Task Parallel Algorithms Mary Hall October 2,
09/07/2012CS4230 CS4230 Parallel Programming Lecture 7: Loop Scheduling cont., and Data Dependences Mary Hall September 7,
9/22/2011CS4961 CS4961 Parallel Programming Lecture 9: Task Parallelism in OpenMP Mary Hall September 22,
CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #13 – Other.
Algorithmic Foundations COMP108 COMP108 Algorithmic Foundations Algorithm efficiency Prudence Wong.
1 Programming with Shared Memory - 3 Recognizing parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Jan 22, 2016.
Algorithmic Foundations COMP108 COMP108 Algorithmic Foundations Algorithm efficiency Prudence Wong
CS 412/413 Spring 2005Introduction to Compilers1 CS412/CS413 Introduction to Compilers Tim Teitelbaum Lecture 30: Loop Optimizations and Pointer Analysis.
Lecture 38: Compiling for Modern Architectures 03 May 02
CSE15 Discrete Mathematics 03/06/17
Lecture 20: Consistency Models, TM
Code Optimization Overview and Examples
Propositional Equivalence
COMP108 Algorithmic Foundations Algorithm efficiency
CS314 – Section 5 Recitation 13
CS4961 Parallel Programming Lecture 11: Data Locality, cont
Dependence Analysis Important and difficult
Atomic Operations in Hardware
Advanced Algorithms Analysis and Design
Atomic Operations in Hardware
Data Dependence, Parallelization, and Locality Enhancement (courtesy of Tarek Abdelrahman, University of Toronto)
Loop Restructuring Loop unswitching Loop peeling Loop fusion
Exploiting Parallelism
Parallelizing Loops Moreno Marzolla
Numerical Representation of Strings
CS 213: Data Structures and Algorithms
CS4230 Parallel Programming Lecture 12: More Task Parallelism Mary Hall October 4, /04/2012 CS4230.
Chapter 4 Greedy Algorithms
Presented by: Huston Bokinsky Ying Zhang 25 April, 2013
Analysis of Algorithms CS 477/677
CS4961 Parallel Programming Lecture 12: Data Locality, cont
Sequence Alignment 11/24/2018.
Register Pressure Guided Unroll-and-Jam
CS4961 Parallel Programming Lecture 2: Introduction to Parallel Algorithms Mary Hall August 27, /25/2009 CS4961.
Siddhartha Chatterjee Spring 2008
Cyclic string-to-string correction
Program Breakdown, Variables, Types, Control Flow, and Input/Output
Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.
CSE 589 Applied Algorithms Spring 1999
State Machines 6-Apr-196-Apr-19.
Introduction to Computer Science
Analysis of Algorithms
Programming with Shared Memory - 3 Recognizing parallelism
Programming with Shared Memory Specifying parallelism
Lecture: Consistency Models, TM
Introduction to Optimization
Optimizing single thread performance
CS 201 Compiler Construction
Presentation transcript:

Parallelizing Loops Moreno Marzolla Dip. di Informatica—Scienza e Ingegneria (DISI) Università di Bologna http://www.moreno.marzolla.name/

Moreno Marzolla, Università di Bologna, Italy Copyright © 2017 Moreno Marzolla, Università di Bologna, Italy (http://www.moreno.marzolla.name/teaching/HPC/) This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0). To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/4.0/ or send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA. Parallelizing Loops

Credits Salvatore Orlando (Univ. Ca' Foscari di Venezia) Mary Hall (Univ. of Utah) Parallelizing Loops

Loop optimization 90% of execution time in 10% of the code Mostly in loops Loop optimizations Transform loops preserving the semantics Goal Single-threaded system: mostly optimizing for memory hierarchy Multi-threaded and vector systems: loop parallelization Parallelizing Loops

Reordering instructions When can you change the order of two instructions without changing the semantics? They do not operate (read or write) on the same variables They can be only read the same variables This is formally captured in the concept of data dependence True dependence: Write X – Read X (RAW) Output dependence: Write X – Write X (WAW) Anti dependence: Read X – Write X (WAR) If you detect Read X – Read X (RAR) it is safe to change the order Parallelizing Loops

Race Condition vs Data Dependence A race condition exists when the result of an execution depends on the timing of two or more events A data dependence is an ordering on a pair of memory operations that must be preserved to maintain correctness Parallelizing Loops

Key Control Concept: Data Dependence Q: When is parallelization guaranteed to be safe? A: If there are no data dependencies across reordered computations. Definition: Two memory accesses are involved in a data dependence if they may refer to the same memory location and one of the accesses is a write. Bernstein’s conditions Ij is the set of memory locations read by process Pj Oj the set of memory locations written by Pj To execute Pj and another process Pk in parallel, the following conditions must hold Ij ∩ Ok = Æ write after read Ik ∩ Oj = Æ read after write Oj ∩ Ok = Æ write after write Bernstein, A. J. (1 October 1966). "Analysis of Programs for Parallel Processing". IEEE Transactions on Electronic Computers. EC-15 (5): 757–763. doi:10.1109/PGEC.1966.264565 Parallelizing Loops

Data Dependence Two memory accesses are involved in a data dependence if they may refer to the same memory location and one of the references is a write Data-Flow or true dependency RAW (Read After Write) Anti dependency WAR (Write After Read) Output dependency WAW (Write After Write) S1 S1 a = b + c; d = 2 * a; S2 S2 S1 c = a + b; a = 2 * a; S1 S2 S2 S1 a = k; if (a>0) { a = 2 * c; } S1 S2 Parallelizing Loops S2

Control Dependence An instruction S2 has a control dependency on S1 if the outcome of S1 determines whether S2 is executed or not Of course, S1 and S2 can not be exchanged This type of dependency applies to the condition of an if- then-else or loop with respect to their bodies S1 S1 if (a>0) { a = 2 * c; } else { b = 3; } S2 S2 S3 S3 Parallelizing Loops

Data Dependence In the following, we always use a simple arrow to denote any type of dependence S2 depends on S1 S1 S2 Parallelizing Loops

Fundamental Theorem of Dependence Any reordering transformation that preserves every dependence in a program preserves the meaning of that program Recognizing parallel loops (intuitively) Find data dependences in loop No dependences crossing iteration boundary → parallelization of loop’s iterations is safe Parallelizing Loops

Example Each iteration does not depend on previous ones for (i=0; i<n; i++) { a[i] = b[i] + c[i]; } S1 i = 0 i = 1 i = 2 i = 3 S1 S1 S1 S1 ... Each iteration does not depend on previous ones There are no dependencies crossing iteration boundaries This loop is fully parallelizable Loop iterations can be performed concurrently, in any order Iterations can be split across processors Parallelizing Loops

Example Each iteration depends on the previous one (RAW) for (i=1; i<n; i++) { a[i] = a[i-1] + b[i] } S1 i = 1 i = 2 i = 3 i = 4 S1 S1 S1 S1 ... Each iteration depends on the previous one (RAW) Loop-carried dependency Hence, this loop is not parallelizable with trivial transformations Parallelizing Loops

Example s = 0; for (i=0; i<n; i++) { s = s + a[i]; } S1 i = 0 i = 1 i = 2 i = 3 S1 S1 S1 S1 ... We have a loop-carried dependency on s that can not be removed with trivial loop transformations but can be removed with non-trivial transformations this is a reduction, indeed! Parallelizing Loops

Exercise: draw the dependencies among iterations of the following loop for (i=2; i<n; i++) { a[i] = 4 * c[i-1] - 2; b[i] = a[i] * 2; c[i] = a[i-1] + 3; d[i] = b[i] + c[i-2]; } S1 S2 S3 S4 i = 2 i = 3 i = 4 i = 5 i = 6 S1 S1 S1 S1 S1 S2 S2 S2 S2 S2 ... S3 S3 S3 S3 S3 S4 S4 S4 S4 S4 Parallelizing Loops

Exercise (cont.) for (i=2; i<n; i++) { a[i] = 4 * c[i-1] - 2; b[i] = a[i] * 2; c[i] = a[i-1] + 3; d[i] = b[i] + c[i-2]; } S1 S2 S3 S4 i = 2 i = 3 i = 4 i = 5 i = 6 S1 S1 S1 S1 S1 RAW S2 S2 S2 S2 S2 ... S3 S3 S3 S3 S3 S4 S4 S4 S4 S4 Parallelizing Loops

Exercise (cont.) for (i=2; i<n; i++) { a[i] = 4 * c[i-1] - 2; b[i] = a[i] * 2; c[i] = a[i-1] + 3; d[i] = b[i] + c[i-2]; } S1 S2 S3 S4 i = 2 i = 3 i = 4 i = 5 i = 6 S1 S1 S1 S1 S1 RAW S2 S2 S2 S2 S2 ... S3 S3 S3 S3 S3 S4 S4 S4 S4 S4 Parallelizing Loops

Exercise (cont.) for (i=2; i<n; i++) { a[i] = 4 * c[i-1] - 2; b[i] = a[i] * 2; c[i] = a[i-1] + 3; d[i] = b[i] + c[i-2]; } S1 S2 S3 S4 i = 2 i = 3 i = 4 i = 5 i = 6 S1 S1 S1 S1 S1 S2 S2 S2 S2 S2 ... RAW S3 S3 S3 S3 S3 S4 S4 S4 S4 S4 Parallelizing Loops

Exercise (cont.) for (i=2; i<n; i++) { a[i] = 4 * c[i-1] - 2; b[i] = a[i] * 2; c[i] = a[i-1] + 3; d[i] = b[i] + c[i-2]; } S1 S2 S3 S4 i = 2 i = 3 i = 4 i = 5 i = 6 S1 S1 S1 S1 S1 RAW + RAW S2 S2 S2 S2 S2 ... S3 S3 S3 S3 S3 S4 S4 S4 S4 S4 Parallelizing Loops

Exercise (cont.) for (i=2; i<n; i++) { a[i] = 4 * c[i-1] - 2; b[i] = a[i] * 2; c[i] = a[i-1] + 3; d[i] = b[i] + c[i-2]; } S1 S2 S3 S4 i = 2 i = 3 i = 4 i = 5 i = 6 S1 S1 S1 S1 S1 S2 S2 S2 S2 S2 ... S3 S3 S3 S3 S3 S4 S4 S4 S4 S4 Parallelizing Loops

Removing dependencies: Loop aligning Dependencies can sometimes be removed by aligning loop iterations a[0] = 0; d[0] = a[0] + 2; for (i=1; i<n-1; i++) { a[i] = b[i-1] * c[i-1]; d[i] = a[i] + 2; } a[n-1] = b[n-1] * c[n-1]; a[0] = 0; for (i=0; i<n-1; i++) { a[i+1] = b[i] * c[i]; d[i] = a[i] + 2; } T1 S1 T2 S2 i = 0 i = 1 i = 2 i = n-2 i = 1 i = 2 i = 2 i = n-2 S1 S1 S1 S1 T1 T1 T1 T1 ... ... S2 S2 S2 S2 T2 T2 T2 T2 Parallelizing Loops

Removing dependencies: Reordering / 1 Some dependencies can be removed by reordering for (i=0; i<n-1; i++) { a[i+1] = c[i] + k; b[i] = b[i] + a[i]; c[i+1] = d[i] + w; } for (i=0; i<n-1; i++) { c[i+1] = d[i] + w; a[i+1] = c[i] + k; b[i] = b[i] + a[i]; } S1 S3 S2 S1 S3 S2 i = 0 i = 1 i = 2 i = n-2 i = 0 i = 1 i = 2 i = n-2 S1 S1 S1 S1 S3 S3 S3 S3 S2 S2 S2 S2 S1 S1 S1 S1 ... ... S3 S3 S3 S3 S2 S2 S2 S2 Parallelizing Loops

Reordering / 2 After reordering, we can align loop iterations b[0] = b[0] + a[0]; a[1] = c[1] + k; b[1] = c[1] + a[1]; for (i=1; i<n-2; i++) { c[i] = d[i-1] + w; a[i+1] = c[i] + k; b[i+1] = b[i+1] + a[i+1]; } c[n-2] = d[n-2] + w; a[n-1] = c[n-2] + k; c[n-1] = d[n-1] + w; After reordering, we can align loop iterations T3 T1 for (i=0; i<n-1; i++) { c[i+1] = d[i] + w; a[i+1] = c[i] + k; b[i] = b[i] + a[i]; } T2 S3 S1 S2 i = 0 i = 1 i = 2 i = n-2 i = 1 i = 1 i = 1 i = 1 S3 S3 S3 S3 T3 T3 T3 T3 S1 S1 S1 S1 T1 T1 T1 T1 ... ... S2 S2 S2 S2 T2 T2 T2 T2 Parallelizing Loops

Loop interchange Exchanging the loop indexes might allow the outer loop to be parallelized Why? To use coarse-grained parallelism (if appropriate) for (j=1; j<m; j++) { #pragma omp parallel for for (i=0; i<n; i++) { a[i][j] = a[i][j-1] + b[i]; } #pragma omp parallel for for (i=0; i<n; i++) { for (j=1; j<m; j++) { a[i][j] = a[i][j-1] + b[i]; } S1 S1 ... i=0 S1 S1 S1 S1 i=0 S1 S1 S1 S1 ... i=1 S1 S1 S1 S1 i=1 S1 S1 S1 S1 ... ... i=2 S1 S1 S1 S1 i=2 S1 S1 S1 S1 ... ... ... ... ... ... i=n-1 S1 S1 S1 S1 i=n-1 S1 S1 S1 S1 Parallelizing Loops

Example 1 (Arnold's Cat Map exercise) Which loop(s) can be parallelized? for (i=0; i<k; i++) { for (y=0; y<h; y++) { for (x=0; x<w; x++) { int xnext = (2*x+y) % w; int ynext = (x + y) % h; next[ynext][xnext] = cur[y][x]; } /* Swap old and new */ tmp = cur; cur = next; next = tmp; Parallelizing Loops

Example 1 (Arnold's Cat Map exercise) Which loop(s) can be parallelized? for (i=0; i<k; i++) { for (y=0; y<h; y++) { for (x=0; x<w; x++) { int xnext = (2*x+y) % w; int ynext = (x + y) % h; next[ynext][xnext] = cur[y][x]; } /* Swap old and new */ tmp = cur; cur = next; next = tmp; i = 0 i = 1 i = 2 S1 S1 S1 RAW S2 S2 S2 ... S1 S3 S3 S3 Note that S1 depends on S4 in a RAW way: next[i][j] actually does not modify the variable next (a pointer), but reads the pointer to modify the content of the pointed block. The pointer itself is modified (written) by statement S4 S4 S4 S4 S2 S3 S4 There are loop-carried dependencies: this loop can not be parallelized Parallelizing Loops

Example 1 (Arnold's Cat Map exercise) Which loop(s) can be parallelized? There are no dependencies across loop iterations (there could be a WAW dependency on next, but we know that the cat map is bijective so it can't happen). These loops can be fully parallelized. for (i=0; i<k; i++) { for (y=0; y<h; y++) { for (x=0; x<w; x++) { int xnext = (2*x+y) % w; int ynext = (x + y) % h; next[ynext][xnext] = cur[y][x]; } /* Swap old and new */ tmp = cur; cur = next; next = tmp; S1 S2 S3 Parallelizing Loops

Example 2 (Bellman-Ford Algorithm) Which loop(s) can be parallelized? for (r=0; r<n; r++) { for (e=0; e<m; e++) { const int i = g->edges[e].src; const int j = g->edges[e].dst; const double wij = g->edges[e].w; if (d[i] + wij < d[j]) { dnew[j] = d[i] + wij; } for (i=0; i<n; i++) { d[i] = dnew[i]; S1 S2 S3 Parallelizing Loops

Example 2 (Bellman-Ford Algorithm) Which loop(s) can be parallelized? Control dependency: S2 must follow S1 for (r=0; r<n; r++) { for (e=0; e<m; e++) { const int i = g->edges[e].src; const int j = g->edges[e].dst; const double wij = g->edges[e].w; if (d[i] + wij < d[j]) { dnew[j] = d[i] + wij; } for (i=0; i<n; i++) { d[i] = dnew[i]; r = 0 r = 1 r = 2 S1 S1 S1 S1 S2 S2 S2 S2 ... S3 S3 S3 S3 Looking at the outer loop only, we observe a loop-carried dependency (RAW) S3 → S1. Therefore, the outer loop can not be parallelized Parallelizing Loops

Example 2 (Bellman-Ford Algorithm) Which loop(s) can be parallelized? for (r=0; r<n; r++) { for (e=0; e<m; e++) { const int i = g->edges[e].src; const int j = g->edges[e].dst; const double wij = g->edges[e].w; if (d[i] + wij < d[j]) { dnew[j] = d[i] + wij; } for (i=0; i<n; i++) { d[i] = dnew[i]; No data dependencies (read d, write dnew) S1 S2 No data dependencies (read dnew, write d) Parallelizing Loops

Difficult dependencies for (i=1; i<n; i++) { for (j=1; j<m; j++) { a[i][j] = a[i-1][j-1] + a[i-1][j] + a[i][j-1]; } j=0 j=1 j=2 j=3 j=m-1 i=0 i=1 i=2 a[i-1][j-1] a[i-1][j] a[i][j-1] a[i][j] i=n-1 Parallelizing Loops

Difficult dependencies It is not possible to parallelize the loop iterations no matter what loop we consider j=0 j=1 j=2 j=3 j=m-1 j=0 j=1 j=2 j=3 j=m-1 i=0 i=0 i=1 i=1 i=2 i=2 i=n-1 i=n-1 Parallelizing Loops

Solution It is possible to parallelize the inner loop by sweeping the matrix diagonally Wavefront sweep for (slice=0; slice < n + m - 1; slice++) { z1 = slice < m ? 0 : slice - m + 1; z2 = slice < n ? 0 : slice - n + 1; /* The following loop can be parallelized */ for (i = slice - z2; i >= z1; i--) { j = slice - i; /* process a[i][j] … */ } Parallelizing Loops https://stackoverflow.com/questions/2112832/traverse-rectangular-matrix-in-diagonal-strips

Example: Levenshtein's String Distance Based on the concept of “edit distance” Edit distance = number of editing operations required to transform a string S in a (possibly different) string T Start with a “cursor” on the first character of S Allowed edit operations Leave a character unchanged (cost 0) Remove a character (cost 1) Insert a character (cost 1) Replace a character with a different one (cost 1) After each operation, move the cursor to the next character of S Parallelizing Loops

Example Trasnform “ALBERO” in “LIBRO” One simple possibility is to remove all characters from “ALBERO” and insert all characters of “LIBRO” Total cost: 6 removals + 5 inserts = 11 ALBERO → LBERO remove A LBERO → BERO remove L BERO → ERO remove B ERO → RO remove E RO → O remove R O → remove O → L insert L L → LI insert I LI → LIB insert B LIB → LIBR insert R LIBR → LIBRO insert O Parallelizing Loops

Example We can do better 2 removals + 1 insert = 3 ALBERO → LBERO remove A LBERO → LBERO do not change LBERO → LIBERO insert I LIBERO → LIBERO do not change LIBERO → LIBRO remove E Parallelizing Loops

Definition Two strings S[1..n] and T[1..m] of lenghts n and m One or both can be empty The Levenshtein distance between S[1..n] e T[1..m] is the minimum cost among all sequences of edit operations that transform S in T Some additional definitions S[1..i] is the substring made from the first i characters of S i = 0 denotes the empty substring T[1..j] is the substring made from the first j characters of T j = 0 denotes the empty substring Parallelizing Loops

Dynamic Programming approach Let L[i, j] be the minimum number of edit operations required to transform the prefix S[1..i] of S into the prefix T[1..j] of T The edit distance of S and T is therefore L[n, m] Parallelizing Loops

Computing L[i, j] If i = 0 or j = 0 To transform an empty string into a nonempty string it is necessary to insert all characters To transform a nonempty string into an empty string it is necessary to remove all characters If i > 0 and j > 0, take the minimum among: Transform S[1..i - 1] into T[1..j] and then remove the last character S[i] from S Transform S[1..i] into T[1..j - 1] and then insert the last character T[j] from T Transform S[1..i - 1] into T[1..j - 1] and replace S[i] with T[j] if they differ S T S[1..i-1] T[1..j] S[1..i] T[1..j-1] S[1..i-1] T[1..j-1] Parallelizing Loops

Computing L[i, j] If i = 0 or j = 0 L[i, j] = max { i, j } Otherwise Cost to transform S[1..i-1] into T[1..j] and remove S[i] If i = 0 or j = 0 L[i, j] = max { i, j } Otherwise If S[i] = T[j] L[i, j] = min{ L[i-1, j] + 1, L[i, j-1] + 1, L[i-1, j-1] } If S[i] ≠ T[j] L[i, j] = min{ L[i-1, j] + 1, L[i, j-1] + 1, L[i-1, j-1] + 1 } Cost to transform S[1..i] into T[1..j-1] and insert T[j] Cost to transform S[1..i-1] into T[1..j-1], and leave the last character unchanged Parallelizing Loops

Example “” A L B E R O 1 2 3 4 5 6 I I valori nella tabella sono stati calcolati mediante il programma Levenshtein.java Minimum number of edit operations required to transform ALBE into LIB Levenshtein distance between ALBERO and LIBRO Parallelizing Loops

Conclusions A loop can be parallelized if there are no dependencies crossing loop iterations Some kinds of dependencies can be eliminated using different algorithms E.g., reductions Other kinds of dependencies can be eliminated by sweeping across loop iterations "diagonally" Unfortunately, there are situations where dependencies can not be removed, no matter what Parallelizing Loops