Parallelizing Loops Moreno Marzolla

Slides:

Advertisements

Similar presentations

Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.

Advertisements

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.

Programmability Issues

Optimizing single thread performance Dependence Loop transformations.

EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

Bernstein’s Conditions. Techniques to Exploit Parallelism in Sequential Programming Hierarchy of levels of parallelism: Procedure or Methods Statements.

Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.

Dependence Analysis Kathy Yelick Bebop group meeting, 8/3/01.

CS420 lecture six Loops. Time Analysis of loops Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if (A[j] > A[j+1) swap(A,j,j+1) 1. loop.

Enhancing Fine- Grained Parallelism Chapter 5 of Allen and Kennedy Mirit & Haim.

Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.

CMPUT Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic B: Loop Restructuring José Nelson Amaral

A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam.

Prof. Fateman CS 164 Lecture 221 Global Optimization Lecture 22.

Data Dependences CS 524 – High-Performance Computing.

Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp

Dependence: Theory and Practice Allen and Kennedy, Chapter 2 Liza Fireman.

Prof. Bodik CS 164 Lecture 16, Fall Global Optimization Lecture 16.

Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2.

– 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance.

ECE 1747H : Parallel Programming Lecture 1-2: Overview.

15-740/ Oct. 17, 2012 Stefan Muller.  Problem: Software is buggy!  More specific problem: Want to make sure software doesn’t have bad property.

09/15/2011CS4961 CS4961 Parallel Programming Lecture 8: Dependences and Locality Optimizations Mary Hall September 15,

08/26/2010CS4961 CS4961 Parallel Programming Lecture 2: Introduction to Parallel Algorithms Mary Hall August 26,

Recursion and Induction UC Berkeley Fall 2004, E77 Copyright 2005, Andy Packard. This work is licensed under the.

CPSC 388 – Compiler Design and Construction Optimization.

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2011 Dependence Analysis and Loop Transformations.

10/02/2012CS4230 CS4230 Parallel Programming Lecture 11: Breaking Dependences and Task Parallel Algorithms Mary Hall October 2,

09/07/2012CS4230 CS4230 Parallel Programming Lecture 7: Loop Scheduling cont., and Data Dependences Mary Hall September 7,

9/22/2011CS4961 CS4961 Parallel Programming Lecture 9: Task Parallelism in OpenMP Mary Hall September 22,

Program Analysis & Transformations Loop Parallelization and Vectorization Toheed Aslam.

10/01/2009CS4961 CS4961 Parallel Programming Lecture 12/13: Introduction to Locality Mary Hall October 1/3,

Algorithmic Foundations COMP108 COMP108 Algorithmic Foundations Algorithm efficiency Prudence Wong.

Algorithmic Foundations COMP108 COMP108 Algorithmic Foundations Algorithm efficiency Prudence Wong

CS 412/413 Spring 2005Introduction to Compilers1 CS412/CS413 Introduction to Compilers Tim Teitelbaum Lecture 30: Loop Optimizations and Pointer Analysis.

Lecture 38: Compiling for Modern Architectures 03 May 02

Parallelizing Loops Moreno Marzolla

Lecture 20: Consistency Models, TM

Code Optimization Overview and Examples

High-level optimization Jakub Yaghob

CSC 427: Data Structures and Algorithm Analysis

Simone Campanoni Dependences Simone Campanoni

COMP108 Algorithmic Foundations Algorithm efficiency

CS314 – Section 5 Recitation 13

CS4961 Parallel Programming Lecture 11: Data Locality, cont

Simone Campanoni Loop transformations Simone Campanoni

Threads Cannot Be Implemented As a Library

Dependence Analysis Important and difficult

Data Dependence, Parallelization, and Locality Enhancement (courtesy of Tarek Abdelrahman, University of Toronto)

Loop Restructuring Loop unswitching Loop peeling Loop fusion

Exploiting Parallelism

CS 213: Data Structures and Algorithms

CS4230 Parallel Programming Lecture 12: More Task Parallelism Mary Hall October 4, /04/2012 CS4230.

Presented by: Huston Bokinsky Ying Zhang 25 April, 2013

CS4961 Parallel Programming Lecture 12: Data Locality, cont

Register Pressure Guided Unroll-and-Jam

CS4961 Parallel Programming Lecture 2: Introduction to Parallel Algorithms Mary Hall August 27, /25/2009 CS4961.

Siddhartha Chatterjee Spring 2008

Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.

Building Java Programs

CS203 – Advanced Computer Architecture

M4 and Parallel Programming

CSC 427: Data Structures and Algorithm Analysis

Programming with Shared Memory - 3 Recognizing parallelism

Programming with Shared Memory Specifying parallelism

Lecture: Consistency Models, TM

Introduction to Optimization

Optimizing single thread performance

CS 201 Compiler Construction

Presentation transcript:

Parallelizing Loops Moreno Marzolla Dip. di Informatica—Scienza e Ingegneria (DISI) Università di Bologna http://www.moreno.marzolla.name/

Moreno Marzolla, Università di Bologna, Italy Copyright © 2017 Moreno Marzolla, Università di Bologna, Italy (http://www.moreno.marzolla.name/teaching/HPC/) This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0). To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/4.0/ or send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA. Parallelizing Loops

Credits Salvatore Orlando (Univ. Ca' Foscari di Venezia) Mary Hall (Univ. of Utah) Parallelizing Loops

Loop optimization 90% of execution time in 10% of the code Mostly in loops Loop optimizations Transform loops preserving the semantics Goal Single-threaded system: mostly optimizing for memory hierarchy Multi-threaded and vector systems: loop parallelization Parallelizing Loops

Reordering instructions When can you change the order of two instructions without changing the semantics? They do not operate (read or write) on the same variables They can be only read the same variables This is formally captured in the concept of data dependence True dependence: Write X – Read X (RAW) Output dependence: Write X – Write X (WAW) Anti dependence: Read X – Write X (WAR) If you detect Read X – Read X (RAR) it is safe to change the order Parallelizing Loops

Race Condition vs Data Dependence A race condition exists when the result of an execution depends on the timing of two or more events A data dependence is an ordering on a pair of memory operations that must be preserved to maintain correctness Parallelizing Loops

Key Control Concept: Data Dependence Q: When is parallelization guaranteed to be safe? A: If there are no data dependencies across reordered computations. Definition: Two memory accesses are involved in a data dependence if they may refer to the same memory location and one of the accesses is a write. Bernstein’s conditions Ij is the set of memory locations read by process Pj Oj the set of memory locations written by Pj To execute Pj and another process Pk in parallel, the following conditions must hold Ij ∩ Ok = Æ write after read Ik ∩ Oj = Æ read after write Oj ∩ Ok = Æ write after write Bernstein, A. J. (1 October 1966). "Analysis of Programs for Parallel Processing". IEEE Transactions on Electronic Computers. EC-15 (5): 757–763. doi:10.1109/PGEC.1966.264565 Parallelizing Loops

Data Dependence Two memory accesses are involved in a data dependence if they may refer to the same memory location and one of the references is a write Data-Flow or true dependency RAW (Read After Write) Anti dependency WAR (Write After Read) Output dependency WAW (Write After Write) S1 S1 a = b + c; d = 2 * a; S2 S2 S1 c = a + b; a = 2 * a; S1 S2 S2 S1 a = k; if (a>0) { a = 2 * c; } S1 S2 Parallelizing Loops S2

Control Dependence An instruction S2 has a control dependency on S1 if the outcome of S1 determines whether S2 is executed or not Of course, S1 and S2 can not be exchanged This type of dependency applies to the condition of an if- then-else or loop with respect to their bodies S1 S1 if (a>0) { a = 2 * c; } else { b = 3; } S2 S2 S3 S3 Parallelizing Loops

Data Dependence In the following, we always use a simple arrow to denote any type of dependence S2 depends on S1 S1 S2 Parallelizing Loops

Fundamental Theorem of Dependence Any reordering transformation that preserves every dependence in a program preserves the meaning of that program Recognizing parallel loops (intuitively) Find data dependences in loop No dependences crossing iteration boundary → parallelization of loop’s iterations is safe Parallelizing Loops

Example Each iteration does not depend on previous ones for (i=0; i<n; i++) { a[i] = b[i] + c[i]; } S1 i = 0 i = 1 i = 2 i = 3 S1 S1 S1 S1 ... Each iteration does not depend on previous ones There are no dependencies crossing iteration boundaries This loop is fully parallelizable Loop iterations can be performed concurrently, in any order Iterations can be split across processors Parallelizing Loops

Example Each iteration depends on the previous one (RAW) for (i=1; i<n; i++) { a[i] = a[i-1] + b[i] } S1 i = 1 i = 2 i = 3 i = 4 S1 S1 S1 S1 ... Each iteration depends on the previous one (RAW) Loop-carried dependency Hence, this loop is not parallelizable with trivial transformations Parallelizing Loops

Example s = 0; for (i=0; i<n; i++) { s = s + a[i]; } S1 i = 0 i = 1 i = 2 i = 3 S1 S1 S1 S1 ... We have a loop-carried dependency on s that can not be removed with trivial loop transformations but can be removed with non-trivial transformations this is a reduction, indeed! Parallelizing Loops

Exercise: draw the dependencies among iterations of the following loop for (i=2; i<n; i++) { a[i] = 4 * c[i-1] - 2; b[i] = a[i] * 2; c[i] = a[i-1] + 3; d[i] = b[i] + c[i-2]; } S1 S2 S3 S4 i = 2 i = 3 i = 4 i = 5 i = 6 S1 S1 S1 S1 S1 S2 S2 S2 S2 S2 ... S3 S3 S3 S3 S3 S4 S4 S4 S4 S4 Parallelizing Loops

Exercise (cont.) for (i=2; i<n; i++) { a[i] = 4 * c[i-1] - 2; b[i] = a[i] * 2; c[i] = a[i-1] + 3; d[i] = b[i] + c[i-2]; } S1 S2 S3 S4 i = 2 i = 3 i = 4 i = 5 i = 6 S1 S1 S1 S1 S1 RAW S2 S2 S2 S2 S2 ... S3 S3 S3 S3 S3 S4 S4 S4 S4 S4 Parallelizing Loops

Exercise (cont.) for (i=2; i<n; i++) { a[i] = 4 * c[i-1] - 2; b[i] = a[i] * 2; c[i] = a[i-1] + 3; d[i] = b[i] + c[i-2]; } S1 S2 S3 S4 i = 2 i = 3 i = 4 i = 5 i = 6 S1 S1 S1 S1 S1 RAW S2 S2 S2 S2 S2 ... S3 S3 S3 S3 S3 S4 S4 S4 S4 S4 Parallelizing Loops

Exercise (cont.) for (i=2; i<n; i++) { a[i] = 4 * c[i-1] - 2; b[i] = a[i] * 2; c[i] = a[i-1] + 3; d[i] = b[i] + c[i-2]; } S1 S2 S3 S4 i = 2 i = 3 i = 4 i = 5 i = 6 S1 S1 S1 S1 S1 S2 S2 S2 S2 S2 ... RAW S3 S3 S3 S3 S3 S4 S4 S4 S4 S4 Parallelizing Loops

Exercise (cont.) for (i=2; i<n; i++) { a[i] = 4 * c[i-1] - 2; b[i] = a[i] * 2; c[i] = a[i-1] + 3; d[i] = b[i] + c[i-2]; } S1 S2 S3 S4 i = 2 i = 3 i = 4 i = 5 i = 6 S1 S1 S1 S1 S1 RAW + RAW S2 S2 S2 S2 S2 ... S3 S3 S3 S3 S3 S4 S4 S4 S4 S4 Parallelizing Loops

Exercise (cont.) for (i=2; i<n; i++) { a[i] = 4 * c[i-1] - 2; b[i] = a[i] * 2; c[i] = a[i-1] + 3; d[i] = b[i] + c[i-2]; } S1 S2 S3 S4 i = 2 i = 3 i = 4 i = 5 i = 6 S1 S1 S1 S1 S1 S2 S2 S2 S2 S2 ... S3 S3 S3 S3 S3 S4 S4 S4 S4 S4 Parallelizing Loops

Removing dependencies: Loop aligning Dependencies can sometimes be removed by aligning loop iterations a[0] = 0; d[0] = a[0] + 2; for (i=1; i<n-1; i++) { a[i] = b[i-1] * c[i-1]; d[i] = a[i] + 2; } a[n-1] = b[n-1] * c[n-1]; a[0] = 0; for (i=0; i<n-1; i++) { a[i+1] = b[i] * c[i]; d[i] = a[i] + 2; } T1 S1 T2 S2 i = 0 i = 1 i = 2 i = n-2 i = 1 i = 2 i = 2 i = n-2 S1 S1 S1 S1 T1 T1 T1 T1 ... ... S2 S2 S2 S2 T2 T2 T2 T2 Parallelizing Loops

Removing dependencies: Reordering / 1 Some dependencies can be removed by reordering for (i=0; i<n-1; i++) { a[i+1] = c[i] + k; b[i] = b[i] + a[i]; c[i+1] = d[i] + w; } for (i=0; i<n-1; i++) { c[i+1] = d[i] + w; a[i+1] = c[i] + k; b[i] = b[i] + a[i]; } S1 S3 S2 S1 S3 S2 i = 0 i = 1 i = 2 i = n-2 i = 0 i = 1 i = 2 i = n-2 S1 S1 S1 S1 S3 S3 S3 S3 S2 S2 S2 S2 S1 S1 S1 S1 ... ... S3 S3 S3 S3 S2 S2 S2 S2 Parallelizing Loops

Reordering / 2 After reordering, we can align loop iterations b[0] = b[0] + a[0]; a[1] = c[0] + k; b[1] = c[1] + a[1]; for (i=1; i<n-2; i++) { c[i] = d[i-1] + w; a[i+1] = c[i] + k; b[i+1] = b[i+1] + a[i+1]; } c[n-2] = d[n-3] + w; a[n-1] = c[n-2] + k; c[n-1] = d[n-2] + w; After reordering, we can align loop iterations T3 T1 for (i=0; i<n-1; i++) { c[i+1] = d[i] + w; a[i+1] = c[i] + k; b[i] = b[i] + a[i]; } T2 S3 S1 S2 i = 0 i = 1 i = 2 i = n-2 i = 1 i = 1 i = 1 i = 1 S3 S3 S3 S3 T3 T3 T3 T3 S1 S1 S1 S1 T1 T1 T1 T1 ... ... S2 S2 S2 S2 T2 T2 T2 T2 Parallelizing Loops

Loop interchange Exchanging the loop indexes might allow the outer loop to be parallelized Why? To use coarse-grained parallelism (if appropriate) for (j=1; j<m; j++) { #pragma omp parallel for for (i=0; i<n; i++) { a[i][j] = a[i][j-1] + b[i]; } #pragma omp parallel for for (i=0; i<n; i++) { for (j=1; j<m; j++) { a[i][j] = a[i][j-1] + b[i]; } S1 S1 ... i=0 S1 S1 S1 S1 i=0 S1 S1 S1 S1 ... i=1 S1 S1 S1 S1 i=1 S1 S1 S1 S1 ... ... i=2 S1 S1 S1 S1 i=2 S1 S1 S1 S1 ... ... ... ... ... ... i=n-1 S1 S1 S1 S1 i=n-1 S1 S1 S1 S1 Parallelizing Loops

Example 1 (Arnold's Cat Map exercise) Which loop(s) can be parallelized? for (i=0; i<k; i++) { for (y=0; y<h; y++) { for (x=0; x<w; x++) { int xnext = (2*x+y) % w; int ynext = (x + y) % h; next[ynext][xnext] = cur[y][x]; } /* Swap old and new */ tmp = cur; cur = next; next = tmp; Parallelizing Loops

Example 1 (Arnold's Cat Map exercise) Which loop(s) can be parallelized? for (i=0; i<k; i++) { for (y=0; y<h; y++) { for (x=0; x<w; x++) { int xnext = (2*x+y) % w; int ynext = (x + y) % h; next[ynext][xnext] = cur[y][x]; } /* Swap old and new */ tmp = cur; cur = next; next = tmp; i = 0 i = 1 i = 2 S1 S1 S1 RAW S2 S2 S2 ... S1 S3 S3 S3 Note that S1 depends on S4 in a RAW way: next[i][j] actually does not modify the variable next (a pointer), but reads the pointer to modify the content of the pointed block. The pointer itself is modified (written) by statement S4 S4 S4 S4 S2 S3 S4 There are loop-carried dependencies: this loop can not be parallelized Parallelizing Loops

Example 1 (Arnold's Cat Map exercise) Which loop(s) can be parallelized? There are no dependencies across loop iterations (there could be a WAW dependency on next, but we know that the cat map is bijective so it can't happen). These loops can be fully parallelized. for (i=0; i<k; i++) { for (y=0; y<h; y++) { for (x=0; x<w; x++) { int xnext = (2*x+y) % w; int ynext = (x + y) % h; next[ynext][xnext] = cur[y][x]; } /* Swap old and new */ tmp = cur; cur = next; next = tmp; S1 S2 S3 Parallelizing Loops

Example 2 (Bellman-Ford Algorithm) Which loop(s) can be parallelized? for (r=0; r<n; r++) { for (e=0; e<m; e++) { const int i = g->edges[e].src; const int j = g->edges[e].dst; const double wij = g->edges[e].w; if (d[i] + wij < d[j]) { dnew[j] = d[i] + wij; } for (i=0; i<n; i++) { d[i] = dnew[i]; S1 S2 S3 Parallelizing Loops

Example 2 (Bellman-Ford Algorithm) Which loop(s) can be parallelized? Control dependency: S2 must follow S1 for (r=0; r<n; r++) { for (e=0; e<m; e++) { const int i = g->edges[e].src; const int j = g->edges[e].dst; const double wij = g->edges[e].w; if (d[i] + wij < d[j]) { dnew[j] = d[i] + wij; } for (i=0; i<n; i++) { d[i] = dnew[i]; r = 0 r = 1 r = 2 S1 S1 S1 S1 S2 S2 S2 S2 ... S3 S3 S3 S3 Looking at the outer loop only, we observe a loop-carried dependency (RAW) S3 → S1. Therefore, the outer loop can not be parallelized Parallelizing Loops

Example 2 (Bellman-Ford Algorithm) Which loop(s) can be parallelized? for (r=0; r<n; r++) { for (e=0; e<m; e++) { const int i = g->edges[e].src; const int j = g->edges[e].dst; const double wij = g->edges[e].w; if (d[i] + wij < d[j]) { dnew[j] = d[i] + wij; } for (i=0; i<n; i++) { d[i] = dnew[i]; No data dependencies (read d, write dnew) S1 S2 No data dependencies (read dnew, write d) Parallelizing Loops

Difficult dependencies for (i=1; i<n; i++) { for (j=1; j<m; j++) { a[i][j] = a[i-1][j-1] + a[i-1][j] + a[i][j-1]; } j=0 j=1 j=2 j=3 j=m-1 i=0 i=1 i=2 a[i-1][j-1] a[i-1][j] a[i][j-1] a[i][j] i=n-1 Parallelizing Loops

Difficult dependencies It is not possible to parallelize the loop iterations no matter what loop we consider j=0 j=1 j=2 j=3 j=m-1 j=0 j=1 j=2 j=3 j=m-1 i=0 i=0 i=1 i=1 i=2 i=2 i=n-1 i=n-1 Parallelizing Loops

Solution It is possible to parallelize the inner loop by sweeping the matrix diagonally Wavefront sweep for (slice=0; slice < n + m - 1; slice++) { z1 = slice < m ? 0 : slice - m + 1; z2 = slice < n ? 0 : slice - n + 1; /* The following loop can be parallelized */ for (i = slice - z2; i >= z1; i--) { j = slice - i; /* process a[i][j] … */ } Parallelizing Loops https://stackoverflow.com/questions/2112832/traverse-rectangular-matrix-in-diagonal-strips

Conclusions A loop can be parallelized if there are no dependencies crossing loop iterations Some kinds of dependencies can be eliminated using different algorithms E.g., reductions Other kinds of dependencies can be eliminated by sweeping across loop iterations "diagonally" Unfortunately, there are situations where dependencies can not be removed, no matter what Parallelizing Loops