Chapter 3 Shared Memory Parallel Programming 2005-2008 Yan Solihin Copyright notice: No part of this publication may be reproduced, stored.

Chapter 3 Shared Memory Parallel Programming Copyright @ 2005-2008 Yan Solihin Copyright notice: No part of this publication may be reproduced, stored in a retrieval system, or transmitted by any means (electronic, mechanical, photocopying, recording, or otherwise) without the prior written permission of the author. An exception is granted for academic lectures at universities and colleges, provided that the following text is included in such copy: “Source: Yan Solihin, Fundamentals of Parallel Computer Architecture, 2008”.

Fundamentals of Parallel Computer Architecture - Chapter 3 2 Steps in Creating a Parallel Program Task Creation: identifying parallel tasks, variable scopes, synchronization Task Mapping: grouping tasks, mapping to processors/memory

Fundamentals of Parallel Computer Architecture - Chapter 3 3 Parallel Programming  Task Creation (correctness) Finding parallel tasks  Code analysis  Algorithm analysis Variable partitioning  Shared vs. private vs. reduction Synchronization  Task mapping (performance) Static vs. dynamic Block vs. cyclic Dimension mapping: column-wise vs. row-wise Communication and data locality considerations

Fundamentals of Parallel Computer Architecture - Chapter 3 4 Code Analysis  Goal: given a code, without the knowledge of the algorithms, find parallel tasks  Focus on loop dependence analysis  Notations: S is a statement in the source code S[i,j,…] denotes a statement in the loop iteration [i,j,…] S1 then S2 means that S1 happens before S2 If S1 then S2:  S1 T S2 denotes true dependence, i.e. S1 writes to a location that is read by S2  S1 A S2 denotes anti dependence, i.e. S1 reads a location written by S2  S1 O S2 denotes output dependence, i.e. S1 writes to the same location written by S2

Fundamentals of Parallel Computer Architecture - Chapter 3 5 Example  Dependences: S1 T S2 S1 T S3 S3 A S4 S2 O S3 S1: x = 2; S2: y = x; S3: y = x + z; S4: z = 6;

Fundamentals of Parallel Computer Architecture - Chapter 3 6 Loop-independent vs. loop-carried dependence  Loop-carried dependence: dependence exists across iterations, i.e., if the loop is removed, the dependence no longer exists  Loop-independent dependence: dependence exists within an iteration. i.e., if the loop is removed, the dependence exists for (i=1; i<n; i++) { S1: a[i] = a[i-1] + 1; S2: b[i] = a[i]; } for (i=1; i<n; i++) for (j=1; j< n; j++) S3: a[i][j] = a[i][j-1] + 1; for (i=1; i<n; i++) for (j=1; j< n; j++) S4: a[i][j] = a[i-1][j] + 1; S1[i] T S1[i+1]: loop-carried S1[i] T S2[i]: loop-independent S3[i,j] T S3[i,j+1]: -loop-carried on for j loop -No loop-carried dependence in for i loop S4[i,j] T S4[i+1,j]: No loop-carried dependence in for j loop Loop-carried on for i loop

Fundamentals of Parallel Computer Architecture - Chapter 3 7 Iteration-space Traversal Graph (ITG)  ITG shows graphically the order of traversal in the iteration space (happens-before relationship)  Node = a point in the iteration space  Directed Edge = the next point that will be encountered after the current point is traversed Example: for (i=1; i<4; i++) for (j=1; j<4; j++) S3: a[i][j] = a[i][j-1] + 1; i j 1 2 3 321

Fundamentals of Parallel Computer Architecture - Chapter 3 8 Loop-carried Dependence Graph (LDG)  LDG shows the true/anti/output dependence relationship graphically  Node = a point in the iteration space  Directed Edge = the dependence Example: for (i=1; i<4; i++) for (j=1; j<4; j++) S3: a[i][j] = a[i][j-1] + 1; S3[i,j] T S3[i,j+1] True deps 1 2 3 321 i j TT TT TT

Fundamentals of Parallel Computer Architecture - Chapter 3 9 Further example for (i=1; i<=n; i++) for (j=1; j<=n; j++) S1: a[i][j] = a[i][j-1] + a[i][j+1] + a[i-1][j] + a[i+1][j]; for (i=1; i<=n; i++) for (j=1; j<=n; j++) { S2: a[i][j] = b[i][j] + c[i][j]; S3: b[i][j] = a[i][j-1] * d[i][j]; }  Draw the ITG  List all the dependence relationships  Draw the LDG

Fundamentals of Parallel Computer Architecture - Chapter 3 10 Answer for Loop Nest 1  ITG i j 1 2 n n21...

Fundamentals of Parallel Computer Architecture - Chapter 3 11 Answer for Loop Nest 1 True dependences: S1[i,j] T S1[i,j+1] S1[i,j] T S1[i+1,j] Output dependences: None Anti dependences: S1[i,j] A S1[i+1,j] S1[i,j] A S1[i,j+1] LDG: i j 1 2 n n21... Note: each Edge represents both true, and anti dependences

Fundamentals of Parallel Computer Architecture - Chapter 3 12 Answer for Loop Nest 2  ITG i j 1 2 n n21...

Fundamentals of Parallel Computer Architecture - Chapter 3 13 Answer for Loop Nest 2 True dependences: S2[i,j] T S3[i,j+1] Output dependences: None Anti dependences: S2[i,j] A S3[i,j] (loop-independent dependence) LDG: i j 1 2 n n21... Note: each edge represents only true dependences

Fundamentals of Parallel Computer Architecture - Chapter 3 14 Finding parallel tasks across iterations  Analyze loop-carried dependences: Dependence must be obeyed (esp. true dependences) There are opportunities when some dependences are missing  Example 1:  LDG:  Can divide the loop into two parallel tasks (one with odd iterations and another with even iterations): for (i=2; i<=n; i++) S: a[i] = a[i-2]; for (i=2; i<=n; i+=2) S: a[i] = a[i-2]; for (i=3; i<=n; i+=2) S: a[i] = a[i-2];

Fundamentals of Parallel Computer Architecture - Chapter 3 15 Example 2  Example 2:  LDG:  There are n parallel tasks (one task per i iteration) for (i=0; i<n; i++) for (j=0; j< n; j++) S3: a[i][j] = a[i][j-1] + 1; i j 1 2 n n21...

Fundamentals of Parallel Computer Architecture - Chapter 3 16 Further example for (i=1; i<=n; i++) for (j=1; j<=n; j++) S1: a[i][j] = a[i][j-1] + a[i][j+1] + a[i-1][j] + a[i+1][j];  LDG:  Where are the parallel tasks? i 1 2 n n21... Note: each Edge represents both true, and anti dependences

Fundamentals of Parallel Computer Architecture - Chapter 3 17 Example 3  Identify which nodes are not dependent on each other  In each anti-diagonal, the nodes are independent of each other  Need to rewrite the code to iterate over anti- diagonals i 1 2 n n21... Note: each Edge represents both true, and anti dependences

Fundamentals of Parallel Computer Architecture - Chapter 3 18 Structure of Rewritten Code  Iterate over anti-diagonals, and over elements within an anti-diagonal:  Parallelize the highlighted loop  Write the code… Calculate number of anti-diagonals Foreach anti-diagonal do: calculate number of points in the current anti-diagonal For each point in the current anti-diagonal do: compute the current point in the matrix

Fundamentals of Parallel Computer Architecture - Chapter 3 19 Implementing Solution 1 for (i=1; i <= 2*n-1; i++) {// 2n-1 anti-diagonals if (i <= n) { points = i; // number of points in anti-diag row = i; // first pt (row,col) in anti-diag col = 1; // note that row+col = i+1 always } else { points = 2*n – i; row = n; col = i-n+1; // note that row+col = i+1 always } for_all (k=1; k <= points; k++) { a[row][col] = … // update a[row][col] row--; col++; } // in OpenMP, the directive would be … #pragma omp parallel for default(shared) private(k) firstprivate(row,col) { for (k=1; k <= points; k++) … }

Fundamentals of Parallel Computer Architecture - Chapter 3 20 DOACROSS Parallelism for (i=1; i<=N; i++) { S: a[i] = a[i-1] + b[i] * c[i]; } for (i=1; i<=N; i++) { S1: temp[i] = b[i] * c[i]; } for (i=1; i<=N; i++) { S2: a[i] = a[i-1] + temp[i]; } Opportunity for parallelism? S[i] T S[i+1] So it has loop-carried dependence But, notice that the b[i] * c[i] part has no Loop-carried dependence Can change to: Now the first loop is parallel, but the second one is not Execution time N x (TS1 + TS2) array temp[] introduces storage overhead Better solution?

Fundamentals of Parallel Computer Architecture - Chapter 3 21 DOACROSS Parallelism Post(0); for (i=1; i<=N; i++) { S1: temp = b[i] * c[i]; wait(i-1); S2: a[i] = a[i-1] + temp; post(i); } Execution time now TS1 + N x TS2 Small storage overhead

Fundamentals of Parallel Computer Architecture - Chapter 3 22 Finding Parallel Tasks in Loop Body  Identify dependences in a loop body  If there are independent statements, can split/distribute the loops  Note that S4 has no dependences with other statements  “S1[i] A S2[i+1]” implies that S2 at iteration i+1 must be executed after S1 at iteration i. Hence dependence not violated if all S2’s executed after all S1’s for (i=0; i<n; i++) { S1: a[i] = b[i+1] * a[i-1]; S2: b[i] = b[i] * coef; S3: c[i] = 0.5 * (c[i] + a[i]); S4: d[i] = d[i-1] * d[i]; } Loop-carried dependences: S1[i] A S2[i+1] Loop-indep dependences: S1[i] T S3[i]

Fundamentals of Parallel Computer Architecture - Chapter 3 23 After loop distribution for (i=0; i<n; i++) { S1: a[i] = b[i+1] * a[i-1]; S2: b[i] = b[i] * coef; S3: c[i] = 0.5 * (c[i] + a[i]); S4: d[i] = d[i-1] * d[i]; } for (i=0; i<n; i++) { S1: a[i] = b[i+1] * a[i-1]; S2: b[i] = b[i] * coef; S3: c[i] = 0.5 * (c[i] + a[i]); } for (i=0; i<n; i++) { S4: d[i] = d[i-1] * d[i]; } -Each loop is a parallel task -Referred to as function parallelism - More distribution possible (refer to textbook)

Fundamentals of Parallel Computer Architecture - Chapter 3 24 Identifying Concurrency (contd.)  Function parallelism: modest degree, does not grow with input size difficult to load balance pipelining, as in video encoding/decoding, or polygon rendering  Most scalable programs are data parallel use both when data parallelism is limited

Fundamentals of Parallel Computer Architecture - Chapter 3 25 DOPIPE Parallelism for (i=2; i<=N; i++) { S1: a[i] = a[i-1] + b[i]; S2: c[i] = c[i] + a[i]; } for (i=2; i<=N; i++) { a[i] = a[i-1] + b[i]; signal(i); } for (i=2; i<=N; i++) { wait(i); c[i] = c[i] + a[i]; } Loop-carried dependences: S1[i-1] T S1[i] Loop independent dependence: S1[i] T S2[i] So, where is the parallelism opportunity? DOPIPE Parallelism What is the max speedup? see textbook

Fundamentals of Parallel Computer Architecture - Chapter 3 27 Task Creation: Algorithm Analysis  Goal: code analysis misses parallelization opportunities available at the algorithm level  Sometimes, the ITG introduces unnecessary serialization  Consider the “ocean” algorithm Numerical goal: at each sweep, compute how each point is affected by its neighbors Hence, any order of update (within a sweep) is an approximation Different ordering of updates: may converge quicker or slower Change ordering to improve parallelism Partition iteration space into red and black points Red sweep and black sweep are each fully parallel

Fundamentals of Parallel Computer Architecture - Chapter 3 28 Example 3: Simulating Ocean Currents Algorithm: While not converging to a solution do: foreach timestep do: foreach cross section do a sweep: foreach point in a cross section do: compute the force interaction with its neighbors compare with the code that implements the algorithm: for (I=1; I<=N; I++) { for (j=1; j<=N; j++) { S1: temp = A[I][j]; S2: A[I][j] = 0.2 * (A[I][j]+A[I][j-1]+A[I-1][j]+ +A[I][j+1]+A[I+1][j]); S3: diff += abs(A[I][j] - temp); }

Fundamentals of Parallel Computer Architecture - Chapter 3 29 Red-Black Coloring in one sweep: - no dependence between black and red points restructured algorithm: While not converging to a solution do: foreach timestep do: foreach cross section do: foreach red point do: //red sweep compute the force interaction wait until red sweep foreach black point do: //blk sweep compute the force interaction see textbook for code

Fundamentals of Parallel Computer Architecture - Chapter 3 30 Task Creation: Further Algorithm Analysis  Can algorithm tolerate asynchronous execution? simply ignore dependences within a sweep parallel program nondeterministic (timing- dependent!) for (I=1; I<=N; I++) { for (j=1; j<=N; j++) { S1: temp = A[I][j]; S2: A[I][j] = 0.2 * (A[I][j]+A[I][j-1]+A[I-1][j]+ +A[I][j+1]+A[I+1][j]); S3: diff += abs(A[I][j] - temp); }

Fundamentals of Parallel Computer Architecture - Chapter 3 32 Determining Variable Scope  This step is specific to shared memory programming model  Analyze how each variable may be used across parallel tasks: Read-only:  variable is only read by all tasks R/W non-conflicting:  variable is read, written, or both by only one task R/W Conflicting:  variable written by one task may be read by another

Fundamentals of Parallel Computer Architecture - Chapter 3 33 Example 1 for (i=1; i<=n; i++) for (j=1; j<=n; j++) { S2: a[i][j] = b[i][j] + c[i][j]; S3: b[i][j] = a[i][j-1] * d[i][j]; }  Define a parallel task as each “for i” loop iteration  Read-only: n, c, d  R/W non-conflicting: a, b  R/W conflicting: i, j

Fundamentals of Parallel Computer Architecture - Chapter 3 34 Example 2 for (i=1; i<=n; i++) for (j=1; j<=n; j++) { S1: a[i][j] = b[i][j] + c[i][j]; S2: b[i][j] = a[i-1][j] * d[i][j]; S3: e[i][j] = a[i][j]; }  Parallel task = each “for j” loop iteration  Read-only: n, i, c, d  R/W Non-conflicting: a, b, e  R/W Conflicting: j

Fundamentals of Parallel Computer Architecture - Chapter 3 35 Privatization  Privatizable variable = Conflicting variable that, in program order, is always defined (=written) by a task before use (=read) by the same task Conflicting variable whose values for different parallel tasks are known ahead of time (hence, private copies can be initialized to the known values)  Consequence Conflicts disappear when the variable is “privatized”  Privatization involves making private copies of a shared variable One private copy per thread (not per task) How is this achieved in shared memory abstraction?  Result of privatization: R/W Conflicting  R/W Non-conflicting

Fundamentals of Parallel Computer Architecture - Chapter 3 36 Example 1 for (i=1; i<=n; i++) for (j=1; j<=n; j++) { S2: a[i][j] = b[i][j] + c[i][j]; S3: b[i][j] = a[i][j-1] * d[i][j]; }  Define a parallel task as each “for i” loop iteration  Read-only: n, c, d  R/W non-conflicting: a, b  R/W conflicting but privatizable: i, j After privatization: i[ID], j[ID]

Fundamentals of Parallel Computer Architecture - Chapter 3 37 Example 2 for (i=1; i<=n; i++) for (j=1; j<=n; j++) { S1: a[i][j] = b[i][j] + c[i][j]; S2: b[i][j] = a[i-1][j] * d[i][j]; S3: e[i][j] = a[i][j]; }  Parallel task = each “for j” loop iteration  Read-only: n, i, c, d  R/W Non-conflicting: a, b, e  R/W Conflicting but privatizable: j After privatization: j[ID]

Fundamentals of Parallel Computer Architecture - Chapter 3 38 Reduction Variables and Operations  Reduction Operation = Operation that reduces elements of some vector/array down to one element Examples:  SUM (+), multiplication (*)  Logical (AND, OR, …)  Reduction variable = The scalar variable that is the result of a reduction operation  Criteria for reducibility: Reduction variable is updated by each task, and the order of update is not important Hence, the reduction operation must be commutative and associative

Fundamentals of Parallel Computer Architecture - Chapter 3 39 Reduction Operation  Compute: y = y_init op x1 op x2 op x3... op xn  op is a reduction operator if it is commutative u op v = v op u  and associative (u op v) op w = u op (v op w)  Certain operations can be transformed into reduction operations (see Homeworks)

Fundamentals of Parallel Computer Architecture - Chapter 3 40 Variable Partitioning  Should be declared private: Privatizable variables  Should be declared shared: Read-only variables R/W Non-conflicting variables  Should be declared reduction: Reduction variables  Other R/W Conflicting variables: Privatization possible? If so, privatize them Otherwise, declare as shared, but protect with synchronization

Fundamentals of Parallel Computer Architecture - Chapter 3 41 Example 1 for (i=1; i<=n; i++) for (j=1; j<=n; j++) { S2: a[i][j] = b[i][j] + c[i][j]; S3: b[i][j] = a[i][j-1] * d[i][j]; }  Declare as shared: n, c, d, a, b  Declare as private: i, j

Fundamentals of Parallel Computer Architecture - Chapter 3 42 Example 2 for (i=1; i<=n; i++) for (j=1; j<=n; j++) { S1: a[i][j] = b[i][j] + c[i][j]; S2: b[i][j] = a[i-1][j] * d[i][j]; S3: e[i][j] = a[i][j]; }  Parallel task = each “for j” loop iteration  Declare as shared: n, i, c, d, a, b, e  Declare as private: j

Fundamentals of Parallel Computer Architecture - Chapter 3 44 Synchronization Primitives  Point-to-point a pair of signal() and wait() a pair of send() and recv() in message passing  Lock ensures mutual exclusion, only one thread can be in a locked region at a given time  Barrier a point where a thread is allowed to go past it only when all threads have reached the point.

Fundamentals of Parallel Computer Architecture - Chapter 3 45 Lock  What problem may arise here?  Lock ensures only one thread inside the locked region  Issues: What granularity to lock? How to build a lock that is correct and fast? // inside a parallel region for (i=start_iter; i<end_iter; i++) sum = sum + a[i]; // inside a parallel region for (i=start_iter; i<end_iter; i++) { lock(x); sum = sum + a[i]; unlock(x); }

Fundamentals of Parallel Computer Architecture - Chapter 3 46 Barrier: Global Event Synchronization  Load balance important  Execution time dependent on the slowest thread One reason for gang scheduling and avoiding time sharing and context switching

Fundamentals of Parallel Computer Architecture - Chapter 3 47 Group Event Synchronization  Subset of processes involved Can use flags or barriers (involving only the subset) Concept of producers and consumers  Major types: Single-producer, multiple-consumer  E.g., producer sets a flag that consumers spin on Multiple-producer, single-consumer  E.g., barrier on producers, the last process sets the flag that consumer spins on Multiple-producer, multiple-consumer

Fundamentals of Parallel Computer Architecture - Chapter 3 48 Parallel Programming  Task Creation (correctness) Finding parallel tasks  Code analysis  Algorithm analysis Variable partitioning  Shared vs. private vs. reduction Synchronization  Task mapping (performance) – details later Static vs. dynamic Block vs. cyclic Dimension mapping: column-wise vs. row-wise Communication and data locality considerations

Fundamentals of Parallel Computer Architecture - Chapter 3 49 Intro to OpenMP: directives format Refer to http://www.openmp.org and the OpenMP 2.0 specifications in the course web site for more detailshttp://www.openmp.org #pragma omp directive-name [clause[ [,] clause]...] new-line For example, #pragma omp for [clause[[,] clause]... ] new-line for-loop The clause is one of private(variable-list) firstprivate(variable-list) lastprivate(variable-list) reduction(operator: variable-list) ordered schedule(kind[, chunk_size]) nowait

Fundamentals of Parallel Computer Architecture - Chapter 3 50 Very Very Short Intro to OpenMP #include //… #pragma omp parallel { … #pragma omp parallel for default(shared) private(i) for(i=0; i<n; i++) A[I]= A[I]*A[I]- 3.0; } #pragma omp parallel shared(A,B)private(i) { #pragma omp sections nowait { #pragma omp section for(i=0; i<n; i++) A[i]= A[i]*A[i]- 4.0; #pragma omp section for(i=0; i<n; i++) B[i]= B[i]*B[i] + 9.0; } // end omp sections } // end omp parallel  Parallel for loop:  Parallel sections:

Fundamentals of Parallel Computer Architecture - Chapter 3 51 Type of variables  shared, private, reduction, firstprivate, lastprivate  Semi-private data for parallel loops: – reduction: variable that is the target of a reduction operation performed by the loop, e.g., sum – firstprivate: initialize the private copy from the value of the shared variable prior to parallel section – lastprivate: upon loop exit, master thread holds the value seen by the thread assigned the last loop iteration (for parallel loops only)

Fundamentals of Parallel Computer Architecture - Chapter 3 52 Barriers  Barriers are implicit after each parallel section  When barriers are not needed for correctness, use nowait clause  schedule clause will be discussed later #include //… #pragma omp parallel { … #pragma omp for nowait default(shared) private(i) for(i=0; i<n; i++) A[I]= A[I]*A[I]- 3.0; }

Fundamentals of Parallel Computer Architecture - Chapter 3 53 Compile, run, and profile  First, find the subroutines/loops that take most of the execution time (pixie, gprof, ssrun, …)  Parallelize or auto-parallelize the loops, e.g.: icc –parallel prog.c f77 –pfa prog.f  Compile the applications, e.g.: cc -mp -O3 prog.c  Set number of threads, e.g.: Setenv MP_NUM_THREADS 8 Can also be set from within the app  Run it, e.g.: Prog, or with profile Ssrun –pcsamp prog

Fundamentals of Parallel Computer Architecture - Chapter 3 54 Matrix Multiplication Example  Reading assignment: Read Section 3.8 in the textbook  In MP1, you will be asked to parallelize a given code

Chapter 3 Shared Memory Parallel Programming 2005-2008 Yan Solihin Copyright notice: No part of this publication may be reproduced, stored.

Similar presentations

Presentation on theme: "Chapter 3 Shared Memory Parallel Programming 2005-2008 Yan Solihin Copyright notice: No part of this publication may be reproduced, stored."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Chapter 3 Shared Memory Parallel Programming 2005-2008 Yan Solihin Copyright notice: No part of this publication may be reproduced, stored.

Similar presentations

Presentation on theme: "Chapter 3 Shared Memory Parallel Programming 2005-2008 Yan Solihin Copyright notice: No part of this publication may be reproduced, stored."— Presentation transcript:

Similar presentations

About project

Feedback