Download presentation
Presentation is loading. Please wait.
Published byCaroline Howard Modified over 8 years ago
1
Chapter 3 Shared Memory Parallel Programming Copyright @ 2005-2008 Yan Solihin Copyright notice: No part of this publication may be reproduced, stored in a retrieval system, or transmitted by any means (electronic, mechanical, photocopying, recording, or otherwise) without the prior written permission of the author. An exception is granted for academic lectures at universities and colleges, provided that the following text is included in such copy: “Source: Yan Solihin, Fundamentals of Parallel Computer Architecture, 2008”.
2
Fundamentals of Parallel Computer Architecture - Chapter 3 2 Steps in Creating a Parallel Program Task Creation: identifying parallel tasks, variable scopes, synchronization Task Mapping: grouping tasks, mapping to processors/memory
3
Fundamentals of Parallel Computer Architecture - Chapter 3 3 Parallel Programming Task Creation (correctness) Finding parallel tasks Code analysis Algorithm analysis Variable partitioning Shared vs. private vs. reduction Synchronization Task mapping (performance) Static vs. dynamic Block vs. cyclic Dimension mapping: column-wise vs. row-wise Communication and data locality considerations
4
Fundamentals of Parallel Computer Architecture - Chapter 3 4 Code Analysis Goal: given a code, without the knowledge of the algorithms, find parallel tasks Focus on loop dependence analysis Notations: S is a statement in the source code S[i,j,…] denotes a statement in the loop iteration [i,j,…] S1 then S2 means that S1 happens before S2 If S1 then S2: S1 T S2 denotes true dependence, i.e. S1 writes to a location that is read by S2 S1 A S2 denotes anti dependence, i.e. S1 reads a location written by S2 S1 O S2 denotes output dependence, i.e. S1 writes to the same location written by S2
5
Fundamentals of Parallel Computer Architecture - Chapter 3 5 Example Dependences: S1 T S2 S1 T S3 S3 A S4 S2 O S3 S1: x = 2; S2: y = x; S3: y = x + z; S4: z = 6;
6
Fundamentals of Parallel Computer Architecture - Chapter 3 6 Loop-independent vs. loop-carried dependence Loop-carried dependence: dependence exists across iterations, i.e., if the loop is removed, the dependence no longer exists Loop-independent dependence: dependence exists within an iteration. i.e., if the loop is removed, the dependence exists for (i=1; i<n; i++) { S1: a[i] = a[i-1] + 1; S2: b[i] = a[i]; } for (i=1; i<n; i++) for (j=1; j< n; j++) S3: a[i][j] = a[i][j-1] + 1; for (i=1; i<n; i++) for (j=1; j< n; j++) S4: a[i][j] = a[i-1][j] + 1; S1[i] T S1[i+1]: loop-carried S1[i] T S2[i]: loop-independent S3[i,j] T S3[i,j+1]: -loop-carried on for j loop -No loop-carried dependence in for i loop S4[i,j] T S4[i+1,j]: No loop-carried dependence in for j loop Loop-carried on for i loop
7
Fundamentals of Parallel Computer Architecture - Chapter 3 7 Iteration-space Traversal Graph (ITG) ITG shows graphically the order of traversal in the iteration space (happens-before relationship) Node = a point in the iteration space Directed Edge = the next point that will be encountered after the current point is traversed Example: for (i=1; i<4; i++) for (j=1; j<4; j++) S3: a[i][j] = a[i][j-1] + 1; i j 1 2 3 321
8
Fundamentals of Parallel Computer Architecture - Chapter 3 8 Loop-carried Dependence Graph (LDG) LDG shows the true/anti/output dependence relationship graphically Node = a point in the iteration space Directed Edge = the dependence Example: for (i=1; i<4; i++) for (j=1; j<4; j++) S3: a[i][j] = a[i][j-1] + 1; S3[i,j] T S3[i,j+1] True deps 1 2 3 321 i j TT TT TT
9
Fundamentals of Parallel Computer Architecture - Chapter 3 9 Further example for (i=1; i<=n; i++) for (j=1; j<=n; j++) S1: a[i][j] = a[i][j-1] + a[i][j+1] + a[i-1][j] + a[i+1][j]; for (i=1; i<=n; i++) for (j=1; j<=n; j++) { S2: a[i][j] = b[i][j] + c[i][j]; S3: b[i][j] = a[i][j-1] * d[i][j]; } Draw the ITG List all the dependence relationships Draw the LDG
10
Fundamentals of Parallel Computer Architecture - Chapter 3 10 Answer for Loop Nest 1 ITG i j 1 2 n n21...
11
Fundamentals of Parallel Computer Architecture - Chapter 3 11 Answer for Loop Nest 1 True dependences: S1[i,j] T S1[i,j+1] S1[i,j] T S1[i+1,j] Output dependences: None Anti dependences: S1[i,j] A S1[i+1,j] S1[i,j] A S1[i,j+1] LDG: i j 1 2 n n21... Note: each Edge represents both true, and anti dependences
12
Fundamentals of Parallel Computer Architecture - Chapter 3 12 Answer for Loop Nest 2 ITG i j 1 2 n n21...
13
Fundamentals of Parallel Computer Architecture - Chapter 3 13 Answer for Loop Nest 2 True dependences: S2[i,j] T S3[i,j+1] Output dependences: None Anti dependences: S2[i,j] A S3[i,j] (loop-independent dependence) LDG: i j 1 2 n n21... Note: each edge represents only true dependences
14
Fundamentals of Parallel Computer Architecture - Chapter 3 14 Finding parallel tasks across iterations Analyze loop-carried dependences: Dependence must be obeyed (esp. true dependences) There are opportunities when some dependences are missing Example 1: LDG: Can divide the loop into two parallel tasks (one with odd iterations and another with even iterations): for (i=2; i<=n; i++) S: a[i] = a[i-2]; for (i=2; i<=n; i+=2) S: a[i] = a[i-2]; for (i=3; i<=n; i+=2) S: a[i] = a[i-2];
15
Fundamentals of Parallel Computer Architecture - Chapter 3 15 Example 2 Example 2: LDG: There are n parallel tasks (one task per i iteration) for (i=0; i<n; i++) for (j=0; j< n; j++) S3: a[i][j] = a[i][j-1] + 1; i j 1 2 n n21...
16
Fundamentals of Parallel Computer Architecture - Chapter 3 16 Further example for (i=1; i<=n; i++) for (j=1; j<=n; j++) S1: a[i][j] = a[i][j-1] + a[i][j+1] + a[i-1][j] + a[i+1][j]; LDG: Where are the parallel tasks? i 1 2 n n21... Note: each Edge represents both true, and anti dependences
17
Fundamentals of Parallel Computer Architecture - Chapter 3 17 Example 3 Identify which nodes are not dependent on each other In each anti-diagonal, the nodes are independent of each other Need to rewrite the code to iterate over anti- diagonals i 1 2 n n21... Note: each Edge represents both true, and anti dependences
18
Fundamentals of Parallel Computer Architecture - Chapter 3 18 Structure of Rewritten Code Iterate over anti-diagonals, and over elements within an anti-diagonal: Parallelize the highlighted loop Write the code… Calculate number of anti-diagonals Foreach anti-diagonal do: calculate number of points in the current anti-diagonal For each point in the current anti-diagonal do: compute the current point in the matrix
19
Fundamentals of Parallel Computer Architecture - Chapter 3 19 Implementing Solution 1 for (i=1; i <= 2*n-1; i++) {// 2n-1 anti-diagonals if (i <= n) { points = i; // number of points in anti-diag row = i; // first pt (row,col) in anti-diag col = 1; // note that row+col = i+1 always } else { points = 2*n – i; row = n; col = i-n+1; // note that row+col = i+1 always } for_all (k=1; k <= points; k++) { a[row][col] = … // update a[row][col] row--; col++; } // in OpenMP, the directive would be … #pragma omp parallel for default(shared) private(k) firstprivate(row,col) { for (k=1; k <= points; k++) … }
20
Fundamentals of Parallel Computer Architecture - Chapter 3 20 DOACROSS Parallelism for (i=1; i<=N; i++) { S: a[i] = a[i-1] + b[i] * c[i]; } for (i=1; i<=N; i++) { S1: temp[i] = b[i] * c[i]; } for (i=1; i<=N; i++) { S2: a[i] = a[i-1] + temp[i]; } Opportunity for parallelism? S[i] T S[i+1] So it has loop-carried dependence But, notice that the b[i] * c[i] part has no Loop-carried dependence Can change to: Now the first loop is parallel, but the second one is not Execution time N x (TS1 + TS2) array temp[] introduces storage overhead Better solution?
21
Fundamentals of Parallel Computer Architecture - Chapter 3 21 DOACROSS Parallelism Post(0); for (i=1; i<=N; i++) { S1: temp = b[i] * c[i]; wait(i-1); S2: a[i] = a[i-1] + temp; post(i); } Execution time now TS1 + N x TS2 Small storage overhead
22
Fundamentals of Parallel Computer Architecture - Chapter 3 22 Finding Parallel Tasks in Loop Body Identify dependences in a loop body If there are independent statements, can split/distribute the loops Note that S4 has no dependences with other statements “S1[i] A S2[i+1]” implies that S2 at iteration i+1 must be executed after S1 at iteration i. Hence dependence not violated if all S2’s executed after all S1’s for (i=0; i<n; i++) { S1: a[i] = b[i+1] * a[i-1]; S2: b[i] = b[i] * coef; S3: c[i] = 0.5 * (c[i] + a[i]); S4: d[i] = d[i-1] * d[i]; } Loop-carried dependences: S1[i] A S2[i+1] Loop-indep dependences: S1[i] T S3[i]
23
Fundamentals of Parallel Computer Architecture - Chapter 3 23 After loop distribution for (i=0; i<n; i++) { S1: a[i] = b[i+1] * a[i-1]; S2: b[i] = b[i] * coef; S3: c[i] = 0.5 * (c[i] + a[i]); S4: d[i] = d[i-1] * d[i]; } for (i=0; i<n; i++) { S1: a[i] = b[i+1] * a[i-1]; S2: b[i] = b[i] * coef; S3: c[i] = 0.5 * (c[i] + a[i]); } for (i=0; i<n; i++) { S4: d[i] = d[i-1] * d[i]; } -Each loop is a parallel task -Referred to as function parallelism - More distribution possible (refer to textbook)
24
Fundamentals of Parallel Computer Architecture - Chapter 3 24 Identifying Concurrency (contd.) Function parallelism: modest degree, does not grow with input size difficult to load balance pipelining, as in video encoding/decoding, or polygon rendering Most scalable programs are data parallel use both when data parallelism is limited
25
Fundamentals of Parallel Computer Architecture - Chapter 3 25 DOPIPE Parallelism for (i=2; i<=N; i++) { S1: a[i] = a[i-1] + b[i]; S2: c[i] = c[i] + a[i]; } for (i=2; i<=N; i++) { a[i] = a[i-1] + b[i]; signal(i); } for (i=2; i<=N; i++) { wait(i); c[i] = c[i] + a[i]; } Loop-carried dependences: S1[i-1] T S1[i] Loop independent dependence: S1[i] T S2[i] So, where is the parallelism opportunity? DOPIPE Parallelism What is the max speedup? see textbook
26
Fundamentals of Parallel Computer Architecture - Chapter 3 26 Parallel Programming Task Creation (correctness) Finding parallel tasks Code analysis Algorithm analysis Variable partitioning Shared vs. private vs. reduction Synchronization Task mapping (performance) Static vs. dynamic Block vs. cyclic Dimension mapping: column-wise vs. row-wise Communication and data locality considerations
27
Fundamentals of Parallel Computer Architecture - Chapter 3 27 Task Creation: Algorithm Analysis Goal: code analysis misses parallelization opportunities available at the algorithm level Sometimes, the ITG introduces unnecessary serialization Consider the “ocean” algorithm Numerical goal: at each sweep, compute how each point is affected by its neighbors Hence, any order of update (within a sweep) is an approximation Different ordering of updates: may converge quicker or slower Change ordering to improve parallelism Partition iteration space into red and black points Red sweep and black sweep are each fully parallel
28
Fundamentals of Parallel Computer Architecture - Chapter 3 28 Example 3: Simulating Ocean Currents Algorithm: While not converging to a solution do: foreach timestep do: foreach cross section do a sweep: foreach point in a cross section do: compute the force interaction with its neighbors compare with the code that implements the algorithm: for (I=1; I<=N; I++) { for (j=1; j<=N; j++) { S1: temp = A[I][j]; S2: A[I][j] = 0.2 * (A[I][j]+A[I][j-1]+A[I-1][j]+ +A[I][j+1]+A[I+1][j]); S3: diff += abs(A[I][j] - temp); }
29
Fundamentals of Parallel Computer Architecture - Chapter 3 29 Red-Black Coloring in one sweep: - no dependence between black and red points restructured algorithm: While not converging to a solution do: foreach timestep do: foreach cross section do: foreach red point do: //red sweep compute the force interaction wait until red sweep foreach black point do: //blk sweep compute the force interaction see textbook for code
30
Fundamentals of Parallel Computer Architecture - Chapter 3 30 Task Creation: Further Algorithm Analysis Can algorithm tolerate asynchronous execution? simply ignore dependences within a sweep parallel program nondeterministic (timing- dependent!) for (I=1; I<=N; I++) { for (j=1; j<=N; j++) { S1: temp = A[I][j]; S2: A[I][j] = 0.2 * (A[I][j]+A[I][j-1]+A[I-1][j]+ +A[I][j+1]+A[I+1][j]); S3: diff += abs(A[I][j] - temp); }
31
Fundamentals of Parallel Computer Architecture - Chapter 3 31 Parallel Programming Task Creation (correctness) Finding parallel tasks Code analysis Algorithm analysis Variable partitioning Shared vs. private vs. reduction Synchronization Task mapping (performance) Static vs. dynamic Block vs. cyclic Dimension mapping: column-wise vs. row-wise Communication and data locality considerations
32
Fundamentals of Parallel Computer Architecture - Chapter 3 32 Determining Variable Scope This step is specific to shared memory programming model Analyze how each variable may be used across parallel tasks: Read-only: variable is only read by all tasks R/W non-conflicting: variable is read, written, or both by only one task R/W Conflicting: variable written by one task may be read by another
33
Fundamentals of Parallel Computer Architecture - Chapter 3 33 Example 1 for (i=1; i<=n; i++) for (j=1; j<=n; j++) { S2: a[i][j] = b[i][j] + c[i][j]; S3: b[i][j] = a[i][j-1] * d[i][j]; } Define a parallel task as each “for i” loop iteration Read-only: n, c, d R/W non-conflicting: a, b R/W conflicting: i, j
34
Fundamentals of Parallel Computer Architecture - Chapter 3 34 Example 2 for (i=1; i<=n; i++) for (j=1; j<=n; j++) { S1: a[i][j] = b[i][j] + c[i][j]; S2: b[i][j] = a[i-1][j] * d[i][j]; S3: e[i][j] = a[i][j]; } Parallel task = each “for j” loop iteration Read-only: n, i, c, d R/W Non-conflicting: a, b, e R/W Conflicting: j
35
Fundamentals of Parallel Computer Architecture - Chapter 3 35 Privatization Privatizable variable = Conflicting variable that, in program order, is always defined (=written) by a task before use (=read) by the same task Conflicting variable whose values for different parallel tasks are known ahead of time (hence, private copies can be initialized to the known values) Consequence Conflicts disappear when the variable is “privatized” Privatization involves making private copies of a shared variable One private copy per thread (not per task) How is this achieved in shared memory abstraction? Result of privatization: R/W Conflicting R/W Non-conflicting
36
Fundamentals of Parallel Computer Architecture - Chapter 3 36 Example 1 for (i=1; i<=n; i++) for (j=1; j<=n; j++) { S2: a[i][j] = b[i][j] + c[i][j]; S3: b[i][j] = a[i][j-1] * d[i][j]; } Define a parallel task as each “for i” loop iteration Read-only: n, c, d R/W non-conflicting: a, b R/W conflicting but privatizable: i, j After privatization: i[ID], j[ID]
37
Fundamentals of Parallel Computer Architecture - Chapter 3 37 Example 2 for (i=1; i<=n; i++) for (j=1; j<=n; j++) { S1: a[i][j] = b[i][j] + c[i][j]; S2: b[i][j] = a[i-1][j] * d[i][j]; S3: e[i][j] = a[i][j]; } Parallel task = each “for j” loop iteration Read-only: n, i, c, d R/W Non-conflicting: a, b, e R/W Conflicting but privatizable: j After privatization: j[ID]
38
Fundamentals of Parallel Computer Architecture - Chapter 3 38 Reduction Variables and Operations Reduction Operation = Operation that reduces elements of some vector/array down to one element Examples: SUM (+), multiplication (*) Logical (AND, OR, …) Reduction variable = The scalar variable that is the result of a reduction operation Criteria for reducibility: Reduction variable is updated by each task, and the order of update is not important Hence, the reduction operation must be commutative and associative
39
Fundamentals of Parallel Computer Architecture - Chapter 3 39 Reduction Operation Compute: y = y_init op x1 op x2 op x3... op xn op is a reduction operator if it is commutative u op v = v op u and associative (u op v) op w = u op (v op w) Certain operations can be transformed into reduction operations (see Homeworks)
40
Fundamentals of Parallel Computer Architecture - Chapter 3 40 Variable Partitioning Should be declared private: Privatizable variables Should be declared shared: Read-only variables R/W Non-conflicting variables Should be declared reduction: Reduction variables Other R/W Conflicting variables: Privatization possible? If so, privatize them Otherwise, declare as shared, but protect with synchronization
41
Fundamentals of Parallel Computer Architecture - Chapter 3 41 Example 1 for (i=1; i<=n; i++) for (j=1; j<=n; j++) { S2: a[i][j] = b[i][j] + c[i][j]; S3: b[i][j] = a[i][j-1] * d[i][j]; } Declare as shared: n, c, d, a, b Declare as private: i, j
42
Fundamentals of Parallel Computer Architecture - Chapter 3 42 Example 2 for (i=1; i<=n; i++) for (j=1; j<=n; j++) { S1: a[i][j] = b[i][j] + c[i][j]; S2: b[i][j] = a[i-1][j] * d[i][j]; S3: e[i][j] = a[i][j]; } Parallel task = each “for j” loop iteration Declare as shared: n, i, c, d, a, b, e Declare as private: j
43
Fundamentals of Parallel Computer Architecture - Chapter 3 43 Parallel Programming Task Creation (correctness) Finding parallel tasks Code analysis Algorithm analysis Variable partitioning Shared vs. private vs. reduction Synchronization Task mapping (performance) Static vs. dynamic Block vs. cyclic Dimension mapping: column-wise vs. row-wise Communication and data locality considerations
44
Fundamentals of Parallel Computer Architecture - Chapter 3 44 Synchronization Primitives Point-to-point a pair of signal() and wait() a pair of send() and recv() in message passing Lock ensures mutual exclusion, only one thread can be in a locked region at a given time Barrier a point where a thread is allowed to go past it only when all threads have reached the point.
45
Fundamentals of Parallel Computer Architecture - Chapter 3 45 Lock What problem may arise here? Lock ensures only one thread inside the locked region Issues: What granularity to lock? How to build a lock that is correct and fast? // inside a parallel region for (i=start_iter; i<end_iter; i++) sum = sum + a[i]; // inside a parallel region for (i=start_iter; i<end_iter; i++) { lock(x); sum = sum + a[i]; unlock(x); }
46
Fundamentals of Parallel Computer Architecture - Chapter 3 46 Barrier: Global Event Synchronization Load balance important Execution time dependent on the slowest thread One reason for gang scheduling and avoiding time sharing and context switching
47
Fundamentals of Parallel Computer Architecture - Chapter 3 47 Group Event Synchronization Subset of processes involved Can use flags or barriers (involving only the subset) Concept of producers and consumers Major types: Single-producer, multiple-consumer E.g., producer sets a flag that consumers spin on Multiple-producer, single-consumer E.g., barrier on producers, the last process sets the flag that consumer spins on Multiple-producer, multiple-consumer
48
Fundamentals of Parallel Computer Architecture - Chapter 3 48 Parallel Programming Task Creation (correctness) Finding parallel tasks Code analysis Algorithm analysis Variable partitioning Shared vs. private vs. reduction Synchronization Task mapping (performance) – details later Static vs. dynamic Block vs. cyclic Dimension mapping: column-wise vs. row-wise Communication and data locality considerations
49
Fundamentals of Parallel Computer Architecture - Chapter 3 49 Intro to OpenMP: directives format Refer to http://www.openmp.org and the OpenMP 2.0 specifications in the course web site for more detailshttp://www.openmp.org #pragma omp directive-name [clause[ [,] clause]...] new-line For example, #pragma omp for [clause[[,] clause]... ] new-line for-loop The clause is one of private(variable-list) firstprivate(variable-list) lastprivate(variable-list) reduction(operator: variable-list) ordered schedule(kind[, chunk_size]) nowait
50
Fundamentals of Parallel Computer Architecture - Chapter 3 50 Very Very Short Intro to OpenMP #include //… #pragma omp parallel { … #pragma omp parallel for default(shared) private(i) for(i=0; i<n; i++) A[I]= A[I]*A[I]- 3.0; } #pragma omp parallel shared(A,B)private(i) { #pragma omp sections nowait { #pragma omp section for(i=0; i<n; i++) A[i]= A[i]*A[i]- 4.0; #pragma omp section for(i=0; i<n; i++) B[i]= B[i]*B[i] + 9.0; } // end omp sections } // end omp parallel Parallel for loop: Parallel sections:
51
Fundamentals of Parallel Computer Architecture - Chapter 3 51 Type of variables shared, private, reduction, firstprivate, lastprivate Semi-private data for parallel loops: – reduction: variable that is the target of a reduction operation performed by the loop, e.g., sum – firstprivate: initialize the private copy from the value of the shared variable prior to parallel section – lastprivate: upon loop exit, master thread holds the value seen by the thread assigned the last loop iteration (for parallel loops only)
52
Fundamentals of Parallel Computer Architecture - Chapter 3 52 Barriers Barriers are implicit after each parallel section When barriers are not needed for correctness, use nowait clause schedule clause will be discussed later #include //… #pragma omp parallel { … #pragma omp for nowait default(shared) private(i) for(i=0; i<n; i++) A[I]= A[I]*A[I]- 3.0; }
53
Fundamentals of Parallel Computer Architecture - Chapter 3 53 Compile, run, and profile First, find the subroutines/loops that take most of the execution time (pixie, gprof, ssrun, …) Parallelize or auto-parallelize the loops, e.g.: icc –parallel prog.c f77 –pfa prog.f Compile the applications, e.g.: cc -mp -O3 prog.c Set number of threads, e.g.: Setenv MP_NUM_THREADS 8 Can also be set from within the app Run it, e.g.: Prog, or with profile Ssrun –pcsamp prog
54
Fundamentals of Parallel Computer Architecture - Chapter 3 54 Matrix Multiplication Example Reading assignment: Read Section 3.8 in the textbook In MP1, you will be asked to parallelize a given code
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.