Chapter 3 Shared Memory Parallel Programming 2005-2008 Yan Solihin Copyright notice: No part of this publication may be reproduced, stored.

Slides:

Advertisements

Similar presentations

Starting Parallel Algorithm Design David Monismith Based on notes from Introduction to Parallel Programming 2 nd Edition by Grama, Gupta, Karypis, and.

Advertisements

1 Parallelization of An Example Program Examine a simplified version of a piece of Ocean simulation Iterative equation solver Illustrate parallel program.

Grid solver example Gauss-Seidel (near-neighbor) sweeps to convergence  interior n-by-n points of (n+2)-by-(n+2) updated in each sweep  updates done.

PARALLEL PROGRAMMING WITH OPENMP Ing. Andrea Marongiu

1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.

ECE669 L5: Grid Computations February 12, 2004 ECE 669 Parallel Computer Architecture Lecture 5 Grid Computations.

Introduction to OpenMP For a more detailed tutorial see: Look at the presentations.

Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.

1 ITCS4145/5145, Parallel Programming B. Wilkinson Feb 21, 2012 Programming with Shared Memory Introduction to OpenMP.

1 Steps in Creating a Parallel Program 4 steps: Decomposition, Assignment, Orchestration, Mapping Done by programmer or system software (compiler, runtime,...)

CSCI-6964: High Performance Parallel & Distributed Computing (HPDC) AE 216, Mon/Thurs 2-3:20 p.m. Pthreads (reading Chp 7.10) Prof. Chris Carothers Computer.

1 Steps in Creating a Parallel Program 4 steps: Decomposition, Assignment, Orchestration, Mapping Done by programmer or system software (compiler, runtime,...)

OpenMPI Majdi Baddourah

1 Organization of Programming Languages-Cheng (Fall 2004) Concurrency u A PROCESS or THREAD:is a potentially-active execution context. Classic von Neumann.

A Very Short Introduction to OpenMP Basile Schaeli EPFL – I&C – LSP Vincent Keller EPFL – STI – LIN.

Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines.

– 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance.

ECE 1747H : Parallel Programming Lecture 1-2: Overview.

10/04/2011CS4961 CS4961 Parallel Programming Lecture 12: Advanced Synchronization (Pthreads) Mary Hall October 4, 2011.

Programming with Shared Memory Introduction to OpenMP

Shared Memory Parallelization Outline What is shared memory parallelization? OpenMP Fractal Example False Sharing Variable scoping Examples on sharing.

Parallel Programming in Java with Shared Memory Directives.

Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.

This module was created with support form NSF under grant # DUE Module developed by Martin Burtscher Module B1 and B2: Parallelization.

L15: Putting it together: N-body (Ch. 6) October 30, 2012.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

OpenMP - Introduction Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

Computational Methods in Physics PHYS 3437 Dr Rob Thacker Dept of Astronomy & Physics (MM-301C)

ECE 1747 Parallel Programming Shared Memory: OpenMP Environment and Synchronization.

1 OpenMP Writing programs that use OpenMP. Using OpenMP to parallelize many serial for loops with only small changes to the source code. Task parallelism.

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

OpenMP OpenMP A.Klypin Shared memory and OpenMP Simple Example Threads Dependencies Directives Handling Common blocks Synchronization Improving load balance.

OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)

Computer Organization David Monismith CS345 Notes to help with the in class assignment.

CS 838: Pervasive Parallelism Introduction to OpenMP Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from online references.

Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,

Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j

Threaded Programming Lecture 4: Work sharing directives.

Design Issues. How to parallelize  Task decomposition  Data decomposition  Dataflow decomposition Jaruloj Chongstitvatana 2 Parallel Programming: Parallelization.

10/02/2012CS4230 CS4230 Parallel Programming Lecture 11: Breaking Dependences and Task Parallel Algorithms Mary Hall October 2,

Introduction to OpenMP

Fundamentals of Parallel Computer Architecture - Chapter 51 Chapter 5 Parallel Programming for Linked Data Structures Yan Solihin.

Fundamentals of Parallel Computer Architecture - Chapter 71 Chapter 7 Introduction to Shared Memory Multiprocessors Yan Solihin Copyright.

Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (

9/22/2011CS4961 CS4961 Parallel Programming Lecture 9: Task Parallelism in OpenMP Mary Hall September 22,

Chapter 2: Parallel Programming Models Yan Solihin Copyright notice: No part of this publication may be reproduced, stored in a retrieval.

Fundamentals of Parallel Computer Architecture - Chapter 41 Chapter 4 Issues in Shared Memory Programming Yan Solihin Copyright notice:

ECE 1747H: Parallel Programming Lecture 2-3: More on parallelism and dependences -- synchronization.

3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,

10/05/2010CS4961 CS4961 Parallel Programming Lecture 13: Task Parallelism in OpenMP Mary Hall October 5,

CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.

OpenMP – Part 2 * *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)

Introduction to OpenMP

Shared Memory Parallelism - OpenMP

Conception of parallel algorithms

Computer Engg, IIT(BHU)

Introduction to OpenMP

Computer Science Department

CS4230 Parallel Programming Lecture 12: More Task Parallelism Mary Hall October 4, /04/2012 CS4230.

Perspective on Parallel Programming

Register Pressure Guided Unroll-and-Jam

Introduction to High Performance Computing Lecture 20

Programming with Shared Memory Introduction to OpenMP

Parallelization of An Example Program

Introduction to OpenMP

CSE 153 Design of Operating Systems Winter 19

Mattan Erez The University of Texas at Austin

Shared-Memory Paradigm & OpenMP

Presentation transcript:

Chapter 3 Shared Memory Parallel Programming Yan Solihin Copyright notice: No part of this publication may be reproduced, stored in a retrieval system, or transmitted by any means (electronic, mechanical, photocopying, recording, or otherwise) without the prior written permission of the author. An exception is granted for academic lectures at universities and colleges, provided that the following text is included in such copy: “Source: Yan Solihin, Fundamentals of Parallel Computer Architecture, 2008”.

Fundamentals of Parallel Computer Architecture - Chapter 3 2 Steps in Creating a Parallel Program Task Creation: identifying parallel tasks, variable scopes, synchronization Task Mapping: grouping tasks, mapping to processors/memory

Fundamentals of Parallel Computer Architecture - Chapter 3 3 Parallel Programming  Task Creation (correctness) Finding parallel tasks  Code analysis  Algorithm analysis Variable partitioning  Shared vs. private vs. reduction Synchronization  Task mapping (performance) Static vs. dynamic Block vs. cyclic Dimension mapping: column-wise vs. row-wise Communication and data locality considerations

Fundamentals of Parallel Computer Architecture - Chapter 3 4 Code Analysis  Goal: given a code, without the knowledge of the algorithms, find parallel tasks  Focus on loop dependence analysis  Notations: S is a statement in the source code S[i,j,…] denotes a statement in the loop iteration [i,j,…] S1 then S2 means that S1 happens before S2 If S1 then S2:  S1 T S2 denotes true dependence, i.e. S1 writes to a location that is read by S2  S1 A S2 denotes anti dependence, i.e. S1 reads a location written by S2  S1 O S2 denotes output dependence, i.e. S1 writes to the same location written by S2

Fundamentals of Parallel Computer Architecture - Chapter 3 5 Example  Dependences: S1 T S2 S1 T S3 S3 A S4 S2 O S3 S1: x = 2; S2: y = x; S3: y = x + z; S4: z = 6;

Fundamentals of Parallel Computer Architecture - Chapter 3 6 Loop-independent vs. loop-carried dependence  Loop-carried dependence: dependence exists across iterations, i.e., if the loop is removed, the dependence no longer exists  Loop-independent dependence: dependence exists within an iteration. i.e., if the loop is removed, the dependence exists for (i=1; i<n; i++) { S1: a[i] = a[i-1] + 1; S2: b[i] = a[i]; } for (i=1; i<n; i++) for (j=1; j< n; j++) S3: a[i][j] = a[i][j-1] + 1; for (i=1; i<n; i++) for (j=1; j< n; j++) S4: a[i][j] = a[i-1][j] + 1; S1[i] T S1[i+1]: loop-carried S1[i] T S2[i]: loop-independent S3[i,j] T S3[i,j+1]: -loop-carried on for j loop -No loop-carried dependence in for i loop S4[i,j] T S4[i+1,j]: No loop-carried dependence in for j loop Loop-carried on for i loop

Fundamentals of Parallel Computer Architecture - Chapter 3 7 Iteration-space Traversal Graph (ITG)  ITG shows graphically the order of traversal in the iteration space (happens-before relationship)  Node = a point in the iteration space  Directed Edge = the next point that will be encountered after the current point is traversed Example: for (i=1; i<4; i++) for (j=1; j<4; j++) S3: a[i][j] = a[i][j-1] + 1; i j

Fundamentals of Parallel Computer Architecture - Chapter 3 8 Loop-carried Dependence Graph (LDG)  LDG shows the true/anti/output dependence relationship graphically  Node = a point in the iteration space  Directed Edge = the dependence Example: for (i=1; i<4; i++) for (j=1; j<4; j++) S3: a[i][j] = a[i][j-1] + 1; S3[i,j] T S3[i,j+1] True deps i j TT TT TT

Fundamentals of Parallel Computer Architecture - Chapter 3 9 Further example for (i=1; i<=n; i++) for (j=1; j<=n; j++) S1: a[i][j] = a[i][j-1] + a[i][j+1] + a[i-1][j] + a[i+1][j]; for (i=1; i<=n; i++) for (j=1; j<=n; j++) { S2: a[i][j] = b[i][j] + c[i][j]; S3: b[i][j] = a[i][j-1] * d[i][j]; }  Draw the ITG  List all the dependence relationships  Draw the LDG

Fundamentals of Parallel Computer Architecture - Chapter 3 10 Answer for Loop Nest 1  ITG i j 1 2 n n21...

Fundamentals of Parallel Computer Architecture - Chapter 3 11 Answer for Loop Nest 1 True dependences: S1[i,j] T S1[i,j+1] S1[i,j] T S1[i+1,j] Output dependences: None Anti dependences: S1[i,j] A S1[i+1,j] S1[i,j] A S1[i,j+1] LDG: i j 1 2 n n21... Note: each Edge represents both true, and anti dependences

Fundamentals of Parallel Computer Architecture - Chapter 3 12 Answer for Loop Nest 2  ITG i j 1 2 n n21...

Fundamentals of Parallel Computer Architecture - Chapter 3 13 Answer for Loop Nest 2 True dependences: S2[i,j] T S3[i,j+1] Output dependences: None Anti dependences: S2[i,j] A S3[i,j] (loop-independent dependence) LDG: i j 1 2 n n21... Note: each edge represents only true dependences

Fundamentals of Parallel Computer Architecture - Chapter 3 14 Finding parallel tasks across iterations  Analyze loop-carried dependences: Dependence must be obeyed (esp. true dependences) There are opportunities when some dependences are missing  Example 1:  LDG:  Can divide the loop into two parallel tasks (one with odd iterations and another with even iterations): for (i=2; i<=n; i++) S: a[i] = a[i-2]; for (i=2; i<=n; i+=2) S: a[i] = a[i-2]; for (i=3; i<=n; i+=2) S: a[i] = a[i-2];

Fundamentals of Parallel Computer Architecture - Chapter 3 15 Example 2  Example 2:  LDG:  There are n parallel tasks (one task per i iteration) for (i=0; i<n; i++) for (j=0; j< n; j++) S3: a[i][j] = a[i][j-1] + 1; i j 1 2 n n21...

Fundamentals of Parallel Computer Architecture - Chapter 3 16 Further example for (i=1; i<=n; i++) for (j=1; j<=n; j++) S1: a[i][j] = a[i][j-1] + a[i][j+1] + a[i-1][j] + a[i+1][j];  LDG:  Where are the parallel tasks? i 1 2 n n21... Note: each Edge represents both true, and anti dependences

Fundamentals of Parallel Computer Architecture - Chapter 3 17 Example 3  Identify which nodes are not dependent on each other  In each anti-diagonal, the nodes are independent of each other  Need to rewrite the code to iterate over anti- diagonals i 1 2 n n21... Note: each Edge represents both true, and anti dependences

Fundamentals of Parallel Computer Architecture - Chapter 3 18 Structure of Rewritten Code  Iterate over anti-diagonals, and over elements within an anti-diagonal:  Parallelize the highlighted loop  Write the code… Calculate number of anti-diagonals Foreach anti-diagonal do: calculate number of points in the current anti-diagonal For each point in the current anti-diagonal do: compute the current point in the matrix

Fundamentals of Parallel Computer Architecture - Chapter 3 19 Implementing Solution 1 for (i=1; i <= 2*n-1; i++) {// 2n-1 anti-diagonals if (i <= n) { points = i; // number of points in anti-diag row = i; // first pt (row,col) in anti-diag col = 1; // note that row+col = i+1 always } else { points = 2*n – i; row = n; col = i-n+1; // note that row+col = i+1 always } for_all (k=1; k <= points; k++) { a[row][col] = … // update a[row][col] row--; col++; } // in OpenMP, the directive would be … #pragma omp parallel for default(shared) private(k) firstprivate(row,col) { for (k=1; k <= points; k++) … }

Fundamentals of Parallel Computer Architecture - Chapter 3 20 DOACROSS Parallelism for (i=1; i<=N; i++) { S: a[i] = a[i-1] + b[i] * c[i]; } for (i=1; i<=N; i++) { S1: temp[i] = b[i] * c[i]; } for (i=1; i<=N; i++) { S2: a[i] = a[i-1] + temp[i]; } Opportunity for parallelism? S[i] T S[i+1] So it has loop-carried dependence But, notice that the b[i] * c[i] part has no Loop-carried dependence Can change to: Now the first loop is parallel, but the second one is not Execution time N x (TS1 + TS2) array temp[] introduces storage overhead Better solution?

Fundamentals of Parallel Computer Architecture - Chapter 3 21 DOACROSS Parallelism Post(0); for (i=1; i<=N; i++) { S1: temp = b[i] * c[i]; wait(i-1); S2: a[i] = a[i-1] + temp; post(i); } Execution time now TS1 + N x TS2 Small storage overhead

Fundamentals of Parallel Computer Architecture - Chapter 3 22 Finding Parallel Tasks in Loop Body  Identify dependences in a loop body  If there are independent statements, can split/distribute the loops  Note that S4 has no dependences with other statements  “S1[i] A S2[i+1]” implies that S2 at iteration i+1 must be executed after S1 at iteration i. Hence dependence not violated if all S2’s executed after all S1’s for (i=0; i<n; i++) { S1: a[i] = b[i+1] * a[i-1]; S2: b[i] = b[i] * coef; S3: c[i] = 0.5 * (c[i] + a[i]); S4: d[i] = d[i-1] * d[i]; } Loop-carried dependences: S1[i] A S2[i+1] Loop-indep dependences: S1[i] T S3[i]

Fundamentals of Parallel Computer Architecture - Chapter 3 23 After loop distribution for (i=0; i<n; i++) { S1: a[i] = b[i+1] * a[i-1]; S2: b[i] = b[i] * coef; S3: c[i] = 0.5 * (c[i] + a[i]); S4: d[i] = d[i-1] * d[i]; } for (i=0; i<n; i++) { S1: a[i] = b[i+1] * a[i-1]; S2: b[i] = b[i] * coef; S3: c[i] = 0.5 * (c[i] + a[i]); } for (i=0; i<n; i++) { S4: d[i] = d[i-1] * d[i]; } -Each loop is a parallel task -Referred to as function parallelism - More distribution possible (refer to textbook)

Fundamentals of Parallel Computer Architecture - Chapter 3 24 Identifying Concurrency (contd.)  Function parallelism: modest degree, does not grow with input size difficult to load balance pipelining, as in video encoding/decoding, or polygon rendering  Most scalable programs are data parallel use both when data parallelism is limited

Fundamentals of Parallel Computer Architecture - Chapter 3 25 DOPIPE Parallelism for (i=2; i<=N; i++) { S1: a[i] = a[i-1] + b[i]; S2: c[i] = c[i] + a[i]; } for (i=2; i<=N; i++) { a[i] = a[i-1] + b[i]; signal(i); } for (i=2; i<=N; i++) { wait(i); c[i] = c[i] + a[i]; } Loop-carried dependences: S1[i-1] T S1[i] Loop independent dependence: S1[i] T S2[i] So, where is the parallelism opportunity? DOPIPE Parallelism What is the max speedup? see textbook

Fundamentals of Parallel Computer Architecture - Chapter 3 26 Parallel Programming  Task Creation (correctness) Finding parallel tasks  Code analysis  Algorithm analysis Variable partitioning  Shared vs. private vs. reduction Synchronization  Task mapping (performance) Static vs. dynamic Block vs. cyclic Dimension mapping: column-wise vs. row-wise Communication and data locality considerations

Fundamentals of Parallel Computer Architecture - Chapter 3 27 Task Creation: Algorithm Analysis  Goal: code analysis misses parallelization opportunities available at the algorithm level  Sometimes, the ITG introduces unnecessary serialization  Consider the “ocean” algorithm Numerical goal: at each sweep, compute how each point is affected by its neighbors Hence, any order of update (within a sweep) is an approximation Different ordering of updates: may converge quicker or slower Change ordering to improve parallelism Partition iteration space into red and black points Red sweep and black sweep are each fully parallel

Fundamentals of Parallel Computer Architecture - Chapter 3 28 Example 3: Simulating Ocean Currents Algorithm: While not converging to a solution do: foreach timestep do: foreach cross section do a sweep: foreach point in a cross section do: compute the force interaction with its neighbors compare with the code that implements the algorithm: for (I=1; I<=N; I++) { for (j=1; j<=N; j++) { S1: temp = A[I][j]; S2: A[I][j] = 0.2 * (A[I][j]+A[I][j-1]+A[I-1][j]+ +A[I][j+1]+A[I+1][j]); S3: diff += abs(A[I][j] - temp); }

Fundamentals of Parallel Computer Architecture - Chapter 3 29 Red-Black Coloring in one sweep: - no dependence between black and red points restructured algorithm: While not converging to a solution do: foreach timestep do: foreach cross section do: foreach red point do: //red sweep compute the force interaction wait until red sweep foreach black point do: //blk sweep compute the force interaction see textbook for code

Fundamentals of Parallel Computer Architecture - Chapter 3 30 Task Creation: Further Algorithm Analysis  Can algorithm tolerate asynchronous execution? simply ignore dependences within a sweep parallel program nondeterministic (timing- dependent!) for (I=1; I<=N; I++) { for (j=1; j<=N; j++) { S1: temp = A[I][j]; S2: A[I][j] = 0.2 * (A[I][j]+A[I][j-1]+A[I-1][j]+ +A[I][j+1]+A[I+1][j]); S3: diff += abs(A[I][j] - temp); }

Fundamentals of Parallel Computer Architecture - Chapter 3 31 Parallel Programming  Task Creation (correctness) Finding parallel tasks  Code analysis  Algorithm analysis Variable partitioning  Shared vs. private vs. reduction Synchronization  Task mapping (performance) Static vs. dynamic Block vs. cyclic Dimension mapping: column-wise vs. row-wise Communication and data locality considerations

Fundamentals of Parallel Computer Architecture - Chapter 3 32 Determining Variable Scope  This step is specific to shared memory programming model  Analyze how each variable may be used across parallel tasks: Read-only:  variable is only read by all tasks R/W non-conflicting:  variable is read, written, or both by only one task R/W Conflicting:  variable written by one task may be read by another

Fundamentals of Parallel Computer Architecture - Chapter 3 33 Example 1 for (i=1; i<=n; i++) for (j=1; j<=n; j++) { S2: a[i][j] = b[i][j] + c[i][j]; S3: b[i][j] = a[i][j-1] * d[i][j]; }  Define a parallel task as each “for i” loop iteration  Read-only: n, c, d  R/W non-conflicting: a, b  R/W conflicting: i, j

Fundamentals of Parallel Computer Architecture - Chapter 3 34 Example 2 for (i=1; i<=n; i++) for (j=1; j<=n; j++) { S1: a[i][j] = b[i][j] + c[i][j]; S2: b[i][j] = a[i-1][j] * d[i][j]; S3: e[i][j] = a[i][j]; }  Parallel task = each “for j” loop iteration  Read-only: n, i, c, d  R/W Non-conflicting: a, b, e  R/W Conflicting: j

Fundamentals of Parallel Computer Architecture - Chapter 3 35 Privatization  Privatizable variable = Conflicting variable that, in program order, is always defined (=written) by a task before use (=read) by the same task Conflicting variable whose values for different parallel tasks are known ahead of time (hence, private copies can be initialized to the known values)  Consequence Conflicts disappear when the variable is “privatized”  Privatization involves making private copies of a shared variable One private copy per thread (not per task) How is this achieved in shared memory abstraction?  Result of privatization: R/W Conflicting  R/W Non-conflicting

Fundamentals of Parallel Computer Architecture - Chapter 3 36 Example 1 for (i=1; i<=n; i++) for (j=1; j<=n; j++) { S2: a[i][j] = b[i][j] + c[i][j]; S3: b[i][j] = a[i][j-1] * d[i][j]; }  Define a parallel task as each “for i” loop iteration  Read-only: n, c, d  R/W non-conflicting: a, b  R/W conflicting but privatizable: i, j After privatization: i[ID], j[ID]

Fundamentals of Parallel Computer Architecture - Chapter 3 37 Example 2 for (i=1; i<=n; i++) for (j=1; j<=n; j++) { S1: a[i][j] = b[i][j] + c[i][j]; S2: b[i][j] = a[i-1][j] * d[i][j]; S3: e[i][j] = a[i][j]; }  Parallel task = each “for j” loop iteration  Read-only: n, i, c, d  R/W Non-conflicting: a, b, e  R/W Conflicting but privatizable: j After privatization: j[ID]

Fundamentals of Parallel Computer Architecture - Chapter 3 38 Reduction Variables and Operations  Reduction Operation = Operation that reduces elements of some vector/array down to one element Examples:  SUM (+), multiplication (*)  Logical (AND, OR, …)  Reduction variable = The scalar variable that is the result of a reduction operation  Criteria for reducibility: Reduction variable is updated by each task, and the order of update is not important Hence, the reduction operation must be commutative and associative

Fundamentals of Parallel Computer Architecture - Chapter 3 39 Reduction Operation  Compute: y = y_init op x1 op x2 op x3... op xn  op is a reduction operator if it is commutative u op v = v op u  and associative (u op v) op w = u op (v op w)  Certain operations can be transformed into reduction operations (see Homeworks)

Fundamentals of Parallel Computer Architecture - Chapter 3 40 Variable Partitioning  Should be declared private: Privatizable variables  Should be declared shared: Read-only variables R/W Non-conflicting variables  Should be declared reduction: Reduction variables  Other R/W Conflicting variables: Privatization possible? If so, privatize them Otherwise, declare as shared, but protect with synchronization

Fundamentals of Parallel Computer Architecture - Chapter 3 41 Example 1 for (i=1; i<=n; i++) for (j=1; j<=n; j++) { S2: a[i][j] = b[i][j] + c[i][j]; S3: b[i][j] = a[i][j-1] * d[i][j]; }  Declare as shared: n, c, d, a, b  Declare as private: i, j

Fundamentals of Parallel Computer Architecture - Chapter 3 42 Example 2 for (i=1; i<=n; i++) for (j=1; j<=n; j++) { S1: a[i][j] = b[i][j] + c[i][j]; S2: b[i][j] = a[i-1][j] * d[i][j]; S3: e[i][j] = a[i][j]; }  Parallel task = each “for j” loop iteration  Declare as shared: n, i, c, d, a, b, e  Declare as private: j

Fundamentals of Parallel Computer Architecture - Chapter 3 43 Parallel Programming  Task Creation (correctness) Finding parallel tasks  Code analysis  Algorithm analysis Variable partitioning  Shared vs. private vs. reduction Synchronization  Task mapping (performance) Static vs. dynamic Block vs. cyclic Dimension mapping: column-wise vs. row-wise Communication and data locality considerations

Fundamentals of Parallel Computer Architecture - Chapter 3 44 Synchronization Primitives  Point-to-point a pair of signal() and wait() a pair of send() and recv() in message passing  Lock ensures mutual exclusion, only one thread can be in a locked region at a given time  Barrier a point where a thread is allowed to go past it only when all threads have reached the point.

Fundamentals of Parallel Computer Architecture - Chapter 3 45 Lock  What problem may arise here?  Lock ensures only one thread inside the locked region  Issues: What granularity to lock? How to build a lock that is correct and fast? // inside a parallel region for (i=start_iter; i<end_iter; i++) sum = sum + a[i]; // inside a parallel region for (i=start_iter; i<end_iter; i++) { lock(x); sum = sum + a[i]; unlock(x); }

Fundamentals of Parallel Computer Architecture - Chapter 3 46 Barrier: Global Event Synchronization  Load balance important  Execution time dependent on the slowest thread One reason for gang scheduling and avoiding time sharing and context switching

Fundamentals of Parallel Computer Architecture - Chapter 3 47 Group Event Synchronization  Subset of processes involved Can use flags or barriers (involving only the subset) Concept of producers and consumers  Major types: Single-producer, multiple-consumer  E.g., producer sets a flag that consumers spin on Multiple-producer, single-consumer  E.g., barrier on producers, the last process sets the flag that consumer spins on Multiple-producer, multiple-consumer

Fundamentals of Parallel Computer Architecture - Chapter 3 48 Parallel Programming  Task Creation (correctness) Finding parallel tasks  Code analysis  Algorithm analysis Variable partitioning  Shared vs. private vs. reduction Synchronization  Task mapping (performance) – details later Static vs. dynamic Block vs. cyclic Dimension mapping: column-wise vs. row-wise Communication and data locality considerations

Fundamentals of Parallel Computer Architecture - Chapter 3 49 Intro to OpenMP: directives format Refer to and the OpenMP 2.0 specifications in the course web site for more detailshttp:// #pragma omp directive-name [clause[ [,] clause]...] new-line For example, #pragma omp for [clause[[,] clause]... ] new-line for-loop The clause is one of private(variable-list) firstprivate(variable-list) lastprivate(variable-list) reduction(operator: variable-list) ordered schedule(kind[, chunk_size]) nowait

Fundamentals of Parallel Computer Architecture - Chapter 3 50 Very Very Short Intro to OpenMP #include //… #pragma omp parallel { … #pragma omp parallel for default(shared) private(i) for(i=0; i<n; i++) A[I]= A[I]*A[I]- 3.0; } #pragma omp parallel shared(A,B)private(i) { #pragma omp sections nowait { #pragma omp section for(i=0; i<n; i++) A[i]= A[i]*A[i]- 4.0; #pragma omp section for(i=0; i<n; i++) B[i]= B[i]*B[i] + 9.0; } // end omp sections } // end omp parallel  Parallel for loop:  Parallel sections:

Fundamentals of Parallel Computer Architecture - Chapter 3 51 Type of variables  shared, private, reduction, firstprivate, lastprivate  Semi-private data for parallel loops: – reduction: variable that is the target of a reduction operation performed by the loop, e.g., sum – firstprivate: initialize the private copy from the value of the shared variable prior to parallel section – lastprivate: upon loop exit, master thread holds the value seen by the thread assigned the last loop iteration (for parallel loops only)

Fundamentals of Parallel Computer Architecture - Chapter 3 52 Barriers  Barriers are implicit after each parallel section  When barriers are not needed for correctness, use nowait clause  schedule clause will be discussed later #include //… #pragma omp parallel { … #pragma omp for nowait default(shared) private(i) for(i=0; i<n; i++) A[I]= A[I]*A[I]- 3.0; }

Fundamentals of Parallel Computer Architecture - Chapter 3 53 Compile, run, and profile  First, find the subroutines/loops that take most of the execution time (pixie, gprof, ssrun, …)  Parallelize or auto-parallelize the loops, e.g.: icc –parallel prog.c f77 –pfa prog.f  Compile the applications, e.g.: cc -mp -O3 prog.c  Set number of threads, e.g.: Setenv MP_NUM_THREADS 8 Can also be set from within the app  Run it, e.g.: Prog, or with profile Ssrun –pcsamp prog

Fundamentals of Parallel Computer Architecture - Chapter 3 54 Matrix Multiplication Example  Reading assignment: Read Section 3.8 in the textbook  In MP1, you will be asked to parallelize a given code