ECE 1747 Parallel Programming Shared Memory: OpenMP Environment and Synchronization.

Slides:



Advertisements
Similar presentations
1 Programming Explicit Thread-level Parallelism  As noted previously, the programmer must specify how to parallelize  But, want path of least effort.
Advertisements

Open[M]ulti[P]rocessing Pthreads: Programmer explicitly define thread behavior openMP: Compiler and system defines thread behavior Pthreads: Library independent.
PARALLEL PROGRAMMING WITH OPENMP Ing. Andrea Marongiu
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.
Introduction to OpenMP For a more detailed tutorial see: Look at the presentations.
1 ITCS4145/5145, Parallel Programming B. Wilkinson Feb 21, 2012 Programming with Shared Memory Introduction to OpenMP.
CSCI-6964: High Performance Parallel & Distributed Computing (HPDC) AE 216, Mon/Thurs 2-3:20 p.m. Pthreads (reading Chp 7.10) Prof. Chris Carothers Computer.
A Very Short Introduction to OpenMP Basile Schaeli EPFL – I&C – LSP Vincent Keller EPFL – STI – LIN.
Comp 422: Parallel Programming Shared Memory Multithreading: Pthreads Synchronization.
Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines.
– 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance.
ECE 1747 Parallel Programming Shared Memory: OpenMP Environment and Synchronization.
1 Parallel Programming With OpenMP. 2 Contents  Overview of Parallel Programming & OpenMP  Difference between OpenMP & MPI  OpenMP Programming Model.
Programming with Shared Memory Introduction to OpenMP
Shared Memory Parallelization Outline What is shared memory parallelization? OpenMP Fractal Example False Sharing Variable scoping Examples on sharing.
Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (
Parallel Programming in Java with Shared Memory Directives.
Uses some of the slides for chapters 7 and 9 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.
Chapter 17 Shared-Memory Programming. Introduction OpenMP is an application programming interface (API) for parallel programming on multiprocessors. It.
OpenMP China MCP.
1 OpenMP Writing programs that use OpenMP. Using OpenMP to parallelize many serial for loops with only small changes to the source code. Task parallelism.
Introduction to OpenMP. OpenMP Introduction Credits:
ECE 1747H : Parallel Programming Message Passing (MPI)
OpenMP OpenMP A.Klypin Shared memory and OpenMP Simple Example Threads Dependencies Directives Handling Common blocks Synchronization Improving load balance.
Lecture 8: OpenMP. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism / Implicit parallelism.
OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)
04/10/25Parallel and Distributed Programming1 Shared-memory Parallel Programming Taura Lab M1 Yuuki Horita.
CS 838: Pervasive Parallelism Introduction to OpenMP Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from online references.
Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j
OpenMP fundamentials Nikita Panov
High-Performance Parallel Scientific Computing 2008 Purdue University OpenMP Tutorial Seung-Jai Min School of Electrical and Computer.
Threaded Programming Lecture 4: Work sharing directives.
Introduction to OpenMP
DEV490 Easy Multi-threading for Native.NET Apps with OpenMP ™ and Intel ® Threading Toolkit Software Application Engineer, Intel.
Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.
09/09/2010CS4961 CS4961 Parallel Programming Lecture 6: Data Parallelism in OpenMP, cont. Introduction to Data Parallel Algorithms Mary Hall September.
ECE 1747H: Parallel Programming Lecture 2: Data Parallelism.
Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (
9/22/2011CS4961 CS4961 Parallel Programming Lecture 9: Task Parallelism in OpenMP Mary Hall September 22,
MPI and OpenMP.
Threaded Programming Lecture 2: Introduction to OpenMP.
ECE 1747H: Parallel Programming Lecture 2-3: More on parallelism and dependences -- synchronization.
Uses some of the slides for chapters 7 and 9 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.
HPF (High Performance Fortran). What is HPF? HPF is a standard for data-parallel programming. Extends Fortran-77 or Fortran-90. Similar extensions exist.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
10/05/2010CS4961 CS4961 Parallel Programming Lecture 13: Task Parallelism in OpenMP Mary Hall October 5,
Special Topics in Computer Engineering OpenMP* Essentials * Open Multi-Processing.
CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.
CS240A, T. Yang, Parallel Programming with OpenMP.
COMP7330/7336 Advanced Parallel and Distributed Computing OpenMP: Programming Model Dr. Xiao Qin Auburn University
OpenMP – Part 2 * *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)
Introduction to OpenMP
Shared Memory Parallelism - OpenMP
Shared-memory Programming
CS427 Multicore Architecture and Parallel Computing
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Improving Barrier Performance Dr. Xiao Qin.
Loop Parallelism and OpenMP CS433 Spring 2001
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing A bug in the rwlock program Dr. Xiao Qin.
Open[M]ulti[P]rocessing
Computer Engg, IIT(BHU)
Introduction to OpenMP
Shared-Memory Programming
September 4, 1997 Parallel Processing (CS 667) Lecture 5: Shared Memory Parallel Programming with OpenMP* Jeremy R. Johnson Parallel Processing.
Computer Science Department
Parallel Programming with OpenMP
Programming with Shared Memory Introduction to OpenMP
Introduction to OpenMP
Shared-Memory Paradigm & OpenMP
Presentation transcript:

ECE 1747 Parallel Programming Shared Memory: OpenMP Environment and Synchronization

What is OpenMP? Standard for shared memory programming for scientific applications. Has specific support for scientific application needs (unlike Pthreads). Rapidly gaining acceptance among vendors and application writers. See for more info.

OpenMP API Overview API is a set of compiler directives inserted in the source program (in addition to some library functions). Ideally, compiler directives do not affect sequential code. –pragma’s in C / C++. –(special) comments in Fortran code.

OpenMP API Example Sequential code: statement1; statement2; statement3; Assume we want to execute statement 2 in parallel, and statement 1 and 3 sequentially.

OpenMP API Example (2 of 2) OpenMP parallel code: statement 1; #pragma statement2; statement3; Statement 2 (may be) executed in parallel. Statement 1 and 3 are executed sequentially.

Important Note By giving a parallel directive, the user asserts that the program will remain correct if the statement is executed in parallel. OpenMP compiler does not check correctness. Some tools exist for helping with that. Totalview - good parallel debugger (

API Semantics Master thread executes sequential code. Master and slaves execute parallel code. Note: very similar to fork-join semantics of Pthreads create/join primitives.

OpenMP Implementation Overview OpenMP implementation –compiler, –library. Unlike Pthreads (purely a library).

OpenMP Example Usage (1 of 2) OpenMP Compiler Annotated Source Sequential Program Parallel Program compiler switch

OpenMP Example Usage (2 of 2) If you give sequential switch, –comments and pragmas are ignored. If you give parallel switch, –comments and/or pragmas are read, and –cause translation into parallel program. Ideally, one source for both sequential and parallel program (big maintenance plus).

OpenMP Directives Parallelization directives: –parallel region –parallel for Data environment directives: –shared, private, threadprivate, reduction, etc. Synchronization directives: –barrier, critical

General Rules about Directives They always apply to the next statement, which must be a structured block. Examples –#pragma omp … statement –#pragma omp … { statement1; statement2; statement3; }

OpenMP Parallel Region #pragma omp parallel A number of threads are spawned at entry. Each thread executes the same code. Each thread waits at the end. Very similar to a number of create/join’s with the same function in Pthreads.

Getting Threads to do Different Things Through explicit thread identification (as in Pthreads). Through work-sharing directives.

Thread Identification int omp_get_thread_num() int omp_get_num_threads() Gets the thread id. Gets the total number of threads.

Example #pragma omp parallel { if( !omp_get_thread_num() ) master(); else slave(); }

Work Sharing Directives Always occur within a parallel region directive. Two principal ones are –parallel for –parallel section

OpenMP Parallel For #pragma omp parallel #pragma omp for for( … ) { … } Each thread executes a subset of the iterations. All threads wait at the end of the parallel for.

Multiple Work Sharing Directives May occur within a single parallel region #pragma omp parallel { #pragma omp for for( ; ; ) { … } #pragma omp for for( ; ; ) { … } } All threads wait at the end of the first for.

The NoWait Qualifier #pragma omp parallel { #pragma omp for nowait for( ; ; ) { … } #pragma omp for for( ; ; ) { … } } Threads proceed to second for w/o waiting.

Parallel Sections Directive #pragma omp parallel { #pragma omp sections { { … } #pragma omp section  this is a delimiter { … } #pragma omp section { … } … }

A Useful Shorthand #pragma omp parallel #pragma omp for for ( ; ; ) { … } is equivalent to #pragma omp parallel for for ( ; ; ) { … } (Same for parallel sections)

Note the Difference between... #pragma omp parallel { #pragma omp for for( ; ; ) { … } f(); #pragma omp for for( ; ; ) { … } }

… and... #pragma omp parallel for for( ; ; ) { … } f(); #pragma omp parallel for for( ; ; ) { … }

Sequential Matrix Multiply for( i=0; i<n; i++ ) for( j=0; j<n; j++ ) { c[i][j] = 0.0; for( k=0; k<n; k++ ) c[i][j] += a[i][k]*b[k][j]; }

OpenMP Matrix Multiply #pragma omp parallel for for( i=0; i<n; i++ ) for( j=0; j<n; j++ ) { c[i][j] = 0.0; for( k=0; k<n; k++ ) c[i][j] += a[i][k]*b[k][j]; }

Sequential SOR for some number of timesteps/iterations { for (i=0; i<n; i++ ) for( j=1, j<n, j++ ) temp[i][j] = 0.25 * ( grid[i-1][j] + grid[i+1][j] grid[i][j-1] + grid[i][j+1] ); for( i=0; i<n; i++ ) for( j=1; j<n; j++ ) grid[i][j] = temp[i][j]; }

OpenMP SOR for some number of timesteps/iterations { #pragma omp parallel for for (i=0; i<n; i++ ) for( j=0, j<n, j++ ) temp[i][j] = 0.25 * ( grid[i-1][j] + grid[i+1][j] grid[i][j-1] + grid[i][j+1] ); #pragma omp parallel for for( i=0; i<n; i++ ) for( j=0; j<n; j++ ) grid[i][j] = temp[i][j]; }

Equivalent OpenMP SOR for some number of timesteps/iterations { #pragma omp parallel { #pragma omp for for (i=0; i<n; i++ ) for( j=0, j<n, j++ ) temp[i][j] = 0.25 * ( grid[i-1][j] + grid[i+1][j] grid[i][j-1] + grid[i][j+1] ); #pragma omp for for( i=0; i<n; i++ ) for( j=0; j<n; j++ ) grid[i][j] = temp[i][j] }

Some Advanced Features Conditional parallelism. Scheduling options. (More can be found in the specification)

Conditional Parallelism: Issue Oftentimes, parallelism is only useful if the problem size is sufficiently big. For smaller sizes, overhead of parallelization exceeds benefit.

Conditional Parallelism: Specification #pragma omp parallel if( expression ) #pragma omp for if( expression ) #pragma omp parallel for if( expression ) Execute in parallel if expression is true, otherwise execute sequentially.

Conditional Parallelism: Example for( i=0; i<n; i++ ) #pragma omp parallel for if( n-i > 100 ) for( j=i+1; j<n; j++ ) for( k=i+1; k<n; k++ ) a[j][k] = a[j][k] - a[i][k]*a[i][j] / a[j][j]

Scheduling of Iterations: Issue Scheduling: assigning iterations to a thread. So far, we have assumed the default which is block scheduling. OpenMP allows other scheduling strategies as well, for instance cyclic, gss (guided self- scheduling), etc.

Scheduling of Iterations: Specification #pragma omp parallel for schedule( ) can be one of –block (default) –cyclic –gss

Example Multiplication of two matrices C = A x B, where the A matrix is upper-triangular (all elements below diagonal are 0). 0 A

Sequential Matrix Multiply Becomes for( i=0; i<n; i++ ) for( j=0; j<n; j++ ) { c[i][j] = 0.0; for( k=i; k<n; k++ ) c[i][j] += a[i][k]*b[k][j]; } Load imbalance with block distribution.

OpenMP Matrix Multiply #pragma omp parallel for schedule( cyclic ) for( i=0; i<n; i++ ) for( j=0; j<n; j++ ) { c[i][j] = 0.0; for( k=i; k<n; k++ ) c[i][j] += a[i][k]*b[k][j]; }

Data Environment Directives (1 of 2) All variables are by default shared. One exception: the loop variable of a parallel for is private. By using data directives, some variables can be made private or given other special characteristics.

Reminder: Matrix Multiply #pragma omp parallel for for( i=0; i<n; i++ ) for( j=0; j<n; j++ ) { c[i][j] = 0.0; for( k=0; k<n; k++ ) c[i][j] += a[i][k]*b[k][j]; } a, b, c are shared i, j, k are private

Data Environment Directives (2 of 2) Private Threadprivate Reduction

Private Variables #pragma omp parallel for private( list ) Makes a private copy for each thread for each variable in the list. This and all further examples are with parallel for, but same applies to other region and work-sharing directives.

Private Variables: Example (1 of 2) for( i=0; i<n; i++ ) { tmp = a[i]; a[i] = b[i]; b[i] = tmp; } Swaps the values in a and b. Loop-carried dependence on tmp. Easily fixed by privatizing tmp.

Private Variables: Example (2 of 2) #pragma omp parallel for private( tmp ) for( i=0; i<n; i++ ) { tmp = a[i]; a[i] = b[i]; b[i] = tmp; } Removes dependence on tmp. Would be more difficult to do in Pthreads.

Private Variables: Alternative 1 for( i=0; i<n; i++ ) { tmp[i] = a[i]; a[i] = b[i]; b[i] = tmp[i]; } Requires sequential program change. Wasteful in space, O(n) vs. O(p).

Private Variables: Alternative 2 f() { int tmp; /* local allocation on stack */ for( i=from; i<to; i++ ) { tmp = a[i]; a[i] = b[i]; b[i] = tmp; }

Threadprivate Private variables are private on a parallel region basis. Threadprivate variables are global variables that are private throughout the execution of the program.

Threadprivate #pragma omp threadprivate( list ) Example: #pragma omp threadprivate( x) Requires program change in Pthreads. Requires an array of size p. Access as x[pthread_self()]. Costly if accessed frequently. Not cheap in OpenMP either.

Reduction Variables #pragma omp parallel for reduction( op:list ) op is one of +, *, -, &, ^, |, &&, or || The variables in list must be used with this operator in the loop. The variables are automatically initialized to sensible values.

Reduction Variables: Example #pragma omp parallel for reduction( +:sum ) for( i=0; i<n; i++ ) sum += a[i]; Sum is automatically initialized to zero.

SOR Sequential Code with Convergence for( ; diff > delta; ) { for (i=0; i<n; i++ ) for( j=0; j<n, j++ ) { … } diff = 0; for( i=0; i<n; i++ ) for( j=0; j<n; j++ ) { diff = max(diff, fabs(grid[i][j] - temp[i][j])); grid[i][j] = temp[i][j]; }

SOR Sequential Code with Convergence for( ; diff > delta; ) { #pragma omp parallel for for (i=0; i<n; i++ ) for( j=0; j<n, j++ ) { … } diff = 0; #pragma omp parallel for reduction( max: diff ) for( i=0; i<n; i++ ) for( j=0; j<n; j++ ) { diff = max(diff, fabs(grid[i][j] - temp[i][j])); grid[i][j] = temp[i][j]; }

SOR Sequential Code with Convergence for( ; diff > delta; ) { #pragma omp parallel for for (i=0; i<n; i++ ) for( j=0; j<n, j++ ) { … } diff = 0; #pragma omp parallel for reduction( max: diff ) for( i=0; i<n; i++ ) for( j=0; j<n; j++ ) { diff = max(diff, fabs(grid[i][j] - temp[i][j])); grid[i][j] = temp[i][j]; } Bummer: no reduction operator for max or min.

Synchronization Primitives Critical #pragma omp critical name Implements critical sections by name. Similar to Pthreads mutex locks (name ~ lock). Barrier #pragma omp critical barrier Implements global barrier.

OpenMP SOR with Convergence (1 of 2) #pragma omp parallel private( mydiff ) for( ; diff > delta; ) { #pragma omp for nowait for( i=from; i<to; i++ ) for( j=0; j<n, j++ ) { … } diff = 0.0; mydiff = 0.0; #pragma omp barrier...

OpenMP SOR with Convergence (2 of 2)... #pragma omp for nowait for( i=from; i<to; i++ ) for( j=0; j<n; j++ ) { mydiff=max(mydiff,fabs(grid[i][j]-temp[i][j]); grid[i][j] = temp[i][j]; } #pragma critical diff = max( diff, mydiff ); #pragma barrier }

Synchronization Primitives Big bummer: no condition variables. Result: must busy wait for condition synchronization. Clumsy. Very inefficient on some architectures.

PIPE: Sequential Program for( i=0; i<num_pic, read(in_pic); i++ ) { int_pic_1 = trans1( in_pic ); int_pic_2 = trans2( int_pic_1); int_pic_3 = trans3( int_pic_2); out_pic = trans4( int_pic_3); }

Sequential vs. Parallel Execution Sequential Parallel (Color -- picture; horizontal line -- processor).

PIPE: Parallel Program P0: for( i=0; i<num_pics, read(in_pic); i++ ) { int_pic_1[i] = trans1( in_pic ); signal(event_1_2[i]); } P1: for( i=0; i<num_pics; i++ ) { wait( event_1_2[i] ); int_pic_2[i] = trans2( int_pic_1[i] ); signal(event_2_3[i]); }

PIPE: Main Program #pragma omp parallel sections { #pragma omp section stage1(); #pragma omp section stage2(); #pragma omp section stage3(); #pragma omp section stage4(); }

PIPE: Stage 1 void stage1() { num1 = 0; for( i=0; i<num_pics, read(in_pic); i++ ) { int_pic_1[i] = trans1( in_pic ); #pragma omp critical 1 num1++; }

PIPE: Stage 2 void stage2 () { for( i=0; i<num_pic; i++ ) { do { #pragma omp critical 1 cond = (num1 <= i); } while (cond); int_pic_2[i] = trans2(int_pic_1[i]); #pragma omp critical 2 num2++; }

OpenMP PIPE Note the need to exit critical during wait. Otherwise no access by other thread. Never busy-wait inside critical!