Programming with Shared Memory Introduction to OpenMP

Programming with Shared Memory Introduction to OpenMP
Part 1 ITCS4145/5145, Parallel Programming B. Wilkinson Feb 11, slides 8b-1.ppt

OpenMP Thread-based shared memory programming model.
Accepted standard developed in late 1990s by a group of industry specialists. Higher-level than using thread API’s such as Pthreads or Java threads. Write programs in C/C++ (or Fortran!) and use OpenMP compiler directives to specify parallelism. OpenMP also has a few supporting library routines and environment variables. Several compilers available to compile OpenMP programs include recent Linux C compilers.

OpenMP thread model parallel region parallel region
Initially, single thread executed by a master thread. parallel directive here uses team of threads with subsequent block of code executed by multiple threads in parallel. Exact number of threads determined by one of several ways, see later. Other directives within parallel construct to specify parallel for loops and different blocks of code for threads. Code outside parallel region executed by master thread only OpenMP thread model Master thread Multiple threads parallel region Synchronization Master thread only parallel region Master thread only

Number of threads in a team
Established by one of three ways, either: num_threads clause after the parallel directive e.g. #pragma omp parallel num_threads(5) or 2. omp_set_num_threads() library routine being previously called e.g. omp_set_num_threads(6); Environment variable OMP_NUM_THREADS is defined e.g $ export OMP_NUM_THREADS=8 $ ./hello in order given or is system dependent if none of above. Number of threads available can be altered dynamically to achieve best use of system resources.

Finding number of threads and thread ID during program execution
omp_get_num_threads() – get the total number of threads omp_get_thread_num() – Returns thread number (ID), an integer from 0 to omp_get_num_thread() -1 where thread 0 is master thread The names of these two functions are similar; easy to confuse.

OpenMP Parallel Directive
C “pragmatic” directive instructs compiler to use OpenMP features All OpenMP directives have omp #pragma omp parallel structured_block OpenMP parallel directive Single statement or compound statement created with { ...} with single entry point and single exit point. Creates multiple threads, each one executing the specified structured_block. Implicit barrier at end of construct.

Hello world example Output from an 8-processor/core machine:
VERY IMPORTANT Opening brace must on a new line (tabs,spaces ok) Hello world example #pragma omp parallel { printf("Hello World from thread = %d\n", omp_get_thread_num(), omp_get_num_threads()); } Output from an 8-processor/core machine: Hello World from thread 0 of 8 Hello World from thread 4 of 8 Hello World from thread 3 of 8 Hello World from thread 2 of 8 Hello World from thread 7 of 8 Hello World from thread 1 of 8 Hello World from thread 6 of 8 Hello World from thread 5 of 8

Global “shared” variables/data
Any variable declared outside a parallel construct accessible by all threads unless otherwise specified: int main (int argc, char *argv[]) { int x; // accessibly by all threads #pragma omp parallel { … // each thread see the same x } }

Private variables Separate copies of variables for each thread.
Can be declared within each parallel region but OpenMP provides private clause. int tid; … #pragma omp parallel private(tid) { tid = omp_get_thread_num(); printf("Hello World from thread = %d\n", tid); } Each thread has a local variable tid Also a shared clause available for shared variables.

Another example of shared and private data
int main (int argc, char *argv[]) { int x; int tid; #pragma omp parallel private(tid) tid = omp_get_thread_num(); if (tid == 0) x = 42; printf ("Thread %d, x = %d\n", tid, x); } x is shared by all threads tid is private – each thread has its own copy Variables declared outside the parallel construct are shared unless otherwise specified

Output Why does x change? tid has a separate value for each thread
$ ./data Thread 3, x = 0 Thread 2, x = 0 Thread 1, x = 0 Thread 0, x = 42 Thread 4, x = 42 Thread 5, x = 42 Thread 6, x = 42 Thread 7, x = 42 tid has a separate value for each thread Why does x change?

Another Example Shared versus Private
int a[100]; #pragma omp parallel private(tid, n) { tid = omp_get_thread_num(); n = omp_get_num_threads(); a[tid] = 10*n; } OR #pragma omp parallel private(tid, n) shared(a) ... a[ ] is shared tid and n are private optional

Variations of private variables
private clause – creates private copies of variables for each thread firstprivate clause - as private clause but initializes each copy to the values given immediately prior to parallel construct. lastprivate clause – as private but “the value of each lastprivate variable from the sequentially last iteration of the associated loop, or the lexically last section directive, is assigned to the variable’s original object.”

Specifying work inside a parallel region
Work-Sharing Specifying work inside a parallel region Four constructs in this classification: sections – section for single master In all cases, implicit barrier at end of construct unless a nowait clause included, which overrides the barrier. Note: These constructs do not start a new team of threads. That done by an enclosing parallel construct.

Sections #pragma omp parallel { #pragma omp sections
The construct: #pragma omp parallel { #pragma omp sections #pragma omp section structured_block … } cause structured blocks to be shared among threads in team. The first section directive optional. Enclosing parallel directive Blocks executed by available threads

Example One thread does this Another thread does this
#pragma omp parallel shared(a,b,c,d,nthreads) private(i,tid) { tid = omp_get_thread_num(); #pragma omp sections nowait #pragma omp section printf("Thread %d doing section 1\n",tid); for (i=0; i<N; i++) { c[i] = a[i] + b[i]; printf("Thread %d: c[%d]= %f\n",tid,i,c[i]); } printf("Thread %d doing section 2\n",tid); d[i] = a[i] * b[i]; printf("Thread %d: d[%d]= %f\n",tid,i,d[i]); } /* end of sections */ } /* end of parallel section */ Example One thread does this Another thread does this

Another sections example
#pragma omp parallel shared(a,b,c,d,nthreads) private(i,tid) { tid = omp_get_thread_num(); #pragma omp sections nowait #pragma omp section printf("Thread %d doing section 1\n",tid); for (i=0; i<N; i++) { c[i] = a[i] + b[i]; printf("Thread %d: c[%d]=%f\n“,tid,i,c[i]); } Threads do not wait after finishing section One thread does this

Sections example continued
#pragma omp section { printf("Thread %d doing section 2\n",tid); for (i=0; i<N; i++) { d[i] = a[i] * b[i]; printf("Thread %d: d[%d]= %f\n",tid,i,d[i]); } } /* end of sections */ printf ("Thread %d done\n", tid); } /* end of parallel section */ Another thread does this

Output Threads do not wait (i.e. no barrier)
Thread 0 doing section 1 Thread 0: c[0]= Thread 0: c[1]= Thread 0: c[2]= Thread 0: c[3]= Thread 0: c[4]= Thread 3 done Thread 2 done Thread 1 doing section 2 Thread 1: d[0]= Thread 1: d[1]= Thread 1: d[2]= Thread 1: d[3]= Thread 0 done Thread 1: d[4]= Thread 1 done Threads do not wait (i.e. no barrier)

Output if remove nowait clause
Thread 0 doing section 1 Thread 0: c[0]= Thread 0: c[1]= Thread 0: c[2]= Thread 0: c[3]= Thread 0: c[4]= Thread 3 doing section 2 Thread 3: d[0]= Thread 3: d[1]= Thread 3: d[2]= Thread 3: d[3]= Thread 3: d[4]= Thread 3 done Thread 1 done Thread 2 done Thread 0 done Barrier here If we remove the nowait, then there is a barrier at the end of the section. Threads wait until they are all done with the section.

Combining parallel and section constructs
If a parallel directive is followed by a single “sections” directive, they can be combined into: #pragma omp parallel sections { #pragma omp section structured_block … } with similar effect. (However, a nowait clause is not allowed.)

Parallel For Loop #pragma omp parallel #pragma omp for
{ … #pragma omp for for ( i = 0; i < n; i++ ) { … // for loop body } causes for loop to be divided into parts and parts shared among threads in the team – equivalent to a “forall.” Different iterations will be executed by available threads Enclosing parallel region Must have a new line here Must be “for” loop of a simple C form such as (i = 0; i < n; i++) where lower bound and upper bound are constants

Example Executed by one thread For loop
#pragma omp parallel shared(a,b,c,nthreads,chunk) private(i,tid) { tid = omp_get_thread_num(); if (tid == 0) { nthreads = omp_get_num_threads(); printf("Number of threads = %d\n", nthreads); } printf("Thread %d starting...\n",tid); #pragma omp for for (i=0; i<N; i++) { c[i] = a[i] + b[i]; printf("Thread %d: c[%d]= %f\n",tid,i,c[i]); } /* end of parallel section */ Executed by one thread For loop Without “nowait”, threads wait after finishing loop

Combined parallel and for constructs
If a parallel directive is followed by a single for directive, it can be combined into: #pragma omp parallel for <for loop> { … } with similar effects.

Combining Directives Example
#pragma omp parallel for shared(a,b,c,nthreads) private(i,tid) for (i = 0; i < N; i++) { c[i] = a[i] + b[i]; printf("Thread %d: c[%d]= %f\n",tid,i,c[i]); } Declares a Parallel Region and a Parallel For

Scheduling a Parallel For
By default, a parallel for scheduled by mapping blocks (or chunks) of iterations to available threads (static mapping) Thread 1 starting... Thread 1: i = 2, c[1] = Thread 1: i = 3, c[1] = Thread 2 starting... Thread 2: i = 4, c[2] = Thread 3 starting... Number of threads = 4 Thread 0 starting... Thread 0: i = 0, c[0] = Thread 0: i = 1, c[0] = Default Chunk Size Barrier here

Loop Scheduling and Partitioning
OpenMP offers scheduling clauses to add to for construct: 1. Static #pragma omp parallel for schedule (static,chunk_size) Partitions loop iterations into equal sized chunks specified by chunk_size. Chunks assigned to threads in round robin fashion. 2. Dynamic #pragma omp parallel for schedule (dynamic,chunk_size) Uses internal work queue. Chunk-sized block of loop assigned to threads as they become available.

3. Guided #pragma omp parallel for schedule (guided,chunk_size) Similar to dynamic but chunk size starts large and gets smaller to reduce time threads have to go to work queue. chunk size = number of iterations remaining 2 * number of threads 4. Runtime #pragma omp parallel for schedule (runtime) Uses OMP_SCEDULE environment variable to specify which of static, dynamic or guided should be used.

Question Guided scheduling is similar to Static except that the chunk sizes start large and get smaller. What is the advantage of using Guided versus Static? Answer: Guided improves load balance

Reduction Operation Variable
A reduction is when we apply a commutative operator to an aggregate values creating a single value (similar to the MPI_Reduce) sum = 0; #pragma omp parallel for reduction(+:sum) for (k = 0; k < 100; k++ ) { sum = sum + funct(k); } Operation Variable Private copy of sum created for each thread by compiler. Private copy will be added to sum at end. Eliminates the need for critical sections here.

Single The directive #pragma omp parallel #pragma omp single
{ … #pragma omp single structured_block } cause the structured block to be executed by one thread only. Must have a new line here

Master The master directive: #pragma omp parallel #pragma omp master
{ … #pragma omp master structured_block } causes only the master thread to execute the structured block. Different to those in work sharing group in that there is no implied barrier at end of construct (nor beginning). Other threads encountering master directive will ignore it and associated structured block, and will move on.

Master Example #pragma omp parallel private(tid) { tid = omp_get_thread_num(); printf ("Thread %d starting...\n", tid); #pragma omp master printf("Thread %d doing work\n",tid); ... } /* end of master */ printf ("Thread %d done\n", tid); } /* end of parallel section */

Is there any difference between these two approaches:
Master Directive: Using an if statement: #pragma omp parallel private(tid) { ... tid=omp_get_thread_num(); if (tid == 0) structured_block } #pragma omp parallel { ... #pragma omp master structured_block }

Questions

Programming with Shared Memory Introduction to OpenMP

Similar presentations

Presentation on theme: "Programming with Shared Memory Introduction to OpenMP"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Programming with Shared Memory Introduction to OpenMP

Similar presentations

Presentation on theme: "Programming with Shared Memory Introduction to OpenMP"— Presentation transcript:

Similar presentations

About project

Feedback