Introduction to OpenMP

Introduction to OpenMP

Shared-Memory Systems
Processor Processor Processor Processor Bus interface Bus interface Bus interface Bus interface Processor/ memory us Memory controller All processors can access all of the shared memory Memory Shared memory

OpenMP OpenMP uses compiler directives (similar to Paraguin) to parallelize a program The programmer inserts #pragma statements into the sequential program to tell the compiler how to parallelize the program This is a higher level of abstraction than pthreads or Java threads Standardized in late 1990s gcc supports OpenMP

Getting Started

To begin Syntax: #pragma omp parallel structured_block omp indicates that the pragma is an OpenMP pragma (other compilers will ignore it) parallel indicates the directive (“parallel” indicates the start of a parallel region) structured_block will be either a single statement (such as a for loop) or a block of statements

Code outside a parallel region is executed by master thread only
A parallel region indicates sections of code that are executed by all threads At the end of a parallel region, all threads synchronize as if there were a barrier Code outside a parallel region is executed by master thread only parallel region Multiple threads Synchronization parallel region Multiple threads Synchronization

Hello World Very Important Opening brace must be on a new line
int main (int argc, char *argv[]) { #pragma omp parallel { printf("Hello World from thread = %d of %d\n", omp_get_thread_num(), omp_get_num_threads()); } Very Important Opening brace must be on a new line

Compiling and Output $ gcc -fopenmp hello.c -o hello $ ./hello Hello world from thread 2 of 4 Hello world from thread 0 of 4 Hello world from thread 3 of 4 Hello world from thread 1 of 4 $ Flag to tell gcc to interpret OpenMP directives

Execution omp_get_thread_num() – get the current threads number omp_get_num_threads() – get the total number of threads The names of these two functions are similar; easy to confuse.

Execution There are 3 ways to indicate how many threads you want:
Use num_threads withing the directive E.g. #pragma omp parallel num_threads(5) Use the omp_set_num_threads function E.g. omp_set_num_threads(6); Use the OMP_NUM_THREADS environmental variable E.g $ export OMP_NUM_THREADS=8 $ ./hello

Shared versus Private Data

int main (int argc, char *argv[]) { int x; int tid; #pragma omp parallel private(tid) tid = omp_get_thread_num(); if (tid == 0) x = 42; printf ("Thread %d, x = %d\n", tid, x); } x is shared by all threads tid is private – each thread has its own copy Variables declared outside the parallel construct are shared unless otherwise specified

$ ./data Thread 3, x = 0 Thread 2, x = 0 Thread 1, x = 0 Thread 0, x = 42 Thread 4, x = 42 Thread 5, x = 42 Thread 6, x = 42 Thread 7, x = 42 tid has a separate value for each thread x has the same value for each thread (well… almost)

Another Example Shared versus Private
#pragma omp parallel private(tid, n) { tid = omp_get_thread_num(); n = omp_get_num_threads(); a[tid] = 10*n; } OR #pragma omp parallel private(tid, n) shared(a) ... a[ ] is shared tid and n are private optional

Private Variables private clause – creates private copies of variables for each thread firstprivate clause - as private clause but initializes each copy to the values given immediately prior to parallel construct. lastprivate clause – as private but “the value of each lastprivate variable from the sequentially last iteration of the associated loop, or the lexically last section directive, is assigned to the variable’s original object.”

Work Sharing Constructs

Specifying Work Inside a Parallel Region
There are 4 constructs: section – each section is executed by a different thread for – each iteration is executed by a (potentially) different thread single – executed by a single thread (sequential) master – executed by the master only (sequential) There is a barrier after each construct (except master) unless a nowait clause is given These must be used within a parallel region

Sections Syntax Enclosing parallel region
#pragma omp parallel { #pragma omp sections #pragma omp section structured_block ... } Enclosing parallel region Sections executed by available threads

Sections Example Threads do not wait after finishing section
#pragma omp parallel shared(a,b,c,d,nthreads) private(i,tid) { tid = omp_get_thread_num(); #pragma omp sections nowait #pragma omp section printf("Thread %d doing section 1\n",tid); for (i=0; i<N; i++) { c[i] = a[i] + b[i]; printf("Thread %d: c[%d]=%f\n“,tid,i,c[i]); } printf("Thread %d doing section 2\n",tid); d[i] = a[i] * b[i]; printf("Thread %d: d[%d]= %f\n",tid,i,d[i]); } /* end of sections */ printf ("Thread %d done\n", tid); } /* end of parallel section */ Threads do not wait after finishing section One thread does this

Sections Example Another thread does this
} #pragma omp section { printf("Thread %d doing section 2\n",tid); for (i=0; i<N; i++) { d[i] = a[i] * b[i]; printf("Thread %d: d[%d]=%f\n",tid,i,d[i]); } /* end of sections */ printf ("Thread %d done\n", tid); } /* end of parallel section */ Another thread does this

Sections Output Threads do not wait (i.e. no barrier)
Thread 0 doing section 1 Thread 0: c[0]= Thread 0: c[1]= Thread 0: c[2]= Thread 0: c[3]= Thread 0: c[4]= Thread 3 done Thread 2 done Thread 1 doing section 2 Thread 1: d[0]= Thread 1: d[1]= Thread 1: d[2]= Thread 1: d[3]= Thread 0 done Thread 1: d[4]= Thread 1 done Threads do not wait (i.e. no barrier)

Sections Output Barrier here
Thread 0 doing section 1 Thread 0: c[0]= Thread 0: c[1]= Thread 0: c[2]= Thread 0: c[3]= Thread 0: c[4]= Thread 3 doing section 2 Thread 3: d[0]= Thread 3: d[1]= Thread 3: d[2]= Thread 3: d[3]= Thread 3: d[4]= Thread 3 done Thread 1 done Thread 2 done Thread 0 done Barrier here If we remove the nowait, then there is a barrier at the end of the section. Threads wait until they are all done with the section.

Parallel For Syntax Enclosing parallel region
#pragma omp parallel { #pragma omp for for (i = 0; i < N; i++) { ... } Enclosing parallel region Different iterations will be executed by available threads Must be a simple C for loop, where lower bound and upper bound are constants

Parallel For Example #pragma omp parallel shared(a,b,c,nthreads) private(i,tid) { tid = omp_get_thread_num(); if (tid == 0) { nthreads = omp_get_num_threads(); printf("Number of threads = %d\n", nthreads); } printf("Thread %d starting...\n",tid); #pragma omp for for (i = 0; i < N; i++) { c[i] = a[i] + b[i]; printf("Thread %d: i = %d, c[%d] = %f\n", tid, i, c[i]); } /* end of parallel section */ Without “nowait”, threads wait after finishing loop

Parallel For Output Iterations of loop are mapped to threads
Thread 1 starting... Thread 1: i = 2, c[1] = Thread 1: i = 3, c[1] = Thread 2 starting... Thread 2: i = 4, c[2] = Thread 3 starting... Number of threads = 4 Thread 0 starting... Thread 0: i = 0, c[0] = Thread 0: i = 1, c[0] = Iterations of loop are mapped to threads Mapping is In this example, mapping = Barrier here

Combining Directives If a Parallel Region consists of only one Parallel For or Parallel Sections, they can be combined #pragma omp parallel sections #pragma omp parallel for

Combining Directives Example
#pragma omp parallel for shared(a,b,c,nthreads) private(i,tid) for (i = 0; i < N; i++) { c[i] = a[i] + b[i]; } Declares a Parallel Region and a Parallel For

Scheduling a Parallel For
By default, a parallel for is scheduled by mapping blocks (or chunks) of iterations to available threads (static mapping) Thread 1 starting... Thread 1: i = 2, c[1] = Thread 1: i = 3, c[1] = Thread 2 starting... Thread 2: i = 4, c[2] = Thread 3 starting... Number of threads = 4 Thread 0 starting... Thread 0: i = 0, c[0] = Thread 0: i = 1, c[0] = Default Chunk Size Barrier here

Static – Partitions loop iterations into equal sized chunks specified by chunk_size. Chunks assigned to threads in round robin fashion. #pragma omp parallel for schedule (static, chunk_size) Dynamic – Uses internal work queue. Chunk- sized block of iterations assigned to threads as they become available. #pragma omp parallel for schedule (dynamic, chunk_size)

Guided – Similar to dynamic but chunk size starts large and gets smaller to reduce time threads have to go to work queue. #pragma omp parallel for schedule (guided) Runtime – Uses OMP_SCEDULE environment variable to specify which of static, dynamic or guided should be used. #pragma omp parallel for schedule (runtime)

Question Guided scheduling is similar to Static except that the chunk sizes start large and get smaller. What is the advantage of using Guided versus Static? Answer: Guided improves load balance

Reduction A reduction is when we apply a commutative operator to an aggregate values creating a single value (similar to the MPI_Reduce) sum = 0; #pragma omp parallel for reduction(+:sum) for (k = 0; k < 100; k++ ) { sum = sum + funct(k); } Operation Variable Private copy of sum created for each thread by compiler. Private copy will be added to sum at end. Eliminates the need for critical sections here.

Single Only one thread executes this section No guarantee of which one
#pragma omp parallel { ... #pragma omp single structured_block } Only one thread executes this section No guarantee of which one

Single Example #pragma omp parallel private(tid) { tid = omp_get_thread_num(); printf ("Thread %d starting...\n", tid); #pragma omp single printf("Thread %d doing work\n",tid); ... } /* end of single */ printf ("Thread %d done\n", tid); } /* end of parallel section */

Single Results Only one thread executing the section
Thread 0 starting... Thread 0 doing work Thread 3 starting... Thread 2 starting Thread 1 starting... Thread 0 done Thread 1 done Thread 2 done Thread 3 done Only one thread executing the section “nowait” was NOT specified, so threads wait for the one thread to finish. Barrier here

Master Only one thread (the master) executes this section
#pragma omp parallel { ... #pragma omp master structured_block } Only one thread (the master) executes this section Cannot specify “nowait” here There is no barrier after this block. Threads will NOT wait.

Master Example #pragma omp parallel private(tid) { tid = omp_get_thread_num(); printf ("Thread %d starting...\n", tid); #pragma omp master printf("Thread %d doing work\n",tid); ... } /* end of master */ printf ("Thread %d done\n", tid); } /* end of parallel section */

Is there any difference between these two approaches:
Master Directive: Using an if statement: #pragma omp parallel { ... #pragma omp master structured_block } #pragma omp parallel private(tid) { ... tid=omp_get_thread_num(); if (tid == 0) structured_block }

Synchronization

Critical Section A critical section implies mutual exclusion.
Only one thread allowed to enter the critical section at a time. #pragma omp parallel { ... #pragma omp critical (name) structured_block } name is optional

Critical Example #pragma omp parallel private(tid) { tid = omp_get_thread_num(); printf ("Thread %d starting...\n", tid); #pragma omp critical (myCS) printf("Thread %d in critical section \n",tid); sleep (1); printf("Thread %d finishing critical section \n",tid); } /* end of critical */ printf ("Thread %d done\n", tid); } /* end of parallel section */

Critical Results 1 second delay
Thread 0 starting... Thread 0 in critical section Thread 3 starting... Thread 1 starting... Thread 2 starting... Thread 0 finishing critical section Thread 0 done Thread 3 in critical section Thread 3 finishing critical section Thread 3 done Thread 2 in critical section Thread 2 finishing critical section Thread 2 done Thread 1 in critical section Thread 1 finishing critical section Thread 1 done 1 second delay

Atomic If the critical section is a simple update of a variable, then atomic is more efficient Ensures mutual exclusion for the statement #pragma omp parallel { ... #pragma omp atomic expression_statement } Must be a simple statement of the form: x = expression x += expression x -= expression ... x++; x--;

Barrier Threads will wait at a barrier until all threads have reached the same barrier. All threads must be able to reach the barrier (i.e. be careful about placing the barrier inside an if statement where some threads my not execute it). #pragma omp parallel { ... #pragma omp barrier }

Barrier Example No barrier at the end of the single block
#pragma omp parallel private(tid) { tid = omp_get_thread_num(); printf ("Thread %d starting...\n", tid); #pragma omp single nowait printf("Thread %d busy doing work ... \n",tid); sleep(10); } printf("Thread %d reached barrier\n",tid); #pragma omp barrier printf ("Thread %d done\n", tid); } /* end of parallel section */ No barrier at the end of the single block Threads wait here Not here

Barrier Results Thread 3 sleeping for 10 seconds 10 second delay
Thread 3 starting... Thread 0 starting... Thread 0 reached barrier Thread 2 starting... Thread 2 reached barrier Thread 1 starting... Thread 1 reached barrier Thread 3 busy doing work ... Thread 3 reached barrier Thread 3 done Thread 0 done Thread 2 done Thread 1 done Thread 3 sleeping for 10 seconds 10 second delay

Flush A synchronization point which causes threads to have a “consistent” view of certain or all shared variables in memory. All current read and write operations on variables allowed to complete and values written back to memory but any memory operations in code after flush are not started. Format: #pragma omp flush (variable_list)

Flush Only applied to thread executing flush, not to all threads in team. (So not all threads have to execute the flush.) Flush occurs automatically at entry and exit of parallel and critical directives, and at the exit of for, sections, and single (if a no-wait clause is not present).

More information

Questions

Introduction to OpenMP

Similar presentations

Presentation on theme: "Introduction to OpenMP"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to OpenMP

Similar presentations

Presentation on theme: "Introduction to OpenMP"— Presentation transcript:

Similar presentations

About project

Feedback