Download presentation
Presentation is loading. Please wait.
1
Introduction to OpenMP
2
Shared-Memory Systems
Processor Processor Processor Processor Bus interface Bus interface Bus interface Bus interface Processor/ memory us Memory controller All processors can access all of the shared memory Memory Shared memory
3
OpenMP OpenMP uses compiler directives (similar to Paraguin) to parallelize a program The programmer inserts #pragma statements into the sequential program to tell the compiler how to parallelize the program This is a higher level of abstraction than pthreads or Java threads Standardized in late 1990s gcc supports OpenMP
4
Getting Started
5
To begin Syntax: #pragma omp parallel structured_block omp indicates that the pragma is an OpenMP pragma (other compilers will ignore it) parallel indicates the directive (“parallel” indicates the start of a parallel region) structured_block will be either a single statement (such as a for loop) or a block of statements
6
Code outside a parallel region is executed by master thread only
A parallel region indicates sections of code that are executed by all threads At the end of a parallel region, all threads synchronize as if there were a barrier Code outside a parallel region is executed by master thread only parallel region Multiple threads Synchronization parallel region Multiple threads Synchronization
7
Hello World Very Important Opening brace must be on a new line
int main (int argc, char *argv[]) { #pragma omp parallel { printf("Hello World from thread = %d of %d\n", omp_get_thread_num(), omp_get_num_threads()); } Very Important Opening brace must be on a new line
8
Compiling and Output $ gcc -fopenmp hello.c -o hello $ ./hello Hello world from thread 2 of 4 Hello world from thread 0 of 4 Hello world from thread 3 of 4 Hello world from thread 1 of 4 $ Flag to tell gcc to interpret OpenMP directives
9
Execution omp_get_thread_num() – get the current threads number omp_get_num_threads() – get the total number of threads The names of these two functions are similar; easy to confuse.
10
Execution There are 3 ways to indicate how many threads you want:
Use num_threads withing the directive E.g. #pragma omp parallel num_threads(5) Use the omp_set_num_threads function E.g. omp_set_num_threads(6); Use the OMP_NUM_THREADS environmental variable E.g $ export OMP_NUM_THREADS=8 $ ./hello
11
Shared versus Private Data
12
Shared versus Private Data
int main (int argc, char *argv[]) { int x; int tid; #pragma omp parallel private(tid) tid = omp_get_thread_num(); if (tid == 0) x = 42; printf ("Thread %d, x = %d\n", tid, x); } x is shared by all threads tid is private – each thread has its own copy Variables declared outside the parallel construct are shared unless otherwise specified
13
Shared versus Private Data
$ ./data Thread 3, x = 0 Thread 2, x = 0 Thread 1, x = 0 Thread 0, x = 42 Thread 4, x = 42 Thread 5, x = 42 Thread 6, x = 42 Thread 7, x = 42 tid has a separate value for each thread x has the same value for each thread (well… almost)
14
Another Example Shared versus Private
#pragma omp parallel private(tid, n) { tid = omp_get_thread_num(); n = omp_get_num_threads(); a[tid] = 10*n; } OR #pragma omp parallel private(tid, n) shared(a) ... a[ ] is shared tid and n are private optional
15
Private Variables private clause – creates private copies of variables for each thread firstprivate clause - as private clause but initializes each copy to the values given immediately prior to parallel construct. lastprivate clause – as private but “the value of each lastprivate variable from the sequentially last iteration of the associated loop, or the lexically last section directive, is assigned to the variable’s original object.”
16
Work Sharing Constructs
17
Specifying Work Inside a Parallel Region
There are 4 constructs: section – each section is executed by a different thread for – each iteration is executed by a (potentially) different thread single – executed by a single thread (sequential) master – executed by the master only (sequential) There is a barrier after each construct (except master) unless a nowait clause is given These must be used within a parallel region
18
Sections Syntax Enclosing parallel region
#pragma omp parallel { #pragma omp sections #pragma omp section structured_block ... } Enclosing parallel region Sections executed by available threads
19
Sections Example Threads do not wait after finishing section
#pragma omp parallel shared(a,b,c,d,nthreads) private(i,tid) { tid = omp_get_thread_num(); #pragma omp sections nowait #pragma omp section printf("Thread %d doing section 1\n",tid); for (i=0; i<N; i++) { c[i] = a[i] + b[i]; printf("Thread %d: c[%d]=%f\n“,tid,i,c[i]); } printf("Thread %d doing section 2\n",tid); d[i] = a[i] * b[i]; printf("Thread %d: d[%d]= %f\n",tid,i,d[i]); } /* end of sections */ printf ("Thread %d done\n", tid); } /* end of parallel section */ Threads do not wait after finishing section One thread does this
20
Sections Example Another thread does this
} #pragma omp section { printf("Thread %d doing section 2\n",tid); for (i=0; i<N; i++) { d[i] = a[i] * b[i]; printf("Thread %d: d[%d]=%f\n",tid,i,d[i]); } /* end of sections */ printf ("Thread %d done\n", tid); } /* end of parallel section */ Another thread does this
21
Sections Output Threads do not wait (i.e. no barrier)
Thread 0 doing section 1 Thread 0: c[0]= Thread 0: c[1]= Thread 0: c[2]= Thread 0: c[3]= Thread 0: c[4]= Thread 3 done Thread 2 done Thread 1 doing section 2 Thread 1: d[0]= Thread 1: d[1]= Thread 1: d[2]= Thread 1: d[3]= Thread 0 done Thread 1: d[4]= Thread 1 done Threads do not wait (i.e. no barrier)
22
Sections Output Barrier here
Thread 0 doing section 1 Thread 0: c[0]= Thread 0: c[1]= Thread 0: c[2]= Thread 0: c[3]= Thread 0: c[4]= Thread 3 doing section 2 Thread 3: d[0]= Thread 3: d[1]= Thread 3: d[2]= Thread 3: d[3]= Thread 3: d[4]= Thread 3 done Thread 1 done Thread 2 done Thread 0 done Barrier here If we remove the nowait, then there is a barrier at the end of the section. Threads wait until they are all done with the section.
23
Parallel For Syntax Enclosing parallel region
#pragma omp parallel { #pragma omp for for (i = 0; i < N; i++) { ... } Enclosing parallel region Different iterations will be executed by available threads Must be a simple C for loop, where lower bound and upper bound are constants
24
Parallel For Example #pragma omp parallel shared(a,b,c,nthreads) private(i,tid) { tid = omp_get_thread_num(); if (tid == 0) { nthreads = omp_get_num_threads(); printf("Number of threads = %d\n", nthreads); } printf("Thread %d starting...\n",tid); #pragma omp for for (i = 0; i < N; i++) { c[i] = a[i] + b[i]; printf("Thread %d: i = %d, c[%d] = %f\n", tid, i, c[i]); } /* end of parallel section */ Without “nowait”, threads wait after finishing loop
25
Parallel For Output Iterations of loop are mapped to threads
Thread 1 starting... Thread 1: i = 2, c[1] = Thread 1: i = 3, c[1] = Thread 2 starting... Thread 2: i = 4, c[2] = Thread 3 starting... Number of threads = 4 Thread 0 starting... Thread 0: i = 0, c[0] = Thread 0: i = 1, c[0] = Iterations of loop are mapped to threads Mapping is In this example, mapping = Barrier here
26
Combining Directives If a Parallel Region consists of only one Parallel For or Parallel Sections, they can be combined #pragma omp parallel sections #pragma omp parallel for
27
Combining Directives Example
#pragma omp parallel for shared(a,b,c,nthreads) private(i,tid) for (i = 0; i < N; i++) { c[i] = a[i] + b[i]; } Declares a Parallel Region and a Parallel For
28
Scheduling a Parallel For
By default, a parallel for is scheduled by mapping blocks (or chunks) of iterations to available threads (static mapping) Thread 1 starting... Thread 1: i = 2, c[1] = Thread 1: i = 3, c[1] = Thread 2 starting... Thread 2: i = 4, c[2] = Thread 3 starting... Number of threads = 4 Thread 0 starting... Thread 0: i = 0, c[0] = Thread 0: i = 1, c[0] = Default Chunk Size Barrier here
29
Scheduling a Parallel For
Static – Partitions loop iterations into equal sized chunks specified by chunk_size. Chunks assigned to threads in round robin fashion. #pragma omp parallel for schedule (static, chunk_size) Dynamic – Uses internal work queue. Chunk- sized block of iterations assigned to threads as they become available. #pragma omp parallel for schedule (dynamic, chunk_size)
30
Scheduling a Parallel For
Guided – Similar to dynamic but chunk size starts large and gets smaller to reduce time threads have to go to work queue. #pragma omp parallel for schedule (guided) Runtime – Uses OMP_SCEDULE environment variable to specify which of static, dynamic or guided should be used. #pragma omp parallel for schedule (runtime)
31
Question Guided scheduling is similar to Static except that the chunk sizes start large and get smaller. What is the advantage of using Guided versus Static? Answer: Guided improves load balance
32
Reduction A reduction is when we apply a commutative operator to an aggregate values creating a single value (similar to the MPI_Reduce) sum = 0; #pragma omp parallel for reduction(+:sum) for (k = 0; k < 100; k++ ) { sum = sum + funct(k); } Operation Variable Private copy of sum created for each thread by compiler. Private copy will be added to sum at end. Eliminates the need for critical sections here.
33
Single Only one thread executes this section No guarantee of which one
#pragma omp parallel { ... #pragma omp single structured_block } Only one thread executes this section No guarantee of which one
34
Single Example #pragma omp parallel private(tid) { tid = omp_get_thread_num(); printf ("Thread %d starting...\n", tid); #pragma omp single printf("Thread %d doing work\n",tid); ... } /* end of single */ printf ("Thread %d done\n", tid); } /* end of parallel section */
35
Single Results Only one thread executing the section
Thread 0 starting... Thread 0 doing work Thread 3 starting... Thread 2 starting Thread 1 starting... Thread 0 done Thread 1 done Thread 2 done Thread 3 done Only one thread executing the section “nowait” was NOT specified, so threads wait for the one thread to finish. Barrier here
36
Master Only one thread (the master) executes this section
#pragma omp parallel { ... #pragma omp master structured_block } Only one thread (the master) executes this section Cannot specify “nowait” here There is no barrier after this block. Threads will NOT wait.
37
Master Example #pragma omp parallel private(tid) { tid = omp_get_thread_num(); printf ("Thread %d starting...\n", tid); #pragma omp master printf("Thread %d doing work\n",tid); ... } /* end of master */ printf ("Thread %d done\n", tid); } /* end of parallel section */
38
Is there any difference between these two approaches:
Master Directive: Using an if statement: #pragma omp parallel { ... #pragma omp master structured_block } #pragma omp parallel private(tid) { ... tid=omp_get_thread_num(); if (tid == 0) structured_block }
39
Synchronization
40
Critical Section A critical section implies mutual exclusion.
Only one thread allowed to enter the critical section at a time. #pragma omp parallel { ... #pragma omp critical (name) structured_block } name is optional
41
Critical Example #pragma omp parallel private(tid) { tid = omp_get_thread_num(); printf ("Thread %d starting...\n", tid); #pragma omp critical (myCS) printf("Thread %d in critical section \n",tid); sleep (1); printf("Thread %d finishing critical section \n",tid); } /* end of critical */ printf ("Thread %d done\n", tid); } /* end of parallel section */
42
Critical Results 1 second delay
Thread 0 starting... Thread 0 in critical section Thread 3 starting... Thread 1 starting... Thread 2 starting... Thread 0 finishing critical section Thread 0 done Thread 3 in critical section Thread 3 finishing critical section Thread 3 done Thread 2 in critical section Thread 2 finishing critical section Thread 2 done Thread 1 in critical section Thread 1 finishing critical section Thread 1 done 1 second delay
43
Atomic If the critical section is a simple update of a variable, then atomic is more efficient Ensures mutual exclusion for the statement #pragma omp parallel { ... #pragma omp atomic expression_statement } Must be a simple statement of the form: x = expression x += expression x -= expression ... x++; x--;
44
Barrier Threads will wait at a barrier until all threads have reached the same barrier. All threads must be able to reach the barrier (i.e. be careful about placing the barrier inside an if statement where some threads my not execute it). #pragma omp parallel { ... #pragma omp barrier }
45
Barrier Example No barrier at the end of the single block
#pragma omp parallel private(tid) { tid = omp_get_thread_num(); printf ("Thread %d starting...\n", tid); #pragma omp single nowait printf("Thread %d busy doing work ... \n",tid); sleep(10); } printf("Thread %d reached barrier\n",tid); #pragma omp barrier printf ("Thread %d done\n", tid); } /* end of parallel section */ No barrier at the end of the single block Threads wait here Not here
46
Barrier Results Thread 3 sleeping for 10 seconds 10 second delay
Thread 3 starting... Thread 0 starting... Thread 0 reached barrier Thread 2 starting... Thread 2 reached barrier Thread 1 starting... Thread 1 reached barrier Thread 3 busy doing work ... Thread 3 reached barrier Thread 3 done Thread 0 done Thread 2 done Thread 1 done Thread 3 sleeping for 10 seconds 10 second delay
47
Flush A synchronization point which causes threads to have a “consistent” view of certain or all shared variables in memory. All current read and write operations on variables allowed to complete and values written back to memory but any memory operations in code after flush are not started. Format: #pragma omp flush (variable_list)
48
Flush Only applied to thread executing flush, not to all threads in team. (So not all threads have to execute the flush.) Flush occurs automatically at entry and exit of parallel and critical directives, and at the exit of for, sections, and single (if a no-wait clause is not present).
49
More information
50
Questions
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.