Introduction to OpenMP

Introduction to OpenMP
I - Introduction ITCS4145/5145, Parallel Programming C. Ferner and B. Wilkinson Feb 3, 2016

OpenMP structure A standard developed in the 1990’s for thread-based programming for shared memory systems. Higher level than using low level APIs such as Pthreads or Java threads. Consists of a set of compiler directives, and a few library routines and environment variables. gcc supports OpenMP with the –fopenmp option, so additional software not needed.

OpenMP compiler directives
OpenMP uses #pragma compiler directives to parallelize a program (“pragmatic” directive) Programmer inserts #pragma statements into the sequential program to tell the compiler how to parallelize the program When the compiler comes across a compiler directive, it creates corresponding thread-based parallel code. Basic OpenMP pattern is the thread-pool pattern

Thread-pool pattern Basic OpenMP pattern
Master thread Basic OpenMP pattern Parallel region indicates sections of code executed by all threads At the end of a parallel region, all threads wait for each other as if there were a “barrier” (unless other specified) Code outside a parallel region executed by master thread only parallel region Multiple threads Synchronization parallel region Multiple threads Synchronization

Parallel Region Syntax: #pragma omp parallel structured_block
omp indicates an OpenMP pragma (other compilers will ignore it) parallel indicates a parallel region structured_block is either: a single statement terminated with ; or a block of statements, i.e. { statement1; statement2;…}

Hello World int main (int argc, char *argv[]) { #pragma omp parallel { printf("Hello World from thread %d of %d\n", omp_get_thread_num(), omp_get_num_threads()); } These routines return the thread ID (from 0 onwards) and the total number of threads, respectively (easy to confuse as names similar) Very Important Opening brace must be on a new line

Compiling and Output Flag to tell gcc to interpret OpenMP directives $ gcc -fopenmp hello.c -o hello $ ./hello Hello world from thread 2 of 4 Hello world from thread 0 of 4 Hello world from thread 3 of 4 Hello world from thread 1 of 4 $

Number of threads Three ways to indicate how many threads you want:
1. Use num_threads within the directive E.g. #pragma omp parallel num_threads(5) 2. Use the omp_set_num_threads function E.g. omp_set_num_threads(6); 3. Use the OMP_NUM_THREADS environmental variable E.g $ export OMP_NUM_THREADS=8 $ ./hello

Shared versus Private Data
int main (int argc, char *argv[]) { int x; int tid; #pragma omp parallel private(tid) tid = omp_get_thread_num(); if (tid == 0) x = 42; printf ("Thread %d, x = %d\n", tid, x); } x is shared by all threads tid is private – each thread has its own copy Variables declared outside parallel construct are shared unless otherwise specified

Abstractly, it looks like this
Processor tid Local Memory Processor tid 1 Local Memory Processor tid 2 Local Memory Processor tid 3 Local Memory 42 x Shared Memory

Shared versus Private Data
$ ./data Thread 3, x = 0 Thread 2, x = 0 Thread 1, x = 0 Thread 0, x = 42 Thread 4, x = 42 Thread 5, x = 42 Thread 6, x = 42 Thread 7, x = 42 tid has a separate value for each thread x has the same value for each thread (well… almost)

Another Example Shared versus Private
int x, tid, a[100]; #pragma omp parallel private(tid, n) { tid = omp_get_thread_num(); n = omp_get_num_threads(); a[tid] = 10*n; } OR #pragma omp parallel private(tid, n) shared(a) ... a[ ] is shared tid and n are private optional

Specifying Work Inside a Parallel Region (Work Sharing Constructs)
Four constructs: section – each section executed by a different thread for – one or more iterations executed by a (potentially) different thread single – executed by a single thread (sequential) master – executed by the master only (sequential) Barrier after each construct (except master) unless a nowait clause is given Constructs must be used within a parallel region

Sections Syntax Parallel region Sections executed by available threads
#pragma omp parallel { #pragma omp sections #pragma omp section structured_block ... } Parallel region Sections executed by available threads

Sections Example Threads do not wait after finishing section
#pragma omp parallel shared(a,b,c,d,nthreads) private(i,tid) { tid = omp_get_thread_num(); #pragma omp sections nowait #pragma omp section printf("Thread %d doing section 1\n",tid); for (i=0; i<N; i++) { c[i] = a[i] + b[i]; printf("Thread %d: c[%d]=%f\n“,tid,i,c[i]); } Threads do not wait after finishing section One thread does this

Sections example continued
#pragma omp section { printf("Thread %d doing section 2\n",tid); for (i=0; i<N; i++) { d[i] = a[i] * b[i]; printf("Thread %d: d[%d]=%f\n",tid,i,d[i]); } } /* end of sections */ printf ("Thread %d done\n", tid); } /* end of parallel section */ Another thread does this

Sections Output Threads do not wait (i.e. no barrier)
Thread 0 doing section 1 Thread 0: c[0]= Thread 0: c[1]= Thread 0: c[2]= Thread 0: c[3]= Thread 0: c[4]= Thread 3 done Thread 2 done Thread 1 doing section 2 Thread 1: d[0]= Thread 1: d[1]= Thread 1: d[2]= Thread 1: d[3]= Thread 0 done Thread 1: d[4]= Thread 1 done Threads do not wait (i.e. no barrier)

If we remove the nowait clause
Thread 0 doing section 1 Thread 0: c[0]= Thread 0: c[1]= Thread 0: c[2]= Thread 0: c[3]= Thread 0: c[4]= Thread 3 doing section 2 Thread 3: d[0]= Thread 3: d[1]= Thread 3: d[2]= Thread 3: d[3]= Thread 3: d[4]= Thread 3 done Thread 1 done Thread 2 done Thread 0 done A barrier at the end of section. Threads wait until they are all done with the section.

Parallel For Syntax Enclosing parallel region
#pragma omp parallel { #pragma omp for for (i = 0; i < N; i++) { ... } Enclosing parallel region Different iterations will be executed by available threads Must be a simple C for loop, where lower bound and upper bound are constants (actually loop invariant)

Iteration Space Suppose N = 13. The iteration space is i=0 i=1 i=2 i=3

Iteration Partitions Without further specification
iterations per partition (chunksize) = Partition 0 Partition 1 Partition 2 Partition 3 i=0 i=4 i=8 i=12 i=1 i=5 i=9 - i=2 i=6 i=10 i=3 i=7 i=11

Mapping Iteration Partitions are assigned to processors Ex. PE 0 PE 1
- i=2 i=6 i=10 i=3 i=7 i=11

Parallel For Example #pragma omp parallel shared(a,b,c,nthreads) private(i,tid) { tid = omp_get_thread_num(); if (tid == 0) { nthreads = omp_get_num_threads(); printf("Number of threads = %d\n", nthreads); } printf("Thread %d starting...\n",tid); #pragma omp for for (i = 0; i < N; i++) { c[i] = a[i] + b[i]; printf("Thread %d: i = %d, c[%d] = %f\n", tid, i, c[i]); } /* end of parallel section */ Without “nowait”, threads wait after finishing loop

Parallel For Output Iterations of loop are mapped to threads
Thread 1 starting... Thread 1: i = 2, c[1] = Thread 1: i = 3, c[1] = Thread 2 starting... Thread 2: i = 4, c[2] = Thread 3 starting... Number of threads = 4 Thread 0 starting... Thread 0: i = 0, c[0] = Thread 0: i = 1, c[0] = Iterations of loop are mapped to threads Mapping is In this example, mapping = Barrier here

Combining Directives If a Parallel Region consists of only one Parallel For or Parallel Sections, they can be combined #pragma omp parallel sections #pragma omp parallel for

Combining Directives Example
#pragma omp parallel for shared(a,b,c,nthreads) private(i,tid) for (i = 0; i < N; i++) { c[i] = a[i] + b[i]; } Declares a parallel region with a parallel for

Scheduling a Parallel For
By default, a parallel for is scheduled by mapping blocks (or chunks) of iterations to available threads (static mapping) Thread 1 starting... Thread 1: i = 2, c[1] = Thread 1: i = 3, c[1] = Thread 2 starting... Thread 2: i = 4, c[2] = Thread 3 starting... Number of threads = 4 Thread 0 starting... Thread 0: i = 0, c[0] = Thread 0: i = 1, c[0] = Default Chunk Size Barrier here

Static – Partitions loop iterations into equal sized chunks specified by chunk_size. Chunks assigned to threads in round robin fashion: #pragma omp parallel for schedule (static, chunk_size) (If chunk_size not specified chunks are approx. equal with one at least 1.) Dynamic –Chunk-sized block of iterations assigned to threads as they become available: #pragma omp parallel for schedule (dynamic, chunk_size) (If chunk_size not specified, it defaults to 1.)

Guided – Similar to dynamic but chunk size starts large and gets smaller: #pragma omp parallel for schedule (guided) Example:* Runtime – Uses OMP_SCHEDULE environment variable to specify which of static, dynamic or guided should be used: #pragma omp parallel for schedule (runtime) * Actual algorithm is slightly more complicated see the OpenMP standard

Static assignment Suppose there are N = 100 iterations Chunk_size = 15
P = 4 threads There are 100/15 = 7 blocks Each thread will get 7/4 = 2 blocks (max)

Cyclic Assignment of Blocks
PE0 PE1 PE2 PE3 i = i = i = 30 … 44 i = i = 60 … 74 i = i =

Question Guided scheduling is similar to Static except that the chunk sizes start large and get smaller. What is the advantage of using Guided versus Static? Answer: Guided improves load balance Is there any disadvantage of using Guided? Answer: Overhead

Reduction A reduction is when a binary commutative operator is applied to a collection of values producing a single value

Reduction A binary commutative operator applied to a collection of values producing a single value E.g., applying summation to the following values: Produces the single value of 549 OpenMP (and MPI) standards do not specify how reduction should be implemented; however, … 83 40 23 85 90 2 74 68 51 33 Commutative: changing the order of the operand does not change the result. Associative: the order operations are performed does not matter (with the sequence not changed) i.e. rearranging parentheses does not alter result.

Reduction Implementation
A reduction could be implemented fairly efficiently on multiple processor using a tree In which case the time is O(log(P))

Reduction Operators Operator Description + Summation * Product ^
Logical exclusive or && Logical and & Bitwise and || Logical or | Bitwise or min Least number in reduction list max Largest number in reduction list Subtract is in the spec but does summation??

Reduction Operation Variable
sum = 0; #pragma omp parallel for reduction(+:sum) for (k = 0; k < 100; k++ ) { sum = sum + funct(k); } Operation Variable Private copy of sum created for each thread by compiler. Private copy will be added to sum at end. Eliminates the need for critical sections here.

Single Only one thread executes this section No guarantee of which one
#pragma omp parallel { ... #pragma omp single structured_block } Only one thread executes this section No guarantee of which one

Single Example #pragma omp parallel private(tid) { tid = omp_get_thread_num(); printf ("Thread %d starting...\n", tid); #pragma omp single printf("Thread %d doing work\n",tid); ... } // end of single printf ("Thread %d done\n", tid); } // end of parallel section

Single Results Only one thread executing the section
Thread 0 starting... Thread 0 doing work Thread 3 starting... Thread 2 starting Thread 1 starting... Thread 0 done Thread 1 done Thread 2 done Thread 3 done Only one thread executing the section “nowait” was NOT specified, so threads wait for the one thread to finish. Barrier here

Master Only one thread (the master) executes this section
#pragma omp parallel { ... #pragma omp master structured_block } Only one thread (the master) executes this section Cannot specify “nowait” here No barrier after this block. Threads will NOT wait.

Master Example #pragma omp parallel private(tid) { tid = omp_get_thread_num(); printf ("Thread %d starting...\n", tid); #pragma omp master printf("Thread %d doing work\n",tid); ... } // end of master printf ("Thread %d done\n", tid); } // end of parallel section

Is there any difference between these two approaches:
Master Directive: Using an if statement: #pragma omp parallel { ... #pragma omp master structured_block } #pragma omp parallel \ private(tid) { ... tid=omp_get_thread_num(); if (tid == 0) structured_block } (First has a barrier unless nowait clause

Assignment 1 Assignment posted:
Part 1 a tutorial compiling and executing sample code – on your own computer Part 2 Parallelizing matrix multiplication - on your own computer Part 3 Executing matrix multiplication code on cluster Due Sunday January 31st, 2016 (Week 3)

Questions

More information

Introduction to OpenMP

Similar presentations

Presentation on theme: "Introduction to OpenMP"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to OpenMP

Similar presentations

Presentation on theme: "Introduction to OpenMP"— Presentation transcript:

Similar presentations

About project

Feedback