Programming with OpenMP*

Programming with OpenMP*
Intel Software College

Objectives Upon completion of this module you will be able to use OpenMP to: implement data parallelism implement task parallelism Script: The objectives of the course are to prepare students and faculty to use OpenMP to parallelize C, C++, Fortran applications using either task or data parallelism. The course is expected to take 2 hours for the lecture – plus additional time for the labs or demos up to about an hour

Agenda What is OpenMP? Optional Advanced topics Parallel regions
Worksharing Data environment Synchronization Optional Advanced topics Script: The agenda for the course is shown above. We will be covering the rudiments of what OpenMP is, how its clauses are constructed, how it can be controlled with environment variables etc. We will also look at what parallel regions are and how they are comprised of structured blocks of code. Then we will spend the most time discussing worksharing and three principle ways worksharing is accomplished in OpenMP: OMP For, OMP Parallel Sections, OMP Tasks. We will also spend some time to understand the data environment and the ramifications in several code examples of different variables being either private or shared. We will look at several key synchronization constructs such as critical sections, single, master & atomic clauses Then we will look at Intel suggested placement of material within a University curriculum and layout what topics we think are important to discuss Finally we will wrap up with some optional material that covers the API, a Monte Carlo Pi example and other helpful clauses.

(combined C/C++ and Fortran)
What Is OpenMP? Portable, shared-memory threading API Fortran, C, and C++ Multi-vendor support for both Linux and Windows Standardizes task & loop-level parallelism Supports coarse-grained parallelism Combines serial and parallel code in single source Standardizes ~ 20 years of compiler-directed threading experience Current spec is OpenMP 3.0 318 Pages (combined C/C++ and Fortran) Script: What is OpenMP? OpenMP is a portable (OpenMP codes can be moved between linux & windows for example), shared-memory threading API that standardizes task & loop level parallelism. Because OpenMP clauses have both lexical and dynamic extent, it is possible support a broad multi-file course grained parallelism. Often, the best parallelism technique is to parallel at the coarsest grain possible often parallelizing tasks or loops from with the main driver itself – as this gives the most bang for the buck (the most computation for the necessary threading overhead costs). Another key benefits is that OpenMP allows for a developer to parallelize their applications incrementally. Since OpenMP is primarily a pragma or directive based approach we can easily combine serial & parallel code in a single source. By simply compiling with or without the /OpenMP compiler flag we can turn OpenMP on or off. Code compiled without the /OpenMP flag simply ignores the OpenMp pragmas which allows simple access back to the original serial application. Openmp also standardizes about 20 years of compiler directed threading experience. For more information or to review the latest OpenMP spec (currently the latest spec is OpenMP 3.0) – goto

Programming Model Fork-Join Parallelism:
Master thread spawns a team of threads as needed Parallelism is added incrementally: that is, the sequential program evolves into a parallel program Master Thread Script: OpenMP uses a fork join methodology to implement parallelism. A master thread, shown in red, begins executing code. At an OMP PARALLEL directive, the master thread forks other threads to do work. At a specified point in the code called a barrier, the master thread will wait for all the child threads to finish work before proceeding. Because we can have multiple parallel regions each created as above, we see that parallelism can be added incrementally. We can work on one region of code and get it running well as a parallel region and we can simply turn off openmp in other more problematic parallel regions, until we get the desired program behaviour. That’s the high level overview of OpenMP, now we will look at some of the details to get us started. More Background info: In case someone asks if the threads are really created and destroyed between parallel regions – here is some background info: The Intel OpenMP implementation maintains a thread pool to minimize overhead from thread creation. Worker threads are reused from one parallel region to the next. In serial regions, the worker threads either sleep, spin, or spin for a user-specified amount of time before going to sleep. This behavior is controlled by the KMP_BLOCKTIME environment variable or the KMP_SET_BLOCKTIME function, which are described in the compiler documentation. Worker thread behavior in the serial regions can affect application performance. On shared systems, for example, spinning threads consume CPU resources that could be used by other applications. However, there is overhead associated with waking sleeping threads. Parallel Regions

A Few Syntax Details to Get Started
Most of the constructs in OpenMP are compiler directives or pragmas For C and C++, the pragmas take the form: #pragma omp construct [clause [clause]…] For Fortran, the directives take one of the forms: C$OMP construct [clause [clause]…] !$OMP construct [clause [clause]…] *$OMP construct [clause [clause]…] Header file or Fortran 90 module #include “omp.h” use omp_lib Script: Most of the constructs we will looking at in OpenMP are compiler directives or pragmas. The C/C++ and Fortran versions of these directives are shown here. C and C++, the pragmas take the form: #pragma omp construct [clause [clause] where clauses are optional modifiers Be sure to include “omp.h” if you intend to use any routines from the OpenMP Library. Now let look at some environment variables that control OpenMP behavior Note to Fortran Users: For Fortran, you can also use other sentinel’s (!$OMP), but these must exactly line up on columns Column 6 must be blank or contain a + indicating that this line is a continuation from the previous line.

Worksharing Data environment Synchronization Optional Advanced topics Script: Now we will discuss parallel regions

Parallel Region & Structured Blocks (C/C++)
Most OpenMP constructs apply to structured blocks Structured block: a block with one point of entry at the top and one point of exit at the bottom The only “branches” allowed are STOP statements in Fortran and exit() in C/C++ #pragma omp parallel { int id = omp_get_thread_num(); more: res[id] = do_big_job (id); if (conv (res[id]) goto more; } printf (“All done\n”); if (go_now()) goto more; #pragma omp parallel { int id = omp_get_thread_num(); more: res[id] = do_big_job(id); if (conv (res[id]) goto done; goto more; } done: if (!really_done()) goto more; Script: A Parallel region is created using the #pragma omp parallel construct. A master thread creates a pool of worker threads once the master thread crosses this pragma. On this foil, the creation of the parallel region is highlighted in yellow and includes the pragma and the left curly brace “{“. The parallel region extends from the left curly brace – to the highlighted yellow right curly brace “}”. There is an implicit barrier at the right curly brace and that’s the point at which the other worker threads complete execution and either go to sleep or spin or otherwise idle. Parallel constructs form the foundation of OpenMP parallel execution. Each time an executing thread enters a parallel region, it creates a team of threads and becomes master of that team. This allows parallel execution to take place within that construct by the threads in that team. The following directives are necessary for a parallel region: #pragma omp parallel A parallel region consists of a structured block of code. We see on the left a good example of a structured block of code where there is a single point of entry into the block at the top of the block, and one exit to the block at the bottom – AND no braches out of the block. Question to class – can someone spot some reasons why this block is unstructured? Here are a couple of reasons that that block is unstructured – multiple entrances and multiple exits to the block. We see on the right a bad example - or an unstructured block of code. Here we have two entries in the block – one from the top of the block and one from the goto more statement which jumps into the block at the label “more:” Additionally, the bad example has multiple exits from the block: 1) From the bottom of the block and one from the goto done statement A structured block Not a structured block

Activity 1: Hello Worlds
Modify the “Hello, Worlds” serial code to run multithreaded using OpenMP* Script: Take about 5 minutes to build and run the hello world lab. In this example – we will print “hello world” from several threads. Run the code several times. Do you see any issues with the code – do you always get expected results? Does anyone in the class have weird behavior in terms of the sequence of the words that are printed out? Since printf is a function of state – it can only print one thing to one screen at a time – some students are likely to see “race conditions” where one printf in a thread is writing over the results of another printf in an other thread.

Worksharing – Parallel For Data environment Synchronization Optional Advanced topics Script: Lets look at worksharing

Automatically divides work among threads
Worksharing Worksharing is the general term used in OpenMP to describe distribution of work across threads. Three examples of worksharing in OpenMP are: omp for construct omp sections construct omp task construct Script: Worksharing is the general term used in OpenMP to describe distribution of work across threads. There are three primary categories of worksharing in OpenMP The three examples are: The omp for construct - that automatically divides the a for loop’s work up and distributes the work across threads The omp sections directive distributes work among threads bound to a defined parallel region. This construct is good for function level parallelism where the tasks or functions are well defined and known at compile time The omp task pragma can be used to explicitly define a task. Now we will look more closely at the omp for construct Automatically divides work among threads

omp for construct #pragma omp parallel #pragma omp for Implicit barrier i = 1 i = 2 i = 3 i = 4 i = 5 i = 6 i = 7 i = 8 i = 9 i = 10 i = 11 i = 12 // assume N=12 #pragma omp parallel #pragma omp for for(i = 1, i < N+1, i++) c[i] = a[i] + b[i]; Threads are assigned an independent set of iterations Threads must wait at the end of work-sharing construct Script: The omp for directive instructs the compiler to distribute loop iterations within the team of threads that encounters this work-sharing construct. In this example the omp for construct resides inside a parallel region where a team of threads has been created – for sake of argument – lets say that 3 threads are in the tread team. The omp for construct takes the number of loop iterations, N, and divides N by the number of threads in a team – to arrive at a unit of work for each thread. Omp for assigned a different set of iterations to each thread. So lets say that N has the value 12. Then each of the 3 threads gets to work on 4 loop iterations. So one thread is assigned to work on iterations 1-4, the next thread on 5-8, etc. At the end of the parallel region, the threads rejoin and a single thread exits the parallel region. Effectively – omp for was able to cut the execution time down significantly and for this loop the execution time would be on the order of 33% of the serial time – a reduction of 66%. To use the omp for directive, it must be place immediately prior to the loop in question as we see in the above code. Next we’ll see how to combine openmp constructs Background see the openmp spec at or visit publib.boulder.ibm.com/infocenter at the url below to get at the definition and usage Pragma omp [clause] clause is any of the following: private (list) Declares the scope of the data variables in list to be private to each thread. Data variables in list are separated by commas. firstprivate (list) Declares the scope of the data variables in list to be private to each thread. Each new private object is initialized as if there was an implied declaration within the statement block. Data variables in list are separated by commas. lastprivate (list) Declares the scope of the data variables in list to be private to each thread. The final value of each variable in list, if assigned, will be the value assigned to that variable in the last iteration. Variables not assigned a value will have an indeterminate value. Data variables in list are separated by commas. reduction (operator:list) Performs a reduction on all scalar variables in list using the specified operator. Reduction variables in list are separated by commas. A private copy of each variable in list is created for each thread. At the end of the statement block, the final values of all private copies of the reduction variable are combined in a manner appropriate to the operator, and the result is placed back into the original value of the shared reduction variable. Variables specified in the reduction clause: must be of a type appropriate to the operator. must be shared in the enclosing context. must not be const-qualified. must not have pointer type. ordered Specify this clause if an ordered construct is present within the dynamic extent of the omp for directive. schedule (type) Specifies how iterations of the for loop are divided among available threads. Acceptable values for type are: auto Withauto, scheduling is delegated to the compiler and runtime system. .The compiler and runtime system can choose any possible mapping of iterations to threads (including all possible valid schedules) and these may be different in different loops. dynamic Iterations of a loop are divided into chunks of size ceiling(number_of_iterations/number_of_threads). Chunks are dynamically assigned to threads on a first-come, first-serve basis as threads become available. This continues until all work is completed. dynamic,n As above, except chunks are set to size n. n must be an integral assignment expression of value 1 or greater. guided Chunks are made progressively smaller until the default minimum chunk size is reached. The first chunk is of size ceiling(number_of_iterations/number_of_threads). Remaining chunks are of size ceiling(number_of_iterations_left/number_of_threads). The minimum chunk size is 1. Chunks are assigned to threads on a first-come, first-serve basis as threads become available. This continues until all work is completed. guided,n As above, except the minimum chunk size is set to n. n must be an integral assignment expression of value 1 or greater. runtime Scheduling policy is determined at run time. Use the OMP_SCHEDULE environment variable to set the scheduling type and chunk size. static Iterations of a loop are divided into chunks of size ceiling(number_of_iterations/number_of_threads). Each thread is assigned a separate chunk. This scheduling policy is also known as block scheduling. static,n Iterations of a loop are divided into chunks of size n. Each chunk is assigned to a thread in round-robin fashion. n must be an integral assignment expression of value 1 or greater. This scheduling policy is also known as block cyclic scheduling. Note: if n=1, iterations of a loop are divided into chunks of size 1 and each chunk is assigned to a thread in round-robin fashion. This scheduling policy is also known as block cyclic scheduling nowait Use this clause to avoid the implied barrier at the end of the for directive. This is useful if you have multiple independent work-sharing sections or iterative loops within a given parallel region. Only one nowait clause can appear on a given for directive. and where for_loop is a for loop construct with the following canonical shape: for (init_expr; exit_cond; incr_expr) statementwhere: init_exprtakes form:iv = b integer-type iv = bexit_condtakes form:iv <= ub iv < ub iv >= ub iv > ubincr_exprtakes form:++iv iv++ --iv iv-- iv += incr iv -= incr iv = iv + incr iv = incr + iv iv = iv - incrand where: ivIteration variable. The iteration variable must be a signed integer not modified anywhere within the for loop. It is implicitly made private for the duration of the for operation. If not specified as lastprivate, the iteration variable will have an indeterminate value after the operation completes.b, ub, incrLoop invariant signed integer expressions. No synchronization is performed when evaluating these expressions and evaluated side effects may result in indeterminate values.Usage This pragma must appear immediately before the loop or loop block directive to be affected. Program sections using the omp for pragma must be able to produce a correct result regardless of which thread executes a particular iteration. Similarly, program correctness must not rely on using a particular scheduling algorithm. The for loop iteration variable is implicitly made private in scope for the duration of loop execution. This variable must not be modified within the body of the for loop. The value of the increment variable is indeterminate unless the variable is specified as having a data scope of lastprivate. An implicit barrier exists at the end of the for loop unless the nowait clause is specified. Restrictions are: The for loop must be a structured block, and must not be terminated by a break statement. Values of the loop control expressions must be the same for all iterations of the loop. An omp for directive can accept only one schedule clauses. The value of n (chunk size) must be the same for all threads of a parallel region.

Combining constructs These two code segments are equivalent
#pragma omp parallel { #pragma omp for for (i=0;i< MAX; i++) { res[i] = huge(); } #pragma omp parallel for for (i=0;i< MAX; i++) { res[i] = huge(); } Script: In these equivalent code snippets we see that OepnMP constructs can be combined down to a single statement Here the #pragma omp parallel and the nested #pragma omp for (from the right handed example) are combined to a single tidier construct on the right #pragma omp parallel for Most often in this course we sill use the more abbreviated combined version. There can be occasions however when it may be useful to separate the constructs – such as if the parallel regions has other work do in addition to the for – maybe a and addition omp sections directive. Now well turn to lab activity 2 Background: Side Note – Not shown – but useful also. It is not just the omp or that can be merged with parallel for . Its is also allowed to merge the omp sections with the omp parallel -> #progma omp parallel sections.

The Private Clause Reproduces the variable for each task
Variables are un-initialized; C++ object is default constructed Any value external to the parallel region is undefined void* work(float* c, int N) { float x, y; int i; #pragma omp parallel for private(x,y) for(i=0; i<N; i++) { x = a[i]; y = b[i]; c[i] = x + y; } Script: We’ve just about wrapped up our coverage of the omp parallel for construct – before we move on to the lab however – we need to introduce one more concept – the concept of private variables. For reasons that we will go into more later on in this module – it is important to be able to make some variables have copies that are private to each thread. The omp private clause accomplishes this. Declaring a variable to be private with the omp private clause means that each thread has its own copy of that variable that can be modified without effecting any others threads value of the their similarly named variable. In this example above, each thread will have private copies of variables x & y. This means that thread 1, thread 2, thread 3 etc all have a variable named x & a variable named y but thread 1’s variable x can contain a different value than thread 2’s variable x. This construct allows each thread to proceed without effecting the computation of the other threads. We will talk more about private variables later in the context of race conditions. Next foil Background Private Cause For-loop iteration variable is PRIVATE by default.

Activity 2 – Parallel Mandelbrot
Objective: create a parallel version of Mandelbrot. Modify the code to add OpenMP worksharing clauses to parallelize the computation of Mandelbrot. Follow the next Mandelbrot activity called Mandelbrot in the student lab doc Script: In this exercise we will use the combined #pragma omp parallel for construct to parallelize a specific loop in the mandelbrot application. Then we will use wall clock time to evaluate the performance of the parallel version as compared to the serial version

The schedule clause The schedule clause affects how loop iterations are mapped onto threads schedule(static [,chunk]) Blocks of iterations of size “chunk” to threads Round robin distribution Low overhead, may cause load imbalance schedule(dynamic[,chunk]) Threads grab “chunk” iterations When done with iterations, thread requests next set Higher threading overhead, can reduce load imbalance schedule(guided[,chunk]) Dynamic schedule starting with large block Size of the blocks shrink; no smaller than “chunk” Script: The omp for loop can be modified with a schedule clause that effects how the loop iterations are mapped onto threads. This mapping can dramatically improve performance by eliminating load imbalances or in reducing threading overhead. We are going to dissect the schedule clause and look at several options we can use in the schedule clause and how openmp loop behavior changes for each option. The first schedule clause we are going to examine is the schedule(static) clause. Lets advance to the first sub animation on the slide. For the sake of argument, we are going to assume that the loops in question have N iterations and lets assume we have 4 threads in the thread pool – just to make the concepts a little more tangible. To talk about this scheduling we first need to define what a chunk is. A chunk is a contiguous range of iterations – so iterations 0 through 99, for example would be considered a chuck of iterations. What schedule( static) does is break the for loop in chunks of iterations. Each thread in the team gets one chunk. If we have N total iteration, schedule(static) assigns a chunk of N/(number of threads) iterations to each thread for execution. schedule(static, chunk) – Lets assume that chunk is of 8. Then schedule (static,8) would interleave the allocation of chunks of size 8 to threads. That means that thread 1 gets 8 iterations, then thread 2 gets another 8 etc. The chunks of 8 are doled out to the threads in round robin fashion to whatever threads a free for execution. Increasing chunk size reduces overhead and may increase cache hit rate. Descreasing chunk size allows for finer balancing of workloads. Next animation schedule(dynamic) – takes more overhead to accomplish but it effectively assigns the threads one-at-a-time dynamically. This is great for loop iterations where iteration 1 of a loop takes vastly different computation that iteration N for example. Using dynamic scheduling can greatly improve threading load balance in such cases. Threading load balance is ideal state where all threads have an equal amount of work to do and can all finish their work in roughly an equal amount of time. schedule(dynamic, chunk) – similar to dynamic but assigns the threads assigns the threads chunk-at-a-time dynamically rather than one-at-a-time dynamically next animation schedule(guided) – is a specific cases of dynamic scheduling where the computation is known to take less time for early iterations and significantly longer for later iterations. For example, in finding prime numbers using a sieve. The larger primes take more time to test than the smaller primes – so if we are testing in a large loop the smaller iteration compute quickly – but the later iterations take a long time – schedule guided may be a great choice for this situation. schedule(guided, chunk) - dynamic allocation of chunks to tasks using guided self-scheduling heuristic. Initial chunks are bigger, later chunks are smaller - minimum chunk size is “chunk” Lets look at Recommend uses Use Static scheduling for predictable and similar work per iteration Use Dynamic scheduling for unpredictable, highly variable work per iteration Use Guided scheduling as a special case of dynamic to reduce scheduling overhead when the computation gets progressively more time consuming Use Auto scheduling to select by Compiler or runtime environment variables Lets look at an example now – see next foil Background: See Dr Michael Quinns Book - Parallel Programming in C with MPI and OpenMP Here are some quick notes regarding each of the options we see on the screen schedule(static) - block allocation of N/threads contiguous iterations to each thread schedule(static, S) - interleaved allocation of chunks of size S to threads schedule(dynamic) - dynamic one-at-a-time allocation of iterations to threads schedule(dynamic, S) - dynamic allocation of S iterations at a time to threads schedule(guided)- guided self-scheduling - minimum chunk size is 1 schedule(guided, S) - dynamic allocation of chunks to tasks using guided self-scheduling heuristic. Initial chunks are bigger, later chunks are smaller - minimum chunk size is S Note: When schedule(static, chunk_size) is specified, iterations are divided into chunks of size chunk_size, and the chunks are assigned to the threads in the team in a round-robin fashion in the order of the thread number. When no chunk_size is specified, the iteration space is divided into chunks that are approximately equal in size, and at most one chunk is distributed to each thread. Note that the size of the chunks is unspecified in this case. A compliant implementation of static schedule must ensure that the same assignment of logical iteration numbers to threads will be used in two loop regions if the following conditions are satisfied: 1) both loop regions have the same number of loop iterations, 2) both loop regions have the same value of chunk_size specified, or both loop regions have no chunk_size specified, and 3) both loop regions bind to the same parallel region. A data dependence between the same logical iterations in two such loops is guaranteed to be satisfied allowing safe use of the nowait clause (see Section A.9 on page 170 for examples). dynamic When schedule(dynamic, chunk_size) is specified, the iterations are distributed to threads in the team in chunks as the threads request them. Each thread executes a chunk of iterations, then requests another chunk, until no chunks remain to be distributed. Each chunk contains chunk_size iterations, except for the last chunk to be distributed, which may have fewer iterations. When no chunk_size is specified, it defaults to 1. guided When schedule(guided, chunk_size) is specified, the iterations are assigned to threads in the team in chunks as the executing threads request them. Each thread executes a chunk of iterations, then requests another chunk, until no chunks remain to be assigned. For a chunk_size of 1, the size of each chunk is proportional to the number of unassigned iterations divided by the number of threads in the team, decreasing to 1. For a chunk_size with value k (greater than 1), the size of each chunk is determined in the same way, with the restriction that the chunks do not contain fewer than k iterations (except for the last chunk to be assigned, which may have fewer than k iterations). auto When schedule(auto) is specified, the decision regarding scheduling is delegated to the compiler and/or runtime system. The programmer gives the implementation the freedom to choose any possible mapping of iterations to threads in the team. runtime When schedule(runtime) is specified, the decision regarding scheduling is deferred until run time, and the schedule and chunk size are taken from the run-sched-var ICV. If the ICV is set to auto, the schedule is implementation defined.

Schedule Clause Example
#pragma omp parallel for schedule (static, 8) for( int i = start; i <= end; i += 2 ) { if ( TestForPrime(i) ) gPrimesFound++; } Iterations are divided into chunks of 8 If start = 3, then first chunk is i={3,5,7,9,11,13,15,17} Script: This example performs a test for primes. It uses a omp parallel for construct modified with a schedule(static,8) clause. This example has an increasing amount of work as the iteration counter gets larger. That is because testing numbers for primality with a brute force method takes longer to compute since there are more numbers to test. This C example uses STATIC scheduling. The set of iterations is divided up into chunks of size 8 and distributed to threads in round robin fashion. Lets compare just using STATIC scheduling with using the STATIC,8 scheduling. And lets assume we have 4 threads in the team and that start =1 and end = 1001. With simple STATIC scheduling with 4 threads each thread would be assigned a chunk of 250 iterations. The fist 250 iteration would be assigned to thread 1, the next 250 to thread 2 and so one. The last thread – thread 4 – would get the 250 most difficult iterations and would take far longer than the others One the other hand, when we use STATIC,8 – the runtime chunks up 8 iterations in a chunk. As we approach the end of the loop, we would have 4 threads all computing the more difficult calculations. Arguably the thread computing the second to last chunk, iterations numbered 985 – 993, is almost as challenged as the thread computing the last chunk, iterations 993 to So we see that load balancing has improved It may be even better to try dynamic or guided here and see how the performance compares Now its time for a lab

Activity 2b –Mandelbrot Scheduling
Objective: create a parallel version of mandelbrot. That uses OpenMP dynamic scheduling Follow the next Mandelbrot activity called Mandelbrot Scheduling in the student lab doc Script: In this activity you will experiment with openmp scheduling clause to empirically determine the best scheduling method for mandelbrot. Static scheduling is not the best choice because the work load in the middle of the graphic is complicated and each row of pixels takes a long time o compute, while a row of pixels near the top of the screen is very quick to compute. Next we will look at improving the mandelbrot application by adding a scheduling clause and observing the wall clock time for execution Ask the students which scheduling method seemed to work best

Worksharing – Parallel Sections Data environment Synchronization Optional Advanced topics Script: Now a very brief look at function level parallelism – also called task decomposition. We’ll also look at omp parallel sections

Task Decomposition alice,bob, and cy can be computed in parallel
a = alice(); b = bob(); s = boss(a, b); c = cy(); printf ("%6.2f\n", bigboss(s,c)); alice bob boss cy bigboss alice,bob, and cy can be computed in parallel Script: We will now look at ways to take advantage of task decomposition – also known as function level parallelism. In this example, we have functions: alice, bob, boss, cy and bigboss. There are various dependencies among these functions as seen in the directed edge graph to the right. Boss cant complete until alice & bob functions are complete. Bigboss cant complete until boss & cy are completed. How do we parallelize such a function or task dependency? We’ll see one approach in a couple slides For now, lets identify the functions that can be computed independently (or computed in parallel). Alice & bob & cy can all be computed at the same time as there are no dependencies among them. Lets see if we can put this to good advantage later. Background: Get more info in Michael Quinn’s excellent book -Parallel Programming in C with MPI and OpenMP

omp sections #pragma omp sections Must be inside a parallel region
Precedes a code block containing of N blocks of code that may be executed concurrently by N threads Encompasses each omp section #pragma omp section Precedes each block of code within the encompassing block described above May be omitted for first parallel section after the parallel sections pragma Enclosed program segments are distributed for parallel execution among available threads Script: The omp sections directive distributes work among threads bound to a defined parallel region. The omp sections construct (note the “s” in sections) indicates that there will be two or more omp section constructs ahead that can be executed in parallel. The omp sections construct must either be inside a parallel region or must be part of a combined omp parallel sections construct. When program execution reaches a omp sections directive, program segments defined by the following omp section directive are distributed for parallel execution among available threads. The parallelism comes from executing each omp section in parallel Lets look at the previous example to see how to apply this to our boss bigboss example Background: Get more info in Michael Quinn’s excellent book -Parallel Programming in C with MPI and OpenMP see the openmp spec at or visit publib.boulder.ibm.com/infocenter at the url below to get at the definition and usage Parameters clause is any of the following: private (list) Declares the scope of the data variables in list to be private to each thread. Data variables in list are separated by commas. firstprivate (list) Declares the scope of the data variables in list to be private to each thread. Each new private object is initialized as if there was an implied declaration within the statement block. Data variables in list are separated by commas. lastprivate (list) Declares the scope of the data variables in list to be private to each thread. The final value of each variable in list, if assigned, will be the value assigned to that variable in the last section. Variables not assigned a value will have an indeterminate value. Data variables in list are separated by commas. reduction (operator: list) Performs a reduction on all scalar variables in list using the specified operator. Reduction variables in list are separated by commas. A private copy of each variable in list is created for each thread. At the end of the statement block, the final values of all private copies of the reduction variable are combined in a manner appropriate to the operator, and the result is placed back into the original value of the shared reduction variable. Variables specified in the reduction clause: must be of a type appropriate to the operator. must be shared in the enclosing context. must not be const-qualified. must not have pointer type. nowait Use this clause to avoid the implied barrier at the end of the sections directive. This is useful if you have multiple independent work-sharing sections within a given parallel region. Only one nowait clause can appear on a given sections directive. Usage The omp section directive is optional for the first program code segment inside the omp sections directive. Following segments must be preceded by an omp section directive. All omp section directives must appear within the lexical construct of the program source code segment associated with the omp sections directive. When program execution reaches a omp sections directive, program segments defined by the following omp section directive are distributed for parallel execution among available threads. A barrier is implicitly defined at the end of the larger program region associated with the omp sections directive unless the nowait clause is specified.

Functional Level Parallelism w sections
#pragma omp parallel sections { #pragma omp section /* Optional */ a = alice(); #pragma omp section b = bob(); c = cy(); } s = boss(a, b); printf ("%6.2f\n", bigboss(s,c)); Script: Here we have enclosed the omp sections in the omp parallel construct. We placed code of interest inside the parallel sections construct’s code block. Next we added parallel section constructs in front of the tasks that could be executed in parallel – namely alice, bob, and cy. When all of the threads executing the parallel section reach the implicit barrier (the right curly brace “}” at the end of the parallel section, then the master thread will continue on – executing the boss function and later the biggboss function. Another possible approach that Quinn points out – compute alice and bob together then computes boss & cy together, then computes biggboss. #pragma omp parallel sections { #pragma omp section /* Optional */ a = alice(); #pragma omp section b = bob(); } c = cy(); s = boss(a, b); printf ("%6.2f\n", bigboss(s,c)); Background: Get more info in Michael Quinn’s excellent book -Parallel Programming in C with MPI and OpenMP

Advantage of Parallel Sections
Independent sections of code can execute concurrently – reduce execution time #pragma omp parallel sections { #pragma omp section phase1(); phase2(); phase3(); } Script: In this example, Phase1, Phase2, and Phase3 represent completely independent tasks Sections are distributed among the threads in the parallel team. Each section is executed only once and each thread may execute zero or more sections. It’s not possible to determine whether or not a section will be executed before another. Therefore, the output of one section should not serve as the input to another concurrent section. Notice the overall parallelism achievable in the serial/parallel flow diagram Now we will begin our exploration of omp tasks Serial Parallel

Worksharing – Tasks Data environment Synchronization Optional Advanced topics Script: Next stop – omp tasks

New Addition to OpenMP Tasks – Main change for OpenMP 3.0
Allows parallelization of irregular problems unbounded loops recursive algorithms producer/consumer Script: Tasks are a powerful new addition to OpenMP since the OpenMP 3.0 spec. Tasks allow parallelization of irregular problems that were impossible to very difficult to parallel in OpenMP prior to Now it is possible to parallelize unbounded loops (such as while loops loops), recursive algorithms, and producer/consumer patterns. Lets explore what Tasks actually are

What are tasks? Tasks are independent units of work
Threads are assigned to perform the work of each task Tasks may be deferred Tasks may be executed immediately The runtime system decides which of the above Tasks are composed of: code to execute data environment internal control variables (ICV) Script: First of all, Tasks are independent units of work that get threads assigned to them in order to do some calculation. The assigned threads might start executing immediately or their execution might be deferred depending on decision made by the OS & runtime. Tasks are composed of three components: Code to execute – the literal code in your program enclosed by the task directive 2) A data environment – the shared & private data the manipulated by the task 3) Internal control variables – thread scheduling and environment variable type controls. A task is a specific instance of executable code and its data environment, generated when a thread encounters a task construct or a parallel construct. Background: New concept in OpenMP 3.0: explicit task - We have simply added a way to create a task explicitly for a team of threads to execute. Key Concept: All parallel execution is done in the context of a parallel region - Thread encountering parallel construct packages up a set of N implicit tasks, one per thread. - Team of N threads is created - Each thread begins execution of a separate implicit task immediately New concept: explicit task – OpenMP has simply added a way to create a task explicitly for the team to execute. Every part of an OpenMP program is part of one task or another! Serial Parallel

Simple Task Example #pragma omp parallel // assume 8 threads {
#pragma omp single private(p) … while (p) { #pragma omp task processwork(p); } p = p->next; A pool of 8 threads is created here One thread gets to execute the while loop The single “while loop” thread creates a task for each instance of processwork() Script: In this example, we are looking at a pointer chasing linked list. We create a parallel region using the #pragma omp parallel construct – and we are assuming, for the sake of the illustration, that 8 threads are created once the master thread crosses into the parallel region. At this point we have a team of 8 threads created. Let’s also assume that the linked list contains ~1000 nodes. We immediately limit the number of thread which will operate the while loop. We only want one while loop running. With out the single construct – we would have 8 identical copies of while loops all trying to process work and getting into each others way. The omp task construct copies the code and data and internal control variables to a new task – lets call it task01 and gives that task a thread from the team to execute the task’s instance of the code and data. Since the omp task construct is called from within the while loop, and since the while loop is going to traverse all 1000 nodes, then at some point 1000 tasks will be generated. It is unlikely that all 1000 tasks will be generated at the same time. Since we only have 8 threads to service the 1000 tasks, and the master thread is busy controlling the while loop – we will effectively have 7 threads to do actual processwork. Lets say that the master thread keeps generating tasks and the 7 worker threads can’t consume these tasks quickly enough. Then eventually, the master thread may “task switch”. It may suspend the work of controlling the while loop and creating tasks. The runtime may decide that the master should begin servicing the tasks just like the rest of the threads. When the task pool drains enough due to the extra help, the master thread may task switch back to executing the while loop and begin generating new tasks once more. This process is at the heart of openmp tasks. We’ll see some animations to demonstrate this in the next few foils

Task Construct – Explicit Task View
A team of threads is created at the omp parallel construct A single thread is chosen to execute the while loop – lets call this thread “L” Thread L operates the while loop, creates tasks, and fetches next pointers Each time L crosses the omp task construct it generates a new task and has a thread assigned to it Each task runs in its own thread All tasks complete at the barrier at the end of the parallel region’s single construct #pragma omp parallel { #pragma omp single { // block 1 node * p = head; while (p) { //block 2 #pragma omp task private(p) process(p); p = p->next; //block 3 } Script: This foil is one way to look at pointer chasing in a linked list, where the list must be traversed (and each node in the linked list) has to be processed by function “process” Here we see the overview of the flow of this code snippet A team of threads is created at the omp parallel construct A single thread is chosen to execute the while loop – lets call this thread “L” Thread L operates the while loop, creates tasks, and fetches next pointers Each time L crosses the omp task construct it generates a new task and has a thread assigned to it Each task runs in its own thread All tasks complete at the barrier at the end of the parallel region The next foil will give more insight into the parallelism advantage of this approach

Why are tasks useful? Idle Time
Have potential to parallelize irregular patterns and recursive function calls Block 1 Block 2 Task 1 Block 2 Task 2 Block 2 Task 3 Block 3 Time Single Threaded Block 1 Block 3 Thr Thr2 Thr3 Thr4 Block 2 Task 2 Block 2 Task 1 Block 2 Task 3 Time Saved #pragma omp parallel { #pragma omp single { // block 1 node * p = head; while (p) { //block 2 #pragma omp task process(p); p = p->next; //block 3 } Idle Script: Here is another look at the same example – but emphasizing the potential performance payoff from this sort of pipelined approach that we get from using tasks. First off, observe that in single threaded mode, all tasks are done sequentially – Block1 is calculated ( node p is assigned to head), then block 2 (pointer p is processed by process(p), then block 3 ( reads next pointer in the linked list), then repeating Blocks2, block 3 etc. 1st animation Now consider the same code executed in parallel First, the master thread crosses the omp parallel construct and creates a team of threads. Next one of those threads is chosen to execute the while loop – lets call this thread L Thread L encounters an omp task construct at block 2 which copies the code and data for process(p) to a new task – well call it Task1 The thread L increments the pointer p – grabbing a new node from the list, and loops to the top of the while loop Then thread L again encounters an omp task construct at block 2 which copies the code and data for process(p) to a new task – well call it Task2 Then thread L again encounters an omp task construct at block 2 which copies the code and data for process(p) to a new task – well call it Task3 So Thread L’s job is simply to assign work to threads and traverse the linked list. The parallelism comes from the fact that thread L does not have to wait for the results of any task before generating a new task. If the system has sufficient resources (enough cores, registers, memory, etc) then task1, task2, task3 can all be computed in parallel. So roughly speaking – the execution time will be about the duration of the longest executing task (task 2 in this case) plus some extra administration time for thread L. The time save can be significant compared to the serial execution of the same code. Obviously the more parallel resources that are supplied the better the parallelism – up until the point that the longest serial task begins to dominate the total execution time. Now its time for lab activity

Activity 3 – Linked List using Tasks
Objective: Modify the linked list pointer chasing code to implement tasks to parallelize the application Follow the Linked List task activity called LinkedListTask in the student lab doc Script: Likely this lab will have to be skipped for time constraints. However – it does show some fair speedup by using task to parallelize a pointer chasing while loop. This is a fairly simple lab in which you will start with a serial version of the application and add a few openmp pragmas and build & run the application Lets now look at when/where tasks are guaranteed to be complete while(p != NULL){ do_work(p->data); p = p->next; }

When are tasks gauranteed to be complete?
Tasks are gauranteed to be complete: At thread or task barriers At the directive: #pragma omp barrier At the directive: #pragma omp taskwait Script: Now we are going to quickly explore when tasks are guaranteed to be complete. For doing computations with dependent tasks, where Task B specifically relies on completion of task A, it is sometimes necessary for developers to know explicitly when or where a task can be guaranteed to be completed. This foil addresses this concern. Tasks are guaranteed to be complete at the following three locations: At thread or task barriers – such as the end of a parallel region, the end of a single region, the end of a parallel “for” region – we’ll talk more aout implicit thread or task barriers later At the directive: #pragma omp barrier At the directive: #pragma omp taskwait In the case of #pragma omp taskwait Encountering task suspends at the point of the directive until all its children (all the encountering task’s children tasks created within the encountering task up to this point) are complete. Only direct children - not descendants! Similarly a Thread barrier (implicit or explicit) includes an implicit taskwait. This pretty much wraps up the discussion of tasks – now we are going to look at data environment topics such as data scoping.

Task Completion Example
#pragma omp parallel { #pragma omp task foo(); #pragma omp barrier #pragma omp single bar(); } Multiple foo tasks created here – one for each thread All foo tasks guaranteed to be completed here One bar task created here Script: Lets take a look at an example that demonstrates where tasks are guaranteed to be complete. In this example, the master thread crosses the parallel construct and a team of N threads is created. Each thread in the team is assigned a task – in this case each thread gets assigned a “foo” task – so that there are now N foo tasks created The “exit” of the omp barrier construct is where we are guaranteed all the N of the foo tasks are complete. Next a single thread crosses the omp task construct and a single task is created to execute the “bar” function. The bar function is guaranteed to be complete at the exit of the single construct’s code block – the right curly brace that signifies the end of the single region Now lets move on to the next item in the agenda bar task guaranteed to be completed here

Worksharing Data environment Synchronization Optional Advanced topics Script: Next up is Data environment – including data scoping, shared & private variables and quite a few quick examples to show the difference

Data Scoping – What’s shared
OpenMP uses a shared-memory programming model Shared variable - a variable that can be read or written by multiple threads Shared clause can be used to make items explicitly shared Global variables are shared by default among tasks File scope variables, namespace scope variables, static variables, Variables with const-qualified type having no mutable member are shared, Static variables which are declared in a scope inside the construct are shared Script: Data scoping – what’s shared? OpenMP uses a shared memory programming model rather than a message passing model. As such, the way that threads communicate results to the master thread or to each other is through shared memory and thru shared variables. Consider two threads each computing a partial sum. A master thread is waiting to add the partial sums to get a grand total. If each thread ONLY has its own local copy of data and can ONLY manipulate its own data – then there would be no way to communicate a partial sum to the master thread. So shared variables play a very important role on a shared memory system. A shared variable is a variable whose name provides access to a the same block of storage for each task region. What variables are considered shared by default? Global variables, variables with file scope, variables with namespace scope, static variables are all shared by default. A variable can be made shared explicitly by adding the shared(list,…) clause to any omp construct. For Fortran junkies, Common blocks, Save variables, and module variables are all shared by default. So now that we know what’s shared – lets look at what’s private. Background: predetermined data-sharing attributes Variables with automatic storage duration that are declared in a scope inside the construct are private. • Variables with heap allocated storage are shared. • Static data members are shared. • Variables with const-qualified type having no mutable member are shared. C/C++ • Static variables which are declared in a scope inside the construct are shared.

Data Scoping – What’s private
But, not everything is shared... Examples of implicitly determined private variables: Stack (local) variables in functions called from parallel regions are PRIVATE Automatic variables within a statement block are PRIVATE Loop iteration variables are private Implicitly declared private variables within tasks will be treated as firstprivate Firstprivate clause declares one or more list items to be private to a task, and initializes each of them with a value Script: While shared variables may be essential to have – they also have a serious drawback that we will explore in some future foils. Shared variables open up the possibility of data races or race conditions. A race condition is a situation in which multiple threads of execution are all updating a shared variable in an unsynchronized fashion causing indeterminate results – which means that the same program running the same data may arrive at different answers in multiple trials of the program execution. This is a fancy way of saying that you could write a program that adds 1 plus 3 and sometimes gets 4, other times gets 1, and other times gets 3. One way to combat data races is by making use of copies of data in private variables for every thread. In order to use private variables, we ought to know a little about them. First of all, it is important to know that some variables are implicitly considered to be private variables. Other variables can be made private by explicitly declaring them to be so using the private clause. Some examples of implicitly determined private variables are; Stack variables in functions called from parallel regions, Automatic variables within a statement block, Loop iteration variables. Second, all variables that are implicitly determined to be private variables will be treated as tough they are first private within the task – meaning that they are all given initial value from their associated original variable. Lets look at some examples to see what this means Background: • Variables appearing in threadprivate directives are threadprivate. • Variables with automatic storage duration that are declared in a scope inside the construct are private. • Variables with heap allocated storage are shared. • Static data members are shared. • The loop iteration variable(s) in the associated for-loop(s) of a for or parallel for construct is(are) private. • Variables with const-qualified type having no mutable member are shared. C/C++ • Static variables which are declared in a scope inside the construct are shared. private (list) Declares the scope of the data variables in list to be private to each thread. Data variables in list are separated by commas predetermined data-sharing attributes For each private variable referenced in the structured block, a new version of the original variable (of the same type and size) is created in memory for each task that contains code associated with the directive References to a private variable in the structured block refer to the current task’s private version of the original variable. A private variable in a task region that eventually generates an inner nested parallel region is permitted to be made shared by implicit tasks in the inner parallel region. A private variable in a task region can be shared by an explicit task region generated during its execution. However, it is the programmer’s responsibility to ensure through synchronization that the lifetime of the variable does not end before completion of the explicit task region sharing it. Any other access by one task to the private variables of another task results in unspecified behavior.

A Data Environment Example
float A[10]; main () { integer index[10]; #pragma omp parallel Work (index); } printf (“%d\n”, index[1]); extern float A[10]; void Work (int *index) { float temp[10]; static integer count; <...> } temp A, index, count Script: Let me ask the class – from what you learned already, tell me which variables are shared and which are private A[] is shared – it is a global variable Index is also an array – we pass the pointer to the first element into work. All thread share this array. Count is also shared – why? Because it is declared to be static – so all tasks can share that same value Temp is private – this is because it is created inside work. Each task has code and data for Work(). And temp is a local variable within the function work(). Lets look at some more examples Which variables are shared and which variables are private? A, index, and count are shared by all threads, but temp is local to each thread

Data Scoping Issue – fib example
int fib ( int n ) { int x,y; if ( n < 2 ) return n; #pragma omp task x = fib(n-1); y = fib(n-2); #pragma omp taskwait return x+y } n is private in both tasks x is a private variable y is a private variable Script: We will assume that the parallel region exists outside of fib and that fib and the tasks inside it are in the dynamic extent of a parallel region. n is firstprivate in both tasks – reason – stack variable called from parallel region are implicitly determined to be private which means that within both task directives, they will then be assigned firstprivate Do you see any issues here? 1st animation What about x & y? They are definitely private within the tasks – BUT we want to use their values OUTSIDE the task. We need to share the values of x & y somehow The problem we see is assigning a value to x (which by default is private) and y (which by default is private) is that the value are needed outside the task construct – after the taskwait – and private variables not defined here So to have any meaning – we have to provide a mechanism to communicate the value of these variables to the statement after the task wait – we have several strategies available to make this work as we shall see on following foils What’s wrong here? Can’t use private variables outside of tasks

Data Scoping Example – fib example
int fib ( int n ) { int x,y; if ( n < 2 ) return n; #pragma omp task shared(x) x = fib(n-1); #pragma omp task shared(y) y = fib(n-2); #pragma omp taskwait return x+y; } n is private in both tasks Script: Good solution In this case, we shared the values of x & y so that the values will available outside each task construct – after the taskwait x & y are shared Good solution we need both values to compute the sum

Data Scoping Issue – List Traversal
List ml; //my_list Element *e; #pragma omp parallel #pragma omp single { for(e=ml->first;e;e=e->next) #pragma omp task process(e); } What’s wrong here? Script: e will be assumed shared here because even though it appears to be a local variable, it is defined outside the parallel region – that means this variable will be treated as shared by default to each task in the parallel region Since e is shared in the task region, we will have a race condition since each task (ie process() call in this case) will be accessing e and possibly updating the variable e. What we want is to have each process have its own private copy of e – we shall see strategies for how to do this in following foils Possible data race ! Shared variable e updated by multiple tasks

Data Scoping Example – List Traversal
List ml; //my_list Element *e; #pragma omp parallel #pragma omp single { for(e=ml->first;e;e=e->next) #pragma omp task firstprivate(e) process(e); } Good solution – e is firstprivate Script: Here – we made e explicitly firstprivate – overriding the default rules that made it shared from the previous foil. Now that it is firstprivate each task has its own copy of e

List ml; //my_list Element *e; #pragma omp parallel #pragma omp single private(e) { for(e=ml->first;e;e=e->next) #pragma omp task process(e); } Good solution – e is private Script: This is another possible solution – By making e private within the parallel region, it will be treated as private by default in the tasks within the parallel region

List ml; //my_list #pragma omp parallel { Element *e; for(e=ml->first;e;e=e->next) #pragma omp task process(e); } Script: In this case, we have another good solution – e is declared within the parallel region – it is a local variable defined within the parallel region which makes it a private variable within the parallel region and that makes it a private variable within the tasks within the parallel region. Now we are going to move on to look at synchronization constructs Good solution – e is private

Worksharing Data environment Synchronization Optional Advanced topics Script: On to synchronization

Example: Dot Product What is Wrong?
float dot_prod(float* a, float* b, int N) { float sum = 0.0; #pragma omp parallel for shared(sum) for(int i=0; i<N; i++) { sum += a[i] * b[i]; } return sum; Script: Here we see a simple minded dot product. We have used a parallel for construct with a shared clause. So what is the problem? Answer: Multiple threads modifying sum with no protection – this is a race condition! Let talk about race condition some more on the next slide What is Wrong?

Race Condition A race condition is nondeterministic behavior caused by the times at which two or more threads access a shared variable For example, suppose both Thread A and Thread B are executing the statement area += 4.0 / (1.0 + x*x); Script: A race condition is nondeterministic behavior caused by the times at which two or more threads access a shared variable Lets look at the code snippet below area = area + 4/(1+x*x) If variable “area” is private, then the individual subtotals will be lost when the loop is exited. But if variable “area” is shared, then we could run into a race condition. So – we have a quandary. On the one hand I want area ot be shared because I want to combine partial sums from thread A & thread B – on the other hand – I don’t want a data race? The next few slides will illustrate the problem.

Two Timings Order of thread execution causes
Value of area Thread A Thread B Value of area Thread A Thread B 11.667 11.667 +3.765 +3.765 15.432 11.667 15.432 15.432 Script: If thread A references “area”, adds to its value, and assigns the new value to “area” before thread B references “area”, then everything is okay. However, if thread B accesses the old value of “area” before thread A writes back the updated value, then variable “area” ends up with the wrong total. The value added by thread A is lost. We see that in a data race condition, the order of execution of each thread can change the resulting calculations 18.995 15.230 Order of thread execution causes non determinant behavior in a data race

Protect Shared Data Must protect access to shared, modifiable data
float dot_prod(float* a, float* b, int N) { float sum = 0.0; #pragma omp parallel for shared(sum) for(int i=0; i<N; i++) { #pragma omp critical sum += a[i] * b[i]; } return sum; Script: To resolve this issue – lets consider a using a pragma omp critical – also called a critical section. We’ll go over the critical section in more detail on the next slide, but for now lets get a feel for how it works. The critical section allows only one thread to enter it at a given time. A critical section “protects’ the next immediate statement or code block – in this case – sum += a[i] + b[i]; Whichever thread, A or B, gets to the critical section first, that thread is guaranteed exclusive access to to that protected code region (called a critical region). Once the thread leaves the critical section, the other thread is allowed to enter. This ability of the critical section to force threads to “take turns” is what prevents the race condition Lets look a little closer at the anatomy of an OpenMP* Critical Construct

OpenMP* Critical Construct
#pragma omp critical [(lock_name)] Defines a critical region on a structured block float RES; #pragma omp parallel { float B; #pragma omp for for(int i=0; i<niters; i++){ B = big_job(i); #pragma omp critical (RES_lock) consum (B, RES); } } Threads wait their turn – only one at a time calls consum() thereby protecting RES from race conditions Naming the critical construct RES_lock is optional Script: The OpenMP* Critical Construct simply defines a critical region on a structured code block Threads wait their turn –at a time, only one calls consum() thereby protecting RES from race conditions Naming the critical construct RES_lock is optional. With named critical regions - a thread waits at the start of a critical region identified by a given name until no other thread in the program is executing a critical region with that same name. Critical sections not specifically named by omp critical directive invocation are mapped to the same unspecified name. Now lets talk about another kind of synchronization called a reduction Good Practice – Name all critical sections

OpenMP* Reduction Clause
reduction (op : list) The variables in “list” must be shared in the enclosing parallel region Inside parallel or work-sharing construct: A PRIVATE copy of each list variable is created and initialized depending on the “op” These copies are updated locally by threads At end of construct, local copies are combined through “op” into a single value and combined with the value in the original SHARED variable Script: OpenMP* Reduction Clause is clause is used to combine an array of values into a single combined scalar value based on the “op” operation passed in the parameter list. The reduction clause is used for very common math operations on large quantities of data – called a reduction. A reduction “reduces” the data in a list or array down to a representative single value – a scalar value. For example, If I want to compute the sum of a list of numbers, I can “reduce” the list to the “sum” of the list. We’ll see a code example on the next foil Before jumping to the next foil lets take care of some business. The variables in “list” must be shared in the enclosing parallel region The reduction clause must be inside parallel or work-sharing construct: The way it works internally is that: A PRIVATE copy of each list variable is created and initialized depending on the “op” These copies are updated locally by threads At end of construct, local copies are combined through “op” into a single value and combined with the value in the original SHARED variable Now lets look at the example on the next foil

Reduction Example Local copy of sum for each thread
#pragma omp parallel for reduction(+:sum) for(i=0; i<N; i++) { sum += a[i] * b[i]; } Local copy of sum for each thread All local copies of sum added together and stored in “global” variable Script: In this example, we are computing the sum of the product of two vectors. This is a reduction operation because I am taking an array or list or vector ful of numbers and boiling the information down to a scalar. To use the reduction clause – I note that the basic reduction operation is an addition and that the reduction variable is sum. I add the following pragma; #pragma omp parallel for reduction(+:sum) The reduction will now apply a reduction to the varibale sum based on the + operation Internally – each thread sort of has its own copy of sum to add up partial sums for. After all partial sums have been computed, all te local copies of sum are added together and stored in a “global” variable called sum accessible to the other threads but without a race condition. Following are some of the valid math operations that reductions are designed for

Numerical Integration Example
4.0  4.0 (1+x2) dx =  1 4.0 (1+x2) f(x) = static long num_steps=100000; double step, pi; void main() { int i; double x, sum = 0.0; step = 1.0/(double) num_steps; for (i=0; i< num_steps; i++){ x = (i+0.5)*step; sum = sum + 4.0/(1.0 + x*x); } pi = step * sum; printf(“Pi = %f\n”,pi); 2.0 Script: Next we will examine a numerical integration example, where the goal is to compute the value of Pi. We’ll do this using the code for an integration. Animation 1 For the Math geeks out there - In this example we are looking at the code to compute the value of Pi. It basically computes the integral from 0 to 1 of the function 4/(1+x^2). Some of you may remember from calculus that this integral evaluates to the 4 * arctangent(x). The 4 * arctangent of x evaluated on the range 0-1 yields a value of Pi. Animation 2 To approximate the area under the curve – which approximates the integral – we will have small areas (many = num_steps) that we add up. Each area will be “step” wide. The height of the function will be approximated by 4/(1+x*x). Animation 3 The area will just be the sum of all these small areas For the rest of us – we will trust that the mast is right and just dig into the code. Here we have a loop that run from 0 to step_size. Inside the loop – we calculate the approximate location on the x axis of the small area; x= (i+.05)*step We calculate the height of the function at this value of x: 4/(1+x*x). Then we add the small area we just calculated to a running total of allthe area we have computed up to this point. When the loop is done – we print the value of pi. So lets ask the big question – are there any race conditions in play here? Which variables should be marked private? which variables should be marked shared? That is the subject of our next lab 0.0 X 1.0

C/C++ Reduction Operations
A range of associative operands can be used with reduction Initial values are the ones that make sense mathematically Operand Initial Value + * 1 - ^ Operand Initial Value & ~0 | && 1 || Script: Below is a table of the C/C++ Reduction Operations. A range of associative operands can be used with reduction including +, *, -, ^(power), bitwise and, bitwise or, logical and, logical or Initial values are the ones that make sense mathematically – for the sum example we say –we would want to start our sum naturally at 0. For a multiplication – we would want to start a product at 1. Lets look at a more complicated example

Activity 4 - Computing Pi
static long num_steps=100000; double step, pi; void main() { int i; double x, sum = 0.0; step = 1.0/(double) num_steps; for (i=0; i< num_steps; i++){ x = (i+0.5)*step; sum = sum + 4.0/(1.0 + x*x); } pi = step * sum; printf(“Pi = %f\n”,pi); Parallelize the numerical integration code using OpenMP What variables can be shared? What variables need to be private? What variables should be set up for reductions? Script: Parallelize the numerical integration code using OpenMP What variables can be shared? What variables need to be private? What variables should be set up for reductions? Please spend about 20 minutes doing lab activity 6 from the student workbook Instructor Note: This is a serial version of the source code. It does not use a “sum” variable that could give a clue to an efficient solution (i.e., local partial sum variable that is updated each loop iteration). This code is small and efficient in serial, but will challenge the students to come up with an efficient solution. Of course, efficiency is not one of the goals with such a short module. Getting the answer correct with multiple threads is enough of a goal for this. One answer: static long num_steps=100000; double step, pi; void main() { int i; double x, sum = 0.0; step = 1.0/(double) num_steps; #pragma omp parallel for private(i, x) reduction(+:sum) for (i=0; i< num_steps; i++){ x = (i+0.5)*step; sum = sum + 4.0/(1.0 + x*x); } pi = step * sum; printf(“Pi = %f\n”,pi);

Single Construct Denotes block of code to be executed by only one thread First thread to arrive is chosen Implicit barrier at end #pragma omp parallel { DoManyThings(); #pragma omp single ExchangeBoundaries(); } // threads wait here for single DoManyMoreThings(); } Script: The single construct denotes a block of code to be executed by only one thread First thread to arrive is chosen - there is an Implicit barrier at end where all threads wait for the single to complete In this example, a team of threads is created at the omp parallel construct All the threads execute DoManyThings() in parallel One thread is chosen by the runtime to execute the omp single construct The single thread executes ExchangeBoundaries() All the other threads in the team wait that implied barrier – the right curly brace next to the green comment line When all threads have arrived at the barrier – then all the threads are released to perform the DoManyMoreThings() routine in parallel The final white curly brace indicates the end of the parallel region

Master Construct Denotes block of code to be executed only by the master thread No implicit barrier at end #pragma omp parallel { DoManyThings(); #pragma omp master { // if not master skip to next stmt ExchangeBoundaries(); } DoManyMoreThings(); Script: The master construct denotes a block of code to be executed by only one thread – the master thread there is an Implicit barrier at end where all threads wait for the master to complete The example above is identical to the omp single example – except that the master thread is the thread chosen to do the work

Implicit Barriers Several OpenMP* constructs have implicit barriers
Parallel – necessary barrier – cannot be removed for single Unnecessary barriers hurt performance and can be removed with the nowait clause The nowait clause is applicable to: For clause Single clause Script: Several OpenMP* constructs have implicit barriers Parallel – necessary barrier – cannot be removed For – optional barrier Single – optional barrier Unnecessary barriers hurt performance and can be removed with the nowait clause The nowait clause is applicable to: For clause Single clause

Nowait Clause #pragma omp for nowait for(...) {...}; #pragma single nowait { [...] } Use when threads unnecessarily wait between independent computations #pragma omp for schedule(dynamic,1) nowait for(int i=0; i<n; i++) a[i] = bigFunc1(i); #pragma omp for schedule(dynamic,1) for(int j=0; j<m; j++) b[j] = bigFunc2(j); Script: The nowait clause can be used to remove unnecessary waits between independent calculations. In the example above – we see how to use the nowait clause in conjunction with an omp for loop (upper left example) In the upper right we see how to use the clause with the pragma single In the center of the foil we see a more complicated example In this example – we have two for loops executing back to back. Without the nowait clause – all the threads in the upper for loop would have to complete at meet at the implied barrier at he end of the first loop – BEFORE any threads could begin execution in the second loop. BUT With the nowait clause, some threads that complete the upper loop early – can actaully go on through and begin calculation in the lower second loop – WITHOUT having to wait for the rest of the threads in the upper loop We've examined the nowait - clause - now lets look at the explicit barrier construct

Barrier Construct Explicit barrier synchronization
Each thread waits until all threads arrive #pragma omp parallel shared (A, B, C) { DoSomeWork(A,B); // Processed A into B #pragma omp barrier DoSomeWork(B,C); // Processed B into C } Script: The barrier construct is just a way for a user to explicitly place a barrier wherever he wishes in a parallel region. Just as it is for implicit barriers, the barrier clause forces all threads to wait until all threads arrive. In the example above – all threads in the parallel region participate in DoSomeWork(A,B) one by one – as they asll complete DoSomeWork(A,B) and the printf – they are now all forced to wait until all threads have completed their work – then ONCE ALL THREADs have met up at the omp barrier, all threads are allowed to do the second roud of work on DoSomeWork(B,C) This is one way to guarantee that the NO threads are altering values of B in the upper section while other threads are consuming B in the lower section That wraps up barriers – lets look at the atomic contruct

Atomic Construct Special case of a critical section
Applies only to simple update of memory location #pragma omp parallel for shared(x, y, index, n) for (i = 0; i < n; i++) { #pragma omp atomic x[index[i]] += work1(i); y[i] += work2(i); } Script: The Atomic Construct can be considered to be a special case of a critical section with some limitations. The limitations are that it only applies to simple updates of memory locations Since index[i] can be the same for different I values, the update to x must be protected. Use of a critical section would serialize updates to x. Atomic protects individual elements of x array, so that if multiple, concurrent instances of index[i] are different, updates can still be done in parallel. That nearly concludes the class – we just need to take some time to tale about what kinds of courses or curriculums would be best served by this material and also to quickly fly by the advanced topics to just show what is left to the reader.

Worksharing Data environment Synchronization Optional Advanced topics Script: Lets do a quick flyby of the more advanced topics

Advanced Concepts

Parallel Construct – Implicit Task View
Tasks are created in OpenMP even without an explicit task directive. Lets look at how tasks are created implicitly for the code snippet below Thread encountering parallel construct packages up a set of implicit tasks Team of threads is created. Each thread in team is assigned to one of the tasks (and tied to it). Barrier holds original master thread until all implicit tasks are finished. #pragma omp parallel { mydata code } Thread 1 2 3 Barrier Script: In this animation – we will see tasks are created implicitly by the omp parallel statement – without any explicit tasks involved. First animation First the master thread will cross the omp parallel construct – creating a pool or team of threads. The encountering thread (for this example I’ll say the Master thread) packages up a set of implicit tasks – containing code, data & ICVs. 2nd animation Then the runtime assigns threads to each task. 3rd animation Each thread gets tied to a particular task. The threads being executing their assigned tasks. As each thread completes its task, it can be recycled by being tied to a new task. 4th animation The Master thread meanwhile, is held at the barrier (end of parallel region “}”) until all the implicit tasks are finished. Now we will look at the anatomy of the omp task construct #pragma omp parallel { int mydata code } { int mydata; code… }

Task Construct #pragma omp task [clause[[,]clause] ...]
structured-block where clause can be one of: if (expression) untied shared (list) private (list) firstprivate (list) default( shared | none ) Script: The omp task construct should be placed inside a parallel region and the task should encapsulate a structured block. The syntax is #pragma omp task [clause[[,]clause] ...] Explicit tasks are created in OpenMP following the same steps just described. Thread encountering parallel construct creates a team of threads at the omp task construct - a thread in team is assigned to one of the explicit tasks (and tied to it). if the task construct is enclosed inside a while loop or other loop structure then each time the omp task construct is crossed a new instance of the task is created and assigned a thread – which can be initially differed or can be initially executed immediately At the end of the parallel region their is an implied barrier - this barrier holds original master thread until all explicit tasks are finished. The syntax is #pragma omp task [clause[[,]clause] ...] Where clause can be one of the following clauses if (expression) - a user directed optimization that weighs the cost of deferring the task versus executing the task code immediately. It can be used to control cache and memory affinity Untied - specifies that the task created is untied. An untied task has no long term association with any given thread. Any thread not otherwise occupied is free to execute an untied task. The thread assigned to execute an untied task may only change at a "task scheduling point". shared (list) a variable whose name provides access to the same block of storage for each task region private (list) a variable whose name provides access to a different block of storage firstprivate (list) variable whose name provides access to a different block of storage for each task region and whose value is initialized with the value of the original variable default( shared | none ) - specifies the default data scoping rules fr the task Now lets look at untied versus ted tasks Background Info see the openmp spec at or visit publib.boulder.ibm.com/infocenter at the url below to get at the definition and usage The permissible clauses include: When the if clause argument is false The task is executed immediately by the encountering thread. The data environment is still local to the new task... ...and it’s still a different task with respect to synchronization. Its used to execute immediately (when exp is false) when the cost of deferring the task is too great compared to the cost of executing the task code – this can aid with cache and memory affinity Untied – specifies that the task created is untied. An untied task has no long term association with any given thread. Any thread not otherwise occupied is free to execute an untied task. The thread assigned to execute an untied task may only change at a "task scheduling point". As opposed to a tied task which is a task that, when its task region is suspended, can be resumed only by the same thread that suspended it; that is, the task is tied to that thread. shared (list) with respect to a given set of task regions that bind to the same parallel region, a variable whose name provides access to the same block of storage private (list) with respect to a given set of task regions that bind to the same parallel region, a variable whose name provides access to a different block of storage for each task region. firstprivate (list) - a variable whose name provides access to a different block of storage for each task region declares the scope of the data variables in list to be private to each thread. Each new private object is initialized with the value of the original variable as if there was an implied declaration within the statement block. Data variables in list are separated by commas. default( shared | none ) – specifies the default data scoping rules fr the task Description tied task A task that, when its task region is suspended, can be resumed only by the untied A task that, when its task region is suspended, can be resumed by any thread in the team; that is, the task is not tied to any thread. firstprivate vars are firstprivate unless shared in the enclosing Context - Specifies that each task should have its own instance of a variable, and that the value of each instance should be initialized to the value of the variable as it existed prior to the parallel directive private With respect to a given set of task regions that bind to the same parallel shared With respect to a given set of task regions that bind to the same parallel A variable which is part of another variable (as an array or structure element) cannot be shared independently of the other components, except for static data members of C++ classes. When a thread encounters a task construct, a task is generated from the code for the associated structured block. The data environment of the task is created according to the data-sharing attribute clauses on the task construct and any defaults that apply. The encountering thread may immediately execute the task, or defer its execution. In the latter case, any thread in the team may be assigned the task. Completion of the task can be guaranteed using task synchronization constructs. A task construct may be nested inside an outer task, but the task region of the inner task is not a part of the task region of the outer task. When an if clause is present on a task construct and the if clause expression evaluates to false, the encountering thread must suspend the current task region and begin execution of the generated task immediately, and the suspended task region may not be resumed until the generated task is completed. The task still behaves as a distinct task region with respect to data environment, lock ownership, and synchronization constructs. Note that the use of a variable in an if clause expression of a task construct causes an implicit reference to the variable in all enclosing constructs. A thread that encounters a task scheduling point within the task region may temporarily suspend the task region. By default, a task is tied and its suspended task region can only be resumed by the thread that started its execution. If the untied clause is present on a task construct, any thread in the team can resume the task region after a suspension. The task construct includes a task scheduling point in the task region of its generating task, immediately following the generation of the explicit task. Each explicit task region includes a task scheduling point at its point of completion. An implementation may add task scheduling points anywhere in untied task regions.

Tied & Untied Tasks Tied Tasks: Untied Tasks:
A tied task gets a thread assigned to it at its first execution and the same thread services the task for its lifetime A thread executing a tied task, can be suspended, and sent of to execute some other task, but eventually, the same thread will return to resume execution of its original tied task Tasks are tied unless explicitly declared untied Untied Tasks: An united task has no long term association with any given thread. Any thread not otherwise occupied is free to execute an untied task. The thread assigned to execute an untied task may only change at a "task scheduling point". An untied task is created by appending “untied” to the task clause Example: #pragma omp task untied Script: By default, a task is created as a tied task. Meaning that it gets a thread assigned to the task for the life of the task. A tied task’s thread is the only thread that can service the task – however – since there can be many fewer threads than tasks – the runtime may suspend the assigned task (lets call it task Z) and assign the its thread go off for new duties (such as being assigned to new tasks – lets say task Y). When task Y’s execution reaches a scheduling point and when the runtime decides that it is time – the thread can by unassigned from task Yand given back task Z to resume computation. So a thread may service multiple tasks – but each tied task can be service only by the thread originally assigned to it. By contrast – An united task has no long term association with any given thread. Any thread not otherwise occupied is free to execute an untied task. The thread assigned to execute an untied task may only change at a "task scheduling point". There can be performance benefits from untied tasks in that the united task is more likely to get serviced by some idle thread sooner rather than later. On the other hand, especially on NUMA architecture, untied tasks could have a negative impact on performance if the random thread assigned to service my task is running on a remote processor with remote cache. It is recommended to avoid using untied tasks unless the developer if willing to explore these performance subtlies (actual performance difference may NOT be subtle but the concept underlying the issue may be subtle) Now lets have a look at explicit tasks

Task switching task switching The act of a thread switching from the execution of one task to another task. The purpose of task switching is distribute threads among the unassigned tasks in the team to avoid piling up long queues of unassigned tasks Task switching, for tied tasks, can only occur at task scheduling points located within the following constructs encountered task constructs encountered taskwait constructs encountered barrier directives implicit barrier regions at the end of the tied task region Untied tasks have implementation dependent scheduling points Script: The speaker notes have a lot more detail on task switching which we covered lightly in class already Next foil Background In untied task regions, task scheduling points may occur at implementation defined points anywhere in the region. In tied task regions, task scheduling points may occur only in task, taskwait, explicit or implicit barrier constructs, and at the completion point of the task. From the OpenMP 3.0 Spec The following example demonstrates a way to generate a large number of tasks with one thread and execute them with the threads in the parallel team. While generating these tasks, the implementation may reach its limit on unassigned tasks. If it does, the implementation is allowed to cause the thread executing the task generating loop to suspend its task at the task scheduling point in the task directive, and start executing unassigned tasks. Once the number of unassigned tasks is sufficiently low, the thread may resume execution of the task generating loop. Example A.13.5c #define LARGE_NUMBER double item[LARGE_NUMBER]; extern void process(double); int main() { #pragma omp parallel { #pragma omp single int i; for (i=0; i<LARGE_NUMBER; i++) #pragma omp task // i is firstprivate, item is shared process(item[i]); } C/C++ Task Scheduling Whenever a thread reaches a task scheduling point, the implementation may cause it to perform a task switch, beginning or resuming execution of a different task bound to the current team. Task scheduling points are implied at the following locations: • the point immediately following the generation of an explicit task • after the last instruction of a task region • in taskwait regions • in implicit and explicit barrier regions. In addition, implementations may insert task scheduling points in untied tasks anywhere that they are not specifically prohibited in this specification. When a thread encounters a task scheduling point it may do one of the following, subject to the Task Scheduling Constraints (below): • begin execution of a tied task bound to the current team. • resume any suspended task region, bound to the current team, to which it is tied. • begin execution of an untied task bound to the current team. • resume any suspended untied task region bound to the current team. If more than one of the above choices is available, it is unspecified as to which will be chosen. Task Scheduling Constraints 1. An explicit task whose construct contained an if clause whose if clause expression evaluated to false is executed immediately after generation of the task. 2. Other scheduling of new tied tasks is constrained by the set of task regions that are currently tied to the thread, and that are not suspended in a barrier region. If this set is empty, any new tied task may be scheduled. Otherwise, a new tied task may be scheduled only if it is a descendant of every task in the set. A program relying on any other assumption about task scheduling is non-conforming. Note – Task scheduling points dynamically divide task regions into parts. Each part is executed uninterruptedly from start to end. Different parts of the same task region are executed in the order in which they are encountered. In the absence of task synchronization constructs, the order in which a thread executes parts of different schedulable tasks is unspecified. A correct program must behave correctly and consistently with all conceivable scheduling sequences that are compatible with the rules above.

Task switching example
The thread executing the “for loop” , AKA the generating task, generates many tasks in a short time so... The SINGLE generating task will have to suspend for a while when “task pool” fills up Task switching is invoked to start draining the “pool” When “pool” is sufficiently drained – then the single task can being generating more tasks again Script: Here’s a task switching example Next foil Background The thread executing the SINGLE will have to suspend generating tasks at some point, because the "task pool" will fill up. At that point, the SINGLE thread will have to stop generating tasks. It is allowed to start executing some of the tasks in the task pool in order to "drain" it. Once it has drained the pool enough, it may return to generating tasks. Lots of tasks generated in a short time so The SINGLE generating task will have to suspend for a while when “task pool” fills up Task switching is invoked to start draining the “pool” The thread executing SINGLE starts executing the queued up tasks When “pool” is sufficiently drained – then the single task can being generating more tasks again The thread executing SINGLE starts generating more tasks again #pragma omp single { for (i=0; i<ONEZILLION; i++) #pragma omp task process(item[i]); }

Optional foil - OpenMP* API
Get the thread number within a team int omp_get_thread_num(void); Get the number of threads in a team int omp_get_num_threads(void); Usually not needed for OpenMP codes Can lead to code not being serially consistent Does have specific uses (debugging) Must include a header file #include <omp.h> Script: Here’s an API foil showing detail of how to use omp_get_thread_num Next foil Background

Optional foil - Monte Carlo Pi
loop 1 to MAX x.coor=(random#) y.coor=(random#) dist=sqrt(x^2 + y^2) if (dist <= 1) hits=hits+1 pi = 4 * hits/MAX Script: Here’s a very cool optional lab that uses Math Kernel Library (MKL) to do a monte carlo approximation for Pi Next foil Background

Optional foil - Making Monte Carlo’s Parallel
hits = 0 call SEED48(1) DO I = 1, max x = DRAND48() y = DRAND48() IF (SQRT(x*x + y*y) .LT. 1) THEN hits = hits+1 ENDIF END DO pi = REAL(hits)/REAL(max) * 4.0 Script: They big take away is that Rand() is not thread safe so Monte carlo’s using rand cannot be parallelized as written Next foil Background Making Monte Carlo’s Parallel The Random Number generator maintains a static variable (the Seed). The only way to make this parallel is to do it in a critical section for each call of the DRAND(), which is a lot of overhead. What is the challenge here?

Optional Activity 5: Computing Pi
Use the Intel® Math Kernel Library (Intel® MKL) VSL: Intel MKL’s VSL (Vector Statistics Libraries) VSL creates an array, rather than a single random number VSL can have multiple seeds (one for each thread) Objective: Use basic OpenMP* syntax to make Pi parallel Choose the best code to divide the task up Categorize properly all variables Script: Here’s a lab to use Intel MKL’s VSL (Vector Statistics Libraries) to create a vecotr of random numbers all in parallel – rather than one at a time as was done with DRAND() Next foil Background

Firstprivate Clause Variables initialized from shared variable
C++ objects are copy-constructed incr=0; #pragma omp parallel for firstprivate(incr) for (I=0;I<=MAX;I++) { if ((I%2)==0) incr++; A(I)=incr; } Script: More detail on the firstprivateprivate clause Next foil Background

Lastprivate Clause Variables update shared variable using value from last iteration C++ objects are updated as if by assignment void sq2(int n, double *lastterm) { double x; int i; #pragma omp parallel #pragma omp for lastprivate(x) for (i = 0; i < n; i++){ x = a[i]*a[i] + b[i]*b[i]; b[i] = sqrt(x); } lastterm = x; Script: More detail on the lastprivate clause Next foil Background

Threadprivate Clause Preserves global scope for per-thread storage
Legal for name-space-scope and file-scope Use copyin to initialize from master thread struct Astruct A; #pragma omp threadprivate(A) … #pragma omp parallel copyin(A) do_something_to(&A); #pragma omp parallel do_something_else_to(&A); Private copies of “A” persist between regions Script: More detail on the threadprivate clause Next foil Background Acts somewhat like a static variable and somewhat like a private variable. It is private to each thread but its value persists between parallel regions

20+ Library Routines Runtime environment routines:
Modify/check the number of threads omp_[set|get]_num_threads() omp_get_thread_num() omp_get_max_threads() Are we in a parallel region? omp_in_parallel() How many processors in the system? omp_get_num_procs() Explicit locks omp_[set|unset]_lock() And many more... Script: More detail on the API info Next foil Background

Library Routines To fix the number of threads used in a program
Set the number of threads Then save the number returned Request as many threads as you have processors. #include <omp.h> void main () { int num_threads; omp_set_num_threads (omp_num_procs ()); #pragma omp parallel { int id = omp_get_thread_num (); #pragma omp single num_threads = omp_get_num_threads (); do_lots_of_stuff (id); } } Protect this operation because memory stores are not atomic Script: More detail on the API info Next foil Background

This should always be the last slide of all presentations.

BACKUP

Programming with OpenMP*

Similar presentations

Presentation on theme: "Programming with OpenMP*"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Programming with OpenMP*

Similar presentations

Presentation on theme: "Programming with OpenMP*"— Presentation transcript:

Similar presentations

About project

Feedback