CONCURRENCY PLATFORMS

CONCURRENCY PLATFORMS

Introduction An alternative to these low - level do - it - yourself tools is the concurrency platform .. this is a software that allows coordination, scheduling, and management of multi core resources. Examples of concurrency platforms include the following.

First: CILK + + Cilk + + is a language extension programming tool. Cilk + + is suited for divide - and - conquer problems where the problem can be divided into parallel independent tasks and the results can be combined afterward. As such, the programmer bears the responsibility of structuring the program to expose its inherent parallelism. Cilk ’ s runtime system bears the responsibility of scheduling the computational tasks on the parallel processor system. The techniques we discuss in this book give the programmer insight on the alternative parallelism options available for a given algorithm.

The application developer can use a few key words provided by Cilk + + to convert a standard serial program into a parallel program. A standard C + + program can be converted to a Cilk + + program running Intel ’ s Cilk + + system developers kit (SDK) by doing these initial steps: 1. Ensure that the serial C + + program is bug free. 2. Rename source file extension from .cpp to .cilk . 3. Add #include < cilk.h > . 4. Rename the main() function to cilk_main() . At this stage, the program is a program that has no parallelism yet. The programmer must add a few key words to the program, such as:

• cilk , which alerts the compiler that this is a parallel program;
• cilk_spawn , which creates a locally spawned function that can be executed in parallel with other tasks; • cilk_sync , which forces the current threads to wait for all locally spawned functions to be completed; thus, all cilk_spawn function must be completed first before the cilk_sync function can continue. This is equivalent to the join statement in the pthread library; and • cilk for, which is a parallel version of the serial for loop statement.

Example #include <iostream> #include <cilk/cilk.h> using namespace std; static void hello() { for(int i=0;i< ;i++) cout << ""; cout << "Hello " << endl; } static void world() { cout << "world! " << endl; } int main(){ cilk_spawn hello(); cilk_spawn world(); cilk_sync; cout << "Done! "; }

Listing 6.1 Pseudocode for the evaluation of Fibonacci numbers
The Cilk + + constructs discussed above specify logical parallelism in the program. The operating system will map the tasks into processes or threads and schedules them for execution. Listing 6.1 is the pseudocode for the Fibonacci algorithm implemented using Cilk: Listing 6.1 Pseudocode for the evaluation of Fibonacci numbers 1: int fib (int n ) 2: { 3: if n < 2 then 4: return n ; 5: else 6: { 7: int x , y ; 8: x = cilk_spawn fib( n − 1); 9: y = cilk_spawn fib( n − 2); 10: cilk_sync 11: return ( x + y ); 12: } 13: end if 14: } The key words in italics in lines 8 and 9 indicate that the fib() function call can be done in parallel. The key word in line 10 ensures that the add operation in line 11 can be performed only after two function calls in lines 8 and 9 have been completed.

Cilk + + Parallel Loop: cilk_for
The syntax of the Cilk + + for loop is very much similar to that of the C + + for loop. cilk for ( i = start_value; i < end_value; i + + ){ statement_1; statement_2; . } The end - of - iteration comparison could be one of the usual relational operators: < , < = , ! = , > = , or > .

Cilk + + for does not have a break statement for early exit from the loop. Cilk + + divides the iterations of the loop into chunks where each chunk consists of few iterations of the loop. An implied cilk_spawn statement creates a thread or a strand for each chunk. Thus, the loop is parallelized since chunk strands will be executed in parallel using a work - stealing scheduler. The chunk size is called the grain size . if the grain size is large, parallelism is reduced since the number of chunks will be small. If the grain size is small, then the overhead to deal with too many strands reduces the performance. The programmer can override the default grain size through the compiler directive statement #pragma cilk_grain size = expression, where expression is any valid C + + expression that yields an integer value. The pragma should immediately precede the cilk_for loop

Data Races and Program Indeterminacy
A data race occurs when two threads attempt to access the same variable in memory and one of them performs a write operation. This is the problem of shared or nonlocal variables. Nonlocal variables are variables that are declared outside the scope where it is used. A global variable is a nonlocal variable declared in the outermost scope of the program. It is hard to rewrite a code that avoids the use of nonlocal variables. This occurs when a function call has side effects and changes a variable declared outside the function. The obvious solution is to use local variables by passing the variable as a parameter of the function. Most of us know that this will lead to functions with a long argument list. The problem is that with multicores, nonlocal variables will lead to race bugs. Parallel processors that share variables must guard against race bugs that compromise data integrity.

Consider the following serial code as an example of determinacy race:
A simple race bug is a determinacy race . A program is deterministic if the output is the same for any multicore strand scheduling strategy. A strand is defined as a sequence of executed instructions containing no parallel control. On the other hand, a program is nondeterministic if it produces different results for every run. Consider the following serial code as an example of determinacy race: 1: #include < iostream > 2: using namespace std; 3: void swap (int & x, int & y); 4: int main() 5: { 6: int x = 1, y = 10; 7: swap ( x , y ); 8: x = 2 * x ; 9: cout << “ x = ” << x << endl; 10: cout << “ y = ” << y << endl; 11: } 12: void swap (int & x, int & y) 13: { 14: int temp ; 15: temp = x ; 16: x = y ; 17: y = temp ; 18: } The output of the serial program is x = 20 and y = 1 because x and y will get swapped first then x is doubled according to lines 7 and 8, respectively.

Now consider a similar code executed on a parallel computing platform with the directive cilk_spawn : 1: #include < iostream > 2: using namespace std; 3: void swap (int & x , int & y ); 4: int main() 5: { 6: int x = 1, y = 10; 7: cilk_spawn swap ( x , y ); 8: x = 2 * x ; 9: cilk_sync; 10: cout << “ x = ” << x << endl; 11: cout << “ y = ” << y << endl; 12: } 13: void swap (int & x , int & y ) 14: { 15: int temp ; 16: temp = x ; 17: x = y ; 18: y = temp ; 19: } The output of the parallel program has a race bug and the output might be x = 20 and y = 1 sometime and x = 10 and y = 2 at another time. Figure 6.1 shows the breakdown of the parallel program into strands A , B , C , and D . Strand A begins at the start of the program and ends at the cilk_spawn statement. The cilk_spawn statement creates the strands B and C . Strand B executes the statement x = 2 * x and strand C executes the swap ( x , y ); statement. Strand D begins after the cilk_sync statement to the end of the program.

The race condition occurs because strands B and C both involve reading and writing the same variable x . This will most certainly lead to data inconsistency of the types discussed, such as 1. output dependencies: write after write (WAW), 2. antidependencies: write after read (WAR), 3. true data dependency: read after write (RAW), and 4. procedural dependencies . Figure 6.1 Splitting of a program into strands using the directive cilk_spawn and merging the strands using cilk_sync statements.

Any of the following race conditions could take place depending on the operating system:
Strand B executes completely before strand C . Strand C executes completely before strand B . Strand B partially executes, then strand C starts. Strand C partially executes, then strand B starts.

cilk_for While cilk_spawn and cilk_sync are great for expressing the parallelism in a recursive algorithm, one of the simplest ways to parallelize a program is to identify a loop with no dependencies between the iterations and run them in parallel. The cilk_for statement converts a simple for loop into a parallel for loop. That is, one where iterations of the for loop body can be executed in parallel. Consider the following loop: for (int i = 0; i < 8; ++i) { do_work(i); } An obvious way to parallelize this would be to add the cilk_spawnattribute to the call to do_work(). cilk_spawn do_work(i); cilk_sync;

A better approach is to use a cilk_forloop:
cilk_for (int i = 0; i < 8; ++i) { do_work(i); } The Intel Cilk Plus compiler and runtime cooperate to divide the work of the loop in half, and then divide it in half again, until there are enough pieces to keep the cores busy, but at the same time minimize the overhead imposed by cilk_spawn. Like the recursive implementation of fib()above, this efficiently spreads the work across the available cores and minimizes steals.

#include <iostream> #include <cilk/cilk
#include <iostream> #include <cilk/cilk.h> using namespace std; int main(){int sum = 0; cilk_for (int i = 0; i <= 10000; i++)sum += i; cout << sum << "\n"; } Note that the program above will return a different answer almost every time. That's because a race condition is created, about which we will talk and solve later in the tutorial. There are certain restrictions on cilk_for that you should take into account. First, you cannot change the loop control variable in the loop body. So, the following is illegal: cilk_for (int i = 0; i <= 10000; i++) i = someFunction(); Moving on, you cannot declare the loop control variable outside the loop in C++, as opposed to C. More exactly, the following code will not work in C++: int i = 0; cilk_for (i = 0; i <= 10000; i++)//work

Locks Recall for a second our previous example, in which we sum up the first 10,000 integers. Whenever we run it we get a different result, because of a race condition. One way to solve this problem is to use locks. Locks are synchronization mechanisms that prevent multiple threads from changing a variable concurrently. Thus, locks help to eliminate data races. //Run the code with the '-ltbb' compiler flag, to allow mutexes to work #include <iostream> #include <tbb/mutex.h> //mutex library using namespace std; int main() { int sum = 0; tbb::mutex m; //define the lock cilk_for (int i = 0; i <= 10000; i++) m.lock(); //lock - prevents other threads from running this code sum += i; m.unlock(); //unlock - allows other threads to access this code } cout << sum << "\n"; }

Even though locks are a solution to data races, there are a few things that can go wrong. First, deadlock might occur, which is when all the threads are waiting on each other. This is best illustrated by this image. Second, since the threads have to wait on each other, the locked part of the code is seriallized, causing performance issues. In Cilk™, the constructs that solve most of the issues associated with locks, are called reducers, about which we are going to talk in the next section.

Reducers You have seen how locks can be used to solve data race, but some of the problems associated with them can make them a poor solution to the problem. The better solution in Cilk™ are reducers. By definition, a reducer is a variable that can be safely used by multiple threads running in parallel. The runtime ensures that each thread has access to a private copy of the variable, eliminating the possibility of races without requiring locks. When the threads synchronize, the reducer copies are merged (or reduced) into a single variable. The runtime creates copies only when needed, minimizing overhead. Getting back to our summation example, where we add up the first 10,000 integers, take a look below at the reducer solution for the race condition problem:

printf("%d\n",sum.get_value()); //notice that sum is now an object }
#include <stdio.h> #include <cilk/cilk.h> #include <cilk/reducer_opadd.h> //needs to be included to use the addition reducer int main(){cilk::reducer_opadd<int> sum; //defining the sum as a reducer with an int value cilk_for (int i = 0; i <= 10000; i++)sum += i; printf("%d\n",sum.get_value()); //notice that sum is now an object } First thing that you need to take care of in order to use a reducer is to include one of the Cilk reducer libraries, that fits the needs of your program most - there are multiple types of reducers: for mathematical operations, for strings, for determining minimum and maximum values of a list etc. For the full list of reducers check out the Intel® Cilk™ documentation. In our case, we need a reducer for a summation, so we will include the reducer_opadd.h library. Next, define the variable susceptible to a race condition as a reducer. Once the operation is complete, in order to retrieve the final value of the computation you need to call the get_value() function on the reducer (reducers are C++ hyperobjects, which is why this section is dedicated to C++ only).

Second: Open MP OpenMP is a concurrency platform for multithreaded, shared - memory parallel processing architecture for C, C + + , and Fortran. By using OpenMP, the programmer is able to incrementally parallelize the program with little programming effort. The programmer manually inserts compiler directives to assist the compiler into generating threads for the parallel processor platform. The user does not need to create the threads nor worry about the tasks assigned to each thread. In that sense, OpenMP is a higher - level programming model compared with pthreads in the POSIX library. At the current state of the art, there is something to be gained using manual parallelization. Automatic parallelizing compilers cannot compete with a hand - coded parallel program. OpenMP uses three types of constructs to control the parallelization of a program. 1. Compiler directives 2. Runtime library routines 3. Environment variables

To compile an OpenMP program, one would issue the command gcc - openmp file.c - o file.
Line 1 is an include file that defines the functions used by OpenMP. Lines 2 – 5 is a serial code just like in any C or C + + program. Line 6 is an OpenMP compiler directive instructing the compiler to parallelize the lines of code enclosed by the curly brackets spanning lines 7 – 10. The directive forks a team of threads and specifies variable scoping; some variables are private to each thread, and some are shared between the threads . Another name for a compiler directive is pragma . Line 7 is the start of the parallel code block indicated by the left curly bracket. The code block is duplicated and all newly forked threads execute that code in parallel. Line 8 is the start of parallel section instructions. Line 10 is the end of the parallel code block indicated by the right curly bracket. All threads join the master thread and disband. Lines 11 – 12 are the start of another serial code block. Listing 6.4 The following pseudocode is a sketch of how OpenMP parallelizes a serial code : 1: #include < omp.h > 2: main () { 3: int var1, var2, var3; 4: Serial code executed by master thread 5: 6: #pragma omp parallel private(var1, var2) shared(var3) 7: { 8: Parallel section executed by all threads 9: 10: } 11: Resume serial code 12: }

Figure 6.2 shows breaking up a serial single - thread code into multithreads.
Figure 6.2 a shows the original serial code composed of several code sections as indicated by the numbered blocks. Indicated on the figure also are the compiler directives manually inserted by the programmer at the start of a group of code sections instructing the compiler to fork threads at this point. Figure 6.2 b shows how the compiler forks as many threads as required to parallelize each code section that follows each compiler fork directive. A join synchronization compiler directive ensures that the program resumes after the parallel threads have finished executing their tasks. There is a master thread, indicated by the solid thick line, which forks the other threads. Each thread is identified by an “ ID ” integer and the master thread has an ID value of “ 0 ” .

OpenMP consists of the following major components:
Compiler directives instructing the compiler on how to parallelize the code Runtime library functions to modify and check the number of threads and to check how may processors there are in the multiprocessor system Environment variables to alter the execution of OpenMP applications Like Cilk + + , OpenMP does not require restructuring the serial program. The user only needs to add compiler directives to reconstruct the serial program into a parallel one.

Example OpenMP Hello World
#include <omp.h> #include <stdio.h> #include <stdlib.h> int main (int argc, char *argv[]) { int nthreads, tid; /* Fork a team of threads giving them their own copies of variables */ #pragma omp parallel private(nthreads, tid) { /* Obtain thread number */ tid = omp_get_thread_num(); printf("Hello World from thread = %d\n", tid); /* Only master thread does this */ if (tid == 0) { nthreads = omp_get_num_threads(); printf("Number of threads = %d\n", nthreads); } } /* All threads join master thread and disband */ }

Open MP Compiler Directives
The user tells the compiler to recognize OpenMP commands by adding - omp on the cc command line. Compiler directives allow the programmer to instruct the compiler on issues of thread creation, work load distribution, data management, and thread synchronization. The format for an OpenMP compiler directive is Notice that each directive could have a collection of clauses. Table 6.1 summarizes some of the OpenMP pragma directives Listing 6.5 The following code fragment shows how #omp comp parallel compiler directive is used to fork additional threads to execute the tasks specified by the affected code section : #pragma omp directive_name [clause, · · · ] newline_character.

#pragma omp parallel default(shared) private(a, b)
{ // The code between brackets will run in parallel statement 1; statement 2; statement 3;}

Compiler Directive Clauses
Some of the compiler directives use one or more clauses. The order in which clauses are written is not important. Most clauses accept a comma - separated list of items. Clauses deal with different types of compiler directives: data sharing among the threads. Other clauses deal with data copying of a private variable value from a thread to a corresponding variable in another thread.

Open MP Work Sharing The work sharing directives control which threads execute which statements. These directives do not fork new threads. The two directives are #pragma omp for and #pragma omp sections . We discuss these two directives in the following sections.

Example of Parallelizing A Loop
#include <omp.h> #include <stdio.h> #include <stdlib.h> #define N int main (int argc, char *argv[]) { int nthreads, tid, i; float a[N], b[N], c[N]; /* Some initializations */ for (i=0; i < N; i++) a[i] = b[i] = i; #pragma omp parallel shared(a,b,c,nthreads) private(i,tid) { tid = omp_get_thread_num(); if (tid == 0) { nthreads = omp_get_num_threads(); printf("Number of threads = %d\n", nthreads); } printf("Thread %d starting...\n",tid); #pragma omp for for (i=0; i<N; i++) { c[i] = a[i] + b[i]; printf("Thread %d: c[%d]= %f\n",tid,i,c[i]); } } /* end of parallel section */ }

Loop Directive: for Most parallel algorithms contain FOR loops, and we dedicate this section to discussing the compiler directive related to FOR loops. The format of the for compiler directive is #pragma omp for [ clause · · · ] newline . There are several clauses associated with the for compiler directive as shown in Table 6.2 . When the schedule clause is schedule(static, 3) , iterations are divided into pieces of size 3 and are assigned to threads in a round - robin fashion ordered by the thread number. When the schedule clause is schedule(dynamic, 3) , iterations are divided into pieces of size 3 and are assigned to next available thread. When a thread completes its task, it looks for the next available chunk.

schedule(static, 3) schedule(dynamic, 3)
#pragma omp for schedule(static, 3) for (i=0; i<N; i++) { c[i] = a[i] + b[i]; printf("Thread %d: c[%d]= %f\n",tid,i,c[i]); } #pragma omp for schedule(dynamic, 3) for (i=0; i<N; i++) { c[i] = a[i] + b[i]; printf("Thread %d: c[%d]= %f\n",tid,i,c[i]); }

https://www. dartmouth. edu/~rc/classes/intro_openmp/OpenMP_Clauses

Restrictions to the for directive are as follows:
The for loop must be a structured block, and, in addition, its execution must not be terminated by a break statement. The values of the loop control expressions of the for loop associated with a for directive must be the same for all the threads in the team. The for loop iteration variable must have a signed integer type. Only a single schedule clause can appear on a for directive. Only a single ordered clause can appear on a for directive. Only a single nowait clause can appear on a for directive. It is unspecified if or how often any side effects within the chunk_size, lb, b, or incr expressions occur. The value of the chunk_size expression must be the same for all threads in the team.

Using the nowait Clause
If there are multiple independent loops within a parallel region, you can use the nowait clause to avoid the implied barrier at the end of the for directive, as follows: #pragma omp parallel { #pragma omp for nowait for (i=1; i<n; i++) b[i] = (a[i] + a[i-1]) / 2.0; for (i=0; i<m; i++) y[i] = sqrt(z[i]); }

ordered used when part of the loop must execute in serial order
ordered clause plus an ordered directive /* C/C++ example */ #pragma omp parallel for private( myval ) ordered { for(i=1; i<=n; i++){ myval = do_lots_of_work(i); #pragma omp ordered { printf("%d %d\n", i, myval); } }

Reduction Operations How reduction works:
sum is the reduction variable cannot be declared shared threads would overwrite the value of sum cannot be declared private private variables don't persist outside of parallel region specified reduction operation performed on individual values from each thread Operator Initial value + * 1 - & ~0 | ^ && || /* C/C++ Example */ for(i=1; i<=n; i++){ sum = sum + a[i]; }

Implementation How does OpenMP parallelize a for loop declared with a reduction clause? OpenMP creates a team of threads and then shares the iterations of the for loop between the threads. Each thread has its own local copy of the reduction variable. The thread modifies only the local copy of this variable. Therefore, there is no data race. When the threads join together, all the local copies of the reduction variable are combined to the global shared variable. For example, let us parallelize the following for loop and let there be three threads in the team of threads. Each thread has sumloc, which is a local copy of the reduction variable. The threads then perform the following computations sum = 0; #pragma omp parallel for shared(sum, a) reduction(+: sum) for (auto i = 0; i < 9; i++) { sum += a[i] }

sumloc_1 = a[0] + a[1] + a[2] Thread 1 Thread 2 Thread 3 In the end, when the treads join together, OpenMP reduces local copies to the shared reduction variable sumloc_2 = a[3] + a[4] + a[5] sumloc_3 = a[6] + a[7] + a[8] sum = sumloc_1 + sumloc_2 + sumloc_3

// OMP_reduction.cpp : Defines the entry point for the console application.
#include "stdafx.h" #include <omp.h> int main() { int a[9]; for (int i=0;i<9;i++) a[i] = i; } int sum = 0; #pragma omp parallel for shared(sum, a) reduction(+: sum) for (auto i = 0; i < 9; i++) { omp_get_thread_num(); printf("This is Thread %d\n", omp_get_thread_num()); sum += a[i]; printf("Sum =%d\n", sum); printf("Sum =%d\n",sum);

Atomic Directive Atomicity means that something is inseparable; an event either happens completely or it does not happen at all, and another thread cannot intervene during the execution of the event. #pragma omp atomic expression // omp_atomic.cpp // compile with: /openmp #include <stdio.h> #include <omp.h> #define MAX 10 int main() { int count = 0; #pragma omp parallel num_threads(MAX) #pragma omp atomic count++; } printf_s("Number of threads: %d\n", count); Output: Number of threads: 10

Thread Control Barrier Each thread wait at the barrier until all threads reach the barrier. /* C/C++ Example */ #pragma omp parallel private(myid, istart, iend) { myrange(myid, nthreads, &istart, &iend); for(i=istart; i<=iend; i++){ a[i] = a[i] - b[i]; } #pragma omp barrier dowork(a); }

Thread Control (continued)
Master A section of code that runs only on the master (thread with rank=0) /* C/C++ example */ #pragma omp parallel private(myid, istart, iend) { myrange(myid, nthreads, global_start, global_end, &istart, &iend); for(i=istart; i<=iend; i++){ a[i] = b[i]; } #pragma omp barrier #pragma omp master { n = global_end - global_start + 1; write_size = fwrite(a, 1, n, file_pointer); } do_work(istart, iend); } Single Similar to Master except runs only on the first thread to reach it

Single The following example demonstrates the single construct. In the example, only one thread prints each of the progress messages. All other threads will skip the single region and stop at the barrier at the end of the single construct until all threads in the team have reached the barrier. If other threads can proceed without waiting for the thread executing the single region, a nowait clause can be specified, as is done in the third single construct in this example. The user must not make any assumptions as to which thread will execute a single region.

#include <stdio.h> void work1() {} void work2() {}
void single_example() { #pragma omp parallel #pragma omp single printf("Beginning work1.\n"); work1(); printf("Finishing work1.\n"); #pragma omp single nowait printf("Finished work1 and beginning work2.\n"); work2(); }

Using the nowait Clause
If there are multiple independent loops within a parallel region, you can use the nowait clause to avoid the implied barrier at the end of the for directive, as follows: #pragma omp parallel { #pragma omp for nowait for (i=1; i<n; i++) b[i] = (a[i] + a[i-1]) / 2.0; for (i=0; i<m; i++) y[i] = sqrt(z[i]); }

Critical Only one thread executes a specified section of the code at a time Threads can execute in any order Similar to ORDERED directive except ordered specifies that threads go in numerical order /* C/C++ Example */ the_max = 0.0; #pragma omp parallel private(myid, istart, iend) { myrange(myid, nthreads, global_start, global_end, &istart, &iend); nvals = iend-istart+1; compute_a(a[istart],nvals); #pragma omp critical the_max = max( maxval(a[istart],nvals), the_max ); #pragma omp end critical call more_work_on_a(a) }

Sections/Section A section of code that is run by only one thread Sections are performed in parallel In the following example routines XAXIS, YAXIS, and ZAXIS can be executed concurrently. #pragma omp parallel { #pragma omp sections { #pragma omp section init_field(field); #pragma omp section check_grid(grid); } }

Parallel Regions Example

Parallel Regions Example
#pragma omp parallel { int i; int id = omp_get_thread_num(); int numthreads = omp_get_num_threads(); double x; sum[id] = 0.0; if (id == 0) printf(" num_threads = %d",numthreads); for (i=id;i< num_steps; i+=numthreads){ x = (i+0.5)*step; sum[id] = sum[id] + 4.0/(1.0+x*x); }} for(full_sum = 0.0, i=0;i<j;i++) full_sum += sum[i]; pi = step * full_sum; run_time = omp_get_wtime() - start_time; printf("\n pi is %f in %f seconds %d threads \n",pi,run_time,j); }} #include <stdio.h> #include <omp.h> #define MAX_THREADS 4 static long num_steps = ; double step; int main () { int i,j; double pi, full_sum = 0.0; double start_time, run_time; double sum[MAX_THREADS]; step = 1.0/(double) num_steps; for (j=1;j<=MAX_THREADS ;j++) { omp_set_num_threads(j); full_sum=0.0; start_time = omp_get_wtime();

The End

CONCURRENCY PLATFORMS

Similar presentations

Presentation on theme: "CONCURRENCY PLATFORMS"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CONCURRENCY PLATFORMS

Similar presentations

Presentation on theme: "CONCURRENCY PLATFORMS"— Presentation transcript:

Similar presentations

About project

Feedback