1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law
2
3
4
5 Shared Memory Model §A collection of processors, each with access to same shared memory. §Processors can interact and synchronize with each other through shared variables.
6 Shared Memory Programming §It is possible to write parallel programs for multiprocessors using MPI §But we can achieve better performance by using a programming model tailored for a shared-memory environment.
7 OpenMP §On shared-memory multiprocessors, memory among processors can be shared. §A directive-based OpenMP Application Program Interface (API) has been developed specifically for shared-memory parallel processing. §Directives assist the compiler in the parallelization of application codes.
8 §In the past, almost all major manufacturers of high performance shared-memory multiprocessor computers have their own sets of directives. §The functionalities and syntaxes of these directive sets varied among vendors.
9 Code portability…
10 §A standard to ensure code portability across shared-memory platforms, an independent organization, openmp.org, was established in §As a result, the OpenMP API came into being in §The primary benefit of using OpenMP is the relative ease of code parallelization made possible by the shared-memory architecture.
11 OpenMP §OpenMP has broad support from many major computer hardware and software manufacturers. §Similar to MPI's achievement as the standard for distributed-memory parallel processing, OpenMP has emerged as the standard for shared-memory parallel computing.
12 §fork §join
13 §The standard view of parallelism in a shared memory program is fork/join parallelism. §A the beginning of a program, only a single thread, called master thread, is active. §At points where parallel operations are required, the master thread forks (creates/awakens) additional threads.
14
15 §The master thread and child threads work concurrently through the parallel section. §At end of parallel code the child threads die or are suspended and flow of control single master thread (join). §Number of active threads can change dynamically throughout the execution of the program.
16
17 Parallel for loops §Parallel operations are often expressed as loops §With OpenMP it is easy to indicate when iterations of for loop may be executed in parallel.
18 for (i=first; i<size; i+=prime) marked[i]=1; l No dependence between one iteration of loop and another.
19 for (i=first; i<size; i+=prime) marked[i]=1; l In OpenMP we will simply indicate that the iterations of for loop may be executed in parallel l The compiler will take care of generating the code that forks/joins threads and schedules the iterations.
20 pragma §A compiler directive in C/C++ is called a pragma (pragmatic information) §A pragma is used to communicate information to the compiler.
21 pragma §Compiler may ignore that information and still generate correct object program. §Information provided by pragma can help compiler optimize the program
22 parallel for pragma #pragma omp
23 parallel for pragma #pragma omp parallel for §Instruct the compiler to parallelize the for loop that immediately follows this directive.
24 parallel for pragma #pragma omp parallel for for (i=first; i<size; i+=prime) marked[i]=1;
25 parallel for pragma #pragma omp parallel for for (i=first; i<size; i+=prime) marked[i]=1; §Runtime system must have information it needs to determine the number of iterations when it evaluates the control clause. for loop must not contain statements that allow the loop to be exited prematurely (i.e. break, return, exit, goto) l continue is allowed.
26 parallel for pragma #pragma omp parallel for for (i=first; i<size; i+=prime) marked[i]=1; In parallel for pragma, variables are by default shared, except the loop index which is private.
27 parallel for pragma int b[3]; char* cptr; int i; cptr = malloc(1); #pragma omp parallel for for(i=0; i<3; i++) b[i]=i;
28 parallel for pragma for(i=2; i<=5; i++) a[i] = a[i] + a[i-1];
29 parallel for pragma for(i=2; i<=5; i++) a[i] = a[i] + a[i-1]; Assume that the array a has been initialized with integers from 1-5.
30 parallel for pragma §Suppose we have 2 threads. §Assume the first thread is assigned loop indices 2 and 3. §Second thread is assigned 4 and 5.
31 parallel for pragma One possible order of execution: Thread 1 performs the computation on i=4, reading the value of a(3) before thread 0 has completed the computations for i=3 which update a(3).
32 §OpenMP will do what you tell it to. §If you parallelize a loop with data dependency, it will give wrong result. §The programmer is responsible for correctness of code.
33 parallel for pragma §The runtime system needs to know how many threads to create. §There are several ways to specify the number of threads to be used. l One of these is to set the environment variable OMP_NUM_THREADS to the required number of threads.
34 parallel for pragma Environment variable OMP_NUM_THREADS In bash: export OMP_NUM_THREADS=4
35 parallel for pragma §The loop indices are distributed among the specified number of threads. §The way in which the loop indices are distributed is known as the schedule.
36 parallel for pragma § In the "static" schedule, which is typically the default, each thread will get a chunk of indices of approximately equal size. §For example, if the loop goes from 1 to 100 and there are 3 threads, l The first thread will process i=1 through i=34 l The second thread will process i=35 through i=67 l The third thread will process i=68 through i=100.
37 parallel for pragma §There is an implied barrier at the end of the loop l Each thread will wait at the end of the loop until all threads have reached that point before they continue.
38 §Sequential program is a special case of shared-memory parallel program (i.e. one with no forks/joins in it)
39 §Directives, as well as OpenMP function calls, are treated as comments in the event that OpenMP invocation is not preferred or available during compilation, the code is in effect a serial code. §This affords a unified code for both serial and parallel applications which can ease code maintenance.
40 §Shared memory model supports incremental parallelization. §Incremental parallelization is the process of transforming a sequential program into a parallel program one block of code at a time.
41 §Benefits of incremental parallelization?
42 §Benefits of incremental parallelization l Profile execution of sequential program. l Sort program blocks in terms of time they consume.
43 §An overhead is incurred any time parallel threads are spawned, such as in the case of parallel for directive. This is system dependent. §Therefore, when a short loop is parallelized, it will probably take longer to execute on multiple threads than on a single thread since the overhead is greater than the time savings due to parallelization.
44 "How long is long enough?" Answer is dependent upon the system and the loop under consideration. As a very rough estimate, several thousand operations (total over all loop iterations, not per iteration) There is only one way to know for sure: Try parallelizing the loop, and then time it and see if it is running faster.