Introduction to OpenMP For a more detailed tutorial see: Look at the presentations
Concepts Directive based programming –declare properties of language structures (sections, loops) –scope variables A few service routines –get information Compiler options Environment variables
OpenMP Programming Model fork-join parallelism Master thread spawns a team of threads as needed.
Typical OpenMP Use Generally used to parallelize loops –Find most time consuming loops –Split iterations up between threads void main() { double Res[1000]; for(int i=0;i<1000;i++) { do_huge_comp(Res[i]); } void main() { double Res[1000]; #pragma omp parallel for for(int i=0;i<1000;i++) { do_huge_comp(Res[i]); }
Thread Interaction OpenMP operates using shared memory –Threads communicate via shared variables Unintended sharing can lead to race conditions –output changes due to thread scheduling Control race conditions using synchronization –synchronization is expensive –change the way data is stored to minimize the need for synchronization
Syntax format Compiler directives –C/C++ #pragma omp construct [clause [clause] …] –Fortran C$OMP construct [clause [clause] … ] !$OMP construct [clause [clause] … ] *$OMP construct [clause [clause] … ] Since we use directives, no changes need to be made to a program for a compiler that doesn’t support OpenMP
Using OpenMP Compilers can automatically place directives with option –-qsmp=auto –xlf_r and xlc do a good job –some loops may speed up, some may slow down Compiler option required when you write in directives –-qsmp=omp (ibm) –-mp (sgi) Can mix directives with automatic parallelization –-qsmp=auto:omp Scoping variables is the hard part! –shared variables, thread private variables
OpenMP Directives 5 categories –Parallel Regions –Worksharing –Data Environment –Synchronization –Runtime functions / environment variables Basically the same between C/C++ and Fortran
Parallel Regions Create threads with omp parallel Threads share A (default behavior) Threads all start at same time then synchronize at a barrier at the end to continue with code. double A[1000] omp_set_num_threads(4); #pragma omp parallel { int ID = omp_get_thread_num(); dosomething(ID, A); }
Sections construct The sections construct gives a different structured block to each thread By default there is a barrier at the end. Use the nowait clause to turn off. #pragma omp parallel #pragma omp sections { X_calculation(); #pragma omp section y_calculation(); #pragma omp section z_calculation(); }
Work-sharing constructs the for construct splits up loop iterations By default, there is a barrier at the end of the “omp for”. Use the “nowait” clause to turn off the barrier. #pragma omp parallel #pragma omp for for (I=0;I<N;I++) { NEAT_STUFF(I); }
Short-hand notation Can combine parallel and work sharing constructs There is also a “parallel sections” construct #pragma omp parallel for for (I=0;I<N;I++){ NEAT_STUFF(I); } #pragma omp parallel for for (I=0;I<N;I++){ NEAT_STUFF(I); }
A Rule In order to be made parallel, a loop must have canonical “shape” for (index=start; index end; ) < <= >= > index++; ++index; index--; --index; index += inc; index -= inc; index = index + inc; index = inc + index; index = index – inc;
An example #pragma omp parallel for private(j) for (i = 0; i < BLOCK_SIZE(id,p,n); i++) for (j = 0; j < n; j++) a[i][j] = MIN(a[i][j], a[i][k] + tmp[j]) By definition, private variable values are undefined at loop entry and exit To change this behavior, you can use the firstprivate(var) and lastprivate(var) clauses x[0] = complex_function(); #pragma omp parallel for private(j) firstprivate(x) for (i = 0; i < n; i++) for (j = 0; j < m; j++) x[j] = g(i, x[j-1]); answer[i] = x[j] – x[i];
Scheduling Iterations The schedule clause effects how loop iterations are mapped onto threads schedule(static [,chunk]) –Deal-out blocks of iterations of size “chunk” to each thread. schedule(dynamic[,chunk]) –Each thread grabs “chunk” iterations off a queue until all iterations have been handled. schedule(guided[,chunk]) –Threads dynamically grab blocks of iterations. The size of he block starts large and shrinks down to size “chunk” as the calculation proceeds. schedule(runtime) –Schedule and chunk size taken from the OMP_SCHEDULE environment variable.
An example #pragma omp parallel for private(j) schedule(static, 2) for (i = 0; i < n; i++) for (j = 0; j < m; j++) x[j][j] = g(i, x[j-1]); You can play with the chunk size to meet load balancing issues, etc.
Scheduling considerations Dynamic is most general and provides load balancing If choice of scheduling has (big) impact on performance, something is wrong: –overhead too big => work in loop too small n can be specification expression, not just constant
Reductions Sometimes you want each thread to calculate part of a value then collapse all that into a single value Done with reduction clause area = 0.0; #pragma omp parallel for private(x) reduction (+:area) for (i = 0; i < n; i++) { x = (i + 0.5)/n; area += 4.0/(1.0 + x*x); } pi = area / n;
Fortran Parallel Directives PARALLEL / END PARALLEL PARALLEL SECTIONS / SECTION / SECTION / END PARALLEL SECTIONS DO / END DO –work sharing directive for DO loop immediately following PARALLEL DO / END PARALLEL DO –combined section and work sharing
Serial Directives MASTER / END MASTER –executed by master thread only DO SERIAL / END DO SERIAL –loop immediately following should not be parallelized –useful with -qsmp=omp:auto
Synchronization Directives BARRIER –inside PARALLEL, all threads synchronize CRITICAL (lock) / END CRITICAL (lock) –section that can be executed by one thread only –lock is optional name to distinguish several critical constructs from each other
An example double area, pi, x; int i, n; area = 0.0; #pragma omp parallel for private(x) for (i = 0; i < n; i++) { x = (i + 0.5)/n; #pragma omp critical area += 4.0/(1.0 + x*x); } pi = area / n;
Scope Rules Shared memory programming model –most variables are shared by default Global variables are shared But not everything is shared –stack variables in functions are private variable set and then used in DO is PRIVATE array whose subscript is constant w.r.t. PARALLEL DO and is set and then used within the DO is PRIVATE
Scope Clauses DO and for directive has extra clauses, the most important –PRIVATE (variable list) –REDUCTION (op: variable list) op is sum, min, max variable is scalar, XLF allows array
Scope Clauses (2) PARALLEL and PARALELL DO and PARALLEL SECTIONS have also –DEFAULT (variable list) scope determined by rules –SHARED (variable list) –IF (scalar logical expression) directives are like programming language extension, not compiler option
integer i,j,n real*8 a(n,n), b(n) read (1) b !$OMP PARALLEL DO !$OMP PRIVATE (i,j) SHARED (a,b,n) do j=1,n do i=1,n a(i,j) = sqrt(1.d0 + b(j)*i) end do !$OMP END PARALLEL DO
Matrix Multiply !$OMP PARALLEL DO PRIVATE(i,j,k) do j=1,n do i=1,n do k=1,n c(i,j) = c(i,j) + a(i,k) * b(k,j) end do
Analysis Outer loop is parallel: columns of c Not optimal for cache use Can put more directives for each loop Then granularity might be too fine
OMP Functions int omp_get_num_procs() int omp_get_num_threads() int omp_get_thread_num() void omp_set_num_threads(int)