Lecture 8: OpenMP
Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism / Implicit parallelism Shared memory / Distributed memory Other programming paradigms Object-oriented Functional and logic
Parallel Programming Models Shared Memory The programmer’s task is to specify the activities of a set of processes that communicate by reading and writing shared memory. Advantage: the programmer need not be concerned with data-distribution issues. Disadvantage: performance implementations may be difficult on computers that lack hardware support for shared memory, and race conditions tend to arise more easily Distributed Memory Processes have only local memory and must use some other mechanism (e.g., message passing or remote procedure call) to exchange information. Advantage: programmers have explicit control over data distribution and communication.
Shared vs Distributed Memory Shared memory Distributed memory Memory Bus PPPP PPPP MMMM Network
Parallel Programming Models Parallel Programming Tools: Parallel Virtual Machine (PVM) Distributed memory, explicit parallelism Message-Passing Interface (MPI) Distributed memory, explicit parallelism PThreads Shared memory, explicit parallelism OpenMP Shared memory, explicit parallelism High-Performance Fortran (HPF) Implicit parallelism Parallelizing Compilers Implicit parallelism
Parallel Programming Models Shared Memory Model Used on Shared memory MIMD architectures Program consists of many independent threads Concurrently executing threads all share a single, common address space. Threads can exchange information by reading and writing to memory using normal variable assignment operations
Parallel Programming Models Memory Coherence Problem To ensure that the latest value of a variable updated in one thread is used when that same variable is accessed in another thread. Hardware support and compiler support are required Cache-coherency protocol Thread 1Thread 2 X
Parallel Programming Models Distributed Shared Memory (DSM) Systems Implement Shared memory model on Distributed memory MIMD architectures Concurrently executing threads all share a single, common address space. Threads can exchange information by reading and writing to memory using normal variable assignment operations Use a message-passing layer as the means for communicating updated values throughout the system.
Parallel Programming Models Synchronization operations in Shared Memory Model Monitors Locks Critical sections Condition variables Semaphores Barriers
OpenMP
OpenMP Shared-memory programming model Thread-based parallelism Fork/Join model Compiler directive based No support for parallel I/O
OpenMP Master thread executes the sequential sections. Master thread forks additional threads. At the end of the parallel code, the created threads die and the control returns to the master thread (join). Master Threads Fork Join Fork Join
OpenMP General Code Structure #include main () { int var1, var2, var3; Serial code... Fork a team of threads. Specify variable scoping #pragma omp parallel private(var1, var2) shared(var3) { Parallel section executed by all threads... All threads join master thread and disband } Resume serial code... }
OpenMP General Code Structure #include main () { int nthreads, tid; #pragma omp parallel private(tid) /* Fork a team of threads */ { tid = omp_get_thread_num(); printf("Hello World from thread = %d\n", tid); if (tid == 0) {/* master thread */ nthreads = omp_get_num_threads(); printf("Number of threads = %d\n", nthreads); } } /* All threads join master thread and terminate */ }
OpenMP parallel Directive The execution of the code block after the parallel pragma is replicated among the threads. #include main () { struct job_struct job_ptr; struct task_struct task_ptr; #pragma omp parallel private(task_ptr) { task_ptr = get_next_task(&job_ptr); while(task_ptr != NULL) { complete_task(task_ptr) task_ptr = get_next_task(&job_ptr); } job_ptr task_ptr Master tread Thread 1 Shared variables
OpenMP parallel for Directive #include main () { int i; float b[5]; #pragma omp parallel for { for (i=0; i < 5; i++) b[i] = i; } b ii Master Thread (0) Thread 1 In parallel for, variables are by default shared with the exception that the loop index variable is private.
OpenMP Execution context: is an address space containing all of the variables the thread may access. Shared variable: has the same address in the execution context of every thread Private variable: has a different address in the execution context of every thread
OpenMP private Clause declares variables in its list to be private to each thread private (list) #include main () { int i,n; float a[10][10]; n=10; #pragma omp parallel for \ private(j) { for (i=0; i < n; i++) for (j=0; j < n; j++) a[i][j] = a[i][j] + i; }
OpenMP critical Directive Directs the compiler to enforce mutual exclusion among the threads executing the block of code. #include main() { int x; x = 0; #pragma omp parallel shared(x) { #pragma omp critical x = x + 1; } /* end of parallel section */ }
OpenMP reduction Directive reduction (operator : variable) #include main () { int i, n; float a, x, p; n=100; a=0.0; #pragma omp parallel for \ private(x) \ reduction(+:a) for (i=0; i < n; i++) { x = i/10.0; a += x*x; } p = a/n; }
OpenMP reduction Operators + addition - subtraction * multiplication & bitwise and | bitwise or ^ bitwise exclusive or && conditional and || conditional or
OpenMP Loop Scheduling: allows the iterations of a loop to be allocated to threads. Static schedule: all iterations are allocated to threads before they execute any loop iterations. Low overhead High load imbalance Dynamic schedule: only some of the iterations are allocated to threads at the beginning of loop’s execution. Threads that complete their iterations are then eligible to get additional work. Higher overhead Reduce load imbalance
OpenMP schedule Clause schedule (type [, chunk]) type: static, dynamic, etc. chunk: number of contiguous iterations assigned to each thread Increasing chunk size can reduce overhead and increase cache hit rate.
OpenMP schedule Clause #include main () { int i,n; float a[10]; n=10; #pragma omp parallel for \ private(i) \ schedule(static,5) { for (i=0; i < n; i++) a[i] = a[i] + i; }
OpenMP schedule Directive #include #define CHUNKSIZE 100 #define N 1000 main () { int i, chunk; float a[N], b[N], c[N]; for (i=0; i < N; i++) a[i] = b[i] = i * 1.0; chunk = CHUNKSIZE; #pragma omp parallel shared(a,b,c,chunk) private(i) { /* iterations will be distributed dynamically in chunk sized pieces*/ #pragma omp for schedule(dynamic,chunk) nowait { for (i=0; i < N; i++) c[i] = a[i] + b[i]; } } /* end of parallel section */ }
OpenMP nowait Clause tells the compiler to omit the barrier synchronization at the end of the parallel for loop #include #define CHUNKSIZE 100 #define N 1000 main () { int i, chunk; float a[N], b[N], c[N]; for (i=0; i < N; i++) a[i] = b[i] = i * 1.0; chunk = CHUNKSIZE; #pragma omp parallel shared(a,b,c,chunk) private(i) { #pragma omp for schedule(dynamic,chunk) nowait { for (i=0; i < N; i++) c[i] = a[i] + b[i]; }
OpenMP for Directive specifies that the iterations of the loop immediately following it must be executed in parallel by the team. for (i=0; i < n; i++) { low=a[i]; high=b[i]; if (low > high) { break; } for (j=low; j < high; j++) c[j]=(c[j]-a[i])/b[i]; } #pragma omp parallel private(i,j) for (i=0; i < n; i++) { low=a[i]; high=b[i]; if (low > high) { break; } #pragma omp for for (j=low; j < high; j++) c[j]=(c[j]-a[i])/b[i]; }
OpenMP single Directive S pecifies that the enclosed code is to be executed by only one thread in the team. Threads in the team that do not execute the single directive wait at the end of the enclosed code block for (i=0; i < n; i++) { low=a[i]; high=b[i]; if (low > high) { break; } for (j=low; j < high; j++) c[j]=(c[j]-a[i])/b[i]; } #pragma omp parallel private(i,j) for (i=0; i < n; i++) { low=a[i]; high=b[i]; if (low > high) { #pragma omp single break; } #pragma omp for for (j=low; j < high; j++) c[j]=(c[j]-a[i])/b[i]; }
OpenMP #include int a, b, i, tid; float x; #pragma omp threadprivate(a, x) main () { omp_set_dynamic(0); /*Explicitly turn off dynamic threads*/ #pragma omp parallel private(b,tid) { tid = omp_get_thread_num(); a = tid; b = tid; x = 1.1 * tid +1.0; printf("Thread %d: a,b,x= %d %d %f\n",tid,a,b,x); } /* end of parallel section */ printf("Master thread doing serial work here\n"); #pragma omp parallel private(tid) { tid = omp_get_thread_num(); printf("Thread %d: a,b,x= %d %d %f\n",tid,a,b,x); } /* end of parallel section */ } Output: Thread 0: a,b,x= Thread 2: a,b,x= Thread 3: a,b,x= Thread 1: a,b,x= Master thread doing serial work here Thread 0: a,b,x= Thread 3: a,b,x= Thread 1: a,b,x= Thread 2: a,b,x= THREADPRIVATE Directive The THREADPRIVATE directive is used to make global file scope variables (C/C++) local and persistent to a thread through the execution of multiple parallel regions.
OpenMP parallel sections Directive – Functional Parallelism Specifies that the enclosed section(s) of code are to be divided among the threads in the team to be evaluated concurrently. #include main () {... #pragma omp parallel sections { #pragma omp section /* thread 1 */ v = alpha(); #pragma omp section /* thread 2 */ w = beta(); #pragma omp section /* thread 3 */ y = delta(); } /* end of parallel section */ x = gamma(v,w); printf(“%f\n”, epsilon(x,y)); }
OpenMP parallel sections Directive – Functional Parallelism Another solution: main () {... #pragma omp parallel { #pragma omp sections { #pragma omp section /* thread 1 */ v = alpha(); #pragma omp section /* thread 2 */ w = beta(); } #pragma omp sections { #pragma omp section /* thread 3 */ x = gamma(v,w); #pragma omp section /* thread 4 */ y = delta(); } } /* end of parallel section */ printf(“%f\n”, epsilon(x,y)); }
OpenMP section Directive #include #define N 1000 main () { int i; float a[N], b[N], c[N], d[N]; for (i=0; i < N; i++) { a[i] = i * 1.5; b[i] = i ; } #pragma omp parallel shared(a,b,c,d) private(i) { #pragma omp sections nowait { #pragma omp section for (i=0; i < N; i++) c[i] = a[i] + b[i]; #pragma omp section for (i=0; i < N; i++) d[i] = a[i] * b[i]; } /* end of sections */ } /* end of parallel section */ }
OpenMP Synchronization constructs: master directive: specifies a region that is to be executed only by the master thread of the team critical directive: specifies a region of code that must be executed by only one thread at a time barrier directive: synchronizes all threads in the team atomic directive: specifies that a specific memory location must be updates atomically (a mini critical section)
OpenMP barrier Directive #pragma omp barrier... ; atomic Directive #pragma omp atomic... ;
OpenMP Run-time Library Routines omp_set_num_threads(void): Sets the number of threads that will be used in the next parallel region omp_get_num_threads(void): Returns the number of threads that are currently executing the parallel region omp_get_thread_number(void): Returns the thread number omp_get_num_procs(void): Returns the number of processors omp_in_parallel(void): used to determine if the section of code is parallel or not
Parallel Programming Models Example: Pi calculation f 0 1 f(x) dx = f 0 1 4/(1+x 2 ) dx = w ∑ f(x i ) f(x) = 4/(1+x 2 ) n = 10 w = 1/n x i = w(i-0.5) x f(x) x i 1
Parallel Programming Models Sequential Code #define f(x) 4.0/(1.0+x*x); main(){ intn,i; float w,x,sum,pi; printf(“n?\n”); scanf(“%d”, &n); w=1.0/n; sum=0.0; for (i=1; i<=n; i++){ x=w*(i-0.5); sum += f(x); } pi=w*sum; printf(“%f\n”, pi); } = w ∑ f(x i ) f(x) = 4/(1+x 2 ) n = 10 w = 1/n x i = w(i-0.5) x f(x) x i 1
OpenMP #include #define f(x) 4.0/(1.0+x*x) #define NUM_THREADS 4 main() { floatsum, w, pi, x; inti, n, id; sum = 0.0; w=1.0/n; omp_set_num_threads(NUM_THREADS); #pragma omp parallel for private(x) { for (i=0; i<n; i++) { x=(i+0.5)*w; #pragma omp critical sum+=f(x); } pi = sum*w; printf(“pi=%f\n”, pi); } x f(x) x i 1