Open[M]ulti[P]rocessing

Open[M]ulti[P]rocessing
Pthreads: Programmer explicitly define thread behavior openMP: Compiler and system defines thread behavior Pthreads: Library independent of the compiler openMP: Library requires compiler support Pthreads: Low-level, with maximum flexibility openMP: Higher-level, with less flexibility Pthreads: Application parallelized all at once openMP: Programmer can incrementally parallelize an application Pthreads: Difficult to program high-performance parallel applications openMP: Much simpler to program high-performance parallel applications Pthreads: Explicit Fork/Join or Detached Thread model openMP: Implicit Fork/Join model

Creating openMP Programs
In C, make use of the preprocessor General Syntax: #pragma omp directive [clause [clause] ...] Note: To extend the syntax to multiple lines, end a line with the '\' character Error Checking: Adapt if compiler support lacking #ifdef _OPENMP // Only include the header if it exists #include <omp . h> #endif #ifdef _OPENMP // Get number of threads and rank if openMP exists int rank = omp_get_thread_num ( ) ; int threads = omp_get_num_threads ( ) ; #e l s e // Default to one thread, rank zero if no openMP support int rank = 0 ; int threads = 1 ;

openMP Parallel Regions
An Introduction Into OpenMP Copyright©2005 Sun Microsystems

Parallel Directive Overview Initiate a parallel structured block:
#pragma omp parallel { /* code here */ } Overview The compiler, encountering the omp parallel pragma, creates multiple threads All threads execute the specified structured block of code A structured block can be either a single statement or a compound statement created with { ...} with a single entry point and a single exit point There is an implicit barrier at the end of the construct Within the block, we can declare variables to be Private: local to a particular thread Shared: visible to all threads

Example int x, threads; #pragma omp parallel private(x, threads)
{ x = omp_get_thread_num(); threads = omp_get_num_threads(); a[x] += num_threads; } omp_get_num_threads() returns number of active threads omp_get_thread_num() returns rank (counting from 0) x and threads are private variables, local to the threads Array a[] is a global shared array Note: a[x] in this example is not a critical section (a[x] is a unique location for each thread)

Matrix Times Vector An Introduction Into OpenMP
Copyright©2005 Sun Microsystems

Trapezoidal Rule Trapezoid Area: h * (f(xi) + f(xi+1)/2)
(f(b) + f(a))/2 Trapezoidal Rule Trapezoid Area: h * (f(xi) + f(xi+1)/2) Calculate the approximate integral Sum up a bunch of adjacent trapezoids Each trapezoid has the same width Approximate Integral: (b-a)*(f(b)+f(a))/2 A closer approximation: (b - a)/n * [f(x0)/2+f(x1)+f(x2)+···+f(xn)/2] Sequential algorithm (Given a, b, and n): w = ( b−a ) / n ; // Width of an integral segment integral = 0; for ( i = 1 ; i < n−1; i++) { integral += f (a+i*w); } // Evaluate at the next point integral = w * (integral + f(a)/2 + f(b)/2); // The approximate result

Parallel Trapezoidal Integration
void TrapezoidIntegration( double a , double b , int n , double global_result ) { int rank = omp_get_thread_num ( ), threads = omp_get_num_threads ( ) ; double w=(b-a)/n, myN=n/threads, myResult; double myA = a+rank*w, myB = myA + (myN - 1)*w; for ( i = 1 ; i < myN−1; i++) { myResult+ = f(myA + i*w); } # pragma omp critical // Mutual exclusion *global_result += w * (myResult +f(myA)/2 + f(myB)/2) ; } int main ( i n t argc , char argv [ ] ) { double result = , a=atod(argv[2]), b=atod(argv[3]); int n, threads = atoi ( argv [ 1 ] , NULL , 10 ) ; #pragma omp parallel num_threads ( threads ) TrapezoidIntegration( a , b , n , &global_result ) ; printf ( "%d trapezoids , %f to %f Integral = %.14e\n" , n, a, b, result ) ;

Global Reduction double TrapezoidIntegration ( double a , double b , int n ) { int rank = omp_get_thread_num ( ), threads = omp_get_num_threads ( ) ; double w=(b-a)/n, myN=n/threads, myResult; double myA = a+rank*w, myB = myA+(myN - 1)*w; for ( i = 1 ; i < myN−1; i++) { myResult += f(myA + i*w); } } return w * ( myResult + f(myA)/2 + f(myB)/2 ); } int main ( i n t argc , char argv [ ] ) { double result = , a=atod(argv[2]), b=atod(argv[3]); int n, threads = atoi ( argv [ 1 ] , NULL , 10 ) ; #pragma omp parallel num_threads ( threads ) reduction(+: result) result += TrapezoidIntegration ( a , b , n ) ; printf ( "%d trapezoids , %f to %f Integral = %.14e\n" , n, a, b, result ) ;

Parallel for loop Corresponds to the forall construct
Syntax: #pragma omp for (i=0; i<MAX; i++) {block of code} The parallel directive creates a team of threads to execute a specified block of code in parallel To optimally use system resources, the number of threads in a team is determined dynamically By default, an implied barrier follows each iteration of the loop Any one of the following three approaches, in precedence order, determine team size: The environment variable OMP_NUM_THREADS Calling omp_set_num_threads() library routine num_threads clause after the parallel directive specifies team size for that particular directive

Illustration of parallel for loops
An Introduction Into OpenMP Copyright©2005 Sun Microsystems

Data Dependencies The compiler rejects loops that don't follow openMP rules: The number of iterations must be known in advance The loop expressions cannot be floats or doubles and cannot change during execution of the loop. The index can only change by the increment part of the for statement int Linear_search ( i n t key , i n t A [ ] , i n t n ) { int i ; #pragma omp parallel for num_threads ( thread_count ) for ( i = 0 ; i < n ; i++) i f ( A [ i ] == key ) return i ; return −1; } // Compiler error: invalid exit from OpenMP structured block

Data Dependencies One iteration depends upon computations of another
Compiles, but results are inconsistent Fibonacci example: f0 = f1 = 1; fi = fi-1 + fi-2 Fibo[0] = fibo[1] = 1 ; # pragma omp parallel for num threads ( threads ) f o r ( i = 2 ; i < n ; i++) fibo[i] = fibo[i−1] + fibo [i−2] ; Possible outcomes using two threads // correct, // incorrect Conclusions Dependencies within a single iteration will work correctly OpenMP compilers don’t reject parallel for directive iteration dependences Avoid attempting to parallelize Loops with cross-iteration dependencies

More par/for examples Calculation of π Trapezoidal Integration
double sum = ; # pragma omp parallel for \ num_threads ( threads ) \ reduction (+: sum) \ private ( factor ) for (k=0 ; k<n ; k++) { if (k%2 == 0) factor = 1.0; e l s e factor = −1.0; sum += factor / ( 2*k+1); } // 1 – 1/3 + 1/5 – 1/7 + … double pi = 4 * sum; h = ( b−a ) / n ; integral = (f(a) + f(b))/2.0; # pragma omp parallel for \ num_threads (threads) \ reduction (+: integral ) for (i = 1; i<=n−1; i++) integral += f(a+ih ) ; integral = h*integral ;

Odd/Even Sort Note: default clause forces programmer to specify the scope of all variables // Note that for, unlike parallel does not fork threads; it uses those available // Spawning new threads is an inefficient operation, and used sparingly # pragma omp parallel num_threads ( threads) \ default ( none ) shared ( a , n ) private ( i , tmp , phase ) for ( phase = 0 ; phase < n ; phase ++) { if ( phase % 2 == 0) # pragma omp for for ( i = 1; i < n; i += 2) { i f ( a[i−1] > a[i] ) { tmp = a[i−1]; a[i−1] = a [i]; a[i] = tmp ; } } else for ( i = 1; i < n−1; i += 2) { i f ( a[i] > a[i+1] ) { tmp = a[i+1]; a[i+1] = a[i]; a[i] = tmp ; } } } Note: There is a default barrier after each iteration

Scheduling of Threads Clause: schedule( type, chunk) Static
Iterations are assigned to threads before the loop is executed. System assigns chunks of iterations to threads in a round robin fashion for seven iterations, 0, 1, , 7, and two threads schedule(static,1) assigns 0,2,4,6 to thread 0 and 1,3,5,7 to thread 1 schedule(static,4) assigns 0,1,2,3 to thread 0 and 4,5,6,7 to thread 1 Dynamic or guided Iterations are assigned to the threads while the loop is executing. After a thread completes its current set of iterations, its requests more Guided initially assigns large chunks, which decreases down to chunk size as threads request more work, dynamic uses the chunk size auto: The compiler and/or the run-time loader determine the schedule runtime: The allocation schedule is determined automatically at run-time

Scheduling Example #pragma omp parallel for num_threads( threads) \ reduction ( + : sum ) schedule ( static , 1 ) for ( i = 0 ; i <= n ; i++) sum += f ( i ) ; Assign prior to execute the loop in a round robin fashion with one iteration assigned to each thread

Sections The structured blocks are shared among threads of a team The sections directive does not create new thread teams // Allocate sections among available threads #pragma omp sections { // The first section directive is implied and optional #pragma omp section { /* structured_block */ } // Each section can have its own individual code } Notes: Sections can be nested. Different independent code blocks run simultaneously in sections.

OMP: Sequential within Parallel Blocks
Single: the block executed by one of the threads Syntax: #pragma omp single {//code} Note: There is an implied barrier at the end of the construct unless nowait appears on the pragma line Master: The block executed by the master thread Syntax: #pragma omp master {//code} Note: There is no implied barrier in this construct

Critical Sections/ Synchronization
Critical Sections: #pragma omp critical name {//code} A critical section is keyed by its name. Thread reaching the critical directive blocks until no other thread is executing the same critical section (one with the same name) The name is optional. If not specified, A global default is used Barrier: #pragma omp barrier Threads wait till all threads reach the barrier; then they all proceed together Caution: All threads must be able to reach the barrier Atomic expression: #pragma omp atomic <expressionStatement> A critical section updating a variable by executing a simple expression Flush: #pragma omp flush (variable_list) The executing thread gets a consistent view of the shared variables Current read and write operations on the variables complete and values are written back to memory. New memory operations in the code after flush are not started, creating a “memory fence”.

Open[M]ulti[P]rocessing

Similar presentations

Presentation on theme: "Open[M]ulti[P]rocessing"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Open[M]ulti[P]rocessing

Similar presentations

Presentation on theme: "Open[M]ulti[P]rocessing"— Presentation transcript:

Similar presentations

About project

Feedback