Open[M]ulti[P]rocessing

Slides:



Advertisements
Similar presentations
NewsFlash!! Earth Simulator no longer #1. In slightly less earthshaking news… Homework #1 due date postponed to 10/11.
Advertisements

Indian Institute of Science Bangalore, India भारतीय विज्ञान संस्थान बंगलौर, भारत Supercomputer Education and Research Centre (SERC) Adapted from: o “MPI-Message.
Open[M]ulti[P]rocessing Pthreads: Programmer explicitly define thread behavior openMP: Compiler and system defines thread behavior Pthreads: Library independent.
PARALLEL PROGRAMMING WITH OPENMP Ing. Andrea Marongiu
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.
1 Programming with Shared Memory. 2 Shared memory multiprocessor system Any memory location can be accessible by any of the processors. A single address.
Introduction to OpenMP For a more detailed tutorial see: Look at the presentations.
1 Friday, November 10, 2006 “ Programs for sale: Fast, Reliable, Cheap: choose two.” -Anonymous.
Programming with Shared Memory
1 ITCS4145/5145, Parallel Programming B. Wilkinson Feb 21, 2012 Programming with Shared Memory Introduction to OpenMP.
CSCI-6964: High Performance Parallel & Distributed Computing (HPDC) AE 216, Mon/Thurs 2-3:20 p.m. Pthreads (reading Chp 7.10) Prof. Chris Carothers Computer.
A Very Short Introduction to OpenMP Basile Schaeli EPFL – I&C – LSP Vincent Keller EPFL – STI – LIN.
CS 470/570 Lecture 7 Dot Product Examples Odd-even transposition sort More OpenMP Directives.
CS 61C: Great Ideas in Computer Architecture (Machine Structures) Thread-Level Parallelism (TLP) and OpenMP Instructors: Krste Asanovic & Vladimir Stojanovic.
Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines.
– 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance.
Programming with Shared Memory Introduction to OpenMP
CS470/570 Lecture 5 Introduction to OpenMP Compute Pi example OpenMP directives and options.
1 Copyright © 2010, Elsevier Inc. All rights Reserved Chapter 5 Shared Memory Programming with OpenMP An Introduction to Parallel Programming Peter Pacheco.
Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (
Parallel Programming in Java with Shared Memory Directives.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Lecture 5: Shared-memory Computing with Open MP. Shared Memory Computing.
Chapter 17 Shared-Memory Programming. Introduction OpenMP is an application programming interface (API) for parallel programming on multiprocessors. It.
OpenMP - Introduction Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
ECE 1747 Parallel Programming Shared Memory: OpenMP Environment and Synchronization.
1 OpenMP Writing programs that use OpenMP. Using OpenMP to parallelize many serial for loops with only small changes to the source code. Task parallelism.
OpenMP OpenMP A.Klypin Shared memory and OpenMP Simple Example Threads Dependencies Directives Handling Common blocks Synchronization Improving load balance.
OpenMP: Open specifications for Multi-Processing What is OpenMP? Join\Fork model Join\Fork model Variables Variables Explicit parallelism Explicit parallelism.
Lecture 8: OpenMP. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism / Implicit parallelism.
OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)
CS 838: Pervasive Parallelism Introduction to OpenMP Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from online references.
Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j
OpenMP fundamentials Nikita Panov
High-Performance Parallel Scientific Computing 2008 Purdue University OpenMP Tutorial Seung-Jai Min School of Electrical and Computer.
Introduction to OpenMP
Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (
9/22/2011CS4961 CS4961 Parallel Programming Lecture 9: Task Parallelism in OpenMP Mary Hall September 22,
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
CPE779: More on OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.
Heterogeneous Computing using openMP lecture 2 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.
CS240A, T. Yang, Parallel Programming with OpenMP.
OpenMP – Part 2 * *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)
1 Programming with Shared Memory - 2 Issues with sharing data ITCS 4145 Parallel Programming B. Wilkinson Jan 22, _Prog_Shared_Memory_II.ppt.
B. Estrade, LSU – High Performance Computing Enablement Group OpenMP II B. Estrade.
Parallel Programming in C with MPI and OpenMP
Introduction to OpenMP
Shared Memory Parallelism - OpenMP
Lecture 5: Shared-memory Computing with Open MP
Shared-memory Programming
CS427 Multicore Architecture and Parallel Computing
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Improving Barrier Performance Dr. Xiao Qin.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing A bug in the rwlock program Dr. Xiao Qin.
Computer Engg, IIT(BHU)
Introduction to OpenMP
Shared-Memory Programming
Programming with Shared Memory
Computer Science Department
Shared Memory Programming with OpenMP
Multi-core CPU Computing Straightforward with OpenMP
Parallel Programming with OpenMP
Introduction to High Performance Computing Lecture 20
Programming with Shared Memory Introduction to OpenMP
Programming with Shared Memory
Introduction to OpenMP
Programming with Shared Memory
Programming with Shared Memory
Parallel Programming with OPENMP
Presentation transcript:

Open[M]ulti[P]rocessing Pthreads: Programmer explicitly define thread behavior openMP: Compiler and system defines thread behavior Pthreads: Library independent of the compiler openMP: Library requires compiler support Pthreads: Low-level, with maximum flexibility openMP: Higher-level, with less flexibility Pthreads: Application parallelized all at once openMP: Programmer can incrementally parallelize an application Pthreads: Difficult to program high-performance parallel applications openMP: Much simpler to program high-performance parallel applications Pthreads: Explicit Fork/Join or Detached Thread model openMP: Implicit Fork/Join model

Creating openMP Programs In C, make use of the preprocessor General Syntax: #pragma omp directive [clause [clause] ...] Note: To extend the syntax to multiple lines, end a line with the '\' character Error Checking: Adapt if compiler support lacking #ifdef _OPENMP // Only include the header if it exists #include <omp . h> #endif #ifdef _OPENMP // Get number of threads and rank if openMP exists int rank = omp_get_thread_num ( ) ; int threads = omp_get_num_threads ( ) ; #e l s e // Default to one thread, rank zero if no openMP support int rank = 0 ; int threads = 1 ;

openMP Parallel Regions An Introduction Into OpenMP Copyright©2005 Sun Microsystems

Parallel Directive Overview Initiate a parallel structured block: #pragma omp parallel { /* code here */ } Overview The compiler, encountering the omp parallel pragma, creates multiple threads All threads execute the specified structured block of code A structured block can be either a single statement or a compound statement created with { ...} with a single entry point and a single exit point There is an implicit barrier at the end of the construct Within the block, we can declare variables to be Private: local to a particular thread Shared: visible to all threads

Example int x, threads; #pragma omp parallel private(x, threads) { x = omp_get_thread_num(); threads = omp_get_num_threads(); a[x] += num_threads; } omp_get_num_threads() returns number of active threads omp_get_thread_num() returns rank (counting from 0) x and threads are private variables, local to the threads Array a[] is a global shared array Note: a[x] in this example is not a critical section (a[x] is a unique location for each thread)

Matrix Times Vector An Introduction Into OpenMP Copyright©2005 Sun Microsystems

Trapezoidal Rule Trapezoid Area: h * (f(xi) + f(xi+1)/2) (f(b) + f(a))/2 Trapezoidal Rule Trapezoid Area: h * (f(xi) + f(xi+1)/2) Calculate the approximate integral Sum up a bunch of adjacent trapezoids Each trapezoid has the same width Approximate Integral: (b-a)*(f(b)+f(a))/2 A closer approximation: (b - a)/n * [f(x0)/2+f(x1)+f(x2)+···+f(xn)/2] Sequential algorithm (Given a, b, and n): w = ( b−a ) / n ; // Width of an integral segment integral = 0; for ( i = 1 ; i < n−1; i++) { integral += f (a+i*w); } // Evaluate at the next point integral = w * (integral + f(a)/2 + f(b)/2); // The approximate result

Parallel Trapezoidal Integration void TrapezoidIntegration( double a , double b , int n , double global_result ) { int rank = omp_get_thread_num ( ), threads = omp_get_num_threads ( ) ; double w=(b-a)/n, myN=n/threads, myResult; double myA = a+rank*w, myB = myA + (myN - 1)*w; for ( i = 1 ; i < myN−1; i++) { myResult+ = f(myA + i*w); } # pragma omp critical // Mutual exclusion *global_result += w * (myResult +f(myA)/2 + f(myB)/2) ; } int main ( i n t argc , char argv [ ] ) { double result = 0 . 0 , a=atod(argv[2]), b=atod(argv[3]); int n, threads = atoi ( argv [ 1 ] , NULL , 10 ) ; #pragma omp parallel num_threads ( threads ) TrapezoidIntegration( a , b , n , &global_result ) ; printf ( "%d trapezoids , %f to %f Integral = %.14e\n" , n, a, b, result ) ;

Global Reduction double TrapezoidIntegration ( double a , double b , int n ) { int rank = omp_get_thread_num ( ), threads = omp_get_num_threads ( ) ; double w=(b-a)/n, myN=n/threads, myResult; double myA = a+rank*w, myB = myA+(myN - 1)*w; for ( i = 1 ; i < myN−1; i++) { myResult += f(myA + i*w); } } return w * ( myResult + f(myA)/2 + f(myB)/2 ); } int main ( i n t argc , char argv [ ] ) { double result = 0 . 0 , a=atod(argv[2]), b=atod(argv[3]); int n, threads = atoi ( argv [ 1 ] , NULL , 10 ) ; #pragma omp parallel num_threads ( threads ) reduction(+: result) result += TrapezoidIntegration ( a , b , n ) ; printf ( "%d trapezoids , %f to %f Integral = %.14e\n" , n, a, b, result ) ;

Parallel for loop Corresponds to the forall construct Syntax: #pragma omp for (i=0; i<MAX; i++) {block of code} The parallel directive creates a team of threads to execute a specified block of code in parallel To optimally use system resources, the number of threads in a team is determined dynamically By default, an implied barrier follows each iteration of the loop Any one of the following three approaches, in precedence order, determine team size: The environment variable OMP_NUM_THREADS Calling omp_set_num_threads() library routine num_threads clause after the parallel directive specifies team size for that particular directive

Illustration of parallel for loops An Introduction Into OpenMP Copyright©2005 Sun Microsystems

Data Dependencies The compiler rejects loops that don't follow openMP rules: The number of iterations must be known in advance The loop expressions cannot be floats or doubles and cannot change during execution of the loop. The index can only change by the increment part of the for statement int Linear_search ( i n t key , i n t A [ ] , i n t n ) { int i ; #pragma omp parallel for num_threads ( thread_count ) for ( i = 0 ; i < n ; i++) i f ( A [ i ] == key ) return i ; return −1; } // Compiler error: invalid exit from OpenMP structured block

Data Dependencies One iteration depends upon computations of another Compiles, but results are inconsistent Fibonacci example: f0 = f1 = 1; fi = fi-1 + fi-2 Fibo[0] = fibo[1] = 1 ; # pragma omp parallel for num threads ( threads ) f o r ( i = 2 ; i < n ; i++) fibo[i] = fibo[i−1] + fibo [i−2] ; Possible outcomes using two threads 1 1 2 3 5 8 13 21 34 55 // correct, 1 1 2 3 5 8 0 0 0 0 // incorrect Conclusions Dependencies within a single iteration will work correctly OpenMP compilers don’t reject parallel for directive iteration dependences Avoid attempting to parallelize Loops with cross-iteration dependencies

More par/for examples Calculation of π Trapezoidal Integration double sum = 0 . 0 ; # pragma omp parallel for \ num_threads ( threads ) \ reduction (+: sum) \ private ( factor ) for (k=0 ; k<n ; k++) { if (k%2 == 0) factor = 1.0; e l s e factor = −1.0; sum += factor / ( 2*k+1); } // 1 – 1/3 + 1/5 – 1/7 + … double pi = 4 * sum; h = ( b−a ) / n ; integral = (f(a) + f(b))/2.0; # pragma omp parallel for \ num_threads (threads) \ reduction (+: integral ) for (i = 1; i<=n−1; i++) integral += f(a+ih ) ; integral = h*integral ;

Odd/Even Sort Note: default clause forces programmer to specify the scope of all variables // Note that for, unlike parallel does not fork threads; it uses those available // Spawning new threads is an inefficient operation, and used sparingly # pragma omp parallel num_threads ( threads) \ default ( none ) shared ( a , n ) private ( i , tmp , phase ) for ( phase = 0 ; phase < n ; phase ++) { if ( phase % 2 == 0) # pragma omp for for ( i = 1; i < n; i += 2) { i f ( a[i−1] > a[i] ) { tmp = a[i−1]; a[i−1] = a [i]; a[i] = tmp ; } } else for ( i = 1; i < n−1; i += 2) { i f ( a[i] > a[i+1] ) { tmp = a[i+1]; a[i+1] = a[i]; a[i] = tmp ; } } } Note: There is a default barrier after each iteration

Scheduling of Threads Clause: schedule( type, chunk) Static Iterations are assigned to threads before the loop is executed. System assigns chunks of iterations to threads in a round robin fashion for seven iterations, 0, 1, . . . , 7, and two threads schedule(static,1) assigns 0,2,4,6 to thread 0 and 1,3,5,7 to thread 1 schedule(static,4) assigns 0,1,2,3 to thread 0 and 4,5,6,7 to thread 1 Dynamic or guided Iterations are assigned to the threads while the loop is executing. After a thread completes its current set of iterations, its requests more Guided initially assigns large chunks, which decreases down to chunk size as threads request more work, dynamic uses the chunk size auto: The compiler and/or the run-time loader determine the schedule runtime: The allocation schedule is determined automatically at run-time

Scheduling Example #pragma omp parallel for num_threads( threads) \ reduction ( + : sum ) schedule ( static , 1 ) for ( i = 0 ; i <= n ; i++) sum += f ( i ) ; Assign prior to execute the loop in a round robin fashion with one iteration assigned to each thread

Sections The structured blocks are shared among threads of a team The sections directive does not create new thread teams // Allocate sections among available threads #pragma omp sections { // The first section directive is implied and optional #pragma omp section { /* structured_block */ } // Each section can have its own individual code . . . } Notes: Sections can be nested. Different independent code blocks run simultaneously in sections.

OMP: Sequential within Parallel Blocks Single: the block executed by one of the threads Syntax: #pragma omp single {//code} Note: There is an implied barrier at the end of the construct unless nowait appears on the pragma line Master: The block executed by the master thread Syntax: #pragma omp master {//code} Note: There is no implied barrier in this construct

Critical Sections/ Synchronization Critical Sections: #pragma omp critical name {//code} A critical section is keyed by its name. Thread reaching the critical directive blocks until no other thread is executing the same critical section (one with the same name) The name is optional. If not specified, A global default is used Barrier: #pragma omp barrier Threads wait till all threads reach the barrier; then they all proceed together Caution: All threads must be able to reach the barrier Atomic expression: #pragma omp atomic <expressionStatement> A critical section updating a variable by executing a simple expression Flush: #pragma omp flush (variable_list) The executing thread gets a consistent view of the shared variables Current read and write operations on the variables complete and values are written back to memory. New memory operations in the code after flush are not started, creating a “memory fence”.