Parallel Programming Models (Shared Address Space) 5 th week.

Slides:



Advertisements
Similar presentations
NewsFlash!! Earth Simulator no longer #1. In slightly less earthshaking news… Homework #1 due date postponed to 10/11.
Advertisements

1 Programming Explicit Thread-level Parallelism  As noted previously, the programmer must specify how to parallelize  But, want path of least effort.
Indian Institute of Science Bangalore, India भारतीय विज्ञान संस्थान बंगलौर, भारत Supercomputer Education and Research Centre (SERC) Adapted from: o “MPI-Message.
Open[M]ulti[P]rocessing Pthreads: Programmer explicitly define thread behavior openMP: Compiler and system defines thread behavior Pthreads: Library independent.
Mohsan Jameel Department of Computing NUST School of Electrical Engineering and Computer Science 1.
PARALLEL PROGRAMMING WITH OPENMP Ing. Andrea Marongiu
1 OpenMP—An API for Shared Memory Programming Slides are based on:
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.
Introduction to OpenMP For a more detailed tutorial see: Look at the presentations.
1 ITCS4145/5145, Parallel Programming B. Wilkinson Feb 21, 2012 Programming with Shared Memory Introduction to OpenMP.
CSCI-6964: High Performance Parallel & Distributed Computing (HPDC) AE 216, Mon/Thurs 2-3:20 p.m. Pthreads (reading Chp 7.10) Prof. Chris Carothers Computer.
OpenMPI Majdi Baddourah
A Very Short Introduction to OpenMP Basile Schaeli EPFL – I&C – LSP Vincent Keller EPFL – STI – LIN.
Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines.
1 Parallel Programming With OpenMP. 2 Contents  Overview of Parallel Programming & OpenMP  Difference between OpenMP & MPI  OpenMP Programming Model.
Programming with Shared Memory Introduction to OpenMP
Shared Memory Parallelization Outline What is shared memory parallelization? OpenMP Fractal Example False Sharing Variable scoping Examples on sharing.
Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (
Introduction to OpenMP
Parallel Programming in Java with Shared Memory Directives.
Lecture 5: Shared-memory Computing with Open MP. Shared Memory Computing.
Chapter 17 Shared-Memory Programming. Introduction OpenMP is an application programming interface (API) for parallel programming on multiprocessors. It.
OpenMP - Introduction Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
ECE 1747 Parallel Programming Shared Memory: OpenMP Environment and Synchronization.
1 OpenMP Writing programs that use OpenMP. Using OpenMP to parallelize many serial for loops with only small changes to the source code. Task parallelism.
OpenMP OpenMP A.Klypin Shared memory and OpenMP Simple Example Threads Dependencies Directives Handling Common blocks Synchronization Improving load balance.
OpenMP: Open specifications for Multi-Processing What is OpenMP? Join\Fork model Join\Fork model Variables Variables Explicit parallelism Explicit parallelism.
Lecture 8: OpenMP. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism / Implicit parallelism.
OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)
04/10/25Parallel and Distributed Programming1 Shared-memory Parallel Programming Taura Lab M1 Yuuki Horita.
SHARED-MEMORY PROGRAMMING 6 th week. -2- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM SHARED-MEMORY PROGRAMMING 6 th week References Introduction.
OpenMP Martin Kruliš Jiří Dokulil. OpenMP OpenMP Architecture Review Board Compaq, HP, Intel, IBM, KAI, SGI, SUN, U.S. Department of Energy,…
CS 838: Pervasive Parallelism Introduction to OpenMP Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from online references.
Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j
OpenMP fundamentials Nikita Panov
High-Performance Parallel Scientific Computing 2008 Purdue University OpenMP Tutorial Seung-Jai Min School of Electrical and Computer.
Threaded Programming Lecture 4: Work sharing directives.
Introduction to OpenMP
Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.
09/09/2010CS4961 CS4961 Parallel Programming Lecture 6: Data Parallelism in OpenMP, cont. Introduction to Data Parallel Algorithms Mary Hall September.
Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (
9/22/2011CS4961 CS4961 Parallel Programming Lecture 9: Task Parallelism in OpenMP Mary Hall September 22,
MPI and OpenMP.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
10/05/2010CS4961 CS4961 Parallel Programming Lecture 13: Task Parallelism in OpenMP Mary Hall October 5,
Special Topics in Computer Engineering OpenMP* Essentials * Open Multi-Processing.
Heterogeneous Computing using openMP lecture 2 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.
CS240A, T. Yang, Parallel Programming with OpenMP.
B. Estrade, LSU – High Performance Computing Enablement Group OpenMP II B. Estrade.
OpenMP An API : For Writing Portable SMP Application Software Rider NCHC GTD.
Introduction to OpenMP
Shared Memory Parallelism - OpenMP
CS427 Multicore Architecture and Parallel Computing
Parallelize Codes Using Intel Software
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Improving Barrier Performance Dr. Xiao Qin.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing A bug in the rwlock program Dr. Xiao Qin.
Open[M]ulti[P]rocessing
Computer Engg, IIT(BHU)
Introduction to OpenMP
Shared-Memory Programming
Computer Science Department
Shared Memory Programming with OpenMP
Parallel Programming with OpenMP
Programming with Shared Memory
Introduction to High Performance Computing Lecture 20
Programming with Shared Memory Introduction to OpenMP
Introduction to OpenMP
Presentation transcript:

Parallel Programming Models (Shared Address Space) 5 th week

OpenMP Is … An Application Program Interface (API) to be used to explicitly direct multi-threaded, shared memory parallelism Three API components Compiler Directives Runtime Library Routines Environment Variables Portable APIs for C/C++ and Fortran Multiple platforms: most Unix platforms and Windows NT

OpenMP Is … (Cont’d) Standardized Jointly proposed by a group of major computer hardware and software vendors Expected to become an ANSI standard What does OpenMP stand for? Open specifications for multi-processing Collaborative work with interested parties from the hardware and software industry, government and academia

OpenMP Is Not … Distributed memory parallel systems by itself Implemented identically by all vendors Guaranteed to make the most efficient use of shared memory There are no data locality constructs

History Directive-based, Fortran programming extensions In the early 90's, by vendors of shared-memory machines Augment a serial Fortran program with directives to specify loops to be parallelized The compiler is responsible for parallelizing such loops across the SMP processors Implementations were all functionally similar, but were diverging (as usual)

History (Cont’d) ANSI X3H5 In 1994 Rejected due to waning interest as distributed memory machines became popular. OpenMP In the spring of 1997 Taking over where ANSI X3H5 had left off, as newer shared memory machine architectures become popular

Goals Standardization Provide a standard among a variety of shared memory architectures(platforms) High-level interfaces to thread programming Lean and Mean A simple and limited set of directives for shared address space programming Just 3 or 4 directives are enough to represent significant parallelism

Hello World Program: Pthread Version #include void* thrfunc(void* arg) { printf(“hello from thread %d\n”, *(int*)arg); }

int main(void) { pthread_t thread[4]; pthread_attr_t attr; int arg[4] = {0,1,2,3}; int i; // setup joinable threads with system scope pthread_attr_init(&attr); pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE); pthread_attr_setscope(&attr, PTHREAD_SCOPE_SYSTEM); // create N threads for(i=0; i<4; i++) pthread_create(&thread[i], &attr, thrfunc, (void*)&arg[i]); // wait for the N threads to finish for(i=0; i<4; i++) pthread_join(thread[i], NULL); }

Hello World: OpenMP Version #include int main(void) { #pragma omp parallel printf(“hello from thread %d\n”, omp_get_thread_num()); }

Goals (Cont’d) Ease of use Incrementally parallelize a serial program Unlike all or nothing approach of message-passing Implement both coarse-grain and fine-grain parallelism Portability Fortran (77, 90, and 95), C, and C++ Public forum for API and membership

Matrix Multiplication: Sequential Version for (i=0; i<N; i++) { for (j=0; j<N; j++) { temp = 0; for (k=0; k<N; k++) temp += a[i][k] * b[k][j]; c[i][j] = temp; } for (i=0; i<N; i++) { for (j=0; j<N; j++) { temp = 0; for (k=0; k<N; k++) temp += a[i][k] * b[k][j]; c[i][j] = temp; }

Matrix Multiplication: MPI Version BlkSz = N / # of processors; start = BlkSz * Rank; end = start + BlkSz; MPI_Bcast (B, N * N, MPI_INT, 0, MPI_COMM_WORLD); if(Rank == 0) { for(i=1; i<# of processors; i++) MPI_Send(A + BlkSz * i, BlkSz, MPI_INT, i, TAG_INIT, MPI_COMM_WORLD); } else { MPI_Recv(A + start, BlkSz, MPI_INT, 0, TAG_INIT, MPI_COMM_WORLD, &status); } BlkSz = N / # of processors; start = BlkSz * Rank; end = start + BlkSz; MPI_Bcast (B, N * N, MPI_INT, 0, MPI_COMM_WORLD); if(Rank == 0) { for(i=1; i<# of processors; i++) MPI_Send(A + BlkSz * i, BlkSz, MPI_INT, i, TAG_INIT, MPI_COMM_WORLD); } else { MPI_Recv(A + start, BlkSz, MPI_INT, 0, TAG_INIT, MPI_COMM_WORLD, &status); } Determine block size Distribute blocks

for (i=start; i<end; i++) { for (j=0; j<N; j++) { temp = 0; for (k=0; k<N; k++) temp += a[i][k] * b[k][j]; c[i][j] = temp; } if (Rank == 0) { for (i=1; i<# of processors; i++) MPI_Recv (c + BLK_SZ * i, BLK_SZ, MPI_INT, i, TAG_END, MPI_COMM_WORLD, &status); } else { MPI_Send(c+start, BLK_SZ, MPI_INT, 0, TAG_END, MPI_COMM_WORLD); } for (i=start; i<end; i++) { for (j=0; j<N; j++) { temp = 0; for (k=0; k<N; k++) temp += a[i][k] * b[k][j]; c[i][j] = temp; } if (Rank == 0) { for (i=1; i<# of processors; i++) MPI_Recv (c + BLK_SZ * i, BLK_SZ, MPI_INT, i, TAG_END, MPI_COMM_WORLD, &status); } else { MPI_Send(c+start, BLK_SZ, MPI_INT, 0, TAG_END, MPI_COMM_WORLD); } Calculate partial matrix multiplication Gather partial result

Matrix Multiplication: OpenMP Version #pragma omp parallel for private(temp), schedule(static) for (i=0; i<N; i++) { for (j=0; j<N; j++) { temp = 0; for (k=0; k<N; k++) temp += a[i][k] * b[k][j]; c[i][j] = temp; } #pragma omp parallel for private(temp), schedule(static) for (i=0; i<N; i++) { for (j=0; j<N; j++) { temp = 0; for (k=0; k<N; k++) temp += a[i][k] * b[k][j]; c[i][j] = temp; } Add directive

Programming Model Thread Based Parallelism A shared memory process with multiple threads Based upon multiple threads in the shared memory programming paradigm Explicit Parallelism Explicit (not automatic) programming model Offer the programmer full control over parallelization

Programming Model (Cont’d) Fork - Join Model All OpenMP programs begin as a single sequential process: the master thread Fork at the beginning of parallel constructs The master thread creates a team of parallel threads The statements enclosed by the parallel region construct are executed in parallel Join at the end of parallel constructs The threads synchronize and terminate after completing the statements in the parallel construct Only the master thread exists

Fork-Join Model

Programming Model (Cont’d) Compiler Directive Based Parallelism is specified through the use of compiler directives imbedded in C/C++ or Fortran source code Nested Parallelism Support Parallel constructs may include other parallel constructs inside. Implementation-dependent Dynamic Threads Alter the number of threads used to execute parallel regions Implementation-dependent

General Code Structure #include main () { int var1, var2, var3; Serial code... /* Beginning of parallel section. Fork a team of threads. Specify variable scoping */ #pragma omp parallel private(var1, var2) shared(var3) { Parallel section executed by all threads... All threads join master thread and disband } Resume serial code }

Terms Construct A statement, which consists of a directive and the subsequent structured block. Directive A C or C++ #pragma followed by the omp identifier, other text, and a new line. The directive specifies program behavior. Structured block A structured block is a statement that has a single entry and a single exit. A compound statement is a structured block if its execution always begins at the opening { and always ends at the closing }.

Terms (Cont’d) Lexical extent The code textually enclosed between the beginning and the end of a structured block following a directive. The static extent of a directives does not span multiple routines or code files Orphaned Directive An OpenMP directive that appears independently from another enclosing directive It exists outside of another directive's static (lexical) extent. Will span routines and possibly code files

Terms (Cont’d) Dynamic extent (region) All statements in the lexical extent, plus any statement inside a function that is executed as a result of the execution of statements within the lexical extent. The dynamic extent of a directive includes both its static (lexical) extent and the extents of its orphaned directives. Master thread The thread that creates a team when a parallel region is entered. Team One or more threads cooperating in the execution of a construct.

Lexical/Orphan/Dynamic Extent #pragma omp parallel { … #pragma omp for for(i=0; i<n; i++) { for(j=0; j<m; j++) sub1(); sub2(); } sub1() { #pragma omp critical … } sub2() { #pragma omp sections … } Static extentOrphan directives Dynamic extent

Terms (Cont’d) Parallel region Statements that bind to an OpenMP parallel construct and may be executed by multiple threads. Serial region Statements executed only by the master thread outside of the dynamic extent of any parallel region. Private A private variable names a block of storage that is unique to the thread making the reference. Shared A shared variable names a single block of storage. All threads in a team that access this variable will access this single block of storage.

OpenMP Components Directives Work-sharing constructs Data environment clauses Synchronization constructs Runtime libraries Environment variables

Directive Format #pragma ompDirective name[clause, …]newline Start of OpenMP C/C++ directives Valid OpenMP directive, After the pragma and before any clauses In any order Can be repeated Required, Proceeds the structured block enclosed by this directive Ex) #pragma omp parallel default(shared) private(beta, pi)

Directive Format (Cont’d) General Rules Directives follow conventions of the C/C++ standards Case sensitive Only one directive-name per directive Each directive applies to at most one succeeding structured block A long directive can be extend to multi-lines escaping the newline character with a backslash ("\") at the end of a directive line.

Parallel Directive Purpose A block of code to be executed by multiple threads. The fundamental OpenMP parallel construct Format #pragma omp parallel [clause...] newline if (scalar_expression) private (list) shared (list) default (shared | none) firstprivate (list) reduction (operator: list) copyin (list) structured_block

Parallel Directive (Cont’d) Description In reaching a PARALLEL directive, a thread creates a team of threads and becomes the master The master is a member of that team (id = 0) The code is duplicated and all threads will execute that code. An implied barrier at the end of a parallel section Only the master thread continues execution past this point.

Parallel Directive (Cont’d) # of threads Determined by the following factors, in order of precedence: omp_set_num_threads() library function OMP_NUM_THREADS environment variable Implementation default Threads are numbered from 0 (master thread) to N-1

Parallel Directive (Cont’d) Clauses IF clause If present, it must evaluate to.TRUE. (Fortran) or non-zero (C/C++) in order for a team of threads to be created. Data scope attribute clauses Restrictions A parallel region must be a structured block that does not span multiple routines or code files Only a single IF clause is permitted

Parallel Directive (Cont’d) Dynamic Threads By default, a program uses the same number of threads to execute each parallel region. The run-time system can dynamically adjust the number of threads omp_set_dynamic() library function OMP_DYNAMIC environment variable Nested Parallel Regions A parallel region nested within another parallel region results in the creation of a new team, consisting of one thread, by default. Implementation-dependent

Example of Parallel Region #include main () { int nthreads, tid; #pragma omp parallel private(nthreads, tid) {/* Fork a team of threads giving them their own copies of variables */ /* Obtain and print thread id */ tid = omp_get_thread_num(); printf("Hello World from thread = %d\n", tid); if (tid == 0) {/* Only master thread does this */ nthreads = omp_get_num_threads(); printf("Number of threads = %d\n", nthreads); } } /* All threads join master thread and terminate */ }

Work-Sharing Constructs Description Divides the execution of the region among the members of the team An implied barrier at the end of the constructs No implied barrier upon the entry of the constructs Work-sharing constructs do not launch new threads

Construct Types #pragma omp for Shares iterations of a loop across the team. Represents a type of data parallelism #pragma omp single Serializes a section of code #pragma omp sections Breaks work into separate, discreet sections. Each section is executed by a thread. Can be used to implement a type of functional parallelism

Construct Types (Cont’d) #pragma omp parallel for Simplified form of #pragma omp parallel + #pragma omp for #pragma omp parallel sections Simplified form of #pragma omp parallel + #pragma omp sections

Work-Sharing Constructs Restrictions Must be enclosed dynamically within a parallel region for parallel execution Must be encountered by all members of a team or none at all Successive work-sharing constructs must be encountered in the same order by all members of a team

#pragma omp for Purpose The iterations of the loop immediately following this directive must be executed in parallel by the team This assumes a parallel region has already been initiated Otherwise it executes in serial on a single processor

#pragma omp for (Cont’d)

Format #pragma omp for [clause...] newline schedule (type [,chunk]) ordered private (list) firstprivate (list) lastprivate (list) shared (list) reduction (operator: list) nowait for_loop

Clauses SCHEDULE clause How iterations of the loop are divided among the threads in the team The default schedule is implementation dependent STATIC Loop iterations are divided into pieces of size chunk and then statically assigned to threads By default, the iterations are evenly (if possible) divided contiguously among the threads

SCHEDULE Clause DYNAMIC Loop iterations are divided into pieces of size chunk, and dynamically scheduled among the threads When a thread finishes one chunk, it is dynamically assigned another. The default chunk size is 1 GUIDED The chunk size is exponentially reduced with each dispatched piece of the iteration space. The chunk size specifies the minimum number of iterations to dispatch each time.. The default chunk size is 1.

SCHEDULE Clause (Cont’d) RUNTIME: The scheduling decision is deferred until runtime by the environment variable OMP_SCHEDULE. It is illegal to specify a chunk size for this clause. ORDERED clause When ORDERED directives are enclosed within the for directive NOWAIT clause Threads do not synchronize at the end of the parallel loop Threads proceed directly to the next statements after the loop

SCHEDULE Clause (Cont’d) Restrictions The for loop can not be a do while loop, or a loop without loop control. The loop iteration variable must be an integer and the loop control parameters must be the same for all threads. Program correctness must not depend upon which thread executes a particular iteration. The chunk size must be specified as a loop invariant integer expression The C/C++ for directive requires that the for-loop must have canonical shape. ORDERED and SCHEDULE clauses may appear once each.

Example of For Directive #include #define CHUNKSIZE 100 #define N 1000 main () { int i, chunk; float a[N], b[N], c[N]; /* Some initializations */ for (i=0; i < N; i++) a[i] = b[i] = i * 1.0; chunk = CHUNKSIZE; #pragma omp parallel shared(a,b,c,chunk) private(i) { #pragma omp for schedule(dynamic,chunk) nowait for (i=0; i < N; i++) c[i] = a[i] + b[i]; } /* end of parallel section */ }

#pragma omp sections Purpose A non-iterative work-sharing construct The enclosed section(s) of code are to be divided among the threads in the team Independent SECTION directives are nested within a SECTIONS directive Each SECTION is executed once by a thread in the team. Different sections will be executed by different threads.

#pragma omp sections (Cont’d)

Format #pragma omp sections [clause...] newline private (list) firstprivate (list) lastprivate (list) reduction (operator: list) nowait { #pragma omp section newline structured_block #pragma omp section newline structured_block }

#pragma omp sections (Cont’d) Clauses An implied barrier at the end of a SECTIONS directive, unless the nowait clause is used Questions What happens if the number of threads and the number of SECTIONs are different? More threads than SECTIONs? Less threads than SECTIONs? Which thread executes which SECTION? Restriction SECTION directives must occur within the lexical extent of an enclosing SECTIONS directive

Example of Sections Directive include #define N 1000 main () { int i; float a[N], b[N], c[N]; /* Some initializations */ for (i=0; i < N; i++) a[i] = b[i] = i * 1.0;

Example of Sections Directive (Cont’d) #pragma omp parallel shared(a,b,c) private(i) { #pragma omp sections nowait { #pragma omp section for (i=0; i < N/2; i++) c[i] = a[i] + b[i]; #pragma omp section for (i=N/2; i < N; i++) c[i] = a[i] + b[i]; } /* end of sections */ } /* end of parallel section */ }

#pragma omp single Purpose The enclosed code is to be executed by only one thread in the team May be useful when dealing with sections of code that are not thread safe (such as I/O)

#pragma omp single (Cont’d)

Format #pragma omp single [clause...] newline private (list) firstprivate (list) nowait structured_block Clauses Threads in the team that do not execute the SINGLE directive, wait at the end of the enclosed code block, unless a nowait clause is specified

#pragma omp parallel for #include #define N 1000 #define CHUNKSIZE 100 main () { int i, chunk; float a[N], b[N], c[N]; /* Some initializations */ for (i=0; i < N; i++) a[i] = b[i] = i * 1.0; chunk = CHUNKSIZE; #pragma omp parallel for shared(a,b,c,chunk) private(i) schedule(static,chunk) for (i=0; i < n; i++) c[i] = a[i] + b[i]; }

Data Environment #pragma omp threadprivate Data scope clauses

#pragma omp threadprivate Purpose Make global file scope variables local and persistent to a thread through the execution of multiple parallel regions Format #pragma omp threadprivate (list)

#pragma omp threadprivate (Cont’d) Notes Appear after the declaration of listed variables/common blocks. Written by one thread is not visible to other threads On first entry to a parallel region, data in THREADPRIVATE variables should be assumed undefined, unless a COPYIN clause is specified in the PARALLEL directive Differ from PRIVATE variables because they are persistent

#pragma omp threadprivate (Cont’d) Restrictions Data in THREADPRIVATE objects is guaranteed to persist only if the dynamic threads mechanism is "turned off" and the number of threads in different parallel regions remains constant. The default setting of dynamic threads is undefined. Must appear after every declaration of a thread private variable block.

Example of Threadprivate Directive int alpha[10], beta[10], i; #pragma omp threadprivate(alpha) main () { /* First parallel region */ #pragma omp parallel private(i,beta) for (i=0; i < 10; i++) alpha[i] = beta[i] = i; /* Second parallel region */ #pragma omp parallel printf("alpha[3]= %d and beta[3]= %d\n",alpha[3],beta[3]); }

Data Scope Clauses Data scope attribute clauses Explicitly define how variables should be scoped An important consideration for OpenMP programming is the understanding and use of data scoping Most variables are shared by default Global variables include File scope variables, static Private variables include Loop index variables Stack variables in subroutines called from parallel regions

Kinds of Data Scope Clauses #pragma … private #pragma … firstprivate #pragma … lastprivate #pragma … shared #pragma … default #pragma … reduction #pragma … copyin

Data Scope Clauses (Cont’d) Used in conjunction with several directives to control the scoping of enclosed variables Control the data environment during execution of parallel constructs. How and which data variables in the serial section of the program are transferred to the parallel sections of the program (and back) Which variables will be visible to all threads in the parallel sections and which variables will be privately allocated to all threads. Effective only within their lexical/static extent

PRIVATE Clause Purpose Declares variables in its list to be private to each thread Format private (list) Behavior A new object of the same type is declared once for each thread in the team All references to the original object are replaced with references to the new object Uninitialized for each thread

Comparison Between PRIVATE And THREADPRIVATE PRIVATETHREADPRIVATE Data Item Where declared Persistent Extent Initialized C/C++: variable At start of region or work-sharing group In declarations of each routine using block or global file scope NoYes Lexical only - unless passed as an argument to subroutine Dynamic FIRSTPRIVATECOPYIN

Shared Clause Purpose Declares variables in its list to be shared among all threads in the team Format shared (list) Notes Exists in only one memory location and all threads can read or write to that address It is the programmer's responsibility to ensure that multiple threads properly access SHARED variables

Default Clause Purpose Allows the user to specify a default PRIVATE, SHARED, or NONE scope for all variables in the lexical extent of any parallel region. Format default (shared | none) Notes Specific variables can be exempted from the default using the PRIVATE, SHARED, FIRSTPRIVATE, LASTPRIVATE, and REDUCTION clauses

Default Clause (Cont’d) Restrictions Only one DEFAULT clause can be specified on a PARALLEL directive

Firstprivate Clause Purpose Combines the behavior of the PRIVATE clause with automatic initialization of the variables in its list. Format firstprivate (list) Notes Listed variables are initialized according to the value of their original objects prior to entry into the parallel or work-sharing construct.

Lastprivate Clause Purpose Combines the behavior of the PRIVATE clause with a copy from the last loop iteration or section to the original variable object Format lastprivate (list) Note The value copied back into the original variable object is obtained from the last (sequentially) iteration or section of the enclosing construct

Copyin Clause Purpose Provides a means for assigning the same value to THREADPRIVATE variables for all threads in the team Format copyin (list) Notes List contains the names of variables to copy. The master thread variable is used as the copy source. The team threads are initialized with its value upon entry into the parallel construct

Reduction Clause Purpose Performs a reduction on the variables that appear in its list. A private copy for each list variable is created for each thread. At the end of the reduction, the reduction variable is applied to all private copies of the shared variable, and the final result is written to the global shared variable Format reduction (operator: list)

Reduction Clause (Cont’d) Restrictions Variables in the list must be named scalar variables They must also be declared SHARED in the enclosing context. Reduction operations may not be associative for real numbers.

Reduction Clause (Cont’d) The reduction variable is used only in statements which have one of following forms x = x op expr x = expr op x (except subtraction) x binop = expr x++ ++x x-- --x

Reduction Example #include main () { int i, n, chunk; float a[100], b[100], result; /* Some initializations */ n = 100; chunk = 10; result = 0.0; for (i=0; i < n; i++) { a[i] = i * 1.0; b[i] = i * 2.0; } #pragma omp parallel for default(shared) private(i) schedule(static,chunk) \ reduction(+:result) for (i=0; i < n; i++) result = result + (a[i] * b[i]); printf("Final result= %f\n",result); }

Synchronization Constructs #pragma omp master #pragma omp critical #pragma omp barrier #pragma omp atomic #pragma omp flush #pragma omp ordered

Race Condition increment(x) { x = x + 1; } increment(x) { x = x + 1; } Thread A One possible execution sequence: 1.Thread 1 loads the value of x into register A. 2.Thread 2 loads the value of x into register A. 3.Thread 1 adds 1 to register A 4.Thread 2 adds 1 to register A 5.Thread 1 stores register A at location x 6.Thread 2 stores register A at location x Thread B

Race Condition (Cont’d) Solutions The increment of x must be synchronized between the two threads OpenMP provides a variety of synchronization constructs to control how the execution of each thread proceeds relative to other team threads.

#pragma omp master Purpose Specifies a region to be executed only by the master thread of the team. All other threads on the team skip this section of code No implied barrier associated with this directive Format #pragma omp master newline structured_block

#pragma omp critical Purpose Specifies a region of code that must be executed by only one thread at a time. Format #pragma omp critical [ name ] newline structured_block

#pragma omp critical (Cont’d) Notes Race condition Other thread will block until the first thread exits the CRITICAL region The optional name enables multiple different CRITICAL regions to exist Different CRITICAL regions with the same name are treated as the same region All unnamed CRITICAL sections are treated as the same section

Example of Critical Directive #include main() { int x; x = 0; #pragma omp parallel shared(x) { #pragma omp critical x = x + 1; } /* end of parallel section */ }

#pragma omp atomic Purpose Specifies that a specific memory location must be updated atomically A mini-CRITICAL section Format pragma omp atomic newline statement_expression

#pragma omp atomic (Cont’d) Restriction An atomic statement must have one of the following forms x binop = expr x++ ++x x-- --x

#pragma omp ordered Purpose Specifies that iterations of the enclosed loop will be executed in the same order as if they were executed on a serial processor Format #pragma omp ordered newline structured_block

#pragma omp ordered (Cont’d) Restrictions Only appear in the dynamic extent of the following directives for or parallel for Only one thread is allowed in an ordered section at any time An iteration of a loop must not execute the same ORDERED directive more than once, and it must not execute more than one ORDERED directive. A loop which contains an ORDERED directive, must be a loop with an ORDERED clause.

Directive Binding Rules The for, SECTIONS, SINGLE, MASTER and BARRIER directives bind to the dynamically enclosing PARALLEL, if one exists. If no parallel region is currently being executed, the directives have no effect. The ORDERED directive binds to the dynamically enclosing for. The ATOMIC directive enforces exclusive access with respect to ATOMIC directives in all threads, not just the current team.

Directive Binding Rules (Cont’d) The CRITICAL directive enforces exclusive access with respect to CRITICAL directives in all threads, not just the current team. A directive can never bind to any directive outside the closest enclosing PARALLEL.

Directive Nesting Rules A PARALLEL directive dynamically inside another PARALLEL directive logically establishes a new team, which is composed of only the current thread unless nested parallelism is enabled. For, SECTIONS, and SINGLE directives that bind to the same PARALLEL are not allowed to be nested inside of each other. For, SECTIONS, and SINGLE directives are not permitted in the dynamic extent of CRITICAL, ORDERED and MASTER regions.

Directive Nesting Rules (Cont’d) CRITICAL directives with the same name are not permitted to be nested inside of each other. BARRIER directives are not permitted in the dynamic extent of DO/for, ORDERED, SECTIONS, SINGLE, MASTER and CRITICAL regions. MASTER directives are not permitted in the dynamic extent of DO/for, SECTIONS and SINGLE directives.

Directive Nesting Rules (Cont’d) ORDERED directives are not permitted in the dynamic extent of CRITICAL regions. Any directive that is permitted when executed dynamically inside a PARALLEL region is also legal when executed outside a parallel region. When executed dynamically outside a user- specified parallel region, the directive is executed with respect to a team composed of only the master thread.

Environment Variables All environment variable names are uppercase. The values assigned to them are not case sensitive. OMP_SCHEDULE Applies only to for, parallel for directives which have their schedule clause set to RUNTIME setenv OMP_SCHEDULE "guided, 4" setenv OMP_SCHEDULE "dynamic"

Environment Variables (Cont’d) OMP_NUM_THREADS Sets the maximum number of threads to use during execution. setenv OMP_NUM_THREADS 8 OMP_DYNAMIC Enables or disables dynamic adjustment of the number of threads available for execution of parallel regions. setenv OMP_DYNAMIC TRUE OMP_NESTED Enables or disables nested parallelism. setenv OMP_NESTED TRUE