B. Estrade, LSU – High Performance Computing Enablement Group OpenMP II B. Estrade.

Slides:

Advertisements

Similar presentations

Scheduling and Performance Issues for Programming using OpenMP

Advertisements

Open[M]ulti[P]rocessing Pthreads: Programmer explicitly define thread behavior openMP: Compiler and system defines thread behavior Pthreads: Library independent.

Mohsan Jameel Department of Computing NUST School of Electrical Engineering and Computer Science 1.

PARALLEL PROGRAMMING WITH OPENMP Ing. Andrea Marongiu

1 OpenMP—An API for Shared Memory Programming Slides are based on:

1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.

Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.

1 ITCS4145/5145, Parallel Programming B. Wilkinson Feb 21, 2012 Programming with Shared Memory Introduction to OpenMP.

CSCI-6964: High Performance Parallel & Distributed Computing (HPDC) AE 216, Mon/Thurs 2-3:20 p.m. Pthreads (reading Chp 7.10) Prof. Chris Carothers Computer.

OpenMPI Majdi Baddourah

A Very Short Introduction to OpenMP Basile Schaeli EPFL – I&C – LSP Vincent Keller EPFL – STI – LIN.

CS 470/570 Lecture 7 Dot Product Examples Odd-even transposition sort More OpenMP Directives.

1 Parallel Programming With OpenMP. 2 Contents  Overview of Parallel Programming & OpenMP  Difference between OpenMP & MPI  OpenMP Programming Model.

Budapest, November st ALADIN maintenance and phasing workshop Short introduction to OpenMP Jure Jerman, Environmental Agency of Slovenia.

Programming with Shared Memory Introduction to OpenMP

Shared Memory Parallelization Outline What is shared memory parallelization? OpenMP Fractal Example False Sharing Variable scoping Examples on sharing.

1 Copyright © 2010, Elsevier Inc. All rights Reserved Chapter 5 Shared Memory Programming with OpenMP An Introduction to Parallel Programming Peter Pacheco.

Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

Uses some of the slides for chapters 7 and 9 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.

2 3 Parent Thread Fork Join Start End Child Threads Compute time Overhead.

This module was created with support form NSF under grant # DUE Module developed by Martin Burtscher Module B1 and B2: Parallelization.

Chapter 17 Shared-Memory Programming. Introduction OpenMP is an application programming interface (API) for parallel programming on multiprocessors. It.

OpenMP - Introduction Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

1 OpenMP Writing programs that use OpenMP. Using OpenMP to parallelize many serial for loops with only small changes to the source code. Task parallelism.

OpenMP OpenMP A.Klypin Shared memory and OpenMP Simple Example Threads Dependencies Directives Handling Common blocks Synchronization Improving load balance.

OpenMP: Open specifications for Multi-Processing What is OpenMP? Join\Fork model Join\Fork model Variables Variables Explicit parallelism Explicit parallelism.

Lecture 8: OpenMP. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism / Implicit parallelism.

OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)

CS 838: Pervasive Parallelism Introduction to OpenMP Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from online references.

Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j

OpenMP fundamentials Nikita Panov

High-Performance Parallel Scientific Computing 2008 Purdue University OpenMP Tutorial Seung-Jai Min School of Electrical and Computer.

Threaded Programming Lecture 4: Work sharing directives.

Introduction to OpenMP

Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (

9/22/2011CS4961 CS4961 Parallel Programming Lecture 9: Task Parallelism in OpenMP Mary Hall September 22,

MPI and OpenMP.

Uses some of the slides for chapters 7 and 9 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.

3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,

10/05/2010CS4961 CS4961 Parallel Programming Lecture 13: Task Parallelism in OpenMP Mary Hall October 5,

Special Topics in Computer Engineering OpenMP* Essentials * Open Multi-Processing.

CPE779: More on OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.

Heterogeneous Computing using openMP lecture 2 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.

CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.

OpenMP – Part 2 * *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)

1 Programming with Shared Memory - 2 Issues with sharing data ITCS 4145 Parallel Programming B. Wilkinson Jan 22, _Prog_Shared_Memory_II.ppt.

NPACI Parallel Computing Institute August 19-23, 2002 San Diego Supercomputing Center S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED.

Introduction to OpenMP

Shared Memory Parallelism - OpenMP

Shared-memory Programming

CS427 Multicore Architecture and Parallel Computing

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Improving Barrier Performance Dr. Xiao Qin.

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing A bug in the rwlock program Dr. Xiao Qin.

Open[M]ulti[P]rocessing

Computer Engg, IIT(BHU)

Introduction to OpenMP

Shared-Memory Programming

Programming with Shared Memory

Computer Science Department

CS4230 Parallel Programming Lecture 12: More Task Parallelism Mary Hall October 4, /04/2012 CS4230.

Multi-core CPU Computing Straightforward with OpenMP

Introduction to High Performance Computing Lecture 20

Programming with Shared Memory Introduction to OpenMP

Introduction to OpenMP

Programming with Shared Memory Specifying parallelism

OpenMP Martin Kruliš.

OpenMP Parallel Programming

Shared-Memory Paradigm & OpenMP

WorkSharing, Schedule, Synchronization and OMP best practices

Presentation transcript:

B. Estrade, LSU – High Performance Computing Enablement Group OpenMP II B. Estrade

B. Estrade, LSU – High Performance Computing Enablement Group Objectives ● learn about OpenMP's work sharing constructs, including: – loops – parallel sections – mutual exclusion

B. Estrade, LSU – High Performance Computing Enablement Group Work Sharing ● refers to the dividing up of tasks among multiple threads ● OpenMP provides methods to: – parallelize loops – define parallel sections – define critical sections – protect shared variables from conflicting updates

B. Estrade, LSU – High Performance Computing Enablement Group Synchronization ● barrier – no threads may move beyond this point in code until all threads have reached this point ● nowait – used in conjunction with the OMP loop clauses – instructs threads not to wait for the others to complete before moving on – used with DO/FOR, section, and single directives to remove implicit barriers ● Directives that force barriers, include: master, order, critical, and atomic; the latter 3 force thread serialization ● Barriers make parallel programs less efficient, but are a necessary part of most non-trivial applications

B. Estrade, LSU – High Performance Computing Enablement Group OpenMP's Flush Directive ● The flush directive is a special kind of synchronization point that marks a point in which a consistent view of the global memory must be provided. ● Even on cache-coherent systems, this directive is available due to it being required in the OpenMP standard that this be able to be forced explicitly by the user. ● In other words, when called, the flush directive ensures that all threads are aware of the most recent value of the shared variables which were 'flushed'. ! Compute values of array C in parallel. !$OMP PARALLEL SHARED(A, B, C), PRIVATE(I) !$OMP DO DO I = 1, N C(I) = A(I) + B(I) A(I) = A(I) + 1 B(I) = B(I) + 1 ENDDO !$OMP END DO ! Flush shared variables, A, B, and C !$OMP FLUSH (A,B,C) !$OMP END PARALLEL PRINT *, C(10) END

B. Estrade, LSU – High Performance Computing Enablement Group Thread 1 i = Thread 2 i = Thread 0 i = Parallelizing Loops for (int i=1;i <= 99; i++) { // do stuff } for (int i=1;i <= 33; i++) { // do stuff } for (int i=34;i <= 66; i++) { // do stuff } for (int i=67;i <= 99; i++) { // do stuff }

B. Estrade, LSU – High Performance Computing Enablement Group Parallelizing Loops ● OpenMP allows for loops to be parallelized pretty easily ● A few caveats – the loop must be a structured block and not be terminated by a break statement – values of the loop control expression must be the same for all iterations of the loop – critical sections, the (atomic) updating of shared variables, and used of the ordered directive should be avoided – OpenMP works best when the loops are static and very straightforward // something else not to parallelize with OpenMP for (int i=0;i < 100; i++) { if (i == 50) i = 0; // resetting i when eq to 50 } // this is not something to parallelize with OpenMP for (int i=0;i < 100; i++) { if (i > 50) break; // breaking when i gt than 50 } BA D

B. Estrade, LSU – High Performance Computing Enablement Group Parallelizing Loops - Fortran PROGRAM VECTOR_ADD USE OMP_LIB PARAMETER (N=100) INTEGER N, I REAL A(N), B(N), C(N) CALL OMP_SET_DYNAMIC (.FALSE.) ! ensures use of all available threads CALL OMP_SET_NUM_THREADS (20) ! sets number of available threads to 20 ! Initialize arrays A and B. DO I = 1, N A(I) = I * 1.0 B(I) = I * 2.0 ENDDO ! Compute values of array C in parallel. !$OMP PARALLEL SHARED(A, B, C), PRIVATE(I) !$OMP DO DO I = 1, N C(I) = A(I) + B(I) ENDDO !$OMP END DO [nowait] !... some more instructions !$OMP END PARALLEL PRINT *, C(10) END

B. Estrade, LSU – High Performance Computing Enablement Group Parallelizing Loops - C/C++ #include #define N 100 int main(void) { float a[N], b[N], c[N]; int i; omp_set_dynamic(0); // ensures use of all available threads omp_set_num_threads(20); // sets number of all available threads to 20 /* Initialize arrays a and b. */ for (i = 0; i < N; i++) { a[i] = i * 1.0; b[i] = i * 2.0; } /* Compute values of array c in parallel. */ #pragma omp parallel shared(a, b, c) private(i) { #pragma omp for [nowait] for (i = 0; i < N; i++) c[i] = a[i] + b[i]; } printf ("%f\n", c[10]); }

B. Estrade, LSU – High Performance Computing Enablement Group Parallel Loop Scheduling ● Scheduling refers to how iterations are assigned to a particular thread; ● There are 4 types: – static: each thread is assigned a chunk in a round-robin fashion – dynamic: each thread is initialized with a chunk, and gets assigned a new chunk of iterations on a first come, first finished basis – guided: iterations are divided into pieces that successfully decrease exponentially, with chunk being the smallest size – runtime: schedule is deferred to runtime, and is set using OMP_SCHEDULE ● Limitations – only one schedule type may be used at for a given loop – the chunk size applies to all threads

B. Estrade, LSU – High Performance Computing Enablement Group Comparison of Loop Schedules Static and Dynamic Iterations: 100 Threads : 5 Chunk Size: 3 Iterations Assigned Thread: Static (* repeated in round robin fashion *) Dynamic (* each thread gets 5 more iterations as it completes *) Guided Iterations: 100 Threads : 5 Chunk Size: 3 Round 1 Iterations Assigned Thread 0: (100)/5 = 20 Thread 1: (100-20)/5 = 80/5 = 16 Thread 2: (80-16)/5 = 74/5 = 14 Thread 3: (74-14)/5 = 60/5 = 12 Thread 4: (60-12)/5 = 48/5 = 9 Round 2 Iterations Assigned Thread 4: (48-9)/5 = 39/5 = 7 (* effectively reverses list of threads seeking work *) Thread 3: (39-7)/5 = 32/5 = 6 (* because threads with fewer iterations should finish sooner *) Thread 2: (32-6)/5 = 26/5 = 5 Thread 1: (26-5)/5 = 21/5 = 4 Thread 0: (21-5)/5 = 16/5 = 3 (* chunk size limit reached! *) Round 3 Iterations Assigned Thread 0: (13 left): = 3 Thread 1: (10 left): = 3 Thread 2: ( 7 left): = 3 Thread 3: ( 4 left): = 3 Thread 4: ( 1 left): = 1 *done

B. Estrade, LSU – High Performance Computing Enablement Group Parallel Loop Scheduling - Example !$OMP PARALLEL SHARED(A, B, C), PRIVATE(I) !$OMP DO SCHEDULE (DYNAMIC,4) DO I = 1, N C(I) = A(I) + B(I) ENDDO !$OMP END PARALLEL schedule type chunk of iterations* #pragma omp parallel shared(a, b, c) private(i) { #pragma omp for schedule (guided,4) for (i = 0; i < N; i++) c[i] = a[i] + b[i]; } Fortran C/C++ *optional

B. Estrade, LSU – High Performance Computing Enablement Group Parallel Loop Ordering ● It is its own kind of scheduling, since it can't use static, dynamic, or guided schedules. ● Ordering refers to ensuring that a particular section inside of a loop must happen in the same order as if the loop was run sequentially. ● In general, this forces a serialization of threads, so loss of efficiency in minimized when a high percentage of the work done per iteration is outside of the ordered directive. ● Forces a serialization of threads, but this cost can be absorbed if a large percentage of time is spent in calculation, per iteration, before the ordered block. #pragma omp parallel shared(a, b, c) private(i) { #pragma omp for ordered for (i = 0; i <= 99; i++) { // // a lot of non-order dependent calculations //... #pragma omp ordered { // some *really* quick set of instructions } } }

B. Estrade, LSU – High Performance Computing Enablement Group Parallel Loop Ordering (Example of 3 threads executing 3 iterations simultaneously) Thread 1 i = 2 Thread 2 i = 3 Thread 0 i = 1 // // lot's of non-dependent // calculations // // lot's of non-dependent // calculations // // lot's of non-dependent // calculations // Thread 0 Thread 1 Thread 2 parallel execution serial execution Note: Provided that the ordered part of the loop is sufficiently small compared to the non-order dependent section of each iteration, the cost of the serialization per N threads can be minimized.

B. Estrade, LSU – High Performance Computing Enablement Group OpenMP Environmental Variables ● OMP_NUM_THREADS – required, informs execution of the number of threads to use ● OMP_SCHEDULE – The OMP_SCHEDULE environment variable applies to PARALLEL DO and work-sharing DO directives that have a schedule type of RUNTIME. ● OMP_DYNAMIC – The OMP_DYNAMIC environment variable enables or disables dynamic adjustment of the number of threads available for the execution of parallel regions

B. Estrade, LSU – High Performance Computing Enablement Group Parallel Sections ● Parallel sections allow for blocks of code to be assigned to individual threads, which are executed once. ● This provides more fine grain control than a basic parallel loop, because in a loop, each iteration is essentially identical. ● There is an implicit barrier at the end of the sections definition, unless otherwise specified as nowait.

B. Estrade, LSU – High Performance Computing Enablement Group Parallel Sections - Examples PROGRAM SECTIONS USE OMP_LIB INTEGER SQUARE INTEGER X, Y, Z, XS, YS, ZS CALL OMP_SET_DYNAMIC (.FALSE.) CALL OMP_SET_NUM_THREADS (3) X = 2 Y = 3 Z = 5 !$OMP PARALLEL !$OMP SECTIONS !$OMP SECTION XS = SQUARE(X) PRINT *, "ID = ", OMP_GET_THREAD_NUM(), "XS =", XS !$OMP SECTION YS = SQUARE(Y) PRINT *, "ID = ", OMP_GET_THREAD_NUM(), "YS =", YS !$OMP SECTION ZS = SQUARE(Z) PRINT *, "ID = ", OMP_GET_THREAD_NUM(), "ZS =", ZS !$OMP END SECTIONS !$OMP END PARALLEL END INTEGER FUNCTION SQUARE(N) INTEGER N SQUARE = N*N END #include int square(int n); int main(void) { int x, y, z, xs, ys, zs; omp_set_dynamic(0); omp_set_num_threads(3); x = 2; y = 3; z = 5; #pragma omp parallel { #pragma omp sections { #pragma omp section { xs = square(x); printf ("id = %d, xs = %d\n", omp_get_thread_num(), xs); } #pragma omp section { ys = square(y); printf ("id = %d, ys = %d\n", omp_get_thread_num(), ys); } #pragma omp section { zs = square(z); printf ("id = %d, zs = %d\n", omp_get_thread_num(), zs); } int square(int n) {return n*n; } Fortran C/C++

B. Estrade, LSU – High Performance Computing Enablement Group Parallel Sections Example #pragma omp sections { #pragma omp section { xs = square(x); printf ("id = %d, xs = %d\n", omp_get_thread_num(), xs); } #pragma omp section { ys = square(y); printf ("id = %d, ys = %d\n", omp_get_thread_num(), ys); } #pragma omp section { zs = square(z); printf ("id = %d, zs = %d\n", omp_get_thread_num(), zs); } thread 0 thread 1 thread 2 Time t=0t=1

B. Estrade, LSU – High Performance Computing Enablement Group Singling Out Threads ● OpenMP allows for the defining of blocks that are to be executed by the master thread, or simply the first thread to arrive at the block. ● This saves time and overhead if one needs to revert to a single thread for a particular code section. ● Unless specified, there is an implied barrier for all threads at the end of both master and single blocks. !$OMP MASTER block !$OMP END MASTER !$OMP SINGLE block !$OMP END SINGLE Master Single

B. Estrade, LSU – High Performance Computing Enablement Group Mutual Exclusion ● Mutual exclusion refers to restricting the execution of a block of code or the updating of a shared variable to only one thread at a time. ● In other words, it forces all threads to execute a particular block of code serially. ● Mutual exclusion constructs, when used properly, provide for safe and efficient code execution. ● OpenMP provides two types of constructs to implement mutual exclusion in one's code: – critical sections – atomic blocks

B. Estrade, LSU – High Performance Computing Enablement Group Critical Sections ● A block of code defined inside of a critical section will be executed by each thread, one thread at a time. ● Critical sections make threads wait, so OpenMP allows for the creation of named critical sections that allow multiple threads to execute at a given time. ● The following code allows any 1 of 3 threads to take turns executing in a critical sections; if timed properly, idle time for each thread will be minimized due to this interleaving. #pragma omp critical(a) { // some code } #pragma omp critical(b) { // some code } #pragma omp critical(c) { // some code } #pragma omp critical { // some code }

B. Estrade, LSU – High Performance Computing Enablement Group Critical Section - Example /* Compute values of array c in parallel. */ #pragma omp parallel shared(a, b, c) private(i) { #pragma omp critical { // // do stuff (one thread at a time) // } thread 0thread 1thread 2 t = 1 t = 2 t = 3 Time Note, thread order not guaranteed!

B. Estrade, LSU – High Performance Computing Enablement Group Named Critical Sections Example #include #define N 100 int main(void) { float a[N], b[N], c[3]; int i; /* Initialize arrays a and b. */ for (i = 0; i < N; i++) { a[i] = i * ; b[i] = i * ; } /* Compute values of array c in parallel. */ #pragma omp parallel shared(a, b, c) private(i) { #pragma omp critical(a) { for (i = 0; i < N; i++) c[0] += a[i] + b[i]; printf("%f\n",c[0]); } #pragma omp critical(b) { for (i = 0; i < N; i++) c[1] += a[i] + b[i]; printf("%f\n",c[1]); } #pragma omp critical(c) { for (i = 0; i < N; i++) c[2] += a[i] + b[i]; printf("%f\n",c[2]); } Each critical section is run by a single thread at a time, but multiple threads may be active at the same time.

B. Estrade, LSU – High Performance Computing Enablement Group Named Critical Sections Example #pragma omp critical(a) { // some code } #pragma omp critical(b) { // some code } #pragma omp critical(c) { // some code } thread 0 thread 1 thread 2 thread 0 thread 1 thread 2 thread 0 thread 1 thread 2 t = 1 t = 2 t = 3 Time Note, thread order not guaranteed!

B. Estrade, LSU – High Performance Computing Enablement Group Atomic Variable Updating ● Forces for the atomic updating of a shared variable by a single thread at once ● This serializes updates to a particular variable, and protects it from multiple writes occuring at once; can be considered “mini” critical sections. ● The example above run without the atomic protection for count will allow for all threads to update the shared variable, causing unpredictable results. ● There is no such thing as a named atomic section to improve efficiency through interleaving, so if this is desired, use critical sections #include int main() { int count = 0; #pragma omp parallel shared(count) { #pragma omp atomic count++; } printf_s("Number of threads: %d\n",count); } Note, thread order not guaranteed!

B. Estrade, LSU – High Performance Computing Enablement Group References ● ● ●

B. Estrade, LSU – High Performance Computing Enablement Group Additional Resources ● ● ● ●