Shared Memory Programming with OpenMP Javier Delgado Grid-Enabledment of Scientific Applications Professor S. Masoud Sadjadi OpenMP Programming - GCB
OpenMP Programming - GCB Outline Motivation for OpenMP Basics Work Sharing Constructs Synchronization Data Sharing and Scope Example Program OpenMP Programming - GCB
OpenMP Programming - GCB Motivation Message Passing Model not optimized for Shared Memory Hard to code “All or nothing” Traditional Threading Libraries not suitable overly complicated Little Fortran support OpenMP Programming - GCB
OpenMP Programming - GCB Brief History ANSI X3H5 Not formally adopted Only basic parallelism support (i.e. loops) Pthreads Too complicated for HPC applications Little support for Fortran Custom/Proprietary solutions Not portable OpenMP – Improve upon X3H5, keeping Scientific Applications in mind OpenMP Programming - GCB
OpenMP Programming - GCB Outline Motivation for OpenMP Basics What it is Design Goals Model Work Sharing Constructs Synchronization Data Sharing and Scope Example Program OpenMP Programming - GCB
OpenMP Programming - GCB OpenMP – What is it? API for multi-threaded, shared memory parallelism Compiler Directives Runtime Library Routines Environment Variables Abstraction of low-level threading constructs Optimized for HPC Extensions for Fortran, C, and C++ OpenMP provides an abstraction to pthreads that is optimized for and allows easier coding of scientiifc apps. OpenMP Programming - GCB
OpenMP Programming - GCB Design Goals Leanness Simple and limited set of directives Incremental parallelism of serial applications Simplicity for implementing scientific applications OpenMP Programming - GCB
OpenMP Programming - GCB Model Shared Memory, thread-based parallelism Programmer has full control Fork-join execution pattern In theOpenMP model, a shared memory process consists of multiple threads. The programmer has full control of the parallelism of the application. Thus, they are responsible for all of the distributed-processing-related issues, such as synchronization, although some of it is taken care of implicitly by the API, as we will see later when I discuss the constructs. OpenMP Programming - GCB
OpenMP Programming - GCB Fork-Join Model Here we see two diagrams portraying the how the fork-join model works. The one on the left hand side shows the flow diagram of a simple program. You have a master thread, which is the original one spawned by the program. When a parallelizable region of code is encountered, it forks different parts of the job to separate threads. The set of threads is known as the “team.” When the parallel region is over, the threads synchronize and terminate, which is what is known as the “JOIN” stage. Afterwards, the master continues working serially. You can immediately notice here that this model allows code to be incrementally parallelized. The diagram on the right hand side simply shows a real-life analogy. source: http://www.mhpcc.edu source: http://dimsboiv.uqac.ca OpenMP Programming - GCB
OpenMP Programming - GCB Fork-Join Model All threads execute parallel region I/O atomicity and synchronization is the programmer's problem If one thread fails in the parallel region, they all do OpenMP Programming - GCB
OpenMP Programming - GCB Outline Motivation for OpenMP Basics Work Sharing Constructs Loops Sections Synchronization Data Sharing and Scope Example Program OpenMP Programming - GCB
OpenMP Programming - GCB Loops Distribute iterations amongst threads #pragma omp for [clause ... ] Clauses SCHEDULE NOWAIT ORDERED OpenMP Programming - GCB
OpenMP Programming - GCB Scheduling Schedule clause describes mapping of threads to iterations Types STATIC – divide evenly amongst nodes DYNAMIC – Assign iterations as they become available GUIDED – Dynamically reasign, with exponentially declining “chunk” size RUNTIME – divide according to environment variable OpenMP provides several methods for breaking down the distribution of threads to iterations when executing loops in parallel. The most basic method it provides is the STATIC one, which provides the least overhead. All task assignments are done before beginning execution of the parallel region. This is the best choice if you know beforehand that the load will be well balanced. Dynamic scheduling assigns iterations to threads as the threads become available. This provides more overhead, but can result in faster computation time if the load is not well balanced. Guided scheduling provides the best load balancing when the loads are not balanced, but should not be abused since it has the most overhead. OpenMP Programming - GCB
OpenMP Programming - GCB Static Scheduling chunk size: 2 iterations Thread 1 Thread 2 Thread 3 time source: http://navet.ics.hawaii.edu/~casanova OpenMP Programming - GCB
OpenMP Programming - GCB Dynamic Scheduling chunk size: 2 iterations Thread 1 Thread 2 Thread 3 time source: http://navet.ics.hawaii.edu/~casanova OpenMP Programming - GCB
OpenMP Programming - GCB Guided Scheduling chunk size: 2 iterations Thread 1 Thread 2 Thread 3 time Note the changing chunk sizes (red borders) source: http://navet.ics.hawaii.edu/~casanova OpenMP Programming - GCB
OpenMP Programming - GCB Sections Allow programmer to specify sections of code that can be executed concurrently Example: wake_up SECTIONS SECTION make_coffee || make_tea cook_cereal END SECTIONS eat_breakfast OpenMP Programming - GCB
OpenMP Programming - GCB Workshare Define a section of code where each line can be executed by a different processor Fortran only Example: Vector Operations on entire arrays C(1:N) = A(1:N) + B(1:N) OpenMP Programming - GCB
OpenMP Programming - GCB Outline Motivation for OpenMP Basics Work Sharing Constructs Synchronization Data Sharing and Scope Example Program OpenMP Programming - GCB
OpenMP Programming - GCB Synchronization Programmer is responsible for correctness of shared variables Example: If x is updated at the same time, it is given a value of 1 instead of 2 shared int x fork() x = x + 1 x = x + 1 The problem here is the same as with any distributed system where data is shared and can potentially be modified at the same time read(x) OpenMP Programming - GCB
OpenMP Programming - GCB Synchronization Solution 1: MASTER or SERIAL directive Only one thread executes the “critical” portion of code Solution 2: CRITICAL or ATOMIC directive Only one thread executes at a time The most simple, but usually not optimal, solution to the synchronization problem is to use the MASTER or SERIAL directive. This will force the portion of code to only run on one processor. another option is to use the CRITICAL directive, which ensures only one thread executes the piece of code at once. All other threads that arrive to that section will block until the current thread executes it. The ATOMIC directive is a one-line CRITICAL section. OpenMP Programming - GCB
Other Synchronization Directives Barrier – force synchronization Flush – require consistent view of memory Ordered – execute loop in order When a thread encounters a barrier, it blocks until all others have reached it as well. Either all or none of the threads must execute the Barrier section Flush specifies a point at which a consistent view of memory must appear for all threads Ordered requires that iterations in the enclosed loop be executed sequentially (by default, this does not necerrarily hold, nor do we know what threads are going to be assigned to what processor/core OpenMP Programming - GCB
OpenMP Programming - GCB Outline Motivation for OpenMP Basics Work Sharing Constructs Synchronization Data Sharing and Scope Example Program OpenMP Programming - GCB
OpenMP Programming - GCB Variable Scope Shared Memory -> shared variables ... by default ... usually Globals: File scope static variables Privates: Loop index Stack variables in subroutines called from parallel regions OpenMP Programming - GCB
OpenMP Programming - GCB Data Scope Attributes Shared – All threads modify the same variable Private – New object created for each thread FirstPrivate – Same, but a copy from master node is created LastPrivate – Same, but final value is assigned at master upon completion of parallel region Reduction – After execution, peform a (specified) reduction and give its value to a variable etc. Here are just a few of the data scope attributes that may be assigned to a variable. Shared variables are shared by all threads. Therefore, you have to deal with synchronization yourself. If a variable is declared PRIVATE, a new object of the same type is created for each thread. With FIRSTPRIVATE, each thread gets the value of the MASTER node. With LASTPRIVATE, upon finishing the parallel region, the final value of the private variable is assigned as the value of the variable at the master. Reduction is a special type that performs a reductino operation, e.g. a summation, logical operation, etc. on all the values for that variable and assigns it to a variable in the main program (i.e. master thread) OpenMP Programming - GCB
OpenMP Programming - GCB Outline Motivation for OpenMP Basics Work Sharing Constructs Synchronization Data Sharing and Scope Example Program OpenMP Programming - GCB
OpenMP Programming - GCB Example Program program calc_pi integer n,i double precision w,x,sum,pi,f,a double precision start, finish, timef f(a) = 4.0 / (1.0 + a*a) n=100000000 w=1.0/n sum=0.0 !$OMP PARALLEL PRIVATE(x,i), SHARED(w,n), & !$OMP REDUCTION(+:sum) !$OMP DO do i=1,n x = w * (i - 0.5) sum = sum + f(x) end do !$OMP END DO !$OMP END PARALLEL pi = w * sum print*,"value of pi, time taken:" end OpenMP Programming - GCB
OpenMP Programming - GCB Disadvantages Scalability of Shared Memory Architecture Hardware Limitations Software (OS) Limitations (to an extent) Price of Shared Memory Supercomputers There is a limit to the number of cores that can be physically put into a system. Furthermore, scalability issues must be dealt with in shared-memory architectures since all processors need to access the same memory, which could lead to a saturation of the bus. It is also necessary to modify the operating system to accommodate such massive computers. Vendors of supercomputers often commit these changes to the Linux kernel and publish the changes. Cost is alsoa factor. Although commodity quad-core computers are already available today, there is not a whole lot you can do with just 4 cores. Which means, you need to seek custom solutions, which are very expensive, to get very powerful shared-memory computers. Here are two examples (next slide) ... OpenMP Programming - GCB
OpenMP Programming - GCB SM Cost Examples IBM System P 16-way processor @ 2.1 GHz 73 GB Storage 8 GB Memory Price: $ 473,770.00 source: commercial vendor Sun Fire e25k Server 16 UltraSPARC IV+, 1.8 Ghz 2 x 73 GB Storage 64 GB Memory Price: $ 1,125,047.00 source: Sun Website Prices obtained on April 28, 2008 OpenMP Programming - GCB