Shared Memory Programming with OpenMP

Slides:



Advertisements
Similar presentations
OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
Advertisements

Parallel Processing with OpenMP
Introductions to Parallel Programming Using OpenMP
NewsFlash!! Earth Simulator no longer #1. In slightly less earthshaking news… Homework #1 due date postponed to 10/11.
Open[M]ulti[P]rocessing Pthreads: Programmer explicitly define thread behavior openMP: Compiler and system defines thread behavior Pthreads: Library independent.
Mohsan Jameel Department of Computing NUST School of Electrical Engineering and Computer Science 1.
PARALLEL PROGRAMMING WITH OPENMP Ing. Andrea Marongiu
Scientific Programming OpenM ulti- P rocessing M essage P assing I nterface.
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.
Introduction to OpenMP For a more detailed tutorial see: Look at the presentations.
Based on Silberschatz, Galvin and Gagne  2009 Threads Definition and motivation Multithreading Models Threading Issues Examples.
1 ITCS4145/5145, Parallel Programming B. Wilkinson Feb 21, 2012 Programming with Shared Memory Introduction to OpenMP.
CSCI-6964: High Performance Parallel & Distributed Computing (HPDC) AE 216, Mon/Thurs 2-3:20 p.m. Pthreads (reading Chp 7.10) Prof. Chris Carothers Computer.
OpenMPI Majdi Baddourah
A Very Short Introduction to OpenMP Basile Schaeli EPFL – I&C – LSP Vincent Keller EPFL – STI – LIN.
Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines.
1 Parallel Programming With OpenMP. 2 Contents  Overview of Parallel Programming & OpenMP  Difference between OpenMP & MPI  OpenMP Programming Model.
Budapest, November st ALADIN maintenance and phasing workshop Short introduction to OpenMP Jure Jerman, Environmental Agency of Slovenia.
Programming with Shared Memory Introduction to OpenMP
Shared Memory Parallelization Outline What is shared memory parallelization? OpenMP Fractal Example False Sharing Variable scoping Examples on sharing.
Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (
Parallel Programming in Java with Shared Memory Directives.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
Lecture 5: Shared-memory Computing with Open MP. Shared Memory Computing.
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 1 Concurrency in Programming Languages Matthew J. Sottile Timothy G. Mattson Craig.
OpenMP - Introduction Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
1 OpenMP Writing programs that use OpenMP. Using OpenMP to parallelize many serial for loops with only small changes to the source code. Task parallelism.
OpenMP OpenMP A.Klypin Shared memory and OpenMP Simple Example Threads Dependencies Directives Handling Common blocks Synchronization Improving load balance.
Lecture 8: OpenMP. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism / Implicit parallelism.
OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)
04/10/25Parallel and Distributed Programming1 Shared-memory Parallel Programming Taura Lab M1 Yuuki Horita.
CS 838: Pervasive Parallelism Introduction to OpenMP Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from online references.
Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j
OpenMP fundamentials Nikita Panov
High-Performance Parallel Scientific Computing 2008 Purdue University OpenMP Tutorial Seung-Jai Min School of Electrical and Computer.
Threaded Programming Lecture 4: Work sharing directives.
Introduction to OpenMP
Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.
Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (
9/22/2011CS4961 CS4961 Parallel Programming Lecture 9: Task Parallelism in OpenMP Mary Hall September 22,
Threaded Programming Lecture 2: Introduction to OpenMP.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
Heterogeneous Computing using openMP lecture 2 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.
OpenMP – Part 2 * *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)
Embedded Systems MPSoC Architectures OpenMP: Exercises Alberto Bosio
B. Estrade, LSU – High Performance Computing Enablement Group OpenMP II B. Estrade.
NPACI Parallel Computing Institute August 19-23, 2002 San Diego Supercomputing Center S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED.
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
Introduction to OpenMP
Shared Memory Parallelism - OpenMP
Lecture 5: Shared-memory Computing with Open MP
SHARED MEMORY PROGRAMMING WITH OpenMP
CS427 Multicore Architecture and Parallel Computing
Open[M]ulti[P]rocessing
Computer Engg, IIT(BHU)
Introduction to OpenMP
Computer Science Department
Chapter 4: Threads Overview Multithreading Models Thread Libraries
Multi-core CPU Computing Straightforward with OpenMP
Introduction to High Performance Computing Lecture 20
Programming with Shared Memory Introduction to OpenMP
Shared Memory Programming
Introduction to OpenMP
Chapter 4: Threads.
Programming with Shared Memory Specifying parallelism
Shared-Memory Paradigm & OpenMP
Presentation transcript:

Shared Memory Programming with OpenMP Javier Delgado Grid-Enabledment of Scientific Applications Professor S. Masoud Sadjadi OpenMP Programming - GCB

OpenMP Programming - GCB Outline Motivation for OpenMP Basics Work Sharing Constructs Synchronization Data Sharing and Scope Example Program OpenMP Programming - GCB

OpenMP Programming - GCB Motivation Message Passing Model not optimized for Shared Memory Hard to code “All or nothing” Traditional Threading Libraries not suitable overly complicated Little Fortran support OpenMP Programming - GCB

OpenMP Programming - GCB Brief History ANSI X3H5 Not formally adopted Only basic parallelism support (i.e. loops)‏ Pthreads Too complicated for HPC applications Little support for Fortran Custom/Proprietary solutions Not portable OpenMP – Improve upon X3H5, keeping Scientific Applications in mind OpenMP Programming - GCB

OpenMP Programming - GCB Outline Motivation for OpenMP Basics What it is Design Goals Model Work Sharing Constructs Synchronization Data Sharing and Scope Example Program OpenMP Programming - GCB

OpenMP Programming - GCB OpenMP – What is it? API for multi-threaded, shared memory parallelism Compiler Directives Runtime Library Routines Environment Variables Abstraction of low-level threading constructs Optimized for HPC Extensions for Fortran, C, and C++ OpenMP provides an abstraction to pthreads that is optimized for and allows easier coding of scientiifc apps. OpenMP Programming - GCB

OpenMP Programming - GCB Design Goals Leanness Simple and limited set of directives Incremental parallelism of serial applications Simplicity for implementing scientific applications OpenMP Programming - GCB

OpenMP Programming - GCB Model Shared Memory, thread-based parallelism Programmer has full control Fork-join execution pattern In theOpenMP model, a shared memory process consists of multiple threads. The programmer has full control of the parallelism of the application. Thus, they are responsible for all of the distributed-processing-related issues, such as synchronization, although some of it is taken care of implicitly by the API, as we will see later when I discuss the constructs. OpenMP Programming - GCB

OpenMP Programming - GCB Fork-Join Model Here we see two diagrams portraying the how the fork-join model works. The one on the left hand side shows the flow diagram of a simple program. You have a master thread, which is the original one spawned by the program. When a parallelizable region of code is encountered, it forks different parts of the job to separate threads. The set of threads is known as the “team.” When the parallel region is over, the threads synchronize and terminate, which is what is known as the “JOIN” stage. Afterwards, the master continues working serially. You can immediately notice here that this model allows code to be incrementally parallelized. The diagram on the right hand side simply shows a real-life analogy. source: http://www.mhpcc.edu source: http://dimsboiv.uqac.ca OpenMP Programming - GCB

OpenMP Programming - GCB Fork-Join Model All threads execute parallel region I/O atomicity and synchronization is the programmer's problem If one thread fails in the parallel region, they all do OpenMP Programming - GCB

OpenMP Programming - GCB Outline Motivation for OpenMP Basics Work Sharing Constructs Loops Sections Synchronization Data Sharing and Scope Example Program OpenMP Programming - GCB

OpenMP Programming - GCB Loops Distribute iterations amongst threads #pragma omp for [clause ... ] Clauses SCHEDULE NOWAIT ORDERED OpenMP Programming - GCB

OpenMP Programming - GCB Scheduling Schedule clause describes mapping of threads to iterations Types STATIC – divide evenly amongst nodes DYNAMIC – Assign iterations as they become available GUIDED – Dynamically reasign, with exponentially declining “chunk” size RUNTIME – divide according to environment variable OpenMP provides several methods for breaking down the distribution of threads to iterations when executing loops in parallel. The most basic method it provides is the STATIC one, which provides the least overhead. All task assignments are done before beginning execution of the parallel region. This is the best choice if you know beforehand that the load will be well balanced. Dynamic scheduling assigns iterations to threads as the threads become available. This provides more overhead, but can result in faster computation time if the load is not well balanced. Guided scheduling provides the best load balancing when the loads are not balanced, but should not be abused since it has the most overhead. OpenMP Programming - GCB

OpenMP Programming - GCB Static Scheduling chunk size: 2 iterations Thread 1 Thread 2 Thread 3 time source: http://navet.ics.hawaii.edu/~casanova OpenMP Programming - GCB

OpenMP Programming - GCB Dynamic Scheduling chunk size: 2 iterations Thread 1 Thread 2 Thread 3 time source: http://navet.ics.hawaii.edu/~casanova OpenMP Programming - GCB

OpenMP Programming - GCB Guided Scheduling chunk size: 2 iterations Thread 1 Thread 2 Thread 3 time Note the changing chunk sizes (red borders)‏ source: http://navet.ics.hawaii.edu/~casanova OpenMP Programming - GCB

OpenMP Programming - GCB Sections Allow programmer to specify sections of code that can be executed concurrently Example: wake_up SECTIONS SECTION make_coffee || make_tea cook_cereal END SECTIONS eat_breakfast OpenMP Programming - GCB

OpenMP Programming - GCB Workshare Define a section of code where each line can be executed by a different processor Fortran only Example: Vector Operations on entire arrays C(1:N) = A(1:N) + B(1:N)‏ OpenMP Programming - GCB

OpenMP Programming - GCB Outline Motivation for OpenMP Basics Work Sharing Constructs Synchronization Data Sharing and Scope Example Program OpenMP Programming - GCB

OpenMP Programming - GCB Synchronization Programmer is responsible for correctness of shared variables Example: If x is updated at the same time, it is given a value of 1 instead of 2 shared int x fork()‏ x = x + 1 x = x + 1 The problem here is the same as with any distributed system where data is shared and can potentially be modified at the same time read(x)‏ OpenMP Programming - GCB

OpenMP Programming - GCB Synchronization Solution 1: MASTER or SERIAL directive Only one thread executes the “critical” portion of code Solution 2: CRITICAL or ATOMIC directive Only one thread executes at a time The most simple, but usually not optimal, solution to the synchronization problem is to use the MASTER or SERIAL directive. This will force the portion of code to only run on one processor. another option is to use the CRITICAL directive, which ensures only one thread executes the piece of code at once. All other threads that arrive to that section will block until the current thread executes it. The ATOMIC directive is a one-line CRITICAL section. OpenMP Programming - GCB

Other Synchronization Directives Barrier – force synchronization Flush – require consistent view of memory Ordered – execute loop in order When a thread encounters a barrier, it blocks until all others have reached it as well. Either all or none of the threads must execute the Barrier section Flush specifies a point at which a consistent view of memory must appear for all threads Ordered requires that iterations in the enclosed loop be executed sequentially (by default, this does not necerrarily hold, nor do we know what threads are going to be assigned to what processor/core OpenMP Programming - GCB

OpenMP Programming - GCB Outline Motivation for OpenMP Basics Work Sharing Constructs Synchronization Data Sharing and Scope Example Program OpenMP Programming - GCB

OpenMP Programming - GCB Variable Scope Shared Memory -> shared variables ... by default ... usually Globals: File scope static variables Privates: Loop index Stack variables in subroutines called from parallel regions OpenMP Programming - GCB

OpenMP Programming - GCB Data Scope Attributes Shared – All threads modify the same variable Private – New object created for each thread FirstPrivate – Same, but a copy from master node is created LastPrivate – Same, but final value is assigned at master upon completion of parallel region Reduction – After execution, peform a (specified) reduction and give its value to a variable etc. Here are just a few of the data scope attributes that may be assigned to a variable. Shared variables are shared by all threads. Therefore, you have to deal with synchronization yourself. If a variable is declared PRIVATE, a new object of the same type is created for each thread. With FIRSTPRIVATE, each thread gets the value of the MASTER node. With LASTPRIVATE, upon finishing the parallel region, the final value of the private variable is assigned as the value of the variable at the master. Reduction is a special type that performs a reductino operation, e.g. a summation, logical operation, etc. on all the values for that variable and assigns it to a variable in the main program (i.e. master thread)‏ OpenMP Programming - GCB

OpenMP Programming - GCB Outline Motivation for OpenMP Basics Work Sharing Constructs Synchronization Data Sharing and Scope Example Program OpenMP Programming - GCB

OpenMP Programming - GCB Example Program program calc_pi integer n,i double precision w,x,sum,pi,f,a double precision start, finish, timef f(a) = 4.0 / (1.0 + a*a)‏ n=100000000 w=1.0/n sum=0.0 !$OMP PARALLEL PRIVATE(x,i), SHARED(w,n), & !$OMP REDUCTION(+:sum)‏ !$OMP DO do i=1,n x = w * (i - 0.5)‏ sum = sum + f(x)‏ end do !$OMP END DO !$OMP END PARALLEL pi = w * sum print*,"value of pi, time taken:" end OpenMP Programming - GCB

OpenMP Programming - GCB Disadvantages Scalability of Shared Memory Architecture Hardware Limitations Software (OS) Limitations (to an extent)‏ Price of Shared Memory Supercomputers There is a limit to the number of cores that can be physically put into a system. Furthermore, scalability issues must be dealt with in shared-memory architectures since all processors need to access the same memory, which could lead to a saturation of the bus. It is also necessary to modify the operating system to accommodate such massive computers. Vendors of supercomputers often commit these changes to the Linux kernel and publish the changes. Cost is alsoa factor. Although commodity quad-core computers are already available today, there is not a whole lot you can do with just 4 cores. Which means, you need to seek custom solutions, which are very expensive, to get very powerful shared-memory computers. Here are two examples (next slide) ... OpenMP Programming - GCB

OpenMP Programming - GCB SM Cost Examples IBM System P 16-way processor @ 2.1 GHz 73 GB Storage 8 GB Memory Price: $ 473,770.00 source: commercial vendor Sun Fire e25k Server 16 UltraSPARC IV+, 1.8 Ghz 2 x 73 GB Storage 64 GB Memory Price: $ 1,125,047.00 source: Sun Website Prices obtained on April 28, 2008 OpenMP Programming - GCB