Introduction to OpenMP. OpenMP Introduction Credits: www.sdsc.edu/~allans/cs260/lectures/OpenMP.ppt www.mgnet.org/~douglas/Classes/cs521-s02/...openmp/MPI-OpenMP.ppt.

Slides:



Advertisements
Similar presentations
NewsFlash!! Earth Simulator no longer #1. In slightly less earthshaking news… Homework #1 due date postponed to 10/11.
Advertisements

Indian Institute of Science Bangalore, India भारतीय विज्ञान संस्थान बंगलौर, भारत Supercomputer Education and Research Centre (SERC) Adapted from: o “MPI-Message.
Open[M]ulti[P]rocessing Pthreads: Programmer explicitly define thread behavior openMP: Compiler and system defines thread behavior Pthreads: Library independent.
PARALLEL PROGRAMMING WITH OPENMP Ing. Andrea Marongiu
1 OpenMP—An API for Shared Memory Programming Slides are based on:
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.
Introduction to OpenMP For a more detailed tutorial see: Look at the presentations.
1 ITCS4145/5145, Parallel Programming B. Wilkinson Feb 21, 2012 Programming with Shared Memory Introduction to OpenMP.
CSCI-6964: High Performance Parallel & Distributed Computing (HPDC) AE 216, Mon/Thurs 2-3:20 p.m. Pthreads (reading Chp 7.10) Prof. Chris Carothers Computer.
OpenMPI Majdi Baddourah
A Very Short Introduction to OpenMP Basile Schaeli EPFL – I&C – LSP Vincent Keller EPFL – STI – LIN.
Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines.
– 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance.
ECE 1747 Parallel Programming Shared Memory: OpenMP Environment and Synchronization.
1 Parallel Programming With OpenMP. 2 Contents  Overview of Parallel Programming & OpenMP  Difference between OpenMP & MPI  OpenMP Programming Model.
Budapest, November st ALADIN maintenance and phasing workshop Short introduction to OpenMP Jure Jerman, Environmental Agency of Slovenia.
CS240A, T. Yang, 2013 Modified from Demmel/Yelick’s and Mary Hall’s Slides 1 Parallel Programming with OpenMP.
Programming with Shared Memory Introduction to OpenMP
Shared Memory Parallelization Outline What is shared memory parallelization? OpenMP Fractal Example False Sharing Variable scoping Examples on sharing.
1 Copyright © 2010, Elsevier Inc. All rights Reserved Chapter 5 Shared Memory Programming with OpenMP An Introduction to Parallel Programming Peter Pacheco.
Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (
Parallel Programming in Java with Shared Memory Directives.
Lecture 5: Shared-memory Computing with Open MP. Shared Memory Computing.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
OpenMP - Introduction Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
ECE 1747 Parallel Programming Shared Memory: OpenMP Environment and Synchronization.
1 OpenMP Writing programs that use OpenMP. Using OpenMP to parallelize many serial for loops with only small changes to the source code. Task parallelism.
OpenMP OpenMP A.Klypin Shared memory and OpenMP Simple Example Threads Dependencies Directives Handling Common blocks Synchronization Improving load balance.
Lecture 8: OpenMP. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism / Implicit parallelism.
OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)
04/10/25Parallel and Distributed Programming1 Shared-memory Parallel Programming Taura Lab M1 Yuuki Horita.
CS 838: Pervasive Parallelism Introduction to OpenMP Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from online references.
Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j
OpenMP fundamentials Nikita Panov
High-Performance Parallel Scientific Computing 2008 Purdue University OpenMP Tutorial Seung-Jai Min School of Electrical and Computer.
09/08/2011CS4961 CS4961 Parallel Programming Lecture 6: More OpenMP, Introduction to Data Parallel Algorithms Mary Hall September 8, 2011.
09/17/2010CS4961 CS4961 Parallel Programming Lecture 8: Introduction to Threads and Data Parallelism in OpenMP Mary Hall September 17, 2009.
Introduction to OpenMP
Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.
09/09/2010CS4961 CS4961 Parallel Programming Lecture 6: Data Parallelism in OpenMP, cont. Introduction to Data Parallel Algorithms Mary Hall September.
Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (
9/22/2011CS4961 CS4961 Parallel Programming Lecture 9: Task Parallelism in OpenMP Mary Hall September 22,
MPI and OpenMP.
Introduction to Pragnesh Patel 1 NICS CSURE th June 2015.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
Special Topics in Computer Engineering OpenMP* Essentials * Open Multi-Processing.
Heterogeneous Computing using openMP lecture 2 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.
CS240A, T. Yang, Parallel Programming with OpenMP.
B. Estrade, LSU – High Performance Computing Enablement Group OpenMP II B. Estrade.
Introduction to OpenMP
SHARED MEMORY PROGRAMMING WITH OpenMP
Shared Memory Parallelism - OpenMP
Lecture 5: Shared-memory Computing with Open MP
SHARED MEMORY PROGRAMMING WITH OpenMP
CS427 Multicore Architecture and Parallel Computing
Open[M]ulti[P]rocessing
Computer Engg, IIT(BHU)
Introduction to OpenMP
Computer Science Department
Multi-core CPU Computing Straightforward with OpenMP
Parallel Programming with OpenMP
CS4961 Parallel Programming Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms Mary Hall September 4, /04/2012 CS4230.
Introduction to High Performance Computing Lecture 20
Programming with Shared Memory Introduction to OpenMP
Introduction to OpenMP
Shared-Memory Paradigm & OpenMP
Parallel Programming with OPENMP
Presentation transcript:

Introduction to OpenMP

OpenMP Introduction

Credits: OpenMP Homepage:

Module Objectives Introduction to the OpenMP standard After completion, users should be equipped to implement OpenMP constructs in their applications and realize performance improvements on shared memory machines

Definition Parallel Computing: Computing multiple things simultaneously. Usually means computing different parts of the same problem simultaneously. In scientific computing, it often means decomposing a domain into more than one sub-domain and computing a solution on each sub-domain separately and simultaneously (or almost separately and simultaneously).

Perfect (a.k.a Embarrassing, Trivial) Parallelism Monte-Carlo Methods Cellular Automata Data Parallelism Domain Decomposition Dense Matrix Multiplication Task Parallelism Pipelining Monte-Carlo? Cellular Automata? Types of Parallelism

Performance Measures Peak Performance: Theoretical upper bound on performance. Sustained Performance: Highest consistently achievable speed. MHz: Million cycles per second. MIPS: Million instructions per second. Mflops: Million floating point operations per second. Speedup: Sequential run time divided by parallel run time.

Parallelism Issues Programming notation Algorithms and Data Structures Load Balancing Problem Size Communication Portability Scalability

Getting your feet wet

Memory Types CPU Memory CPU Memory CPU Memory CPU Memory CPU Distributed Shared

Clustered SMPs Clustered SMPs symmetric multiprocessors Cluster Interconnect Network Memory Multi-socket and/or Multi-core

Distributed vs. Shared Memory Shared - all processors share a global pool of memory –simpler to program –bus contention leads to poor scalability Distributed - each processor physically has its own (private) memory –scales well –memory management is more difficult

What is OpenMP? OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead, OpenMP specifies a notation as part of an existing language (FORTRAN, C) for parallel programming on a shared memory machine Portable across different architectures Scalable as more processors are added Easy to convert sequential code to parallel

Why should I use OpenMP?

Where should I use OpenMP?

OpenMP Specification OpenMp consists of three main parts: Compiler directives used by the programmer to communicate with the compiler Runtime library which enables the setting and querying of parallel parameters Environmental variables that can be used to define a limited number of runtime parameters

OpenMP Example Usage OpenMP Example Usage (1 of 2) OpenMP Compiler Annotated Source Sequential Program Parallel Program compiler switch

OpenMP Example Usage OpenMP Example Usage (2 of 2) If you give sequential switch, –comments and pragmas are ignored. If you give parallel switch, –comments and/or pragmas are read, and –cause translation into parallel program. Ideally, one source for both sequential and parallel program (big maintenance plus).

Simple OpenMP Program Most OpenMP constructs are compiler directives or pragmas The focus of OpenMP is to parallelize loops OpenMP offers an incremental approach to parallelism

OpenMP Programming Model OpenMP is a shared memory model. Workload is distributed among threads – Variables can be shared among all threads duplicated and private to each thread – Threads communicate by sharing variables Unintended sharing of data can lead to race conditions: – race condition: when the program’s outcome changes as the threads are scheduled differently. To control race conditions: – Use synchronization (Chapter Four) to protect data conflicts. Careless use of synchronization can lead to deadlocks (Chapter Four)

OpenMP Execution Model Fork-join model of parallel execution Begin execution as a single process (master thread) Start a parallel construct: Master thread creates a team of threads Complete a parallel construct: Threads in the team wait until all team work has been completed Only master thread continues execution

The Basic Idea

OpenMP directive format in C #pragma directives, defined by C standard as a mechanism to do compiler-specific tasks e.g. ignore errors, generate special code #pragma must be ignored if not understood; thus, SOME OpenMP programs can be compiled for sequential OR parallel execution Typically, OpenMP directives can be enabled by compiler option OpenMP pragma Usage: #pragma omp directive_name [ clause [ clause ]... ] CR Conditional compilation #ifdef _OPENMP printf(“%d avail.processors\n”,omp_get_num_procs()); #endif case sensitive Include file for library routines: #ifdef _OPENMP #include #endif

Microsoft Visual Studio OpenMP Option

Other Compilers Intel (icc, ifort, icpc) –-openmp PGI (pgcc, pgf90, …) –-mp GNU (gcc, gfortran, g++) –-fopenmp –need version 4.2 or later

OpenMP parallel region construct Block of code to be executed by multiple threads in parallel Each thread executes the same code redundantly! C/C++: #pragma omp parallel [ clause [ clause ]... ] CR { structured-block } clause can be either or both of the following: private(comma-separated identifier-list) shared(comma-separated identifier-list) If no private/shared list, shared is assumed for all variables

OpenMP parallel region construct

Communicating Among Threads Shared Memory Model –threads read and write shared variables no need for explicit message passing –change storage attribute to private to minimize synchronization and improve cache reuse because private variables are duplicated in every team member

Storage Model – Data Scoping Shared memory programming model: variables are shared by default Global variables are SHARED among threads –C: file scope variables, static Private Variables: –exist only within the scope of each thread, i.e. they are uninitialized and undefined outside the data scope –loop index variables –Stack variables in sub-programs called from parallel regions

OpenMP -- example #include int main() { // Do this part in parallel printf( "Hello, World!\n" ); return 0; }

OpenMP -- example #include int main() { omp_set_num_threads(16); // Do this part in parallel #pragma omp parallel { printf( "Hello, World!\n" ); } return 0; }

OpenMP environment variables OMP_NUM_THREADS – sets the number of threads to use during execution – when dynamic adjustment of the number of threads is enabled, the value of this environment variable is the maximum number of threads to use setenv OMP_NUM_THREADS 16 [csh, tcsh] export OMP_NUM_THREADS=16 [sh, ksh, bash] At runtime, omp_set_num_threads(6)

OpenMP runtime library omp_get_num_threads Function Returns the number of threads currently in the team executing the parallel region from which it is called – C/C++: int omp_get_num_threads(void); omp_get_thread_num Function Returns the thread number, within the team, that lies between 0 and omp_get_num_threads()-1, inclusive. The master thread of the team is thread 0 – C/C++: int omp_get_thread_num(void);

Hello…WorldWorldWorldWorld! Programming Model - Fork/Join int main() { // serial region printf(“Hello…”); // serial again printf(“!”); } Fork Join // parallel region #pragma omp parallel { printf(“World”); }

Programming Model – Thread Identification Master Thread Thread with ID=0 Only thread that exists in sequential regions Depending on implementation, may have special purpose inside parallel regions Some special directives affect only the master thread (like master ) Other threads in a team have ids 1..N-1 Fork Join

Run-time Library: Timing There are 2 portable timing routines omp_get_wtime –portable wall clock timer returns a double precision value that is number of elapsed seconds from some point in the past –gives time per thread - possibly not globally consistent –difference 2 times to get elapsed time in code omp_get_wtick –time between ticks in seconds

Loop Constructs Because the use of parallel followed by a loop construct is so common, this shorthand notation is often used (note: directive should be followed immediately by the loop) –#pragma parallel for [ clause [ clause ]... ] CR for ( ; ; ) { } Subsets of iterations are assigned to each thread in the team

Programming Model – Concurrent Loops OpenMP easily parallelizes loops –No data dependencies between iterations! Preprocessor calculates loop bounds for each thread directly from serial source ? ? for( i=0; i < 25; i++ ) { printf(“Foo”); } #pragma omp parallel for

Sequential Matrix Multiply for( i=0; i<n; i++ ) for( j=0; j<n; j++ ) { c[i][j] = 0.0; for( k=0; k<n; k++ ) c[i][j] += a[i][k]*b[k][j]; }

OpenMP Matrix Multiply #pragma omp parallel for for( i=0; i<n; i++ ) for( j=0; j<n; j++ ) { c[i][j] = 0.0; for( k=0; k<n; k++ ) c[i][j] += a[i][k]*b[k][j]; }

OpenMP parallel for directive clause can be one of the following: private( list ) shared( list ) default( none | shared | private ) if (Boolean expression) reduction( operator: list) schedule( type [, chunk ] ) nowait num_threads(N) Implicit barrier at the end of for unless nowait is specified If nowait is specified, threads do not synchronize at the end of the parallel loop schedule clause specifies how iterations of the loop are divided among the threads of the team. – Default is implementation dependent

OpenMP parallel/for directive #pragma omp parallel private(f) { f=7; #pragma omp for for (i=0; i<20; i++) a[i] = b[i] + f * (i+1); } /* omp end parallel */ // i is private // a, b are shared

Default Clause Note that the default storage attribute is DEFAULT (SHARED) To change default: DEFAULT(PRIVATE) –each variable in static extent of the parallel region is made private as if specified by a private clause –mostly saves typing DEFAULT(none): no default; must list storage attribute for each variable USE THIS!

If Clause if (Boolean expression) executes (in parallel) normally if the expression is true, otherwise it executes the parallel region serially Used to test if there is sufficient work to justify the overhead in creating and terminating a parallel region

Conditional Parallelism: Example for( i=0; i<n; i++ ) #pragma omp parallel for if( n-i > 100 ) for( j=i+1; j<n; j++ ) for( k=i+1; k<n; k++ ) a[j][k] = a[j][k] - a[i][k]*a[i][j] / a[j][j]

Data model Private and shared variables Variables in the global data space are accessed by all parallel threads (shared variables). Variables in a thread’s private space can only be accessed by the thread (private variables) several variations, depending on the initial values and whether the results are copied outside the region.

#pragma omp parallel for private( privIndx, privDbl ) for ( i = 0; i < arraySize; i++ ) { for ( privIndx = 0; privIndx < 16; privIndx++ ) { privDbl = ( (double) privIndx ) / 16; y[i] = sin( exp( cos( - exp( sin(x[i]) ) ) ) ) + cos( privDbl ); } Parallel for loop index is Private by default.

Reduction Variables #pragma omp parallel for reduction( op:list ) op is one of +, *, -, &, ^, |, &&, or || The variables in list must be used with this operator in the loop. The variables are automatically initialized to sensible values.

The reduction clause sum = 0.0; #pragma parallel for default(none) shared (n, x) reduction(+ : sum) for (int I=0; I<n; I++) sum = sum + x(I); –A private instance of sum is allocated to each thread –Performs a local sum in each thread –Before terminating, each thread adds its local sum to the global sum variable

Programming Model – Loop Scheduling schedule clause determines how loop iterations are divided among the thread team –static([chunk]) divides iterations statically between threads Each thread receives [chunk] iterations, rounding as necessary to account for all iterations Default [chunk] is ceil( # iterations / # threads ) –dynamic([chunk]) allocates [chunk] iterations per thread, allocating an additional [chunk] iterations when a thread finishes Forms a logical work queue, consisting of all loop iterations Default [chunk] is 1 –guided([chunk]) allocates dynamically, but [chunk] is exponentially reduced with each allocation

Loop scheduling

Programming Model – Loop Scheduling for( i=0; i<16; i++ ) { doIteration(i); } // Static Scheduling int chunk = 16/T; int base = tid * chunk; int bound = (tid+1)*chunk; for( i=base; i<bound; i++ ) { doIteration(i); } #pragma omp parallel for \ schedule(static)

Programming Model – Loop Scheduling for( i=0; i<16; i++ ) { doIteration(i); } // Dynamic Scheduling int current_i; while( workLeftToDo() ) { current_i = getNextIter(); doIteration(i); } #pragma omp parallel for \ schedule(dynamic)

OpenMP sections directive Several blocks are executed in parallel C/C++: #pragma omp sections [ clause [ clause ]... ] new-line { [#pragma omp section new-line ] structured-block1 [#pragma omp section new-line structured-block2 ]... }

OpenMP sections directive #pragma omp parallel { #pragma omp sections {{ a=...; b=...; } #pragma omp section { c=...; d=...; } #pragma omp section { e=...; f=...; } #pragma omp section { g=...; h=...; } } /*omp end sections*/ } /*omp end parallel*/

The omp sections clause - example

Threadprivate Private variables are private on a parallel region basis. Threadprivate variables are global variables that are private throughout the execution of the program.

Threadprivate #pragma omp threadprivate( list ) Example: #pragma omp threadprivate( x) Requires program change in POSIX threads. Requires an array of size p. Access as x[pthread_self()]. Costly if accessed frequently. Not cheap in OpenMP either.

Threadprivate Makes global data private to each thread –C: file scope and static variables Different from making them PRIVATE –with PRIVATE global scope is lost –THREADPRIVATE preserves global scope for each thread Threadprivate variables can be initialized using COPYIN clause

Master structured block Only the master (0) thread executes the block Rest of the team skips the section and continues execution from the end of the master No barrier at the end (or start) of the master section The worksharing construct, OMP single is similar in behavior but has an implied barrier at the end. Single is performed by any one thread. Syntax: –#pragma omp master { …… }

Ordered Structured Block Enclosed code is executed in the same order as would occur in sequential execution of the loop Directives: –#pragma omp ordered { ….. }

OpenMP synchronization Implicit Barrier - all threads in a team wait for all threads to complete up to the barrier point – beginning and end of parallel constructs – end of all other control constructs – barrier can be removed with nowait clause Explicit critical - only one thread at a time may execute a critical region

OpenMP critical directive Enclosed code – executed by all threads, but – restricted to only one thread at a time C/C++: #pragma omp critical [ ( name ) ] new-line structured-block A thread waits at the beginning of a critical region until no other thread in the team is executing a critical region with the same name. All unnamed critical directives map to the same unspecified name.

OpenMP critical C / C++: cnt = 0; f=7; #pragma omp parallel { #pragma omp for for (i=0; i<20; i++) { if (b[i] == 0) #pragma omp critical { cnt ++; } /* endif */ a[i] = b[i] + f * (i+1); } /* end for */ } /*omp end parallel */

Clauses by Directive Table