OpenMP Martin Kruliš Jiří Dokulil. OpenMP OpenMP Architecture Review Board Compaq, HP, Intel, IBM, KAI, SGI, SUN, U.S. Department of Energy,…

Slides:



Advertisements
Similar presentations
NewsFlash!! Earth Simulator no longer #1. In slightly less earthshaking news… Homework #1 due date postponed to 10/11.
Advertisements

Indian Institute of Science Bangalore, India भारतीय विज्ञान संस्थान बंगलौर, भारत Supercomputer Education and Research Centre (SERC) Adapted from: o “MPI-Message.
Open[M]ulti[P]rocessing Pthreads: Programmer explicitly define thread behavior openMP: Compiler and system defines thread behavior Pthreads: Library independent.
Mohsan Jameel Department of Computing NUST School of Electrical Engineering and Computer Science 1.
PARALLEL PROGRAMMING WITH OPENMP Ing. Andrea Marongiu
1 OpenMP—An API for Shared Memory Programming Slides are based on:
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.
Introduction to OpenMP For a more detailed tutorial see: Look at the presentations.
1 ITCS4145/5145, Parallel Programming B. Wilkinson Feb 21, 2012 Programming with Shared Memory Introduction to OpenMP.
CSCI-6964: High Performance Parallel & Distributed Computing (HPDC) AE 216, Mon/Thurs 2-3:20 p.m. Pthreads (reading Chp 7.10) Prof. Chris Carothers Computer.
OpenMPI Majdi Baddourah
A Very Short Introduction to OpenMP Basile Schaeli EPFL – I&C – LSP Vincent Keller EPFL – STI – LIN.
Arquitectura de Sistemas Paralelos e Distribuídos Paulo Marques Dep. Eng. Informática – Universidade de Coimbra Ago/ OpenMP.
1 Parallel Programming With OpenMP. 2 Contents  Overview of Parallel Programming & OpenMP  Difference between OpenMP & MPI  OpenMP Programming Model.
Programming with Shared Memory Introduction to OpenMP
Shared Memory Parallelization Outline What is shared memory parallelization? OpenMP Fractal Example False Sharing Variable scoping Examples on sharing.
Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (
Introduction to OpenMP
Parallel Programming in Java with Shared Memory Directives.
UNIT -6 PROGRAMMING SHARED ADDRESS SPACE PLATFORMS THREAD BASICS PREPARED BY:-H.M.PATEL.
Chapter 17 Shared-Memory Programming. Introduction OpenMP is an application programming interface (API) for parallel programming on multiprocessors. It.
OpenMP - Introduction Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
1 OpenMP Writing programs that use OpenMP. Using OpenMP to parallelize many serial for loops with only small changes to the source code. Task parallelism.
OpenMP OpenMP A.Klypin Shared memory and OpenMP Simple Example Threads Dependencies Directives Handling Common blocks Synchronization Improving load balance.
OpenMP: Open specifications for Multi-Processing What is OpenMP? Join\Fork model Join\Fork model Variables Variables Explicit parallelism Explicit parallelism.
Lecture 8: OpenMP. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism / Implicit parallelism.
OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)
CS 838: Pervasive Parallelism Introduction to OpenMP Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from online references.
Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j
OpenMP fundamentials Nikita Panov
High-Performance Parallel Scientific Computing 2008 Purdue University OpenMP Tutorial Seung-Jai Min School of Electrical and Computer.
Threaded Programming Lecture 4: Work sharing directives.
Introduction to OpenMP
Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (
9/22/2011CS4961 CS4961 Parallel Programming Lecture 9: Task Parallelism in OpenMP Mary Hall September 22,
Threaded Programming Lecture 2: Introduction to OpenMP.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-3. OMP_INIT_LOCK OMP_INIT_NEST_LOCK Purpose: ● This subroutine initializes a lock associated with the lock variable.
Parallel Programming Models (Shared Address Space) 5 th week.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
10/05/2010CS4961 CS4961 Parallel Programming Lecture 13: Task Parallelism in OpenMP Mary Hall October 5,
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-2. Environment Variables OMP_NUM_THREADS OMP_SCHEDULE.
Heterogeneous Computing using openMP lecture 2 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.
COMP7330/7336 Advanced Parallel and Distributed Computing OpenMP: Programming Model Dr. Xiao Qin Auburn University
B. Estrade, LSU – High Performance Computing Enablement Group OpenMP II B. Estrade.
OpenMP An API : For Writing Portable SMP Application Software Rider NCHC GTD.
Introduction to OpenMP
SHARED MEMORY PROGRAMMING WITH OpenMP
Martin Kruliš Jiří Dokulil
Shared Memory Parallelism - OpenMP
CS427 Multicore Architecture and Parallel Computing
Parallelize Codes Using Intel Software
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Improving Barrier Performance Dr. Xiao Qin.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing A bug in the rwlock program Dr. Xiao Qin.
Open[M]ulti[P]rocessing
Computer Engg, IIT(BHU)
Introduction to OpenMP
Shared-Memory Programming
Computer Science Department
Shared Memory Programming with OpenMP
Multi-core CPU Computing Straightforward with OpenMP
Introduction to High Performance Computing Lecture 20
Programming with Shared Memory Introduction to OpenMP
Introduction to OpenMP
OpenMP Martin Kruliš.
OpenMP Parallel Programming
Parallel Programming with OPENMP
WorkSharing, Schedule, Synchronization and OMP best practices
Presentation transcript:

OpenMP Martin Kruliš Jiří Dokulil

OpenMP OpenMP Architecture Review Board Compaq, HP, Intel, IBM, KAI, SGI, SUN, U.S. Department of Energy,… specifications (freely available) 1.0 – C/C++ and FORTRAN versions 2.0 – C/C++ and FORTRAN versions 2.5 – combined C/C++ and FORTRAN 3.0 – combined C/C++ and FORTRAN 4.0 – combined C/C++ and FORTRAN (July 2013)

Basics fork – join model tailored mostly for large array operations pragmas #pragma omp … only a few constructs programs should run without OpenMP possible but not enforced #ifdef _OPENMP

Simple example #define N 1024*1024 int* data=new int[N]; for(int i=0; i<N; ++i) { data[i]=i; }

Simple example – cont. #define N 1024*1024 int* data=new int[N]; #pragma omp parallel for for(int i=0; i<N; ++i) { data[i]=i; }

Another example int sum; #pragma omp parallel for for(int i=0; i<N; ++i) { sum+=data[i]; } WRONG

Variable scope shared one instance for all threads private one instance for each thread reduction special variant for reduction operations valid within lexical extent no effect in called functions

Variable scope – private default for loop control variable only for the parallelized loop should (probably always) be made private all loops in Fortran all variables declared within the parallelized block all non-static variables in called functions allocated on stack – private for each thread uninitialized values at start of the block and after the block except for classes default constructor (must be accessible) may not be shared among the threads

Variable scope – private int j; #pragma omp parallel for private(j) for(int i=0; i<N/2; ++i) { j=i*2; data[j]=i; data[j+1]=i; }

Variable scope – reduction performing e.g. sum of an array cannot use only private variable shared requires explicit synchronization combination is possible and (relatively) efficient but unnecessarily complex each thread works on an private copy initialized to a default value (0 for +, 1 for *,…) final results are joined and available to the master thread

Variable scope – reduction long long sum=0; #pragma omp parallel for reduction(+:sum) for(int i=0; i<N; ++i) { sum+=data[i]; }

Variable scope – firstprivate and lastprivate values of private variables at the start of the block and after end of the block are undefined firstprivate all values are initialized to the value of the master thread lastprivate variable after the parallelized block is set to the value of the last iteration (last in the serial version)

parallel #pragma omp parallel launches threads and executes block in parallel modifiers if (scalar expression) variable scope modifiers (including reduction) num_threads especially useful in conjunction with omp_get_thread_num

Loop-level parallelism #pragma omp parallel for launch threads and execute loop in parallel can be nested #pragma omp for parallel loop within another parallel block no (direct) nesting “simple” for expression implicit barrier at the end

Loop-level parallelism – modifiers 1 variable scope modifiers nowait – removes barrier cannot be used with #pragma omp parallel for ordered loop (or called function) may contain block marked #pragma omp ordered such block is executed in the same order as in serial execution of the loop at most one such block may exist

Loop-level parallelism – modifiers 2 schedule schedule(static[, chunk_size]) round robin no chunk size → equal size to all threads schedule(dynamic[, chunk_size]) threads request chunks default chunk size is 1 schedule(guided[, chunk_size]) like dynamic with size of chunks proportional to the amount of remaining work, but at least chunk_size default chunk size is 1 auto selected by implementation runtime use default value stored in variable def-sched-var

Parallel sections #pragma omp sections #pragma omp section … several blocks of code that should be evaluated in parallel modifiers private, firstprivate, lastprivate, reduction nowait

Single #pragma omp single code is executed by only one thread of the team modifiers private, firstprivate nowait when not used, there is a barrier at the end of the block copyprivate final value of the variable is distributed to all threads in the team after the block is executed incompatible with nowait

Workshare Fortran only… SUBROUTINE A11_1(AA, BB, CC, DD, EE, FF, N) INTEGER N REAL AA(N,N), BB(N,N), CC(N,N), DD(N,N), EE(N,N), FF(N,N) !$OMP PARALLEL !$OMP WORKSHARE AA = BB CC = DD EE = FF !$OMP END WORKSHARE !$OMP END PARALLEL END SUBROUTINE A11_1

Master #pragma omp master executed only by the master thread

Critical section #pragma omp critical [name] the well-known critical section at most once thread can execute critical section with certain name multiple pragmas with same name form one section names have external linkage all unnamed pragmas form one section

Barrier #pragma omp barrier no associated block of code some restrictions on placement if (a<10) #pragma omp barrier { do_something() }

Atomic #pragma omp atomic followed by expression in the form x op= expr +, *, -, /, &, ^, |, > expr must not reference x x++ ++x x-- --y

Flush #pragma omp flush (variable list) make thread’s view of variables consistent with the main memory variable list may be omitted, flushes all similar to volatile in C/C++ influences memory operation reordering that can be performed by the compiler cannot move read/write of the flushed variable to the other “side” of the flush operation all values of flushed variables are saved to the memory before flush finishes first read of flushed variable after flush is performed from the main memory same placement restrictions as barrier

threadprivate #pragma omp threadprivate(list) makes global variable private for each thread complex restrictions

copyin, copyprivate copyin(list) copy value of threadprivate variable from master thread to other members of the team used as modifier in #pragma omp parallel values copied at the start of the block copyprivate(list) copy value from one thread’s threadprivate variable to all other members of the team used as modifier in #pragma omp single values copied at the end of the block

Task new in OpenMP 3.0 #pragma omp task piece of code to be executed in parallel immediately or later if clause forces immediate execution when false tied or untied (to a thread) can be suspended, e.g. by launching nested task modifiers default, private, firstprivate, shared untied if

Task scheduling points after explicit generation of a task after the last instruction of a task region taskwait region in implicit and explicit barriers (almost) anywhere in untied tasks

Taskwait #pragma omp taskwait wait for completion of all child tasks generated since the start of the current task

Functions omp_set_num_threads, omp_get_max_threads number of threads used for parallel regions without num_threads clause omp_get_num_threads number of threads in the team omp_get_thread_num number of calling thread within the team 0 = master omp_get_num_procs number of processors available to the program

Functions – cont. omp_in_parallel checks if the caller is in active parallel region active region is region without if or if the condition was true omp_set_dynamic, omp_get_dynamic dynamic adjustment of thread number on/off omp_set_nested, omp_get_nested nested parallelism on/off

Locks plain and nested omp_lock_t, omp_nest_lock_t omp_init_lock, omp_init_nest_lock initializes the lock omp_destroy_lock, omp_destroy_nest_lock uninitializes must be unlocked omp_set_lock, omp_set_nest_lock must be initialized locks the lock blocks until the lock is acquired omp_unset_lock, omp_unset_nest_lock must be locked and owned by the calling thread unlocks omp_test_lock, omp_test_nest_lock like set but does not block

Timing routines double omp_get_wtime() wall clocl time in seconds since “time in the past” may not be consistent between threads double omp_get_wtick() number of seconds between successive clock ticks of the timer used by omp_get_wtime

Environment variables OMP_NUM_THREADS number of threads launched in parallel regions omp_set_num_threads, omp_get_num_threads OMP_SCHEDULE used in loops with schedule(runtime) "guided,4", "dynamic“ OMP_DYNAMIC set if implementation may change number of threads omp_set_dynamic, omp_get_dynamic true or false OMP_NESTED controls nested parallelism true or false default is false

Nesting of regions some limitations “close nesting” no #pragma omp parallel nested between the two regions “work-sharing region” for, sections, single, (workshare) work-sharing region may not be closely nested inside a work- sharing, critical, ordered, or master region barrier region may not be closely nested inside a work-sharing, critical, ordered, or master region master region may not be closely nested inside a work-sharing region ordered region may not be closely nested inside a critical region ordered region must be closely nested inside a loop region (or parallel loop region) with an ordered clause critical region may not be nested (closely or otherwise) inside a critical region with the same name note that this restriction is not sufficient to prevent deadlock

OpenMP 4.0 The newest version (June 2013) No implementations yet Thread affinity proc_bind(master | close | spread) SIMD support Explicit loop vectorization (by SSE, AVX, …) User defined reduction #pragma omp declare reduction (identifier : typelist : combiner-expr) [initializer-clause] Atomic operations with sequential consistency (seq_cst)

OpenMP 4.0 Accelerator support Xeon Phi cards, GPUs, … #pragma omp target – offloads computation device(idx) map(variable map) #pragma target update