OpenMP Martin Kruliš.

Slides:



Advertisements
Similar presentations
NewsFlash!! Earth Simulator no longer #1. In slightly less earthshaking news… Homework #1 due date postponed to 10/11.
Advertisements

Open[M]ulti[P]rocessing Pthreads: Programmer explicitly define thread behavior openMP: Compiler and system defines thread behavior Pthreads: Library independent.
PARALLEL PROGRAMMING WITH OPENMP Ing. Andrea Marongiu
1 OpenMP—An API for Shared Memory Programming Slides are based on:
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.
Introduction to OpenMP For a more detailed tutorial see: Look at the presentations.
1 ITCS4145/5145, Parallel Programming B. Wilkinson Feb 21, 2012 Programming with Shared Memory Introduction to OpenMP.
CSCI-6964: High Performance Parallel & Distributed Computing (HPDC) AE 216, Mon/Thurs 2-3:20 p.m. Pthreads (reading Chp 7.10) Prof. Chris Carothers Computer.
OpenMPI Majdi Baddourah
A Very Short Introduction to OpenMP Basile Schaeli EPFL – I&C – LSP Vincent Keller EPFL – STI – LIN.
1 Parallel Programming With OpenMP. 2 Contents  Overview of Parallel Programming & OpenMP  Difference between OpenMP & MPI  OpenMP Programming Model.
Programming with Shared Memory Introduction to OpenMP
Shared Memory Parallelization Outline What is shared memory parallelization? OpenMP Fractal Example False Sharing Variable scoping Examples on sharing.
Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (
Parallel Programming in Java with Shared Memory Directives.
UNIT -6 PROGRAMMING SHARED ADDRESS SPACE PLATFORMS THREAD BASICS PREPARED BY:-H.M.PATEL.
Chapter 17 Shared-Memory Programming. Introduction OpenMP is an application programming interface (API) for parallel programming on multiprocessors. It.
OpenMP China MCP.
OpenMP - Introduction Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
ECE 1747 Parallel Programming Shared Memory: OpenMP Environment and Synchronization.
1 OpenMP Writing programs that use OpenMP. Using OpenMP to parallelize many serial for loops with only small changes to the source code. Task parallelism.
OpenMP OpenMP A.Klypin Shared memory and OpenMP Simple Example Threads Dependencies Directives Handling Common blocks Synchronization Improving load balance.
OpenMP: Open specifications for Multi-Processing What is OpenMP? Join\Fork model Join\Fork model Variables Variables Explicit parallelism Explicit parallelism.
Lecture 8: OpenMP. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism / Implicit parallelism.
OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)
OpenMP Martin Kruliš Jiří Dokulil. OpenMP OpenMP Architecture Review Board Compaq, HP, Intel, IBM, KAI, SGI, SUN, U.S. Department of Energy,…
CS 838: Pervasive Parallelism Introduction to OpenMP Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from online references.
Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j
OpenMP fundamentials Nikita Panov
High-Performance Parallel Scientific Computing 2008 Purdue University OpenMP Tutorial Seung-Jai Min School of Electrical and Computer.
Threaded Programming Lecture 4: Work sharing directives.
Introduction to OpenMP
Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (
9/22/2011CS4961 CS4961 Parallel Programming Lecture 9: Task Parallelism in OpenMP Mary Hall September 22,
Threaded Programming Lecture 2: Introduction to OpenMP.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-3. OMP_INIT_LOCK OMP_INIT_NEST_LOCK Purpose: ● This subroutine initializes a lock associated with the lock variable.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
10/05/2010CS4961 CS4961 Parallel Programming Lecture 13: Task Parallelism in OpenMP Mary Hall October 5,
Special Topics in Computer Engineering OpenMP* Essentials * Open Multi-Processing.
CPE779: More on OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-2. Environment Variables OMP_NUM_THREADS OMP_SCHEDULE.
Heterogeneous Computing using openMP lecture 2 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.
CS240A, T. Yang, Parallel Programming with OpenMP.
OpenMP – Part 2 * *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)
B. Estrade, LSU – High Performance Computing Enablement Group OpenMP II B. Estrade.
OpenMP An API : For Writing Portable SMP Application Software Rider NCHC GTD.
Single Instruction Multiple Threads
Introduction to OpenMP
SHARED MEMORY PROGRAMMING WITH OpenMP
Martin Kruliš Jiří Dokulil
Shared Memory Parallelism - OpenMP
Lecture 5: Shared-memory Computing with Open MP
CS427 Multicore Architecture and Parallel Computing
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Improving Barrier Performance Dr. Xiao Qin.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing A bug in the rwlock program Dr. Xiao Qin.
Open[M]ulti[P]rocessing
Computer Engg, IIT(BHU)
Introduction to OpenMP
Shared-Memory Programming
Computer Science Department
Multi-core CPU Computing Straightforward with OpenMP
Parallel Programming with OpenMP
Introduction to High Performance Computing Lecture 20
Programming with Shared Memory Introduction to OpenMP
Introduction to OpenMP
OpenMP Parallel Programming
Parallel Programming with OPENMP
WorkSharing, Schedule, Synchronization and OMP best practices
Presentation transcript:

OpenMP Martin Kruliš

About OpenMP (http://www.openmp.org) API for multi-threaded, shared memory parallelism Supported directly by compilers C++ and Fortran Activated by compiler directive (e.g., g++ -fopenmp) Three components Compiler directives (pragmas) Runtime library resources (functions) Environment variables (defaults, runtime configuration)

History Specification versions 1.0 – C/C++ and FORTRAN versions (1997-1998) 2.0 – C/C++ and FORTRAN versions (2000-2002) 2.5 – combined C/C++ and FORTRAN (2005) 3.0 – combined C/C++ and FORTRAN (2008) 4.0 – combined C/C++ and FORTRAN (2013) 4.5 – current, widely available (2015) 5.0 – newest version, not widely supported yet

Nested (recursive) fork/joins may be allowed optionally Threading Model „Team” of threads Nested (recursive) fork/joins may be allowed optionally Fork/join model Threads allocated dynamically, but thread pool may be used to optimize management

Pragmas Basic Syntax #pragma omp [directive] [clause, …] (block) statement Code should work without the pragmas (as serial) Pragmas may be used to Spawn a parallel region Divide workload among threads Serialize sections of code Synchronization

Parallel Region Spawning a thread team Implicit barrier at the end #pragma omp parallel { int tid = omp_get_thread_num(); // use tid to do thread’s bidding } Implicit barrier at the end No branching/goto-s that will cause the program to jump in/out to/of parallel blocks Regular branching or function calls are OK Spawns one thread per CPU code by default Thread index within its team (master == 0)

Variable Scope Variables Scope Private – a copy per each thread Shared – all threads share the same variable Synchronization may be required int x, y, id; #pragma omp parallel private(id), shared(x,y) { id = x + omp_get_thread_num(); ... }

Variable Scope Private Scope Other scopes and additional clauses Most variables are private by default Variables declared inside the parallel block All non-static variables in called functions Values are not initialized (at the beginning and the end of the block) Except for classes (default constructor must be accessible) Other scopes and additional clauses Will be presented later

Work-sharing Constructs Divide the work in parallel block

Work-sharing Constructs For-loop #pragma omp parallel { #pragma omp for for (int i = 0; i < N; ++i) { ... } } Additional clauses of for directive schedule(strategy) – scheduling (work division) static, dynamic, guided, runtime, auto collapse(n) – encompass nested for-loops Spawns the threads Divide for-loop workload among the threads

Work-sharing Constructs For-loop #pragma omp ordered A block inside for-loop that must be executed in exactly the same order, as if the code was serial #pragma omp parallel for Shorthand for both creating parallel block and apply it on a parallel for-loop May have clauses applied to both parallel and for directives Probably the most often used construct in OpenMP

Work-sharing Constructs For-loop Pitfalls Use #pragma omp for outside of #pragma omp parallel block Has no effect, there is only one thread available Forgetting the for itsef #pragma omp parallel for (int i = 0; i < N; ++i) { … } The entire loop is executed by ALL threads

Work-sharing Constructs Sections Independent blocks of code executed concurrently #pragma omp parallel { #pragma omp sections #pragma omp section load_player_data(); load_game_maps(); ... }

Work-sharing Constructs Single Code executed by single thread from group #pragma omp parallel { #pragma omp for for (...) ... #pragma omp single report_progress(); ... } Only one thread reports the progress

Work-sharing Constructs Synchronization Implicit barrier at the end of each construct for, sections, single nowait clause Removes the barrier

Synchronization Constructs Synchronization Directives To be used within a parallel block #pragma omp master Region being executed only by master thread Similar to #pragma omp single #pragma omp critical [name] Standard critical section with lock guard Only one thread may be in the section at a time If name is provided, all sections with the same name are interconnected (use the same lock)

Synchronization Constructs Synchronization Directives #pragma omp barrier All threads in a team must meet on a barrier #pragma omp atomic Followed by a statement like x += expr; or ++x; Allowed operations: +, *, -, /, &, ^, |, <<,or >> #pragma omp flush Make sure changes of shared variables become visible Executed implicitly for many directives barrier, critical, parallel, …

Tasks Tasks Pieces of code that may be executed by a different thread (within a parallel section) #pragma omp task Additional clauses untied – different thread may resume task after yield priority(p) – scheduling hint depend(list) – tasks that must conclude first And few other that control when and who can execute the task

Tasks Tasks Synchronization #pragma omp taskwait #pragma omp taskyield Wait for all child tasks (spawned in this task) #pragma omp taskyield Placed as statement inside a task The processing thread may suspend this task and pick up another task #pragma omp taskgroup Spawning thread will not continue unless all tasks in the group are completed

Data Sharing Management Data-related Clauses We already know private and shared firstprivate – similar to private, but the value is initialized using the value of the main thread lastprivate – value after parallel block is set to the last value, that would be computed in serial processing copyprivate – used with single block, last value is broadcasted to all threads

Data Sharing Management Reduction Variable is private and reduced at the end #pragma omp ... reduction(op:list) Op represents operation (+, *, &, |, …) Each operation has its own default (e.g., + has 0) List of variables Custom reducers #pragma omp declare reduction (identifier : typename-list : combiner) [initializer-clause]

Synchronization Constructs Local Thread Storage #pragma omp threadprivate(list) Variables are made private for each thread No connection to explicit parallel block Values persist between blocks Variables must be either global or static copyin(list) clause (of parallel block) Similar to firstprivate Threadprivate variables are initialized by master value

Runtime Library Runtime Library Functions #include <omp.h> usually required Functions void omp_set_num_threads(int threads) Set # of threads in next parallel region int omp_get_num_threads(void) Get # of threads in current parallel region int omp_get_max_threads(void) Current maximum of threads in a parallel region

Runtime Library Functions int omp_get_thread_num(void) Current thread index within a team (master == 0) int omp_get_num_procs(void) Actual number of CPU cores (available) int omp_in_parallel(void) True, if the code is executed in parallel omp_set_dynamic(), omp_get_dynamic() Dynamic ~ whether # of threads in a block can be changed when the block is running

Runtime Library Locks Regular and nested locks omp_lock_t, omp_nest_lock_t Initialization and destruction omp_init_lock(), omp_init_nest_lock() omp_destroy_lock(), omp_destroy_nest_lock() Acquiring (blocking) and releasing omp_set_lock(), omp_unset_lock(), … Acquiring the lock without blocking omp_test_lock(), omp_test_nest_lock()

Environmental Variables May affect the application without recompilation OMP_NUM_THREADS Max number of threads during execution OMP_SCHEDULE Scheduling strategy for for-loop construct The loop must have the strategy set to runtime OMP_DYNAMIC Enables dynamic adjustment of threads in a block

Thread Team Configuration Actual number of threads in parallel region if(condition) clause is evaluated num_threads(n) clause is used if present omp_set_num_threads() value is used OMP_NUM_THREADS env. value is used System default (typically # of CPU cores) Affinity Whether threads are bound to CPU cores OMP_PROC_BIND, OMP_PLACES

Nested Parallelism Nested Parallel Blocks Implementations may not support it Must be explicitly enabled omp_set_nested(), OMP_NESTED Nesting depth may be limited omp_set_max_active_levels(), OMP_MAX_ACTIVE_LEVELS More complex to get the right thread ID omp_get_level(), omp_get_active_level(), omp_get_ancestor_thread_num()

Use both SIMD and parallel for SIMD Support SIMD Instructions Generated by compiler when possible Hints may be provided by pragmas to loops #pragma omp simd, #pragma omp for simd Important clauses safelen – max. safe loop unroll width simdlen – recommended loop unroll width aligned – declaration about array(s) data alignment linear – declaration of variables with linear relation to iteration parameter Use both SIMD and parallel for

Accelerator Support Offload Support Execute code on accelerator (GPU, FPGA, …) #pragma omp target ... Slightly more complicated Memory has to be allocated on target device And data transferred there and back Or memory has to be mapped to target device Target device may have more complex thread structure OpenMP introduce thread teams directive

OpenMP 5.0 Major Improvements Full support of accelerator devices (like GPUs) Introducing unified shared memory Multi-level memory systems Dealing with NUMA properties of current systems Fully descriptive loop construct Gives the compiler freedom of choosing the best implementation Enhanced portability, improved debugging tools, …

Discussion