Programming with Shared Memory Introduction to OpenMP

Slides:



Advertisements
Similar presentations
NewsFlash!! Earth Simulator no longer #1. In slightly less earthshaking news… Homework #1 due date postponed to 10/11.
Advertisements

1 Programming Explicit Thread-level Parallelism  As noted previously, the programmer must specify how to parallelize  But, want path of least effort.
Indian Institute of Science Bangalore, India भारतीय विज्ञान संस्थान बंगलौर, भारत Supercomputer Education and Research Centre (SERC) Adapted from: o “MPI-Message.
Open[M]ulti[P]rocessing Pthreads: Programmer explicitly define thread behavior openMP: Compiler and system defines thread behavior Pthreads: Library independent.
Mohsan Jameel Department of Computing NUST School of Electrical Engineering and Computer Science 1.
PARALLEL PROGRAMMING WITH OPENMP Ing. Andrea Marongiu
1 OpenMP—An API for Shared Memory Programming Slides are based on:
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.
Introduction to OpenMP For a more detailed tutorial see: Look at the presentations.
1 ITCS4145/5145, Parallel Programming B. Wilkinson Feb 21, 2012 Programming with Shared Memory Introduction to OpenMP.
CSCI-6964: High Performance Parallel & Distributed Computing (HPDC) AE 216, Mon/Thurs 2-3:20 p.m. Pthreads (reading Chp 7.10) Prof. Chris Carothers Computer.
OpenMPI Majdi Baddourah
A Very Short Introduction to OpenMP Basile Schaeli EPFL – I&C – LSP Vincent Keller EPFL – STI – LIN.
CS 470/570 Lecture 7 Dot Product Examples Odd-even transposition sort More OpenMP Directives.
Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines.
1 Parallel Programming With OpenMP. 2 Contents  Overview of Parallel Programming & OpenMP  Difference between OpenMP & MPI  OpenMP Programming Model.
CS470/570 Lecture 5 Introduction to OpenMP Compute Pi example OpenMP directives and options.
Shared Memory Parallelization Outline What is shared memory parallelization? OpenMP Fractal Example False Sharing Variable scoping Examples on sharing.
Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (
Parallel Programming in Java with Shared Memory Directives.
Lecture 5: Shared-memory Computing with Open MP. Shared Memory Computing.
Chapter 17 Shared-Memory Programming. Introduction OpenMP is an application programming interface (API) for parallel programming on multiprocessors. It.
OpenMP - Introduction Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
ECE 1747 Parallel Programming Shared Memory: OpenMP Environment and Synchronization.
1 OpenMP Writing programs that use OpenMP. Using OpenMP to parallelize many serial for loops with only small changes to the source code. Task parallelism.
OpenMP OpenMP A.Klypin Shared memory and OpenMP Simple Example Threads Dependencies Directives Handling Common blocks Synchronization Improving load balance.
Lecture 8: OpenMP. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism / Implicit parallelism.
OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)
CS 838: Pervasive Parallelism Introduction to OpenMP Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from online references.
Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j
OpenMP fundamentials Nikita Panov
High-Performance Parallel Scientific Computing 2008 Purdue University OpenMP Tutorial Seung-Jai Min School of Electrical and Computer.
Threaded Programming Lecture 4: Work sharing directives.
Introduction to OpenMP
Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (
9/22/2011CS4961 CS4961 Parallel Programming Lecture 9: Task Parallelism in OpenMP Mary Hall September 22,
MPI and OpenMP.
Introduction to Pragnesh Patel 1 NICS CSURE th June 2015.
Parallel Programming Models (Shared Address Space) 5 th week.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
10/05/2010CS4961 CS4961 Parallel Programming Lecture 13: Task Parallelism in OpenMP Mary Hall October 5,
CPE779: More on OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.
Heterogeneous Computing using openMP lecture 2 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.
COMP7330/7336 Advanced Parallel and Distributed Computing OpenMP: Programming Model Dr. Xiao Qin Auburn University
OpenMP – Part 2 * *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)
1 Programming with Shared Memory - 2 Issues with sharing data ITCS 4145 Parallel Programming B. Wilkinson Jan 22, _Prog_Shared_Memory_II.ppt.
B. Estrade, LSU – High Performance Computing Enablement Group OpenMP II B. Estrade.
Introduction to OpenMP
Shared Memory Parallelism - OpenMP
Lecture 5: Shared-memory Computing with Open MP
CS427 Multicore Architecture and Parallel Computing
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing A bug in the rwlock program Dr. Xiao Qin.
Open[M]ulti[P]rocessing
Computer Engg, IIT(BHU)
Introduction to OpenMP
Shared-Memory Programming
September 4, 1997 Parallel Processing (CS 667) Lecture 5: Shared Memory Parallel Programming with OpenMP* Jeremy R. Johnson Parallel Processing.
Programming with Shared Memory
Computer Science Department
Prof. Thomas Sterling Department of Computer Science
Introduction to High Performance Computing Lecture 20
Programming with Shared Memory Introduction to OpenMP
DNA microarrays. Infinite Mixture Model-Based Clustering of DNA Microarray Data Using openMP.
Programming with Shared Memory
Hybrid Parallel Programming
Introduction to OpenMP
Shared-Memory Paradigm & OpenMP
Presentation transcript:

Programming with Shared Memory Introduction to OpenMP Part 1 ITCS4145/5145, Parallel Programming B. Wilkinson Feb 11, 2013. slides 8b-1.ppt

OpenMP Thread-based shared memory programming model. Accepted standard developed in late 1990s by a group of industry specialists. Higher-level than using thread API’s such as Pthreads or Java threads. Write programs in C/C++ (or Fortran!) and use OpenMP compiler directives to specify parallelism. OpenMP also has a few supporting library routines and environment variables. Several compilers available to compile OpenMP programs include recent Linux C compilers.

OpenMP thread model parallel region parallel region Initially, single thread executed by a master thread. parallel directive here uses team of threads with subsequent block of code executed by multiple threads in parallel. Exact number of threads determined by one of several ways, see later. Other directives within parallel construct to specify parallel for loops and different blocks of code for threads. Code outside parallel region executed by master thread only OpenMP thread model Master thread Multiple threads parallel region Synchronization Master thread only parallel region Master thread only

Number of threads in a team Established by one of three ways, either: num_threads clause after the parallel directive e.g. #pragma omp parallel num_threads(5) or 2. omp_set_num_threads() library routine being previously called e.g. omp_set_num_threads(6); Environment variable OMP_NUM_THREADS is defined e.g $ export OMP_NUM_THREADS=8 $ ./hello in order given or is system dependent if none of above. Number of threads available can be altered dynamically to achieve best use of system resources.

Finding number of threads and thread ID during program execution omp_get_num_threads() – get the total number of threads omp_get_thread_num() – Returns thread number (ID), an integer from 0 to omp_get_num_thread() -1 where thread 0 is master thread The names of these two functions are similar; easy to confuse.

OpenMP Parallel Directive C “pragmatic” directive instructs compiler to use OpenMP features All OpenMP directives have omp #pragma omp parallel structured_block OpenMP parallel directive Single statement or compound statement created with { ...} with single entry point and single exit point. Creates multiple threads, each one executing the specified structured_block. Implicit barrier at end of construct.

Hello world example Output from an 8-processor/core machine: VERY IMPORTANT Opening brace must on a new line (tabs,spaces ok) Hello world example #pragma omp parallel { printf("Hello World from thread = %d\n", omp_get_thread_num(), omp_get_num_threads());  } Output from an 8-processor/core machine: Hello World from thread 0 of 8 Hello World from thread 4 of 8 Hello World from thread 3 of 8 Hello World from thread 2 of 8 Hello World from thread 7 of 8 Hello World from thread 1 of 8 Hello World from thread 6 of 8 Hello World from thread 5 of 8

Global “shared” variables/data Any variable declared outside a parallel construct accessible by all threads unless otherwise specified: int main (int argc, char *argv[]) { int x; // accessibly by all threads #pragma omp parallel { … // each thread see the same x   } }

Private variables Separate copies of variables for each thread. Can be declared within each parallel region but OpenMP provides private clause. int tid; … #pragma omp parallel private(tid) { tid = omp_get_thread_num(); printf("Hello World from thread = %d\n", tid); } Each thread has a local variable tid Also a shared clause available for shared variables.

Another example of shared and private data int main (int argc, char *argv[]) { int x; int tid; #pragma omp parallel private(tid) tid = omp_get_thread_num(); if (tid == 0) x = 42; printf ("Thread %d, x = %d\n", tid, x); } x is shared by all threads tid is private – each thread has its own copy Variables declared outside the parallel construct are shared unless otherwise specified

Output Why does x change? tid has a separate value for each thread $ ./data Thread 3, x = 0 Thread 2, x = 0 Thread 1, x = 0 Thread 0, x = 42 Thread 4, x = 42 Thread 5, x = 42 Thread 6, x = 42 Thread 7, x = 42 tid has a separate value for each thread Why does x change?

Another Example Shared versus Private int a[100]; #pragma omp parallel private(tid, n) { tid = omp_get_thread_num(); n = omp_get_num_threads(); a[tid] = 10*n; } OR #pragma omp parallel private(tid, n) shared(a) ... a[ ] is shared tid and n are private optional

Variations of private variables private clause – creates private copies of variables for each thread firstprivate clause - as private clause but initializes each copy to the values given immediately prior to parallel construct. lastprivate clause – as private but “the value of each lastprivate variable from the sequentially last iteration of the associated loop, or the lexically last section directive, is assigned to the variable’s original object.”

Specifying work inside a parallel region Work-Sharing Specifying work inside a parallel region Four constructs in this classification: sections – section for single master In all cases, implicit barrier at end of construct unless a nowait clause included, which overrides the barrier. Note: These constructs do not start a new team of threads. That done by an enclosing parallel construct.

Sections #pragma omp parallel { #pragma omp sections The construct: #pragma omp parallel { #pragma omp sections #pragma omp section structured_block … } cause structured blocks to be shared among threads in team. The first section directive optional. Enclosing parallel directive Blocks executed by available threads

Example One thread does this Another thread does this #pragma omp parallel shared(a,b,c,d,nthreads) private(i,tid) { tid = omp_get_thread_num(); #pragma omp sections nowait #pragma omp section printf("Thread %d doing section 1\n",tid); for (i=0; i<N; i++) { c[i] = a[i] + b[i]; printf("Thread %d: c[%d]= %f\n",tid,i,c[i]); }   printf("Thread %d doing section 2\n",tid); d[i] = a[i] * b[i]; printf("Thread %d: d[%d]= %f\n",tid,i,d[i]); } /* end of sections */ } /* end of parallel section */ Example One thread does this Another thread does this

Another sections example #pragma omp parallel shared(a,b,c,d,nthreads) private(i,tid) { tid = omp_get_thread_num(); #pragma omp sections nowait #pragma omp section printf("Thread %d doing section 1\n",tid); for (i=0; i<N; i++) { c[i] = a[i] + b[i]; printf("Thread %d: c[%d]=%f\n“,tid,i,c[i]); } Threads do not wait after finishing section One thread does this

Sections example continued #pragma omp section { printf("Thread %d doing section 2\n",tid); for (i=0; i<N; i++) { d[i] = a[i] * b[i]; printf("Thread %d: d[%d]= %f\n",tid,i,d[i]); } } /* end of sections */ printf ("Thread %d done\n", tid); } /* end of parallel section */ Another thread does this

Output Threads do not wait (i.e. no barrier) Thread 0 doing section 1 Thread 0: c[0]= 5.000000 Thread 0: c[1]= 7.000000 Thread 0: c[2]= 9.000000 Thread 0: c[3]= 11.000000 Thread 0: c[4]= 13.000000 Thread 3 done Thread 2 done Thread 1 doing section 2 Thread 1: d[0]= 0.000000 Thread 1: d[1]= 6.000000 Thread 1: d[2]= 14.000000 Thread 1: d[3]= 24.000000 Thread 0 done Thread 1: d[4]= 36.000000 Thread 1 done Threads do not wait (i.e. no barrier)

Output if remove nowait clause Thread 0 doing section 1 Thread 0: c[0]= 5.000000 Thread 0: c[1]= 7.000000 Thread 0: c[2]= 9.000000 Thread 0: c[3]= 11.000000 Thread 0: c[4]= 13.000000 Thread 3 doing section 2 Thread 3: d[0]= 0.000000 Thread 3: d[1]= 6.000000 Thread 3: d[2]= 14.000000 Thread 3: d[3]= 24.000000 Thread 3: d[4]= 36.000000 Thread 3 done Thread 1 done Thread 2 done Thread 0 done Barrier here If we remove the nowait, then there is a barrier at the end of the section. Threads wait until they are all done with the section.

Combining parallel and section constructs If a parallel directive is followed by a single “sections” directive, they can be combined into: #pragma omp parallel sections { #pragma omp section structured_block … } with similar effect. (However, a nowait clause is not allowed.)

Parallel For Loop #pragma omp parallel #pragma omp for { … #pragma omp for for ( i = 0; i < n; i++ ) { … // for loop body } causes for loop to be divided into parts and parts shared among threads in the team – equivalent to a “forall.” Different iterations will be executed by available threads Enclosing parallel region Must have a new line here Must be “for” loop of a simple C form such as (i = 0; i < n; i++) where lower bound and upper bound are constants

Example Executed by one thread For loop #pragma omp parallel shared(a,b,c,nthreads,chunk) private(i,tid) { tid = omp_get_thread_num(); if (tid == 0) { nthreads = omp_get_num_threads(); printf("Number of threads = %d\n", nthreads); } printf("Thread %d starting...\n",tid);   #pragma omp for for (i=0; i<N; i++) { c[i] = a[i] + b[i]; printf("Thread %d: c[%d]= %f\n",tid,i,c[i]); } /* end of parallel section */ Executed by one thread For loop Without “nowait”, threads wait after finishing loop

Combined parallel and for constructs If a parallel directive is followed by a single for directive, it can be combined into: #pragma omp parallel for <for loop> { … } with similar effects.

Combining Directives Example #pragma omp parallel for shared(a,b,c,nthreads) private(i,tid) for (i = 0; i < N; i++) { c[i] = a[i] + b[i]; printf("Thread %d: c[%d]= %f\n",tid,i,c[i]); } Declares a Parallel Region and a Parallel For

Scheduling a Parallel For By default, a parallel for scheduled by mapping blocks (or chunks) of iterations to available threads (static mapping) Thread 1 starting... Thread 1: i = 2, c[1] = 9.000000 Thread 1: i = 3, c[1] = 11.000000 Thread 2 starting... Thread 2: i = 4, c[2] = 13.000000 Thread 3 starting... Number of threads = 4 Thread 0 starting... Thread 0: i = 0, c[0] = 5.000000 Thread 0: i = 1, c[0] = 7.000000 Default Chunk Size Barrier here

Loop Scheduling and Partitioning OpenMP offers scheduling clauses to add to for construct: 1. Static #pragma omp parallel for schedule (static,chunk_size) Partitions loop iterations into equal sized chunks specified by chunk_size. Chunks assigned to threads in round robin fashion. 2. Dynamic #pragma omp parallel for schedule (dynamic,chunk_size) Uses internal work queue. Chunk-sized block of loop assigned to threads as they become available.

3. Guided #pragma omp parallel for schedule (guided,chunk_size) Similar to dynamic but chunk size starts large and gets smaller to reduce time threads have to go to work queue. chunk size = number of iterations remaining 2 * number of threads 4. Runtime #pragma omp parallel for schedule (runtime) Uses OMP_SCEDULE environment variable to specify which of static, dynamic or guided should be used.

Question Guided scheduling is similar to Static except that the chunk sizes start large and get smaller. What is the advantage of using Guided versus Static? Answer: Guided improves load balance

Reduction Operation Variable A reduction is when we apply a commutative operator to an aggregate values creating a single value (similar to the MPI_Reduce) sum = 0; #pragma omp parallel for reduction(+:sum) for (k = 0; k < 100; k++ ) { sum = sum + funct(k); } Operation Variable Private copy of sum created for each thread by compiler. Private copy will be added to sum at end. Eliminates the need for critical sections here.

Single The directive #pragma omp parallel #pragma omp single { … #pragma omp single structured_block } cause the structured block to be executed by one thread only. Must have a new line here

Master The master directive: #pragma omp parallel #pragma omp master { … #pragma omp master structured_block } causes only the master thread to execute the structured block. Different to those in work sharing group in that there is no implied barrier at end of construct (nor beginning). Other threads encountering master directive will ignore it and associated structured block, and will move on.

Master Example #pragma omp parallel private(tid) { tid = omp_get_thread_num(); printf ("Thread %d starting...\n", tid); #pragma omp master printf("Thread %d doing work\n",tid); ... } /* end of master */ printf ("Thread %d done\n", tid); } /* end of parallel section */

Is there any difference between these two approaches: Master Directive: Using an if statement: #pragma omp parallel private(tid) { ... tid=omp_get_thread_num(); if (tid == 0) structured_block } #pragma omp parallel { ... #pragma omp master structured_block }

Questions