PARALLEL PROGRAMMING WITH OPENMP Ing. Andrea Marongiu

Slides:

Advertisements

Similar presentations

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.

Advertisements

Scheduling and Performance Issues for Programming using OpenMP

CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.

NewsFlash!! Earth Simulator no longer #1. In slightly less earthshaking news… Homework #1 due date postponed to 10/11.

Starting Parallel Algorithm Design David Monismith Based on notes from Introduction to Parallel Programming 2 nd Edition by Grama, Gupta, Karypis, and.

May 2, 2015©2006 Craig Zilles1 (Easily) Exposing Thread-level Parallelism  Previously, we introduced Multi-Core Processors —and the (atomic) instructions.

1 Programming Explicit Thread-level Parallelism  As noted previously, the programmer must specify how to parallelize  But, want path of least effort.

Indian Institute of Science Bangalore, India भारतीय विज्ञान संस्थान बंगलौर, भारत Supercomputer Education and Research Centre (SERC) Adapted from: o “MPI-Message.

Open[M]ulti[P]rocessing Pthreads: Programmer explicitly define thread behavior openMP: Compiler and system defines thread behavior Pthreads: Library independent.

PARALLEL PROGRAMMING WITH OPENMP Ing. Andrea Marongiu

Scientific Programming OpenM ulti- P rocessing M essage P assing I nterface.

1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.

Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

Introduction to OpenMP For a more detailed tutorial see: Look at the presentations.

1 ITCS4145/5145, Parallel Programming B. Wilkinson Feb 21, 2012 Programming with Shared Memory Introduction to OpenMP.

CSCI-6964: High Performance Parallel & Distributed Computing (HPDC) AE 216, Mon/Thurs 2-3:20 p.m. Pthreads (reading Chp 7.10) Prof. Chris Carothers Computer.

A Very Short Introduction to OpenMP Basile Schaeli EPFL – I&C – LSP Vincent Keller EPFL – STI – LIN.

INTEL CONFIDENTIAL OpenMP for Domain Decomposition Introduction to Parallel Programming – Part 5.

Parallel Programming with OpenMP Ing. Andrea Marongiu

– 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance.

Programming with Shared Memory Introduction to OpenMP

Shared Memory Parallelization Outline What is shared memory parallelization? OpenMP Fractal Example False Sharing Variable scoping Examples on sharing.

Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (

Parallel Programming in Java with Shared Memory Directives.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.

Chapter 17 Shared-Memory Programming. Introduction OpenMP is an application programming interface (API) for parallel programming on multiprocessors. It.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

ECE 1747 Parallel Programming Shared Memory: OpenMP Environment and Synchronization.

1 OpenMP Writing programs that use OpenMP. Using OpenMP to parallelize many serial for loops with only small changes to the source code. Task parallelism.

OpenMP OpenMP A.Klypin Shared memory and OpenMP Simple Example Threads Dependencies Directives Handling Common blocks Synchronization Improving load balance.

Lecture 8: OpenMP. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism / Implicit parallelism.

OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)

04/10/25Parallel and Distributed Programming1 Shared-memory Parallel Programming Taura Lab M1 Yuuki Horita.

CS 838: Pervasive Parallelism Introduction to OpenMP Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from online references.

Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j

OpenMP fundamentials Nikita Panov

High-Performance Parallel Scientific Computing 2008 Purdue University OpenMP Tutorial Seung-Jai Min School of Electrical and Computer.

Introduction to OpenMP

Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.

Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (

9/22/2011CS4961 CS4961 Parallel Programming Lecture 9: Task Parallelism in OpenMP Mary Hall September 22,

MPI and OpenMP.

Threaded Programming Lecture 2: Introduction to OpenMP.

3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,

CPE779: More on OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.

Heterogeneous Computing using openMP lecture 2 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.

CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.

B. Estrade, LSU – High Performance Computing Enablement Group OpenMP II B. Estrade.

Parallel Programming in C with MPI and OpenMP

Introduction to OpenMP

Shared Memory Parallelism - OpenMP

Shared-memory Programming

CS427 Multicore Architecture and Parallel Computing

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Improving Barrier Performance Dr. Xiao Qin.

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing A bug in the rwlock program Dr. Xiao Qin.

Open[M]ulti[P]rocessing

Computer Engg, IIT(BHU)

Introduction to OpenMP

Shared-Memory Programming

Computer Science Department

Multi-core CPU Computing Straightforward with OpenMP

Introduction to High Performance Computing Lecture 20

Programming with Shared Memory Introduction to OpenMP

Introduction to OpenMP

Shared-Memory Paradigm & OpenMP

Parallel Programming with OPENMP

Presentation transcript:

PARALLEL PROGRAMMING WITH OPENMP Ing. Andrea Marongiu

Programming model: OpenMP  De-facto standard for the shared memory programming model  A collection of compiler directives, library routines and environment variables  Easy to specify parallel execution within a serial code  Requires special support in the compiler  Generates calls to threading libraries (e.g. pthreads)  Focus on loop-level parallel execution  Popular in high-end embedded

Fork/Join Parallelism  Initially only master thread is active  Master thread executes sequential code  Fork: Master thread creates or awakens additional threads to execute parallel code  Join: At the end of parallel code created threads are suspended upon barrier synchronization Sequential program Parallel program

Pragmas  Pragma: a compiler directive in C or C++  Stands for “pragmatic information”  A way for the programmer to communicate with the compiler  Compiler free to ignore pragmas: original sequential semantic is not altered  Syntax: #pragma omp

Components of OpenMP  Parallel regions  #pragma omp parallel  Work sharing  #pragma omp for  #pragma omp sections  Synchronization  #pragma omp barrier  #pragma omp critical  #pragma omp atomic  Parallel regions  #pragma omp parallel  Work sharing  #pragma omp for  #pragma omp sections  Synchronization  #pragma omp barrier  #pragma omp critical  #pragma omp atomic Directives  Data scope attributes  private  shared  reduction  Loop scheduling  static  dynamic  Data scope attributes  private  shared  reduction  Loop scheduling  static  dynamic Clauses  Thread Forking/Joining  omp_parallel_start()  omp_parallel_end()  Loop scheduling  Thread IDs  omp_get_thread_num()  omp_get_num_threads()  Thread Forking/Joining  omp_parallel_start()  omp_parallel_end()  Loop scheduling  Thread IDs  omp_get_thread_num()  omp_get_num_threads() Runtime Library

Outlining parallelism The parallel directive  Fundamental construct to outline parallel computation within a sequential program  Code within its scope is replicated among threads  Defers implementation of parallel execution to the runtime (machine- specific, e.g. pthread_create) int main() { #pragma omp parallel { printf (“\nHello world!”); } int main() { #pragma omp parallel { printf (“\nHello world!”); } } A sequential program....is easily parallelized int main() { omp_parallel_start(&parfun, …); parfun(); omp_parallel_end(); } int parfun(…) { printf (“\nHello world!”); } int main() { omp_parallel_start(&parfun, …); parfun(); omp_parallel_end(); } int parfun(…) { printf (“\nHello world!”); }

#pragma omp parallel int main() { #pragma omp parallel { printf (“\nHello world!”); } int main() { #pragma omp parallel { printf (“\nHello world!”); } int main() { omp_parallel_start(&parfun, …); parfun(); omp_parallel_end(); } int parfun(…) { printf (“\nHello world!”); } int main() { omp_parallel_start(&parfun, …); parfun(); omp_parallel_end(); } int parfun(…) { printf (“\nHello world!”); } Code originally contained within the scope of the pragma is outlined to a new function within the compiler

#pragma omp parallel int main() { #pragma omp parallel { printf (“\nHello world!”); } int main() { #pragma omp parallel { printf (“\nHello world!”); } int main() { omp_parallel_start(&parfun, …); parfun(); omp_parallel_end(); } int parfun(…) { printf (“\nHello world!”); } int main() { omp_parallel_start(&parfun, …); parfun(); omp_parallel_end(); } int parfun(…) { printf (“\nHello world!”); } The #pragma construct in the main function is replaced with function calls to the runtime library

#pragma omp parallel int main() { #pragma omp parallel { printf (“\nHello world!”); } int main() { #pragma omp parallel { printf (“\nHello world!”); } int main() { omp_parallel_start(&parfun, …); parfun(); omp_parallel_end(); } int parfun(…) { printf (“\nHello world!”); } int main() { omp_parallel_start(&parfun, …); parfun(); omp_parallel_end(); } int parfun(…) { printf (“\nHello world!”); } First we call the runtime to fork new threads, and pass them a pointer to the function to execute in parallel

#pragma omp parallel int main() { #pragma omp parallel { printf (“\nHello world!”); } int main() { #pragma omp parallel { printf (“\nHello world!”); } int main() { omp_parallel_start(&parfun, …); parfun(); omp_parallel_end(); } int parfun(…) { printf (“\nHello world!”); } int main() { omp_parallel_start(&parfun, …); parfun(); omp_parallel_end(); } int parfun(…) { printf (“\nHello world!”); } Then the master itself calls the parallel function

#pragma omp parallel int main() { #pragma omp parallel { printf (“\nHello world!”); } int main() { #pragma omp parallel { printf (“\nHello world!”); } int main() { omp_parallel_start(&parfun, …); parfun(); omp_parallel_end(); } int parfun(…) { printf (“\nHello world!”); } int main() { omp_parallel_start(&parfun, …); parfun(); omp_parallel_end(); } int parfun(…) { printf (“\nHello world!”); } Finally we call the runtime to synchronize threads with a barrier and suspend them

#pragma omp parallel Data scope attributes int main() { int id; int a = 5; #pragma omp parallel { id = omp_get_thread_num(); if (id == 0) printf (“Master: a = %d.”, a*2); else printf (“Slave: a = %d.”, a); } int main() { int id; int a = 5; #pragma omp parallel { id = omp_get_thread_num(); if (id == 0) printf (“Master: a = %d.”, a*2); else printf (“Slave: a = %d.”, a); } A slightly more complex example Call runtime to get thread ID: Every thread sees a different value Master and slave threads access the same variable a

#pragma omp parallel Data scope attributes int main() { int id; int a = 5; #pragma omp parallel { id = omp_get_thread_num(); if (id == 0) printf (“Master: a = %d.”, a*2); else printf (“Slave: a = %d.”, a); } int main() { int id; int a = 5; #pragma omp parallel { id = omp_get_thread_num(); if (id == 0) printf (“Master: a = %d.”, a*2); else printf (“Slave: a = %d.”, a); } A slightly more complex example Call runtime to get thread ID: Every thread sees a different value Master and slave threads access the same variable a

#pragma omp parallel Data scope attributes int main() { int id; int a = 5; #pragma omp parallel shared (a) private (id) { id = omp_get_thread_num(); if (id == 0) printf (“Master: a = %d.”, a*2); else printf (“Slave: a = %d.”, a); } int main() { int id; int a = 5; #pragma omp parallel shared (a) private (id) { id = omp_get_thread_num(); if (id == 0) printf (“Master: a = %d.”, a*2); else printf (“Slave: a = %d.”, a); } A slightly more complex example Insert code to retrieve the address of the shared object from within each parallel thread Allow symbol privatization: Each thread contains a private copy of this variable

#pragma omp parallel Data scope attributes int main() { int id; int a = 5; #pragma omp parallel shared (a) private (id) { id = omp_get_thread_num(); if (id == 0) printf (“Master: a = %d.”, a*2); else printf (“Slave: a = %d.”, a); } int main() { int id; int a = 5; #pragma omp parallel shared (a) private (id) { id = omp_get_thread_num(); if (id == 0) printf (“Master: a = %d.”, a*2); else printf (“Slave: a = %d.”, a); } A slightly more complex example Insert code to retrieve the address of the shared object from within each parallel thread Allow symbol privatization: Each thread contains a private copy of this variable

#pragma omp parallel Data scope attributes int main() { int id; int a = 5; #pragma omp parallel shared (a) private (id) { id = omp_get_thread_num(); if (id == 0) printf (“Master: a = %d.”, a*2); else printf (“Slave: a = %d.”, a); } int main() { int id; int a = 5; #pragma omp parallel shared (a) private (id) { id = omp_get_thread_num(); if (id == 0) printf (“Master: a = %d.”, a*2); else printf (“Slave: a = %d.”, a); } A slightly more complex example Insert code to retrieve the address of the shared object from within each parallel thread Allow symbol privatization: Each thread contains a private copy of this variable

Sharing work among threads The for directive  The parallel pragma instructs every thread to execute all of the code inside the block  If we encounter a for loop that we want to divide among threads, we use the for pragma #pragma omp for

int main() { #pragma omp parallel for { for (i=0; i<10; i++) a[i] = i; } int main() { #pragma omp parallel for { for (i=0; i<10; i++) a[i] = i; } int main() { omp_parallel_start(&parfun, …); parfun(); omp_parallel_end(); } int parfun(…) { int LB = …; int UB = …; for (i=LB; i<UB; i++) a[i] = i; } int main() { omp_parallel_start(&parfun, …); parfun(); omp_parallel_end(); } int parfun(…) { int LB = …; int UB = …; for (i=LB; i<UB; i++) a[i] = i; }

#pragma omp for int main() { #pragma omp parallel for { for (i=0; i<10; i++) a[i] = i; } int main() { #pragma omp parallel for { for (i=0; i<10; i++) a[i] = i; } int main() { omp_parallel_start(&parfun, …); parfun(); omp_parallel_end(); } int parfun(…) { int LB = …; int UB = …; for (i=LB; i<UB; i++) a[i] = i; } int main() { omp_parallel_start(&parfun, …); parfun(); omp_parallel_end(); } int parfun(…) { int LB = …; int UB = …; for (i=LB; i<UB; i++) a[i] = i; }

#pragma omp for int main() { #pragma omp parallel for { for (i=0; i<10; i++) a[i] = i; } int main() { #pragma omp parallel for { for (i=0; i<10; i++) a[i] = i; } int main() { omp_parallel_start(&parfun, …); parfun(); omp_parallel_end(); } int parfun(…) { int LB = …; int UB = …; for (i=LB; i<UB; i++) a[i] = i; } int main() { omp_parallel_start(&parfun, …); parfun(); omp_parallel_end(); } int parfun(…) { int LB = …; int UB = …; for (i=LB; i<UB; i++) a[i] = i; }

The schedule clause Static Loop Partitioning Es. 12 iterations (N), 4 threads (Nthr) N Nthr C = ceil ( ) DATA CHUNK #pragma omp for { for (i=0; i<12; i++) a[i] = i; } #pragma omp for { for (i=0; i<12; i++) a[i] = i; } LB = C * TID Thread ID (TID) 0123 UB = min { [C * ( TID + 1) ], N} LOWER BOUND UPPER BOUND Useful for: Simple, regular loops Iterations with equal duration 3 iterations thread Iteration space schedule(static)

The schedule clause Static Loop Partitioning Es. 12 iterations (N), 4 threads (Nthr) N Nthr C = ceil ( ) DATA CHUNK #pragma omp for { for (i=0; i<12; i++) a[i] = i; } #pragma omp for { for (i=0; i<12; i++) a[i] = i; } LB = C * TID Thread ID (TID) 0123 UB = min { [C * ( TID + 1) ], N} LOWER BOUND UPPER BOUND Useful for: Simple, regular loops Iterations with equal duration 3 iterations thread Iteration space schedule(static)

The schedule clause Static Loop Partitioning #pragma omp for { for (i=0; i<12; i++) a[i] = i; } #pragma omp for { for (i=0; i<12; i++) a[i] = i; } #pragma omp for { for (i=0; i<12; i++) { int start = rand(); int count = 0; while (start++ < 256) count++; a[count] = foo(); } #pragma omp for { for (i=0; i<12; i++) { int start = rand(); int count = 0; while (start++ < 256) count++; a[count] = foo(); } schedule(static) Iteration space UNBALANCED workloads

The schedule clause Dynamic Loop Partitioning #pragma omp for { for (i=0; i<12; i++) { int start = rand(); int count = 0; while (start++ < 256) count++; a[count] = foo(); } #pragma omp for { for (i=0; i<12; i++) { int start = rand(); int count = 0; while (start++ < 256) count++; a[count] = foo(); } schedule(static) schedule(dynamic) Iteration space

The schedule clause Dynamic Loop Partitioning Iteration space Runtime environment Work queue int parfun(…) { int LB, UB; GOMP_loop_dynamic_next(&LB, &UB); for (i=LB; i<UB; i++) {…} } int parfun(…) { int LB, UB; GOMP_loop_dynamic_next(&LB, &UB); for (i=LB; i<UB; i++) {…} }

The schedule clause Dynamic Loop Partitioning Iteration space BALANCED workloads

Sharing work among threads The sections directive  The for pragma allows to exploit data parallelism in loops  OpenMP also provides a directive to exploit task parallelism #pragma omp sections

Task Parallelism Example int main() { v = alpha(); w = beta (); y = delta (); x = gamma (v, w); z = epsilon (x, y)); printf (“%f\n”, z); } int main() { v = alpha(); w = beta (); y = delta (); x = gamma (v, w); z = epsilon (x, y)); printf (“%f\n”, z); }

Task Parallelism Example int main() { #pragma omp parallel sections { v = alpha(); w = beta (); } #pragma omp parallel sections { y = delta (); x = gamma (v, w); } z = epsilon (x, y)); printf (“%f\n”, z); } int main() { #pragma omp parallel sections { v = alpha(); w = beta (); } #pragma omp parallel sections { y = delta (); x = gamma (v, w); } z = epsilon (x, y)); printf (“%f\n”, z); }

Task Parallelism Example int main() { #pragma omp parallel sections { v = alpha(); w = beta (); } #pragma omp parallel sections { y = delta (); x = gamma (v, w); } z = epsilon (x, y)); printf (“%f\n”, z); } int main() { #pragma omp parallel sections { v = alpha(); w = beta (); } #pragma omp parallel sections { y = delta (); x = gamma (v, w); } z = epsilon (x, y)); printf (“%f\n”, z); }

Task Parallelism Example int main() { #pragma omp parallel sections { #pragma omp section v = alpha(); #pragma omp section w = beta (); } #pragma omp parallel sections { #pragma omp section y = delta (); #pragma omp section x = gamma (v, w); } z = epsilon (x, y)); printf (“%f\n”, z); } int main() { #pragma omp parallel sections { #pragma omp section v = alpha(); #pragma omp section w = beta (); } #pragma omp parallel sections { #pragma omp section y = delta (); #pragma omp section x = gamma (v, w); } z = epsilon (x, y)); printf (“%f\n”, z); }

Task Parallelism Example int main() { v = alpha(); w = beta (); y = delta (); x = gamma (v, w); z = epsilon (x, y)); printf (“%f\n”, z); } int main() { v = alpha(); w = beta (); y = delta (); x = gamma (v, w); z = epsilon (x, y)); printf (“%f\n”, z); }

Task Parallelism Example int main() { #pragma omp parallel sections { v = alpha(); w = beta (); y = delta (); } #pragma omp parallel sections { x = gamma (v, w); z = epsilon (x, y)); } printf (“%f\n”, z); } int main() { #pragma omp parallel sections { v = alpha(); w = beta (); y = delta (); } #pragma omp parallel sections { x = gamma (v, w); z = epsilon (x, y)); } printf (“%f\n”, z); }

Task Parallelism Example int main() { #pragma omp parallel sections { v = alpha(); w = beta (); y = delta (); } #pragma omp parallel sections { x = gamma (v, w); z = epsilon (x, y)); } printf (“%f\n”, z); } int main() { #pragma omp parallel sections { v = alpha(); w = beta (); y = delta (); } #pragma omp parallel sections { x = gamma (v, w); z = epsilon (x, y)); } printf (“%f\n”, z); }

Task Parallelism Example int main() { #pragma omp parallel sections { #pragma omp section v = alpha(); #pragma omp section w = beta (); #pragma omp section y = delta (); } #pragma omp parallel sections { #pragma omp section x = gamma (v, w); #pragma omp section z = epsilon (x, y)); } printf (“%f\n”, z); } int main() { #pragma omp parallel sections { #pragma omp section v = alpha(); #pragma omp section w = beta (); #pragma omp section y = delta (); } #pragma omp parallel sections { #pragma omp section x = gamma (v, w); #pragma omp section z = epsilon (x, y)); } printf (“%f\n”, z); }

#pragma omp barrier  Most important synchronization mechanism in shared memory fork/join parallel programming  All threads participating in a parallel region wait until everybody has finished before computation flows on  This prevents later stages of the program to work with inconsistent shared data  It is implied at the end of parallel constructs, as well as for and sections (unless a nowait clause is specified)

#pragma omp critical  Critical Section: a portion of code that only one thread at a time may execute  We denote a critical section by putting the pragma #pragma omp critical in front of a block of C code

 -finding code example double area, pi, x; int i, n; #pragma omp parallel for private(x) \ shared(area) { for (i=0; i<n; i++) { x = (i + 0.5)/n; area += 4.0/(1.0 + x*x); } pi = area/n; double area, pi, x; int i, n; #pragma omp parallel for private(x) \ shared(area) { for (i=0; i<n; i++) { x = (i + 0.5)/n; area += 4.0/(1.0 + x*x); } pi = area/n;

Race condition  Ensure atomic updates of the shared variable area to avoid a race condition in which one process may “race ahead” of another and ignore changes

Race condition (Cont’d) time Thread A reads “11.667” into a local register Thread B reads “11.667” into a local register Thread A updates area with “ ” Thread B ignores write from thread A and updates area with “ ”

 -finding code example double area, pi, x; int i, n; #pragma omp parallel for private(x) shared(area) { for (i=0; i<n; i++) { x = (i +0.5)/n; #pragma omp critical area += 4.0/(1.0 + x*x); } pi = area/n; double area, pi, x; int i, n; #pragma omp parallel for private(x) shared(area) { for (i=0; i<n; i++) { x = (i +0.5)/n; #pragma omp critical area += 4.0/(1.0 + x*x); } } pi = area/n; #pragma omp critical protects the code within its scope by acquiring a lock before entering the critical section and releasing it after execution

Correctness, not performance!  As a matter of fact, using locks makes execution sequential  To dim this effect we should try use fine grained locking (i.e. make critical sections as small as possible)  A simple instruction to compute the value of area in the previous example is translated into many more simpler instructions within the compiler!  The programmer is not aware of the real granularity of the critical section

Correctness, not performance!  As a matter of fact, using locks makes execution sequential  To dim this effect we should try use fine grained locking (i.e. make critical sections as small as possible)  A simple instruction to compute the value of area in the previous example is translated into many more simpler instructions within the compiler!  The programmer is not aware of the real granularity of the critical section This is a dump of the intermediate representation of the program within the compiler

Correctness, not performance!  As a matter of fact, using locks makes execution sequential  To dim this effect we should try use fine grained locking (i.e. make critical sections as small as possible)  A simple instruction to compute the value of area in the previous example is translated into many more simpler instructions within the compiler!  The programmer is not aware of the real granularity of the critical section

Correctness, not performance!  As a matter of fact, using locks makes execution sequential  To dim this effect we should try use fine grained locking (i.e. make critical sections as small as possible)  A simple instruction to compute the value of area in the previous example is translated into many more simpler instructions within the compiler!  The programmer is not aware of the real granularity of the critical section call runtime to acquire lock Lock-protected operations (critical section) call runtime to release lock

 -finding code example double area, pi, x; int i, n; #pragma omp parallel for \ private(x) \ shared(area) { for (i=0; i<n; i++) { x = (i +0.5)/n; #pragma omp critical area += 4.0/(1.0 + x*x); } pi = area/n; double area, pi, x; int i, n; #pragma omp parallel for \ private(x) \ shared(area) { for (i=0; i<n; i++) { x = (i +0.5)/n; #pragma omp critical area += 4.0/(1.0 + x*x); } pi = area/n; Parallel Sequential Waiting for lock

Correctness, not performance!  A programming pattern such as area += 4.0/(1.0 + x*x); in which we:  Fetch the value of an operand  Add a value to it  Store the updated value is called a reduction, and is commonly supported by parallel programming APIs  OpenMP takes care of storing partial results in private variables and combining partial results after the loop

Correctness, not performance! double area, pi, x; int i, n; #pragma omp parallel for private(x) shared(area) { for (i=0; i<n; i++) { x = (i +0.5)/n; area += 4.0/(1.0 + x*x); } pi = area/n; double area, pi, x; int i, n; #pragma omp parallel for private(x) shared(area) { for (i=0; i<n; i++) { x = (i +0.5)/n; area += 4.0/(1.0 + x*x); } pi = area/n; The reduction clause instructs the compiler to create private copies of the area variable for every thread. At the end of the loop partial sums are combined on the shared area variable reduction(+:area)

Correctness, not performance! double area, pi, x; int i, n; #pragma omp parallel for private(x) shared(area) { for (i=0; i<n; i++) { x = (i +0.5)/n; area += 4.0/(1.0 + x*x); } pi = area/n; double area, pi, x; int i, n; #pragma omp parallel for private(x) shared(area) { for (i=0; i<n; i++) { x = (i +0.5)/n; area += 4.0/(1.0 + x*x); } pi = area/n; The reduction clause instructs the compiler to create private copies of the area variable for every thread. At the end of the loop partial sums are combined on the shared area variable reduction(+:area)

Correctness, not performance! double area, pi, x; int i, n; #pragma omp parallel for private(x) shared(area) { for (i=0; i<n; i++) { x = (i +0.5)/n; area += 4.0/(1.0 + x*x); } pi = area/n; double area, pi, x; int i, n; #pragma omp parallel for private(x) shared(area) { for (i=0; i<n; i++) { x = (i +0.5)/n; area += 4.0/(1.0 + x*x); } pi = area/n; The reduction clause instructs the compiler to create private copies of the area variable for every thread. At the end of the loop partial sums are combined on the shared area variable reduction(+:area)

Summary

Customizing OpenMP for Efficient Exploitation of the Memory Hierarchy  Memory latency is well recognized as a severe performance bottleneck  MPSoCs feature complex memory hierarchy, with multiple cache levels, private and shared on-chip and off-chip memories  Using efficiently the memory hierarchy is of the utmost importance to exploit the computational power of MPSoCs, but..

Customizing OpenMP for Efficient Exploitation of the Memory Hierarchy  It is a difficult task, requiring deep understanding of the application and its memory access pattern  OpenMP standard doesn’t provide any facilities to deal with data placement and partitioning   Customization of the programming interface would bring the advantages of OpenMP to the MPSoC world

Extending OpenMP to support Data Distribution  We need a custom directive that enables specific code analysis and transformation  When static code analysis can’t tell how to distribute data we must rely on profiling  The runtime is responsible for exploiting this information to efficiently map arrays to memories

Extending OpenMP to support Data Distribution  The entire process is driven by the custom #pragma omp distributed directive { int A[m]; float B[n]; #pragma omp distributed (A, B) … } { int *A; float *B; A = distributed_malloc (m); B = distributed_malloc (n); … } Originally stack-allocated arrays are transformed into pointers to allow for their explicit placement throughout the memory hierarchy within the program

Extending OpenMP to support Data Distribution  The entire process is driven by the custom #pragma omp distributed directive { int A[m]; float B[n]; #pragma omp distributed (A, B) … } { int *A; float *B; A = distributed_malloc (m); B = distributed_malloc (n); … } The transformed program invokes the runtime to retrieve profile information which drive data placement When no profile information is found, the distributed_malloc returns a pointer to the shared memory

Data partitioning  OpenMP model is focused on loop parallelism  In this parallelization scheme multiple threads may access different sections (discontiguous addresses) of shared arrays  Data partitioning is the process of tiling data arrays and placing the tiles in memory such that a maximum number of accesses are satisfied from local memory  Most obvious implementation of this concept is the data cache, but..  Inter-array discontiguity often causes cache conflicts  Embedded systems impose constraints on energy, predictability, real-time that often make caches not suitable

A simple example #pragma omp parallel for for (i = 0; i < 4; i++) for (j = 0; j < 6; j++) A[ i ][ j ] = 1.0; 3,03,13,23,33,43,5 2,02,12,22,32,42,5 1,01,11,21,31,41,5 0,00,10,20,30,40,5 ITERATION SPACE INTERCONNECT SHARED MEMORY SPM CPU1 SPM CPU2 SPM CPU2 SPM CPU2 The iteration space is partitioned between the processors

A simple example #pragma omp parallel for for (i = 0; i < 4; i++) for (j = 0; j < 6; j++) A[ i ][ j ] = 1.0; 3,03,13,23,33,43,5 2,02,12,22,32,42,5 1,01,11,21,31,41,5 0,00,10,20,30,40,5 ITERATION SPACEDATA SPACE INTERCONNECT SHARED MEMORY SPM CPU1 SPM CPU2 SPM CPU2 SPM CPU2 Data space overlaps with iteration space. Each processor accesses a different tile A(i,j) Array is accessed with the loop induction variables

A simple example #pragma omp parallel for for (i = 0; i < 4; i++) for (j = 0; j < 6; j++) A[ i ][ j ] = 1.0; 3,03,13,23,33,43,5 2,02,12,22,32,42,5 1,01,11,21,31,41,5 0,00,10,20,30,40,5 ITERATION SPACEDATA SPACE INTERCONNECT SHARED MEMORY SPM CPU1 SPM CPU2 SPM CPU2 SPM CPU2 The compiler can actually split the matrix into four smaller arrays and allocate them onto scratchpads A(i,j) No access to remote memories through the bus, since data is allocated locally

Another example #pragma omp parallel for for (i = 0; i < 4; i++) for (j = 0; j < 6; j++) hist[A[i][j]]++; 3,03,13,23,33,43,5 2,02,12,22,32,42,5 1,01,11,21,31,41,5 0,00,10,20,30,40,5 ITERATION SPACEA INTERCONNECT SHARED MEMORY SPM CPU1 SPM CPU2 SPM CPU2 SPM CPU2 Different locations within array hist are accessed by many processors hist

Another example  In this case static code analysis can’t tell anything on array access pattern  How to decide most efficient partitioning?  Split array in as many tiles as there are processors  Use access count information to map tiles to the processor that has most accesses to it

Another example 3,03,13,23,33,43,5 2,02,12,22,32,42,5 1,01,11,21,31,41,5 0,00,10,20,30,40,5 ITERATION SPACEA INTERCONNECT SHARED MEMORY SPM CPU1 SPM CPU2 SPM CPU2 SPM CPU2 Now processors need to access remote scratchpads, since they work on multiple tiles!!! hist TILE 1 Access count PROC1 1 PROC2 2 PROC3 3 PROC4 2 TILE 2 Access count PROC1 1 PROC2 4 PROC3 2 PROC4 0 TILE 3 Access count PROC1 3 PROC2 0 PROC3 1 PROC4 4 TILE 4 Access count PROC1 2 PROC2 0 PROC3 0 PROC4 1

Problem with data partitioning  If there is no overlapping of iteration space and data space it may happen that multiple processor need to access different tiles  In this case data partitioning introduces addressing difficulties because the data tiles can become discontiguous in physical memory  How to address the problem of generating efficient code to access data when performing loop and data partitioning?  We can further extend the OpenMP programming interface to deal with that!  The programmer only has to specify the intention of partitioning an array throughout the memory hierarchy, and the compiler does the necessary instrumentation

Code Instrumentation In general, the steps for addressing an array element using tiling are:  Computation of the offset w.r.t. the base address  Identify the tile to which this element belongs to  Re-compute the index relative to the current tile  Load the tile base address from a metadata array This metadata array is populated during the memory allocation step of the tool-flow. It relies on access count information to figure out the most efficient mapping of array tiles to memories.

Extending OpenMP to support data partitioning #pragma omp parallel tiled(A) { … /* Access memory */ A[i][j] = foo(); … } { /* Compute offset, tile and index for distributed array */ int offset = …; int tile = …; int index = …; /* Read tile base address */ int *base = tiles[dvar][tile]; /* Access memory */ base[index] = foo(); … } The instrumentation process is driven by the custom tiled clause, which can be coupled with every parallel and work-sharing construct.