Programming with OpenMP* Intel Software College. Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or.

Slides:

Advertisements

Similar presentations

Implementing Domain Decompositions Intel Software College Introduction to Parallel Programming – Part 3.

Advertisements

Indian Institute of Science Bangalore, India भारतीय विज्ञान संस्थान बंगलौर, भारत Supercomputer Education and Research Centre (SERC) Adapted from: o “MPI-Message.

Open[M]ulti[P]rocessing Pthreads: Programmer explicitly define thread behavior openMP: Compiler and system defines thread behavior Pthreads: Library independent.

PARALLEL PROGRAMMING WITH OPENMP Ing. Andrea Marongiu

1 OpenMP—An API for Shared Memory Programming Slides are based on:

1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.

Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.

Introduction to OpenMP For a more detailed tutorial see: Look at the presentations.

1 ITCS4145/5145, Parallel Programming B. Wilkinson Feb 21, 2012 Programming with Shared Memory Introduction to OpenMP.

CSCI-6964: High Performance Parallel & Distributed Computing (HPDC) AE 216, Mon/Thurs 2-3:20 p.m. Pthreads (reading Chp 7.10) Prof. Chris Carothers Computer.

OpenMPI Majdi Baddourah

A Very Short Introduction to OpenMP Basile Schaeli EPFL – I&C – LSP Vincent Keller EPFL – STI – LIN.

INTEL CONFIDENTIAL OpenMP for Domain Decomposition Introduction to Parallel Programming – Part 5.

INTEL CONFIDENTIAL Confronting Race Conditions Introduction to Parallel Programming – Part 6.

INTEL CONFIDENTIAL OpenMP for Task Decomposition Introduction to Parallel Programming – Part 8.

INTEL CONFIDENTIAL Reducing Parallel Overhead Introduction to Parallel Programming – Part 12.

1 Parallel Programming With OpenMP. 2 Contents  Overview of Parallel Programming & OpenMP  Difference between OpenMP & MPI  OpenMP Programming Model.

Programming with OpenMP* Intel Software College. Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or.

Programming with Shared Memory Introduction to OpenMP

Shared Memory Parallelization Outline What is shared memory parallelization? OpenMP Fractal Example False Sharing Variable scoping Examples on sharing.

Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (

Parallel Programming in Java with Shared Memory Directives.

Lecture 5: Shared-memory Computing with Open MP. Shared Memory Computing.

ME964 High Performance Computing for Engineering Applications “The inside of a computer is as dumb as hell but it goes like mad!” Richard Feynman © Dan.

2 3 Parent Thread Fork Join Start End Child Threads Compute time Overhead.

Chapter 17 Shared-Memory Programming. Introduction OpenMP is an application programming interface (API) for parallel programming on multiprocessors. It.

OpenMP - Introduction Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

1 OpenMP Writing programs that use OpenMP. Using OpenMP to parallelize many serial for loops with only small changes to the source code. Task parallelism.

OpenMP OpenMP A.Klypin Shared memory and OpenMP Simple Example Threads Dependencies Directives Handling Common blocks Synchronization Improving load balance.

Lecture 8: OpenMP. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism / Implicit parallelism.

OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)

OpenMP Martin Kruliš Jiří Dokulil. OpenMP OpenMP Architecture Review Board Compaq, HP, Intel, IBM, KAI, SGI, SUN, U.S. Department of Energy,…

CS 838: Pervasive Parallelism Introduction to OpenMP Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from online references.

Programming with OpenMP* Intel Software College. Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or.

Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j

OpenMP fundamentials Nikita Panov

High-Performance Parallel Scientific Computing 2008 Purdue University OpenMP Tutorial Seung-Jai Min School of Electrical and Computer.

Threaded Programming Lecture 4: Work sharing directives.

Introduction to OpenMP

Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (

9/22/2011CS4961 CS4961 Parallel Programming Lecture 9: Task Parallelism in OpenMP Mary Hall September 22,

3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,

10/05/2010CS4961 CS4961 Parallel Programming Lecture 13: Task Parallelism in OpenMP Mary Hall October 5,

Special Topics in Computer Engineering OpenMP* Essentials * Open Multi-Processing.

Heterogeneous Computing using openMP lecture 2 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.

CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.

Programming with OpenMP*” Part II Intel Software College.

Tuning Threaded Code with Intel® Parallel Amplifier.

CS240A, T. Yang, Parallel Programming with OpenMP.

Programming with OpenMP* Intel Software College. Objectives Upon completion of this module you will be able to use OpenMP to:  implement data parallelism.

B. Estrade, LSU – High Performance Computing Enablement Group OpenMP II B. Estrade.

Introduction to OpenMP

Martin Kruliš Jiří Dokulil

Shared Memory Parallelism - OpenMP

Lecture 5: Shared-memory Computing with Open MP

CS427 Multicore Architecture and Parallel Computing

Parallelize Codes Using Intel Software

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing A bug in the rwlock program Dr. Xiao Qin.

Open[M]ulti[P]rocessing

Computer Engg, IIT(BHU)

Introduction to OpenMP

Programming with OpenMP*

Computer Science Department

Programming with OpenMP*

Introduction to High Performance Computing Lecture 20

Programming with Shared Memory Introduction to OpenMP

Introduction to OpenMP

OpenMP Martin Kruliš.

OpenMP Parallel Programming

Parallel Programming with OPENMP

Presentation transcript:

Programming with OpenMP* Intel Software College

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 2 Objectives Upon completion of this module you will be able to use OpenMP to: implement data parallelism implement task parallelism

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 3 Agenda What is OpenMP? Parallel regions Worksharing Data environment Synchronization Curriculum Placement Optional Advanced topics

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 4 What Is OpenMP? Portable, shared-memory threading API –Fortran, C, and C++ –Multi-vendor support for both Linux and Windows Standardizes task & loop-level parallelism Supports coarse-grained parallelism Combines serial and parallel code in single source Standardizes ~ 20 years of compiler- directed threading experience Current spec is OpenMP Pages (combined C/C++ and Fortran)

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 5 Programming Model Fork-Join Parallelism: Master thread spawns a team of threads as needed Parallelism is added incrementally: that is, the sequential program evolves into a parallel program Parallel Regions Master Thread

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 6 A Few Syntax Details to Get Started Most of the constructs in OpenMP are compiler directives or pragmas For C and C++, the pragmas take the form: #pragma omp construct [clause [clause]…] For Fortran, the directives take one of the forms: C$OMP construct [clause [clause]…] !$OMP construct [clause [clause]…] *$OMP construct [clause [clause]…] Header file or Fortran 90 module #include “omp.h” use omp_lib

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 7 Agenda What is OpenMP? Parallel regions Worksharing Data environment Synchronization Curriculum Placement Optional Advanced topics

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 8 Most OpenMP constructs apply to structured blocks Structured block: a block with one point of entry at the top and one point of exit at the bottom The only “branches” allowed are STOP statements in Fortran and exit() in C/C++ A structured blockNot a structured block Parallel Region & Structured Blocks (C/C++) if (go_now()) goto more; #pragma omp parallel { int id = omp_get_thread_num(); more: res[id] = do_big_job(id); if (conv (res[id]) goto done; goto more; } done: if (!really_done()) goto more; #pragma omp parallel { int id = omp_get_thread_num(); more: res[id] = do_big_job (id); if (conv (res[id]) goto more; } printf (“All done\n”);

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 9 Activity 1: Hello Worlds Modify the “Hello, Worlds” serial code to run multithreaded using OpenMP*

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 10 Agenda What is OpenMP? Parallel regions Worksharing – Parallel For Data environment Synchronization Curriculum Placement Optional Advanced topics

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 11 Worksharing Worksharing is the general term used in OpenMP to describe distribution of work across threads. Three examples of worksharing in OpenMP are: omp for construct omp sections construct omp task construct Automatically divides work among threads

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 12 omp for Construct Threads are assigned an independent set of iterations Threads must wait at the end of work-sharing construct #pragma omp parallel #pragma omp for Implicit barrier i = 1 i = 2 i = 3 i = 4 i = 5 i = 6 i = 7 i = 8 i = 9 i = 10 i = 11 i = 12 // assume N=12 #pragma omp parallel #pragma omp for for(i = 1, i < N+1, i++) c[i] = a[i] + b[i];

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 13 Combining constructs These two code segments are equivalent #pragma omp parallel { #pragma omp for for (i=0;i< MAX; i++) { res[i] = huge(); } #pragma omp parallel for for (i=0;i< MAX; i++) { res[i] = huge(); }

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 14 The Private Clause Reproduces the variable for each task Variables are un-initialized; C++ object is default constructed Any value external to the parallel region is undefined void* work(float* c, int N) { float x, y; int i; #pragma omp parallel for private(x,y) for(i=0; i<N; i++) { x = a[i]; y = b[i]; c[i] = x + y; }

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 15 Activity 2 – Parallel Mandelbrot Objective: create a parallel version of Mandelbrot. Modify the code to add OpenMP worksharing clauses to parallelize the computation of Mandelbrot. Follow the next Mandelbrot activity called Mandelbrot in the student lab doc

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 16 The schedule clause The schedule clause affects how loop iterations are mapped onto threads schedule(static [,chunk]) Blocks of iterations of size “chunk” to threads Round robin distribution Low overhead, may cause load imbalance schedule(dynamic[,chunk]) Threads grab “chunk” iterations When done with iterations, thread requests next set Higher threading overhead, can reduce load imbalance schedule(guided[,chunk]) Dynamic schedule starting with large block Size of the blocks shrink; no smaller than “chunk”

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 17 Schedule Clause Example #pragma omp parallel for schedule (static, 8) for( int i = start; i <= end; i += 2 ) { if ( TestForPrime(i) ) gPrimesFound++; } Iterations are divided into chunks of 8 If start = 3, then first chunk is i ={3,5,7,9,11,13,15,17}

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 18 Activity 3 –Mandelbrot Scheduling Objective: create a parallel version of mandelbrot. That uses OpenMP dynamic scheduling Follow the next Mandelbrot activity called Mandelbrot Scheduling in the student lab doc

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 19 Agenda What is OpenMP? Parallel regions Worksharing – Parallel Sections Data environment Synchronization Curriculum Placement Optional Advanced topics

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 20 Task Decomposition a = alice(); b = bob(); s = boss(a, b); c = cy(); printf ("%6.2f\n", bigboss(s,c)); alice,bob, and cy can be computed in parallel alice bob boss bigboss cy

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 21 omp sections #pragma omp sections Must be inside a parallel region Precedes a code block containing of N blocks of code that may be executed concurrently by N threads Encompasses each omp section #pragma omp section Precedes each block of code within the encompassing block described above May be omitted for first parallel section after the parallel sections pragma Enclosed program segments are distributed for parallel execution among available threads

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 22 Functional Level Parallelism w sections #pragma omp parallel sections { #pragma omp section /* Optional */ a = alice(); #pragma omp section b = bob(); #pragma omp section c = cy(); } s = boss(a, b ); printf ("%6.2f\n", bigboss(s,c));

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 23 Advantage of Parallel Sections Independent sections of code can execute concurrently – reduce execution time SerialParallel #pragma omp parallel sections { #pragma omp section phase1(); #pragma omp section phase2(); #pragma omp section phase3(); }

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 24 Agenda What is OpenMP? Parallel regions Worksharing – Tasks Data environment Synchronization Curriculum Placement Optional Advanced topics

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 25 New Addition to OpenMP Tasks – Main change for OpenMP 3.0 Allows parallelization of irregular problems unbounded loops recursive algorithms producer/consumer

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 26 What are tasks? Tasks are independent units of work Threads are assigned to perform the work of each task Tasks may be deferred Tasks may be executed immediately The runtime system decides which of the above Tasks are composed of: code to execute data environment internal control variables (ICV) SerialParallel

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 27 Simple Task Example A pool of 8 threads is created here #pragma omp parallel // assume 8 threads { #pragma omp single private(p) { … while (p) { #pragma omp task { processwork(p); } p = p->next; } One thread gets to execute the while loop The single “while loop” thread creates a task for each instance of processwork()

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 28 Task Construct – Explicit Task View A team of threads is created at the omp parallel construct A single thread is chosen to execute the while loop – lets call this thread “L” Thread L operates the while loop, creates tasks, and fetches next pointers Each time L crosses the omp task construct it generates a new task and has a thread assigned to it Each task runs in its own thread All tasks complete at the barrier at the end of the parallel region’s single construct #pragma omp parallel { #pragma omp single { // block 1 node * p = head; while (p) { //block 2 #pragma omp task process(p); p = p->next; //block 3 }

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 29 Why are tasks useful? #pragma omp parallel { #pragma omp single { // block 1 node * p = head; while (p) { //block 2 #pragma omp task process(p); p = p->next; //block 3 } Have potential to parallelize irregular patterns and recursive function calls Block 1 Block 2 Task 1 Block 2 Task 2 Block 2 Task 3 Block 3 Time Single Threaded Block 1 Block 3 Thr1 Thr2 Thr3 Thr4 Block 2 Task 2 Block 2 Task 1 Block 2 Task 3 Time Saved Idle

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 30 Activity 4 – Linked List using Tasks while(p != NULL){ do_work(p->data); p = p->next; } Objective: Modify the linked list pointer chasing code to implement tasks to parallelize the application Follow the Linked List task activity called LinkedListTask in the student lab doc Note: We also have a companion lab, that uses worksharing to solve the same problem LinkedListWorkSharing We also have task labs on recursive functions - examples quicksort & fibonacci

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 31 Tasks are gauranteed to be complete: At thread or task barriers At the directive: #pragma omp barrier At the directive: #pragma omp taskwait When are tasks gauranteed to be complete?

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 32 Task Completion Example #pragma omp parallel { #pragma omp task foo(); #pragma omp barrier #pragma omp single { #pragma omp task bar(); } Multiple foo tasks created here – one for each thread All foo tasks guaranteed to be completed here One bar task created here bar task guaranteed to be completed here

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 33 Agenda What is OpenMP? Parallel regions Worksharing Data environment Synchronization Curriculum Placement Optional Advanced topics

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 34 Data Scoping – What’s shared OpenMP uses a shared-memory programming model Shared variable - a variable whose name provides access to a the same block of storage for each task region Shared clause can be used to make items explicitly shared Global variables are shared among tasks C/C++: File scope variables, namespace scope variables, static variables, Variables with const-qualified type having no mutable member are shared, Static variables which are declared in a scope inside the construct are shared. Fortran: COMMON blocks, SAVE variables, MODULE variables

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 35 Data Scoping – What’s private But, not everything is shared... Examples of implicitly determined private variables: Stack (local) variables in functions called from parallel regions are PRIVATE Automatic variables within a statement block are PRIVATE Loop iteration variables are private Implicitly declared private variables within tasks will be treated as firstprivate Firstprivate clause declares one or more list items to be private to a task, and initializes each of them with a value

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 36 A Data Environment Example temp A, index, count temp A, index, count Which variables are shared and which variables are private? float A[10]; main () { integer index[10]; #pragma omp parallel { Work (index); } printf (“%d\n”, index[1]); } extern float A[10]; void Work (int *index) { float temp[10]; static integer count; } A, index, and count are shared by all threads, but temp is local to each thread

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 37 int fib ( int n ) { int x,y; if ( n < 2 ) return n; #pragma omp task x = fib(n-1); #pragma omp task y = fib(n-2); #pragma omp taskwait return x+y } Data Scoping Issue – fib example n is a private variable it will be treated as firstprivate in both tasks What’s wrong here? Can’t use private variables outside of tasks x is a private variable y is a private variable

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 38 int fib ( int n ) { int x,y; if ( n < 2 ) return n; #pragma omp task shared(x) x = fib(n-1); #pragma omp task shared(y) y = fib(n-2); #pragma omp taskwait return x+y; } n is firstprivate in both tasks x & y are shared Good solution we need both values to compute the sum Data Scoping Example – fib example

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 39 List ml; //my_list Element *e; #pragma omp parallel #pragma omp single { for(e=ml->first;e;e=e->next) #pragma omp task process(e); } Data Scoping Issue – List Traversal What’s wrong here? Possible data race ! Shared variable e updated by multiple tasks

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 40 List ml; //my_list Element *e; #pragma omp parallel #pragma omp single { for(e=ml->first;e;e=e->next) #pragma omp task firstprivate(e) process(e); } Good solution – e is firstprivate Data Scoping Example – List Traversal

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 41 List ml; //my_list Element *e; #pragma omp parallel #pragma omp single private(e) { for(e=ml->first;e;e=e->next) #pragma omp task process(e); } Good solution – e is firstprivate Data Scoping Example – List Traversal

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 42 List ml; //my_list #pragma omp parallel #pragma omp for For (int i=0; i<N; i++) { Element *e; for(e=ml->first;e;e=e- >next) #pragma omp task process(e); } Data Scoping Example – Multiple List Traversal Good solution – e is firstprivate

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 43 Agenda What is OpenMP? Parallel regions Worksharing Data environment Synchronization Curriculum Placement Optional Advanced topics

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 44 Example: Dot Product float dot_prod(float* a, float* b, int N) { float sum = 0.0; #pragma omp parallel for shared(sum) for(int i=0; i<N; i++) { sum += a[i] * b[i]; } return sum; } What is Wrong?

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 45 Race Condition A race condition is nondeterministic behavior caused by the times at which two or more threads access a shared variable For example, suppose both Thread A and Thread B are executing the statement area += 4.0 / (1.0 + x*x);

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 46 Two Timings Value of area Thread AThread B Value of area Thread AThread B Order of thread execution causes non determinant behavior in a data race

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 47 Protect Shared Data Must protect access to shared, modifiable data float dot_prod(float* a, float* b, int N) { float sum = 0.0; #pragma omp parallel for shared(sum) for(int i=0; i<N; i++) { #pragma omp critical sum += a[i] * b[i]; } return sum; }

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 48 #pragma omp critical [(lock_name)] Defines a critical region on a structured block OpenMP* Critical Construct float RES; #pragma omp parallel { float B; #pragma omp for for(int i=0; i<niters; i++){ B = big_job(i); #pragma omp critical (RES_lock) consum (B, RES); } } Threads wait their turn –at a time, only one calls consum() thereby protecting RES from race conditions Naming the critical construct RES_lock is optional Good Practice – Name all critical sections

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 49 OpenMP* Reduction Clause reduction (op : list) The variables in “list” must be shared in the enclosing parallel region Inside parallel or work-sharing construct: A PRIVATE copy of each list variable is created and initialized depending on the “op” These copies are updated locally by threads At end of construct, local copies are combined through “op” into a single value and combined with the value in the original SHARED variable

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 50 Reduction Example Local copy of sum for each thread All local copies of sum added together and stored in “global” variable #pragma omp parallel for reduction(+:sum) for(i=0; i<N; i++) { sum += a[i] * b[i]; }

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 51 A range of associative operands can be used with reduction Initial values are the ones that make sense mathematically C/C++ Reduction Operations OperandInitial Value +0 *1 -0 ^0 OperandInitial Value &~0 |0 &&1 ||0

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 52 Numerical Integration Example (1+x 2 ) f(x) = X  4.0 (1+x 2 ) dx =  0 1 static long num_steps=100000; double step, pi; void main() { int i; double x, sum = 0.0; step = 1.0/(double) num_steps; for (i=0; i< num_steps; i++){ x = (i+0.5)*step; sum = sum + 4.0/(1.0 + x*x); } pi = step * sum; printf(“Pi = %f\n”,pi); }

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 53 Activity 5 - Computing Pi Parallelize the numerical integration code using OpenMP What variables can be shared? What variables need to be private? What variables should be set up for reductions? static long num_steps=100000; double step, pi; void main() { int i; double x, sum = 0.0; step = 1.0/(double) num_steps; for (i=0; i< num_steps; i++){ x = (i+0.5)*step; sum = sum + 4.0/(1.0 + x*x); } pi = step * sum; printf(“Pi = %f\n”,pi);}

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 54 Denotes block of code to be executed by only one thread First thread to arrive is chosen Implicit barrier at end Single Construct #pragma omp parallel { DoManyThings(); #pragma omp single { ExchangeBoundaries(); } // threads wait here for single DoManyMoreThings(); }

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 55 Denotes block of code to be executed only by the master thread No implicit barrier at end Master Construct #pragma omp parallel { DoManyThings(); #pragma omp master { // if not master skip to next stmt ExchangeBoundaries(); } DoManyMoreThings(); }

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 56 Implicit Barriers Several OpenMP* constructs have implicit barriers Parallel – necessary barrier – cannot be removed for single Unnecessary barriers hurt performance and can be removed with the nowait clause The nowait clause is applicable to: For clause Single clause

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 57 Nowait Clause Use when threads unnecessarily wait between independent computations #pragma single nowait { [...] } #pragma omp for nowait for(...) {...}; #pragma omp for schedule(dynamic,1) nowait for(int i=0; i<n; i++) a[i] = bigFunc1(i); #pragma omp for schedule(dynamic,1) for(int j=0; j<m; j++) b[j] = bigFunc2(j);

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 58 Barrier Construct Explicit barrier synchronization Each thread waits until all threads arrive #pragma omp parallel shared (A, B, C) { DoSomeWork(A,B); printf(“Processed A into B\n”); #pragma omp barrier DoSomeWork(B,C); printf(“Processed B into C\n”); }

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 59 Atomic Construct Special case of a critical section Applies only to simple update of memory location #pragma omp parallel for shared(x, y, index, n) for (i = 0; i < n; i++) { #pragma omp atomic x[index[i]] += work1(i); y[i] += work2(i); }

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 60 Agenda What is OpenMP? Parallel regions Worksharing Data environment Synchronization Curriculum Placement Optional Advanced topics

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 61 OpenMP materials suggested for: This material is suggested for Faculty of: Algorithms or Data Structure course C/C++/Fortran language courses Applied Math, Applied Science, Engineering programming courses where loops access arrays in C/C++/ or Fortran

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 62 What OpenMP essentials should be covered in university course? Discussion of what OpenMP is and what it can do Parallel regions Work sharing – parallel for Work queuing – tasks Shared & private variables Protection of shared data, critical sections, locks etc Reduction clause Optional topics nowait, atomic, first private, last private,… API – omp_set_num_threads(), get_num_threads()

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 63 Agenda What is OpenMP? Parallel regions Worksharing Data environment Synchronization Curriculum Placement Optional Advanced topics

Advanced Concepts

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 65 Parallel Construct – Implicit Task View Tasks are created in OpenMP even without an explicit task directive. Lets look at how tasks are created implicitly for the code snippet below Thread encountering parallel construct packages up a set of implicit tasks Team of threads is created. Each thread in team is assigned to one of the tasks (and tied to it). Barrier holds original master thread until all implicit tasks are finished. #pragma omp parallel { int mydata code } { int mydata; code… } { mydata code } { mydata code } { mydata code } Thread Barrier

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 66 Task Construct #pragma omp task [clause[[,]clause]...] structured-block if (expression) untied shared (list) private (list) firstprivate (list) default( shared | none ) where clause can be one of:

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 67 Tied & Untied Tasks Tied Tasks: A tied task gets a thread assigned to it at its first execution and the same thread services the task for its lifetime A thread executing a tied task, can be suspended, and sent of to execute some other task, but eventually, the same thread will return to resume execution of its original tied task Tasks are tied unless explicitly declared untied Untied Tasks: An united task has no long term association with any given thread. Any thread not otherwise occupied is free to execute an untied task. The thread assigned to execute an untied task may only change at a "task scheduling point". An untied task is created by appending “untied” to the task clause Example: #pragma omp task untied

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 68 Task switching task switching The act of a thread switching from the execution of one task to another task. The purpose of task switching is distribute threads among the unassigned tasks in the team to avoid piling up long queues of unassigned tasks Task switching, for tied tasks, can only occur at task scheduling points located within the following constructs encountered task constructs encountered taskwait constructs encountered barrier directives implicit barrier regions at the end of the tied task region Untied tasks have implementation dependent scheduling points

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 69 The thread executing the “for loop”, AKA the generating task, generates many tasks in a short time so... The SINGLE generating task will have to suspend for a while when “task pool” fills up Task switching is invoked to start draining the “pool” When “pool” is sufficiently drained – then the single task can being generating more tasks again Task switching example int exp; //exp either T or F; #pragma omp single { for (i=0; i<ONEZILLION; i++) #pragma omp task process(item[i]); }

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 70 Optional foil - OpenMP* API Get the thread number within a team int omp_get_thread_num(void); Get the number of threads in a team int omp_get_num_threads(void); Usually not needed for OpenMP codes Can lead to code not being serially consistent Does have specific uses (debugging) Must include a header file #include

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 71 Optional foil - Monte Carlo Pi loop 1 to MAX x.coor=(random#) y.coor=(random#) dist=sqrt(x^2 + y^2) if (dist <= 1) hits=hits+1 pi = 4 * hits/MAX r

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 72 Optional foil - Making Monte Carlo’s Parallel hits = 0 call SEED48(1) DO I = 1, max x = DRAND48() y = DRAND48() IF (SQRT(x*x + y*y).LT. 1) THEN hits = hits+1 ENDIF END DO pi = REAL(hits)/REAL(max) * 4.0 What is the challenge here?

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 73 Optional Activity 6: Computing Pi Use the Intel® Math Kernel Library (Intel® MKL) VSL: Intel MKL’s VSL (Vector Statistics Libraries) VSL creates an array, rather than a single random number VSL can have multiple seeds (one for each thread) Objective: Use basic OpenMP* syntax to make Pi parallel Choose the best code to divide the task up Categorize properly all variables

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 74 Variables initialized from shared variable C++ objects are copy-constructed Firstprivate Clause incr=0; #pragma omp parallel for firstprivate(incr) for (I=0;I<=MAX;I++) { if ((I%2)==0) incr++; A(I)=incr; }

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 75 Variables update shared variable using value from last iteration C++ objects are updated as if by assignment Lastprivate Clause void sq2(int n, double *lastterm) { double x; int i; #pragma omp parallel #pragma omp for lastprivate(x) for (i = 0; i < n; i++){ x = a[i]*a[i] + b[i]*b[i]; b[i] = sqrt(x); } lastterm = x; }

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 76 Preserves global scope for per-thread storage Legal for name-space-scope and file-scope Use copyin to initialize from master thread Threadprivate Clause struct Astruct A; #pragma omp threadprivate(A) … #pragma omp parallel copyin(A) do_something_to(&A); … #pragma omp parallel do_something_else_to(&A); Private copies of “A” persist between regions

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners Library Routines Runtime environment routines: Modify/check the number of threads omp_[set|get]_num_threads() omp_get_thread_num() omp_get_max_threads() Are we in a parallel region? omp_in_parallel() How many processors in the system? omp_get_num_procs() Explicit locks omp_[set|unset]_lock() And many more...

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 78 Library Routines To fix the number of threads used in a program Set the number of threads Then save the number returned #include void main () { int num_threads; omp_set_num_threads (omp_num_procs ()); #pragma omp parallel { int id = omp_get_thread_num (); #pragma omp single num_threads = omp_get_num_threads (); do_lots_of_stuff (id); } } Request as many threads as you have processors. Protect this operation because memory stores are not atomic

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 79

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 80 Agenda What is OpenMP? Runtime functions/environment variables Parallel regions Worksharing/Workqueuing Data environment Synchronization Other Helpful Constructs and Clauses

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 81 Environment Variables Set the default number of threads OMP_NUM_THREADS integer Set the default scheduling protocol OMP_SCHEDULE “schedule[, chunk_size]” Enable dynamic thread adjustment OMP_DYNAMIC [TRUE|FALSE] Enable nested parallelism OMP_NESTED [TRUE|FALSE]

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners Library Routines Runtime environment routines: Modify/check the number of threads omp_[set|get]_num_threads() omp_get_thread_num() omp_get_max_threads() Are we in a parallel region? omp_in_parallel() How many processors in the system? omp_get_num_procs() Explicit locks omp_[set|unset]_lock() And many more... In this course, focuses on the directives only approach to OpenMP – which makes incremental parallelism easy

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 83 What is a Thread? An execution entity with a stack and associated static memory, called threadprivate memory..

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 84 Load Imbalance Unequal work loads lead to idle threads and wasted time. time Busy Idle #pragma omp parallel { #pragma omp for for( ; ; ){ } time

Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 85 #pragma omp parallel { #pragma omp critical {... }... } Synchronization Lost time waiting for locks time Busy Idle In Critical