1 Friday, November 10, 2006 “ Programs for sale: Fast, Reliable, Cheap: choose two.” -Anonymous.

Slides:

Advertisements

Similar presentations

1 Programming Explicit Thread-level Parallelism  As noted previously, the programmer must specify how to parallelize  But, want path of least effort.

Advertisements

Indian Institute of Science Bangalore, India भारतीय विज्ञान संस्थान बंगलौर, भारत Supercomputer Education and Research Centre (SERC) Adapted from: o “MPI-Message.

Open[M]ulti[P]rocessing Pthreads: Programmer explicitly define thread behavior openMP: Compiler and system defines thread behavior Pthreads: Library independent.

Introduction to OpenMP For a more detailed tutorial see: Look at the presentations also see:

PARALLEL PROGRAMMING WITH OPENMP Ing. Andrea Marongiu

1 OpenMP—An API for Shared Memory Programming Slides are based on:

1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.

Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

Introduction to OpenMP For a more detailed tutorial see: Look at the presentations.

1 ITCS4145/5145, Parallel Programming B. Wilkinson Feb 21, 2012 Programming with Shared Memory Introduction to OpenMP.

Parallel Programming in C with MPI and OpenMP

CSCI-6964: High Performance Parallel & Distributed Computing (HPDC) AE 216, Mon/Thurs 2-3:20 p.m. Pthreads (reading Chp 7.10) Prof. Chris Carothers Computer.

OpenMPI Majdi Baddourah

A Very Short Introduction to OpenMP Basile Schaeli EPFL – I&C – LSP Vincent Keller EPFL – STI – LIN.

Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines.

1 Parallel Programming With OpenMP. 2 Contents  Overview of Parallel Programming & OpenMP  Difference between OpenMP & MPI  OpenMP Programming Model.

Programming with Shared Memory Introduction to OpenMP

Shared Memory Parallelization Outline What is shared memory parallelization? OpenMP Fractal Example False Sharing Variable scoping Examples on sharing.

Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (

Parallel Programming in Java with Shared Memory Directives.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

Lecture 5: Shared-memory Computing with Open MP. Shared Memory Computing.

2 3 Parent Thread Fork Join Start End Child Threads Compute time Overhead.

UNIT -6 PROGRAMMING SHARED ADDRESS SPACE PLATFORMS THREAD BASICS PREPARED BY:-H.M.PATEL.

Chapter 17 Shared-Memory Programming. Introduction OpenMP is an application programming interface (API) for parallel programming on multiprocessors. It.

OpenMP - Introduction Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

ECE 1747 Parallel Programming Shared Memory: OpenMP Environment and Synchronization.

1 OpenMP Writing programs that use OpenMP. Using OpenMP to parallelize many serial for loops with only small changes to the source code. Task parallelism.

OpenMP OpenMP A.Klypin Shared memory and OpenMP Simple Example Threads Dependencies Directives Handling Common blocks Synchronization Improving load balance.

Lecture 8: OpenMP. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism / Implicit parallelism.

OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)

OpenMP Martin Kruliš Jiří Dokulil. OpenMP OpenMP Architecture Review Board Compaq, HP, Intel, IBM, KAI, SGI, SUN, U.S. Department of Energy,…

CS 838: Pervasive Parallelism Introduction to OpenMP Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from online references.

Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j

OpenMP fundamentials Nikita Panov

High-Performance Parallel Scientific Computing 2008 Purdue University OpenMP Tutorial Seung-Jai Min School of Electrical and Computer.

Threaded Programming Lecture 4: Work sharing directives.

Introduction to OpenMP

Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (

9/22/2011CS4961 CS4961 Parallel Programming Lecture 9: Task Parallelism in OpenMP Mary Hall September 22,

Parallel Programming Models (Shared Address Space) 5 th week.

3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,

Heterogeneous Computing using openMP lecture 2 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.

CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.

CS240A, T. Yang, Parallel Programming with OpenMP.

COMP7330/7336 Advanced Parallel and Distributed Computing OpenMP: Programming Model Dr. Xiao Qin Auburn University

Distributed and Parallel Processing George Wells.

B. Estrade, LSU – High Performance Computing Enablement Group OpenMP II B. Estrade.

NPACI Parallel Computing Institute August 19-23, 2002 San Diego Supercomputing Center S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED.

Parallel Programming in C with MPI and OpenMP

Introduction to OpenMP

Shared Memory Parallelism - OpenMP

Lecture 5: Shared-memory Computing with Open MP

Parallel Programming in C with MPI and OpenMP

Shared-memory Programming

CS427 Multicore Architecture and Parallel Computing

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Improving Barrier Performance Dr. Xiao Qin.

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing A bug in the rwlock program Dr. Xiao Qin.

Open[M]ulti[P]rocessing

Computer Engg, IIT(BHU)

Introduction to OpenMP

Shared-Memory Programming

SHARED MEMORY PROGRAMMING WITH OpenMP

Computer Science Department

Parallel Programming with OpenMP

Programming with Shared Memory Introduction to OpenMP

Introduction to OpenMP

Parallel Programming with OPENMP

Presentation transcript:

1 Friday, November 10, 2006 “ Programs for sale: Fast, Reliable, Cheap: choose two.” -Anonymous

2 #pragma omp parallel for for(i=1; i<=n; i++) { temp = 2.0*a[i]; a[i] = temp; b[i] = c[i]/temp; }

3 When the private clause is encountered, a separate memory location is allocated for each specified variable on each thread. The value of the variable is not initialized; a location is simply set aside for each thread. Private variables must be initialized within the loop. #pragma omp parallel for private(temp) for(i=1; i<=n; i++) { temp = 2.0*a[i]; a[i] = temp; b[i] = c[i]/temp; }

4 #pragma omp parallel for private(temp) for(i=1; i<=n; i++) { temp = 2.0*a[i]; a[i] = temp; b[i] = c[i]/temp; } #pragma omp parallel for private(temp) { for(i=1; i<=n; i++) { temp = 2.0*a[i]; a[i] = temp; b[i] = c[i]/temp; } Intel compiler gives error here.

5 Note that the loop index, i, is always private no matter what the default. #pragma omp parallel for default(private) shared(n,a,b,c) for(i=1; i<=n; i++) { temp = 2.0*a[i]; a[i] = temp; b[i] = c[i]/temp; }

6 j = jstart; #pragma omp parallel for private(j) for(i=1; i<=n; i++){ if(i == 1 || i == n) j = j + 1; a[i] = a[i] + j; } What is wrong here?

7 j = jstart; #pragma omp parallel for firstprivate(j) for(i=1; i<=n; i++){ if(i == 1 || i == n) j = j + 1; a[i] = a[i] + j; }

8 #pragma omp parallel for lastprivate(x) for(i=1; i<=n; i++) { x = sin( pi * dx * (float)i ); a[i] = exp(x); } lastx = x; parallel for pragma may have both firstprivate and lastprivate clauses and they may have variables in common.

9 Environment variable OMP_NUM_THREADS In bash: export OMP_NUM_THREADS=4

10 #include int main(void){ int i,tid, numthreads, numprocs, size=10; numthreads = omp_get_num_threads(); numprocs = omp_get_num_procs(); printf("Numthreads before for-loop is=%d numprocs=%d\n", numthreads, numprocs); #pragma omp parallel for private(tid) schedule(static,2) for(i=0; i<size; i++){ numthreads = omp_get_num_threads(); tid = omp_get_thread_num(); printf("Numthreads after for-loop is=%d\n", numthreads); printf("I am thread=%d, I have iteration=%d\n", tid, i); } return 0; }

11 omp_get_thread_num §Returns the thread rank in a parallel region. §The rank of threads ranges from 0 to omp_get_num_threads() - 1.

12 To compile: icc -openmp myprog.c -o myprog Note: Logout of past sessions for compiler environment settings to take effect.

13 §Difference between front-end and compute nodes.

14 Numthreads before for-loop is=1 numprocs=4 Numthreads after for-loop is=4 I am thread=0, I have iteration=0 Numthreads after for-loop is=4 I am thread=0, I have iteration=1 Numthreads after for-loop is=4 I am thread=0, I have iteration=8 Numthreads after for-loop is=4 I am thread=0, I have iteration=9 Numthreads after for-loop is=4 I am thread=1, I have iteration=2 Numthreads after for-loop is=4 I am thread=1, I have iteration=3 Numthreads after for-loop is=4 I am thread=2, I have iteration=4 Numthreads after for-loop is=4 I am thread=2, I have iteration=5 Numthreads after for-loop is=4 I am thread=3, I have iteration=6 Numthreads after for-loop is=4 I am thread=3, I have iteration=7

15 void omp_set_num_threads (int t) §It uses parameter value to set the number of threads to be active in parallel sections of code. §We have the ability to tailor the level of parallelism. §call omp_set_num_threads prior to the beginning of a parallel region for it to take effect; §The result is undefined if this subroutine is called within a parallel region.

16 #include int main(void){ int i,tid, size=10; omp_set_num_threads(4); #pragma omp parallel for private(tid) schedule(static,2) for(i=0; i<size; i++){ tid = omp_get_thread_num(); printf("I am thread=%d, I have iteration=%d\n", tid, i); } return 0; }

17 I am thread=0, I have iteration=0 I am thread=0, I have iteration=1 I am thread=0, I have iteration=8 I am thread=0, I have iteration=9 I am thread=1, I have iteration=2 I am thread=1, I have iteration=3 I am thread=2, I have iteration=4 I am thread=2, I have iteration=5 I am thread=3, I have iteration=6 I am thread=3, I have iteration=7

18 #include int main(void){ int i,tid, size=10; #pragma omp parallel for private(tid) schedule(static,2) num_threads(4) for(i=0; i<size; i++){ tid = omp_get_thread_num(); printf("I am thread=%d, I have iteration=%d\n", tid, i); } return 0; }

19 What is wrong here? double area, pi, x; int i,n; //.... area = 0.0; #pragma omp parallel for private(x) for (i=0; i<n; i++){ x=(i+0.5)/n; area+=4.0/(1.0+x*x); } pi=area/n;

20 What is wrong here? double area, pi, x; int i,n; //.... area = 0.0; #pragma omp parallel for private(x) for (i=0; i<n; i++){ x=(i+0.5)/n; area+=4.0/(1.0+x*x); // Race Condition } pi=area/n;

21 double area, pi, x; int i,n; //.... area = 0.0; #pragma omp parallel for private(x) for (i=0; i<n; i++){ x=(i+0.5)/n; #pragma omp critical area+=4.0/(1.0+x*x); } pi=area/n;

22 double area, pi, x; int i,n; //.... area = 0.0; #pragma omp parallel for private(x) for (i=0; i<n; i++){ x=(i+0.5)/n; #pragma omp critical area+=4.0/(1.0+x*x); } pi=area/n; This can affect speedup.

23 for(i=1; i<=n; i++){ sum = sum + a[i]; }

24 #pragma omp parallel for reduction(+:sum) for(i=1; i<=n; i++){ sum = sum + a[i]; }

25 Different reduction operations and initial values *1 &all bits 1 |0 ^0 &&1 ||0

26 double area, pi, x; int i,n; //.... area = 0.0; #pragma omp parallel for private(x) reduction (+:area) for (i=0; i<n; i++){ x=(i+0.5)/n; area+=4.0/(1.0+x*x); } pi=area/n;

27 Conditional execution of loops double area, pi, x; int i,n; //.... area = 0.0; #pragma omp parallel for private(x) reduction (+:area) if(n>5000) for (i=0; i<n; i++){ x=(i+0.5)/n; area+=4.0/(1.0+x*x); } pi=area/n;

28 for(i=1; i<=n; i++) { myval = do_lots_of_work(i); printf("%d %d\n", i, myval); }

29 #pragma omp parallel for private(myval) ordered for(i=1; i<=n; i++) { myval = do_lots_of_work(i); #pragma omp ordered { printf("%d %d\n", i, myval); } Note: The opening curly brace may not appear on the same line as the ordered directive.

30 for (i=0; i<BLOCK_SIZE(id,p,n); i++) for(j=0; j<n; j++) a[i][j]=MIN(a[i][j], a[i][k]+tmp[j]); §If we parallelize inner loop what would happen?

31 for (i=0; i<BLOCK_SIZE(id,p,n); i++) for(j=0; j<n; j++) a[i][j]=MIN(a[i][j], a[i][k]+tmp[j]); §fork/join overhead for every iteration of outer loop. §If we parallelize outer loop we only incur fork/join overhead once.

32 #pragma omp parallel for for (i=0; i<BLOCK_SIZE(id,p,n); i++) for(j=0; j<n; j++) a[i][j]=MIN(a[i][j], a[i][k]+tmp[j]);

33 #pragma omp parallel for for (i=0; i<BLOCK_SIZE(id,p,n); i++) for(j=0; j<n; j++) a[i][j]=MIN(a[i][j], a[i][k]+tmp[j]); Problem here!

34 #pragma omp parallel for private(j) for (i=0; i<BLOCK_SIZE(id,p,n); i++) for(j=0; j<n; j++) a[i][j]=MIN(a[i][j], a[i][k]+tmp[j]);

35 Scheduling loops §n is the number of iterations and t is the number of threads schedule (static) l Static allocation of about n/t contiguous iterations to each thread. schedule (static, C) l Allocation of chunks to tasks. Each chunk contains C contiguous iterations. Assigned in round-robin fashion. schedule (dynamic) l Iterations are dynamically allocated, one at a time, to threads.

36 Scheduling loops schedule (dynamic, C) Iterations are dynamically allocated, C iterations at a time, to threads. schedule (guided, C) l Also called guided self scheduling. The first chunk is an implementation dependent size and each successive chunk is a fixed fraction of preceding chunk until the minimum chunk size C is reached. schedule (guided) Same as above, C is taken to be 1 schedule (runtime) The schedule type is chosen at runtime based on value of environment variable OMP_SCHEDULE e.g. export OMP_SCHEDULE=“static,1”

37 In OpenMP, there are two main approaches for assigning work to threads: Loop-level Parallel regions §In the first approach, loop-level, individual loops are parallelized with each thread being assigned a unique range of the loop index. §In parallel regions, any sections of the code can be parallelized, not just loops. l The work within the parallel regions is explicitly distributed among the threads using the unique identifier assigned to each thread. l This can be done by using if statements, e.g., if(mythreadid == 0) …

38 §In the loop-level approach, execution starts on a single thread. Then, when a parallel loop is encountered, multiple threads are spawned. When the parallel loop is finished, the extra threads are discarded and the execution is once again serial until the next parallel loop. §In the parallel-regions approach, multiple threads are maintained, irrespective of whether or not loops are encountered.

39 §When using the parallel directive that the entire region of code between within braces will be duplicated on all threads. §This allows more flexibility than restricting and we can parallelize code in a manner much like that used with MPI.

40 #pragma omp parallel for for(i=1; i<=maxi; i++) { a(i) = b(i); } #pragma omp parallel #pragma omp for for(i=1; i<=maxi; i++) { a(i) = b(i); }

41 #include int main(void){ int i,tid, numthreads, size=10; #pragma omp parallel for private(tid) schedule(static,2) for(i=0; i<size; i++){ numthreads = omp_get_num_threads(); tid = omp_get_thread_num(); printf("Numthreads after for-loop is=%d\n", numthreads); printf("I am thread=%d, I have iteration=%d\n", tid, i); } return 0; }

42 #include int main(void){ int i,tid, numthreads, size=10; #pragma omp parallel private(tid) { numthreads = omp_get_num_threads(); tid = omp_get_thread_num(); printf("Numthreads after for-loop is=%d\n", numthreads); #pragma omp for schedule(static,2) for(i=0; i<size; i++){ printf("I am thread=%d, I have iteration=%d\n", tid, i); } return 0; }

43 Numthreads after for-loop is=4 I am thread=0, I have iteration=0 I am thread=0, I have iteration=1 I am thread=0, I have iteration=8 I am thread=0, I have iteration=9 Numthreads after for-loop is=4 I am thread=1, I have iteration=2 I am thread=1, I have iteration=3 Numthreads after for-loop is=4 I am thread=3, I have iteration=6 I am thread=3, I have iteration=7 I am thread=2, I have iteration=4 I am thread=2, I have iteration=5

44 #pragma omp parallel { #pragma omp for for(i=1; i<=maxi; i++){ a[i] = b[i]; } #pragma omp for for(j=1; j<=maxj; j++){ c[j] = d[j]; } Advantages when there are multiple loops

45 §There is an implied barrier at the end of a loop with for directive. §If a barrier is not desired, the nowait clause can be used.

46 #pragma omp parallel { #pragma omp for nowait for(i=1; i<=maxi; i++){ a[i] = b[i]; } #pragma omp for for(j=1; j<=maxj; j++){ c[j] = d[j]; }

47 §In the loop-level approach, domain decomposition is performed automatically by distributing loop indices among the threads. §In the parallel regions approach, domain decomposition is performed manually.

48 #pragma omp parallel private(myid,istart,iend,nthreads,nper) { nthreads = omp_get_num_threads(); nper = imax/nthreads; myid = omp_get_thread_num(); istart = myid*nper + 1; iend = istart + nper - 1; do_work(istart,iend); for(i=istart; i<=iend; i++) a(i) = b(i)*c(i) } The size of the arrays is imax, and nper is the number of indices per thread.

49 §Master /slave programming in message passing model. §The shared memory model allows each thread to access the list of tasks, so there is no need for a separate master thread.

50 pseudo-code int main(){ struct task_struct task_ptr; … task_ptr=get_next_task(); while(task_ptr != NULL){ complete_task(task_ptr); task_ptr = get_next_task() } … }

51 int main(){ struct task_struct task_ptr; … #pragma omp parallel private (task_ptr) { task_ptr=get_next_task(); while(task_ptr != NULL){ complete_task(task_ptr); task_ptr = get_next_task() } … } task_struct get_next_task() { #pragma omp critical { // here shared linked-list is modified. }

52 SELF-TEST §Example 7.11 §Example 7.12

53 SELF TEST: Matrix Multiplication /* static scheduling of matrix multiplication loops */ #pragma omp parallel default(private) shared (a, b, c, dim) num_threads(4) #pragma omp for schedule(static) for (i = 0; i < dim; i++) { for (j = 0; j < dim; j++) { c(i,j) = 0; for (k = 0; k < dim; k++) { c(i,j) += a(i, k) * b(k, j); }

54 v=alpha(); w=beta(); x=gamma(v,w); y=delta(); epsilon(x,y); alphabeta gamma epsilon delta

55  parallel sections pragma precedes a block of k blocks of code that may be executed concurrently by k threads.  section pragma is within the encompassing parallel sections block and identifies individual blocks in it. §There is an implied barrier at end of sections.

56 #pragma omp parallel sections { #pragma omp section /*optional pragma*/ v=alpha(); #pragma omp section w=beta(); #pragma omp section y=delta(); } x=gamma(v,w); epsilon(x,y); alphabeta gamma epsilon delta

57 #pragma omp parallel { #pragma omp sections { #pragma omp section /*optional pragma*/ v=alpha(); #pragma omp section w=beta(); } #pragma omp sections { #pragma omp section /*optional pragma*/ x=gamma(v,w); #pragma omp section y=delta(); } epsilon(x,y); alphabeta gamma epsilon delta

58 #pragma omp parallel { #pragma omp sections { #pragma omp section /*optional pragma*/ v=alpha(); #pragma omp section w=beta(); } #pragma omp sections { #pragma omp section /*optional pragma*/ x=gamma(v,w); #pragma omp section y=delta(); } epsilon(x,y); alphabeta gamma epsilon delta Better if only two processors are available.

59 §#pragma omp single l For example, for printing error messages §#pragma omp master §omp_set_dynamic §OMP_DYNAMIC