Programming with OpenMP*” Part II Intel Software College.

Slides:

Advertisements

Similar presentations

Shared-Memory Model and Threads Intel Software College Introduction to Parallel Programming – Part 2.

Advertisements

Implementing Domain Decompositions Intel Software College Introduction to Parallel Programming – Part 3.

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.

Indian Institute of Science Bangalore, India भारतीय विज्ञान संस्थान बंगलौर, भारत Supercomputer Education and Research Centre (SERC) Adapted from: o “MPI-Message.

Open[M]ulti[P]rocessing Pthreads: Programmer explicitly define thread behavior openMP: Compiler and system defines thread behavior Pthreads: Library independent.

PARALLEL PROGRAMMING WITH OPENMP Ing. Andrea Marongiu

1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.

Introduction to OpenMP For a more detailed tutorial see: Look at the presentations.

Programming with OpenMP* Intel Software College. Copyright © 2008, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or.

1 ITCS4145/5145, Parallel Programming B. Wilkinson Feb 21, 2012 Programming with Shared Memory Introduction to OpenMP.

CSCI-6964: High Performance Parallel & Distributed Computing (HPDC) AE 216, Mon/Thurs 2-3:20 p.m. Pthreads (reading Chp 7.10) Prof. Chris Carothers Computer.

OpenMPI Majdi Baddourah

A Very Short Introduction to OpenMP Basile Schaeli EPFL – I&C – LSP Vincent Keller EPFL – STI – LIN.

INTEL CONFIDENTIAL OpenMP for Domain Decomposition Introduction to Parallel Programming – Part 5.

INTEL CONFIDENTIAL Confronting Race Conditions Introduction to Parallel Programming – Part 6.

INTEL CONFIDENTIAL OpenMP for Task Decomposition Introduction to Parallel Programming – Part 8.

INTEL CONFIDENTIAL Reducing Parallel Overhead Introduction to Parallel Programming – Part 12.

INTEL CONFIDENTIAL Finding Parallelism Introduction to Parallel Programming – Part 3.

1 Parallel Programming With OpenMP. 2 Contents  Overview of Parallel Programming & OpenMP  Difference between OpenMP & MPI  OpenMP Programming Model.

Programming with OpenMP* Intel Software College. Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or.

Programming with Shared Memory Introduction to OpenMP

Shared Memory Parallelization Outline What is shared memory parallelization? OpenMP Fractal Example False Sharing Variable scoping Examples on sharing.

Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (

Parallel Programming in Java with Shared Memory Directives.

2 3 Parent Thread Fork Join Start End Child Threads Compute time Overhead.

Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

OpenMP - Introduction Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

1 OpenMP Writing programs that use OpenMP. Using OpenMP to parallelize many serial for loops with only small changes to the source code. Task parallelism.

OpenMP OpenMP A.Klypin Shared memory and OpenMP Simple Example Threads Dependencies Directives Handling Common blocks Synchronization Improving load balance.

Lecture 8: OpenMP. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism / Implicit parallelism.

OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)

CS 838: Pervasive Parallelism Introduction to OpenMP Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from online references.

Monte Carlo Methods Versatile methods for analyzing the behavior of some activity, plan or process that involves uncertainty.

Programming with OpenMP* Intel Software College. Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or.

Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j

OpenMP fundamentials Nikita Panov

High-Performance Parallel Scientific Computing 2008 Purdue University OpenMP Tutorial Seung-Jai Min School of Electrical and Computer.

Threaded Programming Lecture 4: Work sharing directives.

Design Issues. How to parallelize  Task decomposition  Data decomposition  Dataflow decomposition Jaruloj Chongstitvatana 2 Parallel Programming: Parallelization.

Correcting Threading Errors with Intel® Parallel Inspector.

10/02/2012CS4230 CS4230 Parallel Programming Lecture 11: Breaking Dependences and Task Parallel Algorithms Mary Hall October 2,

INTEL CONFIDENTIAL Shared Memory Considerations Introduction to Parallel Programming – Part 4.

Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (

9/22/2011CS4961 CS4961 Parallel Programming Lecture 9: Task Parallelism in OpenMP Mary Hall September 22,

3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,

Special Topics in Computer Engineering OpenMP* Essentials * Open Multi-Processing.

CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.

Tuning Threaded Code with Intel® Parallel Amplifier.

Programming with OpenMP* Intel Software College. Objectives Upon completion of this module you will be able to use OpenMP to:  implement data parallelism.

OpenMP – Part 2 * *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)

B. Estrade, LSU – High Performance Computing Enablement Group OpenMP II B. Estrade.

NPACI Parallel Computing Institute August 19-23, 2002 San Diego Supercomputing Center S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED.

Introduction to OpenMP

Shared Memory Parallelism - OpenMP

CS427 Multicore Architecture and Parallel Computing

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing A bug in the rwlock program Dr. Xiao Qin.

Open[M]ulti[P]rocessing

Computer Engg, IIT(BHU)

Introduction to OpenMP

Programming with OpenMP*

Computer Science Department

Shared Memory Programming with OpenMP

CS4230 Parallel Programming Lecture 12: More Task Parallelism Mary Hall October 4, /04/2012 CS4230.

Programming with Shared Memory Introduction to OpenMP

Introduction to OpenMP

Lecture 2 The Art of Concurrency

Shared-Memory Paradigm & OpenMP

Presentation transcript:

Programming with OpenMP*” Part II Intel Software College

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 2 Programming with OpenMP* Agenda What is OpenMP? Parallel Regions Worksharing Construct Data Scoping to Protect Data Explicit Synchronization Scheduling Clauses Other Helpful Constructs and Clauses

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 3 Programming with OpenMP* Assigning Iterations The schedule clause affects how loop iterations are mapped onto threads schedule(static [,chunk]) Blocks of iterations of size “chunk” to threads Round robin distribution schedule(dynamic[,chunk]) Threads grab “chunk” iterations When done with iterations, thread requests next set schedule(guided[,chunk]) Dynamic schedule starting with large block Size of the blocks shrink; no smaller than “chunk”

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 4 Programming with OpenMP* Schedule ClauseWhen To Use STATIC Predictable and similar work per iteration DYNAMIC Unpredictable, highly variable work per iteration GUIDED Special case of dynamic to reduce scheduling overhead Which Schedule to Use

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 5 Programming with OpenMP* Schedule Clause Example #pragma omp parallel for schedule (static, 8) for( int i = start; i <= end; i += 2 ) { if ( TestForPrime(i) ) gPrimesFound++; } Iterations are divided into chunks of 8 If start = 3, then first chunk is i ={3,5,7,9,11,13,15,17}

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 6 Programming with OpenMP* Parallel Sections Independent sections of code can execute concurrently SerialParallel #pragma omp parallel sections { #pragma omp section phase1(); #pragma omp section phase2(); #pragma omp section phase3(); }

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Sections are distributed among the threads in the parallel team. Each section is executed only once and each thread may execute zero or more sections. It’s not possible to determine whether or not a section will be executed before another. Therefore, the output of one section should not serve as the input to another. Instead, the section that generates output should be moved before the sections construct. 7 Programming with OpenMP* Parallel Sections (continued)

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 8 Programming with OpenMP* Denotes block of code to be executed by only one thread First thread to arrive is chosen Implicit barrier at end Single Construct #pragma omp parallel { DoManyThings(); #pragma omp single { ExchangeBoundaries(); } // threads wait here for single (Implicit Barrier) DoManyMoreThings(); }

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 9 Programming with OpenMP* Denotes block of code to be executed only by the master thread No implicit barrier at end Master Construct #pragma omp parallel { DoManyThings(); #pragma omp master { // if not master skip to next stmt ExchangeBoundaries(); } DoManyMoreThings(); }

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 10 Programming with OpenMP* Implicit Barriers Several OpenMP* constructs have implicit barriers parallel for single Unnecessary barriers hurt performance Waiting threads accomplish no work! Suppress implicit barriers, when safe, with the nowait clause

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 11 Programming with OpenMP* Nowait Clause Use when threads would wait between independent computations #pragma single nowait { [...] } #pragma omp for nowait for(...) {...}; #pragma omp for schedule(dynamic,1) for(int i=0; i<n; i++) a[i] = bigFunc1(i); #pragma omp for schedule(dynamic,1) for(int j=0; j<m; j++) b[j] = bigFunc2(j); nowait

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 12 Programming with OpenMP* Need An Explicit Barrier Construct Each thread waits until all threads arrive Consider the following code segment: #pragma omp parallel shared (A, B, C) { DoSomeWork(A,B); printf(“Processed A into B\n”); DoSomeWork(B,C); printf(“Processed B into C\n”); } #pragma omp barrier What is Wrong?

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 13 Programming with OpenMP* More on synchronization Consider the following code segment: #pragma omp parallel for shared(x, y, index, n) for (i = 0; i < n; i++) { x[index[i]] += work1(i); y[i] += work2(i); } What is Wrong? Since index[i] can be the same for different i values, the update to x must be protected. Example if i=0, index[0]=5, x[5]=82 i=1, index[1]=5, x[5]=12 There will be a “Race Condition” between threads 0 & 1

Race Condition A race condition is nondeterministic behavior caused by the times at which two or more threads access a shared variable For example, suppose both Thread A and Thread B are executing the statement area += 4.0 / (1.0 + y*y);

Two Timings: Possibility 1 Value of area Thread AThread B

Two Timings : Possibility 2 Value of area Thread AThread B

Two Timings Value of area Thread AThread B Value of area Thread AThread B Order of thread execution causes non determinant behavior in a data race

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 18 Programming with OpenMP* Using a critical section #pragma omp parallel for shared(x, y, index, n) for (i = 0; i < n; i++) { #pragma omp critical x[index[i]] += work1(i); y[i] += work2(i); } This would serialize update of x; Threads wait their turn – only one at a time updates x However, this might be an overkill. Not All UPDATES to all elements of X might need to be protected

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 19 Programming with OpenMP* Using a critical section #pragma omp parallel for shared(x, y, index, n) for (i = 0; i < n; i++) { #pragma omp critical x[index[i]] += work1(i); y[i] += work2(i); } Example if i=0, index[0]=5, x[5]=82 i=1, index[1]=29,x[29]=12 i=2, index[2]=8, x[8]=435 i=3, index[3]=5, x[5]=12 Only updates to x[5] need to be protected

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 20 Programming with OpenMP* Atomic Construct Special case of a critical section Applies only to simple update of memory location #pragma omp parallel for shared(x, y, index, n) for (i = 0; i < n; i++) { #pragma omp atomic x[index[i]] += work1(i); y[i] += work2(i); } Atomic protects individual elements of x array, so that if multiple, concurrent instances of index[i] are different, updates can still be done in parallel.

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 21 Programming with OpenMP* Atomic Construct Special case of a critical section Applies only to simple update of memory location #pragma omp parallel for shared(x, y, index, n) for (i = 0; i < n; i++) { #pragma omp atomic x[index[i]] += work1(i); y[i] += work2(i); } In the previous example i=0, index[0]=5, x[5]=82 i=1, index[1]=29,x[29]=12 i=2, index[2]=8, x[8]=435 i=3, index[3]=5, x[5]=12 Updates to x[29] and x[8] will be done in parallel; updates to x[5] will be serialized.

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 22 Programming with OpenMP* OpenMP* API Get the thread number within a team int omp_get_thread_num(void); Get the number of threads in a team int omp_get_num_threads(void); Usually not needed for OpenMP codes Does have specific uses (debugging) Must include a header file #include

Workshop on Patterns in High Performance Computing University of Illinois at Urbana-Champaign May 4-6, 2005

Slide-24 What Makes HPC Applications Challenging MITRE ISIMIT Lincoln Laboratory Benchmarking Working Group Session Agenda 1:00-1:15David KoesterWhat Makes HPC Applications Challenging? 1:15-1:30Piotr LuszczekHPCchallenge Challenges 1:30-1:45Fred TracyAlgorithm Comparisons of Application Benchmarks 1:45-2:00Henry NewmanI/O Challenges 2:00-2:15Phil ColellaThe Seven Dwarfs 2:15-2:30Glenn LueckeRun-Time Error Detection Benchmark 2:30-3:00Break 3:00-3:15Bill MannSSCA #1 Draft Specification 3:15-3:30Theresa MeuseSSCA #6 Draft Specification 3:30-??Discussions — User Needs HPCS Vendor Needs for the MS4 Review HPCS Vendor Needs for the MS5 Review HPCS Productivity Team Working Groups

Slide-25 What Makes HPC Applications Challenging MITRE ISIMIT Lincoln Laboratory Killer Kernels Phil Colella —The Seven Dwarfs C OMPUTATIONAL R ESEARCH D IVISION 1 Algorithms that consume the bulk of the cycles of current high-end systems in DOE Structured Grids (including locally structured grids, e.g. AMR) Unstructured Grids Fast Fourier Transform Dense Linear Algebra Sparse Linear Algebra Particles Monte Carlo (Should also include optimization / solution of nonlinear systems, which at the high end is something one uses mainly in conjunction with the other seven) DARPA

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Monte Carlo Simulations Monte Carlo (MC) stochastic methods are based on the use of random numbers and probability statistics to investigate problems in many fields such as: Economics, nuclear physics, and many others 26 Programming with OpenMP*

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Monte Carlo Simulations (cont.) Solving equations which describe the interactions between hundreds or thousands of atoms is impossible. However, with MC, a large system can be sampled in a number of random configurations Data can be used to describe the system as a whole 27 Programming with OpenMP*

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Example: Calculates Pi based on a MC simulation 28 Programming with OpenMP* Consider a player throwing darts randomly Number of darts that hit shaded part (circle quadrant) is proportional to area of that part

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 29 Programming with OpenMP* Monte Carlo Pi loop 1 to MAX x.coor=(random#) y.coor=(random#) dist=sqrt(x^2 + y^2) if (dist <= 1) hits=hits+1 pi = 4 * hits/MAX r

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 30 Programming with OpenMP* Making Monte Carlo’s Parallel hits = 0 call SEED48(1) DO I = 1, max x = DRAND48() y = DRAND48() IF (SQRT(x*x + y*y).LT. 1) THEN hits = hits+1 ENDIF END DO pi = REAL(hits)/REAL(max) * 4.0 What is the challenge here?

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Making Monte Carlo’s Parallel The Random Number generator maintains a static variable (the Seed). The only way to make the code parallel is to do it in a critical section for each call of the DRAND(), which is a lot of overhead. 31 Programming with OpenMP*

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 32 Programming with OpenMP* Lab Activity : Computing Pi, Due, Tuesday, November 25 Use the Intel® Math Kernel Library (Intel® MKL) VSL: Intel MKL’s VSL (Vector Statistics Libraries) VSL creates an array, rather than a single random number VSL can have multiple seeds (one for each thread) Objective: Use basic OpenMP* syntax to make Pi parallel Choose the best code to divide the task up Categorize properly all variables

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 33 Programming with OpenMP* Programming with OpenMP What’s Been Covered OpenMP* is: A simple approach to parallel programming for shared memory machines We explored basic OpenMP coding on how to: Make code regions parallel ( omp parallel ) Split up work ( omp for ) Categorize variables ( omp private ….) Synchronize ( omp critical …)

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 34 Programming with OpenMP*

Advanced Concepts

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 36 Programming with OpenMP* More OpenMP* Data environment constructs FIRSTPRIVATE LASTPRIVATE THREADPRIVATE

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 37 Programming with OpenMP* Variables initialized from shared variable C++ objects are copy-constructed Firstprivate Clause incr=0; #pragma omp parallel for firstprivate(incr) for (I=0;I<=MAX;I++) { if ((I%2)==0) incr++; A(I)=incr; }

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 38 Programming with OpenMP* Variables update shared variable using value from last iteration C++ objects are updated as if by assignment Lastprivate Clause void sq2(int n, double *lastterm) { double x; int i; #pragma omp parallel #pragma omp for lastprivate(x) for (i = 0; i < n; i++){ x = a[i]*a[i] + b[i]*b[i]; b[i] = sqrt(x); } lastterm = x; }

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 39 Programming with OpenMP* Preserves global scope for per-thread storage Legal for name-space-scope and file-scope Use copyin to initialize from master thread Threadprivate Clause struct Astruct A; #pragma omp threadprivate(A) … #pragma omp parallel copyin(A) do_something_to(&A); … #pragma omp parallel do_something_else_to(&A); Private copies of “A” persist between regions

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 40 Programming with OpenMP* Performance Issues Idle threads do no useful work Divide work among threads as evenly as possible Threads should finish parallel tasks at same time Synchronization may be necessary Minimize time waiting for protected resources

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 41 Programming with OpenMP* Load Imbalance Unequal work loads lead to idle threads and wasted time. time Busy Idle #pragma omp parallel { #pragma omp for for( ; ; ){ } time

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 42 Programming with OpenMP* #pragma omp parallel { #pragma omp critical {... }... } Synchronization Lost time waiting for locks time Busy Idle In Critical

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 43 Programming with OpenMP* Performance Tuning Profilers use sampling to provide performance data. Traditional profilers are of limited use for tuning OpenMP*: Measure CPU time, not wall clock time Do not report contention for synchronization objects Cannot report load imbalance Are unaware of OpenMP constructs Programmers need profilers specifically designed for OpenMP.

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 44 Programming with OpenMP* Static Scheduling: Doing It By Hand Must know: Number of threads (Nthrds) Each thread ID number (id) Compute start and end iterations: #pragma omp parallel { int i, istart, iend; istart = id * N / Nthrds; iend = (id+1) * N / Nthrds; for(i=istart;i<iend;i++){ c[i] = a[i] + b[i];} }