Download presentation
Presentation is loading. Please wait.
Published byColeen Lang Modified over 8 years ago
1
NPACI Parallel Computing Institute August 19-23, 2002 San Diego Supercomputing Center S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE Introduction to shared memory programming using OpenMP
2
S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 2 Overview 1.OpenMP standard 2.General performance considerations 3.Stommel model with OpenMP
3
S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 3 Shared memory programming Two basic types of parallelism: fine grain (loop level) and coarse grain (task level). Loop level Loop level: parallelize only loops Easy to implement Highly readable code Less than optimal performance (sometimes) Most often used Task level Task level: distribute work by domain decomposition (similar to MPI) Time-consuming implementation Complicated source code Best performance
4
S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 4 OpenMP: an interface based on compiler directives OpenMP: an interface based on compiler directives Suitable both for task level and loop level parallelism Fortran Directives begin with the !$OMP, C$OMP or *$OMP sentinel. To simplify things and avoid problems associated with the Fortran free and fixed source forms, use the !$OMP sentinel starting in column 1. !$OMP parallel !$OMP do !$OMP end parallelC/C++ #pragma omp parallel #pragma omp for #pragma omp end parallel
5
S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 5 A Simple Example - Parallel Loop do i=1,128 b(i) = a(i) + c(i) enddo The first directive specifies that the loop immediately following should be executed in parallel. The second (optional) directive specifies the end of the parallel section. For codes that spend the majority of their time executing the content of simple loops, the PARALLEL DO directive can result in significant increase in performance. !$OMP PARALLEL DO !$OMP END PARALLEL DO (note - END PARALLEL DO opt)
6
S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 6 Distribution of work - SCHEDULE Clause The division of work among CPUs can be controlled with the SCHEDULE clause. For example !$OMP PARALLEL DO SCHEDULE(STATIC) Iterations divided among the CPUs in contiguous chunks !$OMP PARALLEL DO SCHEDULE(STATIC, N) Iterations divided round-robin fashion in chunks of size N !$OMP PARALLEL DO SCHEDULE(DYNAMIC,N) Iterations handed out in chunks of size N as CPUs become available !$OMP PARALLEL DO SCHEDULE(GUIDED,N) Iterations handed out in chunks of exponentially decreasing sizes starting with N as CPUs become available
7
S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 7 Example - SCHEDULE(STATIC), 4 CPUs CPU2:do i=65,96 a(i)=b(i)+c(i) enddo CPU3:do i=97,128 a(i)=b(i)+c(i) enddo CPU0:do i=1,32 a(i)=b(i)+c(i) enddo CPU1:do i=33,64 a(i)=b(i)+ c(i) enddo
8
S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 8 Example - SCHEDULE(STATIC,16), 4 CPUs CPU0:do i=1,16 a(i)=b(i)+c(i) enddo do i=65,80 a(i)=b(i)+c(i) enddo CPU1:do i=17,32 a(i)=b(i)+c(i) enddo do i=81,96 a(i)=b(i)+c(i) enddo CPU2:do i=33, 48 a(i)=b(i)+c(i) enddo do i=97,112 a(i)=b(i)+c(i) enddo CPU3:do i=49,64 a(i)=b(i)+c(i) enddo do i=113,128 a(i)=b(i)+c(i) enddo
9
S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 9 Data scope Data scope SHARED - variable is shared by all processors PRIVATE - each processor has a private copy of a variable In the previous example of a simple parallel loop, we relied on the OpenMP defaults. Explicitly, the loop would be written as !$OMP PARALLEL DO SHARED(A,B,C,N) PRIVATE(I) do I=1,N B(I) = A(I) + C(I) enddo !$OMP END PARALLEL DO All CPUs have access to the same storage area for A, B, C and N, but each loop needs its own private value of the loop index I.
10
S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 10 PRIVATE Data Example In the following loop, each processor needs its own private copy of the variable TEMP. If TEMP were shared, the result would be unpredictable since multiple processors would be writing to the same memory location. !$OMP PARALLEL DO SHARED(A,B,C,N) PRIVATE(I,TEMP) do I=1,N TEMP = A(I)/B(I) C(I) = TEMP + SQRT(TEMP) enddo !$OMP END PARALLEL DO
11
S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 11 FIRSTPRIVATE / LASTPRIVATE FIRSTPRIVATE: Private copies of the variables are initialized from the original objects. Use can lead to better performing code LASTPRIVATE: On exiting the parallel region or loop, variable has the value that it would have had in the case of serial execution A = 2.0 c On entry, each thread has A equal to 2.0 !$OMP PARALLEL DO FIRSTPRIVATE(A),LASTPRIVATE(I) DO I=1,N Z(I) = A*X(I) + Y(I) ENDDO C On exit, I is set to N
12
S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 12 REDUCTION variables Variables that are used in collective operations over the elements of an array can be labeled as REDUCTION variables. ASUM = 0.0 APROD = 1.0 !$OMP PARALLEL DO REDUCTION(+:ASUM) REDUCTION(*:APROD) do I=1,n ASUM = ASUM + A(I) APROD = APROD * A(I) enddo !$OMP END PARALLEL DO Each processor has its own copy of ASUM and APROD. After the parallel work is finished, the master processor collects the values generated by each processor and performs global reduction.
13
S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 13 Parallel regions The !$OMP PARALLEL directive can be used to mark entire regions as parallel. The following two examples are equivalent !$OMP PARALLEL !$OMP DO do i=1,n a(i)=b(i)+c(i) enddo !$OMP DO do i=1,n x(i)=y(i)+z(i) enddo !$OMP END PARALLEL !$OMP PARALLEL DO do i=1,n a(i)=b(i)+c(i) enddo !$OMP PARALLEL DO do i=1,n x(i)=y(i)+z(i) enddo
14
S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 14 A more practical example of using !$OMP PARALLEL When a parallel region is exited, a barrier is implied - all threads must reach the barrier before any can proceed. By using the NOWAIT clause at the end of each loop inside the parallel region, an unnecessary synchronization of threads can be avoided. !$OMP PARALLEL !$OMP DO do i=1,n a(i)=b(i)+c(i) enddo !$OMP END DO NOWAIT !$OMP DO do i=1,n x(i)=y(i)+z(i) enddo !$OMP END DO NOWAIT !$OMP END PARALLEL
15
S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 15 A note on OpenMP parallel loop directives OpenMP provides two sets of directives for specifying a parallel loop. Their appropriate use is detailed below !$OMP DO / !$OMP END DO Used inside parallel regions marked with the PARALLEL/END PARALLEL directives !$OMP PARALLEL DO / !$OMP END PARALLEL DO Used to mark isolated loops outside of parallel regions
16
S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 16 Critical Regions Certain parallel programs may require that each processor execute a section of code, where it is critical that only one processor execute the code section at a time. These regions can be marked with the CRITICAL / END CRITICAL directives. !$OMP PARALLEL SHARED(X,Y) !$OMP CRITICAL (SECTION1) call subroutine update(x) !$OMP END CRITICAL (SECTION1) !$OMP CRITICAL (SECTION2) call subroutine update(y) !$OMP END CRITICAL (SECTION2) !$OMP END PARALLEL
17
S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 17 OpenMP runtime library OMP_GET_NUM_THREADS() – returns the current number of threads OMP_GET_THREAD_NUM()- returns the id of this thread OMP_SET_NUM_THREADS(n) – set the desired number of threads OMP_IN_PARALLEL() – returns.true. if inside parallel region etc…
18
S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 18 More about OpenMP Other important features of Open MP: BARRIER - all threads must synchronize at barrier ORDERED - order of execution matches serial code MASTER – only thread number 0 executes this code ATOMIC – for atomic variable update SECTIONS - work sharing based on sections of code Environment variables: OMP_NUM_THREADS For more complete details see the OpenMP web site at http://www.openmp.org
19
S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 19 The OpenMP Architecture Review Board is comprised of the following organizations. Compaq (Digital) CompaqDigital Hewlett-Packard Company Hewlett-Packard Company Intel Corporation Intel Corporation International Business Machines (IBM) International Business Machines (IBM) Kuck & Associates, Inc. (KAI) Kuck & Associates, Inc. (KAI) Silicon Graphics, Inc. Silicon Graphics, Inc. Sun Microsystems, Inc. Sun Microsystems, Inc. U.S. Department of Energy ASCI program U.S. Department of Energy ASCI program The following software vendors also endorse the OpenMP API: Absoft Corporation Absoft Corporation Edinburgh Portable Compilers Edinburgh Portable Compilers Etnus, Inc. Etnus, Inc. GENIAS Software GmBH GENIAS Software GmBH Myrias Computer Technologies, Inc. Myrias Computer Technologies, Inc. The Portland Group, Inc. (PGI) The Portland Group, Inc. (PGI) OpenMP Partners
20
S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 20 Summary of Part 1 Use OpenMP directives to help compiler parallelize your code Loop level parallelism is the easiest (just use PARALLEL DO) Pay attention to variable scope (PRIVATE/SHARED) Avoid race conditions – use REDUCTION, CRITICAL, ATOMIC, MASTER etc
21
S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 21 Part 2: Performance considerations Main obstacles to scaling of OpenMP codes are: 1.Serial code (code outside of loops, also CRITICAL regions). 2.Load imbalance. 3.Directives overhead. 4.False sharing (when several processors update the same cache line). Number 3 and 4 is the price of using OpenMP as opposed to MPI. The advantage is the absence of communication overheads.
22
S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 22 OpenMP directives overhead
23
S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE 23 General performance recommendations Be aware of the Amdahl’s law –Minimize serial code –Remove dependencies among iterations Balance the load –Experiment with using SCHEDULE clause Be aware of directives cost –Parallelize outer loops –Minimize the number of directives –Minimize synchronization – minimize the use of BARRIER, CRITICAL, ORDERED –Consider using NOWAIT clause of OMP DO when enclosing several loops inside one PARALLEL region. –Merge loops to reduce synchronization cost Reduce false sharing –Use private variables Try task level parallelism
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.