Download presentation
Presentation is loading. Please wait.
Published byHilda Madeleine Rose Modified over 9 years ago
1
Shared Memory Parallelization Outline What is shared memory parallelization? OpenMP Fractal Example False Sharing Variable scoping Examples on sharing and synchronization
2
Shared Memory Parallelization All processors can access all the memory in the parallel system The time to access the memory may not be equal for all processors - not necessarily a flat memory Parallelizing on a SMP does not reduce CPU time - it reduces wallclock time Parallel execution is achieved by generating threads which execute in parallel Number of threads is independent of the number of processors
3
Shared Memory Parallelization Overhead for SMP parallelization is large (100-200 sec)- size of parallel work construct must be significant enough to overcome overhead SMP parallelization is degraded by other processes on the node - important to be dedicated on the SMP node Remember Amdahl's Law - Only get a speedup on code that is parallelized
4
Fork-Join Model 1.All OpenMP programs begin as a single process: the master thread 2.FORK: the master thread then creates a team of parallel threads 3.Parallel region statements executed in parallel among the various team threads 4.JOIN: threads synchronize and terminate, leaving only the master thread
5
OpenMP 1997: group of hardware and software vendors announced their support for OpenMP, a new API for multi-platform shared- memory programming (SMP) on UNIX and Microsoft Windows NT platforms. www.openmp.org OpenMP parallelism specified through the use of compiler directives which are imbedded in C/C++ or Fortran source code. IBM does not yet support OpenMP for C++.
6
OpenMP How is OpenMP typically used? OpenMP is usually used to parallelize loops: –Find your most time consuming loops. –Split them up between threads. Better scaling can be obtained using OpenMP parallel regions, but can be tricky!
7
Loop Parallelization
8
Functional Parallelization
9
Fractal Example !$OMP PARALLEL !$OMP DO SCHEDULE(RUNTIME) do i=0,inos ! Long loop do k=1,niter ! Short loop if(zabs(z(i)).lt.lim) then if(z(i).eq.dcmplx(0.,0.)) then z(i)=c(i) else z(i)=z(i)**alpha+c(i) endif kount(i)=k else exit endif end do !$OMP END PARALLEL
10
Fractal Example (cont’d) Can also define parallel region thus: !$OMP PARALLEL DO SCHEDULE(RUNTIME) do i=0,inos ! Long loop do k=1,niter ! Short loop... end do C syntax: #pragma omp parallel for for(i=0; i <= inos; i++) for(k=1; j <= niter; k++) {... }
11
Fractal Example (cont’d) Number of threads is machine- dependent, or can be set at runtime by setting an environment variable SCHEDULE clause specifies how the iterations of the loop are divided among the threads: –STATIC: the loop iterations divided into contiguous chunks of equal size. –DYNAMIC: iterations are broken into chunks of specified size (default 1). As each thread finishes its work it dynamically obtains the next set of iterations. –RUNTIME: the schedule determined at runtime –GUIDED
12
Fractal Example (cont’d) Compilation: xlf90_r -qsmp=omp prog.f cc_r -qsmp=omp prog.c Threaded version of compilers will perform automatic parallelization of your program unless you specify otherwise using the -qsmp=omp (or noauto) option Program will run on four processors unless specified otherwise by setting the XLSMPOPTS=parthds= environment variable Default schedule is STATIC. Try setting it to DYNAMIC with: export XLSMPOPTS="SCHEDULE=dynamic” –This will assign loop iterations in chunks of 1. Try a larger chunk size (and get better performance), for example 40: export XLSMPOPTS="SCHEDULE=dynamic= 40"
13
Fractal Example (cont’d) Tradeoff between Load Balancing and Reduced Overhead The larger the size (GRANULARITY) of the piece of work, the lower the overall thread overhead. The smaller the size (GRANULARITY) of the piece of work, the better the dynamically scheduled load balancing Watch out for FALSE SHARING: chunk size smaller than cache line
14
False Sharing IBM Power3 cache line is 128 Bytes (16 8-Byte words) !$OMP PARALLEL DO do I=1,50 A(I)=B(I)+C(I) enddo say A(1-13)starts on cache line –then some of A(14-20)will be on first cache line so won’t be accessible until first thread finished Solution: set chunk size of 32 so won’t have overlap on other cache line
15
Variable Scoping Most difficult part of Shared Memory Parallelization –What memory is Shared –What memory is Private - each processor has its own copy Compare MPI: all variables are private Variables are shared by default, except: –loop indices –scalars, and arrays whose subscript is constant with respect to PARALLEL DO, that are set and then used in loop)
16
How does sharing work? THREAD 1: increment(x) { x = x + 1; } THREAD 1: 10 LOAD A, (x address) 20 ADD A, 1 30 STORE A, (x address) THREAD 2: increment(x) { x = x + 1; } THREAD 2: 10 LOAD A, (x address) 20 ADD A, 1 30 STORE A, (x address) X initially 0 Result could be 1 or 2 Need synchronization
17
Variable Scoping example read *,n sum = 0.0 call random (b) call random (c) !$OMP PARALLEL !$OMP PRIVATE (i,sump ) !$OMP SHARED (a,b,n,c,sum) sump = 0.0 !$OMP DO do i=1,n a(i) = sqrt(b(i)**2+c(i)**2) sump = sump + a(i) enddo !$OMP CRITICAL sum = sum + sump !$OMP ENDCRITICAL !$OMP END PARALLEL end
18
Scoping example #2 read *,n sum = 0.0 call random (b) call random (c) !$OMP PARALLEL DO !$OMP&PRIVATE (i) !$OMP&SHARED (a,b,n) !$OMP&REDUCTION (+:sum) do i=1,n a(i) = sqrt(b(i)**2+c(i)**2) sum = sum + a(i) Enddo !$OMP PARALLEL ENDDO end Each processor needs a separate copy of i everything else is Shared
19
Variable Scoping Global variables are SHARED among threads –Fortran: COMMON blocks, SAVE variables, MODULE variables –C: variables “visible” when #pragma omp parallel encountered, static variables declared within a parallel region But not everything is shared... –Stack variables in sub-programs called from parallel regions are PRIVATE –Automatic variables within a statement block are PRIVATE.
20
Hello World #1 (correct) PROGRAM HELLO INTEGER TID, OMP_GET_THREAD_NUM !$OMP PARALLEL PRIVATE(TID) TID = OMP_GET_THREAD_NUM() PRINT *, 'Hello World from thread = ', TID... !$OMP END PARALLEL END
21
Hello World #2 (incorrect) PROGRAM HELLO INTEGER TID, OMP_GET_THREAD_NUM !$OMP PARALLEL TID = OMP_GET_THREAD_NUM() PRINT *, 'Hello World from thread = ', TID... !$OMP END PARALLEL END
22
Hello World #3 (incorrect) PROGRAM HELLO INTEGER TID, OMP_GET_THREAD_NUM TID = OMP_GET_THREAD_NUM() PRINT *, 'Hello World from thread = ', TID !$OMP PARALLEL... !$OMP END PARALLEL END
23
Another Variable Scoping Example subroutine example4(n,m,a,b,c) real*8 a(100,100),B(100,100),c(100) integer n,i real*8 sum !$OMP PARALLEL DO !$OMP PRIVATE (j,i,c) !$OMP SHARED (a,b,m,n) do j=1,m do i=2,n-1 c(i) = sqrt(1.0+b(i,j)**2) enddo do i=1,n a(i,j) = sqrt(b(i,j)**2+c(i)**2) enddo end Each processor needs a separate copy of j,i,c everything else is Shared. What about c? c(1) and c(n)?
24
Another Variable Scoping Example (cont’d) subroutine example4(n,m,a,b,c) real*8 a(100,100),B(100,100),c(100) integer n,i real*8 sum !$OMP PARALLEL DO !$OMP PRIVATE (j,i) !$OMP SHARED (a,b,m,n) !$OMP FIRSTPRIVATE (c) do j=1,m do i=2,n-1 c(i) = sqrt(1.0+b(i,j)**2) enddo do i=1,n a(i,j) = sqrt(b(i,j)**2+c(i)**2) enddo end Need First Value of c. Master copies it's c array to all threads prior to DO loop
25
Another Variable Scoping Example (cont’d) What if last value of c is needed? Use LASTPRIVATE clause
26
References www.openmp.org ASCI Blue training : http://www.llnl.gov/computing/tutorials/ workshops/workshop/ EWOMP ‘99: http://www.it.lth.se/ewomp99/ programme.html EWOMP ‘00: http://www.epcc.ed.ac.uk/ewomp2000/ proceedings.html Multimedia tutorial at Boston University: http://scv.bu.edu/SCV/Tutorials/ OpenMP/
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.