Shared Memory Parallelization Outline What is shared memory parallelization? OpenMP Fractal Example False Sharing Variable scoping Examples on sharing.

Slides:



Advertisements
Similar presentations
OpenMP.
Advertisements

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
Parallel Processing with OpenMP
Introduction to Openmp & openACC
Scheduling and Performance Issues for Programming using OpenMP
NewsFlash!! Earth Simulator no longer #1. In slightly less earthshaking news… Homework #1 due date postponed to 10/11.
May 2, 2015©2006 Craig Zilles1 (Easily) Exposing Thread-level Parallelism  Previously, we introduced Multi-Core Processors —and the (atomic) instructions.
1 Programming Explicit Thread-level Parallelism  As noted previously, the programmer must specify how to parallelize  But, want path of least effort.
Open[M]ulti[P]rocessing Pthreads: Programmer explicitly define thread behavior openMP: Compiler and system defines thread behavior Pthreads: Library independent.
PARALLEL PROGRAMMING WITH OPENMP Ing. Andrea Marongiu
Scientific Programming OpenM ulti- P rocessing M essage P assing I nterface.
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.
Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Introduction to OpenMP For a more detailed tutorial see: Look at the presentations.
1 ITCS4145/5145, Parallel Programming B. Wilkinson Feb 21, 2012 Programming with Shared Memory Introduction to OpenMP.
OpenMPI Majdi Baddourah
A Very Short Introduction to OpenMP Basile Schaeli EPFL – I&C – LSP Vincent Keller EPFL – STI – LIN.
Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines.
1 Parallel Programming With OpenMP. 2 Contents  Overview of Parallel Programming & OpenMP  Difference between OpenMP & MPI  OpenMP Programming Model.
Budapest, November st ALADIN maintenance and phasing workshop Short introduction to OpenMP Jure Jerman, Environmental Agency of Slovenia.
Programming with Shared Memory Introduction to OpenMP
1 Copyright © 2010, Elsevier Inc. All rights Reserved Chapter 5 Shared Memory Programming with OpenMP An Introduction to Parallel Programming Peter Pacheco.
Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (
Executing OpenMP Programs Mitesh Meswani. Presentation Outline Introduction to OpenMP Machine Architectures Shared Memory (SMP) Distributed Memory MPI.
Parallel Programming in Java with Shared Memory Directives.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Lecture 5: Shared-memory Computing with Open MP. Shared Memory Computing.
Chapter 17 Shared-Memory Programming. Introduction OpenMP is an application programming interface (API) for parallel programming on multiprocessors. It.
OpenMP - Introduction Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
1 OpenMP Writing programs that use OpenMP. Using OpenMP to parallelize many serial for loops with only small changes to the source code. Task parallelism.
OpenMP OpenMP A.Klypin Shared memory and OpenMP Simple Example Threads Dependencies Directives Handling Common blocks Synchronization Improving load balance.
Lecture 8: OpenMP. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism / Implicit parallelism.
OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)
CS 838: Pervasive Parallelism Introduction to OpenMP Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from online references.
Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j
OpenMP fundamentials Nikita Panov
High-Performance Parallel Scientific Computing 2008 Purdue University OpenMP Tutorial Seung-Jai Min School of Electrical and Computer.
Threaded Programming Lecture 4: Work sharing directives.
Introduction to OpenMP
Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.
Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (
Threaded Programming Lecture 2: Introduction to OpenMP.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
Special Topics in Computer Engineering OpenMP* Essentials * Open Multi-Processing.
CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.
NPACI Parallel Computing Institute August 19-23, 2002 San Diego Supercomputing Center S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED.
OpenMP An API : For Writing Portable SMP Application Software Rider NCHC GTD.
Parallel Programming in C with MPI and OpenMP
Introduction to OpenMP
Shared Memory Parallelism - OpenMP
Lecture 5: Shared-memory Computing with Open MP
SHARED MEMORY PROGRAMMING WITH OpenMP
Shared-memory Programming
CS427 Multicore Architecture and Parallel Computing
Computer Engg, IIT(BHU)
Introduction to OpenMP
Shared-Memory Programming
Computer Science Department
Shared Memory Programming with OpenMP
Multi-core CPU Computing Straightforward with OpenMP
Prof. Thomas Sterling Department of Computer Science
September 4, 1997 Parallel Processing (CS 730) Lecture 5: Shared Memory Parallel Programming with OpenMP* Jeremy R. Johnson Wed. Jan. 31, 2001 *Parts.
September 4, 1997 Parallel Processing (CS 730) Lecture 5: Shared Memory Parallel Programming with OpenMP* Jeremy R. Johnson *Parts of this lecture.
Introduction to High Performance Computing Lecture 20
Programming with Shared Memory Introduction to OpenMP
Introduction to OpenMP
Shared-Memory Paradigm & OpenMP
Presentation transcript:

Shared Memory Parallelization Outline What is shared memory parallelization? OpenMP Fractal Example False Sharing Variable scoping Examples on sharing and synchronization

Shared Memory Parallelization All processors can access all the memory in the parallel system The time to access the memory may not be equal for all processors - not necessarily a flat memory Parallelizing on a SMP does not reduce CPU time - it reduces wallclock time Parallel execution is achieved by generating threads which execute in parallel Number of threads is independent of the number of processors

Shared Memory Parallelization Overhead for SMP parallelization is large (  sec)- size of parallel work construct must be significant enough to overcome overhead SMP parallelization is degraded by other processes on the node - important to be dedicated on the SMP node Remember Amdahl's Law - Only get a speedup on code that is parallelized

Fork-Join Model 1.All OpenMP programs begin as a single process: the master thread 2.FORK: the master thread then creates a team of parallel threads 3.Parallel region statements executed in parallel among the various team threads 4.JOIN: threads synchronize and terminate, leaving only the master thread

OpenMP 1997: group of hardware and software vendors announced their support for OpenMP, a new API for multi-platform shared- memory programming (SMP) on UNIX and Microsoft Windows NT platforms. OpenMP parallelism specified through the use of compiler directives which are imbedded in C/C++ or Fortran source code. IBM does not yet support OpenMP for C++.

OpenMP How is OpenMP typically used? OpenMP is usually used to parallelize loops: –Find your most time consuming loops. –Split them up between threads. Better scaling can be obtained using OpenMP parallel regions, but can be tricky!

Loop Parallelization

Functional Parallelization

Fractal Example !$OMP PARALLEL !$OMP DO SCHEDULE(RUNTIME) do i=0,inos ! Long loop do k=1,niter ! Short loop if(zabs(z(i)).lt.lim) then if(z(i).eq.dcmplx(0.,0.)) then z(i)=c(i) else z(i)=z(i)**alpha+c(i) endif kount(i)=k else exit endif end do !$OMP END PARALLEL

Fractal Example (cont’d) Can also define parallel region thus: !$OMP PARALLEL DO SCHEDULE(RUNTIME) do i=0,inos ! Long loop do k=1,niter ! Short loop... end do C syntax: #pragma omp parallel for for(i=0; i <= inos; i++) for(k=1; j <= niter; k++) {... }

Fractal Example (cont’d) Number of threads is machine- dependent, or can be set at runtime by setting an environment variable SCHEDULE clause specifies how the iterations of the loop are divided among the threads: –STATIC: the loop iterations divided into contiguous chunks of equal size. –DYNAMIC: iterations are broken into chunks of specified size (default 1). As each thread finishes its work it dynamically obtains the next set of iterations. –RUNTIME: the schedule determined at runtime –GUIDED

Fractal Example (cont’d) Compilation: xlf90_r -qsmp=omp prog.f cc_r -qsmp=omp prog.c Threaded version of compilers will perform automatic parallelization of your program unless you specify otherwise using the -qsmp=omp (or noauto) option Program will run on four processors unless specified otherwise by setting the XLSMPOPTS=parthds= environment variable Default schedule is STATIC. Try setting it to DYNAMIC with: export XLSMPOPTS="SCHEDULE=dynamic” –This will assign loop iterations in chunks of 1. Try a larger chunk size (and get better performance), for example 40: export XLSMPOPTS="SCHEDULE=dynamic= 40"

Fractal Example (cont’d) Tradeoff between Load Balancing and Reduced Overhead The larger the size (GRANULARITY) of the piece of work, the lower the overall thread overhead. The smaller the size (GRANULARITY) of the piece of work, the better the dynamically scheduled load balancing Watch out for FALSE SHARING: chunk size smaller than cache line

False Sharing IBM Power3 cache line is 128 Bytes (16 8-Byte words) !$OMP PARALLEL DO do I=1,50 A(I)=B(I)+C(I) enddo say A(1-13)starts on cache line –then some of A(14-20)will be on first cache line so won’t be accessible until first thread finished Solution: set chunk size of 32 so won’t have overlap on other cache line

Variable Scoping Most difficult part of Shared Memory Parallelization –What memory is Shared –What memory is Private - each processor has its own copy Compare MPI: all variables are private Variables are shared by default, except: –loop indices –scalars, and arrays whose subscript is constant with respect to PARALLEL DO, that are set and then used in loop)

How does sharing work? THREAD 1: increment(x) { x = x + 1; } THREAD 1: 10 LOAD A, (x address) 20 ADD A, 1 30 STORE A, (x address) THREAD 2: increment(x) { x = x + 1; } THREAD 2: 10 LOAD A, (x address) 20 ADD A, 1 30 STORE A, (x address) X initially 0 Result could be 1 or 2 Need synchronization

Variable Scoping example read *,n sum = 0.0 call random (b) call random (c) !$OMP PARALLEL !$OMP PRIVATE (i,sump ) !$OMP SHARED (a,b,n,c,sum) sump = 0.0 !$OMP DO do i=1,n a(i) = sqrt(b(i)**2+c(i)**2) sump = sump + a(i) enddo !$OMP CRITICAL sum = sum + sump !$OMP ENDCRITICAL !$OMP END PARALLEL end

Scoping example #2 read *,n sum = 0.0 call random (b) call random (c) !$OMP PARALLEL DO !$OMP&PRIVATE (i) !$OMP&SHARED (a,b,n) !$OMP&REDUCTION (+:sum) do i=1,n a(i) = sqrt(b(i)**2+c(i)**2) sum = sum + a(i) Enddo !$OMP PARALLEL ENDDO end Each processor needs a separate copy of i everything else is Shared

Variable Scoping Global variables are SHARED among threads –Fortran: COMMON blocks, SAVE variables, MODULE variables –C: variables “visible” when #pragma omp parallel encountered, static variables declared within a parallel region But not everything is shared... –Stack variables in sub-programs called from parallel regions are PRIVATE –Automatic variables within a statement block are PRIVATE.

Hello World #1 (correct) PROGRAM HELLO INTEGER TID, OMP_GET_THREAD_NUM !$OMP PARALLEL PRIVATE(TID) TID = OMP_GET_THREAD_NUM() PRINT *, 'Hello World from thread = ', TID... !$OMP END PARALLEL END

Hello World #2 (incorrect) PROGRAM HELLO INTEGER TID, OMP_GET_THREAD_NUM !$OMP PARALLEL TID = OMP_GET_THREAD_NUM() PRINT *, 'Hello World from thread = ', TID... !$OMP END PARALLEL END

Hello World #3 (incorrect) PROGRAM HELLO INTEGER TID, OMP_GET_THREAD_NUM TID = OMP_GET_THREAD_NUM() PRINT *, 'Hello World from thread = ', TID !$OMP PARALLEL... !$OMP END PARALLEL END

Another Variable Scoping Example subroutine example4(n,m,a,b,c) real*8 a(100,100),B(100,100),c(100) integer n,i real*8 sum !$OMP PARALLEL DO !$OMP PRIVATE (j,i,c) !$OMP SHARED (a,b,m,n) do j=1,m do i=2,n-1 c(i) = sqrt(1.0+b(i,j)**2) enddo do i=1,n a(i,j) = sqrt(b(i,j)**2+c(i)**2) enddo end Each processor needs a separate copy of j,i,c everything else is Shared. What about c? c(1) and c(n)?

Another Variable Scoping Example (cont’d) subroutine example4(n,m,a,b,c) real*8 a(100,100),B(100,100),c(100) integer n,i real*8 sum !$OMP PARALLEL DO !$OMP PRIVATE (j,i) !$OMP SHARED (a,b,m,n) !$OMP FIRSTPRIVATE (c) do j=1,m do i=2,n-1 c(i) = sqrt(1.0+b(i,j)**2) enddo do i=1,n a(i,j) = sqrt(b(i,j)**2+c(i)**2) enddo end Need First Value of c. Master copies it's c array to all threads prior to DO loop

Another Variable Scoping Example (cont’d) What if last value of c is needed? Use LASTPRIVATE clause

References ASCI Blue training : workshops/workshop/ EWOMP ‘99: programme.html EWOMP ‘00: proceedings.html Multimedia tutorial at Boston University: OpenMP/