Presentation is loading. Please wait.

Presentation is loading. Please wait.

OpenMP: Where Do We Go From Here? Dr. Barbara Chapman High Performance Computing and Tools group Computer Science Department University of Houston.

Similar presentations


Presentation on theme: "OpenMP: Where Do We Go From Here? Dr. Barbara Chapman High Performance Computing and Tools group Computer Science Department University of Houston."— Presentation transcript:

1 OpenMP: Where Do We Go From Here? Dr. Barbara Chapman High Performance Computing and Tools group Computer Science Department University of Houston

2 Contents OpenMP Version 2.5 OpenMP for Clusters Implementation Conclusions

3 OpenMP A set of compiler directives and library routines for parallel application programmers Makes it easy to create multi-threaded programs in Fortran, C and C++ Strong points: –incremental parallelization –portability –ease of use Standardizes last 15 years of SMP practice

4 OpenMP Programming Model Fork-Join Parallelism:  Master thread spawns a team of threads as needed.  Parallelism is added incrementally: i.e. the sequential program evolves into a parallel program. Parallel Regions Master Thread

5 What is OpenMP? API for shared memory programming for scientific applications, esp. small SMPs –First version for Fortran 77 published late 1997 Memory shared, threads may have private data Loop iterations are distributed among threads Shared X[0] Private X[1]X[n]

6 Why is OpenMP Relevant to HPC? HPC architectures –HPC platforms with global memory –SMPs are growing in size –SMT/CMT (hyperthreading) technology is spreading HPC programming –It can be used with other APIs to program clusters of SMT/SMP platforms –Remember productivity? Vendor support –OpenMP is actively maintained

7 Life is Short, Remember? It’s official: OpenMP is easier to use than MPI!

8 Uptake of OpenMP Widely available Single source code Ease of programming Major industry codes ported to OpenMP Hybrid codes: DOE apps, weather forecasting, NASA,. Flexibility: threadids allow for explicit multithreaded programming too!

9 The OpenMP ARB OpenMP is maintained by the OpenMP Architecture Review Board (the ARB). The ARB: Interprets OpenMP Writes new specifications - keeps OpenMP relevant. Works to increase the impact of OpenMP. Members are organizations - not individuals –Current members Permanent: Fujitsu, HP, IBM, Intel, NEC, SGI, Sun Auxiliary: ASCI, cOMPunity, EPCC, KSL, NASA, PGI That’s us!

10 Contents OpenMP Version 2.5 OpenMP for Clusters Implementation Conclusions

11 The Big Merge OpenMP Fortran 1.1 OpenMP C/C++ 1.0 OpenMP Fortran 2.0 OpenMP Fortran 2.0 OpenMP C/C++ 2.0 OpenMP C/C++ 2.0 1998 20001999 2002 OpenMP Fortran 1.0 1997 In future there will be just one specification for all languages

12 What Is OpenMP 2.5? Merges Fortran and C/C++ specifications –text and terms same wherever possible helped resolve inconsistencies –Reorganized with new material, but no new features Internal control variables Memory model –more accurate specs for compiler writers –language specific sections are identified graphically Public comment period ended OpenMP 2.0, 2.5, …

13 Memory Model API has to shield user from diverse ways that HW/OS maintains consistency Relaxing consistency requirements makes it easier to get performance Previous specs didn’t say much about this...which means that nobody understood flush There is a need to define the relaxed memory model underlying OpenMP properly.

14 flush( list ) flush (with no list) is no problem: it’s a memory fence –All outstanding reads/writes must complete –No subsequent reads/writes can begin flush(a) means that –All outstanding reads/writes to a must complete –No subsequent reads/writes to a can begin But what about ordering between flush(a) and reads/writes to other variables?

15 Producer/Consumer Example Standard example for flush( list ) from Chandra et al. There is nothing to prevent the write of flag occurring before the write of data! Producer: data =... !$omp flush(data) flag = 1 !$omp flush(flag) Consumer: do !$omp flush(flag) while (flag.eq.0) !$omp flush(data)... = data

16 Rules for flush( list ) Enforcing ordering between flush(a) and reads/writes to other variables makes it equivalent to flush with no list. New rules: 1.Flush directives may be reordered with reads/writes to variables not in the list. 2.flush directives may not be reordered with respect to each other.

17 Producer/Consumer Example Correct version according to 2.5: Unfortunately, almost all examples/uses in the 2.0 specs are now incorrect! Producer: data =... !$omp flush(data,flag) flag = 1 !$omp flush(flag) Consumer: do !$omp flush(flag) while (flag.eq.0) !$omp flush(data)... = data

18 OpenMP 3.0 2.5 took longer than we thought... There is a list of proposals for 3.0 extensions Some proposals –Parallelization of loop nests –Control of thread stack size –Control of idle thread behaviour Other issues have arisen from 2.5 discussions. More still to be looked at –Semaphores, schedule types, …

19 Parallelization of Loop Nests do i = 1,33 do j = 1,33....loop body end do With 32 threads, how can we get good load balance without manually collapsing the loops? Can we handle non-rectangular and/or imperfect nests? Do I = 1, 33 DO J = 1, 33 … body of loop … END DO Do I = 1, 33 DO J = 1, 33 … body of loop … END DO What can compiler deal with (well)?

20 Challenges for OpenMP 3.0 Applications outside scientific computing Compute intensive commercial and financial applications need HPC technology. Multiprocessor game platforms are coming. Single processor optimization Multiple virtual processors on a single chip need multi- threading Clusters and distributed shared memory Clusters are the fastest growing HPC platform. Can OpenMP play a greater role? Does OpenMP have the right language features for these?

21 Contents OpenMP Version 2.5 OpenMP for Clusters Implementation Conclusions

22 OpenMP for Clusters Given increasing use of clusters as compute platforms, and their cost advantages OpenMP needs to be extended to distributed memory systems –Data distribution extensions (for DSM machines) SGI, Compaq –Software Distributed Shared Memory (SDSM) approach TreadMarks, Omni/Scash, ParADE, NanosDSM, FDSM

23 Global Arrays (GA) Developed by Pacific Northwest National Lab “Shared Memory” programming interface in distributed memory systems Greatly simplifies parallel programming for distributed memory systems Implemented as a library with C and Fortran- 77 bindings Virtual Shared Memory 1234 Local Memory

24 GA Features Global arrays –A data structure with shared memory abstraction –BLOCK and GEN_BLOCK data distribution –Library support Single-sided (asynchronous) communications Implemented using ARMCI library –ARMCI focuses on optimized transfer of contiguous and strided (non-contiguous) data

25 GA Programming Model

26 Translation of OpenMP to GA Basic strategy is largely straightforward –Shared data concept –GA library routines match most OpenMP constructs Resulting code may be optimized by compiler –optimize accesses to global arrays: move, merge or eliminate communications –Adjust work and data distribution to maximize the data locality and balance the workload –Inspector-Executor for irregular access applications

27 Translation: OpenMP  GA !$OMP PARALLEL SHARED (b,a,sum) !$OMP DO do j = 2, SIZE do i = 2, SIZE a(i, j) = (b(i - 1, j) + b(i + 1, j) + b(i, j - 1) + b(i, j + 1)) / 4 enddo !$OMP END DO … !$OMP DO REDUCTION(+:sum) do j = 1,SIZE_1 do i = 1,SIZE_1 sum = sum + b(i,j) end do !$OMP END DO !$OMP SINGLE print *, 'sum is ', sum !$OMP END SINGLE !$OMP END PARALLEL call MPI_INIT() call ga_initialize() myid = ga_nodeid() nproc = ga_nnodes() ! create Global Arrays OK=ga_create(MT_DBL, SIZE_1, SIZE_1, 'A', SIZE_1, SIZE_1/nproc, g_a) OK=ga_create(MT_DBL, SIZE_1, SIZE_1, 'B', SIZE_1,SIZE_1/nproc, g_b) jlo = myid*(SIZE_1/nproc)+start_y-1 jhi = myid*(SIZE_1/nproc)+end_y+1 call ga_get(g_b, 1, SIZE_1, jlo, jhi, b, ld) call ga_sync() do j = 2, end_y-start_y+2 do i = 2, SIZE_1-1 a(i, j) = (b(i - 1, j) + b(i + 1, j) + b(i, j - 1) + b(i, j + 1)) / 4 enddo jlo = myid*(SIZE_1/nproc)+1 jhi = (myid+1)*(SIZE_1/nproc) call ga_put(g_b, 1, SIZE_1, jlo, jhi, b(1,2), ld) do j = 1, SIZE_1/nproc do i=1, SIZE_1 sump = sump + b(i,j) end do call ga_dgop(MT_DBL, sump, 1, '+') call ga_terminate() call MPI_FINALIZE(rc) MPI & GA Initilization Create GAs for shared variables Compute Array region and get the local copy Compute Array region and put back the local copy Reduction MPI & GA Temination Computation Reduction

28 Hard Parts Implementation of sequential regions may become nontrivial GA global arrays require BLOCK or GEN_BLOCK data distributions –Compiler has to generate them Indirect accesses to global data require special handling –We don’t know what regions of global array to fetch and write back Actually, this is the “HPF problem”! A ( IND (I) + 1 )

29 (a)Strided accesses in a procedure of the original OpenMP UMT98 (b)BLOCK data distribution in the npart dimension of psib(nblem,ndir,npart) after modifying psib(nbelem, ndir*npart) Irregular codes with strided accesses and array reshaping, e.g. UMT98 Hard Parts

30 Optimizations Motivation for this approach is that it permits optimization Before transforming, compiler attempts aggressive privatization of OpenMP code After transforming, compiler may aggregate communications in GA code –Remove/merge redundant get/put operations –Move get and put operations as early as possible Array region and parallel control flow analyses are needed

31 LBE(1024x9x1024) NERSC IBM SP RS/6000 cluster (Seaborg) with 380 16-way nodes

32 GCCG Program in FIRE Benchmark NERSC IBM SP RS/6000 cluster (Seaborg) with 380 16-way nodes

33

34 Contents OpenMP Version 2.5 OpenMP for Clusters Implementation Conclusions

35 Implementation of OpenMP for Clusters Based on Open64 compiler and Dragon analysis tool Open64 –An open source compiler with partial OpenMP –Rich analyses and optimizations Dragon –A graphical interactive program analysis tool –Interacts with our extended version of Open64 –Will become part of TERAGRID toolset

36 So What’s A Compiler?

37 The Open64 Compiler An optimizing compiler suite for Linux/Intel IA-64 systems –Was Pro64 open-sourced by SGI C/C++ and Fortran77/90 + OpenMP compilers Open to all researchers/developers in the community Can also work on IA-32, under the NUE (HP Native User Environment) emulator.

38 The Dragon Tool Callgraph Flowgraph Source Code browser Procedures containing OpenMP

39 Dynamic Callgraph/Flowgraph Show feedback information after the application has been run one or more times. frequency, i.e. how many times a procedure was invoked at runtime, cycle counts Working on summarizing execution of several OpenMP threads in single Callgraph/Flowgraph Integrating with KOJAK (UTK), PerfSuite (UIUC) BT NAS Benchmark

40 Contents OpenMP Version 2.5 OpenMP for Clusters Implementation Conclusions

41 Where will OpenMP be Relevant? To avoid the heat waves, we’ll need it! Simultaneous multithreading, hyperthreading, chip multithreading, streaming

42 Conclusions and Future Work Conclusions –OpenMP on clusters is still work in progress –Intel working on software based on TreadMarks (SDSM) to target clusters –Performance is not there yet Future work –Need to consider needs of emerging architectures –Make it easier to combine with MPI

43 Do We Need To Extend OpenMP for HPC? Data distributions –User specification can help compiler –But if introduced into the language, bring many complications More loop schedules might help –E.g. a general block schedule Nested parallelism? –Exploit several levels of parallelism e.g. in SMP cluster: do we need more expressivity? Support for irregular codes –E.g. to indicate reuse of communication set Maximum privatization of data “sharable”

44 Outlook We’re rounding up those cycles!!!

45 !$OMP PARALLEL !$OMP DO do i=1,imt RHOKX(imt,i) = 0.0 enddo !$OMP ENDDO !$OMP DO do i=1, imt do j=1, jmt if (k.le. KMU(j,i)) then RHOKX(j,i) = DXUR(j,i)*p5*RHOKX(j,i) endif enddo !$OMP ENDDO !$OMP DO do i=1, imt do j=1, jmt if (k > KMU(j,i)) then RHOKX(j,i) = 0.0 endif enddo !$OMP ENDDO if (k == 1) then !$OMP DO do i=1, imt do j=1, jmt RHOKMX(j,i) = RHOKX(j,i) enddo !$OMP ENDDO !$OMP DO do i=1, imt do j=1, jmt SUMX(j,i) = 0.0 enddo !$OMP ENDDO endif !$OMP SINGLE factor = dzw(kth-1)*grav*p5 !$OMP END SINGLE !$OMP DO do i=1, imt do j=1, jmt SUMX(j,i) = SUMX(j,i) + factor * & (RHOKX(j,i) + RHOKMX(j,i)) enddo !$OMP ENDDO !$OMP END PARALLEL Synchronization? Part of computation of gradient of hydrostatic pressure in POP code Runtime execution model for (c stands for chunk) Dataflow execution model associated with translated code

46 Dataflow Relationship Between Modules Courtesy of Guang R. Gao (U of Delaware),etc


Download ppt "OpenMP: Where Do We Go From Here? Dr. Barbara Chapman High Performance Computing and Tools group Computer Science Department University of Houston."

Similar presentations


Ads by Google