Towards a Performance Tool Interface for OpenMP: An Approach Based on Directive Rewriting Bernd Mohr, Felix Wolf Forschungszentrum Jülich John von Neumann - Institut für Computing Zentralinstitut für Angewandte Mathematik Jülich Allen Malony, Sameer Shende University of Oregon Department of Computer and Information Science Eugene, Oregon
© 2001 Forschungszentrum Jülich, University of Oregon [2] Outline Introduction Proposed OpenMP Performance Tool Interface Prototype Implementation Examples Future Work
© 2001 Forschungszentrum Jülich, University of Oregon [3] Introduction Motivation “Standard” OpenMP performance tools interface similar in spirit to the MPI profiling interface (PMPI)” Goals Expose OpenMP parallel execution to the performance measurement system Define it at the abstraction level of the OpenMP programming model Make the performance measurement interface portable –across different platforms –across all OpenMP supported languages –different performance tools Allow flexibility in how the interface is applied
© 2001 Forschungszentrum Jülich, University of Oregon [4] Proposed OpenMP Performance Tool Interface POMP OpenMP Directive Instrumentation OpenMP Runtime Library Routine Instrumentation Performance Monitoring Library Control User Code Instrumentation Context Descriptors Conditional Compilation Conditional / Selective Transformations Remarks C/C++ OpenMP Pragma Instrumentation Implementation Issues Open Issues
© 2001 Forschungszentrum Jülich, University of Oregon [5] OpenMP Directive Instrumentation Insert calls to pomp_NAME_TYPE(d) at appropriate places around directives NAME name of the OpenMP construct TYPE –fork, join mark change in parallelism grade –enter, exit flag entering/exiting OpenMP construct –begin, end mark start/end of body of construct d context descriptor Observation of implicit barrier at DO, SECTIONS, WORKSHARE, SINGLE constructs Add NOWAIT to construct Make barrier explicit
© 2001 Forschungszentrum Jülich, University of Oregon [6] Example: !$OMP PARALLEL DO Instrumentation !$OMP PARALLEL DO clauses... do loop !$OMP END PARALLEL DO !$OMP PARALLEL other-clauses... !$OMP DO schedule-clauses, ordered-clauses, lastprivate-clauses do loop !$OMP END DO !$OMP END PARALLEL DO NOWAIT !$OMP BARRIER call pomp_parallel_fork(d) call pomp_parallel_begin(d) call pomp_parallel_end(d) call pomp_parallel_join(d) call pomp_do_enter(d) call pomp_do_exit(d) call pomp_barrier_enter(d) call pomp_barrier_exit(d)
© 2001 Forschungszentrum Jülich, University of Oregon [7] OpenMP Runtime Library Routine Instrumentation Transform omp_###_lock() pomp_###_lock() omp_###_nest_lock() pomp_###_nest_lock() [ ### = init | destroy | set | unset | test ] POMP version Calls omp version internally Can do extra stuff before and after call Transformations of other OpenMP API functions necessary?
© 2001 Forschungszentrum Jülich, University of Oregon [8] Performance Monitoring Library Control Give programmer control over performance monitoring at runtime !$OMP INST [ INIT | FINALIZE | ON | OFF ] Translated into pomp_init(), pomp_finalize() pomp_on(), pomp_off() Ignored in “normal” OpenMP compilation mode Alternatives !$POMP ? Use conditional compilation with explicit POMP calls
© 2001 Forschungszentrum Jülich, University of Oregon [9] User Code Instrumentation Compiler / transformation tool should insert pomp_begin(d) pomp_end(d) calls at beginning and end of each(?) user function Allow user-specified arbitrary (non-function) code regions !$OMP INST BEGIN ( ) arbitrary user code !$OMP INST END ( ) Alternatives !$POMP ? Use conditional compilation with explicit POMP calls descriptor?
© 2001 Forschungszentrum Jülich, University of Oregon [10] Context Descriptors Describe execution contexts through context descriptor typedef struct ompregdescr { char name[]; /* construct */ char sub_name[]; /* region name */ int num_sections; char filename[]; /* src filename */ int begin_line1, begin_lineN; /* begin line # */ int end_line1, end_lineN; /* end line # */ WORD data[4]; /* perf. data */ struct ompregdescr* next; } OMPRegDescr; Generate context descriptors in global static memory: OMPRegDescr rd42675 = { "critical", "phase1", 0, "foo.c", 5, 5, 13, 13 }; Pass address to POMP functions
© 2001 Forschungszentrum Jülich, University of Oregon [11] Conditional Compilation C, C++, [Fortran, if supported] #ifdef _POMP arbitrary user code #endif Fortran Free Form !P$ arbitrary user code Fortran Fixed Form CP$ arbitrary *P$ user !P$ code Usual restrictions apply
© 2001 Forschungszentrum Jülich, University of Oregon [12] Conditional / Selective Transformations (Temporarily) disable / re-enable POMP instrumentation at compile time !$OMP NOINSTRUMENT !$OMP INSTRUMENT Alternative: !$POMP ?
© 2001 Forschungszentrum Jülich, University of Oregon [13] C/C++ OpenMP Pragma Instrumentation No END pragmas instrumentation for “closing” part follows structured block adding nowait has to be done in the “opening part” #pragma omp XXX structured block; Simple differences in language no “ call ” keyword “ ; ” !$OMP #pragma omp pomp_###_begin(d); pomp_###_end(d); { }
© 2001 Forschungszentrum Jülich, University of Oregon [14] Example: #pragma omp sections Instrumentation #pragma omp sections { #pragma omp section structured block; #pragma omp section structured block; } pomp_sections_enter(d); { pomp_section_begin(d); pomp_section_end(d); } { pomp_section_begin(d); pomp_section_end(d); } pomp_sections_exit(d); nowait #pragma omp barrier pomp_barrier_enter(d); pomp_barrier_exit(d);
© 2001 Forschungszentrum Jülich, University of Oregon [15] Implementation Issues pomp_NAME_TYPE(d) more efficient / simpler than pomp_event(POMP_TYPE, POMP_NAME, fname, line#,...) Inlining of POMP calls possible Context descriptors Full context information available, incl. source reference But minimal runtime overhead –just one argument needs to be passed –no need to dynamically allocate memory for data!! –context data initialization at compile time Context data is kept together with executable Allows for separate compilation Potentially too much overhead for ATOMIC, CRITICAL, MASTER, SINGLE, and OpenMP lock calls --pomp-disable=construct-list
© 2001 Forschungszentrum Jülich, University of Oregon [16] Open Issues ORDERED ? FLUSH ? Instrumentation of PARALLEL DO / FOR loop iterations Potentially allows measurement of influence of loop scheduling policies Overhead?? Allow passing additional user information to POMP library Conditional compilation Extra parameter to !$OMP INST BEGIN/END... Specification of extent of user code instrumentation Additional pragmas/directives? Separate (outside source code) specification? OpenMP Runtime Instrumentation necessary?
© 2001 Forschungszentrum Jülich, University of Oregon [17] Prototype Implementation: OPARI OPARIOpenMP Pragma And Region Instrumentor (OPARI) Source-to-Source translator to insert POMP calls around OpenMP constructs and API functions Supports Fortran77 and Fortran90, OpenMP 2.0 C and C++, OpenMP 1.0 Runtime Library Control ( init, finalize, on, off ) (Manual) User Code Instrumentation ( begin, end ) Conditional Compilation (#ifdef _POMP, !P$ ) Conditional / Selective Transformation ( [no]instrument ) Preserves source code information ( #line line file ) ~ 2000 lines of C++ code
© 2001 Forschungszentrum Jülich, University of Oregon [18] OPARI Limitations Fortran: –END DO and END PARALLEL DO directives required –atomic expression on line by itself C/C++: –structured blocks: simple expression statement or block (compound statement) –Exception: for statement after parallel for Could be fixed by enhancing OPARI’s parsing capabilities Source code and documentation available at
© 2001 Forschungszentrum Jülich, University of Oregon [19] Prototype Implementation: POMP Library EXPERTEXtensible PERformance Tool (EXPERT) Automatic event trace analyzer TAUTuning and Analysis Utilities (TAU) Performance analysis framework Required ~ 1 day to implement tool specific POMP libraries
© 2001 Forschungszentrum Jülich, University of Oregon [20] Prototype Implementation: EXPERT POMP Library void pomp_for_enter(OMPRegDescr* r) { /* Get EPILOG region descriptor stored in r */ ElgRegion* e = (ElgRegion*)(r->data[0]); /* If not yet there, initialize and store it */ if (! e) e = ElgRegion_Init(r); /* Record enter event */ elg_enter(e->rid); } void pomp_for_exit(OMPRegDescr* r) { /* Record collective exit event */ elg_omp_collexit(); }
© 2001 Forschungszentrum Jülich, University of Oregon [21] Prototype Implementation: TAU POMP Library TAU_GLOBAL_TIMER(tfor, "for enter/exit", "[OpenMP]", OpenMP); void pomp_for_enter(OMPRegDescr* r) { #ifdef TAU_AGGREGATE_OPENMP_TIMINGS TAU_GLOBAL_TIMER_START(tfor); #endif #ifdef TAU_OPENMP_REGION_VIEW TauStartOpenMPRegionTimer(); #endif } void pomp_for_exit(OMPRegDescr* r) {... }
© 2001 Forschungszentrum Jülich, University of Oregon [22] Examples EXPERT REMO: Weather Forecast DKRZ Germany MPI + OpenMP (experimental) TAU Stommel: Ocean Circulation Simulation SDSC MPI + OpenMP event trace based Vampir profile based RACY
© 2001 Forschungszentrum Jülich, University of Oregon [23]
© 2001 Forschungszentrum Jülich, University of Oregon [24]
© 2001 Forschungszentrum Jülich, University of Oregon [25]
© 2001 Forschungszentrum Jülich, University of Oregon [26] Future Work Measure typical POMP calling overhead EPCC OpenMP Microbenchmarks? Investigate “formal” standardization with OpenMP forum [OpenMP Supplemental Standard?] OpenMP programmers –What do you expect from an OpenMP performance tool? Tool developers: –Download and try out OPARI –Implement POMP interface for your tool –Tell us about problems, comments, enhancements OpenMP ARB members –What do we need to do next?
© 2001 Forschungszentrum Jülich, University of Oregon [27] Conclusion POMP OpenMP Performance Tool Interface Portable Flexible Efficient Defined at the abstraction level of the OpenMP programming model Standard? Prototype Software OPARIOpenMP Pragma And Region Instrumentor (OPARI) TAUTuning and Analysis Utilities (TAU)
© 2001 Forschungszentrum Jülich, University of Oregon [28]
© 2001 Forschungszentrum Jülich, University of Oregon [29] !$OMP PARALLEL Instrumentation call pomp_parallel_fork(d) !$OMP PARALLEL call pomp_parallel_begin(d) structured block call pomp_barrier_enter(d) !$OMP BARRIER call pomp_barrier_exit(d) call pomp_parallel_end(d) !$OMP END PARALLEL call pomp_parallel_join(d)
© 2001 Forschungszentrum Jülich, University of Oregon [30] !$OMP DO Instrumentation call pomp_do_enter(d) !$OMP DO do loop !$OMP END DO NOWAIT call pomp_barrier_enter(d) !$OMP BARRIER call pomp_barrier_exit(d) call pomp_do_exit(d)
© 2001 Forschungszentrum Jülich, University of Oregon [31] !$OMP WORKSHARE Instrumentation call pomp_workshare_enter(d) !$OMP WORKSHARE structured block !$OMP END WORKSHARE NOWAIT call pomp_barrier_enter(d) !$OMP BARRIER call pomp_barrier_exit(d) call pomp_workshare_exit(d)
© 2001 Forschungszentrum Jülich, University of Oregon [32] !$OMP SECTIONS Instrumentation call pomp_sections_enter(d) !$OMP SECTIONS !$OMP SECTION call pomp_section_begin(d) structured block call pomp_section_end(d) !$OMP SECTION call pomp_section_begin(d) structured block call pomp_section_end(d) !$OMP END SECTIONS NOWAIT call pomp_barrier_enter(d) !$OMP BARRIER call pomp_barrier_exit(d) call pomp_sections_exit(d)
© 2001 Forschungszentrum Jülich, University of Oregon [33] Synchronization Constructs Instrumentation 1 call pomp_single_enter(d) !$OMP SINGLE call pomp_single_begin(d) structured block call pomp_single_end(d) !$OMP END SINGLE NOWAIT call pomp_barrier_enter(d) !$OMP BARRIER call pomp_barrier_exit(d) call pomp_single_exit(d) !$OMP MASTER call pomp_master_begin(d) structured block call pomp_master_end(d) !$OMP END MASTER
© 2001 Forschungszentrum Jülich, University of Oregon [34] Synchronization Constructs Instrumentation 2 call pomp_critical_enter(d) !$OMP CRITICAL call pomp_critical_begin(d) structured block call pomp_critical_end(d) !$OMP END CRITICAL call pomp_sections_exit(d) call pomp_barrier_enter(d) !$OMP BARRIER call pomp_barrier_exit(d) call pomp_atomic_enter(d) !$OMP ATOMIC atomic expression call pomp_atomic_exit(d)
© 2001 Forschungszentrum Jülich, University of Oregon [35] Automatic Analysis EX PER TEXtensible PERformance Tool (EXPERT) programmable, extensible, flexible performance property specification based on event patterns analyzes along three hierarchical dimensions –performance properties (general specific) –dynamic call tree position –location (machine node process thread) Done: fully functional demonstration prototype Work in Progress: –optimization / generalization –more performance properties –source code and time line displays
© 2001 Forschungszentrum Jülich, University of Oregon [36] Expert Result Presentation Interconnected weighted tree browser scalable still accurate Each node has weight Percentage of CPU allocation time i.e. time spent in subtree of call tree Displayed weight depends on state of node Collapsed (including weight of descendants) Expanded (without weight of descendants) Displayed using Color: allows to easily identify hot spots (bottlenecks) Numerical value: Detailed comparison 100 main 60 bar 10 main 30 foo
© 2001 Forschungszentrum Jülich, University of Oregon [37] Performance Properties View Main Problem: Idle Threads Fine: User code Fine: OpenMP +MPI Fine: OpenMP +MPI
© 2001 Forschungszentrum Jülich, University of Oregon [38] Dynamic Call Tree View 1st Optimization Opportunity 2nd Optimization Opportunity 3rd Optimization Opportunity
© 2001 Forschungszentrum Jülich, University of Oregon [39] Supports locations up to Grid scale Easily allows exploration of load balance problems on different levels [ Of course, Idle Thread Problem only applies to slave threads ] Locations View