Towards a Performance Tool Interface for OpenMP: An Approach Based on Directive Rewriting Bernd Mohr, Felix Wolf Forschungszentrum Jülich John von Neumann.

Slides:



Advertisements
Similar presentations
OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
Advertisements

Introduction to Openmp & openACC
Introductions to Parallel Programming Using OpenMP
NewsFlash!! Earth Simulator no longer #1. In slightly less earthshaking news… Homework #1 due date postponed to 10/11.
Chapter 7: User-Defined Functions II
The Path to Multi-core Tools Paul Petersen. Multi-coreToolsThePathTo 2 Outline Motivation Where are we now What is easy to do next What is missing.
University of Houston Open Source Software Support for the OpenMP Runtime API for Profiling Oscar Hernandez, Ramachandra Nanjegowda, Van Bui, Richard Krufin.
PARALLEL PROGRAMMING WITH OPENMP Ing. Andrea Marongiu
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Robert Bell, Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science.
Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.
OmpP: A Profiling Tool for OpenMP Karl Fürlinger Michael Gerndt {fuerling, Technische Universität München.
Sameer Shende Department of Computer and Information Science Neuro Informatics Center University of Oregon Tool Interoperability.
Introduction to OpenMP For a more detailed tutorial see: Look at the presentations.
On the Integration and Use of OpenMP Performance Tools in the SPEC OMP2001 Benchmarks Bernd Mohr 1, Allen D. Malony 2, Rudi Eigenmann 3 1 Forschungszentrum.
1 ITCS4145/5145, Parallel Programming B. Wilkinson Feb 21, 2012 Programming with Shared Memory Introduction to OpenMP.
Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware.
OpenMPI Majdi Baddourah
A Very Short Introduction to OpenMP Basile Schaeli EPFL – I&C – LSP Vincent Keller EPFL – STI – LIN.
On the Integration and Use of OpenMP Performance Tools in the SPEC OMP2001 Benchmarks Rudi Eigenmann Department of Electrical and Computer Engineering.
SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.
Budapest, November st ALADIN maintenance and phasing workshop Short introduction to OpenMP Jure Jerman, Environmental Agency of Slovenia.
Programming with Shared Memory Introduction to OpenMP
CS470/570 Lecture 5 Introduction to OpenMP Compute Pi example OpenMP directives and options.
Shared Memory Parallelization Outline What is shared memory parallelization? OpenMP Fractal Example False Sharing Variable scoping Examples on sharing.
CC02 – Parallel Programming Using OpenMP 1 of 25 PhUSE 2011 Aniruddha Deshmukh Cytel Inc.
Parallel Programming in Java with Shared Memory Directives.
OMPi: A portable C compiler for OpenMP V2.0 Elias Leontiadis George Tzoumas Vassilios V. Dimakopoulos University of Ioannina.
Lecture 5: Shared-memory Computing with Open MP. Shared Memory Computing.
Paradyn Week – April 14, 2004 – Madison, WI DPOMP: A DPCL Based Infrastructure for Performance Monitoring of OpenMP Applications Bernd Mohr Forschungszentrum.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Score-P – A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir Alexandru Calotoiu German Research School for.
OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)
Performance Technology for Complex Parallel Systems Part 3 – Alternative Tools and Frameworks Bernd Mohr.
CS 838: Pervasive Parallelism Introduction to OpenMP Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from online references.
StreamX10: A Stream Programming Framework on X10 Haitao Wei School of Computer Science at Huazhong University of Sci&Tech.
Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j
Colorama: Architectural Support for Data-Centric Synchronization Luis Ceze, Pablo Montesinos, Christoph von Praun, and Josep Torrellas, HPCA 2007 Shimin.
UPC Performance Tool Interface Professor Alan D. George, Principal Investigator Mr. Hung-Hsun Su, Sr. Research Assistant Mr. Adam Leko, Sr. Research Assistant.
Introduction to OpenMP
Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.
Allen D. Malony Performance Research Laboratory Department of Computer and Information Science.
Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (
MPI and OpenMP.
FORTRAN History. FORTRAN - Interesting Facts n FORTRAN is the oldest Language actively in use today. n FORTRAN is still used for new software development.
Threaded Programming Lecture 2: Introduction to OpenMP.
Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.
Single Node Optimization Computational Astrophysics.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
Special Topics in Computer Engineering OpenMP* Essentials * Open Multi-Processing.
CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.
Tuning Threaded Code with Intel® Parallel Amplifier.
Kai Li, Allen D. Malony, Sameer Shende, Robert Bell
Introduction to OpenMP
Shared Memory Parallelism - OpenMP
Lecture 5: Shared-memory Computing with Open MP
SHARED MEMORY PROGRAMMING WITH OpenMP
CS427 Multicore Architecture and Parallel Computing
Computer Engg, IIT(BHU)
Introduction to OpenMP
Computer Science Department
A configurable binary instrumenter
Introduction to High Performance Computing Lecture 20
Programming with Shared Memory Introduction to OpenMP
Allen D. Malony Computer & Information Science Department
Outline Introduction Motivation for performance mapping SEAA model
Introduction to OpenMP
Presentation transcript:

Towards a Performance Tool Interface for OpenMP: An Approach Based on Directive Rewriting Bernd Mohr, Felix Wolf Forschungszentrum Jülich John von Neumann - Institut für Computing Zentralinstitut für Angewandte Mathematik Jülich Allen Malony, Sameer Shende University of Oregon Department of Computer and Information Science Eugene, Oregon

© 2001 Forschungszentrum Jülich, University of Oregon [2] Outline Introduction Proposed OpenMP Performance Tool Interface Prototype Implementation Examples Future Work

© 2001 Forschungszentrum Jülich, University of Oregon [3] Introduction Motivation “Standard” OpenMP performance tools interface similar in spirit to the MPI profiling interface (PMPI)” Goals Expose OpenMP parallel execution to the performance measurement system Define it at the abstraction level of the OpenMP programming model Make the performance measurement interface portable –across different platforms –across all OpenMP supported languages –different performance tools Allow flexibility in how the interface is applied

© 2001 Forschungszentrum Jülich, University of Oregon [4] Proposed OpenMP Performance Tool Interface POMP OpenMP Directive Instrumentation OpenMP Runtime Library Routine Instrumentation Performance Monitoring Library Control User Code Instrumentation Context Descriptors Conditional Compilation Conditional / Selective Transformations Remarks C/C++ OpenMP Pragma Instrumentation Implementation Issues Open Issues

© 2001 Forschungszentrum Jülich, University of Oregon [5] OpenMP Directive Instrumentation Insert calls to pomp_NAME_TYPE(d) at appropriate places around directives NAME name of the OpenMP construct TYPE –fork, join mark change in parallelism grade –enter, exit flag entering/exiting OpenMP construct –begin, end mark start/end of body of construct d context descriptor Observation of implicit barrier at DO, SECTIONS, WORKSHARE, SINGLE constructs Add NOWAIT to construct Make barrier explicit

© 2001 Forschungszentrum Jülich, University of Oregon [6] Example: !$OMP PARALLEL DO Instrumentation !$OMP PARALLEL DO clauses... do loop !$OMP END PARALLEL DO !$OMP PARALLEL other-clauses... !$OMP DO schedule-clauses, ordered-clauses, lastprivate-clauses do loop !$OMP END DO !$OMP END PARALLEL DO NOWAIT !$OMP BARRIER call pomp_parallel_fork(d) call pomp_parallel_begin(d) call pomp_parallel_end(d) call pomp_parallel_join(d) call pomp_do_enter(d) call pomp_do_exit(d) call pomp_barrier_enter(d) call pomp_barrier_exit(d)

© 2001 Forschungszentrum Jülich, University of Oregon [7] OpenMP Runtime Library Routine Instrumentation Transform omp_###_lock()  pomp_###_lock() omp_###_nest_lock()  pomp_###_nest_lock() [ ### = init | destroy | set | unset | test ] POMP version Calls omp version internally Can do extra stuff before and after call Transformations of other OpenMP API functions necessary?

© 2001 Forschungszentrum Jülich, University of Oregon [8] Performance Monitoring Library Control Give programmer control over performance monitoring at runtime !$OMP INST [ INIT | FINALIZE | ON | OFF ] Translated into pomp_init(), pomp_finalize() pomp_on(), pomp_off() Ignored in “normal” OpenMP compilation mode Alternatives !$POMP ? Use conditional compilation with explicit POMP calls

© 2001 Forschungszentrum Jülich, University of Oregon [9] User Code Instrumentation Compiler / transformation tool should insert pomp_begin(d) pomp_end(d) calls at beginning and end of each(?) user function Allow user-specified arbitrary (non-function) code regions !$OMP INST BEGIN ( ) arbitrary user code !$OMP INST END ( ) Alternatives !$POMP ? Use conditional compilation with explicit POMP calls  descriptor?

© 2001 Forschungszentrum Jülich, University of Oregon [10] Context Descriptors Describe execution contexts through context descriptor typedef struct ompregdescr { char name[]; /* construct */ char sub_name[]; /* region name */ int num_sections; char filename[]; /* src filename */ int begin_line1, begin_lineN; /* begin line # */ int end_line1, end_lineN; /* end line # */ WORD data[4]; /* perf. data */ struct ompregdescr* next; } OMPRegDescr; Generate context descriptors in global static memory: OMPRegDescr rd42675 = { "critical", "phase1", 0, "foo.c", 5, 5, 13, 13 }; Pass address to POMP functions

© 2001 Forschungszentrum Jülich, University of Oregon [11] Conditional Compilation C, C++, [Fortran, if supported] #ifdef _POMP arbitrary user code #endif Fortran Free Form !P$ arbitrary user code Fortran Fixed Form CP$ arbitrary *P$ user !P$ code Usual restrictions apply

© 2001 Forschungszentrum Jülich, University of Oregon [12] Conditional / Selective Transformations (Temporarily) disable / re-enable POMP instrumentation at compile time !$OMP NOINSTRUMENT !$OMP INSTRUMENT Alternative: !$POMP ?

© 2001 Forschungszentrum Jülich, University of Oregon [13] C/C++ OpenMP Pragma Instrumentation No END pragmas instrumentation for “closing” part follows structured block adding nowait has to be done in the “opening part” #pragma omp XXX structured block; Simple differences in language no “ call ” keyword “ ; ” !$OMP  #pragma omp pomp_###_begin(d); pomp_###_end(d); { }

© 2001 Forschungszentrum Jülich, University of Oregon [14] Example: #pragma omp sections Instrumentation #pragma omp sections { #pragma omp section structured block; #pragma omp section structured block; } pomp_sections_enter(d); { pomp_section_begin(d); pomp_section_end(d); } { pomp_section_begin(d); pomp_section_end(d); } pomp_sections_exit(d); nowait #pragma omp barrier pomp_barrier_enter(d); pomp_barrier_exit(d);

© 2001 Forschungszentrum Jülich, University of Oregon [15] Implementation Issues pomp_NAME_TYPE(d) more efficient / simpler than pomp_event(POMP_TYPE, POMP_NAME, fname, line#,...) Inlining of POMP calls possible Context descriptors Full context information available, incl. source reference But minimal runtime overhead –just one argument needs to be passed –no need to dynamically allocate memory for data!! –context data initialization at compile time Context data is kept together with executable Allows for separate compilation Potentially too much overhead for ATOMIC, CRITICAL, MASTER, SINGLE, and OpenMP lock calls  --pomp-disable=construct-list

© 2001 Forschungszentrum Jülich, University of Oregon [16] Open Issues ORDERED ? FLUSH ? Instrumentation of PARALLEL DO / FOR loop iterations Potentially allows measurement of influence of loop scheduling policies Overhead?? Allow passing additional user information to POMP library Conditional compilation Extra parameter to !$OMP INST BEGIN/END... Specification of extent of user code instrumentation Additional pragmas/directives? Separate (outside source code) specification? OpenMP Runtime Instrumentation necessary?

© 2001 Forschungszentrum Jülich, University of Oregon [17] Prototype Implementation: OPARI OPARIOpenMP Pragma And Region Instrumentor (OPARI) Source-to-Source translator to insert POMP calls around OpenMP constructs and API functions Supports Fortran77 and Fortran90, OpenMP 2.0 C and C++, OpenMP 1.0 Runtime Library Control ( init, finalize, on, off ) (Manual) User Code Instrumentation ( begin, end ) Conditional Compilation (#ifdef _POMP, !P$ ) Conditional / Selective Transformation ( [no]instrument ) Preserves source code information ( #line line file ) ~ 2000 lines of C++ code

© 2001 Forschungszentrum Jülich, University of Oregon [18] OPARI Limitations Fortran: –END DO and END PARALLEL DO directives required –atomic expression on line by itself C/C++: –structured blocks: simple expression statement or block (compound statement) –Exception: for statement after parallel for Could be fixed by enhancing OPARI’s parsing capabilities Source code and documentation available at

© 2001 Forschungszentrum Jülich, University of Oregon [19] Prototype Implementation: POMP Library EXPERTEXtensible PERformance Tool (EXPERT) Automatic event trace analyzer TAUTuning and Analysis Utilities (TAU) Performance analysis framework Required ~ 1 day to implement tool specific POMP libraries

© 2001 Forschungszentrum Jülich, University of Oregon [20] Prototype Implementation: EXPERT POMP Library void pomp_for_enter(OMPRegDescr* r) { /* Get EPILOG region descriptor stored in r */ ElgRegion* e = (ElgRegion*)(r->data[0]); /* If not yet there, initialize and store it */ if (! e) e = ElgRegion_Init(r); /* Record enter event */ elg_enter(e->rid); } void pomp_for_exit(OMPRegDescr* r) { /* Record collective exit event */ elg_omp_collexit(); }

© 2001 Forschungszentrum Jülich, University of Oregon [21] Prototype Implementation: TAU POMP Library TAU_GLOBAL_TIMER(tfor, "for enter/exit", "[OpenMP]", OpenMP); void pomp_for_enter(OMPRegDescr* r) { #ifdef TAU_AGGREGATE_OPENMP_TIMINGS TAU_GLOBAL_TIMER_START(tfor); #endif #ifdef TAU_OPENMP_REGION_VIEW TauStartOpenMPRegionTimer(); #endif } void pomp_for_exit(OMPRegDescr* r) {... }

© 2001 Forschungszentrum Jülich, University of Oregon [22] Examples EXPERT REMO: Weather Forecast DKRZ Germany MPI + OpenMP (experimental) TAU Stommel: Ocean Circulation Simulation SDSC MPI + OpenMP event trace based  Vampir profile based  RACY

© 2001 Forschungszentrum Jülich, University of Oregon [23]

© 2001 Forschungszentrum Jülich, University of Oregon [24]

© 2001 Forschungszentrum Jülich, University of Oregon [25]

© 2001 Forschungszentrum Jülich, University of Oregon [26] Future Work Measure typical POMP calling overhead EPCC OpenMP Microbenchmarks? Investigate “formal” standardization with OpenMP forum [OpenMP Supplemental Standard?] OpenMP programmers –What do you expect from an OpenMP performance tool? Tool developers: –Download and try out OPARI –Implement POMP interface for your tool –Tell us about problems, comments, enhancements OpenMP ARB members –What do we need to do next?

© 2001 Forschungszentrum Jülich, University of Oregon [27] Conclusion POMP OpenMP Performance Tool Interface Portable Flexible Efficient Defined at the abstraction level of the OpenMP programming model Standard? Prototype Software OPARIOpenMP Pragma And Region Instrumentor (OPARI) TAUTuning and Analysis Utilities (TAU)

© 2001 Forschungszentrum Jülich, University of Oregon [28]

© 2001 Forschungszentrum Jülich, University of Oregon [29] !$OMP PARALLEL Instrumentation call pomp_parallel_fork(d) !$OMP PARALLEL call pomp_parallel_begin(d) structured block call pomp_barrier_enter(d) !$OMP BARRIER call pomp_barrier_exit(d) call pomp_parallel_end(d) !$OMP END PARALLEL call pomp_parallel_join(d)

© 2001 Forschungszentrum Jülich, University of Oregon [30] !$OMP DO Instrumentation call pomp_do_enter(d) !$OMP DO do loop !$OMP END DO NOWAIT call pomp_barrier_enter(d) !$OMP BARRIER call pomp_barrier_exit(d) call pomp_do_exit(d)

© 2001 Forschungszentrum Jülich, University of Oregon [31] !$OMP WORKSHARE Instrumentation call pomp_workshare_enter(d) !$OMP WORKSHARE structured block !$OMP END WORKSHARE NOWAIT call pomp_barrier_enter(d) !$OMP BARRIER call pomp_barrier_exit(d) call pomp_workshare_exit(d)

© 2001 Forschungszentrum Jülich, University of Oregon [32] !$OMP SECTIONS Instrumentation call pomp_sections_enter(d) !$OMP SECTIONS !$OMP SECTION call pomp_section_begin(d) structured block call pomp_section_end(d) !$OMP SECTION call pomp_section_begin(d) structured block call pomp_section_end(d) !$OMP END SECTIONS NOWAIT call pomp_barrier_enter(d) !$OMP BARRIER call pomp_barrier_exit(d) call pomp_sections_exit(d)

© 2001 Forschungszentrum Jülich, University of Oregon [33] Synchronization Constructs Instrumentation 1 call pomp_single_enter(d) !$OMP SINGLE call pomp_single_begin(d) structured block call pomp_single_end(d) !$OMP END SINGLE NOWAIT call pomp_barrier_enter(d) !$OMP BARRIER call pomp_barrier_exit(d) call pomp_single_exit(d) !$OMP MASTER call pomp_master_begin(d) structured block call pomp_master_end(d) !$OMP END MASTER

© 2001 Forschungszentrum Jülich, University of Oregon [34] Synchronization Constructs Instrumentation 2 call pomp_critical_enter(d) !$OMP CRITICAL call pomp_critical_begin(d) structured block call pomp_critical_end(d) !$OMP END CRITICAL call pomp_sections_exit(d) call pomp_barrier_enter(d) !$OMP BARRIER call pomp_barrier_exit(d) call pomp_atomic_enter(d) !$OMP ATOMIC atomic expression call pomp_atomic_exit(d)

© 2001 Forschungszentrum Jülich, University of Oregon [35] Automatic Analysis EX PER TEXtensible PERformance Tool (EXPERT) programmable, extensible, flexible performance property specification based on event patterns analyzes along three hierarchical dimensions –performance properties (general  specific) –dynamic call tree position –location (machine  node  process  thread) Done: fully functional demonstration prototype Work in Progress: –optimization / generalization –more performance properties –source code and time line displays

© 2001 Forschungszentrum Jülich, University of Oregon [36] Expert Result Presentation Interconnected weighted tree browser scalable still accurate Each node has weight Percentage of CPU allocation time i.e. time spent in subtree of call tree Displayed weight depends on state of node Collapsed (including weight of descendants) Expanded (without weight of descendants) Displayed using Color: allows to easily identify hot spots (bottlenecks) Numerical value: Detailed comparison 100 main 60 bar 10 main 30 foo

© 2001 Forschungszentrum Jülich, University of Oregon [37] Performance Properties View Main Problem: Idle Threads Fine: User code Fine: OpenMP +MPI Fine: OpenMP +MPI

© 2001 Forschungszentrum Jülich, University of Oregon [38] Dynamic Call Tree View 1st Optimization Opportunity 2nd Optimization Opportunity 3rd Optimization Opportunity

© 2001 Forschungszentrum Jülich, University of Oregon [39] Supports locations up to Grid scale Easily allows exploration of load balance problems on different levels [ Of course, Idle Thread Problem only applies to slave threads ] Locations View