On the Integration and Use of OpenMP Performance Tools in the SPEC OMP2001 Benchmarks Rudi Eigenmann Department of Electrical and Computer Engineering.

Slides:



Advertisements
Similar presentations
K T A U Kernel Tuning and Analysis Utilities Department of Computer and Information Science Performance Research Laboratory University of Oregon.
Advertisements

Parallel Processing with OpenMP
Introductions to Parallel Programming Using OpenMP
NewsFlash!! Earth Simulator no longer #1. In slightly less earthshaking news… Homework #1 due date postponed to 10/11.
Dynamic performance measurement control Dynamic event grouping Multiple configurable counters Selective instrumentation Application-Level Performance Access.
SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.
Robert Bell, Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science.
OmpP: A Profiling Tool for OpenMP Karl Fürlinger Michael Gerndt {fuerling, Technische Universität München.
Sameer Shende Department of Computer and Information Science Neuro Informatics Center University of Oregon Tool Interoperability.
The TAU Performance Technology for Complex Parallel Systems (Performance Analysis Bring Your Own Code Workshop, NRL Washington D.C.) Sameer Shende, Allen.
On the Integration and Use of OpenMP Performance Tools in the SPEC OMP2001 Benchmarks Bernd Mohr 1, Allen D. Malony 2, Rudi Eigenmann 3 1 Forschungszentrum.
Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science Institute University.
The TAU Performance System: Advances in Performance Mapping Sameer Shende University of Oregon.
Performance Instrumentation and Measurement for Terascale Systems Jack Dongarra, Shirley Moore, Philip Mucci University of Tennessee Sameer Shende, and.
Allen D. Malony Department of Computer and Information Science Computational Science Institute University of Oregon TAU Performance.
Towards a Performance Tool Interface for OpenMP: An Approach Based on Directive Rewriting Bernd Mohr, Felix Wolf Forschungszentrum Jülich John von Neumann.
Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware.
TAU Performance Toolkit (WOMPAT 2004 OpenMP Lab) Sameer Shende, Allen D. Malony University of Oregon {sameer,
Kai Li, Allen D. Malony, Robert Bell, Sameer Shende Department of Computer and Information Science Computational.
The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon.
Sameer Shende, Allen D. Malony Computer & Information Science Department Computational Science Institute University of Oregon.
Performance Technology for Complex Parallel Systems REFERENCES.
1 The VAMPIR and PARAVER performance analysis tools applied to a wet chemical etching parallel algorithm S. Boeriu 1 and J.C. Bruch, Jr. 2 1 Center for.
CC02 – Parallel Programming Using OpenMP 1 of 25 PhUSE 2011 Aniruddha Deshmukh Cytel Inc.
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
Paradyn Week – April 14, 2004 – Madison, WI DPOMP: A DPCL Based Infrastructure for Performance Monitoring of OpenMP Applications Bernd Mohr Forschungszentrum.
WORK ON CLUSTER HYBRILIT E. Aleksandrov 1, D. Belyakov 1, M. Matveev 1, M. Vala 1,2 1 Joint Institute for nuclear research, LIT, Russia 2 Institute for.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Artdaq Introduction artdaq is a toolkit for creating the event building and filtering portions of a DAQ. A set of ready-to-use components along with hooks.
TRACEREP: GATEWAY FOR SHARING AND COLLECTING TRACES IN HPC SYSTEMS Iván Pérez Enrique Vallejo José Luis Bosque University of Cantabria TraceRep IWSG'15.
Score-P – A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir Alexandru Calotoiu German Research School for.
BLU-ICE and the Distributed Control System Constraints for Software Development Strategies Timothy M. McPhillips Stanford Synchrotron Radiation Laboratory.
OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)
John Mellor-Crummey Robert Fowler Nathan Tallent Gabriel Marin Department of Computer Science, Rice University Los Alamos Computer Science Institute HPCToolkit.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
KOJAK Evaluation Report Adam Leko, Hans Sherburne UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red: Negative.
Dynamic performance measurement control Dynamic event grouping Multiple configurable counters Selective instrumentation Application-Level Performance Access.
Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.
Debugging parallel programs. Breakpoint debugging Probably the most widely familiar method of debugging programs is breakpoint debugging. In this method,
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
1 SciDAC High-End Computer System Performance: Science and Engineering Jack Dongarra Innovative Computing Laboratory University of Tennesseehttp://
© 2006, National Research Council Canada © 2006, IBM Corporation Solving performance issues in OTS-based systems Erik Putrycz Software Engineering Group.
 Programming - the process of creating computer programs.
Threaded Programming Lecture 2: Introduction to OpenMP.
Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.
Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs Allen D. Malony, Scott Biersdorff, Sameer Shende, Heike Jagode†, Stanimire.
Single Node Optimization Computational Astrophysics.
Performane Analyzer Performance Analysis and Visualization of Large-Scale Uintah Simulations Kai Li, Allen D. Malony, Sameer Shende, Robert Bell Performance.
CEPBA-Tools experiences with MRNet and Dyninst Judit Gimenez, German Llort, Harald Servat
A Dynamic Tracing Mechanism For Performance Analysis of OpenMP Applications - Caubet, Gimenez, Labarta, DeRose, Vetter (WOMPAT 2001) - Presented by Anita.
Online Performance Analysis and Visualization of Large-Scale Parallel Applications Kai Li, Allen D. Malony, Sameer Shende, Robert Bell Performance Research.
1 University of Maryland Using Information About Cache Evictions to Measure the Interactions of Application Data Structures Bryan R. Buck Jeffrey K. Hollingsworth.
PERFORMANCE OF THE OPENMP AND MPI IMPLEMENTATIONS ON ULTRASPARC SYSTEM Abstract Programmers and developers interested in utilizing parallel programming.
Parallel OpenFOAM CFD Performance Studies Student: Adi Farshteindiker Advisors: Dr. Guy Tel-Zur,Prof. Shlomi Dolev The Department of Computer Science Faculty.
Kai Li, Allen D. Malony, Sameer Shende, Robert Bell
Performance Technology for Scalable Parallel Systems
Tracing and Performance Analysis Tools for Heterogeneous Multicore System by Soon Thean Siew.
TAU integration with Score-P
Fei Cai Shaogang Wu Longbing Zhang and Zhimin Tang
Allen D. Malony, Sameer Shende
TAU Parallel Performance System
Many-core Software Development Platforms
A configurable binary instrumenter
TAU: A Framework for Parallel Performance Analysis
Allen D. Malony Computer & Information Science Department
Outline Introduction Motivation for performance mapping SEAA model
Parallel Program Analysis Framework for the DOE ACTS Toolkit
Department of Computer Science, University of Tennessee, Knoxville
TAU Performance DataBase Framework (PerfDBF)
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

On the Integration and Use of OpenMP Performance Tools in the SPEC OMP2001 Benchmarks Rudi Eigenmann Department of Electrical and Computer Engineering Purdue University Allen Malony Department of Computer and Information Science University of Oregon Bernd Mohr Forschungszentrum Jülich John von Neumann - Institut für Computing

© 2000 Forschungszentrum Jülich, NIC-ZAM [2] Outline SPEC OMP2001 benchmark suite Motivation: Integrated performance tools in benchmarking suites Approach for OMPM2001 POMP OpenMP performance monitoring interface Automatic OpenMP instrumentation (OPARI) Performance analysis tools (EXPERT and TAU) Experiments Concluding remarks

© 2000 Forschungszentrum Jülich, NIC-ZAM [3] SPEC OMP2001 Benchmark Suite 11 application programs used in scientific computing CFD: APPLU, APSI, GALGEL, MGRID, SWIM Molecular dynamics: AMMP Crash simulation: FMA3D Neural network: ART Genetic algorithm: GAFORT Earthquake modeling: EQUAKE Quantum chromodynamics: WUPWISE Fortran and C source code with OpenMP parallelization Medium and large data sets Goals of portability and relative ease of use

© 2000 Forschungszentrum Jülich, NIC-ZAM [4] OMPM2001 Performance Measurement Studies OMP2001 measures and reports total execution time only Scalability results for different processor numbers “Performance Characteristics of the SPEC OMP2001 Benchmarks,” Aslot and Eigenmann, EWOMP 2001 Studies performance characteristics in detail –Timing profiles (scalability) across parallel sections –Memory system and cache (hardware counter) profiles Use of high-resolution timers and hardware counters Quantitative and qualitative explanations Custom instrumentation and measurement libraries Required hand-instrumentation of OpenMP constructs

© 2000 Forschungszentrum Jülich, NIC-ZAM [5] Performance Tools and Benchmark Suites Detailed performance measurement and analysis reveal interesting runtime characteristics in application codes Important for performance diagnosis and tuning Help to understand effects of a new parallel API (OpenMP) Benchmark suites typically do not have integrated tools Portability of performance tools is poor Hard to configure tools for benchmarking methodology Tools often require manual application and operation BUT: Automatic and portable performance tools could allow more in-depth, cross-platform performance analysis Goal: integrated performance tools for OMP2001

© 2000 Forschungszentrum Jülich, NIC-ZAM [6] Approach for OMPM2001 Leverage state-of-the-art performance instrumentation, measurement, and analysis technology POMP OpenMP performance monitoring interface OPARI automatic OpenMP source instrumentation Performance profile and trace measurement libraries EXPERT automatic event trace analyzer TAU performance analysis system Configure performance tools as integrated and automated components in OMPM2001 benchmarking methodology Conduct performance experiments on OMPM2001 codes Evaluate with respect to portability, ease of use, results

© 2000 Forschungszentrum Jülich, NIC-ZAM [7] POMP OpenMP Performance Monitoring Interface OpenMP instrumentation OpenMP directive/pragma instrumentation OpenMP runtime library routine instrumentation POMP Directive/Pragma Extensions Runtime library control ( !$POMP INIT, FINALIZE, ON, OFF ) (Manual) user code instrumentation !$POMP BEGIN(myname) … structured block !$POMP END(myname) Conditional compilation ( #ifdef _POMP ) Conditional / selective transformations ( !$POMP [NO]INSTRUMENT )

© 2000 Forschungszentrum Jülich, NIC-ZAM [8] Example: !$OMP PARALLEL DO Instrumentation !$OMP PARALLEL DO clauses... do loop !$OMP END PARALLEL DO !$OMP PARALLEL other-clauses... !$OMP DO schedule-clauses, ordered-clauses, lastprivate-clauses do loop !$OMP END DO !$OMP END PARALLEL DO NOWAIT !$OMP BARRIER call pomp_parallel_fork(d) call pomp_parallel_begin(d) call pomp_parallel_end(d) call pomp_parallel_join(d) call pomp_do_enter(d) call pomp_do_exit(d) call pomp_barrier_enter(d) call pomp_barrier_exit(d)

© 2000 Forschungszentrum Jülich, NIC-ZAM [9] OpenMP Runtime Library Routine Instrumentation Transform omp_###_lock()  pomp_###_lock() omp_###_nest_lock()  pomp_###_nest_lock() [ ### = init | destroy | set | unset | test ] POMP version Calls omp version internally Can do extra stuff before and after call

© 2000 Forschungszentrum Jülich, NIC-ZAM [10] Instrumentation of OpenMP Constructs OPARIOpenMP Pragma And Region Instrumentor Source-to-Source translator to insert POMP calls around OpenMP constructs and API functions Done: Supports Fortran77 and Fortran90, OpenMP 2.0 C and C++, OpenMP 1.0 POMP Extensions EPILOG and TAU POMP monitoring library implementations Preserves source code information ( #line line file )

© 2000 Forschungszentrum Jülich, NIC-ZAM [11] History and Future of POMP POMP OpenMP performance monitoring interface Forschungszentrum Jülich, University of Oregon Presented at EWOMP’01, LACSI’01, and SC’01 Published at "The Journal of Supercomputing", 23, European IST Project INTONE Development of OpenMP tools (incl. Monitoring interface) Pallas, CEPBA, Royal Inst. Of Technology, Tech. Univ. Dresden KSL-POMP Development of OpenMP monitoring interface inside ASCI Based on POMP, but further developed in other directions Work in Progress: Investigating joint proposal Investigating standardization through OpenMP Forum

© 2000 Forschungszentrum Jülich, NIC-ZAM [12] EXPERT: Automatic Analysis of OpenMP + MPI Programs EX PER TEXtensible PERformance Tool Programmable, extensible, flexible performance property specification Based on event patterns Analyzes along three hierarchical dimensions Performance properties (general  specific) Dynamic call tree position Location (machine  node  process  thread) Foreach property severity matrix is computed Time losses due to performance property Per location and call tree node Call site Property Location

© 2000 Forschungszentrum Jülich, NIC-ZAM [13] Location How is the problem distributed across the machine? Class of Behavior Which kind of behavior caused the problem? Call Graph Where in the source code is the problem? In which context? Color Coding Shows the severity of the problem

© 2000 Forschungszentrum Jülich, NIC-ZAM [14] TAU Performance System Framework TAUTuning and Analysis Utilities Performance system framework for scalable parallel and distributed high-performance computing Targets a general complex system computation model nodes / contexts / threads Multi-level: system / software / parallelism Measurement and analysis abstraction Integrated toolkit for performance instrumentation, measurement, analysis, and visualization Portable performance profiling/tracing facility Open software approach

© 2000 Forschungszentrum Jülich, NIC-ZAM [15] TAU Performance System Architecture EPILOG Paraver

© 2000 Forschungszentrum Jülich, NIC-ZAM [16] Instrumentation User functions EXPERT: –Compiler instrumentation (Linux PGI, Hitachi SR-8000) –Manual instrumentation via !$POMP directives TAU: –Source instrumentation based on PDT (Program DB Toolkit) based on commercial parsers from EDG and Mutec –Dynamic instrumentation via dyninst or DPCL –Manual instrumentation via TAU API OpenMP: Source instrumentation via OPARI MPI: wrapper library using "standard" PMPI monitoring interface

© 2000 Forschungszentrum Jülich, NIC-ZAM [17] Measurement and Analysis EXPERT EPILOG tracing library Automatic trace analysis through EXPERT Manual analysis through EPILOG  VTF3 converter + Vampir TAU TAU tracing library Manual analysis through TAU  VTF converter + Vampir TAU profiling library Manual analysis through RACY/jRacy TAU EPILOG tracing library Automatic trace analysis through EXPERT

© 2000 Forschungszentrum Jülich, NIC-ZAM [18] Integration with SPEC runspec tool Development of OPARI and of OPARI/TAU compile and link scripts Take “regular” compile / link command as argument Perform all necessary instrumentation, compilations, and linking Example usage in SPEC configuration file default:default:opari:default FC = opari-comp CC = opari-comp FLD = opari-link CLD = opari-link Invocation through runspec... --extension=opari...

© 2000 Forschungszentrum Jülich, NIC-ZAM [19] Experimental Setup: ZAMpano ZAMPANoZAM PArallel Nodes 9 node Linux cluster Each node 4 x Intel Pentium III Xeon 550 MHz, 512 Kbyte L1 cache 2 GByte ECC-RAM SuSE 7.2 Linux GB-SMP kernel PGI F77, F90, C, C++ compilers V3.3-2 Advantages Exclusive reservation for extended periods for measurements Simultaneous multiple measurements (on different nodes) Root access Full tool support

© 2000 Forschungszentrum Jülich, NIC-ZAM [20] Wishful Thinking meets Reality ;-) Problems with OMPM2001 building / compiling 1 GByte program+data size limit if dynamic linking is used PGI couldn’t compile AMMP GALGEL, EQUAKE, GAFORT, ART core dump midway WUPWISE runs but has result output differences SWIM, MGRID, APPLU, APSI, FMA3D run Problems with applying EXPERT Traces of SWIM, APPLU for “ref” data set Traces of SWIM, MGRID, APPLU, FMA3D for “test” set Problems with applying TAU PDT instrumentation failed due to NON-ANSI Fortran Instrumentor bug when OpenMP loops are 1 st executable line

© 2000 Forschungszentrum Jülich, NIC-ZAM [21] Results: Event Statistics ("test" data set) SWIM55% Event rate [events/s] Trace size [# events] Time POMP [s] Time [s] Benchmark Over head 2,372108, MGRID6%181,96053, APPLU155%491,09214, FMA3D324% Full tracing SWIM0%2,375103, MGRID1.6%74,07921, APPLU76%372,1259, FMA3D221% Restricted user event tracing

© 2000 Forschungszentrum Jülich, NIC-ZAM [22] Results: Event Statistics ("ref" data set) 8132,05417,06816,656SWIM2.5%20188,2989,5349,593APPLU0% Restricted user event tracing 8132,05416,67916,656SWIMO.1%~15,200~147.5 M10,6669,593APPLU11% Event rate [events/s] Trace size [# events] Time POMP [s] Time [s] Benchmark Over head Full tracing

© 2000 Forschungszentrum Jülich, NIC-ZAM [23] Results: EXPERT Analysis of SWIM

© 2000 Forschungszentrum Jülich, NIC-ZAM [24] Results: EXPERT Analysis of APPLU

© 2000 Forschungszentrum Jülich, NIC-ZAM [25] Results: Vampir SWIM "ref" data set

© 2000 Forschungszentrum Jülich, NIC-ZAM [26] Future Work Get more benchmarks running (other compilers?) Fix instrumentation, measurement, and analysis problems Fix TAU f90 instrumentor problems and get TAU profile data Get more / better EPILOG traces EXPERT profile library (to avoid huge traces) Extend analysis to other platforms  SUN, SGI, Hitachi, IBM, NEC, … Investigate runtime trace compression techniques Other tools?  Guide/VGV, Paraver, INTONE

© 2000 Forschungszentrum Jülich, NIC-ZAM [27] Conclusions More portable OMP SPEC benchmarks needed ANSI Fortran Dynamic data allocation “Small” data set Add !$OMP END [PARALLEL] DO 's Integrated POMP instrumentation ( !$POMP BEGIN/END(myname) ) for important user functions and executions phases Document and specify additional measurement events Would also solve instrumentation problems Integrated generic and portable SPEC POMP measurement library Can then easily be replaced by 3rd party / user POMP libraries OpenMP ARB POMP Standard would be big win (Until then: OPARI)

© 2000 Forschungszentrum Jülich, NIC-ZAM [28] Additional Issues Level of measurement detail What is necessary and appropriate? Could use base level and allow user-configured levels Full program execution vs. portion of program execution Distribution complexity Tool packages should be added to benchmark distribution Packages need to be easily obtained and configured Must be public domain or licensed through SPEC Publishing of detailed performance results Part of official SPEC benchmark report? …