OmpP: A Profiling Tool for OpenMP Karl Fürlinger Michael Gerndt {fuerling, Technische Universität München.

Slides:



Advertisements
Similar presentations
Construction process lasts until coding and testing is completed consists of design and implementation reasons for this phase –analysis model is not sufficiently.
Advertisements

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
Parallel Processing with OpenMP
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Intel® performance analyze tools Nikita Panov Idrisov Renat.
NewsFlash!! Earth Simulator no longer #1. In slightly less earthshaking news… Homework #1 due date postponed to 10/11.
Tools for applications improvement George Bosilca.
Enabling Efficient On-the-fly Microarchitecture Simulation Thierry Lafage September 2000.
The Path to Multi-core Tools Paul Petersen. Multi-coreToolsThePathTo 2 Outline Motivation Where are we now What is easy to do next What is missing.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Scripting Languages For Virtual Worlds. Outline Necessary Features Classes, Prototypes, and Mixins Static vs. Dynamic Typing Concurrency Versioning Distribution.
On the Integration and Use of OpenMP Performance Tools in the SPEC OMP2001 Benchmarks Bernd Mohr 1, Allen D. Malony 2, Rudi Eigenmann 3 1 Forschungszentrum.
1 ITCS4145/5145, Parallel Programming B. Wilkinson Feb 21, 2012 Programming with Shared Memory Introduction to OpenMP.
Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware.
Contemporary Languages in Parallel Computing Raymond Hummel.
Performance Technology for Complex Parallel Systems REFERENCES.
1 The VAMPIR and PARAVER performance analysis tools applied to a wet chemical etching parallel algorithm S. Boeriu 1 and J.C. Bruch, Jr. 2 1 Center for.
1 1 Profiling & Optimization David Geldreich (DREAM)
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.
Programming with Shared Memory Introduction to OpenMP
Shared Memory Parallelization Outline What is shared memory parallelization? OpenMP Fractal Example False Sharing Variable scoping Examples on sharing.
CC02 – Parallel Programming Using OpenMP 1 of 25 PhUSE 2011 Aniruddha Deshmukh Cytel Inc.
Parallel Programming in Java with Shared Memory Directives.
GmImgProc Alexandra Olteanu SCPD Alexandru Ştefănescu SCPD.
MPI3 Hybrid Proposal Description
Offline Performance Monitoring for Linux Abhishek Shukla.
Beyond Automatic Performance Analysis Prof. Dr. Michael Gerndt Technische Univeristät München
MpiP Evaluation Report Hans Sherburne, Adam Leko UPC Group HCS Research Laboratory University of Florida.
Paradyn Week – April 14, 2004 – Madison, WI DPOMP: A DPCL Based Infrastructure for Performance Monitoring of OpenMP Applications Bernd Mohr Forschungszentrum.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Analyzing parallel programs with Pin Moshe Bach, Mark Charney, Robert Cohn, Elena Demikhovsky, Tevi Devor, Kim Hazelwood, Aamer Jaleel, Chi- Keung Luk,
Adventures in Mastering the Use of Performance Evaluation Tools Manuel Ríos Morales ICOM 5995 December 4, 2002.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)
INDUSTRIAL PROJECT (234313) ULTRASOUND SCANNER EMBEDDED ONLINE PROFILER Students: Liat Peterfreund, Hagay Myr Supervisor: Mr. Tomer Gal (GE Healthcare)
Hybrid MPI and OpenMP Parallel Programming
Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo 31 August, 2015.
Blaise Barney, LLNL ASC Tri-Lab Code Development Tools Workshop Thursday, July 29, 2010 Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,
Performance Monitoring Tools on TCS Roberto Gomez and Raghu Reddy Pittsburgh Supercomputing Center David O’Neal National Center for Supercomputing Applications.
High-Performance Parallel Scientific Computing 2008 Purdue University OpenMP Tutorial Seung-Jai Min School of Electrical and Computer.
UAB Dynamic Tuning of Master/Worker Applications Anna Morajko, Paola Caymes Scutari, Tomàs Margalef, Eduardo Cesar, Joan Sorribes and Emilio Luque Universitat.
1 Typical performance bottlenecks and how they can be found Bert Wesarg ZIH, Technische Universität Dresden.
Tool Visualizations, Metrics, and Profiled Entities Overview [Brief Version] Adam Leko HCS Research Laboratory University of Florida.
Overview of AIMS Hans Sherburne UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red: Negative note Green:
Threaded Programming Lecture 2: Introduction to OpenMP.
Bronis R. de Supinski and Jeffrey S. Vetter Center for Applied Scientific Computing August 15, 2000 Umpire: Making MPI Programs Safe.
Programming an SMP Desktop using Charm++ Laxmikant (Sanjay) Kale Parallel Programming Laboratory Department of Computer Science.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
Performance profiling of Experiments’ Geant4 Simulations Geant4 Technical Forum Ryszard Jurga.
Other Tools HPC Code Development Tools July 29, 2010 Sue Kelly Sandia is a multiprogram laboratory operated by Sandia Corporation, a.
A Dynamic Tracing Mechanism For Performance Analysis of OpenMP Applications - Caubet, Gimenez, Labarta, DeRose, Vetter (WOMPAT 2001) - Presented by Anita.
Parallel Computing Presented by Justin Reschke
Presented by Jack Dongarra University of Tennessee and Oak Ridge National Laboratory KOJAK and SCALASCA.
Tuning Threaded Code with Intel® Parallel Amplifier.
1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,
SOFTWARE TESTING TRAINING TOOLS SUPPORT FOR SOFTWARE TESTING Chapter 6 immaculateres 1.
Kai Li, Allen D. Malony, Sameer Shende, Robert Bell
Muen Policy & Toolchain
Distributed Shared Memory
Performance Analysis and optimization of parallel applications
TAU integration with Score-P
Computer Engg, IIT(BHU)
A configurable binary instrumenter
Shared Memory Programming
Allen D. Malony Computer & Information Science Department
Department of Computer Science, University of Tennessee, Knoxville
Presentation transcript:

ompP: A Profiling Tool for OpenMP Karl Fürlinger Michael Gerndt {fuerling, Technische Universität München

ompP: A Profiling Tool for OpenMP | 2 Performance Analysis of OpenMP Applications Platform specific tools –SUN Studio –Intel Thread Analyzer –... –Make use of platform/compiler specific knowledge (naming conventions, outlining of parallel regions,...) Platform independent tools –How can we obtain performance data in a portable way? –No standard performance measurement interface for OpenMP yet, –POMP proposal for such an inteface [Mohr02] –DMPL proposed as a debugging interface [Cownie03]

ompP: A Profiling Tool for OpenMP | 3 MPI Profiling Interface (MPIP) Wrapper interposition approach –Easy since MPI functionality is provided in a library –No recompilation necessary Performance measurement libraries libraries –For tracing: Vampir / Intel Trace Analyzer, Paraver,... –For profiling: mpiP Aggregate Sent Message Size (top twenty, descending, bytes) Call Site Count Total Avrg MPI% Send e+06 6e Bcast

ompP: A Profiling Tool for OpenMP | 4 OpenMP Profiling Interface (POMP) No standard yet, but POMP proposal by Bernd Mohr et al. Insert function calls in and around OpenMP constructs to expose exectution events. Implicit barriers added to expose load imbalances Example:

ompP: A Profiling Tool for OpenMP | 5 ompP: OpenMP Profiler ompP –Simple execution profiler for OpenMP, based on POMP instrumentation –Currently only counts and times are kept –Hardware performance counter support planned for future –Simple textual profiling report available immediately after execution of the target application

ompP: A Profiling Tool for OpenMP | 6 ompP: Design / Implementation Opari creates a region descriptor for each identified OpenMP construct –struct ompregdescr omp_rd_1 = { "parallel", "", 0, "main.c", 8, 8, 11, 11 }; –Descriptor passed in POMP_* calls, multiple different calls use same descriptor –Complicates performance data bookkeeping so we break down larger POMP regions into smaller „Pseudoregions“

ompP: A Profiling Tool for OpenMP | 7 Pseudoregions „Pseudoregions“ –To simplify performance data book-keeping split POMP regions into smaller conceptual pseudo-regions: enter, exit, body, main,.. –Exactly two „events“ for each pseudo-region: ENTER and EXIT –Times and counts are kept for each Pseudo-region Opari Instrumentation with pseudo-region nesting

ompP: A Profiling Tool for OpenMP | 8 Pseudoregions (2) OpenMP constructs / POMP regions and Pseudoregions

ompP: A Profiling Tool for OpenMP | 9 Performance Data Reporting Regionstack –Stack of entered POMP regions is maintained –Performance data is attributed to stack, not to entered region itself (similar to callgraph profile vs. flat profile) Profiling report contains: –Header with general information: date and time of the program run, number of threads,... –List of all identified POMP regions with their type (PARALLEL, ATOMIC, BARRIER,...) –Region summary list: Performance data is summed over threads, list is sorted according to the summed execution time –Detailled region profile

ompP: A Profiling Tool for OpenMP | 10 Columns of the detailled region profile execT, execC: number of executions and total inclusive time, derived from main or body exitBarT, exitBarC derived from ibarr pseudo region and correspond to time spent in the implicit “exit barrier” in worksharing constructs or parallel regions.load for detecting load imbalances startupT and startupC derived from enter pseudo region, defined for parallel regions shutdownT and shutdownC defined for parallel regions, derived from exit singleBodyT and singleBodyC for single regions, time spent inside the single region sectionT and sectionC, defined for sections construct, time spent inside a section construct enterT, enterC, exitT and exitC for critical constructs,

ompP: A Profiling Tool for OpenMP | 11 Usage Examples Platform: –4-way Itanium-2 SMP system –1.3 GHz, 3 MB third level cache and 8 GB main memory –Intel compiler version 8.0 –Suse Linux kernel Test Applications: –APART Test Suite –Quicksort code from the OpenMP source code repository

ompP: A Profiling Tool for OpenMP | 12 APART Test Suite ATS: –Framework for testing automated and manual performance analysis tools –Work functions that specify a certain amount of (sequential) work for a single thread / process –Distribution functions specify distribution of work among threads / processes –Individual programs demonstrate certain inefficiencies (imbalances, etc.) –ompP output of „imbalance in parallel loop“ property: R00003 LOOP pattern.omp.imbalance_in_parallel_loop.c (15--18) 001: [R0001] imbalance_in_parallel_loop.c (17--34) 002: [R0002] pattern.omp.imbalance_in_parallel_loop.c (11--20) 003: [R0003] pattern.omp.imbalance_in_parallel_loop.c (15--18) TID execT execC exitBarT exitBarC *

ompP: A Profiling Tool for OpenMP | 13 Quicksort (1) Parallel implementations of the quicksort algorithm are compared in [Suess04] Code available in the OpenMP Sourcecode repositroy (OmpSCR: ) We compare two versions: 1.Global stack of work elements. Access is protected by two critical sections 2.Local stack of work elements (global stack is only accessed when local stack is empty)

ompP: A Profiling Tool for OpenMP | 14 Quicksort (2) Version 1.0: global stack –Total execution time: seconds –∑enterT + exitT = 7.01 / 4.56 R00002 CRITICAL cpp_qsomp1.cpp ( ) 001: [R0001] cpp_qsomp1.cpp ( ) 002: [R0002] cpp_qsomp1.cpp ( ) TID execT execC enterT enterC exitT exitC * R00003 CRITICAL cpp_qsomp1.cpp ( ) 001: [R0001] cpp_qsomp1.cpp ( ) 002: [R0003] cpp_qsomp1.cpp ( ) TID execT execC enterT enterC exitT exitC *

ompP: A Profiling Tool for OpenMP | 15 Quicksort (3) Version 2.0: local stacks –Total execution time: –∑enterT + exitT = 5.55 / 3.32 => 25% improvement R00002 CRITICAL cpp_qsomp2.cpp ( ) 001: [R0001] cpp_qsomp2.cpp ( ) 002: [R0002] cpp_qsomp2.cpp ( ) TID execT execC enterT enterC exitT exitC * R00003 CRITICAL cpp_qsomp2.cpp ( ) 001: [R0001] cpp_qsomp2.cpp ( ) 002: [R0003] cpp_qsomp2.cpp ( ) TID execT execC enterT enterC exitT exitC *

ompP: A Profiling Tool for OpenMP | 16 Summary ompP : simple profiling tool for OpenMP, based on POMP instrumentation –Simple, but can be very effective as a first step in performance tuning –Platform independent, can be used to compare performance on different platforms –Dependent on POMP instrumentation approach –We would really like to have a standard profiling interface Availablility: –First version was written in C++, → problems when linking with the ompP library (C++ run-time needs to be included as well...) –ompP v2.0: C-only version, same functionality –will be available soon from Thank You!

ompP: A Profiling Tool for OpenMP | 17 References Suess04: Michael Süß and Claudia Leopold. A user’s experience with parallel sorting and OpenMP. In Proceedings of the Sixth Workshop on OpenMP (EWOMP’04), October Cownie03: James Cownie, John DelSignore Jr., Bronis R. de Supinski, and Karen Warren. DMPL: An OpenMP DLL debugging interface. In Proceedings of the Workshop on OpenMP Applications and Tools (WOMPAT 2003), pages , Mohr02: Bernd Mohr, Allen D. Malony, Hans-Christian Hoppe, Frank Schlimbach, Grant Haab, Jay Hoeinger, and Sanjiv Shah. A performance monitoring interface for OpenMP. In Proceedings of the Fourth Workshop on OpenMP (EWOMP 2002), September 2002.