University of Houston Open Source Software Support for the OpenMP Runtime API for Profiling Oscar Hernandez, Ramachandra Nanjegowda, Van Bui, Richard Krufin.

Slides:



Advertisements
Similar presentations
OpenMP.
Advertisements

Parallel Virtual Machine Rama Vykunta. Introduction n PVM provides a unified frame work for developing parallel programs with the existing infrastructure.
Introductions to Parallel Programming Using OpenMP
The OpenUH Compiler: A Community Resource Barbara Chapman University of Houston March, 2007 High Performance Computing and Tools Group
NewsFlash!! Earth Simulator no longer #1. In slightly less earthshaking news… Homework #1 due date postponed to 10/11.
PARALLEL PROGRAMMING WITH OPENMP Ing. Andrea Marongiu
Cc Compiler Parallelization Options CSE 260 Mini-project Fall 2001 John Kerwin.
Distributed Systems Lecture #3: Remote Communication.
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.
On the Integration and Use of OpenMP Performance Tools in the SPEC OMP2001 Benchmarks Bernd Mohr 1, Allen D. Malony 2, Rudi Eigenmann 3 1 Forschungszentrum.
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
JVM-1 Introduction to Java Virtual Machine. JVM-2 Outline Java Language, Java Virtual Machine and Java Platform Organization of Java Virtual Machine Garbage.
Advanced OS Chapter 3p2 Sections 3.4 / 3.5. Interrupts These enable software to respond to signals from hardware. The set of instructions to be executed.
San Diego Supercomputer Center Performance Modeling and Characterization Lab PMaC Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation.
INTEL CONFIDENTIAL OpenMP for Domain Decomposition Introduction to Parallel Programming – Part 5.
Sameer Shende, Allen D. Malony Computer & Information Science Department Computational Science Institute University of Oregon.
GASP: A Performance Tool Interface for Global Address Space Languages & Libraries Adam Leko 1, Dan Bonachea 2, Hung-Hsun Su 1, Bryan Golden 1, Hans Sherburne.
Advanced Database CS-426 Week 2 – Logic Query Languages, Object Model.
Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.
MPI3 Hybrid Proposal Description
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
Paradyn Week – April 14, 2004 – Madison, WI DPOMP: A DPCL Based Infrastructure for Performance Monitoring of OpenMP Applications Bernd Mohr Forschungszentrum.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Lecture 8. Profiling - for Performance Analysis - Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture &
Analyzing parallel programs with Pin Moshe Bach, Mark Charney, Robert Cohn, Elena Demikhovsky, Tevi Devor, Kim Hazelwood, Aamer Jaleel, Chi- Keung Luk,
A Specification Language and Test Planner for Software Testing Aolat A. Adedeji 1 Mary Lou Soffa 1 1 DEPARTMENT OF COMPUTER SCIENCE, UNIVERSITY OF VIRGINIA.
4.1 Silberschatz, Galvin and Gagne ©2005 Operating System Concepts Chapter 4: Threads Overview Multithreading Models Thread Libraries  Pthreads  Windows.
Lecture 10 : Introduction to Java Virtual Machine
Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.
PMaC Performance Modeling and Characterization Performance Modeling and Analysis with PEBIL Michael Laurenzano, Ananta Tiwari, Laura Carrington Performance.
Real-Time Java on JOP Martin Schöberl. Real-Time Java on JOP2 Overview RTSJ – why not Simple RT profile Scheduler implementation User defined scheduling.
OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)
Invitation to Computer Science 5 th Edition Chapter 6 An Introduction to System Software and Virtual Machine s.
Lecture 3 Process Concepts. What is a Process? A process is the dynamic execution context of an executing program. Several processes may run concurrently,
Hardware process When the computer is powered up, it begins to execute fetch-execute cycle for the program that is stored in memory at the boot strap entry.
Colorama: Architectural Support for Data-Centric Synchronization Luis Ceze, Pablo Montesinos, Christoph von Praun, and Josep Torrellas, HPCA 2007 Shimin.
The Mach System Abraham Silberschatz, Peter Baer Galvin, Greg Gagne Presentation By: Agnimitra Roy.
Debugging parallel programs. Breakpoint debugging Probably the most widely familiar method of debugging programs is breakpoint debugging. In this method,
Introduction to OpenMP
Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Operating Systems Processes and Threads.
13-1 Chapter 13 Concurrency Topics Introduction Introduction to Subprogram-Level Concurrency Semaphores Monitors Message Passing Java Threads C# Threads.
Hardware process When the computer is powered up, it begins to execute fetch-execute cycle for the program that is stored in memory at the boot strap entry.
Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.
Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.
PAPI on Blue Gene L Using network performance counters to layout tasks for improved performance.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
Operating Systems Unit 2: – Process Context switch Interrupt Interprocess communication – Thread Thread models Operating Systems.
CEPBA-Tools experiences with MRNet and Dyninst Judit Gimenez, German Llort, Harald Servat
UDI Technology Benefits Slide 1 Uniform Driver Interface UDI Technology Benefits.
A Dynamic Tracing Mechanism For Performance Analysis of OpenMP Applications - Caubet, Gimenez, Labarta, DeRose, Vetter (WOMPAT 2001) - Presented by Anita.
Dynamic Tuning of Parallel Programs with DynInst Anna Morajko, Tomàs Margalef, Emilio Luque Universitat Autònoma de Barcelona Paradyn/Condor Week, March.
Special Topics in Computer Engineering OpenMP* Essentials * Open Multi-Processing.
1 ROGUE Dynamic Optimization Framework Using Pin Vijay Janapa Reddi PhD. Candidate - Electrical And Computer Engineering University of Colorado at Boulder.
CSCI/CMPE 4334 Operating Systems Review: Exam 1 1.
Beyond Application Profiling to System Aware Analysis Elena Laskavaia, QNX Bill Graham, QNX.
Parallel OpenFOAM CFD Performance Studies Student: Adi Farshteindiker Advisors: Dr. Guy Tel-Zur,Prof. Shlomi Dolev The Department of Computer Science Faculty.
Sung-Dong Kim, Dept. of Computer Engineering, Hansung University Java - Introduction.
SHARED MEMORY PROGRAMMING WITH OpenMP
CS427 Multicore Architecture and Parallel Computing
Protection of System Resources
TAU integration with Score-P
Computer Engg, IIT(BHU)
September 4, 1997 Parallel Processing (CS 667) Lecture 5: Shared Memory Parallel Programming with OpenMP* Jeremy R. Johnson Parallel Processing.
Runtime Analysis of Hotspot Java Virtual Machine
Capriccio – A Thread Model
Allen D. Malony Computer & Information Science Department
Dynamic Binary Translators and Instrumenters
Presentation transcript:

University of Houston Open Source Software Support for the OpenMP Runtime API for Profiling Oscar Hernandez, Ramachandra Nanjegowda, Van Bui, Richard Krufin and Barbara Chapman High Performance Computing and Tools Group (HPCTools) Computer Science Department University of Houston, Texas

University of Houston Agenda Introduction to collector API. The basic collector interface and usage. Simple tools. Advanced tools. Implementation of collector API support. Future work, conclusion and questions.

University of Houston Objective of collector interface Designed to provide portable means for tools to collect information about an OpenMP application during runtime in transparent and scalable and independent manner.

University of Houston Features of collector api Collector interface or collector API was proposed by SUN as a white paper for tools committee of OpenMP ARB. Bi-directional: Communication between the OpenMP runtime library and performance tools. Scalable: Minimal overhead as Is a query and event notification based interface. Transparent: No need for application change, recompilation, instrumentation. and supports dynamic binding Independent: The tool and the runtime evolve independently Extensible: Adding more events and requests.

University of Houston Related work 1.POMP – Source code instrumentation using Opari. 2.GASP – Profiling interface for global address space programming models (UPC, CAF). 3.PMPI – A set of wrappers for each MPI call, with instrumentation calls before and after MPI calls. 4.PERUSE – Complements PMPI, extends MPI to give more information about internal states.

University of Houston The basic interface The single routine, int __omp_collector_api(void *msg) used by tools to communicate with runtime. One call, many requests. Designed to support events/states needed for statistical profiling and tracing tools. OpenMP Program (object code) OpenMP Program (object code) Collector API Performance Tool Executable request events

University of Houston Request and events and response Events Examples – OMP_EVENT_FORK – OMP_EVENT_JOIN SZR#ECRSZMEMSZR#ECRSZMEM0 Request1Request2 “__omp_collector_api(void *msg)” msg Request Examples – OMP_REQ_START – OMP_REQ_REGIST ER Response – PRID and PRFP. – Embedded in the MEM passed in REQ

University of Houston Typical usage 1.Export LD_PRELOAD to point to tool. 2.The tool exports the initialization and finalization routines using __atribute__ GCC extension. 3.Tools checks if collector is present 4.Tools request collector for initialization. 5.Tools register thread events with callbacks to the OpenMP RTL. OpenMP Application with Collector API. Performance Tool Is there a collector API? Yes/No Register Event(s)/ Callback(s) Initialize collector API Success/Ready Event Notification /Callback

University of Houston Simple tools Collecting OpenMP metrics and thread states. Collecting user call stack

University of Houston OpenMP states and metrics Example: #pragma omp parallel for reduction (+:sum) for(i=0; i <N ; i++) sum += a[i]; Serial State Master Thread Slave Thread Slave Thread Slave Threads Overhead State ( Prepare for Fork) Idle State Overhead State (Scheduler) Work State sum += a[i]; Reduction State Implicit Barrier State Idle State Serial State Fork Event Join Event Begin Barrier End Barrier Begin idle Event End idle Event Begin idle Event Fork (end parallel region)

University of Houston Advanced usage Collector API with TAU. Selective instrumentation using dynamic optimizer Integration of Collector API with PIN.

University of Houston OpenMP Collector API with TAU Procedures were Instrumented with the compiler Parallel Region(s) Enabling Fork and Join Events. Begin/End Implicit Barriers

University of Houston Dynamic instrumentation using collector API Dynamic optimizer will turn off/on the feedback data collection. This minimizes code generation of instrumentation at runtime Generate conditionals and function pointers to support code patching double x_solve(… ) { if(feedback_on) { return x_solve_inst(….) } …. }

University of Houston Integration with PIN PIN, an dynamic instrumentation tool. Enables to analyze the application at instruction level. Leverage collector api to collect more highlevel statistics. Better understanding of the program behavior with low level instrumentation and high level statistics.

University of Houston Implementation of collector interface in OpenUH 2 functions inserted at different points in RTL – __ompc_event_callback(event_name) – __ompc_set_state(state_name) inline void __ompc_barrier(void) { __ompc_set_state(THR_IBAR_STATE); __ompc_event_callback(OMP_EVENT_THR_BEGIN_IBAR); __ompc_barrier_wait(&__omp_level_1_team_manager); __ompc_event_callback(OMP_EVENT_THR_END_IBAR); __ompc_set_state(THR_WORK_STATE); return; }

University of Houston Overhead and scalability A tool with callback for each event, and call stack display for unique regions. Most cases overhead is less than 5% relative to the execution time.

University of Houston Conclusion and future work Collector interface for both tool users and tool developers. Make the basic and advanced tools available. Integrate with other tools like TAU, PIN and many more. Integrate with dynamic compiler framework.

University of Houston References Formal definition White paper at api.htmlhttp:// api.html Providing Observability for OpenMP 3.0 Applications Yuan Lin, Oleg Mazurov

University of Houston Questions?

University of Houston Backup slides

University of Houston Collecting user callstack Libpsx extension to Perfsuite library. Callstack retrieval using libunwind. Mapping of callstack values to source code location using Binary File descriptor API (libbfd). Removing the RTL frames in the trace. OMP_JOIN_EVENT Callback IsUnique Region ? Collect call stack psx_rtc_cs() psx_bfd_map() 1. Request current PRID and PRFP 2. Determine unique region using Frame pointer. 1. Provides the instruction pointer values for each stack frame. 1. Map the instruction pointer values with source code locations

University of Houston Implementation of collector interface in OpenUH OpenMP Runtime Collector API Collector Tool OMP_REQ_START OMP_REQ_EVENT OMP_REQ_PRID REQ_START Init queue, callback tables, keep track of PRID’s Begin tracking of states, PRID’s Activate monitoring of event A Query Current state REQ_EVENT Store the callback functions in callback table. REQ_PRID PRID and PRFP in the response

University of Houston Hardware counters and Collector interface OpenMP Application Collector Tool Register Fork and Join Invoke callback Fork/Join Counter Oveflow PAPI_Init PAPI_Add_Event PAPI_Overflow Callback() if(FORK) PAPI_START if(JOIN) PAPI_STOP Handler() Mechanism to map to source line Invoke Handler