UPC Performance Tool Interface Professor Alan D. George, Principal Investigator Mr. Hung-Hsun Su, Sr. Research Assistant Mr. Adam Leko, Sr. Research Assistant.

Slides:



Advertisements
Similar presentations
MPI Message Passing Interface
Advertisements

More on Processes Chapter 3. Process image _the physical representation of a process in the OS _an address space consisting of code, data and stack segments.
Programming Paradigms Introduction. 6/15/2005 Copyright 2005, by the authors of these slides, and Ateneo de Manila University. All rights reserved. L1:
2003 Michigan Technological University March 19, Steven Seidel Department of Computer Science Michigan Technological University
The Kernel Abstraction. Challenge: Protection How do we execute code with restricted privileges? – Either because the code is buggy or if it might be.
CS 4800 By Brandon Andrews.  Specifications  Goals  Applications  Design Steps  Testing.
“THREADS CANNOT BE IMPLEMENTED AS A LIBRARY” HANS-J. BOEHM, HP LABS Presented by Seema Saijpaul CS-510.
Chapter 5 Processes and Threads Copyright © 2008.
1 Homework Turn in HW2 at start of next class. Starting Chapter 2 K&R. Read ahead. HW3 is on line. –Due: class 9, but a lot to do! –You may want to get.
Precept 3 COS 461. Concurrency is Useful Multi Processor/Core Multiple Inputs Don’t wait on slow devices.
Scheduler Activations Effective Kernel Support for the User-Level Management of Parallelism.
3.5 Interprocess Communication Many operating systems provide mechanisms for interprocess communication (IPC) –Processes must communicate with one another.
Threads 1 CS502 Spring 2006 Threads CS-502 Spring 2006.
Home: Phones OFF Please Unix Kernel Parminder Singh Kang Home:
3.5 Interprocess Communication
Chapter 11-12, Appendix D C Programs Higher Level languages Compilers C programming Converting C to Machine Code C Compiler for LC-3.
CS533 Concepts of Operating Systems Class 3 Integrated Task and Stack Management.
. Memory Management. Memory Organization u During run time, variables can be stored in one of three “pools”  Stack  Static heap  Dynamic heap.
1 RISE: Randomization Techniques for Software Security Dawn Song CMU Joint work with Monica Chew (UC Berkeley)
GASP: A Performance Tool Interface for Global Address Space Languages & Libraries Adam Leko 1, Dan Bonachea 2, Hung-Hsun Su 1, Bryan Golden 1, Hans Sherburne.
OPERATING SYSTEMS DESIGN AND IMPLEMENTATION Third Edition ANDREW S. TANENBAUM ALBERT S. WOODHULL Yan hao (Wilson) Wu University of the Western.
06 April 2006 Parallel Performance Wizard: Analysis Module Professor Alan D. George, Principal Investigator Mr. Hung-Hsun Su, Sr. Research Assistant Mr.
MPI3 Hybrid Proposal Description
Nachos Phase 1 Code -Hints and Comments
Operating System Support for Virtual Machines Samuel T. King, George W. Dunlap,Peter M.Chen Presented By, Rajesh 1 References [1] Virtual Machines: Supporting.
Analyzing parallel programs with Pin Moshe Bach, Mark Charney, Robert Cohn, Elena Demikhovsky, Tevi Devor, Kim Hazelwood, Aamer Jaleel, Chi- Keung Luk,
Compilation Technology SCINET compiler workshop | February 17-18, 2009 © 2009 IBM Corporation Software Group Coarray: a parallel extension to Fortran Jim.
11 July 2005 Tool Evaluation Scoring Criteria Professor Alan D. George, Principal Investigator Mr. Hung-Hsun Su, Sr. Research Assistant Mr. Adam Leko,
The University of Adelaide, School of Computer Science
Part I MPI from scratch. Part I By: Camilo A. SilvaBIOinformatics Summer 2008 PIRE :: REU :: Cyberbridges.
10/16/ Realizing Concurrency using the thread model B. Ramamurthy.
Overview of CrayPat and Apprentice 2 Adam Leko UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red: Negative.
Operating Systems Lecture 7 OS Potpourri Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard. Zhiqing Liu School of Software.
Hardware process When the computer is powered up, it begins to execute fetch-execute cycle for the program that is stored in memory at the boot strap entry.
Includes slides from course CS194 at UC Berkeley, by prof. Katherine Yelick Shared Memory Programming Pthreads: an overview Ing. Andrea Marongiu
CS333 Intro to Operating Systems Jonathan Walpole.
System calls for Process management
Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.
Processes CS 6560: Operating Systems Design. 2 Von Neuman Model Both text (program) and data reside in memory Execution cycle Fetch instruction Decode.
Overview of dtrace Adam Leko UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red: Negative note Green: Positive.
Operating Systems CSE 411 CPU Management Sept Lecture 10 Instructor: Bhuvan Urgaonkar.
12/22/ Thread Model for Realizing Concurrency B. Ramamurthy.
CSE 351 Final Exam Review 1. The final exam will be comprehensive, but more heavily weighted towards material after the midterm We will do a few problems.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Operating Systems Processes and Threads.
21 Sep UPC Performance Analysis Tool: Status and Plans Professor Alan D. George, Principal Investigator Mr. Hung-Hsun Su, Sr. Research Assistant.
Hardware process When the computer is powered up, it begins to execute fetch-execute cycle for the program that is stored in memory at the boot strap entry.
C LANGUAGE Characteristics of C · Small size
Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.
Threads A thread is an alternative model of program execution
Barriers and Condition Variables
1 Why Threads are a Bad Idea (for most purposes) based on a presentation by John Ousterhout Sun Microsystems Laboratories Threads!
Slides created by: Professor Ian G. Harris Operating Systems  Allow the processor to perform several tasks at virtually the same time Ex. Web Controlled.
Unified Parallel C at LBNL/UCB Berkeley UPC Runtime Report Jason Duell LBNL September 9, 2004.
Concurrency in Java MD. ANISUR RAHMAN. slide 2 Concurrency  Multiprogramming  Single processor runs several programs at the same time  Each program.
Mutual Exclusion -- Addendum. Mutual Exclusion in Critical Sections.
7-Nov Fall 2001: copyright ©T. Pearce, D. Hutchinson, L. Marshall Oct lecture23-24-hll-interrupts 1 High Level Language vs. Assembly.
CSE 332: C++ Exceptions Motivation for C++ Exceptions Void Number:: operator/= (const double denom) { if (denom == 0.0) { // what to do here? } m_value.
7/9/ Realizing Concurrency using Posix Threads (pthreads) B. Ramamurthy.
Threads Some of these slides were originally made by Dr. Roger deBry. They include text, figures, and information from this class’s textbook, Operating.
CS399 New Beginnings Jonathan Walpole.
Indranil Roy High Performance Computing (HPC) group
Realizing Concurrency using Posix Threads (pthreads)
Realizing Concurrency using the thread model
Memory Allocation CS 217.
Allen D. Malony Computer & Information Science Department
Programming with Shared Memory
Realizing Concurrency using Posix Threads (pthreads)
Realizing Concurrency using the thread model
CSE 153 Design of Operating Systems Winter 19
Dynamic Binary Translators and Instrumenters
Presentation transcript:

UPC Performance Tool Interface Professor Alan D. George, Principal Investigator Mr. Hung-Hsun Su, Sr. Research Assistant Mr. Adam Leko, Sr. Research Assistant Mr. Bryan Golden, Research Assistant Mr. Hans Sherburne, Research Assistant HCS Research Laboratory University of Florida PAT

2 Motivation for Tool Interface UPC performance  Can be comparable with MPI code  But, generally requires hand tuning No performance tool support for UPC programs. Why?  New language, but…  Complicated compilers Several different implementation strategies  Direct compilation (GCC-UPC, Cray)  Library approach (Berkeley w/GASNET, MTU UPC, HP) “Wrappers” and binary instrumentation must be handled on a compiler-by-complier basis  One-sided memory operations  Relaxed memory model, compiler optimizations/reorganization Direct source instrumentation not accurate enough!

3 Proposed Interface Event-based interface  UPC compiler/runtime communicate with performance tools using standard interface  Performance tool is notified when certain actions happen at runtime Notification structure  Function “callback” to tool developer code  Use a single function name Pass in event ID and source code location Use varargs for rest of arguments (like printf )  Notifications can come from compiler/runtime (system events) or from code (user events)  Callback has to be threadsafe, re-entrant

4 (Very) Simple Example enum pupc_event_type {pupc_event_type_start, pupc_event_type_end, pupc_event_type_atomic}; void pupc_event_notify( unsigned int event_id, enum pupc_event_type event_type, const char* source_file, unsigned int source_line, unsigned int source_col,...);

5 Event ID Conventions Assumption: 32-bit, unsigned integer System-level events  Events that arise from actions defined in spec  Convention shown right  Range: 0x to 0x5FFFFFFF Implementation-specific events  Defined by implementation  Useful for software cache miss events, etc  Range: 0x to 0xBFFFFFFF 0x GroupActionType User-level events  Users obtain identifiers from tool at runtime with pupc_create_event(const char* name)  Used for marking phases or basic blocks of a program’s computation  Range: 0xC to 0xFFFFFFFF Example system event: Group 3 = library event Action 7 = upc_memcpy Type 0 = executed

6 Defined System Events Startup and shutdown (group 0)  Initialization called by each UPC thread after UPC runtime has been initialized  Exit called before all threads stop (two types of events: collective exit & non-collective exit) Synchronization (group 1)  Fence, notify, wait, barrier start/end Work sharing (group 2)  Forall start/end Library events (group 3)  Lock functions & string functions start/end, etc

7 Defined System Events (2) Direct shared variable access (group 4)  Two sets of get/put events for relaxed and strict memory accesses Function entry/exit (group 5)  Events occur at the beginning and end of every user function Collective events (group 6)  Events for all collective routines defined in spec All system events have symbolic names defined in pupc.h (along with prototypes for all tool interface functions) For complete list and details of each event, see proposal at

8 System Events Sample Table: Synchronization

9 Instrumentation & Measurement Control Need to control overhead of instrumentation for interface --profile flag  Instructs compiler to instrument all events for use with performance tool  Compiler should instrument all events, except Shared local accesses Accesses that have been privatized through optimizations --profile-local flag  Instruments everything as in --profile, but also includes shared local accesses #pragma pupc [on / off] directive  Controls instrumentation during compile time, only has effect when -- profile or --profile-local have been used  Instructs compiler to avoid instrumentation for specific regions of code, if possible pupc_control(int on); function call  Controls measurement during runtime done by performance tool

10 Tool Access to High-Resolution Timers UPC runtime library provides functions to access hardware timers (if available) Not a globally synchronized timer, but locally consistent to each thread Modelled directly after Berkeley UPC timers Get abstract “ticks”, convert to microseconds Prototypes: pupc_tick_t pupc_ticks_now(); uint64_t pupc_ticks_to_us(pupc_tick_t ticks); double pupc_tick_granularityus(); double pupc_tick_overheadus(); #define PUPC_TICK_MAX..., #define PUPC_TICK_MIN...

11 Open Issues Column number argument in callback  Used to differentiate two statements on a single line  But, Available in most systems? Users can split statement, or figure out indirectly what events are from what  Suggestion: keep column number, systems that don’t support pass in 0 No direct support for sampling  Can use interface directly and keep data structures in memory, then sample those using timer  Higher overhead than sampling available structures from runtime directly

12 Open Issues (2) Lines containing multiple events  Tools should expect multiple events to come from a single line  e.g., shared int a; shared int b; a = b + b + (++a);  Ideally, should receive an event for each remote get/put  Will this limit or change source-to-source translations? User function entry/exit points  Potential for very high overhead  What about mixing in C/MPI code?  Can examine stack and relate addresses to source functions, but requires platform-specific code  Examining just stack means more instrumentation required for profiling tools

13 Open Issues (3) One-sided RMA events  Which thread should get a notification that a remote get/put is starting on the other side, if any?  What about relaxed model + nonblocking communication that blocks at fences/barriers?  Need a systematic way for creating implementation-specific events Non-collective exits  Distributed systems (like BUPC) can’t guarantee that non-collective exits will be propagated and all threads will receive an exit event  Non-collective exits can cause incomplete data due to buffering, unless guards are in place  Suggestion: Require users to avoid non-collective exits for profiling runs (reasonable)

14 Open Issues (4) What code is compiled by a UPC compiler?  In proposal, tool developer code (pupc_event_notify et. al) is compiled by UPC compiler with no --profile flag  Some compilers might not support varargs in UPC code  Potential solution: C pupc_event_notify function that makes upcalls to UPC code  Need way of passing shared void* from C to UPC though! Efficient access to MYTHREAD on pthreaded systems  How to access MYTHREAD (in TLS) efficiently?  How to access at all if C pupc_event_notify is used?  Suggestion: have PUPC_INIT return context pointer that gets passed in on all subsequent profile callbacks

15 Questions, Comments, Suggestions? What can I do to get you into this UPC performance tool interface today?

16 Changes From Last Proposal (v1.1 to now) No source location structs, pass file information directly in callback Inclusion of collective event ID category New event type argument in callback: start, stop, atomic  Halves number of event IDs required User-level events  Function is now pupc_user_event() and also includes event type  No more user function events  Extra arguments are meant to be used as in printf: pupc_create_event(“Step”, “%d %f”) pupc_user_event(i, f); Different flags for instrumentation control Timers not “global timer”