Dynaprof Evaluation Report Adam Leko, Hans Sherburne UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red: Negative note Green: Positive note
2 Basic Information Name: Dynaprof Developer: Philip Mucci (UTK) Current versions: Dynaprof CVS as of 2/21/2005 DynInst API v4.1.1 (dependency) PAPI v3.0.7 (dependency) Website: Contact: Philip Mucci
3 Dynaprof Overview Merges existing tools PAPI DynInst API Command-line tool Dynamically instruments programs at runtime Requires no recompilation! Insert probes at runtime Metrics available Wall clock time Any PAPI metrics Can be extended Only simple GUI available (see right) Just wrapper around command-line version Currently pretty broken DynaProf 0.9 Philip J. Mucci, Provided courtesy of UTK's Innovative Computing Laboratory. See for more information. This is Open Source Software! (dynaprof)|
4 Instrumentation Overview Instrumentation very easy Especially for sequential/threaded applications Compile application regularly (-g eases naming later) gcc -O3 -g -o camel camel.c Dynaprof commands Load the exe load camel Specify which probe you wish to use use papiprobe [args] List available functions list camel.c Instrument command All functions in a file: instr module camel.c A single function: instr function camel.c main Run command continue pauses execution (currently does not work) Instrumentation output is produced in an additional file (will be shown at runtime)
5 Instrumentation Overview (2) No special commands needed for sequential applications pthread applications MPI not supported directly through command line Wrapper scripts available for MPICH and LAM Dynaprof must be run in “batch mode” A file containing all instrumentation commands Halts the app before MPI_Init() is called However, not working with current version of MPICH Get assertion failure and stops working Can only use MPI programs with 1 process UPC? Tried GCC-UPC BUPC (smp + pthreads) Both produced no output or crashed Dynaprof
6 Instrumentation Overhead Only could instrument one-process MPI code MPI run wrapper script broken No PPerf apps! (all require > 1 process) Camel overhead very high Only instrumented main LU overhead really low? Possible causes of overhead Frequent subroutine calls from main Use of tsc.h processor counters for timers confuse Dynaprof Expect overhead similar to Paradyn 5-10% for most applications with a reasonable number of instrumentation points
7 Dynaprof Probe Information Probes perform all data collection and analysis Provide code to insert into a function when instrumented Probes can be called 4 different times Function entry point Function exit point Function call point Function return point Each probe is encapsulated in a shared library Allows relatively easy creation of new probes Available probes “Wallclock” probe (records wall clock time) PAPI wallclock probe (same as wallclock, uses high-resolution timers) PAPI probe (records any PAPI metric, such as FLOPs) Specify PAPI metrics as args in use papiprobe [args] command Existing probes provide profile-style data only Although no reason that a trace could not also be collected
8 Probe Output After running, an ASCII file containing raw data is created At runtime, a message like “ …output will be in /home/leko/… ” will be printed indicating where file will be Three programs are provided which analyze the raw data wallclockrpt – for wall clock probe papiclockrpt – for PAPI wall clock probe papiproberpt – for PAPI probe Summary statistics are provided Exclusive profile (metric collected excluding children) Inclusive profile (metric collected including children) 1-call level deep profile (see which functions an instrumented function called) Output from *rpt programs is simple ASCII (sample next page)
9 Sample Probe Report (lu.W.1) dynaprof]$ wallclockrpt lu- 1.wallclock Exclusive Profile. Name Percent Total Calls TOTAL e+11 1 unknown e+11 1 main 3.837e Inclusive Profile. Name Percent Total SubCalls TOTAL e+11 0 main e Level Inclusive Call Tree. Parent/-Child Percent Total Calls TOTAL e+11 1 main e f_setarg e e f_setsig e e f_init e e atexit e e MAIN__ Note: only “main” was instrumented in this profiled run
10 Bottleneck Identification Test Suite Testing metric: what did output of probe tell us? CAMEL: FAILED Instrumenting main caused too much application perturbation NAS LU (“W” workload): TOSS-UP Given enough time, any bottleneck could be identified Even cache miss problems, thanks to PAPI! But how much time to identify bottlenecks? Communication problems difficult/impossible to pinpoint No tracing No communication visualization PPerfMark tests: NOT TESTED Could not evaluate PPerfMark suite (running MPI commands broken) However, same comments for LU would probably apply to all In general, Heavily reliant on user’s proficiency with pinpointing problems Incremental approach Instrument, re-run, instrument w/PAPI, re-run… Process can be tedious But, ease of instrumentation does ease this
11 Dynaprof General Comments Good points Free Source code available, relatively organized Good reference on how to use PAPI & DynInst API Very easy to use Relatively easy to extend Developer very responsive to questions Not-so-good points High instrumentation overhead in a few cases Simple to understand, but not much available functionality Only profiling data with current probes Not really being updated much any more Changing program arguments requires reloading & reinstrumenting executable Dynaprof illustrates that a tool doesn’t have to be ultra-complicated to be useful KISS!
12 Adding UPC/SHMEM Support UPC support Would need to do a ton of work Best bet Provide a UPC probe Instrument “known” UPC runtime functions Gasnet functions for Berkeley Etc. Need one probe per UPC runtime/compiler environment SHMEM support No extra work necessary! Handles instrumenting libraries like any other code However, a few potential problems Reliance on DynInst Hard to port Hard to compile! Reliance on PAPI Can add own probes which do not use PAPI though… Best way to use Dynaprof Steal ideas on how to make tool extensible Probes as shared libraries nice idea! Steal code on how to use DynInst & PAPI
13 Evaluation (1) Available metrics: 1/5 Can use PAPI to get lots of data Limited in what you can collect in a single run, only Two PAPI metrics or Wall clock time Cost: 5/5 Free Documentation quality: 4/5 Minimal documentation, but covers the basics pretty well Extensibility: 3.5/5 Open source Can add new functionality by writing new probes Must write new code to extend (not much existing functionality) Filtering and aggregation: 2/5 Most program data is filtered out for you Direct result of profile-nature of current probes Many times too much information is lost Filtering and aggregation behavior fixed in source code of probes
14 Evaluation (2) Hardware support: 3/5 64-bit Linux (Itanium only), Sparc, IRIX, AlphaServer (Tru64), IBM SP (AIX) Most everything supported: Linux, AIX, IRIX, HP-UX Reliance on PAPI and DynInst could hinder porting No Cray support Heterogeneity support: 0/5 (not supported) Installation: 3/5 Dynaprof easy to compile, but PAPI and DynInst a nightmare to install Also had to hack up some source code a bit to work with newer versions of gcc & javac (JDK1.5) Interoperability: 0.5/5 No export interoperability with other tools There is a half-done TAU probe Not sure if it works Or how useful it is! Learning curve: 4/5 Very easy to use Anyone used to prof/gprof will feel right at home
15 Evaluation (3) Manual overhead: 3/5 Can automatically instrument all functions, a handful of functions, and all function calls within a given function Very easy to choose which functions you want instrumented Can script behavior of dynaprof executable Reinstrumenting requires no recompilation Measurement accuracy: 5/5 For LU, tracing overhead almost negligible using PAPI probes Tracing overhead small as long as number of instrumented functions kept reasonable Program’s correctness of execution not affected Dynamic instrumentation does not get in compiler’s way for optimizations Multiple executions: 0/5 Not supported Multiple analyses & views: 1/5 One way of recording data, one way of presenting it Probes could theoretically present things differently, but none currently do
16 Evaluation (4) Performance bottleneck identification: 1/5 No automatic detection Usefulness of tool directly related to cleverness of user Many bottlenecks would be very difficult to detect with only basic profile information given by hardware counters only Profiling/tracing support: 2/5 Only supports profiling Could feasibly add tracing if you wanted to code Response time: 3/5 No data at all until after run has completed and tracefile has been opened Generating reports from raw data instantaneous though Software support: 4.5/5 Can link against (and instrument!!) any existing library Supports MPI (although broken) and shared-memory threaded programs Source code correlation: 2/5 Data reported to user at the function name level Searching: 0/5 (not supported)
17 Evaluation (5) System stability: 3/5 Command-line interface relatively stable pause while running broken in command-line GUI severely broken Technical support: 4/5 Responses from contact within 24 hours Philip Mucci very helpful, knowledgeable