TAU: A Framework for Parallel Performance Analysis Allen D. Malony malony@cs.uoregon.edu ParaDucks Research Group Computer & Information Science Department Computational Science Institute University of Oregon
Outline Goals and challenges Targeted research areas TAU (Tuning and Analysis Utilities) computation model, architecture, toolkit framework performance system technology examples of TAU use Tools associated with TAU PDT (Program Database Toolkit) distributed runtime monitoring Future plans Conclusions November 21, 2018 Ptools Annual Meeting
Goal and Challenges Create robust (performance) technology for the analysis and tuning of parallel software and systems Challenges different scalable computing platforms different programming languages and systems common, portable framework for analysis extensibe, retargetable tool technology complex set of requirements November 21, 2018 Ptools Annual Meeting
Targeted Research Areas Performance analysis for scalable parallel systems targeting multiple programming and system levels and the mapping between levels Program code analysis for multiple languages enabling development of new source-based tools Integration and interoperation support for building analysis tool frameworks and environments Runtime tool interaction for dynamic applications November 21, 2018 Ptools Annual Meeting
TAU (Tuning and Analysis Utilities) Performance analysis framework for scalable parallel and distributed high-performance computing Target a general parallel computation model computer nodes shared address space contexts threads of execution multi-level parallelism Integrated toolkit for performance instrumentation, measurement, analysis, and visualization portable performance profiling/tracing facility open software approach Network November 21, 2018 Ptools Annual Meeting
TAU Architecture November 21, 2018 Ptools Annual Meeting
TAU Instrumentation Flexible, multiple instrumentation mechanisms source code manual automatic using PDT (tau_instrumentor) object code pre-instrumented libraries statically linked: MPI wrapper library using the MPI Profiling Interface (libTauMpi.a) dynamically linked: Java instrumentation using JVMPI and TAU shared object dynamically loaded in VM executable code dynamic instrumentation using DyninstAPI (tau_run) November 21, 2018 Ptools Annual Meeting
TAU Instrumentation (continued) Common target measurement interface (TAU API) C++ (object-based) instrumentation macro-based, using constructor/destructor techniques function, classes, and templates uniquely identify functions and templates name and type signature (name registration) static object creates performance entry dynamic object receives static object pointer runtime type identification for template instantiations with C and Fortran instrumentation variants Instrumentation optimization November 21, 2018 Ptools Annual Meeting
TAU Measurement Performance information Organization high resolution timer library (real-time clock) generalized software counter library hardware performance counters PCL (Performance Counter Library) (ZAM, Germany) PAPI (Performance API) (UTK, Ptools) consistent, portable API Organization node, context, thread levels profile groups for collective events (runtime selective) mapping between software levels November 21, 2018 Ptools Annual Meeting
TAU Measurement (continued) Profiling function-level, block-level, statement-level supports user-defined events TAU profile (function) database (PD) function callstack hardware counts instead of time Tracing profile-level events interprocess communication events timestamp synchronization User-controlled configuration (configure) November 21, 2018 Ptools Annual Meeting
Timing of Multi-threaded Applications Capture timing information on per thread basis Two alternative wall clock time works on all systems user-level measurement OS-maintained CPU time (e.g., Solaris, Linux) thread virtual time measurement TAU supports both alternatives CPUTIME module profiles user+system time % configure -pthread -CPUTIME November 21, 2018 Ptools Annual Meeting
TAU Analysis Profile analysis Trace analysis pprof racy parallel profiler with text-based display racy graphical interface to pprof Trace analysis trace merging and clock adjustment (if necessary) trace format conversion (ALOG, SDDF, PV, Vampir) Vampir trace analysis and visualization tool (Pallas) November 21, 2018 Ptools Annual Meeting
TAU Status Usage platforms languages communication libraries IBM SP, SGI Origin 2K, Intel Teraflop, Cray T3E, HP, Sun, Windows 95/98/NT, Alpha/Pentium Linux cluster languages C, C++, Fortran 77/90, HPF, pC++, HPC++, Java communication libraries MPI, PVM, Nexus, Tulip, ACLMPL thread libraries pthreads, Tulip, SMARTS, Java,Windows compilers KAI, PGI, GNU, Fujitsu, Sun, Microsoft, SGI, Cray November 21, 2018 Ptools Annual Meeting
TAU Status (continued) application libraries Blitz++, A++/P++, ACLVIS, PAWS application frameworks POOMA, POOMA-2, MC++, Conejo, PaRP other projects ACPC, University of Vienna: Aurora UC Berkeley (Culler): Millenium, sensitivity analysis KAI and Pallas TAU profiling and tracing toolkit (Version 2.7) LANL ACL Fall 1999 CD-ROM distributed at SC'99 Extensive 70-page TAU User’s Guide http://www.acl.lanl.gov/tau November 21, 2018 Ptools Annual Meeting
TAU Examples Instrumentation Measurement Analysis C++ template profiling (PETE, Blitz++) Java and MPI PAPI Measurement mapping of asynchronous execution (SMARTS) hybrid execution (Opus/HPF) Analysis SMARTS scheduling November 21, 2018 Ptools Annual Meeting
C++ Template Instrumentation (Blitz++, PETE) High-level objects array classes templates Optimizations array processing expressions (PETE) Relate performance data to high-level statement Complexity of template evaluation Array expressions November 21, 2018 Ptools Annual Meeting
Standard Template Instrumentation Difficulties Instantiated templates result in mangled identifiers Standard profiling techniques and tools are deficient integrated with proprietary compilers specific systems platforms and programming models Uninterpretable routine names November 21, 2018 Ptools Annual Meeting
TAU Template Instrumentation and Profiling Profile of expression types Performance data presented with respect to high-level array expression types Graphical pprof November 21, 2018 Ptools Annual Meeting
Parallel Java Performance Instrumentation Multi-language applications (Java, C, C++, Fortran) Hybrid execution models (Java threads, MPI) Java Virtual Machine Profiler Interface (JVMPI) event instrumentation in JVM profiler agent (libTAU.so) fields events Java Native Interface (JNI) invoke JVMPI control routines to control Java threads and access thread information MPI profiling interface “Performance Tools for Parallel Java Environments,” Java Workshop, ICS 2000, May 2000. November 21, 2018 Ptools Annual Meeting
TAU Java Instrumentation Architecture Java program TAU package mpiJava package Thread API JNI MPI profiling interface Event notification TAU TAU wrapper Native MPI library JVMPI Profile DB November 21, 2018 Ptools Annual Meeting
Parallel Java Game of Life mpiJava testcase 4 nodes, 28 threads Node process grouping Thread message pairing Vampir display Multi-level event grouping November 21, 2018 Ptools Annual Meeting
TAU and PAPI: NAS Parallel LU Benchmark SGI Power Onyx (4 processors, R10K), MPI Floating point operations Cross-node full / routine profiles Full FP profile for each node Percentage profile November 21, 2018 Ptools Annual Meeting
TAU and PAPI: Matrix Multiply Data cache miss comparison, “regular” vs. “strip-mining” execution 512x512 32 KB (P) 2 MB (S) Regular causes 4.5 times more misses November 21, 2018 Ptools Annual Meeting
Asynchronous Performance Analysis (SMARTS) Scalable Multithreaded Asynchronuous Runtime System user-level threads, light-weight virtual processors macro-dataflow, asynchronous execution interleaving iterates from data-parallel statements integrated with POOMA II TAU measurement of asynchronous parallel execution utilized the TAU mapping API associate iterate performance with data parallel statement evaluate different scheduling policies “SMARTS: Exploting Temporal Locality & Parallelism through Vertical Execution,” ICS '99, August 1999. November 21, 2018 Ptools Annual Meeting
TAU Mapping of Asynchronous Execution Without mapping Two threads executing With mapping POOMA / SMARTS November 21, 2018 Ptools Annual Meeting
With and without mapping (Thread 0) Thread 0 blocks waiting for iterates Iterates get lumped together With mapping Iterates distinguished November 21, 2018 Ptools Annual Meeting
With and without mapping (Thread 1) Array initialization performance lumped Without mapping Performance associated with ExpressionKernel object With mapping Iterate performance mapped to array statement Array initialization performance correctly separated November 21, 2018 Ptools Annual Meeting
TAU and Hybrid Execution in Opus/HPF Fortran 77, Fortran 90, HPF Vienna Fortran Compiling System Opus / HPF combined data (HPF) and task (Opus) parallelism HPF compiler produces Fortran 90 modules processes interoperate using Opus runtime system producer / consumer model MPI and pthreads performance influence at multiple software levels November 21, 2018 Ptools Annual Meeting
TAU Profiling of Opus/HPF Application Multiple producers Multiple consumers Parallelism View November 21, 2018 Ptools Annual Meeting
TAU Profiling of SMARTS Iteration scheduling for two array expressions November 21, 2018 Ptools Annual Meeting
SMARTS Tracing (SOR) – Vampir Visualization SCVE scheduler used in Red/Black SOR running on 32 processors of SGI Origin 2000 Asynchronous, overlapped parallelism November 21, 2018 Ptools Annual Meeting
Program Database Toolkit (PDT) Program code analysis framework for developing source-based tools High-level interface to source code information Integrated toolkit for source code parsing, database creation, and database query commercial grade front end parsers portable IL analyzer, database format, and access API open software approach for tool development Target and integrate multiple source languages http://www.acl.lanl.gov/pdtoolkit November 21, 2018 Ptools Annual Meeting
PDT Architecture and Tools November 21, 2018 Ptools Annual Meeting
PDT Summary Program Database Toolkit (Version 1.1) LANL ACL Fall 1999 CD-ROM distributed at SC'99 EDG C++ Front End (Version 2.41.2) C++ IL Analyzer and DUCTAPE library tools: pdbmerge, pdbconv, pdbtree, pdbhtml standard C++ system header files (KAI KCC 3.4c) Fortran 90 IL Analyzer in progress Automated TAU performance instrumentation Program analysis support for SILOON (ACL CD) “A Tool Framework for Static and Dynamic Analysis of Object-Oriented Software,” submitted to SC ’00. November 21, 2018 Ptools Annual Meeting
Distributed Monitoring Framework Extend usability of TAU performance analysis Access TAU performance data during execution Framework model each application context is a performance data server monitor agent thread is created within each context client processes attach to agents and request data server thread synchronization for data consistency pull mode of interaction Distributed TAU performance data space “A Runtime Monitoring Framework for the TAU Profiling System,” ISCOPE ’99, Nov. 1999. November 21, 2018 Ptools Annual Meeting
TAU Distributed Monitor Architecture Each context has a monitor agent Client in separate thread directs agent Pull model of interaction Initial HPC++ implementation TAU profile database November 21, 2018 Ptools Annual Meeting
Java Implementation of TAU Monitor Motivations more portable monitor middleware system (RMI) more flexible and programmable server interface (JNI) more robust client development (EJB, JDBC, Swing) November 21, 2018 Ptools Annual Meeting
Future Plans TAU platforms: SGI Itanium, Sun Starfire, IBM Linux, ... languages: Java (Java Grande) , OpenMP instrument: automatic (F90, Java), Dyninst measurement: hardware counter, support PAPI displays: “beyond bargraphs” performance views performance database and technology support for multiple runs open API for analysis tool development PDT complete F90 and Java IL Analyzer source browsers: function, class, template tools for aiding in data marshalling and translation November 21, 2018 Ptools Annual Meeting
Future Plans (continued) Distributed monitoring framework application and system monitoring ACL Supermon and SGI Performance Co-Pilot scalable SMP clusters and distributed systems performance monitoring clients Performance evaluation numerical libraries and frameworks scalable runtime systems ASCI application developers (benchmark codes) Investigate performance issues in Linux kernel Investigate integration with CCA November 21, 2018 Ptools Annual Meeting
Conclusions Complex parallel computing environments require robust program analysis tools portable, cross-platform, multi-level, integrated able to bridge and reuse existing technology technology savvy TAU offers a robust performance technology framework for complex parallel computing systems flexible instrumentation and instrumentation extendable profile and trace performance analysis integration with other performance technology Opportunities exist for open performance technology November 21, 2018 Ptools Annual Meeting
Open Performance Technology (OPT) Performance problem is complex diverse platforms, software development, applications things evolve History of incompatible and competing tools instrumentation / measurement technology reinvention lack of common, reusable software foundations Need “value added” (open) approach technology for high-level performance tool development layered performance tool architecture portable, flexible, programmable, integrative technology Opportunity for Industry/National Labs/PACI sites November 21, 2018 Ptools Annual Meeting