TAU: A Framework for Parallel Performance Analysis

Slides:



Advertisements
Similar presentations
K T A U Kernel Tuning and Analysis Utilities Department of Computer and Information Science Performance Research Laboratory University of Oregon.
Advertisements

Dynamic performance measurement control Dynamic event grouping Multiple configurable counters Selective instrumentation Application-Level Performance Access.
Robert Bell, Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science.
Sameer Shende Department of Computer and Information Science Neuro Informatics Center University of Oregon Tool Interoperability.
TAU: Tuning and Analysis Utilities. TAU Performance System Framework  Tuning and Analysis Utilities  Performance system framework for scalable parallel.
The TAU Performance Technology for Complex Parallel Systems (Performance Analysis Bring Your Own Code Workshop, NRL Washington D.C.) Sameer Shende, Allen.
On the Integration and Use of OpenMP Performance Tools in the SPEC OMP2001 Benchmarks Bernd Mohr 1, Allen D. Malony 2, Rudi Eigenmann 3 1 Forschungszentrum.
Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science Institute University.
The TAU Performance System: Advances in Performance Mapping Sameer Shende University of Oregon.
June 2, 2003ICCS Performance Instrumentation and Measurement for Terascale Systems Jack Dongarra, Shirley Moore, Philip Mucci University of Tennessee.
The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon.
Sameer Shende, Allen D. Malony Computer & Information Science Department Computational Science Institute University of Oregon.
1 MPI-2 and Threads. 2 What are Threads? l Executing program (process) is defined by »Address space »Program Counter l Threads are multiple program counters.
SC’01 Tutorial Nov. 7, 2001 TAU Performance System Framework  Tuning and Analysis Utilities  Performance system framework for scalable parallel and distributed.
9/13/20151 Threads ICS 240: Operating Systems –William Albritton Information and Computer Sciences Department at Leeward Community College –Original slides.
Silberschatz, Galvin and Gagne ©2011Operating System Concepts Essentials – 8 th Edition Chapter 4: Threads.
Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.
Chapter 4 Threads, SMP, and Microkernels Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design.
Crossing The Line: Distributed Computing Across Network and Filesystem Boundaries.
Dynamic performance measurement control Dynamic event grouping Multiple configurable counters Selective instrumentation Application-Level Performance Access.
Debugging parallel programs. Breakpoint debugging Probably the most widely familiar method of debugging programs is breakpoint debugging. In this method,
Multithreaded Programing. Outline Overview of threads Threads Multithreaded Models  Many-to-One  One-to-One  Many-to-Many Thread Libraries  Pthread.
© 2006, National Research Council Canada © 2006, IBM Corporation Solving performance issues in OTS-based systems Erik Putrycz Software Engineering Group.
A. Frank - P. Weisberg Operating Systems Structure of Operating Systems.
Full and Para Virtualization
Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.
Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 4: Threads.
1 Chapter 5: Threads Overview Multithreading Models & Issues Read Chapter 5 pages
Sung-Dong Kim, Dept. of Computer Engineering, Hansung University Java - Introduction.
Computer System Structures
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
Introduction to threads
Kai Li, Allen D. Malony, Sameer Shende, Robert Bell
Productive Performance Tools for Heterogeneous Parallel Computing
Operating System & Application Software
Chapter 4: Multithreaded Programming
Chapter 4: Threads.
Chapter 4: Threads.
Current Generation Hypervisor Type 1 Type 2.
Netscape Application Server
Chapter 5: Threads Overview Multithreading Models Threading Issues
TAU integration with Score-P
The Client/Server Database Environment
Chapter 4: Threads 羅習五.
University of Technology
Allen D. Malony, Sameer Shende
Chapter 4: Threads.
Threads & multithreading
Chapter 4: Threads.
Operating System Concepts
TAU Parallel Performance System
Performance Technology for Complex Parallel and Distributed Systems
TAU Parallel Performance System
Chapter 4: Threads.
Chapter 4: Threads.
Modified by H. Schulzrinne 02/15/10 Chapter 4: Threads.
Lecture 1: Multi-tier Architecture Overview
Lecture Topics: 11/1 General Operating System Concepts Processes
Threads Chapter 4.
Allen D. Malony Computer & Information Science Department
CHAPTER 4:THreads Bashair Al-harthi OPERATING SYSTEM
Multithreaded Programming
Outline Introduction Motivation for performance mapping SEAA model
Chapter 4: Threads & Concurrency
Chapter 4: Threads.
Performance Technology for Complex Parallel and Distributed Systems
Parallel Program Analysis Framework for the DOE ACTS Toolkit
Chapter 4: Threads.
Chapter 4: Threads.
Presentation transcript:

TAU: A Framework for Parallel Performance Analysis Allen D. Malony malony@cs.uoregon.edu ParaDucks Research Group Computer & Information Science Department Computational Science Institute University of Oregon

Outline Goals and challenges Targeted research areas TAU (Tuning and Analysis Utilities) computation model, architecture, toolkit framework performance system technology examples of TAU use Tools associated with TAU PDT (Program Database Toolkit) distributed runtime monitoring Future plans Conclusions November 21, 2018 Ptools Annual Meeting

Goal and Challenges Create robust (performance) technology for the analysis and tuning of parallel software and systems Challenges different scalable computing platforms different programming languages and systems common, portable framework for analysis extensibe, retargetable tool technology complex set of requirements November 21, 2018 Ptools Annual Meeting

Targeted Research Areas Performance analysis for scalable parallel systems targeting multiple programming and system levels and the mapping between levels Program code analysis for multiple languages enabling development of new source-based tools Integration and interoperation support for building analysis tool frameworks and environments Runtime tool interaction for dynamic applications November 21, 2018 Ptools Annual Meeting

TAU (Tuning and Analysis Utilities) Performance analysis framework for scalable parallel and distributed high-performance computing Target a general parallel computation model computer nodes shared address space contexts threads of execution multi-level parallelism Integrated toolkit for performance instrumentation, measurement, analysis, and visualization portable performance profiling/tracing facility open software approach Network November 21, 2018 Ptools Annual Meeting

TAU Architecture November 21, 2018 Ptools Annual Meeting

TAU Instrumentation Flexible, multiple instrumentation mechanisms source code manual automatic using PDT (tau_instrumentor) object code pre-instrumented libraries statically linked: MPI wrapper library using the MPI Profiling Interface (libTauMpi.a) dynamically linked: Java instrumentation using JVMPI and TAU shared object dynamically loaded in VM executable code dynamic instrumentation using DyninstAPI (tau_run) November 21, 2018 Ptools Annual Meeting

TAU Instrumentation (continued) Common target measurement interface (TAU API) C++ (object-based) instrumentation macro-based, using constructor/destructor techniques function, classes, and templates uniquely identify functions and templates name and type signature (name registration) static object creates performance entry dynamic object receives static object pointer runtime type identification for template instantiations with C and Fortran instrumentation variants Instrumentation optimization November 21, 2018 Ptools Annual Meeting

TAU Measurement Performance information Organization high resolution timer library (real-time clock) generalized software counter library hardware performance counters PCL (Performance Counter Library) (ZAM, Germany) PAPI (Performance API) (UTK, Ptools) consistent, portable API Organization node, context, thread levels profile groups for collective events (runtime selective) mapping between software levels November 21, 2018 Ptools Annual Meeting

TAU Measurement (continued) Profiling function-level, block-level, statement-level supports user-defined events TAU profile (function) database (PD) function callstack hardware counts instead of time Tracing profile-level events interprocess communication events timestamp synchronization User-controlled configuration (configure) November 21, 2018 Ptools Annual Meeting

Timing of Multi-threaded Applications Capture timing information on per thread basis Two alternative wall clock time works on all systems user-level measurement OS-maintained CPU time (e.g., Solaris, Linux) thread virtual time measurement TAU supports both alternatives CPUTIME module profiles user+system time % configure -pthread -CPUTIME November 21, 2018 Ptools Annual Meeting

TAU Analysis Profile analysis Trace analysis pprof racy parallel profiler with text-based display racy graphical interface to pprof Trace analysis trace merging and clock adjustment (if necessary) trace format conversion (ALOG, SDDF, PV, Vampir) Vampir trace analysis and visualization tool (Pallas) November 21, 2018 Ptools Annual Meeting

TAU Status Usage platforms languages communication libraries IBM SP, SGI Origin 2K, Intel Teraflop, Cray T3E, HP, Sun, Windows 95/98/NT, Alpha/Pentium Linux cluster languages C, C++, Fortran 77/90, HPF, pC++, HPC++, Java communication libraries MPI, PVM, Nexus, Tulip, ACLMPL thread libraries pthreads, Tulip, SMARTS, Java,Windows compilers KAI, PGI, GNU, Fujitsu, Sun, Microsoft, SGI, Cray November 21, 2018 Ptools Annual Meeting

TAU Status (continued) application libraries Blitz++, A++/P++, ACLVIS, PAWS application frameworks POOMA, POOMA-2, MC++, Conejo, PaRP other projects ACPC, University of Vienna: Aurora UC Berkeley (Culler): Millenium, sensitivity analysis KAI and Pallas TAU profiling and tracing toolkit (Version 2.7) LANL ACL Fall 1999 CD-ROM distributed at SC'99 Extensive 70-page TAU User’s Guide http://www.acl.lanl.gov/tau November 21, 2018 Ptools Annual Meeting

TAU Examples Instrumentation Measurement Analysis C++ template profiling (PETE, Blitz++) Java and MPI PAPI Measurement mapping of asynchronous execution (SMARTS) hybrid execution (Opus/HPF) Analysis SMARTS scheduling November 21, 2018 Ptools Annual Meeting

C++ Template Instrumentation (Blitz++, PETE) High-level objects array classes templates Optimizations array processing expressions (PETE) Relate performance data to high-level statement Complexity of template evaluation Array expressions November 21, 2018 Ptools Annual Meeting

Standard Template Instrumentation Difficulties Instantiated templates result in mangled identifiers Standard profiling techniques and tools are deficient integrated with proprietary compilers specific systems platforms and programming models Uninterpretable routine names November 21, 2018 Ptools Annual Meeting

TAU Template Instrumentation and Profiling Profile of expression types Performance data presented with respect to high-level array expression types Graphical pprof November 21, 2018 Ptools Annual Meeting

Parallel Java Performance Instrumentation Multi-language applications (Java, C, C++, Fortran) Hybrid execution models (Java threads, MPI) Java Virtual Machine Profiler Interface (JVMPI) event instrumentation in JVM profiler agent (libTAU.so) fields events Java Native Interface (JNI) invoke JVMPI control routines to control Java threads and access thread information MPI profiling interface “Performance Tools for Parallel Java Environments,” Java Workshop, ICS 2000, May 2000. November 21, 2018 Ptools Annual Meeting

TAU Java Instrumentation Architecture Java program TAU package mpiJava package Thread API JNI MPI profiling interface Event notification TAU TAU wrapper Native MPI library JVMPI Profile DB November 21, 2018 Ptools Annual Meeting

Parallel Java Game of Life mpiJava testcase 4 nodes, 28 threads Node process grouping Thread message pairing Vampir display Multi-level event grouping November 21, 2018 Ptools Annual Meeting

TAU and PAPI: NAS Parallel LU Benchmark SGI Power Onyx (4 processors, R10K), MPI Floating point operations Cross-node full / routine profiles Full FP profile for each node Percentage profile November 21, 2018 Ptools Annual Meeting

TAU and PAPI: Matrix Multiply Data cache miss comparison, “regular” vs. “strip-mining” execution 512x512 32 KB (P) 2 MB (S) Regular causes 4.5 times more misses November 21, 2018 Ptools Annual Meeting

Asynchronous Performance Analysis (SMARTS) Scalable Multithreaded Asynchronuous Runtime System user-level threads, light-weight virtual processors macro-dataflow, asynchronous execution interleaving iterates from data-parallel statements integrated with POOMA II TAU measurement of asynchronous parallel execution utilized the TAU mapping API associate iterate performance with data parallel statement evaluate different scheduling policies “SMARTS: Exploting Temporal Locality & Parallelism through Vertical Execution,” ICS '99, August 1999. November 21, 2018 Ptools Annual Meeting

TAU Mapping of Asynchronous Execution Without mapping Two threads executing With mapping POOMA / SMARTS November 21, 2018 Ptools Annual Meeting

With and without mapping (Thread 0) Thread 0 blocks waiting for iterates Iterates get lumped together With mapping Iterates distinguished November 21, 2018 Ptools Annual Meeting

With and without mapping (Thread 1) Array initialization performance lumped Without mapping Performance associated with ExpressionKernel object With mapping Iterate performance mapped to array statement Array initialization performance correctly separated November 21, 2018 Ptools Annual Meeting

TAU and Hybrid Execution in Opus/HPF Fortran 77, Fortran 90, HPF Vienna Fortran Compiling System Opus / HPF combined data (HPF) and task (Opus) parallelism HPF compiler produces Fortran 90 modules processes interoperate using Opus runtime system producer / consumer model MPI and pthreads performance influence at multiple software levels November 21, 2018 Ptools Annual Meeting

TAU Profiling of Opus/HPF Application Multiple producers Multiple consumers Parallelism View November 21, 2018 Ptools Annual Meeting

TAU Profiling of SMARTS Iteration scheduling for two array expressions November 21, 2018 Ptools Annual Meeting

SMARTS Tracing (SOR) – Vampir Visualization SCVE scheduler used in Red/Black SOR running on 32 processors of SGI Origin 2000 Asynchronous, overlapped parallelism November 21, 2018 Ptools Annual Meeting

Program Database Toolkit (PDT) Program code analysis framework for developing source-based tools High-level interface to source code information Integrated toolkit for source code parsing, database creation, and database query commercial grade front end parsers portable IL analyzer, database format, and access API open software approach for tool development Target and integrate multiple source languages http://www.acl.lanl.gov/pdtoolkit November 21, 2018 Ptools Annual Meeting

PDT Architecture and Tools November 21, 2018 Ptools Annual Meeting

PDT Summary Program Database Toolkit (Version 1.1) LANL ACL Fall 1999 CD-ROM distributed at SC'99 EDG C++ Front End (Version 2.41.2) C++ IL Analyzer and DUCTAPE library tools: pdbmerge, pdbconv, pdbtree, pdbhtml standard C++ system header files (KAI KCC 3.4c) Fortran 90 IL Analyzer in progress Automated TAU performance instrumentation Program analysis support for SILOON (ACL CD) “A Tool Framework for Static and Dynamic Analysis of Object-Oriented Software,” submitted to SC ’00. November 21, 2018 Ptools Annual Meeting

Distributed Monitoring Framework Extend usability of TAU performance analysis Access TAU performance data during execution Framework model each application context is a performance data server monitor agent thread is created within each context client processes attach to agents and request data server thread synchronization for data consistency pull mode of interaction Distributed TAU performance data space “A Runtime Monitoring Framework for the TAU Profiling System,” ISCOPE ’99, Nov. 1999. November 21, 2018 Ptools Annual Meeting

TAU Distributed Monitor Architecture Each context has a monitor agent Client in separate thread directs agent Pull model of interaction Initial HPC++ implementation TAU profile database November 21, 2018 Ptools Annual Meeting

Java Implementation of TAU Monitor Motivations more portable monitor middleware system (RMI) more flexible and programmable server interface (JNI) more robust client development (EJB, JDBC, Swing) November 21, 2018 Ptools Annual Meeting

Future Plans TAU platforms: SGI Itanium, Sun Starfire, IBM Linux, ... languages: Java (Java Grande) , OpenMP instrument: automatic (F90, Java), Dyninst measurement: hardware counter, support PAPI displays: “beyond bargraphs” performance views performance database and technology support for multiple runs open API for analysis tool development PDT complete F90 and Java IL Analyzer source browsers: function, class, template tools for aiding in data marshalling and translation November 21, 2018 Ptools Annual Meeting

Future Plans (continued) Distributed monitoring framework application and system monitoring ACL Supermon and SGI Performance Co-Pilot scalable SMP clusters and distributed systems performance monitoring clients Performance evaluation numerical libraries and frameworks scalable runtime systems ASCI application developers (benchmark codes) Investigate performance issues in Linux kernel Investigate integration with CCA November 21, 2018 Ptools Annual Meeting

Conclusions Complex parallel computing environments require robust program analysis tools portable, cross-platform, multi-level, integrated able to bridge and reuse existing technology technology savvy TAU offers a robust performance technology framework for complex parallel computing systems flexible instrumentation and instrumentation extendable profile and trace performance analysis integration with other performance technology Opportunities exist for open performance technology November 21, 2018 Ptools Annual Meeting

Open Performance Technology (OPT) Performance problem is complex diverse platforms, software development, applications things evolve History of incompatible and competing tools instrumentation / measurement technology reinvention lack of common, reusable software foundations Need “value added” (open) approach technology for high-level performance tool development layered performance tool architecture portable, flexible, programmable, integrative technology Opportunity for Industry/National Labs/PACI sites November 21, 2018 Ptools Annual Meeting