Performance Technology for Complex Parallel and Distributed Systems

Slides:



Advertisements
Similar presentations
Threads, SMP, and Microkernels
Advertisements

Dynamic performance measurement control Dynamic event grouping Multiple configurable counters Selective instrumentation Application-Level Performance Access.
Threads, SMP, and Microkernels Chapter 4. Process Resource ownership - process is allocated a virtual address space to hold the process image Scheduling/execution-
Chapter 4 Threads, SMP, and Microkernels Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design.
Robert Bell, Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science.
Sameer Shende Department of Computer and Information Science Neuro Informatics Center University of Oregon Tool Interoperability.
The TAU Performance Technology for Complex Parallel Systems (Performance Analysis Bring Your Own Code Workshop, NRL Washington D.C.) Sameer Shende, Allen.
On the Integration and Use of OpenMP Performance Tools in the SPEC OMP2001 Benchmarks Bernd Mohr 1, Allen D. Malony 2, Rudi Eigenmann 3 1 Forschungszentrum.
Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science Institute University.
The TAU Performance System: Advances in Performance Mapping Sameer Shende University of Oregon.
Sameer Shende, Allen D. Malony Computer & Information Science Department Computational Science Institute University of Oregon.
SC’01 Tutorial Nov. 7, 2001 TAU Performance System Framework  Tuning and Analysis Utilities  Performance system framework for scalable parallel and distributed.
Performance Technology for Complex Parallel Systems Part 2 – Complexity Scenarios Sameer Shende.
Computer System Architectures Computer System Software
1 Developing Native Device for MPJ Express Advisor: Dr. Aamir Shafi Co-advisor: Ms Samin Khaliq.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Chapter 4 Threads, SMP, and Microkernels Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design.
Crossing The Line: Distributed Computing Across Network and Filesystem Boundaries.
Chapter 34 Java Technology for Active Web Documents methods used to provide continuous Web updates to browser – Server push – Active documents.
Performance Technology for Complex Parallel Systems Part 2 – Complexity Scenarios Sameer Shende.
1 Geospatial and Business Intelligence Jean-Sébastien Turcotte Executive VP San Francisco - April 2007 Streamlining web mapping applications.
Dynamic performance measurement control Dynamic event grouping Multiple configurable counters Selective instrumentation Application-Level Performance Access.
Debugging parallel programs. Breakpoint debugging Probably the most widely familiar method of debugging programs is breakpoint debugging. In this method,
1 SciDAC High-End Computer System Performance: Science and Engineering Jack Dongarra Innovative Computing Laboratory University of Tennesseehttp://
A. Frank - P. Weisberg Operating Systems Structure of Operating Systems.
Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.
Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.
Performance Technology for Complex Parallel Systems Part 1 – Overview and TAU Introduction Allen D. Malony.
Performance Technology for Complex Parallel Systems Part 1 – Overview and TAU Introduction Allen D. Malony.
Threads, SMP, and Microkernels Chapter 4. Processes and Threads Operating systems use processes for two purposes - Resource allocation and resource ownership.
1 Chapter 2: Operating-System Structures Services Interface provided to users & programmers –System calls (programmer access) –User level access to system.
Sung-Dong Kim, Dept. of Computer Engineering, Hansung University Java - Introduction.
Chapter Goals Describe the application development process and the role of methodologies, models, and tools Compare and contrast programming language generations.
Computer System Structures
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
Introduction to threads
The Post Windows Operating System
Introduction to Visual Basic. NET,. NET Framework and Visual Studio
Applications Active Web Documents Active Web Documents.
Kai Li, Allen D. Malony, Sameer Shende, Robert Bell
Productive Performance Tools for Heterogeneous Parallel Computing
Chapter 4: Threads.
Chapter 4: Threads.
OpenMosix, Open SSI, and LinuxPMI
Performance Technology for Scalable Parallel Systems
Self Healing and Dynamic Construction Framework:
TAU integration with Score-P
CMPE419 Mobile Application Development
Performance Evaluation of Adaptive MPI
Performance Technology for Complex Parallel and Distributed Systems
Chapter 4: Threads.
Threads, SMP, and Microkernels
TAU: A Framework for Parallel Performance Analysis
Chapter 4: Threads.
Q: What Does the Future Hold for “Parallel” Languages?
Software Engineering with Reusable Components
Lecture Topics: 11/1 General Operating System Concepts Processes
Lecture 4- Threads, SMP, and Microkernels
Allen D. Malony Computer & Information Science Department
CHAPTER 4:THreads Bashair Al-harthi OPERATING SYSTEM
Multithreaded Programming
MPJ: A Java-based Parallel Computing System
Outline Introduction Motivation for performance mapping SEAA model
Chapter 4: Threads & Concurrency
Chapter 4: Threads.
Parallel Program Analysis Framework for the DOE ACTS Toolkit
Introduction to Virtual Machines
Outline Operating System Organization Operating System Examples
Introduction to Virtual Machines
CMPE419 Mobile Application Development
Presentation transcript:

Performance Technology for Complex Parallel and Distributed Systems Allen D. Malony, Sameer Shende {malony,sameer}@cs.uoregon.edu Computer & Information Science Department Computational Science Institute University of Oregon

Performance Needs  Performance Technology Observe/analyze/understand performance behavior Multiple levels of software and hardware Different types and detail of performance data Alternative performance problem solving methods Multiple targets of software and system application Robust AND ubiquitous performance technology Broad scope of performance observability Flexible and configurable mechanisms Technology integration and extension Cross-platform portability Open layered and modular framework architecture April 13, 2019

Complexity Challenges Computing system environment complexity Observation integration and optimization Access, accuracy, and granularity constraints Diverse/specialized observation capabilities/technology Restricted modes limit performance problem solving Sophisticated software development environments Programming paradigms and performance models Performance data mapping to software abstractions Uniformity of performance abstraction across platforms Rich observation capabilities and flexible configuration Common performance problem solving methods April 13, 2019

General Problem How do we create robust and ubiquitous performance technology for the analysis and tuning of parallel and distributed software and systems in the presence of (evolving) complexity challenges? April 13, 2019

Talk Outline Complexity and Performance Technology Computation Model for Performance Technology TAU Performance Framework Model-oriented framework architecture TAU performance system toolkit Complexity Scenarios Multi-threaded execution Virtual machine environments Distributed message passing systems Hierarchical, hybrid parallel systems Future Work and Conclusions April 13, 2019

Computation Model for Performance Technology How to address dual performance technology goals? Robust capabilities + widely available methodologies Contend with problems of system diversity Flexible tool composition/configuration/integration Approaches Restrict computation types / performance problems limited performance technology coverage Base technology on abstract computation model general architecture and software execution features map features/methods to existing complex system types develop capabilities that can adapt and be optimized April 13, 2019

Framework for Performance Problem Solving Model-based composition Instrumentation / measurement / execution models performance observability constraints performance data types and events Analysis / presentation model performance data processing performance views and model mapping Integration model performance tool component configuration / integration Can performance problem solving framework be designed based on general complex system model? April 13, 2019

General Complex System Computation Model Node: physically distinct shared memory machine Message passing node interconnection network Context: distinct virtual memory space within node Thread: execution threads (user/system) in context Network Node Node Node memory node memory SMP memory VM space …    … Context Threads April 13, 2019

TAU Performance Framework Tuning and Analysis Utilities Performance system framework for scalable parallel and distributed high-performance computing Targets a general complex system computation model nodes / contexts / threads multi-level: system / software / parallelism measurement and analysis abstraction Integrated toolkit for performance instrumentation, measurement, analysis, and visualization portable performance profiling/tracing facility open software approach April 13, 2019

TAU Architecture Dynamic April 13, 2019

TAU Instrumentation Flexible, multiple instrumentation mechanisms Source code manual automatic using PDT (tau_instrumentor) Object code pre-instrumented libraries statically linked dynamically linked Executable code dynamic instrumentation using DynInstAPI (tau_run) April 13, 2019

TAU Instrumentation (continued) Common target measurement interface (TAU API) C++ (object-based) design and implementation Macro-based, using constructor/destructor techniques Function, classes, and templates Uniquely identify functions and templates name and type signature (name registration) static object creates performance entry dynamic object receives static object pointer runtime type identification for template instantiations C and Fortran instrumentation variants Instrumentation and measurement optimization April 13, 2019

TAU Measurement Performance information Organization High resolution timer library (real-time / virtual clocks) Generalized software counter library Hardware performance counters PCL (Performance Counter Library) (ZAM, Germany) PAPI (Performance API) (UTK, Ptools Consortium) consistent, portable API Organization Node, context, thread levels Profile groups for collective events (runtime selective) Mapping between software levels April 13, 2019

TAU Measurement (continued) Profiling Function-level, block-level, statement-level Supports user-defined events TAU profile (function) database (PD) Function callstack Hardware counts instead of time Tracing Profile-level events Interprocess communication events Timestamp synchronization User-controlled configuration (configure) April 13, 2019

TAU Analysis Profile analysis Trace analysis Pprof Racy parallel profiler with texted based display Racy graphical interface to pprof Trace analysis Trace merging and clock adjustment (if necessary) Trace format conversion (ALOG, SDDF, PV, Vampir) Vampir (Pallas) April 13, 2019

TAU Status Usage (selective) Platforms Languages IBM SP, SGI Origin 2K, Intel Teraflop, Cray T3E, HP, Sun, Windows 95/98/NT, Alpha/Pentium Linux cluster Languages C, C++, Fortran 77/90, HPF, pC++, HPC++, Java Communication libraries MPI, PVM, Nexus, Tulip, ACLMPL Thread libraries pthreads, Tulip, SMARTS, Java,Windows Compilers KAI, PGI, GNU, Fujitsu, Sun, Microsoft, SGI, Cray April 13, 2019

TAU Status (continued) Application libraries Blitz++, A++/P++, ACLVIS, PAWS Application frameworks POOMA, POOMA-2, MC++, Conejo, PaRP Other projects ACPC, University of Vienna: Aurora UC Berkeley (Culler): Millenium, sensitivity analysis KAI and Pallas TAU profiling and tracing toolkit (Version 2.8 beta) Extensive 70-page TAU User’s Guide http://www.cs.uoregon.edu/research/paracomp/tau April 13, 2019

Complexity Scenarios Multi-threaded execution Abstract thread-based performance measurement Virtual machine environments Performance instrumentation in virtual machine Measurement of multi-level virtual machine events Distributed message passing computation Performance measurement message passing library Integration with multi-threading Hierarchical, hybrid parallel systems Portable shared memory and message passing APIs Combined task and data parallel execution Performance system configuration and model mapping April 13, 2019

Multi-Threading Performance Measurement Fine-grained parallelism Different forms and levels of threading Greater need for efficient instrumentation Issues Thread identity and per-thread data storage Performance measurement support and synchronization TAU general threading and measurement model Common thread layer and measurement support Interface to system specific libraries (reg, id, sync) Target different thread systems with core functionality Pthreads, Windows, Java, SMARTS, Tulip April 13, 2019

Multi-Threading in Java Profile and trace Java (JDK 1.2+) applications Observe user-level and system-level threads Observe events for different Java packages /lang, /io, /awt, … Test application SciVis, NPAC, Syracuse University % ./configure -jdk=<dir_where_jdk_is_installed> % setenv LD_LIBRARY_PATH $LD_LIBRARY_PATH\:<taudir>/<arch>/lib % java -XrunTAU svserver April 13, 2019

TAU Profiling of Java Application (SciVis) 24 threads of execution! Profile for each Java thread Captures events for different Java packages April 13, 2019

TAU Tracing of Java Application (SciVis) Performance groups Timeline display Parallelism view April 13, 2019

Vampir Dynamic Call Tree View (SciVis) Per thread call tree Expanded call tree Annotated performance April 13, 2019

Virtual Machine Environments (Java) Integrate performance system with VM system/apps Captures robust performance data Maintain features of environment portability, concurrency, extensibility, interoperation Java No need to modify Java source, bytecode, or JVM Implemented using JVMPI (JVM profiling interface) fields JVMPI events Executes in memory space of JVM profiler agent loaded as shared object April 13, 2019

TAU Java Instrumentation Architecture Java program TAU package Thread API JNI Event notification TAU JVMPI Profile DB April 13, 2019

Distributed Message Passing (MPI) Performance measurement of communication events Integrate with application events Multi-language applications execution C, C++, Fortran, Java mpiJava (Syracuse, JavaGrande) Java wrapper package with JNI C bindings to MPI Integrate cross-language/system technology JVMPI and Tau profiler agent MPI profiling interface - link-time interposition library Cross execution mode uniformity and consistency invoke JVMPI control routines to control Java threads access thread information and expose to MPI interface April 13, 2019

TAU Java Instrumentation Architecture Java program TAU package mpiJava package Thread API JNI MPI profiling interface Event notification TAU TAU wrapper Native MPI library JVMPI Profile DB April 13, 2019

Parallel Java Game of Life (Profile) mpiJava testcase 4 nodes, 28 threads Merged Java and MPI event profiles Thread 4 executes all MPI routines Node 0 Node 1 Node 2 April 13, 2019

Parallel Java Game of Life (Trace) Integrated event tracing Merged trace viz Node process grouping Thread message pairing Vampir display Multi-level event grouping April 13, 2019

Hierarchical, Hybrid Parallel Systems (Opus/HPF) Hierarchical, hybrid programming and execution model Multi-threaded SMP and inter-node message passing Integrated task and data parallelism Opus / HPF environment (University of Vienna) Combined data (HPF) and task (Opus) parallelism HPF compiler produces Fortran 90 modules Processes interoperate using Opus runtime system producer / consumer model MPI and pthreads Performance influence at multiple software levels Performance analysis oriented to programming model April 13, 2019

Opus / HPF Execution Trace 4-node, 28 process: consumers, producers Process-grouping in Vampir visualization April 13, 2019

Opus / HPF Execution Trace Statistics April 13, 2019

Hybrid Parallel Computation (OpenMPI + MPI) Portable hybrid parallel programming OpenMP for shared memory parallel programming Fork-join model Loop level parallelism MPI for cross-box message-based parallelism OpenMP performance measurement Interface to OpenMP runtime system (RTS events) Compiler support and integration 2D Stommel model of ocean circulation Jacobi iteration, 5-point stencil Timothy Kaiser (San Diego Supercomputing Center) April 13, 2019

OpenMP + MPI Ocean Modeling (Trace) Thread message pairing Integrated OpenMP + MPI events April 13, 2019

OpenMP + MPI Ocean Modeling (HW Profile) % configure -papi=../packages/papi -openmp -c++=pgCC -cc=pgcc -mpiinc=../packages/mpich/include -mpilib=../packages/mpich/libo Integrated OpenMP + MPI events FP instructions April 13, 2019

Summary Complex parallel computing environments require robust and widely available performance technology Portable, cross-platform, multi-level, integrated Able to bridge and reuse existing technology Technology savvy and open TAU is only a performance technology framework General computation model and core services Mapping, extension, and refinement Integration of additional performance technology Need for higher-level framework layers Computational and performance model archetypes Performance diagnosis April 13, 2019

TAU Future Plans Platforms Languages Instrumentation Measurement IA-64 (!), Compaq ASCI Q, Sun Starfire, Linux SC Languages OpenMP + MPI, Java (Java Grande), Opus / Java Instrumentation Automatic source (F90, Java), DynInst, DPCL, DITools Measurement Extend tracing support to include event data (e.g., HW counts) Dynamic performance measurement control Displays Extensible Performance Display Tool (ExPeDiTo) TraceView 2 (TV2), Pajé Performance database and technology Support for multiple runs Open API for analysis tool development April 13, 2019

Open Performance Technology (OPT) Performance problem is complex diverse platforms, software development, applications things evolve History of incompatible and competing tools instrumentation / measurement technology reinvention lack of common, reusable software foundations Need “Value added” (open) approach technology for high-level performance tool development layered performance tool architecture portable, flexible, programmable, integrative technology Opportunity for performance technology integration in industry and across HPC software projects April 13, 2019

TAU Measurement API Configuration Function and class methods Template TAU_PROFILE_INIT(argc, argv); TAU_PROFILE_SET_NODE(myNode); TAU_PROFILE_SET_CONTEXT(myContext); TAU_PROFILE_EXIT(message); Function and class methods TAU_PROFILE(name, type, group); Template TAU_TYPE_STRING(variable, type); TAU_PROFILE(name, type, group); CT(variable); User-defined timing TAU_PROFILE_TIMER(timer, name, type, group); TAU_PROFILE_START(timer); TAU_PROFILE_STOP(timer); April 13, 2019

TAU Measurement API (continued) User-defined events TAU_REGISTER_EVENT(variable, event_name); TAU_EVENT(variable, value); TAU_PROFILE_STMT(statement); Mapping TAU_MAPPING(statement, key); TAU_MAPPING_OBJECT(funcIdVar); TAU_MAPPING_LINK(funcIdVar, key); TAU_MAPPING_PROFILE (FuncIdVar); TAU_MAPPING_PROFILE_TIMER(timer, FuncIdVar); TAU_MAPPING_PROFILE_START(timer); TAU_MAPPING_PROFILE_STOP(timer); Reporting TAU_REPORT_STATISTICS(); TAU_REPORT_THREAD_STATISTICS(); April 13, 2019

C++ Template Instrumentation (Blitz++, PETE) High-level objects Array classes Templates (Blitz++) Optimizations Array processing Expressions (PETE) Relate performance data to high-level statement Complexity of template evaluation Array expressions April 13, 2019

Standard Template Instrumentation Difficulties Instantiated templates result in mangled identifiers Standard profiling techniques / tools are deficient Integrated with proprietary compilers Specific systems platforms and programming models Uninterpretable routine names April 13, 2019

TAU Instrumentation and Profiling Profile of expression types Performance data presented with respect to high-level array expression types Graphical pprof April 13, 2019

TAU and SMARTS: Asynchronous Performance Scalable Multithreaded Asynchronuous RTS User-level threads, light-weight virtual processors Macro-dataflow, asynchronous execution interleaving iterates from data-parallel statements Integrated with POOMA II (parallel dense array library) Measurement of asynchronous parallel execution Utilized the TAU mapping API Associate iterate performance with data parallel statement Evaluate different scheduling policies “SMARTS: Exploting Temporal Locality and Parallelism through Vertical Execution” (ICS '99) April 13, 2019

TAU Mapping of Asynchronous Execution Without mapping Two threads executing With mapping POOMA / SMARTS April 13, 2019

With and without mapping (Thread 0) Thread 0 blocks waiting for iterates Iterates get lumped together With mapping Iterates distinguished April 13, 2019

With and without mapping (Thread 1) Array initialization performance lumped Without mapping Performance associated with ExpressionKernel object With mapping Iterate performance mapped to array statement Array initialization performance correctly separated April 13, 2019

TAU Profiling of SMARTS Scheduling Iteration scheduling for two array expressions April 13, 2019

SMARTS Tracing (SOR) – Vampir Visualization SCVE scheduler used in Red/Black SOR running on 32 processors of SGI Origin 2000 Asynchronous, overlapped parallelism April 13, 2019

TAU Distributed Monitoring Framework Extend usability of TAU performance analysis Access TAU performance data during execution Framework model each application context is a performance data server monitor agent thread is created within each context client processes attach to agents and request data server thread synchronization for data consistency pull mode of interaction Distributed TAU performance data space “A Runtime Monitoring Framework for the TAU Profiling System” (ISCOPE ‘99) April 13, 2019

TAU Distributed Monitor Architecture Each context has a monitor agent Client in separate thread directs agent Pull model of interaction Initial HPC++ implementation TAU profile database April 13, 2019

Java Implementation of TAU Monitor Motivations More portable monitor middleware system (RMI) More flexible and programmable server interface (JNI) More robust client development (EJB, JDBC, Swing) April 13, 2019

Trigger Support for Runtime Monitoring Execution event triggering Inform external clients of events during execution “Server” library Java trigger modules JNI link between application and trigger modules “Client” trigger library Client Application JNI … Client Application Context Triggers Client RMI April 13, 2019

Trigger API and TAU Monitor Application Trigger at points of desired monitor access Pull TAU profile data Unblock trigger and continue April 13, 2019

PDT and Monitor Future Plans Complete F90 and Java IL Analyzer Source browsers: function, class, template Tools for aiding in data marshalling and translation TAU monitoring framework Application and system monitoring ACL Supermon and SGI Performance Co-Pilot scalable SMP clusters and distributed systems Performance monitoring clients April 13, 2019