Performance Technology for Complex Parallel Systems Sameer Shende University of Oregon.

Performance Technology for Complex Parallel Systems Sameer Shende University of Oregon

General Problems How do we create robust and ubiquitous performance technology for the analysis and tuning of parallel and distributed software and systems in the presence of (evolving) complexity challenges? How do we apply performance technology effectively for the variety and diversity of performance problems that arise in the context of complex parallel and distributed computer systems.

Computation Model for Performance Technology  How to address dual performance technology goals?  Robust capabilities + widely available methodologies  Contend with problems of system diversity  Flexible tool composition/configuration/integration  Approaches  Restrict computation types / performance problems  limited performance technology coverage  Base technology on abstract computation model  general architecture and software execution features  map features/methods to existing complex system types  develop capabilities that can adapt and be optimized

General Complex System Computation Model  Node: physically distinct shared memory machine  Message passing node interconnection network  Context: distinct virtual memory space within node  Thread: execution threads (user/system) in context memory Node VM space Context SMP Threads node memory … … Interconnection Network Inter-node message communication * * physical view model view

Definitions – Profiling  Profiling  Recording of summary information during execution  inclusive, exclusive time, # calls, hardware statistics, …  Reflects performance behavior of program entities  functions, loops, basic blocks  user-defined “semantic” entities  Very good for low-cost performance assessment  Helps to expose performance bottlenecks and hotspots  Implemented through  sampling: periodic OS interrupts or hardware counter traps  instrumentation: direct insertion of measurement code

Definitions – Tracing  Tracing  Recording of information about significant points (events) during program execution  entering/exiting code region (function, loop, block, …)  thread/process interactions (e.g., send/receive message)  Save information in event record  timestamp  CPU identifier, thread identifier  Event type and event-specific information  Event trace is a time-sequenced stream of event records  Can be used to reconstruct dynamic program behavior  Typically requires code instrumentation

Event Tracing: Instrumentation, Monitor, Trace 1master 2slave 3... void slave { trace(ENTER, 2);... recv(A, tag, buf); trace(RECV, A);... trace(EXIT, 2); } void master { trace(ENTER, 1);... trace(SEND, B); send(B, tag, buf);... trace(EXIT, 1); } MONITOR 58AENTER1 60BENTER2 62ASENDB 64AEXIT1 68BRECVA... 69BEXIT2... CPU A: CPU B: Event definition timestamp

Event Tracing: “Timeline” Visualization 1master 2slave 3... 58AENTER1 60BENTER2 62ASENDB 64AEXIT1 68BRECVA... 69BEXIT2... main master slave 58606264666870 B A

TAU Performance System Framework  Tuning and Analysis Utilities  Performance system framework for scalable parallel and distributed high-performance computing  Targets a general complex system computation model  nodes / contexts / threads  Multi-level: system / software / parallelism  Measurement and analysis abstraction  Integrated toolkit for performance instrumentation, measurement, analysis, and visualization  Portable performance profiling/tracing facility  Open software approach

TAU Performance System Architecture

Levels of Code Transformation  As program information flows through stages of compilation/linking/execution, different information is accessible at different stages  Each level poses different constraints and opportunities for extracting information  At what level should performance instrumentation be done?

TAU Instrumentation  Flexible instrumentation mechanisms at multiple levels  Source code  manual  automatic using Program Database Toolkit (PDT), OPARI  Object code  pre-instrumented libraries (e.g., MPI using PMPI)  statically linked  dynamically linked (e.g., Virtual machine instrumentation)  fast breakpoints (compiler generated)  Executable code  dynamic instrumentation (pre-execution) using DynInstAPI

TAU Instrumentation (continued)  Targets common measurement interface (TAU API)  Object-based design and implementation  Macro-based, using constructor/destructor techniques  Program units: function, classes, templates, blocks  Uniquely identify functions and templates  name and type signature (name registration)  static object creates performance entry  dynamic object receives static object pointer  runtime type identification for template instantiations  C and Fortran instrumentation variants  Instrumentation and measurement optimization

Multi-Level Instrumentation  Uses multiple instrumentation interfaces  Shares information: cooperation between interfaces  Taps information at multiple levels  Provides selective instrumentation at each level  Targets a common performance model  Presents a unified view of execution

Program Database Toolkit (PDT)  Program code analysis framework for developing source- based tools  High-level interface to source code information  Integrated toolkit for source code parsing, database creation, and database query  commercial grade front end parsers  portable IL analyzer, database format, and access API  open software approach for tool development  Target and integrate multiple source languages  Use in TAU to build automated performance instrumentation tools

PDT Architecture and Tools C/C++ Fortran 77/90

PDT Components  Language front end  Edison Design Group (EDG): C, C++, Java  Mutek Solutions Ltd.: F77, F90  creates an intermediate-language (IL) tree  IL Analyzer  processes the intermediate language (IL) tree  creates “program database” (PDB) formatted file  DUCTAPE (Bernd Mohr, ZAM, Germany)  C++ program Database Utilities and Conversion Tools APplication Environment  processes and merges PDB files  C++ library to access the PDB for PDT applications

TAU Measurement  Performance information  High-resolution timer library (real-time / virtual clocks)  General software counter library (user-defined events)  Hardware performance counters  PCL (Performance Counter Library) (ZAM, Germany)  PAPI (Performance API) (UTK, Ptools Consortium)  consistent, portable API  Organization  Node, context, thread levels  Profile groups for collective events (runtime selective)  Performance data mapping between software levels

TAU Measurement (continued)  Parallel profiling  Function-level, block-level, statement-level  Supports user-defined events  TAU parallel profile database  Function callstack  Hardware counts values (in replace of time)  Tracing  All profile-level events  Inter-process communication events  Timestamp synchronization  User-configurable measurement library (user controlled)

TAU Measurement System Configuration  configure [OPTIONS]  {-c++=, -cc= } Specify C++ and C compilers  {-pthread, -sproc}Use pthread or SGI sproc threads  -openmpUse OpenMP threads  -jdk= Specify location of Java Dev. Kit  -opari= Specify location of Opari OpenMP tool  {-pcl, -papi}= Specify location of PCL or PAPI  -pdt= Specify location of PDT  -dyninst= Specify location of DynInst Package  {-mpiinc=, mpilib= }Specify MPI library instrumentation  - TRACE Generate TAU event traces  -PROFILE Generate TAU profiles  -CPUTIMEUse usertime+system time  -PAPIWALLCLOCKUse PAPI to access wallclock time  -PAPIVIRTUALUse PAPI for virtual (user) time …

TAU Measurement Configuration – Examples ./configure -c++=xlC -cc=xlc –pdt=/usr/packages/pdtoolkit-2.1 -pthread  Use TAU with IBM’s xlC compiler, PDT and the pthread library  Enable TAU profiling (default) ./configure -TRACE –PROFILE  Enable both TAU profiling and tracing ./configure -c++=guidec++ -cc=guidec -papi=/usr/local/packages/papi –openmp -mpiinc=/usr/packages/mpich/include -mpilib=/usr/packages/mpich/lib  Use OpenMP+MPI using KAI's Guide compiler suite and use PAPI for accessing hardware performance counters for measurements  Typically configure multiple measurement libraries

TAU Measurement API  Initialization and runtime configuration  TAU_PROFILE_INIT(argc, argv); TAU_PROFILE_SET_NODE(myNode); TAU_PROFILE_SET_CONTEXT(myContext); TAU_PROFILE_EXIT(message); TAU_REGISTER_THREAD();  Function and class methods  TAU_PROFILE(name, type, group);  Template  TAU_TYPE_STRING(variable, type); TAU_PROFILE(name, type, group); CT(variable);  User-defined timing  TAU_PROFILE_TIMER(timer, name, type, group); TAU_PROFILE_START(timer); TAU_PROFILE_STOP(timer);

Compiling: TAU Makefiles  Include TAU Makefile in the user’s Makefile.  Variables:  TAU_CXXSpecify the C++ compiler  TAU_CCSpecify the C compiler used by TAU  TAU_DEFSDefines used by TAU. Add to CFLAGS  TAU_LDFLAGSLinker options. Add to LDFLAGS  TAU_INCLUDEHeader files include path. Add to CFLAGS  TAU_LIBSStatically linked TAU library. Add to LIBS  TAU_SHLIBSDynamically linked TAU library  TAU_MPI_LIBSTAU’s MPI wrapper library for C/C++  TAU_MPI_FLIBSTAU’s MPI wrapper library for F90  TAU_FORTRANLIBSMust be linked in with C++ linker for F90.  Note: Not including TAU_DEFS in CFLAGS disables instrumentation in C/C++ programs.

Including TAU Makefile - Example include /usr/tau/sgi64/lib/Makefile.tau-pthread-kcc CXX = $(TAU_CXX) CC = $(TAU_CC) CFLAGS = $(TAU_DEFS) LIBS = $(TAU_LIBS) OBJS =... TARGET= a.out TARGET: $(OBJS) $(CXX) $(LDFLAGS) $(OBJS) -o $@ $(LIBS).cpp.o: $(CC) $(CFLAGS) -c $< -o $@

TAU Makefile for PDT include /usr/tau/include/Makefile CXX = $(TAU_CXX) CC = $(TAU_CC) PDTPARSE = $(PDTDIR)/$(CONFIG_ARCH)/bin/cxxparse TAUINSTR = $(TAUROOT)/$(CONFIG_ARCH)/bin/tau_instrumentor CFLAGS = $(TAU_DEFS) LIBS = $(TAU_LIBS) OBJS =... TARGET= a.out TARGET: $(OBJS) $(CXX) $(LDFLAGS) $(OBJS) -o $@ $(LIBS).cpp.o: $(PDTPARSE) $< $(TAUINSTR) $*.pdb $< -o $*.inst.cpp $(CC) $(CFLAGS) -c $*.inst.cpp -o $@

Setup: Running Applications % setenv PROFILEDIR /home/data/experiments/profile/01 % setenv TRACEDIR/home/data/experiments/trace/01 % set path=($path / /bin) % setenv LD_LIBRARY_PATH $LD_LIBRARY_PATH\: / /lib For PAPI/PCL: % setenv PAPI_EVENT PAPI_FP_INS % setenv PCL_EVENT PCL_FP_INSTR For Java (without instrumentation): % java application With instrumentation: % java -XrunTAU application % java -XrunTAU:exclude=sun/io,java application For DyninstAPI: % a.out % tau_run a.out % tau_run -XrunTAUsh-papi a.out

TAU Analysis  Profile analysis  pprof  parallel profiler with text-based display  racy  graphical interface to pprof (Tcl/Tk)  jracy  Java implementation of Racy  Trace analysis and visualization  Trace merging and clock adjustment (if necessary)  Trace format conversion (ALOG, SDDF, Vampir)  Vampir (Pallas) trace visualization

Pprof Command  pprof [-c|-b|-m|-t|-e|-i] [-r] [-s] [-n num] [-f file] [-l] [nodes]  -cSort according to number of calls  -bSort according to number of subroutines called  -mSort according to msecs (exclusive time total)  -tSort according to total msecs (inclusive time total)  -eSort according to exclusive time per call  -iSort according to inclusive time per call  -vSort according to standard deviation (exclusive usec)  -rReverse sorting order  -sPrint only summary profile information  -n numPrint only first number of functions  -f fileSpecify full path and filename without node ids  -l List all functions and exit

Pprof Output (NAS Parallel Benchmark – LU)  Intel Quad PIII Xeon, RedHat, PGI F90  F90 + MPICH  Profile for: Node Context Thread  Application events and MPI events

jRacy (NAS Parallel Benchmark – LU) n: node c: context t: thread Global profiles Individual profile Routine profile across all nodes

Vampir Trace Visualization Tool  Visualization and Analysis of MPI Programs  Originally developed by Forschungszentrum Jülich  Current development by Technical University Dresden  Distributed by PALLAS, Germany  http://www.pallas.de/pages/vampir.htm

Vampir (NAS Parallel Benchmark – LU) Timeline display Callgraph display Communications display Parallelism display

Case Study: Hybrid Computation (OpenMPI + MPI)  Portable hybrid parallel programming  OpenMP for shared memory parallel programming  Fork-join model  Loop level parallelism  MPI for cross-box message-based parallelism  OpenMP performance measurement  Interface to OpenMP runtime system (RTS events)  Compiler support and integration  2D Stommel model of ocean circulation  Jacobi iteration, 5-point stencil  Timothy Kaiser (San Diego Supercomputing Center)

OpenMP Instrumentation  OPARI [FZJ, Germany]  OPARI  OpenMP Pragma And Region Instrumentor (OPARI)  Source-to-Source translator to insert POMP calls around OpenMP constructs and API functions  POMP  OpenMP Directive Instrumentation  OpenMP Runtime Library Routine Instrumentation  Performance Monitoring Library Control  User Code Instrumentation  Context Descriptors  Conditional Compilation  Conditional / Selective Transformations

Example: !$OMP PARALLEL DO Instrumentation !$OMP PARALLEL DO clauses... do loop !$OMP END PARALLEL DO !$OMP PARALLEL other-clauses... !$OMP DO schedule-clauses, ordered-clauses, lastprivate-clauses do loop !$OMP END DO !$OMP END PARALLEL DO NOWAIT !$OMP BARRIER call pomp_parallel_fork(d) call pomp_parallel_begin(d) call pomp_parallel_end(d) call pomp_parallel_join(d) call pomp_do_enter(d) call pomp_do_exit(d) call pomp_barrier_enter(d) call pomp_barrier_exit(d)

Tracing Hybrid Executions – TAU and Vampir

Profiling Hybrid Executions

Case Study: Utah ASCI/ASAP Level 1 Center  C-SAFE was established to build a problem-solving environment (PSE) for the numerical simulation of accidental fires and explosions  Fundamental chemistry and engineering physics models  Coupled with non-linear solvers, optimization, computational steering, visualization, and experimental data verification  Very large-scale simulations  Computer science problems:  Coupling of multiple simulation codes  Software engineering across diverse expert teams  Achieving high performance on large-scale systems

Example C-SAFE Simulation Problems ∑ Heptane fire simulation Material stress simulation Typical C-SAFE simulation with a billion degrees of freedom and non-linear time dynamics

Uintah High-Level Component View

High Level Architecture C-SAFE Implicitly Connected to All Components UCF Data Control / Light Data Checkpointing Mixing Model Mixing Model Fluid Model Fluid Model Subgrid Model Subgrid Model Chemistry Database Controller Chemistry Database Controller Chemistry Databases Chemistry Databases High Energy Simulations High Energy Simulations Numerical Solvers Numerical Solvers Non-PSE Components Performance Analysis Performance Analysis Simulation Controller Simulation Controller Problem Specification Numerical Solvers Numerical Solvers MPM Material Properties Database Material Properties Database Blazer Database Visualization Data Manager Data Manager Post Processing And Analysis Post Processing And Analysis Parallel Services Parallel Services Resource Management Resource Management PSE Components Scheduler Uintah Parallel Component Architecture

Uintah Computational Framework  Execution model based on software (macro) dataflow  Exposes parallelism and hides data transport latency  Computations expressed a directed acyclic graphs of tasks  consumes input and produces output (input to future task)  input/outputs specified for each patch in a structured grid  Abstraction of global single-assignment memory  DataWarehouse  Directory mapping names to values (array structured)  Write value once then communicate to awaiting tasks  Task graph gets mapped to processing resources  Communications schedule approximates global optimal

Uintah Task Graph (Material Point Method)  Diagram of named tasks (ovals) and data (edges)  Imminent computation  Dataflow-constrained  MPM  Newtonian material point motion time step  Solid: values defined at material point (particle)  Dashed: values defined at vertex (grid)  Prime (‘): values updated during time step

Uintah PSE  UCF automatically sets up:  Domain decomposition  Inter-processor communication with aggregation/reduction  Parallel I/O  Checkpoint and restart  Performance measurement and analysis (stay tuned)  Software engineering  Coding standards  CVS (Commits: Y3 - 26.6 files/day, Y4 - 29.9 files/day)  Correctness regression testing with bugzilla bug tracking  Nightly build (parallel compiles)  170,000 lines of code (Fortran and C++ tasks supported)

Performance Technology Integration  Uintah present challenges to performance integration  Software diversity and structure  UCF middleware, simulation code modules  component-based hierarchy  Portability objectives  cross-language and cross-platform  multi-parallelism: thread, message passing, mixed  Scalability objectives  High-level programming and execution abstractions  Requires flexible and robust performance technology  Requires support for performance mapping

Performance Analysis Objectives for Uintah  Micro tuning  Optimization of simulation code (task) kernels for maximum serial performance  Scalability tuning  Identification of parallel execution bottlenecks  overheads: scheduler, data warehouse, communication  load imbalance  Adjustment of task graph decomposition and scheduling  Performance tracking  Understand performance impacts of code modifications  Throughout course of software development  C-SAFE application and UCF software

Uintah Performance Engineering Approach  Contemporary performance methodology focuses on control flow (function) level measurement and analysis  C-SAFE application involves coupled-models with task- based parallelism and dataflow control constraints  Performance engineering on algorithmic (task) basis  Observe performance based on algorithm (task) semantics  Analyze task performance characteristics in relation to other simulation tasks and UCF components  scientific component developers can concentrate on performance improvement at algorithmic level  UCF developers can concentrate on bottlenecks not directly associated with simulation module code

Task execution time dominates (what task?) MPI communication overheads (where?) Task Execution in Uintah Parallel Scheduler  Profile methods and functions in scheduler and in MPI library Task execution time distribution  Need to map performance data!

Semantics-Based Performance Mapping  Associate performance measurements with high-level semantic abstractions  Need mapping support in the performance measurement system to assign data correctly

Hypothetical Mapping Example  Particles distributed on surfaces of a cube Particle* P[MAX]; /* Array of particles */ int GenerateParticles() { /* distribute particles over all faces of the cube */ for (int face=0, last=0; face < 6; face++){ /* particles on this face */ int particles_on_this_face = num(face); for (int i=last; i < particles_on_this_face; i++) { /* particle properties are a function of face */ P[i] =... f(face);... } last+= particles_on_this_face; }

Hypothetical Mapping Example (continued)  How much time is spent processing face i particles?  What is the distribution of performance among faces?  How is this determined if execution is parallel? int ProcessParticle(Particle *p) { /* perform some computation on p */ } int main() { GenerateParticles(); /* create a list of particles */ for (int i = 0; i < N; i++) /* iterates over the list */ ProcessParticle(P[i]); }

Semantic Entities/Attributes/Associations (SEAA)  New dynamic mapping scheme  Entities defined at any level of abstraction  Attribute entity with semantic information  Entity-to-entity associations  Two association types (implemented in TAU API)  Embedded – extends data structure of associated object to store performance measurement entity  External – creates an external look-up table using address of object as the key to locate performance measurement entity

No Performance Mapping versus Mapping  Typical performance tools report performance with respect to routines  Does not provide support for mapping  Performance tools with SEAA mapping can observe performance with respect to scientist’s programming and problem abstractions TAU (no mapping) TAU (w/ mapping)

Uintah Task Performance Mapping  Uintah partitions individual particles across processing elements (processes or threads)  Simulation tasks in task graph work on particles  Tasks have domain-specific character in the computation  “interpolate particles to grid” in Material Point Method  Task instances generated for each partitioned particle set  Execution scheduled with respect to task dependencies  How to attributed execution time among different tasks  Assign semantic name (task type) to a task instance  SerialMPM::interpolateParticleToGrid  Map TAU timer object to (abstract) task (semantic entity)  Look up timer object using task type (semantic attribute)  Further partition along different domain-specific axes

Using External Associations  Two level mappings:  Level 1:  Level 2:  Embedded association vs External association Data (object) Performance Data... Hash Table

Task Performance Mapping Instrumentation void MPIScheduler::execute(const ProcessorGroup * pc, DataWarehouseP & old_dw, DataWarehouseP & dw ) {... TAU_MAPPING_CREATE( task->getName(), "[MPIScheduler::execute()]", (TauGroup_t)(void*)task->getName(), task->getName(), 0);... TAU_MAPPING_OBJECT(tautimer) TAU_MAPPING_LINK(tautimer,(TauGroup_t)(void*)task->getName()); // EXTERNAL ASSOCIATION... TAU_MAPPING_PROFILE_TIMER(doitprofiler, tautimer, 0) TAU_MAPPING_PROFILE_START(doitprofiler,0); task->doit(pc); TAU_MAPPING_PROFILE_STOP(0);... }

Task Performance Mapping (Profile) Performance mapping for different tasks Mapped task performance across processes

Task Performance Mapping (Trace) Work packet computation events colored by task type Distinct phases of computation can be identifed based on task

Task Performance Mapping (Trace - Zoom) Startup communication imbalance

Task Performance Mapping (Trace - Parallelism) Communication / load imbalance

Comparing Uintah Traces for Scalability Analysis 8 processes 32 processes

Scaling Performance Optimizations Last year: initial “correct” scheduler Reduce communication by 10 x Reduce task graph overhead by 20 x ASCI Nirvana SGI Origin 2000 Los Alamos National Laboratory

Scalability to 2000 Processors (Fall 2001) ASCI Nirvana SGI Origin 2000 Los Alamos National Laboratory

TAU Performance System Status  Computing platforms  IBM SP, SGI Origin, Intel Teraflop, Cray T3E, Compaq SC, HP, Sun, Apple, Windows, IA-32, IA-64 (Linux), …  Programming languages  C, C++, Fortran 77/90, HPF, Java  Communication libraries  MPI, PVM, Nexus, Tulip, ACLMPL, MPIJava  Thread libraries  pthread, Java,Windows, SGI sproc, Tulip, SMARTS, OpenMP  Compilers  KAI, PGI, GNU, Fujitsu, HP, Sun, Microsoft, SGI, Cray, IBM, Compaq

PDT Status  Program Database Toolkit (Version 2.1, web download)  EDG C++ front end (Version 2.45.2)  Mutek Fortran 90 front end (Version 2.4.1)  C++ and Fortran 90 IL Analyzer  DUCTAPE library  Standard C++ system header files (KCC Version 4.0f)  PDT-constructed tools  TAU instrumentor (C/C++/F90)  Program analysis support for SILOON and CHASM  Platforms  SGI, IBM, Compaq, SUN, HP, Linux (IA32/IA64), Apple, Windows, Cray T3E

Evolution of the TAU Performance System  Customization of TAU for specific needs  TAU’s existing strength lies in its robust support for performance instrumentation and measurement  TAU will evolve to support new performance capabilities  Online performance data access via application-level API  Dynamic performance measurement control  Generalize performance mapping  Runtime performance analysis and visualization

Information  TAU (http://www.acl.lanl.gov/tau)  PDT (http://www.acl.lanl.gov/pdtoolkit)

Support Acknowledgement  TAU and PDT support:  Department of Energy (DOE)  DOE 2000 ACTS contract  DOE MICS contract  DOE ASCI Level 3 (LANL, LLNL)  U. of Utah DOE ASCI Level 1 subcontract  DARPA  NSF National Young Investigator (NYI) award

Performance Technology for Complex Parallel Systems Sameer Shende University of Oregon.

Similar presentations

Presentation on theme: "Performance Technology for Complex Parallel Systems Sameer Shende University of Oregon."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Performance Technology for Complex Parallel Systems Sameer Shende University of Oregon.

Similar presentations

Presentation on theme: "Performance Technology for Complex Parallel Systems Sameer Shende University of Oregon."— Presentation transcript:

Similar presentations

About project

Feedback