Presentation is loading. Please wait.

Presentation is loading. Please wait.

Allen D. Malony Performance Research Laboratory (PRL) Neuroinformatics Center (NIC) Department.

Similar presentations


Presentation on theme: "Allen D. Malony Performance Research Laboratory (PRL) Neuroinformatics Center (NIC) Department."— Presentation transcript:

1 Allen D. Malony malony@cs.uoregon.edu http://www.cs.uoregon.edu/research/tau Performance Research Laboratory (PRL) Neuroinformatics Center (NIC) Department of Computer and Information Science University of Oregon The TAU Parallel Performance System

2 OSDL 20062 Outline  Research interests and motivation  TAU performance system  Instrumentation  Measurement  Analysis tools  Parallel profile analysis (ParaProf)  Performance data management (PerfDMF)  Performance data mining (PerfExplorer)  TAU status  Open Trace Format (OTF)  ZeptoOS and KTAU

3 The TAU Parallel Performance SystemOSDL 20063 Research Motivation  Tools for performance problem solving  Empirical-based performance optimization process  Performance technology concerns characterization Performance Tuning Performance Diagnosis Performance Experimentation Performance Observation hypotheses properties Instrumentation Measurement Analysis Visualization Performance Technology Experiment management Performance data storage Performance Technology

4 The TAU Parallel Performance SystemOSDL 20064 TAU Performance System  Tuning and Analysis Utilities (14+ year project effort)  Performance system framework for HPC systems  Integrated, scalable, flexible, and parallel  Targets a general complex system computation model  Entities: nodes / contexts / threads  Multi-level: system / software / parallelism  Measurement and analysis abstraction  Integrated toolkit for performance problem solving  Instrumentation, measurement, analysis, and visualization  Portable performance profiling and tracing facility  Performance data management and data mining  Partners: LLNL, ANL, Research Center Jülich, LANL

5 The TAU Parallel Performance SystemOSDL 20065 TAU Parallel Performance System Goals  Portable (open source) parallel performance system  Computer system architectures and operating systems  Different programming languages and compilers  Multi-level, multi-language performance instrumentation  Flexible and configurable performance measurement  Support for multiple parallel programming paradigms  Multi-threading, message passing, mixed-mode, hybrid, object oriented (generic), component  Support for performance mapping  Integration of leading performance technology  Scalable (very large) parallel performance analysis

6 The TAU Parallel Performance SystemOSDL 20066 memory Node VM space Context SMP Threads node memory … … Interconnection Network Inter-node message communication * * physical view model view General Complex System Computation Model  Node: physically distinct shared memory machine  Message passing node interconnection network  Context: distinct virtual memory space within node  Thread: execution threads (user/system) in context

7 The TAU Parallel Performance SystemOSDL 20067 TAU Performance System Architecture

8 The TAU Parallel Performance SystemOSDL 20068 TAU Performance System Architecture

9 The TAU Parallel Performance SystemOSDL 20069 TAU Instrumentation Approach  Support for standard program events  Routines, classes and templates  Statement-level blocks  Support for user-defined events  Begin/End events (“user-defined timers”)  Atomic events (e.g., size of memory allocated/freed)  Selection of event statistics  Support definition of “semantic” entities for mapping  Support for event groups (aggregation, selection)  Instrumentation optimization  Eliminate instrumentation in lightweight routines

10 The TAU Parallel Performance SystemOSDL 200610 TAU Instrumentation Mechanisms  Source code  Manual (TAU API, TAU component API)  Automatic (robust)  C, C++, F77/90/95 (Program Database Toolkit (PDT))  OpenMP (directive rewriting (Opari), POMP2 spec)  Object code  Pre-instrumented libraries (e.g., MPI using PMPI)  Statically-linked and dynamically-linked  Executable code  Dynamic instrumentation (pre-execution) (DynInstAPI)  Virtual machine instrumentation (e.g., Java using JVMPI)  TAU_COMPILER to automate instrumentation process

11 The TAU Parallel Performance SystemOSDL 200611 User-level abstractions problem domain source code object codelibraries instrumentation executable runtime image compiler linkerOS VM instrumentation performance data run preprocessor Multi-Level Instrumentation and Mapping  Multiple interfaces  Information sharing  Between interfaces  Event selection  Within/between levels  Mapping  Associate performance data with high-level semantic abstractions

12 The TAU Parallel Performance SystemOSDL 200612 Program Database Toolkit (PDT) Application / Library C / C++ parser Fortran parser F77/90/95 C / C++ IL analyzer Fortran IL analyzer Program Database Files IL DUCTAPE PDBhtml SILOON CHASM tau_instrument or Program documentation Application component glue C++ / F90/95 interoperability Automatic source instrumentation

13 The TAU Parallel Performance SystemOSDL 200613 Program Database Toolkit (PDT)  Program code analysis framework  Develop source-based tools  High-level interface to source code information  Integrated toolkit for source code parsing, database creation, and database query  Commercial grade front-end parsers  Portable IL analyzer, database format, and access API  Open software approach for tool development  Multiple source languages  Implement automatic performance instrumentation tools  tau_instrumentor

14 The TAU Parallel Performance SystemOSDL 200614 TAU Measurement Approach  Portable and scalable parallel profiling solution  Multiple profiling types and options  Event selection and control (enabling/disabling, throttling)  Online profile access and sampling  Online performance profile overhead compensation  Portable and scalable parallel tracing solution  Trace translation to Open Trace Format (OTF)  Trace streams and hierarchical trace merging  Robust timing and hardware performance support  Multiple counters (hardware, user-defined, system)  Performance measurement for CCA component software

15 The TAU Parallel Performance SystemOSDL 200615 TAU Measurement Mechanisms  Parallel profiling  Function-level, block-level, statement-level  Supports user-defined events and mapping events  TAU parallel profile stored (dumped) during execution  Support for flat, callgraph/callpath, phase profiling  Support for memory profiling  Tracing  All profile-level events  Inter-process communication events  Inclusion of multiple counter data in traced events

16 The TAU Parallel Performance SystemOSDL 200616 Types of Parallel Performance Profiling  Flat profiles  Metric (e.g., time) spent in an event (callgraph nodes)  Exclusive/inclusive, # of calls, child calls  Callpath profiles (Calldepth profiles)  Time spent along a calling path (edges in callgraph)  “main=> f1 => f2 => MPI_Send” (event name)  TAU_CALLPATH_LENGTH environment variable  Phase profiles  Flat profiles under a phase (nested phases are allowed)  Default “main” phase  Supports static or dynamic (per-iteration) phases

17 The TAU Parallel Performance SystemOSDL 200617 Performance Analysis and Visualization  Analysis of parallel profile and trace measurement  Parallel profile analysis  ParaProf: parallel profile analysis and presentation  ParaVis: parallel performance visualization package  Profile generation from trace data (tau2pprof)  Performance data management framework (PerfDMF)  Parallel trace analysis  Translation to VTF (V3.0), EPILOG, OTF formats  Integration with VNG (Technical University of Dresden)  Online parallel analysis and visualization  Integration with CUBE browser (KOJAK, UTK, FZJ)

18 The TAU Parallel Performance SystemOSDL 200618 ParaProf Parallel Performance Profile Analysis HPMToolkit MpiP TAU Raw files PerfDMF managed (database) Metadata Application Experiment Trial

19 The TAU Parallel Performance SystemOSDL 200619 Example Applications  sPPM  ASCI benchmark, Fortran, C, MPI, OpenMP or pthreads  Miranda  research hydrodynamics code, Fortran, MPI  GYRO  tokamak turbulence simulation, Fortran, MPI  FLASH  physics simulation, Fortran, MPI  WRF  weather research and forecasting, Fortran, MPI  S3D  3D combustion, Fortran, MPI

20 The TAU Parallel Performance SystemOSDL 200620 ParaProf – Flat Profile (Miranda, BG/L) 8K processors node, context, thread Miranda  hydrodynamics  Fortran + MPI  LLNL Run to 64K

21 The TAU Parallel Performance SystemOSDL 200621 ParaProf – Stacked View (Miranda)

22 The TAU Parallel Performance SystemOSDL 200622 ParaProf – Callpath Profile (Flash) Flash  thermonuclear flashes  Fortran + MPI  Argonne

23 The TAU Parallel Performance SystemOSDL 200623 ParaProf – Histogram View (Miranda) 8k processors 16k processors

24 The TAU Parallel Performance SystemOSDL 200624 NAS BT – Flat Profile How is MPI_Wait() distributed relative to solver direction? Application routine names reflect phase semantics

25 The TAU Parallel Performance SystemOSDL 200625 NAS BT – Phase Profile (Main and X, Y, Z) Main phase shows nested phases and immediate events

26 The TAU Parallel Performance SystemOSDL 200626 ParaProf – 3D Full Profile (Miranda) 16k processors

27 The TAU Parallel Performance SystemOSDL 200627 ParaProf – 3D Full Profile (Flash) 128 processors

28 The TAU Parallel Performance SystemOSDL 200628 ParaProf Bar Plot (Zoom in/out +/-)

29 The TAU Parallel Performance SystemOSDL 200629 ParaProf – 3D Scatterplot (Miranda)  Each point is a “thread” of execution  A total of four metrics shown in relation  ParaVis 3D profile visualization library  JOGL

30 The TAU Parallel Performance SystemOSDL 200630 ParaProf – Callgraph Zoom (Flash) Zoom in (+) Zoom out (-)

31 The TAU Parallel Performance SystemOSDL 200631 Performance Tracing on Miranda  Use TAU to generate VTF3 traces for Vampir analysis  MPI calls with HW counter information (not shown)  Detailed code behavior to focus optimization efforts

32 The TAU Parallel Performance SystemOSDL 200632 S3D on Lemieux (TAU-to-VTF3, Vampir) S3D  3D combustion  Fortran + MPI  PSC

33 The TAU Parallel Performance SystemOSDL 200633 S3D on Lemieux (Zoomed)

34 The TAU Parallel Performance SystemOSDL 200634 Runtime MPI Shared Library Instrumentation  We can now interpose the MPI wrapper library for applications that have already been compiled (no re- compilation or re-linking necessary!)  Uses LD_PRELOAD for Linux  Soon on AIX using MPI_EUILIB/MPI_EUILIBPATH  Simply compile TAU with MPI support and prefix your MPI program with tau_load.sh  Requires shared library MPI  Approach will work with other shared libraries % mpirun –np 4 tau_load.sh a.out

35 The TAU Parallel Performance SystemOSDL 200635 Workload Characterization  Idea: partition performance data for individual functions based on runtime parameters  Enable by configuring with –PROFILEPARAM  TAU_PROFILE_PARAM1L (value, “name”)  Simple example: void foo(int input) { TAU_PROFILE("foo", "", TAU_DEFAULT); TAU_PROFILE_PARAM1L(input, "input");... }

36 The TAU Parallel Performance SystemOSDL 200636 Workload Characterization (continued)  5 seconds spent in function “ foo ” becomes  2 seconds for “ foo [ = ] ”  1 seconds for “ foo [ = ] ”  …  Currently used in MPI wrapper library  Allows for partitioning of time spent in MPI routines based on parameters (message size, message tag, destination node)  Can be extrapolated to infer specifics about the MPI subsystem and system as a whole

37 The TAU Parallel Performance SystemOSDL 200637 Characterization Based on Message Size  Simple example, send/receive squared message sizes (0-32MB) #include int main(int argc, char **argv) { int rank, size, i, j; int buffer[16*1024*1024]; MPI_Init(&argc, &argv); MPI_Comm_size( MPI_COMM_WORLD, &size ); MPI_Comm_rank( MPI_COMM_WORLD, &rank ); for (i=0;i<1000;i++) for (j=1;j<16*1024*1024;j*=2) { if (rank == 0) { MPI_Send(buffer,j,MPI_INT,1,42,MPI_COMM_WORLD); } else { MPI_Status status; MPI_Recv(buffer,j,MPI_INT,0,42,MPI_COMM_WORLD,&status); } MPI_Finalize(); }

38 The TAU Parallel Performance SystemOSDL 200638 Characterization Results  Two different message sizes (~3.3MB and ~4K)

39 The TAU Parallel Performance SystemOSDL 200639 Important Questions for Application Developers  How does performance vary with different compilers?  Is poor performance correlated with certain OS features?  Has a recent change caused unanticipated performance?  How does performance vary with MPI variants?  Why is one application version faster than another?  What is the reason for the observed scaling behavior?  Did two runs exhibit similar performance?  How are performance data related to application events?  Which machines will run my code the fastest and why?  Which benchmarks predict my code performance best?

40 The TAU Parallel Performance SystemOSDL 200640 Performance Problem Solving Goals  Answer questions at multiple levels of interest  Data from low-level measurements and simulations  use to predict application performance  High-level performance data spanning dimensions  machine, applications, code revisions, data sets  examine broad performance trends  Discover general correlations application performance and features of their external environment  Develop methods to predict application performance on lower-level metrics  Discover performance correlations between a small set of benchmarks and a collection of applications that represent a typical workload for a given system

41 The TAU Parallel Performance SystemOSDL 200641 Performance Data Management (PerfDMF) K. Huck, A. Malony, R. Bell, A. Morris, “Design and Implementation of a Parallel Performance Data Management Framework,” ICPP 2005. (awarded best paper)

42 The TAU Parallel Performance SystemOSDL 200642 Performance Data Mining (Objectives)  Conduct parallel performance analysis in a systematic, collaborative and reusable manner  Manage performance complexity  Discover performance relationship and properties  Automate process  Multi-experiment performance analysis  Large-scale performance data reduction  Summarize characteristics of large processor runs  Implement extensible analysis framework  Abtraction / automation of data mining operations  Interface to existing analysis and data mining tools

43 The TAU Parallel Performance SystemOSDL 200643 Performance Data Mining (PerfExplorer)  Performance knowledge discovery framework  Data mining analysis applied to parallel performance data  comparative, clustering, correlation, dimension reduction, …  Use the existing TAU infrastructure  TAU performance profiles, PerfDMF  Client-server based system architecture  Technology integration  Java API and toolkit for portability  PerfDMF  R-project/Omegahat, Octave/Matlab statistical analysis  WEKA data mining package  JFreeChart for visualization, vector output (EPS, SVG)

44 The TAU Parallel Performance SystemOSDL 200644 Performance Data Mining (PerfExplorer) K. Huck and A. Malony, “PerfExplorer: A Performance Data Mining Framework For Large-Scale Parallel Computing,” SC 2005.

45 The TAU Parallel Performance SystemOSDL 200645 PerfExplorer Analysis Methods  Data summaries, distributions, scatterplots  Clustering  k-means  Hierarchical  Correlation analysis  Dimension reduction  PCA  Random linear projection  Thresholds  Comparative analysis  Data management views

46 The TAU Parallel Performance SystemOSDL 200646 Cluster Analysis  Performance data represented as vectors - each dimension is the cumulative time for an event  k-means: k random centers are selected and instances are grouped with the "closest" (Euclidean) center  New centers are calculated and the process repeated until stabilization or max iterations  Dimension reduction necessary for meaningful results  Virtual topology, summaries constructed

47 The TAU Parallel Performance SystemOSDL 200647 sPPM Cluster Analysis

48 The TAU Parallel Performance SystemOSDL 200648 Flash Clustering on 16K BG/L Processors  Four significant events automatically selected  Clusters and correlations are visible

49 The TAU Parallel Performance SystemOSDL 200649 Correlation Analysis  Describes strength and direction of a linear relationship between two variables (events) in the data

50 The TAU Parallel Performance SystemOSDL 200650 Comparative Analysis  Relative speedup, efficiency  total runtime, by event, one event, by phase  Breakdown of total runtime  Group fraction of total runtime  Correlating events to total runtime  Timesteps per second  Performance Evaluation Research Center (PERC)  PERC tools study (led by ORNL, Pat Worley)  In-depth performance analysis of select applications  Evaluation performance analysis requirements  Test tool functionality and ease of use

51 The TAU Parallel Performance SystemOSDL 200651 PerfExplorer Interface Select experiments and trials of interest Data organized in application, experiment, trial structure (will allow arbitrary in future) Experiment metadata

52 The TAU Parallel Performance SystemOSDL 200652 PerfExplorer Interface Select analysis

53 The TAU Parallel Performance SystemOSDL 200653 B1-std B3-gtc Timesteps per Second  Cray X1 is the fastest to solution in all 3 tests  FFT (nl2) improves time for B3-gtc only  TeraGrid faster than p690 for B1-std?  Plots generated automatically B1-std B2-cy B3-gtc TeraGrid

54 The TAU Parallel Performance SystemOSDL 200654 Relative Efficiency (B1-std)  By experiment (B1-std)  Total runtime (Cheetah (red))  By event for one experiment  Coll_tr (blue) is significant  By experiment for one event  Shows how Coll_tr behaves for all experiments 16 processor base case CheetahColl_tr

55 The TAU Parallel Performance SystemOSDL 200655 PerfExplorer Future Work  Extensions to PerfExplorer framework  Examine properties of performance data  Automated guidance of analysis  Workflow scripting for repeatable analysis  Dependency modeling (go beyond correlation)  Time-series analysis of phase-based data

56 The TAU Parallel Performance SystemOSDL 200656 Open Trace Format (OTF)  Features  Hierarchical trace format  Replacement for proprietary formats such as STF  Pallas and Intel  Efficient streams based parallel access  Tracing library available on IBM BG/L platform  Development of OTF supported by LLNL  Joint development effort  ZiH / Technical University of Dresden  ParaTools, Inc.  http://www.paratools.com/otf

57 The TAU Parallel Performance SystemOSDL 200657 OTF Options

58 The TAU Parallel Performance SystemOSDL 200658 Vampir and VNG  Commercial trace based tools  Developed at ZiH, T.U. Dresden  Wolfgang Nagel, Holger Brunst and others…  http://www.vampir-ng.de  Vampir Trace Visualizer  Known also as Intel ® Trace Analyzer v4.0  Sequential program  Vampir Next Generation (VNG)  Client (vng) runs on a desktop, server (vngd) on a cluster  Parallel trace analysis  Orders of magnitude bigger traces (more memory)  State of the art in parallel trace visualization

59 The TAU Parallel Performance SystemOSDL 200659 Vampir Next Generation (VNG) Architecture Merged Traces Analysis Server Classic Analysis:  monolithic  sequential Worker 1 Worker 2 Worker m Master Trace 1 Trace 2 Trace 3 Trace N File System Internet Parallel Program Monitor System Event Streams Visualization Client Segment Indicator 768 Processes Thumbnail Timeline with 16 visible Traces Process Parallel I/O Message Passing

60 The TAU Parallel Performance SystemOSDL 200660 TAU Tracing Enhancements  Configure TAU with -TRACE –vtf= –otf= options % configure –TRACE –vtf= … % configure –TRACE –otf= … Generates tau_merge, tau2vtf, tau2otf tools in / /bin % tau_f90.sh app.f90 –o app  Instrument and execute application % mpirun -np 4 app  Merge and convert trace files to VTF3/OTF format % tau_treemerge.pl % tau2vtf tau.trc tau.edf app.vpt.gz % vampir foo.vpt.gz OR % tau2otf tau.trc tau.edf app.otf –n % vampir app.otf OR use VNG to analyze OTF/VTF trace files

61 The TAU Parallel Performance SystemOSDL 200661 TAU Eclipse Integration  Eclipse GUI integration of existing TAU tools  New Eclipse plug-in for code instrumentation  Integration with CDT and FDT  Java, C/C++, and Fortran projects  Can be instrumented and run from within eclipse  Each project can be given multiple build configurations corresponding to available TAU makefiles  All TAU configuration options are available  Paraprof tool can be launched automatically

62 The TAU Parallel Performance SystemOSDL 200662 TAU Eclipse Integration TAU configuration TAU experimentation

63 The TAU Parallel Performance SystemOSDL 200663 TAU Eclipse Future Work  Development of the TAU Eclipse plugins for Java and the CDT/FDT is ongoing  Planned features include:  Full integration with the Eclipse Parallel Tools project  Database storage of project performance data  Refinement of the plugin settings interface to allow easier selection of TAU runtime and compiletime options  Accessibility of TAU configuration and commandline tools via the Eclipse UI

64 The TAU Parallel Performance SystemOSDL 200664 ZeptoOS and TAU  DOE OS/RTS for Extreme Scale Scientific Computation  OS research for petascale systems  ZeptoOS project  scalable, adaptive components for petascale architectures  Argonne National Laboratory and University of Oregon  University of Oregon  Kernel-level performance monitoring  OS component performance assessment and tuning  KTAU (Kernel Tuning and Analysis Utilities)  integration of TAU infrastructure in Linux kernel  integration with ZeptoOS (light-weight Linux-based kernel)  installation on BG/L and other platforms (e.g., Cray XT3)  Port to 32-bit and 64-bit Linux platforms

65 The TAU Parallel Performance SystemOSDL 200665 Linux Kernel Profiling using TAU – Goals  Fine-grained kernel-level performance measurement  Parallel applications  Support both profiling and tracing  Both process-centric and system-wide view  Merge user-space performance with kernel-space  User-space: (TAU) profile/trace  Kernel-space: (KTAU) profile/trace  Detailed program-OS interaction data  Including interrupts (IRQ)  Analysis and visualization compatible with TAU

66 The TAU Parallel Performance SystemOSDL 200666 KTAU Architecture

67 The TAU Parallel Performance SystemOSDL 200667 KTAU On BG/L

68 The TAU Parallel Performance SystemOSDL 200668 KTAU Future Work  Dynamic measurement control  Enable/disable events w/o recompilation or reboot  Add new performance data sources  Look into hardware counters  Improve user-space integration  Full callpaths and phase-based profiling  Merged user/kernel traces  Integration with monitoring technology  SuperMon, MRNet, TAUg  New porting efforts  IA-64, PPC-64 and AMD Opteron  System characterization studies

69 The TAU Parallel Performance SystemOSDL 200669 TAU Performance System Status  Computing platforms  IBM, SGI, Cray, HP, Sun, Hitachi, NEC, Linux clusters, Apple, Windows, …  Programming languages  C, C++, Fortran 90/95, UPC, HPF, Java, OpenMP, Python  Thread libraries  pthreads, SGI sproc, Java,Windows, OpenMP  Communications libraries  MPI-1/2, PVM, shmem, …  Compilers  IBM, Intel, PGI, GNU, Fujitsu, Sun, NAG, Microsoft, SGI, Cray, HP, NEC, Absoft, Lahey, PathScale, Open64

70 The TAU Parallel Performance SystemOSDL 200670 Project Affiliations (selected)  Lawrence Livermore National Lab  Hydrodynamics (Miranda), radiation diffusion (KULL)  Open Trace Format (OTF) implementation on BG/L  Argonne National Lab  ZeptoOS project and KTAU  Astrophysical thermonuclear flashes (Flash)  Center for Simulation of Accidental Fires and Explosion  University of Utah, ASCI ASAP Center, C-SAFE  Uintah Computational Framework (UCF)  Oak Ridge National Lab  Contribution to the Joule Report (S3D, AORSA3D)

71 The TAU Parallel Performance SystemOSDL 200671 Project Affiliations (continued)  Sandia National Lab  Simulation of turbulent reactive flows (S3D)  Combustion code (CFRFS)  Los Alamos National Lab  Monte Carlo transport (MCNP)  SAIC’s Adaptive Grid Eulerian (SAGE)  CCSM / ESMF / WRF climate/earth/weather simulation  NSF, NOAA, DOE, NASA, …  Common component architecture (CCA) integration  Performance Evaluation Research Center (PERC)  DOE SciDAC center

72 The TAU Parallel Performance SystemOSDL 200672 Support Acknowledgements  Department of Energy (DOE)  Office of Science  MICS, Argonne National Lab  ASC/NNSA  University of Utah ASC/NNSA Level 1  ASC/NNSA, Lawrence Livermore National Lab  Department of Defense (DoD)  HPC Modernization Office (HPCMO)  Programming Environment and Training (PET)  NSF Software and Tools for High-End Computing  Research Centre Juelich  Los Alamos National Laboratory  ParaTools

73 The TAU Parallel Performance SystemOSDL 200673 Acknowledgements  Dr. Sameer Shende, Senior Scientist  Alan Morris, Senior Software Engineer  Wyatt Spear, PRL staff  Scott Biersdorff, PRL staff  Robert Yelle, PRL staff  Kevin Huck, Ph.D. student  Aroon Nataraj, Ph.D. student  Kai Li, Ph.D. student  Li Li, Ph.D. student  Suravee Suthikulpanit, M.S. student


Download ppt "Allen D. Malony Performance Research Laboratory (PRL) Neuroinformatics Center (NIC) Department."

Similar presentations


Ads by Google