Profiling Tools In Ranger Carlos Rosales, Kent Milfeld and Yaakoub Y. El Kharma

Slides:

Advertisements

Similar presentations

K T A U Kernel Tuning and Analysis Utilities Department of Computer and Information Science Performance Research Laboratory University of Oregon.

Advertisements

Profiler In software engineering, profiling ("program profiling", "software profiling") is a form of dynamic program analysis that measures, for example,

ARCHER Tips and Tricks A few notes from the CSE team.

GNU gprof Profiler Yu Kai Hong Department of Mathematics National Taiwan University July 19, 2008 GNU gprof 1/22.

Profiling your application with Intel VTune at NERSC

Intel® performance analyze tools Nikita Panov Idrisov Renat.

Automated Instrumentation and Monitoring System (AIMS)

Profiling S3D on Cray XT3 using TAU Sameer Shende

70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 11: Monitoring Server Performance.

Chapter Apache Installation in Linux- Mandrake. Acknowledgment The following information has been obtained directly from

Scalability Study of S3D using TAU Sameer Shende

Operating Systems Concepts 1. A Computer Model An operating system has to deal with the fact that a computer is made up of a CPU, random access memory.

Chapter 3 Software Two major types of software

1 1 Profiling & Optimization David Geldreich (DREAM)

Chocolate Bar! luqili. Milestone 3 Speed 11% of final mark 7%: path quality and speed –Some cleverness required for full marks –Implement some A* techniques.

© 2008 Pittsburgh Supercomputing Center Performance Engineering of Parallel Applications Philip Blood, Raghu Reddy Pittsburgh Supercomputing Center.

Performance Measurement on kraken using fpmpi and craypat Kwai Wong NICS at UTK / ORNL March 24, 2010.

Blaise Barney, LLNL ASC Tri-Lab Code Development Tools Workshop Thursday, July 29, 2010 Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,

MpiP Evaluation Report Hans Sherburne, Adam Leko UPC Group HCS Research Laboratory University of Florida.

WORK ON CLUSTER HYBRILIT E. Aleksandrov 1, D. Belyakov 1, M. Matveev 1, M. Vala 1,2 1 Joint Institute for nuclear research, LIT, Russia 2 Institute for.

Multi-core Programming VTune Analyzer Basics. 2 Basics of VTune™ Performance Analyzer Topics What is the VTune™ Performance Analyzer? Performance tuning.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

GPTL: A simple and free general purpose tool for performance analysis and profiling April 8, 2014 Jim Rosinski NOAA/ESRL.

Lecture 8. Profiling - for Performance Analysis - Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture &

IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.

70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 11: Monitoring Server Performance.

9 Chapter Nine Compiled Web Server Programs. 9 Chapter Objectives Learn about Common Gateway Interface (CGI) Create CGI programs that generate dynamic.

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

Timing and Profiling ECE 454 Computer Systems Programming Topics: Measuring and Profiling Cristiana Amza.

BG/Q Performance Tools Scott Parker Mira Community Conference: March 5, 2012 Argonne Leadership Computing Facility.

DDT Debugging Techniques Carlos Rosales Scaling to Petascale 2010 July 7, 2010.

Application performance and communication profiles of M3DC1_3D on NERSC babbage KNC with 16 MPI Ranks Thanh Phung, Intel TCAR Woo-Sun Yang, NERSC.

Overview of CrayPat and Apprentice 2 Adam Leko UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red: Negative.

Martin Schulz Center for Applied Scientific Computing Lawrence Livermore National Laboratory Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,

Using parallel tools on the SDSC IBM DataStar DataStar Overview HPM Perf IPM VAMPIR TotalView.

Blaise Barney, LLNL ASC Tri-Lab Code Development Tools Workshop Thursday, July 29, 2010 Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,

CE Operating Systems Lecture 3 Overview of OS functions and structure.

Performance Monitoring Tools on TCS Roberto Gomez and Raghu Reddy Pittsburgh Supercomputing Center David O’Neal National Center for Supercomputing Applications.

Application Profiling Using gprof. What is profiling? Allows you to learn:  where your program is spending its time  what functions called what other.

70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 11: Monitoring Server Performance.

Belgrade, 25 September 2014 George S. Markomanolis, Oriol Jorba, Kim Serradell Performance analysis Tools: a case study of NMMB on Marenostrum.

Debugging parallel programs. Breakpoint debugging Probably the most widely familiar method of debugging programs is breakpoint debugging. In this method,

NUG Meeting Performance Profiling Using hpmcount, poe+ & libhpm Richard Gerber NERSC User Services

Tool Visualizations, Metrics, and Profiled Entities Overview [Brief Version] Adam Leko HCS Research Laboratory University of Florida.

So, You Need to Look at a New Application … Scenarios:  New application development  Analyze/Optimize external application  Suspected bottlenecks First.

Overview of AIMS Hans Sherburne UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red: Negative note Green:

Single Node Optimization Computational Astrophysics.

SPI NIGHTLIES Alex Hodgkins. SPI nightlies  Build and test various software projects each night  Provide a nightlies summary page that displays all.

PAPI on Blue Gene L Using network performance counters to layout tasks for improved performance.

Projections - A Step by Step Tutorial By Chee Wai Lee For the 2004 Charm++ Workshop.

Cliff Addison University of Liverpool NW-GRID Training Event 26 th January 2007 SCore MPI Taking full advantage of GigE.

Debugging Lab Antonio Gómez-Iglesias Texas Advanced Computing Center.

Introduction to HPC Debugging with Allinea DDT Nick Forrington

Tuning Threaded Code with Intel® Parallel Amplifier.

Profiling Antonio Gómez-Iglesias Texas Advanced Computing Center.

1 Chapter 2: Operating-System Structures Services Interface provided to users & programmers –System calls (programmer access) –User level access to system.

Navigating TAU Visual Display ParaProf and TAU Portal Mahin Mahmoodi Pittsburgh Supercomputing Center 2010.

Topic 2: Hardware and Software

Development Environment

Performance Analysis Tools

Performance Analysis and optimization of parallel applications

Lecture 1 Runtime environments.

Advanced TAU Commander

Threads and Data Sharing

CMSC 611: Advanced Computer Architecture

Using JDeveloper.

Lecture Topics: 11/1 General Operating System Concepts Processes

CMSC 611: Advanced Computer Architecture

Hardware Environment VIA cluster - 8 nodes Blade Server – 5 nodes

Lecture 1 Runtime environments.

Presentation transcript:

Profiling Tools In Ranger Carlos Rosales, Kent Milfeld and Yaakoub Y. El Kharma

SCALABILITY IN MPI APPLICATIONS mpiP and IPM

About mpiP mpiP is an MPI Profiling library mpip.sourceforge.netmpip.sourceforge.net Scalable & Lightweight Multiplatform – Linux IA32/IA64/x86_64/MPIS64 – IBM POWER 4/5 – Cray XT3/XT4/X1E Does not require manual code instrumentation – collects statistics of MPI functions (wraps original MPI function calls) – less overhead than tracing tools – less data than tracing tools Easy to use (requires linking but not compilation)

Using mpiP Load the mpiP module: % module load mpiP Link the static library before any others: % mpicc -g -L$TACC_MPIP_LIB -lmpiP -lbfd -liberty./srcFile.c Set environmental variables controlling the mpiP output: % setenv MPIP ‘-t 10 -k 2’ In this case: -t 10  only callsites with time > 10% MPI time included in report -k 2  set callsite stack traceback depth to 2 Run program through the queue as usual.

mpiP runtime options optiondescriptiondefault -cgenerate concise report no callsite process-specific detail -k nset callsite stack traceback depth to 1 -odisable profiling at initialization -t xset print threshold, is MPI % of time for each callsite0.0 -vgenerate both concise and verbose reports -x exespecify the full path to the executable (csh)% setenv MPIP ‘option1 option2 …’ (bash)% export MPIP=‘option1 option2 …’

mpiP calls from C/Fortran Generate arbitrary reports using the function call MPI_Pcontrol() with different arguments Useful to: – profile specific sections of the code – obtain individual profiles of multiple function calls ArgumentOutput behavior 0Disable profiling (default) 1Enable profiling 2Reset all callsite data 3Generate verbose report 4Generate concise report

MPI_Pcontrol examples Scope limitation switch(i) { case 5: MPI_Pcontrol(1);// enable profiling break; case 6: MPI_Pcontrol(0);// disable profiling break; default: break; } /*... do something for one timestep... */ Individual reports switch(i) { case 5: MPI_Pcontrol(2); // reset profile data MPI_Pcontrol(1); // enable profiling break; case 6: MPI_Pcontrol(3); // generate verbose report MPI_Pcontrol(4); // generate concise report MPI_Pcontrol(0); // disable profiling break; default: break; } /*... do something for one timestep... */

mpiP output After running the executable a file with the extension.mpiP will be generated with: – MPI Time (MPI time for all MPI calls) – MPI callsites – Aggregate message size – Aggregate time For scalability analysis it is important to compare the total MPI time to the total running time of the application. Detailed function call data can be used to identify communication hotspots.

mpiP output: MPI Time MPI Time (seconds) Task AppTime MPITime MPI% *

mpiP output: MPI Time MPI Time (seconds) Task AppTime MPITime MPI% * This process seems to be controlling all MPI exchanges

mpiP output: MPI callsites Callsites: ID Lev File/Address Line Parent_Funct MPI_Call 1 0 matmultc.c 60 main Send 2 0 matmultc.c 52 main Bcast 3 0 matmultc.c 103 main Barrier 4 0 matmultc.c 78 main Send 5 0 matmultc.c 65 main Recv 6 0 matmultc.c 74 main Send 7 0 matmultc.c 98 main Send 8 0 matmultc.c 92 main Recv 9 0 matmultc.c 88 main Bcast

mpiP output: Aggregate time Aggregate Time (top twenty, descending, milliseconds) Call Site Time App% MPI% COV Recv e Bcast e Recv Barrier Send Bcast Send Send Send

mpiP output: Aggregate time Aggregate Time (top twenty, descending, milliseconds) Call Site Time App% MPI% COV Recv e Bcast e Recv Barrier Send Bcast Send Send Send

mpiP output: Message Size Aggregate Sent Message Size (top twenty, descending, bytes) Call Site Count Total Avrg Sent% Bcast e e Send e e Bcast e e Send e e Send e e Send

About IPM IPM is an Integrated Performance Monitoring tool Portable profiling infrastructure for parallel codes Low-overhead performance summary of computing and communication IPM is a quick, easy and concise profiling tool Requires no manual instrumentation, just adding the -g option to the compilation Produces XML output that is parsed by scripts to generate browser- readable html pages The level of detail it reports is lower than TAU, PAPI, HPCToolkit or Scalasca but higher than mpiP

Using IPM Available on Ranger for both intel and pgi compilers, with mvapich and mvapich2 Create ipm environment with module command before running code: % module load ipm In your job script, set up the following ipm environment before the ibrun command: module load ipm export LD_PRELOAD=$TACC_IPM_LIB/libipm.so export IPM_REPORT=full ibrun

Using IPM export LD_PRELOAD=$TACC_IPM_LIB/libipm.so – must be inside job script to ensure the IPM wrappers for MPI calls are loaded properly IPM_REPORT: controls the level of information collected – full – terse – none IPM_MPI_THRESHOLD : Reports only routines using this percentage (or more) of MPI time. – e.g. “ IPM_MPI_THRESHOLD 0.3” report subroutines that consume more than 30% of the total MPI time. Important details: % module help ipm

Output from IPM When your code has finished running IPM will create an XML file with a name like: username Get basic or full information in text mode using: % ipm_parse username % ipm_parse -full username You can also transform this XML file into standard HTML files : % ipm_parse -html username This generates a directory which contains an index.html file readable by any web browser Tar this directory and scp the file to your own local computer to visualize the results

IPM: Text Output ##IPMv0.922############################################### # # command : /work/01125/yye00/ICAC/cactus_SandTank SandTank.par # host : i /x86_64_Linux mpi_tasks : 32 on 2 nodes # start : 05/26/09/11:49:06 wallclock : sec # stop : 05/26/09/11:49:09 %comm : 2.01 # gbytes : e+00 total gflop/sec : e-02 total # ########################################################## # region : * [ntasks] = 32 # # [total] min max # entries # wallclock # user # system # %comm # gflop/sec # gbytes # # PAPI_FP_OPS e e e e+06 # PAPI_TOT_CYC e e e e+08 # PAPI_VEC_INS e e e e+07 # PAPI_TOT_INS e e e e+08 # # [time] [calls] # MPI_Allreduce # MPI_Comm_rank # MPI_Barrier # MPI_Allgatherv # MPI_Bcast # MPI_Allgather # MPI_Recv # MPI_Comm_size # MPI_Send ###########################################################

IPM: HTML Output

IPM: Load Balance

IPM: Communication balancing, by task

IPM: Integrated Performance Monitoring

IPM: Connectivity, Buffer-size Distribution

IPM: Buffer-Size Distribution

IPM: Memory Usage

BASIC PROFILING TOOLS timers, gprof

Timers: Command Line The command time is available in most Unix systems. It is simple to use (no code instrumentation required). Gives total execution time of a process and all its children in seconds. % /usr/bin/time -p./exeFile real 9.95 user 9.86 sys 0.06 Leave out the -p option to get additional information: % time./exeFile % 9.860u 0.060s 0: %0+0k 0+0io 0pf+0w

Timers: Code Section INTEGER :: rate, start, stop REAL :: time CALL SYSTEM_CLOCK(COUNT_RATE = rate) CALL SYSTEM_CLOCK(COUNT = start) ! Code to time here CALL SYSTEM_CLOCK(COUNT = stop) time = REAL( ( stop - start )/ rate ) #include double start, stop, time; start = (double)clock()/CLOCKS_PER_SEC; /* Code to time here */ stop = (double)clock()/CLOCKS_PER_SEC; time = stop - start;

About GPROF GPROF is the GNU Project PROFiler. gnu.org/software/binutils/gnu.org/software/binutils/ Requires recompilation of the code. Compiler options and libraries provide wrappers for each routine call and periodic sampling of the program. A default gmon.out file is produced with the function call information. GPROF links the symbol list in the executable with the data in gmon.out.

Types of Profiles Flat Profile – CPU time spend in each function (self and cumulative) – Number of times a function is called – Useful to identify most expensive routines Call Graph – Number of times a function was called by other functions – Number of times a function called other functions – Useful to identify function relations – Suggests places where function calls could be eliminated Annotated Source – Indicates number of times a line was executed

Profiling with gprof Use the -pg flag during compilation: % gcc -g -pg./srcFile.c % icc -g -p./srcFile.c % pgcc -g -pg./srcFile.c Run the executable. An output file gmon.out will be generated with the profiling information. Execute gprof and redirect the output to a file: % gprof./exeFile gmon.out > profile.txt % gprof –l./exeFile gmon.out > profile_line.txt % gprof -A./exeFile gmon.out > profile_anotated.txt

Flat profile In the flat profile we can identify the most expensive parts of the code (in this case, the calls to matSqrt, matCube, and sysCube). % cumulative self self total time seconds seconds calls s/call s/call name matSqrt matCube sysCube main vecSqrt sysSqrt vecCube

Call Graph Profile index % time self children called name /1 (8) [1] main [1] /1 sysSqrt [3] /2 matSqrt [2] /1 sysCube [5] /1 matCube [4] /2 vecSqrt [6] /1 vecCube [7] /2 main [1] /2 sysSqrt [3] [2] matSqrt [2] /1 main [1] [3] sysSqrt [3] /2 matSqrt [2] /2 vecSqrt [6]

Visual Call Graph main matSqrt vecSqrt matCube vecCube sysSqrt sysCube

Call Graph Profile index % time self children called name /1 (8) [1] main [1] /1 sysSqrt [3] /2 matSqrt [2] /1 sysCube [5] /1 matCube [4] /2 vecSqrt [6] /1 vecCube [7] /2 main [1] /2 sysSqrt [3] [2] matSqrt [2] /1 main [1] [3] sysSqrt [3] /2 matSqrt [2] /2 vecSqrt [6]

Visual Call Graph main matSqrt vecSqrt matCube vecCube sysSqrt sysCube

Call Graph Profile index % time self children called name /1 (8) [1] main [1] /1 sysSqrt [3] /2 matSqrt [2] /1 sysCube [5] /1 matCube [4] /2 vecSqrt [6] /1 vecCube [7] /2 main [1] /2 sysSqrt [3] [2] matSqrt [2] /1 main [1] [3] sysSqrt [3] /2 matSqrt [2] /2 vecSqrt [6]

Visual Call Graph main matSqrt vecSqrt matCube vecCube sysSqrt sysCube

ADVANCED PROFILING TOOLS PerfExpert, Tau

Advanced Profiling Tools Can be intimidating: – Difficult to install – Many dependences – Require kernel patches Useful for serial and parallel programs Extensive profiling and scalability information Analyze code using: – Timers – Hardware registers (PAPI) – Function wrappers Not your problem!!

PAPI PAPI is a Performance Application Programming Interface icl.cs.utk.edu/papi API to use hardware counters Behind Tau, HPCToolkit Multiplatform: – Most Intel & AMD chips – IBM POWER 4/5/6 – Cray X/XD/XT – Sun UltraSparc I/II/III – MIPS – SiCortex – Cell Available as a module in Ranger

PAPI: Available Events Counter/Event Name Meaning PAPI_L1_DCMLevel 1 data cache misses PAPI_L1_ICMLevel 1 instruction cache misses PAPI_L2_DCMLevel 2 data cache misses PAPI_L2_ICMLevel 2 instruction cache misses PAPI_L2_TCMLevel 2 cache misses PAPI_L3_TCMLevel 3 cache misses PAPI_FPU_IDLCycles floating point units are idle PAPI_TLB_DMData translation lookaside buffer misses PAPI_TLB_IM Instruction translation lookaside buffer misses PAPI_STL_ICYCycles with no instruction issue PAPI_HW_INTHardware interrupts PAPI_BR_TKNConditional branch instructions taken PAPI_BR_MSP Conditional branch instructions mispredicted PAPI_TOT_INSInstructions completed PAPI_FP_INSFloating point instructions PAPI_BR_INSBranch instructions Counter/Event Name Meaning PAPI_VEC_INSVector/SIMD instructions PAPI_RES_STLCycles stalled on any resource PAPI_TOT_CYCTotal cycles PAPI_L1_DCALevel 1 data cache accesses PAPI_L2_DCALevel 2 data cache accesses PAPI_L2_ICHLevel 2 instruction cache hits PAPI_L1_ICALevel 1 instruction cache accesses PAPI_L2_ICALevel 2 instruction cache accesses PAPI_L1_ICRLevel 1 instruction cache reads PAPI_L2_TCALevel 2 total cache accesses PAPI_L3_TCRLevel 3 total cache reads PAPI_FML_INSFloating point multiply instructions PAPI_FAD_INS Floating point add instructions (Also includes subtract instructions) PAPI_FDV_INS Floating point divide instructions (Counts both divide and square root instructions) PAPI_FSQ_INS Floating point square root instructions (Counts both divide and square root instructions) PAPI_FP_OPSFloating point operations

About PerfExpert Brand new tool, locally developed at UT Easy to use and understand Great for quick profiling and for beginners Provides recommendation on “what to fix” in a subroutine Collects information from PAPI using HPCToolkit No MPI specific profiling, no 3D visualization, no elaborate metrics Combines ease of use with useful interpretation of gathered performance data

Using PerfExpert Load the papi and java modules: % module load papi % module load java Copy the PerfExpert.sge submission script (for editing): cp /share/home/00976/burtsche/PerfExpert/PerfExpert.sge./ Edit the PerfExpert.sge script to ensure the correct executable name, correct directory, correct project name and so on. Submit your job: % qsub PerfExpert.sge To analyze results: /share/home/00976/burtsche/PerfExpert/PerfExpert./hpctoolkit-…. Typical value for threshold is 0.1

About Tau TAU is a suite of Tuning and Analysis Utilities year project involving – University of Oregon Performance Research Lab – LANL Advanced Computing Laboratory – Research Centre Julich at ZAM, Germany Integrated toolkit – Performance instrumentation – Measurement – Analysis – Visualization

Using Tau Load the papi and tau modules Gather information for the profile run: – Type of run (profiling/tracing, hardware counters, etc…) – Programming Paradigm (MPI/OMP) – Compiler (Intel/PGI/GCC…) Select the appropriate TAU_MAKEFILE based on your choices ( $TAU/Makefile.* ) Set up the selected PAPI counters in your submission script Run as usual & analyze using paraprof – You can transfer the database to your own PC to do the analysis

TAU Performance System Architecture

Tau: Example Load the papi and tau modules: % module load papi % module load tau Say that we choose to do – a profiling run with multiple counters for a – MPI parallel code and use – the PDT instrumentator with – the PGI compiler The TAU_MAKEFILE to use for this combination is: $TAU/Makefile.tau-multiplecounters-mpi-papi-pdt-pgi So we set it up: % setenv TAU_MAKEFILE $TAU/Makefile.tau-multiplecounters-mpi-papi-pdt-pgi And we compile using the wrapper provided by tau: % tau_cc.sh matmult.c

Tau: Example (Cont.) Next we decide which hardware counters to use: – GET_TIME_OF_DAY (time, profiling, similar to using gprof) – PAPI_FP_OPS (Floating Point Operations Per Second) – PAPI_L1_DCM (Data Cache Misses for the cache Level 1) We set these as environmental variables in the command line or the submission script. For csh: % setenv COUNTER1 GET_TIME_OF_DAY % setenv COUNTER2 PAPI_FP_OPS % setenv COUNTER3 PAPI_L1_DCM For bash: % export COUNTER1 = GET_TIME_OF_DAY % export COUNTER2 = PAPI_FP_OPS % export COUNTER3 = PAPI_L1_DCM The we send the job through the queue as usual.

Tau: Example (Cont.) When the program finishes running one new directory will be created for each hardware counter we specified: – MULTI__GET_TIME_OF_DAY – MULTI__PAPI_FP_OPS – MULTI__PAPI_L1_DCM Analize the results with paraprof: % paraprof

TAU: ParaProf Manager Counters we asked for Provides Machine Details Organizes Runs as: Applications, Experiments and Trials.

Tau: Metric View Information includes Mean and Standard Deviation Windows->Function Legend Profile Information is in “GET_TIME_OF_DAY” metric Mean and Standard Deviation Statistics given.

Tau: Metric View Unstack the bars for clarity: Options -> Stack Bars Together

Tau: Function Data Window Click on any of the bars corresponding to function multiply_matrices. This opens the Function Data Window, which gives a closer look at a single function.

Tau: Float Point OPS Hardware Counters provide Floating Point Operations (Function Data view).

Tau: L1 Cache Data Misses

Derived Metrics Select Argument 1 (green ball); Select Argument 2 (green ball); Select Operation; then Apply. Derived Metric will appear as a new trial.

Derived Metrics (Cont.) Select a Function Function Data Window -> Options -> Select Metric -> Exclusive -> …

Derived Metrics Be careful  even though ratios are constant, cores may do different amounts of work/operations per call. Since FP/Miss ratios are constant– must be memory access problem.

Callpath To find out about function calls within the program, follow the same process but using the following TAU_MAKEFILE: Makefile.tau-callpath-mpi-pdt-pgi In the Metric View Window two new options will be available under: – Windows -> Thread -> Call Graph – Windows -> Thread -> Call Path Relations 63

Callpath Call Graph Paths (Must select through “thread” menu.)

Call Path

Profiling dos and don’ts DO Test every change you make Profile typical cases Compile with optimization flags Test for scalability DO NOT Assume a change will be an improvement Profile atypical cases Profile ad infinitum – Set yourself a goal or – Set yourself a time limit

Other Profiling Tools In Ranger Scalasca scalasca.org – Scalable all-in-one profiling package. – Requires re-compilation of source to instrument, like Tau. – Accessible by loading the scalasca module: –module load scalasca HPCToolkit hpctoolkit.org – All-in-one package similar to Tau and Scalasca but of more recent development – Uses binary instrumentation, so recompilation of the code is not required – Accessible via developer’s installation under: /scratch/projects/hpctoolkit/pkgs/hpctoolkit/bin/