Performance Monitoring Tools on TCS Roberto Gomez and Raghu Reddy Pittsburgh Supercomputing Center David O’Neal National Center for Supercomputing Applications.

Slides:

Advertisements

Similar presentations

Short introduction to the use of PEARL General properties First tier assessments Higher tier assessments Before looking at first and higher tier assessments,

Advertisements

K T A U Kernel Tuning and Analysis Utilities Department of Computer and Information Science Performance Research Laboratory University of Oregon.

Module 13: Performance Tuning. Overview Performance tuning methodologies Instance level Database level Application level Overview of tools and techniques.

Yaron Doweck Yael Einziger Supervisor: Mike Sumszyk Spring 2011 Semester Project.

Profiling your application with Intel VTune at NERSC

Automated Instrumentation and Monitoring System (AIMS)

Last update: August 9, 2002 CodeTest Embedded Software Verification Tools By Advanced Microsystems Corporation.

Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.

Cactus in GrADS (HFA) Ian Foster Dave Angulo, Matei Ripeanu, Michael Russell.

Hands-On Microsoft Windows Server 2003 Administration Chapter 10 Monitoring and Troubleshooting Windows Server 2003.

1 Lecture 6 Performance Measurement and Improvement.

70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 11: Monitoring Server Performance.

Linux+ Guide to Linux Certification, Second Edition

VTF Applications Performance and Scalability Sharon Brunett CACR/Caltech ASCI Site Review October 28,

Instrumentation and Measurement CSci 599 Class Presentation Shreyans Mehta.

Network File System (NFS) in AIX System COSC513 Operation Systems Instructor: Prof. Anvari Yuan Ma SID:

The 6713 DSP Starter Kit (DSK) is a low-cost platform which lets customers evaluate and develop applications for the Texas Instruments C67X DSP family.

Blaise Barney, LLNL ASC Tri-Lab Code Development Tools Workshop Thursday, July 29, 2010 Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,

PVM. PVM - What Is It? F Stands for: Parallel Virtual Machine F A software tool used to create and execute concurrent or parallel applications. F Operates.

Week 7 Working with the BASH Shell. Objectives  Redirect the input and output of a command  Identify and manipulate common shell environment variables.

Lecture 8. Profiling - for Performance Analysis - Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture &

NERSC NUG Training 5/30/03 Understanding and Using Profiling Tools on Seaborg Richard Gerber NERSC User Services

Bigben Pittsburgh Supercomputing Center J. Ray Scott

Adventures in Mastering the Use of Performance Evaluation Tools Manuel Ríos Morales ICOM 5995 December 4, 2002.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 11: Monitoring Server Performance.

Introduction to Parallel Programming with C and MPI at MCSR Part 2 Broadcast/Reduce.

Support for Debugging Automatically Parallelized Programs Robert Hood Gabriele Jost CSC/MRJ Technology Solutions NASA.

Debugging and Profiling GMAO Models with Allinea’s DDT/MAP Georgios Britzolakis April 30, 2015.

CS 584. Performance Analysis Remember: In measuring, we change what we are measuring. 3 Basic Steps Data Collection Data Transformation Data Visualization.

Linux+ Guide to Linux Certification, Third Edition

Linux+ Guide to Linux Certification Chapter Eight Working with the BASH Shell.

Profiling Tools on the NERSC Crays and IBM/SP NERSC User Services N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER.

Using parallel tools on the SDSC IBM DataStar DataStar Overview HPM Perf IPM VAMPIR TotalView.

ASAP RDF SGP RDF 1.2 and 1.3 Transfer of Information

Parallel Programming on the SGI Origin2000 With thanks to Igor Zacharov / Benoit Marchand, SGI Taub Computer Center Technion Moshe Goldberg,

CE Operating Systems Lecture 3 Overview of OS functions and structure.

Distributed monitoring system. Why Monitor? Solve them! Identify Problems Ensure conduct Requirements Manage many computers Spot trends in the system.

Linux Architecture Overview 1. Initialization Uboot – hardware init, loads kernel Kernel – remaining initialization, calls “init” Init – 1 st process,

Profiling, Tracing, Debugging and Monitoring Frameworks Sathish Vadhiyar Courtesy: Dr. Shirley Moore (University of Tennessee)

70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 11: Monitoring Server Performance.

Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.

ASC Tri-Lab Code Development Tools Workshop Thursday, July 29, 2010 Lawrence Livermore National Laboratory, P. O. Box 808, Livermore, CA This work.

A Tutorial on Introduction to gdb By Sasanka Madiraju Graduate Assistant Center for Computation and Technology.

NUG Meeting Performance Profiling Using hpmcount, poe+ & libhpm Richard Gerber NERSC User Services

Mark E. Fuller Senior Principal Instructor Oracle University Oracle Corporation.

SvPablo. Source view Pablo GUI for instrumenting source code and viewing runtime performance data Joint work at Univ. of Illinois and Rice Univ. HPF programs.

Chapter 4 Message-Passing Programming. The Message-Passing Model.

How to for compiling and running MPI Programs. Prepared by Kiriti Venkat.

Overview of AIMS Hans Sherburne UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red: Negative note Green:

Lesson 3-Touring Utilities and System Features. Overview Employing fundamental utilities. Linux terminal sessions. Managing input and output. Using special.

MPI and OpenMP.

INFSO-RI Enabling Grids for E-sciencE Charon Extension Layer. Modular environment for Grid jobs and applications management Jan.

PAPI on Blue Gene L Using network performance counters to layout tasks for improved performance.

Linux+ Guide to Linux Certification, Second Edition

Debugging Lab Antonio Gómez-Iglesias Texas Advanced Computing Center.

Wouter Verkerke, NIKHEF 1 Using ‘stoomboot’ for NIKHEF-ATLAS batch computing What is ‘stoomboot’ – Hardware –16 machines, each 2x quad-core Pentium = 128.

Hernán García CeCalcULA Universidad de los Andes.

Introduction to HPC Debugging with Allinea DDT Nick Forrington

Linux Administration Working with the BASH Shell.

Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,

Navigating TAU Visual Display ParaProf and TAU Portal Mahin Mahmoodi Pittsburgh Supercomputing Center 2010.

OpenPBS – Distributed Workload Management System

A Guide to Unix Using Linux Fourth Edition

NGS computation services: APIs and Parallel Jobs

Advanced TAU Commander

Linux Architecture Overview.

Parallel Computing Explained How to Parallelize a Code

Presentation transcript:

Performance Monitoring Tools on TCS Roberto Gomez and Raghu Reddy Pittsburgh Supercomputing Center David O’Neal National Center for Supercomputing Applications

Objective Measure single PE performance –Operation counts, wall time, MFLOP rates –Cache utilization ratio Study scalability –Time spent in MPI calls vs. computation –Time spent in OpenMP parallel sections

Atom Tools atom(1) –Various tools –Low overhead –No recompiling or re-linking in some cases

Useful Tools Flop2: –Floating point operations count Timer5: –Wall time (inclusive & exclusive) per routine Calltrace: –Detailed statistics of calls and their arguments Developed by Dick Compaq

Instrumentation Load atom module –module load atom Create routines file –nm –g a.out | awk ‘{if($5==“T”) print $1}’ > routines Edit routines file –place main routine first; remove unwanted ones Instrument executable –cat routines | atom –tool flop2 a.out –cat routines | atom –tool timer5 a.out Execute – a.out.[flop2,timer5] to create fprof.* and tprof.*

Single PE Performance Analysis Procedure Calls Self Time Total Time ========= ===== ========= ========== $null_evol$null_j_ $null_eth$null_d1_ $null_hyper_u$null_u_ $null_hyper_w$null_w_ ============= ========== ============ ============ Total Sample Timer5 output file:

Single PE Performance Analysis Procedure Calls Fops ========= ===== ==== $null_evol$null_j_ $null_eth$null_d1_ $null_hyper_u$null_u_ $null_hyper_w$null_w_ ========================================== Total Sample Flop2 output file: Obtain MFLOPS = Fops/(Self Time)

MPI calltrace module load atom cat $ATOMPATH/mpicalls | atom –tool \ calltrace a.out Execute a.out.calltrace to generate one trace file per PE Gather timings for desired MPI routines Repeat for increasing number of processors

Sample calltrace statistics: Number of processors 8 PEs 128 PEs 256 PEs Processor grid 2x2x2 8x4x4 8x8x4 Total Run time: MPI_ISEND Statistics MPI_RECV Statistics MPI_WAIT Statistics MPI_ALLTOALL Statistics MPI_REDUCE Statistics MPI_ALLREDUCE Statistics MPI_BCAST Statistics MPI_BARRIER Statistics ____________________________________________________ Total MPI Time

calltrace timings graph

DCPI Digital Continuous Profiling Infrastructure daemon and profiling utilities Very low overhead (1-2%) Aggregate or per-process data and analysis No code modifications Requires interactive access to compute nodes

DCPI Example Driver script – creates map file and host list – calls daemon and profiling scripts Daemon startup script – starts daemon with selected options Daemon shutdown script – halts daemon Profiling script – executes post-processing utility with selected options

DCPI Driver Script PBS job file –dcpi.pbsdcpi.pbs Creates map file and host list –Image map generated by dcpiscan(1) –Host list used by dsh(1) commands Executes daemon and profiling scripts –Start daemon, run test executable, stop daemon, post-process

DCPI Startup Script C shell script –dcpi_start.cshdcpi_start.csh Three arguments defined by driver job –MAP, WORK, EXE Creates database directory (DCPIDB) –Derived from WORK + hostname Starts dcpid(1) process –Events of interest are specified here

DCPI Stop Script C shell script –dcpi_stop.cshdcpi_stop.csh No arguments dcpiquit(1) flushes buffers and halts the daemon process

DCPI Profiling Script C shell script –dcpi_post.cshdcpi_post.csh Three arguments defined by driver job –MAP, WORK, EXE Determines database location (as before) Uses dcpiprof(1) to post-process database files –Profile selection(s) must be consistent with daemon startup options

DCPI Example Output Profiler writes to stdout by default –dcpi.outputdcpi.output Single node output in four sections –Start daemon, run test, halt daemon –Basic dcpiprof output –Memory operations (MOPS) –Floating point operations (FOPS) Reference profiling script for details

Other DCPI Options Per-process output files –See dcpid(1) –bypid option Trim output –See dcpiprof(1) –keep option Host list can also be cropped ProfileMe events for EV67 and later –Focus on –pm events –See dcpiprofileme(1) options

Common DCPI Problems Login denied (dsh) –Requires permission to login on compute nodes Daemon not started in background NFS is flaky for larger node counts (100+) Set filemode of DCPIDB directory correctly Mismatch between startup configuration and profiling specifications –See dcpid(1), dcpiprof(1), and dcpiprofileme(1)

Summary Low-level interfaces provide access to hardware counters Very effective, but requires experience Minimal overhead costs Report timings, flop counts, MFLOP rates for user code and library calls, e.g. MPI More information available, e.g. message sizes, time variability, etc.