Performance Monitoring Tools on TCS Roberto Gomez and Raghu Reddy Pittsburgh Supercomputing Center David O’Neal National Center for Supercomputing Applications
Objective Measure single PE performance –Operation counts, wall time, MFLOP rates –Cache utilization ratio Study scalability –Time spent in MPI calls vs. computation –Time spent in OpenMP parallel sections
Atom Tools atom(1) –Various tools –Low overhead –No recompiling or re-linking in some cases
Useful Tools Flop2: –Floating point operations count Timer5: –Wall time (inclusive & exclusive) per routine Calltrace: –Detailed statistics of calls and their arguments Developed by Dick Compaq
Instrumentation Load atom module –module load atom Create routines file –nm –g a.out | awk ‘{if($5==“T”) print $1}’ > routines Edit routines file –place main routine first; remove unwanted ones Instrument executable –cat routines | atom –tool flop2 a.out –cat routines | atom –tool timer5 a.out Execute – a.out.[flop2,timer5] to create fprof.* and tprof.*
Single PE Performance Analysis Procedure Calls Self Time Total Time ========= ===== ========= ========== $null_evol$null_j_ $null_eth$null_d1_ $null_hyper_u$null_u_ $null_hyper_w$null_w_ ============= ========== ============ ============ Total Sample Timer5 output file:
Single PE Performance Analysis Procedure Calls Fops ========= ===== ==== $null_evol$null_j_ $null_eth$null_d1_ $null_hyper_u$null_u_ $null_hyper_w$null_w_ ========================================== Total Sample Flop2 output file: Obtain MFLOPS = Fops/(Self Time)
MPI calltrace module load atom cat $ATOMPATH/mpicalls | atom –tool \ calltrace a.out Execute a.out.calltrace to generate one trace file per PE Gather timings for desired MPI routines Repeat for increasing number of processors
Sample calltrace statistics: Number of processors 8 PEs 128 PEs 256 PEs Processor grid 2x2x2 8x4x4 8x8x4 Total Run time: MPI_ISEND Statistics MPI_RECV Statistics MPI_WAIT Statistics MPI_ALLTOALL Statistics MPI_REDUCE Statistics MPI_ALLREDUCE Statistics MPI_BCAST Statistics MPI_BARRIER Statistics ____________________________________________________ Total MPI Time
calltrace timings graph
DCPI Digital Continuous Profiling Infrastructure daemon and profiling utilities Very low overhead (1-2%) Aggregate or per-process data and analysis No code modifications Requires interactive access to compute nodes
DCPI Example Driver script – creates map file and host list – calls daemon and profiling scripts Daemon startup script – starts daemon with selected options Daemon shutdown script – halts daemon Profiling script – executes post-processing utility with selected options
DCPI Driver Script PBS job file –dcpi.pbsdcpi.pbs Creates map file and host list –Image map generated by dcpiscan(1) –Host list used by dsh(1) commands Executes daemon and profiling scripts –Start daemon, run test executable, stop daemon, post-process
DCPI Startup Script C shell script –dcpi_start.cshdcpi_start.csh Three arguments defined by driver job –MAP, WORK, EXE Creates database directory (DCPIDB) –Derived from WORK + hostname Starts dcpid(1) process –Events of interest are specified here
DCPI Stop Script C shell script –dcpi_stop.cshdcpi_stop.csh No arguments dcpiquit(1) flushes buffers and halts the daemon process
DCPI Profiling Script C shell script –dcpi_post.cshdcpi_post.csh Three arguments defined by driver job –MAP, WORK, EXE Determines database location (as before) Uses dcpiprof(1) to post-process database files –Profile selection(s) must be consistent with daemon startup options
DCPI Example Output Profiler writes to stdout by default –dcpi.outputdcpi.output Single node output in four sections –Start daemon, run test, halt daemon –Basic dcpiprof output –Memory operations (MOPS) –Floating point operations (FOPS) Reference profiling script for details
Other DCPI Options Per-process output files –See dcpid(1) –bypid option Trim output –See dcpiprof(1) –keep option Host list can also be cropped ProfileMe events for EV67 and later –Focus on –pm events –See dcpiprofileme(1) options
Common DCPI Problems Login denied (dsh) –Requires permission to login on compute nodes Daemon not started in background NFS is flaky for larger node counts (100+) Set filemode of DCPIDB directory correctly Mismatch between startup configuration and profiling specifications –See dcpid(1), dcpiprof(1), and dcpiprofileme(1)
Summary Low-level interfaces provide access to hardware counters Very effective, but requires experience Minimal overhead costs Report timings, flop counts, MFLOP rates for user code and library calls, e.g. MPI More information available, e.g. message sizes, time variability, etc.