Download presentation
Presentation is loading. Please wait.
Published byReynold Booth Modified over 8 years ago
1
Performance Monitoring Tools on TCS Roberto Gomez and Raghu Reddy Pittsburgh Supercomputing Center David O’Neal National Center for Supercomputing Applications
2
Objective Measure single PE performance –Operation counts, wall time, MFLOP rates –Cache utilization ratio Study scalability –Time spent in MPI calls vs. computation –Time spent in OpenMP parallel sections
3
Atom Tools atom(1) –Various tools –Low overhead –No recompiling or re-linking in some cases
4
Useful Tools Flop2: –Floating point operations count Timer5: –Wall time (inclusive & exclusive) per routine Calltrace: –Detailed statistics of calls and their arguments Developed by Dick Foster @ Compaq
5
Instrumentation Load atom module –module load atom Create routines file –nm –g a.out | awk ‘{if($5==“T”) print $1}’ > routines Edit routines file –place main routine first; remove unwanted ones Instrument executable –cat routines | atom –tool flop2 a.out –cat routines | atom –tool timer5 a.out Execute – a.out.[flop2,timer5] to create fprof.* and tprof.*
6
Single PE Performance Analysis Procedure Calls Self Time Total Time ========= ===== ========= ========== $null_evol$null_j_ 3072 60596709 79880903 $null_eth$null_d1_ 72458 45499161 45499161 $null_hyper_u$null_u_ 3328 39889655 44500045 $null_hyper_w$null_w_ 3328 19195271 33769541............ ============= ========== ============ ============ Total 1961226 248258934 248258934 Sample Timer5 output file:
7
Single PE Performance Analysis Procedure Calls Fops ========= ===== ==== $null_evol$null_j_ 3072 20406036288 $null_eth$null_d1_ 72458 20220926518 $null_hyper_u$null_u_ 3328 14062774258 $null_hyper_w$null_w_ 3328 3823795456......... ========================================== Total 1936818 70876179927 Sample Flop2 output file: Obtain MFLOPS = Fops/(Self Time)
8
MPI calltrace module load atom cat $ATOMPATH/mpicalls | atom –tool \ calltrace a.out Execute a.out.calltrace to generate one trace file per PE Gather timings for desired MPI routines Repeat for increasing number of processors
9
Sample calltrace statistics: Number of processors 8 PEs 128 PEs 256 PEs Processor grid 2x2x2 8x4x4 8x8x4 Total Run time: 277.028 314.857 422.170 MPI_ISEND Statistics 1.250 1.498 2.265 MPI_RECV Statistics 4.349 19.779 26.537 MPI_WAIT Statistics 9.172 16.311 20.150 MPI_ALLTOALL Statistics 5.072 9.433 12.894 MPI_REDUCE Statistics 0.013 0.162 0.002 MPI_ALLREDUCE Statistics 0.391 2.073 10.313 MPI_BCAST Statistics 0.061 1.135 1.382 MPI_BARRIER Statistics 14.959 28.694 62.028 ____________________________________________________ Total MPI Time 35.267 79.085 135.571
10
calltrace timings graph
11
DCPI Digital Continuous Profiling Infrastructure daemon and profiling utilities Very low overhead (1-2%) Aggregate or per-process data and analysis No code modifications Requires interactive access to compute nodes
12
DCPI Example Driver script – creates map file and host list – calls daemon and profiling scripts Daemon startup script – starts daemon with selected options Daemon shutdown script – halts daemon Profiling script – executes post-processing utility with selected options
13
DCPI Driver Script PBS job file –dcpi.pbsdcpi.pbs Creates map file and host list –Image map generated by dcpiscan(1) –Host list used by dsh(1) commands Executes daemon and profiling scripts –Start daemon, run test executable, stop daemon, post-process
14
DCPI Startup Script C shell script –dcpi_start.cshdcpi_start.csh Three arguments defined by driver job –MAP, WORK, EXE Creates database directory (DCPIDB) –Derived from WORK + hostname Starts dcpid(1) process –Events of interest are specified here
15
DCPI Stop Script C shell script –dcpi_stop.cshdcpi_stop.csh No arguments dcpiquit(1) flushes buffers and halts the daemon process
16
DCPI Profiling Script C shell script –dcpi_post.cshdcpi_post.csh Three arguments defined by driver job –MAP, WORK, EXE Determines database location (as before) Uses dcpiprof(1) to post-process database files –Profile selection(s) must be consistent with daemon startup options
17
DCPI Example Output Profiler writes to stdout by default –dcpi.outputdcpi.output Single node output in four sections –Start daemon, run test, halt daemon –Basic dcpiprof output –Memory operations (MOPS) –Floating point operations (FOPS) Reference profiling script for details
18
Other DCPI Options Per-process output files –See dcpid(1) –bypid option Trim output –See dcpiprof(1) –keep option Host list can also be cropped ProfileMe events for EV67 and later –Focus on –pm events –See dcpiprofileme(1) options
19
Common DCPI Problems Login denied (dsh) –Requires permission to login on compute nodes Daemon not started in background NFS is flaky for larger node counts (100+) Set filemode of DCPIDB directory correctly Mismatch between startup configuration and profiling specifications –See dcpid(1), dcpiprof(1), and dcpiprofileme(1)
20
Summary Low-level interfaces provide access to hardware counters Very effective, but requires experience Minimal overhead costs Report timings, flop counts, MFLOP rates for user code and library calls, e.g. MPI More information available, e.g. message sizes, time variability, etc.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.