Performance Analysis Tools
Performance Analysis Goal Once we have a working parallel program, we want to tune it to run faster Hot spot – An area of code that uses a significant amount of CPU time Bottleneck – An area of code that uses resources inefficiently and slows the program down (e.g. communication)
Timers One way to identify hot spots and bottlenecks is to use timers. We’ve used it to measure the elapsed time of the entire algorithm, but this can be used to measure time spent on different parts of the algorithm
Timers Timer Usage Wallclock / CPU Time Resolution Languages time Shell script Both 1/100th second Any gettimeofday Subroutine Wallclock Microseconds C/C++ read_real_time Nanoseconds C/C++ on IBM AIX Systems MPI_Wtime Subrouting C/C++,Fortran
Time command Usage: time mpirun –np # command Result real 0m1.071s user 0m0.177s sys 0m0.045s
Time command Meaning Real time: the total wall clock (start to finish) time your program took to load, execute, and exit. User time: the total amount of CPU time your program took to execute. System time: the amount of CPU time spent on operating system calls in executing your program.
gettimeofday gettimeofday is a system call that returns a structure that gives the time since Epoch (January 1 1970) int gettimeofday(struct timeval *tv, struct timezone *tz); The timeval structure has seconds and microseconds: struct timeval { time_t tv_sec; /* seconds */ suseconds_t tv_usec; /* microseconds */ };
gettimeofday Usage: #include <sys/time.h> struct timeval tv1, tv2; ... gettimeofday(&tv1, NULL); ... // Work to be timed gettimeofday(&tv2, NULL); // Convert time to seconds elapsed_time = (tv2.tv_sec - tv1.tv_sec) + ((tv2.tv_usec - tv1.tv_usec) / 1000000.0);
MPI_Wtime Returns a single double-precision value that is the number of seconds since some time in the past (most likely Epoch) MPI also provides a MPI_Wtick() routine that provides the resolution (most likely microseconds)
MPI_Wtime Usage: #include "mpi.h" ... double start,end,resolution; MPI_Init(&argc, &argv); start = MPI_Wtime(); /* start time */ ... // Work to be timed end = MPI_Wtime(); /* end time */ resolution = MPI_Wtick(); printf("elapsed= %e resolution= %e\n", end-start, resolution);
MPI_Wtime Sample output: Wallclock times(secs): start= 1384864000.067529 end= 1384864000.074005 elapsed= 6.475925e-03 resolution= 1.000000e-06 Elapsed time (seconds) Accurate to microseconds
read_real_time read_real_time is a system call that returns a structure that gives the time since Epoch (January 1 1970) int read_real_time(timebasestruct_t *t, size_t size_of_timebasestruct_t); Designed to measure time accurate to nanoseconds Guarantee correct time units across different IBM RS/6000 architectures.
read_real_time #include <sys/time.h> ... timebasestruct_t start, finish; int secs, n_secs; read_real_time(&start, TIMEBASE_SZ); /* do some work */ read_real_time(&finish, TIMEBASE_SZ); /* Make sure both values are in seconds and nanoseconds */ time_base_to_time(&start, TIMEBASE_SZ); time_base_to_time(&finish, TIMEBASE_SZ);
read_real_time Usage continued: ... /* Subtract the starting time from the ending time */ secs = finish.tb_high - start.tb_high; n_secs = finish.tb_low - start.tb_low; /* Fix carry from low-order to high-order during the measurement */ if (n_secs < 0) { secs--; n_secs += 1000000000; } printf("Sample time was %d seconds %d nanoseconds\n", secs, n_secs);
Profilers Profiler prof gprof Xprofiler mpiP
prof Compile your program with the –p option: Run the program gcc –p <program>.c –o <program> Run the program Profile file created called mon.out Run: prof –m mon.out
Sample Output from prof Name %Time Seconds Cumsecs #Calls msec/call .fft 51.8 0.59 1024 0.576 .main 40.4 0.46 1.05 1 460. .bit_reverse 7.9 0.09 1.14 0.088 .cos 0.0 0.00 256 .sin .catopen 0. .setlocale ._doprnt 7 ._flsbuf 11 ._xflsbuf ._wrtchk ._findbuf ._xwrite .free 2 .free_y .write .exit .memchr 19 .atoi .__nl_langinfo_std 4 .gettimeofday 8 .printf
gprof Compile your program with the –p option: Run the program gcc –gp <program>.c –o <program> Run the program Profile file created called gmon.out Run: gprof <program> gmon.out
Sample Output from gprof ngranularity: Each sample hit covers 4 bytes. Time: 1.17 seconds called/total parents index %time self descendents called+self name index called/total children 0.44 0.72 1/1 .__start [2] [1] 99.1 0.44 0.72 1 .main [1] 0.59 0.13 1024/1024 .fft [3] 0.00 0.00 256/256 .cos [6] 0.00 0.00 256/256 .sin [7] 0.00 0.00 8/8 .gettimeofday [11] 0.00 0.00 7/7 .printf [16] 0.00 0.00 1/1 .atoi [31] 0.00 0.00 1/1 .exit [33]
xprofiler X Windows profiler based on gprof Compile and run the program as you would with gprof Run: xprofiler <program> gmon.out Provides a graphical representation of the program execution
Library View
Function View
mpiP Compile an MPI program with –g: Run the MPI program as usual mpcc -g <program>.c –o <program> -L/usr/local/tools/mpiP/lib -lmpiP -lbfd Run the MPI program as usual A file is created called <program>.N.XXXXX.mpiP Where N is the number of processors and XXXXX is the collector task processor id
Sample output from mpiP @ mpiP @ Command : sphot @ Version : 0.9 @ Build date : Mar 8 2001, 16:22:46 @ Start time : 2001 04 11 16:04:23 @ Stop time : 2001 04 11 16:04:51 @ Number of tasks : 4 @ Collector Rank : 0 @ Collector PID : 30088 @ Event Buffer Size : 100000 @ Final Trace Dir : . @ Local Trace Dir : /usr/tmp @ Task Map : 0 blue333.pacific.llnl.gov 0 @ Task Map : 1 blue334.pacific.llnl.gov 0 @ Task Map : 2 blue335.pacific.llnl.gov 0 @ Task Map : 3 blue336.pacific.llnl.gov 0
Sample output from mpiP ---------------------------------------------------------------- @--- MPI Time (seconds) ---------------------------------------- Task AppTime MPITime MPI% 0 27.9 7.18 25.73 1 27.9 7.5 26.89 2 27.9 7.78 27.90 3 27.9 7.73 27.72 * 112 30.2 27.06
Sample output from mpiP ---------------------------------------------------------- @--- Callsites: 38 --------------------------------------- ID MPICall ParentFunction Filename Line PC 1 Barrier copyglob copyglob.f 65 10000b9c 2 Barrier copypriv@OL@1 copypriv.f 195 10001cd4 3 Barrier copypriv@OL@2 copypriv.f 237 1000213c 4 Barrier copypriv@OL@3 copypriv.f 279 10002624 5 Barrier copypriv@OL@4 copypriv.f 324 10002b04 6 Barrier sphot sphot.f 269 10008f2c 7 Bcast rdopac rdopac.f 49 10008638 8 Comm_rank copyglob copyglob.f 13 100003a8 9 Comm_rank copypriv copypriv.f 75 10000c38 10 Comm_rank genxsec genxsec.f 37 1000503c 11 Comm_rank rdinput rdinput.f 17 100071d4 …
Sample output from mpiP ----------------------------------------------------------------- @--- Aggregate Time (top twenty, descending, milliseconds) ------ Call Site Time App% MPI% Bcast 7 1.54e+04 13.79 50.95 Barrier 1 1.42e+04 12.73 47.03 Barrier 2 563 0.50 1.87 Waitall 34 25.7 0.02 0.09 Reduce 25 7.4 0.01 0.02 Barrier 5 2.54 0.00 0.01 Barrier 6 1.55 0.00 0.01 Barrier 4 1.44 0.00 0.00 Comm_rank 13 1.22 0.00 0.00 Barrier 3 1.01 0.00 0.00 Comm_rank 9 0.967 0.00 0.00 …
Sample output from mpiP ------------------------------------------------------------------------- @--- Callsite statistics (all, milliseconds): 102 ----------------------- Name Site Rank Count Max Mean Min App% MPI% Barrier 1 0 1 0.087 0.087 0.087 0.00 0.00 Barrier 1 1 1 12.7 12.7 12.7 0.05 0.17 Barrier 1 2 1 7.09e+03 7.09e+03 7.09e+03 25.44 91.17 Barrier 1 3 1 7.09e+03 7.09e+03 7.09e+03 25.44 91.75 Barrier 1 * 4 7.09e+03 3.55e+03 0.087 12.73 47.03 Barrier 2 0 1 0.12 0.12 0.12 0.00 0.00 Barrier 2 1 1 0.29 0.29 0.29 0.00 0.00 Barrier 2 2 1 307 307 307 1.10 3.95 Barrier 2 3 1 255 255 255 0.92 3.31 Barrier 2 * 4 307 141 0.12 0.50 1.87 Send 31 1 1 0.169 0.169 0.169 0.00 0.00 Send 31 2 1 0.341 0.341 0.341 0.00 0.00 Send 31 3 1 0.184 0.184 0.184 0.00 0.00 Send 31 * 3 0.341 0.231 0.169 0.00 0.00 ...
Questions