Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida International University March 2009 (Additional Slides by Javier Delgado) Parallel Computing Explained Timing and Profiling
Agenda 1 Parallel Computing Overview 2 How to Parallelize a Code 3 Porting Issues 4 Scalar Tuning 5 Parallel Code Tuning 6 Timing and Profiling 6.1 Timing Timing a Section of Code CPU Time Wall clock Time Timing an Executable Timing a Batch Job 6.2 Profiling Profiling Tools Profile Listings Profiling Analysis 6.3 Further Information
Timing and Profiling Now that your program has been ported to the new computer, you will want to know how fast it runs. This chapter describes how to measure the speed of a program using various timing routines. The chapter also covers how to determine which parts of the program account for the bulk of the computational load so that you can concentrate your tuning efforts on those computationally intensive parts of the program. This chapter also gives a summary of some available profiling tools.
Timing In the following sections, we’ll discuss timers and review the profiling tools ssrun and prof on the Origin and vprof and gprof on the Linux Clusters. The specific timing functions described are: Timing a section of code FORTRAN etime, dtime, cpu_time for CPU time time and f_time for wallclock time clock for CPU time gettimeofday for wallclock time Timing an executable time a.out Timing a batch run busage qstat qhist
CPU Time etime A section of code can be timed using etime. It returns the elapsed CPU time in seconds since the program started. real*4 tarray(2),time1,time2,timeres … beginning of program time1=etime(tarray) … start of section of code to be timed … lots of computation … end of section of code to be timed time2=etime(tarray) timeres=time2-time1
CPU Time dtime A section of code can also be timed using dtime. It returns the elapsed CPU time in seconds since the last call to dtime. real*4 tarray(2),timeres … beginning of program timeres=dtime(tarray) … start of section of code to be timed … lots of computation … end of section of code to be timed timeres=dtime(tarray) … rest of program
CPU Time The etime and dtime Functions User time. This is returned as the first element of tarray. It’s the CPU time spent executing user code. System time. This is returned as the second element of tarray. It’s the time spent executing system calls on behalf of your program. Sum of user and system time. This is the function value that is returned. It’s the time that is usually reported. Metric. Timings are reported in seconds. Timings are accurate to 1/100th of a second.
CPU Time Timing Comparison Warnings For the SGI computers: The etime and dtime functions return the MAX time over all threads for a parallel program. This is the time of the longest thread, which is usually the master thread. For the Linux Clusters: The etime and dtime functions are contained in the VAX compatibility library of the Intel FORTRAN Compiler. To use this library include the compiler flag -Vaxlib. Another warning: Do not put calls to etime and dtime inside a do loop. The overhead is too large.
CPU Time cpu_time The cpu_time routine is available only on the Linux clusters as it is a component of the Intel FORTRAN compiler library. It provides substantially higher resolution and has substantially lower overhead than the older etime and dtime routines. It can be used as an elapsed timer. real*8 time1, time2, timeres … beginning of program call cpu_time (time1) … start of section of code to be timed … lots of computation … end of section of code to be timed call cpu_time(time2) timeres=time2-time1 … rest of program
CPU Time clock For C programmers, one can call the cpu_time routine using a FORTRAN wrapper or call the intrinsic function clock that can be used to determine elapsed CPU time. include static const double iCPS = 1.0/(double)CLOCKS_PER_SEC; double time1, time2, timres; … time1=(clock()*iCPS); … /* do some work */ … time2=(clock()*iCPS); timers=time2-time1;
Wall clock Time time For the Origin, the function time returns the time since 00:00:00 GMT, Jan. 1, It is a means of getting the elapsed wall clock time. The wall clock time is reported in integer seconds. external time integer*4 time1,time2,timeres … beginning of program time1=time( ) … start of section of code to be timed … lots of computation … end of section of code to be timed time2=time( ) timeres=time2 - time1
Wall clock Time f_time For the Linux clusters, the appropriate FORTRAN function for elapsed time is f_time. integer*8 f_time external f_time integer*8 time1,time2,timeres … beginning of program time1=f_time() … start of section of code to be timed … lots of computation … end of section of code to be timed time2=f_time() timeres=time2 - time1 As above for etime and dtime, the f_time function is in the VAX compatibility library of the Intel FORTRAN Compiler. To use this library include the compiler flag -Vaxlib.
Wall clock Time gettimeofday For C programmers, wallclock time can be obtained by using the very portable routine gettimeofday. #include /* definition of NULL */ #include /* definition of timeval struct and protyping of gettimeofday */ double t1,t2,elapsed; struct timeval tp; int rtn;.... rtn=gettimeofday(&tp, NULL); t1=(double)tp.tv_sec+(1.e-6)*tp.tv_usec;.... /* do some work */.... rtn=gettimeofday(&tp, NULL); t2=(double)tp.tv_sec+(1.e-6)*tp.tv_usec; elapsed=t2-t1;
Timing an Executable To time an executable (if using a csh or tcsh shell, explicitly call /usr/bin/time) time …options… a.out where options can be ‘ -p ’ for a simple output or ‘ -f format ’ which allows the user to display more than just time related information. Consult the man pages on the time command for format options.
Timing a Batch Job Time of a batch job running or completed. Origin busage jobid Linux clusters qstat jobid # for a running job qhist jobid # for a completed job
Agenda 1 Parallel Computing Overview 2 How to Parallelize a Code 3 Porting Issues 4 Scalar Tuning 5 Parallel Code Tuning 6 Timing and Profiling 6.1 Timing Timing a Section of Code CPU Time Wall clock Time Timing an Executable Timing a Batch Job 6.2 Profiling Profiling Tools Profile Listings Profiling Analysis 6.3 Further Information
Profiling Profiling determines where a program spends its time. It detects the computationally intensive parts of the code. Use profiling when you want to focus attention and optimization efforts on those loops that are responsible for the bulk of the computational load. Most codes follow the Rule. That is, 90% of the computation is done in 10% of the code.
Profiling Tools Profiling Tools on the Origin On the SGI Origin2000 computer there are profiling tools named ssrun and prof. Used together they do profiling, or what is called hot spot analysis. They are useful for generating timing profiles. ssrun The ssrun utility collects performance data for an executable that you specify. The performance data is written to a file named "executablename.exptype.id". prof The prof utility analyzes the data file created by ssrun and produces a report. Example ssrun -fpcsamp a.out prof -h a.out.fpcsamp.m12345 > prof.list
Profiling Tools Profiling Tools on the Linux Clusters On the Linux clusters the profiling tools are still maturing. There are currently several efforts to produce tools comparable to the ssrun and perfex tools. gprof Basic profiling information can be generated using the OS utility gprof. First, compile the code with the compiler flags -p -g for the Intel compiler (-g on the Intel compiler does not change the optimization level) or -pg for the GNU compiler. Second, run the program. Finally analyze the resulting gmon.out file using the gprof utility : gprof executable gmon.out. efc -O -p -g -o foo foo.f./foo gprof foo gmon.out
The Performance API (PAPI) Provides an interface to hardware performance counters integrated in CPU Provides more in-depth details about resource utilization E.g. cache misses, instructions per second Used by perfex, mpitrace, perfsuite, and other profiling tools Requires kernel patch to deploy on Linux
Profiling Tools Profiling Tools on the Linux Clusters vprof On the IA32 platform there is a utility called vprof that provides performance information using the PAPI instrumentation library. To instrument the whole application requires recompiling and linking to vprof and PAPI libraries. setenv VMON PAPI_TOT_CYC ifc -g -O -o md md.f /usr/apps/tools/vprof/lib/vmonauto_gcc.o - L/usr/apps/tools/lib -lvmon -lpapi./md /usr/apps/tools/vprof/bin/cprof -e md vmon.out
Cycles % Cum% Secs Proc VSUB PFSOR PBSOR PFSOR VADD ITSRCG ITSRSI ITJSI ITJCG Profile Listings Profile Listings on the Origin Prof Output First Listing The first listing gives the number of cycles executed in each procedure (or subroutine). The procedures are listed in descending order of cycle count.
Cycles % Cum% Line Proc VSUB PFSOR PBSOR VSUB PFSOR VSUB VADD VSUB VADD PFSOR Profile Listings Profile Listings on the Origin Prof Output Second Listing The second listing gives the number of cycles per source code line. The lines are listed in descending order of cycle count.
Flat profile: Each sample counts as seconds. % cumulative self self total time seconds seconds calls us/call us/call name compute_ dist_ SIND_SINCOS sin cos dotr8_ update_ f_fioinit f_intorange mov initialize_ Profile Listings Profile Listings on the Linux Clusters gprof Output First Listing The listing gives a 'flat' profile of functions and routines encountered, sorted by 'self seconds' which is the number of seconds accounted for by this function alone.
Call graph: index % time self children called name [1] main [1] /101 compute_ [2] /100 update_ [8] /1 initialize_ [12] /101 main [1] [2] compute_ [2] / dist_ [3] /50500 dotr8_ [7] / compute_ [2] [3] dist_ [3] [4] SIND_SINCOS [4] … Profile Listings Profile Listings on the Linux Clusters gprof Output Second Listing The second listing gives a 'call-graph' profile of functions and routines encountered. The definitions of the columns are specific to the line in question. Detailed information is contained in the full output from gprof.
Columns correspond to the following events: PAPI_TOT_CYC - Total cycles (1956 events) File Summary: 100.0% /u/ncsa/gbauer/temp/md.f Function Summary: 84.4% compute 15.6% dist Line Summary: 67.3% /u/ncsa/gbauer/temp/md.f: % /u/ncsa/gbauer/temp/md.f: % /u/ncsa/gbauer/temp/md.f: % /u/ncsa/gbauer/temp/md.f: % /u/ncsa/gbauer/temp/md.f: % /u/ncsa/gbauer/temp/md.f: % /u/ncsa/gbauer/temp/md.f: % /u/ncsa/gbauer/temp/md.f: % /u/ncsa/gbauer/temp/md.f: % /u/ncsa/gbauer/temp/md.f:105 Profile Listings Profile Listings on the Linux Clusters vprof Listing The above listing from (using the -e option to cprof), displays not only cycles consumed by functions (a flat profile) but also the lines in the code that contribute to those functions.
0.7% /u/ncsa/gbauer/temp/md.f: % /u/ncsa/gbauer/temp/md.f: % /u/ncsa/gbauer/temp/md.f: % /u/ncsa/gbauer/temp/md.f:100 … % do j=1,np 101 if (i.ne. j) then % call dist(nd,box,pos(1,i),pos(1,j),rij,d) 103 ! attribute half of the potential energy to particle 'j' % pot = pot + 0.5*v(d) % do k=1,nd % f(k,i) = f(k,i) - rij(k)*dv(d)/d % enddo 108 endif % enddo Profile Listings Profile Listings on the Linux Clusters vprof Listing (cont.)
Profiling Analysis The program being analyzed in the previous Origin example has approximately source code lines, and consists of many subroutines. The first profile listing shows that over 50% of the computation is done inside the VSUB subroutine. The second profile listing shows that line 8106 in subroutine VSUB accounted for 50% of the total computation. Going back to the source code, line 8106 is a line inside a do loop. Putting an OpenMP compiler directive in front of that do loop you can get 50% of the program to run in parallel with almost no work on your part. Since the compiler has rearranged the source lines the line numbers given by ssrun/prof give you an area of the code to inspect. To view the rearranged source use the option f90 … -FLIST:=ON cc … -CLIST:=ON For the Intel compilers, the appropriate options are ifort … –E … icc … -E …
MPE and Jumpshot MPE is a tracing library that comes with MPI Jumpshot is a graphical application for analyzing the MPE output MPE requires inserting code at specific locations to be analyzed Display options are specified in the code (e.g. “Show MPI_Broadcast events in dotted blue lines”
Jumpshot
Perfsuite Collection of tools, utilities, and libraries for software performance analysis Intel architectures only Provides many in-depth statistics Operations per cycle, Cache miss/hit data, etc. Not difficult to use (but may be difficult to compile) mpiexec –np $NN psrun wrf.exe psprocess wrf.exe.NN_n.xml Requires PAPI kernel patch for showing most information
Perfsuite + Graphical App
CEPBA Tools Developed at the European Center for Parallelism at Barcelona Currently not free Provide text-based and graphical applications for: Execution analysis and optimization Execution prediction 3 Main tools: Mpitrace, Dimemas, Paraver
CEPBA Tools Powerful, but complex Requires PAPI kernel patch for showing most information May require application to be recompiled Very large trace files for long executions and/or high number of processors (e.g. over 10GB)
CEPBA Tools Source: Barcelona SuperComputing Center –
Visualizing with Paraver Process: 1. (Compile application with mpitrace libraries linked) 2. Execute application (and preload mpitrace libraries if not linked to the application) 3. Convert individual trace files to a Paraver file 4. “Chop” paraver trace file, if it is too big
Paraver Screenshots
Dimemas Estimate impact of code changes without changing the code Estimate execution time on slightly different architectures
Further Information SGI Irix man etime man 3 time man 1 time man busage man timers man ssrun man prof Origin2000 Performance Tuning and Optimization Guide Linux Clusters man 3 clock man 2 gettimeofday man 1 time man 1 gprof man 1B qstat Intel Compilers Vprof on NCSA Linux Cluster