GPTL: A simple and free general purpose tool for performance analysis and profiling April 8, 2014 Jim Rosinski NOAA/ESRL
Outline Motivation and Basic Usage Auto-instrumentation Auto-profiling MPI routines Summary across threads and tasks Induced overhead Choice of underlying timing routine PAPI interface Utility functions Future work NCAR SEA 2
Motivation Needed something to simplify, for an arbitrary number of regions to be timed: time = 0; for (i = 0; i < 10; i++) { gettimeofday (tp1,0); compute (); gettimeofday (tp2,0); delta = tp2.tv_sec - tp1.tv_sec + 1.e6*(tp2.tv_usec - tp1.tv_usec); time += delta; } printf (“compute took %g seconds\n”, time); NCAR SEA 3
Solution #include... ret = GPTLinitialize () ret = GPTLstart (“total”); for (i = 0; i < 10; i++) { ret = GPTLstart (“compute”); compute (); ret = GPTLstop (“compute”);... } ret = GPTLstop (“total”); ret = GPTLpr (0); NCAR SEA 4
Results Output file timing.0 contains: Called Wallclock total compute NCAR SEA 5
Most of the API #include... ret = GPTLsetoption (PAPI_FP_OPS, 1); // Enable a PAPI counter ret = GPTLsetutr (GPTLnanotime); // Better wallclock timer... ret = GPTLinitialize (); // Once per process ret = GPTLstart (“total”); // Start a timer ret = GPTLstart (“compute”); // Start another timer compute (); // Do work ret = GPTLstop (“compute”); // Stop a timer... ret = GPTLstop (“total”); // Stop a timer ret = GPTLpr (iam); // Print results ret = GPTLpr_summary (MPI_COMM_WORLD); // Print results summary // across threads and tasks NCAR SEA 6
Set options via Fortran namelist Avoid recoding/recompiling by using Fortran namelist option: call gptlprocess_namelist (‘my_namelist’, unitno, ret) Example contents of ‘my_namelist’: &gptlnl utr = ‘nanotime’ eventlist = ‘GPTL_CI’,’PAPI_FP_OPS‘ / NCAR SEA 7
Auto-instrumentation Works with Intel, GNU, Pathscale, PGI, AIX # icc –g –finstrument-functions *.c –lgptl # gfortran –g –finstrument-functions *.f90 –lgptl # pgcc –g –Minstrument:functions *.c –lgptl Inserts automatically at function start: __cyg_profile_func_enter (void *this_fn, void *call_site); And at function exit: __cyg_profile_func_exit (void *this_fn, void *call_site); NCAR SEA 8
Auto-instrumentation (cont’d) GPTL handles these entry points with: void __cyg_profile_func_enter (void *this_fn, void *call_site) { (void) GPTLstart_instr (this_fn); } void __cyg_profile_func_exit (void *this_fn, void *call_site) { (void) GPTLstop_instr (this_fn); } NCAR SEA 9
Auto-instrumentation (cont’d) After running the app, convert addresses to names with: hex2name.pl [-demangle] NCAR SEA 10
Dynamic call tree from auto- instrumentation Stats for thread 0: Called Wallclock max min FP_OPS total e+08 HPCC_Init * HPL_pdinfo * HPL_all_reduce * HPL_broadcast HPL_pdlamch * HPL_fprintf HPCC_InputFileInit ReadInts PTRANS e+07 MaxMem * iceil_ * ilcm_ param_dump Cblacs_get Cblacs_gridmap * Cblacs_pinfo * Cblacs_gridinfo NCAR SEA 11
MPI Auto-instrumentation To enable MPI auto-instrumentation, in macros.make set this: – ENABLE_PMPI=yes NCAR SEA 12
MPI Auto-instrumentation (cont’d) Stats for thread 0: Called Wallclock max min AVG_MPI_BYTES MPI_Init_thru_Finalize e e e-04 - MPI_Send e e e e+03 MPI_Recv e e e e+03 MPI_Ssend e e e e+03 MPI_Issend e e e e+03 MPI_Sendrecv e e e e+03 MPI_Irecv e e e e+03 MPI_Isend e e e e+03 MPI_Wait e e e-06 - MPI_Waitall e e e+00 - MPI_Barrier e e e-05 - MPI_Bcast e e e e+03 NCAR SEA 13
Induced Overhead GPTL estimates its own overhead: overhead of 1 GPTLstart or GPTLstop call=1.28e-07 seconds Components are as follows: Fortran layer: 1.0e-09 = 1.5% of total Get thread number: 1.7e-08 = 13.3% of total Generate hash index: 1.9e-08 = 14.8% of total Find hashtable entry: 1.5e-08 = 11.7% of total Underlying timing routine: 7.0e-08 = 53.2% of total Misc start/stop functions: 7.0e-09 = 5.5% of total NCAR SEA 14
Induced Overhead (cont’d) Stats for thread 0: Called Wallclock max min self_OH parent_OH total x1e x1e e e x1e e e x1e e e e4x e e e5x e e e6x10 1.0e e e e7x1 1.0e e e NCAR SEA 15
Underlying timing routine Default is gettimeofday() For Intel arch’s change to register read which has better granularity and much lower overhead: – C or Fortran: GPTLsetutr(GPTLnanotime); – Fortran: utr = ‘nanotime’ in namelist &gptlnl – May cause problems on machines with variable clock rate (e.g. “turbo mode”) NCAR SEA 16
PAPI details handled by GPTL This call: GPTLsetoption (PAPI_FP_OPS, 1); Implies: PAPI_library_init (PAPI_VER_CURRENT)); PAPI_thread_init ((unsigned long (*)(void(pthread_self)); PAPI_create_eventset (&EventSet[t])); PAPI_assign_eventset_component (EventSet[t], 0); PAPI_multiplex_init (); PAPI_set_multiplex (EventSet[t]); PAPI_add_event (EventSet[t], PAPI_FP_OPS)); PAPI_start (EventSet[t]); PAPI multiplexing handled automatically, enabled only if needed NCAR SEA 17
timing.summary file generated by GPTLpr_summary(comm) name ncalls nranks mean_time std_dev wallmax (rank ) wallmin (rank ) Diag ( 0) ( 1) MainLoop ( 0) ( 1) ZeroTendencies ( 0) ( 1) SaveFlux ( 0) ( 1) RHStendencies ( 0) ( 1) Vdtotal ( 0) ( 1) Vdm ( 0) ( 1) vdmfinish ( 0) ( 1) Vdn ( 0) ( 1) Flux ( 1) ( 0) Force ( 1) ( 0) RKdiff ( 0) ( 1) TimeDiff ( 0) ( 1) Sponge ( 0) ( 1) pre_trisol ( 0) ( 1) Trisol ( 1) ( 0) post_trisol ( 0) ( 1) Vdmints ( 0) ( 1) Pstadv ( 1) ( 0) NCAR SEA 18
Utility functions To print current memory usage at any point in your code: – ret = GPTLprint_memusage (“user string”) Produces e.g. – GPTLprint_memusage: user string size=19.5 MB rss=2.1 MB datastack=1.5 MB To auto-profile current memory usage (at both function entry and exit points) : – ret = GPTLsetoption (GPTLdopr_memusage, 1); Retrieve wallclock, usr, sys timestamps to user code: – ret = GPTLstamp (&wallclock, &usr, &sys); NCAR SEA 19
Future Work XML output Port to GPU Dynamic thread allocation for PTHREADS option Autoconf? NCAR SEA 20
Source and Documentation Source: – git clone Web-based documentation: – jmrosinski.github.io/GPTL Feel free to me: NCAR SEA 21