NUG Meeting Performance Profiling Using hpmcount, poe+ & libhpm Richard Gerber NERSC User Services
NUG Meeting Introduction How to obtain performance numbers Tools based on IBM’s PMAPI Relevant for FY2003 ERCAP
NUG Meeting Agenda Low Level PAPI Interface HPM Toolkit – hpmcount – poe+ libhpm : hardware performance library
NUG Meeting Overview These tools are used for performance measurement All can be used to tune applications and measure performance Needed for FY 2003 ERCAP applications
NUG Meeting Vocabulary PMAPI – IBM’s low-level interface PAPI – Performance API (portable) hpmcount, poe+ report overall code performance libhpm can be used to instrument portions of code
NUG Meeting PAPI Standard application programming interface Portable, don’t confuse with IBM low-level PMAPI interface Can access hardware counter info V2.1 at NERSC See – –
NUG Meeting Using PAPI PAPI is available through a module – module load papi You place calls in source code – xlf –O3 source.F $PAPI #include "fpapi.h“ … integer*8 values(2) integer counters(2), ncounters, irc … irc = PAPI_VER_CURRENT CALL papif_library_init(irc) counters(1)=PAPI_FMA_INS counters(2)=PAPI_FP_INS ncounters=2 CALL papif_start_counters(counters,ncounters,irc) … call papif_stop_counters(values,ncounters,irc) write(6,*) 'Total FMA ',values(1), ' Total FP ', values(2) …
NUG Meeting hpmcount Easy to use Does not affect code performance Profiles entire code Uses hardware counters Reports flip (floating point instruction) rate and many other quantities
NUG Meeting hpmcount usage Serial –%hpmcount executable Parallel –% poe hpmcount executable –nodes n -procs np Gives performance numbers for each task Prints output to STDOUT ( or use – o filename ) Beware! These profile the poe command – hpmcount poe executable – hpmcount executable (if compiled with mp* compilers)
NUG Meeting hpmcount example ex1.f - Unoptimized matrix-matrix multiply % xlf90 -o ex1 -O3 -qstrict ex1.f % hpmcount./ex1 hpmcount (V 2.3.1) summary Total execution time (wall clock time): seconds ######## Resource Usage Statistics ######## Total amount of time in user mode : seconds Total amount of time in system mode : seconds Maximum resident set size : 3116 Kbytes Average shared memory use in text segment : 6900 Kbytes*sec Average unshared memory use in data segment : Kbytes*sec Number of page faults without I/O activity : 785 Number of page faults with I/O activity : 1 Number of times process was swapped out : 0 Number of times file system performed INPUT : 0 Number of times file system performed OUTPUT : 0 Number of IPC messages sent : 0 Number of IPC messages received : 0 Number of signals delivered : 0 Number of voluntary context switches : 1 Number of involuntary context switches : 1727 ####### End of Resource Statistics ########
NUG Meeting hpmcount output ex1.f - Unoptimized matrix-matrix multiply % xlf90 -o ex1 -O3 -qstrict ex1.f % hpmcount./ex1 PM_CYC (Cycles) : PM_INST_CMPL (Instructions completed) : PM_TLB_MISS (TLB misses) : PM_ST_CMPL (Stores completed) : PM_LD_CMPL (Loads completed) : PM_FPU0_CMPL (FPU 0 instructions) : PM_FPU1_CMPL (FPU 1 instructions) : PM_EXEC_FMA (FMAs executed) : Utilization rate : % Avg number of loads per TLB miss : Load and store operations : M Instructions per load/store : MIPS : Instructions per cycle : HW Float points instructions per Cycle : Floating point instructions + FMAs : M Float point instructions + FMA rate : Mflip/s FMA percentage : % Computation intensity : 1.008
NUG Meeting Floating point measures PM_FPU0_CMPL (FPU 0 instructions) PM_FPU1_CMPL (FPU 1 instructions) –The POWER3 processor has two Floating Point Units (FPU) which operate in parallel. –Each FPU can start a new instruction at every cycle. –This is the number of floating point instructions (add, multiply, subtract, divide, multiply+add) that have been executed by each FPU. PM_EXEC_FMA (FMAs executed) –The POWER3 can execute a computation of the form x=s*a+b with one instruction. The is known as a Floating point Multiply & Add (FMA).
NUG Meeting Total flop rate Float point instructions + FMA rate –Float point instructions + FMAs gives the floating point operations. The two are added together since an FMA instruction yields 2 floating point operations. –The rate gives the code’s Mflops. –The POWER3 has a peak rate of 1500 Mflops. (375 MHz clock x 2 FPUs x 2Flops/FMA instruction) –Our example: 22 Mflops.
NUG Meeting Memory access Average number of loads per TLB miss –Memory addresses that are in the Translation Lookaside Buffer can be accessed quickly. –Each time a TLB miss occurs, a new page (4KB, byte elements) is brought into the buffer. –A value of ~500 means each element is accessed ~1 time while the page is in the buffer. –A small value indicates that needed data is stored in widely separated places in memory and a redesign of data structures may help performance significantly. –Our example: 2.0
NUG Meeting Cache hits The –sN option to hpmcount specifies a different statistics set -s2 will include L1 data cache hit rate –33.4% for our example –See for more options and descriptions.
NUG Meeting Optimizing the code Original code fragment DO I=1,N DO K=1,N DO J=1,N Z(I,J) = Z(I,J) + X(I,K) * Y(K,J) END DO
NUG Meeting Optimizing the code “Optimized” code: move I to inner loop DO J=1,N DO K=1,N DO I=1,N Z(I,J) = Z(I,J) + X(I,K) * Y(K,J) END DO
NUG Meeting Optimized results Float point instructions + FMA rate –461 vs. 22 Mflips (ESSL 933) Avg number of loads per TLB miss –20,877 vs. 2.0 (ESSL: 162) L1 cache hit rate –98.9% vs. 33.4%
NUG Meeting Using libhpm libhpm can instrument code sections Embed calls into source code –Fortran, C, C++ Contained in hpmtoolkit module – module load hpmtoolkit compile with $HPMTOOLKIT – xlf –O3 source.F $HPMTOOLKIT Execute program normally
NUG Meeting hpmlib example … #include f_hpm.h … CALL f_hpminit(0,”someid") CALL f_hpmstart(1,"matrix-matrix multiply") DO J=1,N DO K=1,N DO I=1,N Z(I,J) = Z(I,J) + X(I,K) * Y(K,J) END DO CALL f_hpmstop(1) CALL f_hpmterminate(0) …
NUG Meeting Parallel programs poe hpmcount executable –nodes n –procs np –Will print output to STDOUT separately for each task poe+ executable –nodes n –procs np –Will print aggregate number to STDOUT libhpm –Writes output to a separate file for each task Do not do these! – hpmcount poe executable … – hpmcount executable (if compiled with mp* compiler)
NUG Meeting Summary Utilities to measure performance –PAPI – hpmcount – poe+ – hpmlib You need to quote performance data in ERCAP application
NUG Meeting Where to Get More Information NERSC Website: hpcf.nersc.gov PAPI – hpmcount, poe+ – – hpmlib –