NUG Meeting Performance Profiling Using hpmcount, poe+ & libhpm Richard Gerber NERSC User Services 510-486-6820.

NUG Meeting Performance Profiling Using hpmcount, poe+ & libhpm Richard Gerber NERSC User Services ragerber@nersc.gov 510-486-6820

NUG Meeting Introduction How to obtain performance numbers Tools based on IBM’s PMAPI Relevant for FY2003 ERCAP

NUG Meeting Agenda Low Level PAPI Interface HPM Toolkit – hpmcount – poe+ libhpm : hardware performance library

NUG Meeting Overview These tools are used for performance measurement All can be used to tune applications and measure performance Needed for FY 2003 ERCAP applications

NUG Meeting Vocabulary PMAPI – IBM’s low-level interface PAPI – Performance API (portable) hpmcount, poe+ report overall code performance libhpm can be used to instrument portions of code

NUG Meeting PAPI Standard application programming interface Portable, don’t confuse with IBM low-level PMAPI interface Can access hardware counter info V2.1 at NERSC See –http://hpcf.nersc.gov/software/papi.htmlhttp://hpcf.nersc.gov/software/papi.html –http://icl.cs.utk.edu/projects/papi/http://icl.cs.utk.edu/projects/papi/

NUG Meeting Using PAPI PAPI is available through a module – module load papi You place calls in source code – xlf –O3 source.F $PAPI #include "fpapi.h“ … integer*8 values(2) integer counters(2), ncounters, irc … irc = PAPI_VER_CURRENT CALL papif_library_init(irc) counters(1)=PAPI_FMA_INS counters(2)=PAPI_FP_INS ncounters=2 CALL papif_start_counters(counters,ncounters,irc) … call papif_stop_counters(values,ncounters,irc) write(6,*) 'Total FMA ',values(1), ' Total FP ', values(2) …

NUG Meeting hpmcount Easy to use Does not affect code performance Profiles entire code Uses hardware counters Reports flip (floating point instruction) rate and many other quantities

NUG Meeting hpmcount usage Serial –%hpmcount executable Parallel –% poe hpmcount executable –nodes n -procs np Gives performance numbers for each task Prints output to STDOUT ( or use – o filename ) Beware! These profile the poe command – hpmcount poe executable – hpmcount executable (if compiled with mp* compilers)

NUG Meeting hpmcount example ex1.f - Unoptimized matrix-matrix multiply % xlf90 -o ex1 -O3 -qstrict ex1.f % hpmcount./ex1 hpmcount (V 2.3.1) summary Total execution time (wall clock time): 17.258385 seconds ######## Resource Usage Statistics ######## Total amount of time in user mode : 17.220000 seconds Total amount of time in system mode : 0.040000 seconds Maximum resident set size : 3116 Kbytes Average shared memory use in text segment : 6900 Kbytes*sec Average unshared memory use in data segment : 5344036 Kbytes*sec Number of page faults without I/O activity : 785 Number of page faults with I/O activity : 1 Number of times process was swapped out : 0 Number of times file system performed INPUT : 0 Number of times file system performed OUTPUT : 0 Number of IPC messages sent : 0 Number of IPC messages received : 0 Number of signals delivered : 0 Number of voluntary context switches : 1 Number of involuntary context switches : 1727 ####### End of Resource Statistics ########

NUG Meeting hpmcount output ex1.f - Unoptimized matrix-matrix multiply % xlf90 -o ex1 -O3 -qstrict ex1.f % hpmcount./ex1 PM_CYC (Cycles) : 6428126205 PM_INST_CMPL (Instructions completed) : 693651174 PM_TLB_MISS (TLB misses) : 122468941 PM_ST_CMPL (Stores completed) : 125758955 PM_LD_CMPL (Loads completed) : 250513627 PM_FPU0_CMPL (FPU 0 instructions) : 249691884 PM_FPU1_CMPL (FPU 1 instructions) : 3134223 PM_EXEC_FMA (FMAs executed) : 126535192 Utilization rate : 99.308 % Avg number of loads per TLB miss : 2.046 Load and store operations : 376.273 M Instructions per load/store : 1.843 MIPS : 40.192 Instructions per cycle : 0.108 HW Float points instructions per Cycle : 0.039 Floating point instructions + FMAs : 379.361 M Float point instructions + FMA rate : 21.981 Mflip/s FMA percentage : 66.710 % Computation intensity : 1.008

NUG Meeting Floating point measures PM_FPU0_CMPL (FPU 0 instructions) PM_FPU1_CMPL (FPU 1 instructions) –The POWER3 processor has two Floating Point Units (FPU) which operate in parallel. –Each FPU can start a new instruction at every cycle. –This is the number of floating point instructions (add, multiply, subtract, divide, multiply+add) that have been executed by each FPU. PM_EXEC_FMA (FMAs executed) –The POWER3 can execute a computation of the form x=s*a+b with one instruction. The is known as a Floating point Multiply & Add (FMA).

NUG Meeting Total flop rate Float point instructions + FMA rate –Float point instructions + FMAs gives the floating point operations. The two are added together since an FMA instruction yields 2 floating point operations. –The rate gives the code’s Mflops. –The POWER3 has a peak rate of 1500 Mflops. (375 MHz clock x 2 FPUs x 2Flops/FMA instruction) –Our example: 22 Mflops.

NUG Meeting Memory access Average number of loads per TLB miss –Memory addresses that are in the Translation Lookaside Buffer can be accessed quickly. –Each time a TLB miss occurs, a new page (4KB, 512 8-byte elements) is brought into the buffer. –A value of ~500 means each element is accessed ~1 time while the page is in the buffer. –A small value indicates that needed data is stored in widely separated places in memory and a redesign of data structures may help performance significantly. –Our example: 2.0

NUG Meeting Cache hits The –sN option to hpmcount specifies a different statistics set -s2 will include L1 data cache hit rate –33.4% for our example –See http://hpcf.nersc.gov/software/ibm/hpmcount/HPM_README.html for more options and descriptions. http://hpcf.nersc.gov/software/ibm/hpmcount/HPM_README.html

NUG Meeting Optimizing the code Original code fragment DO I=1,N DO K=1,N DO J=1,N Z(I,J) = Z(I,J) + X(I,K) * Y(K,J) END DO

NUG Meeting Optimizing the code “Optimized” code: move I to inner loop DO J=1,N DO K=1,N DO I=1,N Z(I,J) = Z(I,J) + X(I,K) * Y(K,J) END DO

NUG Meeting Optimized results Float point instructions + FMA rate –461 vs. 22 Mflips (ESSL 933) Avg number of loads per TLB miss –20,877 vs. 2.0 (ESSL: 162) L1 cache hit rate –98.9% vs. 33.4%

NUG Meeting Using libhpm libhpm can instrument code sections Embed calls into source code –Fortran, C, C++ Contained in hpmtoolkit module – module load hpmtoolkit compile with $HPMTOOLKIT – xlf –O3 source.F $HPMTOOLKIT Execute program normally

NUG Meeting hpmlib example … #include f_hpm.h … CALL f_hpminit(0,”someid") CALL f_hpmstart(1,"matrix-matrix multiply") DO J=1,N DO K=1,N DO I=1,N Z(I,J) = Z(I,J) + X(I,K) * Y(K,J) END DO CALL f_hpmstop(1) CALL f_hpmterminate(0) …

NUG Meeting Parallel programs poe hpmcount executable –nodes n –procs np –Will print output to STDOUT separately for each task poe+ executable –nodes n –procs np –Will print aggregate number to STDOUT libhpm –Writes output to a separate file for each task Do not do these! – hpmcount poe executable … – hpmcount executable (if compiled with mp* compiler)

NUG Meeting Summary Utilities to measure performance –PAPI – hpmcount – poe+ – hpmlib You need to quote performance data in ERCAP application

NUG Meeting Where to Get More Information NERSC Website: hpcf.nersc.gov PAPI –http://hpcf.nersc.gov/software/tools/papi.htmlhttp://hpcf.nersc.gov/software/tools/papi.html hpmcount, poe+ –http://hpcf.nersc.gov/software/ibm/hpmcount/http://hpcf.nersc.gov/software/ibm/hpmcount/ –http://hpcf.nersc.gov/software/ibm/hpmcount/counter.htmlhttp://hpcf.nersc.gov/software/ibm/hpmcount/counter.html hpmlib –http://hpcf.nersc.gov/software/ibm/hpmcount/HPM_README.htmlhttp://hpcf.nersc.gov/software/ibm/hpmcount/HPM_README.html

NUG Meeting Performance Profiling Using hpmcount, poe+ & libhpm Richard Gerber NERSC User Services 510-486-6820.

Similar presentations

Presentation on theme: "NUG Meeting Performance Profiling Using hpmcount, poe+ & libhpm Richard Gerber NERSC User Services 510-486-6820."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

NUG Meeting Performance Profiling Using hpmcount, poe+ & libhpm Richard Gerber NERSC User Services 510-486-6820.

Similar presentations

Presentation on theme: "NUG Meeting Performance Profiling Using hpmcount, poe+ & libhpm Richard Gerber NERSC User Services 510-486-6820."— Presentation transcript:

Similar presentations

About project

Feedback