Presentation is loading. Please wait.

Presentation is loading. Please wait.

NUG Meeting Performance Profiling Using hpmcount, poe+ & libhpm Richard Gerber NERSC User Services 510-486-6820.

Similar presentations


Presentation on theme: "NUG Meeting Performance Profiling Using hpmcount, poe+ & libhpm Richard Gerber NERSC User Services 510-486-6820."— Presentation transcript:

1 NUG Meeting Performance Profiling Using hpmcount, poe+ & libhpm Richard Gerber NERSC User Services ragerber@nersc.gov 510-486-6820

2 NUG Meeting Introduction How to obtain performance numbers Tools based on IBM’s PMAPI Relevant for FY2003 ERCAP

3 NUG Meeting Agenda Low Level PAPI Interface HPM Toolkit – hpmcount – poe+ libhpm : hardware performance library

4 NUG Meeting Overview These tools are used for performance measurement All can be used to tune applications and measure performance Needed for FY 2003 ERCAP applications

5 NUG Meeting Vocabulary PMAPI – IBM’s low-level interface PAPI – Performance API (portable) hpmcount, poe+ report overall code performance libhpm can be used to instrument portions of code

6 NUG Meeting PAPI Standard application programming interface Portable, don’t confuse with IBM low-level PMAPI interface Can access hardware counter info V2.1 at NERSC See –http://hpcf.nersc.gov/software/papi.htmlhttp://hpcf.nersc.gov/software/papi.html –http://icl.cs.utk.edu/projects/papi/http://icl.cs.utk.edu/projects/papi/

7 NUG Meeting Using PAPI PAPI is available through a module – module load papi You place calls in source code – xlf –O3 source.F $PAPI #include "fpapi.h“ … integer*8 values(2) integer counters(2), ncounters, irc … irc = PAPI_VER_CURRENT CALL papif_library_init(irc) counters(1)=PAPI_FMA_INS counters(2)=PAPI_FP_INS ncounters=2 CALL papif_start_counters(counters,ncounters,irc) … call papif_stop_counters(values,ncounters,irc) write(6,*) 'Total FMA ',values(1), ' Total FP ', values(2) …

8 NUG Meeting hpmcount Easy to use Does not affect code performance Profiles entire code Uses hardware counters Reports flip (floating point instruction) rate and many other quantities

9 NUG Meeting hpmcount usage Serial –%hpmcount executable Parallel –% poe hpmcount executable –nodes n -procs np Gives performance numbers for each task Prints output to STDOUT ( or use – o filename ) Beware! These profile the poe command – hpmcount poe executable – hpmcount executable (if compiled with mp* compilers)

10 NUG Meeting hpmcount example ex1.f - Unoptimized matrix-matrix multiply % xlf90 -o ex1 -O3 -qstrict ex1.f % hpmcount./ex1 hpmcount (V 2.3.1) summary Total execution time (wall clock time): 17.258385 seconds ######## Resource Usage Statistics ######## Total amount of time in user mode : 17.220000 seconds Total amount of time in system mode : 0.040000 seconds Maximum resident set size : 3116 Kbytes Average shared memory use in text segment : 6900 Kbytes*sec Average unshared memory use in data segment : 5344036 Kbytes*sec Number of page faults without I/O activity : 785 Number of page faults with I/O activity : 1 Number of times process was swapped out : 0 Number of times file system performed INPUT : 0 Number of times file system performed OUTPUT : 0 Number of IPC messages sent : 0 Number of IPC messages received : 0 Number of signals delivered : 0 Number of voluntary context switches : 1 Number of involuntary context switches : 1727 ####### End of Resource Statistics ########

11 NUG Meeting hpmcount output ex1.f - Unoptimized matrix-matrix multiply % xlf90 -o ex1 -O3 -qstrict ex1.f % hpmcount./ex1 PM_CYC (Cycles) : 6428126205 PM_INST_CMPL (Instructions completed) : 693651174 PM_TLB_MISS (TLB misses) : 122468941 PM_ST_CMPL (Stores completed) : 125758955 PM_LD_CMPL (Loads completed) : 250513627 PM_FPU0_CMPL (FPU 0 instructions) : 249691884 PM_FPU1_CMPL (FPU 1 instructions) : 3134223 PM_EXEC_FMA (FMAs executed) : 126535192 Utilization rate : 99.308 % Avg number of loads per TLB miss : 2.046 Load and store operations : 376.273 M Instructions per load/store : 1.843 MIPS : 40.192 Instructions per cycle : 0.108 HW Float points instructions per Cycle : 0.039 Floating point instructions + FMAs : 379.361 M Float point instructions + FMA rate : 21.981 Mflip/s FMA percentage : 66.710 % Computation intensity : 1.008

12 NUG Meeting Floating point measures PM_FPU0_CMPL (FPU 0 instructions) PM_FPU1_CMPL (FPU 1 instructions) –The POWER3 processor has two Floating Point Units (FPU) which operate in parallel. –Each FPU can start a new instruction at every cycle. –This is the number of floating point instructions (add, multiply, subtract, divide, multiply+add) that have been executed by each FPU. PM_EXEC_FMA (FMAs executed) –The POWER3 can execute a computation of the form x=s*a+b with one instruction. The is known as a Floating point Multiply & Add (FMA).

13 NUG Meeting Total flop rate Float point instructions + FMA rate –Float point instructions + FMAs gives the floating point operations. The two are added together since an FMA instruction yields 2 floating point operations. –The rate gives the code’s Mflops. –The POWER3 has a peak rate of 1500 Mflops. (375 MHz clock x 2 FPUs x 2Flops/FMA instruction) –Our example: 22 Mflops.

14 NUG Meeting Memory access Average number of loads per TLB miss –Memory addresses that are in the Translation Lookaside Buffer can be accessed quickly. –Each time a TLB miss occurs, a new page (4KB, 512 8-byte elements) is brought into the buffer. –A value of ~500 means each element is accessed ~1 time while the page is in the buffer. –A small value indicates that needed data is stored in widely separated places in memory and a redesign of data structures may help performance significantly. –Our example: 2.0

15 NUG Meeting Cache hits The –sN option to hpmcount specifies a different statistics set -s2 will include L1 data cache hit rate –33.4% for our example –See http://hpcf.nersc.gov/software/ibm/hpmcount/HPM_README.html for more options and descriptions. http://hpcf.nersc.gov/software/ibm/hpmcount/HPM_README.html

16 NUG Meeting Optimizing the code Original code fragment DO I=1,N DO K=1,N DO J=1,N Z(I,J) = Z(I,J) + X(I,K) * Y(K,J) END DO

17 NUG Meeting Optimizing the code “Optimized” code: move I to inner loop DO J=1,N DO K=1,N DO I=1,N Z(I,J) = Z(I,J) + X(I,K) * Y(K,J) END DO

18 NUG Meeting Optimized results Float point instructions + FMA rate –461 vs. 22 Mflips (ESSL 933) Avg number of loads per TLB miss –20,877 vs. 2.0 (ESSL: 162) L1 cache hit rate –98.9% vs. 33.4%

19 NUG Meeting Using libhpm libhpm can instrument code sections Embed calls into source code –Fortran, C, C++ Contained in hpmtoolkit module – module load hpmtoolkit compile with $HPMTOOLKIT – xlf –O3 source.F $HPMTOOLKIT Execute program normally

20 NUG Meeting hpmlib example … #include f_hpm.h … CALL f_hpminit(0,”someid") CALL f_hpmstart(1,"matrix-matrix multiply") DO J=1,N DO K=1,N DO I=1,N Z(I,J) = Z(I,J) + X(I,K) * Y(K,J) END DO CALL f_hpmstop(1) CALL f_hpmterminate(0) …

21 NUG Meeting Parallel programs poe hpmcount executable –nodes n –procs np –Will print output to STDOUT separately for each task poe+ executable –nodes n –procs np –Will print aggregate number to STDOUT libhpm –Writes output to a separate file for each task Do not do these! – hpmcount poe executable … – hpmcount executable (if compiled with mp* compiler)

22 NUG Meeting Summary Utilities to measure performance –PAPI – hpmcount – poe+ – hpmlib You need to quote performance data in ERCAP application

23 NUG Meeting Where to Get More Information NERSC Website: hpcf.nersc.gov PAPI –http://hpcf.nersc.gov/software/tools/papi.htmlhttp://hpcf.nersc.gov/software/tools/papi.html hpmcount, poe+ –http://hpcf.nersc.gov/software/ibm/hpmcount/http://hpcf.nersc.gov/software/ibm/hpmcount/ –http://hpcf.nersc.gov/software/ibm/hpmcount/counter.htmlhttp://hpcf.nersc.gov/software/ibm/hpmcount/counter.html hpmlib –http://hpcf.nersc.gov/software/ibm/hpmcount/HPM_README.htmlhttp://hpcf.nersc.gov/software/ibm/hpmcount/HPM_README.html


Download ppt "NUG Meeting Performance Profiling Using hpmcount, poe+ & libhpm Richard Gerber NERSC User Services 510-486-6820."

Similar presentations


Ads by Google