Understanding Performance Counter Data - 1 Methodology [Configuration micro-benchmark] Validation micro-benchmark – used to predict event count Prediction via tool, mathematical model, and/or simulation Hardware-reported event count collection via PAPI (instrumented benchmark run 100 times; mean event count and standard deviation calculated) Comparison/analysis Report findings
Understanding Performance Counter Data - 2 Can quantify PAPI overhead in some cases, e.g., Loads and stores Floating-point operations (on some platforms) Can show that count is reasonable in others, e.g., L1 Dcache misses DTLB misses (R10K) Multiprocessor cache consistency protocol-related events (R10K)
Understanding Performance Counter Data - 3 Interesting facts Stream buffers are incredibly effective! Itanium has 17% more instructions retired and 17% more Icache misses than predicted – this is due to no-ops Itanium has 5x TLB misses than predicted – don’t know why yet! Power3 has 5x (for smaller versions of benchmark) and 2x (for larger versions) TLB misses than predicted – don’t know why yet!
Understanding Performance Counter Data - 4 Interesting facts Power3 (gcc compiler): single-precision vs. double-precision floating-point add benchmark ½ the number of floating-point operations for double-precision benchmark due to rounding instructions needed for single-precision benchmark 1.39x cycles for single-precision benchmark, as compared to double-precision benchmark