Download presentation
Presentation is loading. Please wait.
Published bySarah Megan Gordon Modified over 9 years ago
1
On-board Performance Counters: What do they really tell us? Pat Teller The University of Texas at El Paso (UTEP) PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002
2
Credits (Person Power) Michael Maxwell, Graduate (Ph.D.) Student Leonardo Salayandia, Graduate (M.S.) Student – graduating in Dec. 2002 Alonso Bayona, Undergraduate Alexander Sainz, Undergraduate PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002
3
Credits (Financial) DoD PET Program NSF MIE (Model Institutions of Excellence) REU (Research Experiences for Undergraduates) Program UTEP Dodson Endowment PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002
4
Motivation Facilitate performance-tuning efforts that employ aggregate event counts (that are not time multiplexed) accessed via PAPI When possible provide calibration data, i.e., quantify overhead related to PAPI and other sources Identify unexpected results – Errors? Misunderstandings of processor functionality? PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002
5
Road Map Scope of Research Methodology Results Future Work and Conclusions PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002
6
Processors Under Study PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002 MIPS R10K and R12K: 2 counters, 32 events IBM Power3: 8 counters, 100+ events Linux/IA-64: 4 counters, 150 events Linux/Pentium: 2 counters, 80+ events
7
Events Studied So Far Number of load and store instructions executed Number of floating-point instructions executed Total number of instructions executed (issued/committed) Number of L1 I-cache and L1 D-cache misses Number of L2 cache misses Number of TLB misses Number of branch mispredictions PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002
8
PAPI Overhead PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002 Extra instructions Read counter before and after workload Processing of counter overflow interrupts Cache pollution TLB pollution
9
Methodology [Configuration micro-benchmark] Validation micro-benchmark – used to predict event count Prediction via tool, mathematical model, and/or simulation Hardware-reported event count collection via PAPI (instrumented benchmark run 100 times; mean event count and standard deviation calculated) Comparison/analysis Report findings PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002
10
Validation Micro-benchmark Simple, usually small program Stresses a portion of the microarchitecture or memory hierarchy Its size, simplicity, or execution time facilitates the tracing of its execution path and/or prediction of the number of times an event is generated Basic types: array, loop, in-line, and floating- point Scalable w.r.t. granularity, i.e., number of generated events PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002
11
Example – Loop Validation Micro-benchmark For (I = 0; I < number_of_loops; I++) { sequence of 100 instructions with data dependencies that prevent compiler reorder or optimization } Used to stress a particular functional unit,e.g., the load/store unit PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002
12
Configuration Micro-benchmark Simple, usually small program Designed to provide insight into structure and management algorithms of microarchitecture and/or memory hierarchy Example: program to identify the page size used to store user data PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002
13
Some Results
14
Reported Event Counts: Expected, Consistent and Quantifiable Results Overhead related to PAPI and other sources is consistent and quantifiable Reported Event Count – Predicted Event Count= Overhead PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002
15
Example 1: Number of Loads Itanium, Power3, and R12K PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002
16
Example 2: Number of Stores Itanium, Power3, and R12K PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002
17
Example 2: Number of Stores Power3 and Itanium PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002 PlatformMIPS R12K IBM Power3 Linux/IA- 64 Linux/ Pentium Loads462886N/A Stores31129N/A
18
Example 3: Total Number of Floating Point Operations – Pentium II, R10K and R12K, and Itanium ProcessorAccurateConsistent Pentium II MIPS R10K, R12K Itanium Even when counters overflow. No overhead due to PAPI. PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002
19
Reported Event Counts: Unexpected and Consistent Results --Errors? The hardware-reported counts are multiples of the predicted counts Reported Event Count / Multiplier = Predicted Event Count Cannot identify overhead for calibration PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002
20
Example 1: Total Number of Floating-Point Operations – Power3 PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002 AccurateConsistent
21
Reported Counts: Expected (Not Quantifiable) Results Predictions: only possible under special circumstances Reported event counts seem reasonable But are they useful without knowing more about the algorithm used by the vendor? PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002
22
Example 1: Total Data TLB Misses Replacement policy can (unpredictably) affect event counts PAPI may (unpredictably) affect event counts Other processes may (unpredictably) affect event counts PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002
23
Example 1: Total Compulsory Data TLB Misses for R10K % difference per no. of references Predicted values consistently lower than reported Small standard deviations Greater predictability with increased no. of references PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002
24
Example 2: L1 D-Cache Misses # misses relatively constant as # of array references increase PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002
25
Example 2: L1 D-Cache Misses On some of the processors studied, as the number of accesses increased, the miss rate approached 0 Accessing the array in strides of size two cache-size units plus one cache-line resulted in approximately the same event count as accessing the array in strides of one word What’s going on? PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002
26
Example 2: L1 D-Cache Misses with Random Access (Foil Prefetch Scheme used by Stream Buffers) L1 D cache misses as a function of % filled -150.0 -100.0 -50.0 0.0 50.0 100.0 150.0 200.0 250.0 300.0 350.0 400.0 0.050.0100.0150.0200.0250.0300.0 % of cache filled % Error Power3 R12k Pentium PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002
27
Example 2: A Mathematical Model that Verifies that Execution Time increases Proportionately with L1 D-Cache Misses total_number_of_cycles = iterations * exec_cycles_per_iteration + cache_misses * cycles_per_cache_miss PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002
28
Reported Event Counts: Unexpected but Consistent Results Predicted counts and reported counts differ significantly but in a consistent manner Is this an error? Are we missing something? PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002
29
Example 1: Compulsory Data TLB Misses for Itanium % difference per no. of references Reported counts consistently ~5 times greater than predicted PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002
30
Example 3: Compulsory Data TLB Misses for Power 3 % difference per no. of references Reported counts consistently ~5/~2 times greater than predicted for small/large counts PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002
31
Reported Event Counts: Unexpected Results Outliers Puzzles PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002
32
Example 1: Outliers L1 D-Cache Misses for Itanium PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002
33
Example 1: Supporting Data Itanium L1 Data Cache Misses MeanStandard Deviation 90% of data 1M accesses 1,290170 10% of data 1M accesses 782,891566,370 PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002
34
Example 2: R10K Floating-Point Division Instructions a = init_value; b = init_value; c = init_value; a = b / init_value; b = a / init_value; c = b / init_value; a = init_value; b = init_value; c = init_value; a = a / init_value; b = b / init_value; c = c / init_value; 1 FP Instruction Counted 3 FP Instructions Counted PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002
35
Example 2: Assembler Code Analysis No optimization Same instructions Different (expected) operands Three division instructions in both No reason for different FP counts l.d s.d l.d s.d l.d s.d l.d div.d s.d l.d div.d s.d l.d div.d s.d l.d s.d l.d s.d l.d s.d l.d div.d s.d l.d div.d s.d l.d div.d s.d PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002
36
Example 3: L1 D-Cache Misses with Random Access – Itanium only when at array size = 10x cache size PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002
37
Example 4: L1 I-Cache Misses and Instructions Retired - Itanium PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002 Both about 17% more than expected.
38
Future Work Extend events studied – include multiprocessor events Extend processors studied – include Power4 Study sampling on Power4; IBM collaboration re: workload characterization/system resource usage using sampling PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002
39
Conclusions Performance counters provide informative data that can be used for performance tuning Expected frequency of event may determine usefulness of event counts Calibration data can make event counts more useful to application programmers (loads, stores, floating-point instructions) The usefulness of some event counts -- as well as our research – could be enhanced with vendor collaboration The usefulness of some event counts is questionable without documentation of the related behavior PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002
40
Should we attach the following warning to some event counts on some platforms? CAUTION: The values in the performance counters may be greater than you think. PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002
41
And should we attach the PCAT Seal of Approval on others? PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002 PCAT
42
Invitation to Vendors Help us understand what’s going on, when to attach the “warning, and when to attach the “seal of approval.” Application programmers will appreciate your efforts and so will we! PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002
43
Question to You On-board Performance Counters: What do they really tell you? With all the caveats, are they useful nonetheless? PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.