Presentation is loading. Please wait.

Presentation is loading. Please wait.

On-board Performance Counters: What do they really tell us? Pat Teller The University of Texas at El Paso (UTEP) PTools 2002 Annual Meeting, University.

Similar presentations


Presentation on theme: "On-board Performance Counters: What do they really tell us? Pat Teller The University of Texas at El Paso (UTEP) PTools 2002 Annual Meeting, University."— Presentation transcript:

1 On-board Performance Counters: What do they really tell us? Pat Teller The University of Texas at El Paso (UTEP) PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

2 Credits (Person Power)  Michael Maxwell, Graduate (Ph.D.) Student  Leonardo Salayandia, Graduate (M.S.) Student – graduating in Dec. 2002  Alonso Bayona, Undergraduate  Alexander Sainz, Undergraduate PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

3 Credits (Financial)  DoD PET Program  NSF MIE (Model Institutions of Excellence) REU (Research Experiences for Undergraduates) Program  UTEP Dodson Endowment PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

4 Motivation  Facilitate performance-tuning efforts that employ aggregate event counts (that are not time multiplexed) accessed via PAPI  When possible provide calibration data, i.e., quantify overhead related to PAPI and other sources  Identify unexpected results – Errors? Misunderstandings of processor functionality? PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

5 Road Map  Scope of Research  Methodology  Results  Future Work and Conclusions PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

6 Processors Under Study PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002  MIPS R10K and R12K: 2 counters, 32 events  IBM Power3: 8 counters, 100+ events  Linux/IA-64: 4 counters, 150 events  Linux/Pentium: 2 counters, 80+ events

7 Events Studied So Far  Number of load and store instructions executed  Number of floating-point instructions executed  Total number of instructions executed (issued/committed)  Number of L1 I-cache and L1 D-cache misses  Number of L2 cache misses  Number of TLB misses  Number of branch mispredictions PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

8 PAPI Overhead PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002  Extra instructions  Read counter before and after workload  Processing of counter overflow interrupts  Cache pollution  TLB pollution

9 Methodology  [Configuration micro-benchmark]  Validation micro-benchmark – used to predict event count  Prediction via tool, mathematical model, and/or simulation  Hardware-reported event count collection via PAPI (instrumented benchmark run 100 times; mean event count and standard deviation calculated)  Comparison/analysis  Report findings PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

10 Validation Micro-benchmark  Simple, usually small program  Stresses a portion of the microarchitecture or memory hierarchy  Its size, simplicity, or execution time facilitates the tracing of its execution path and/or prediction of the number of times an event is generated  Basic types: array, loop, in-line, and floating- point  Scalable w.r.t. granularity, i.e., number of generated events PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

11 Example – Loop Validation Micro-benchmark For (I = 0; I < number_of_loops; I++) { sequence of 100 instructions with data dependencies that prevent compiler reorder or optimization } Used to stress a particular functional unit,e.g., the load/store unit PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

12 Configuration Micro-benchmark  Simple, usually small program  Designed to provide insight into structure and management algorithms of microarchitecture and/or memory hierarchy  Example: program to identify the page size used to store user data PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

13 Some Results

14 Reported Event Counts: Expected, Consistent and Quantifiable Results  Overhead related to PAPI and other sources is consistent and quantifiable  Reported Event Count – Predicted Event Count= Overhead PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

15 Example 1: Number of Loads Itanium, Power3, and R12K PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

16 Example 2: Number of Stores Itanium, Power3, and R12K PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

17 Example 2: Number of Stores Power3 and Itanium PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002 PlatformMIPS R12K IBM Power3 Linux/IA- 64 Linux/ Pentium Loads462886N/A Stores31129N/A

18 Example 3: Total Number of Floating Point Operations – Pentium II, R10K and R12K, and Itanium ProcessorAccurateConsistent Pentium II MIPS R10K, R12K Itanium Even when counters overflow. No overhead due to PAPI. PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

19 Reported Event Counts: Unexpected and Consistent Results --Errors?  The hardware-reported counts are multiples of the predicted counts  Reported Event Count / Multiplier = Predicted Event Count  Cannot identify overhead for calibration PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

20 Example 1: Total Number of Floating-Point Operations – Power3 PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002 AccurateConsistent

21 Reported Counts: Expected (Not Quantifiable) Results  Predictions: only possible under special circumstances  Reported event counts seem reasonable  But are they useful without knowing more about the algorithm used by the vendor? PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

22 Example 1: Total Data TLB Misses  Replacement policy can (unpredictably) affect event counts  PAPI may (unpredictably) affect event counts  Other processes may (unpredictably) affect event counts PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

23 Example 1: Total Compulsory Data TLB Misses for R10K  % difference per no. of references  Predicted values consistently lower than reported  Small standard deviations  Greater predictability with increased no. of references PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

24 Example 2: L1 D-Cache Misses # misses relatively constant as # of array references increase PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

25 Example 2: L1 D-Cache Misses  On some of the processors studied, as the number of accesses increased, the miss rate approached 0  Accessing the array in strides of size two cache-size units plus one cache-line resulted in approximately the same event count as accessing the array in strides of one word  What’s going on? PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

26 Example 2: L1 D-Cache Misses with Random Access (Foil Prefetch Scheme used by Stream Buffers) L1 D cache misses as a function of % filled -150.0 -100.0 -50.0 0.0 50.0 100.0 150.0 200.0 250.0 300.0 350.0 400.0 0.050.0100.0150.0200.0250.0300.0 % of cache filled % Error Power3 R12k Pentium PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

27 Example 2: A Mathematical Model that Verifies that Execution Time increases Proportionately with L1 D-Cache Misses total_number_of_cycles = iterations * exec_cycles_per_iteration + cache_misses * cycles_per_cache_miss PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

28 Reported Event Counts: Unexpected but Consistent Results  Predicted counts and reported counts differ significantly but in a consistent manner  Is this an error?  Are we missing something? PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

29 Example 1: Compulsory Data TLB Misses for Itanium  % difference per no. of references  Reported counts consistently ~5 times greater than predicted PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

30 Example 3: Compulsory Data TLB Misses for Power 3  % difference per no. of references  Reported counts consistently ~5/~2 times greater than predicted for small/large counts PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

31 Reported Event Counts: Unexpected Results  Outliers  Puzzles PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

32 Example 1: Outliers L1 D-Cache Misses for Itanium PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

33 Example 1: Supporting Data Itanium L1 Data Cache Misses MeanStandard Deviation 90% of data 1M accesses 1,290170 10% of data 1M accesses 782,891566,370 PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

34 Example 2: R10K Floating-Point Division Instructions a = init_value; b = init_value; c = init_value; a = b / init_value; b = a / init_value; c = b / init_value; a = init_value; b = init_value; c = init_value; a = a / init_value; b = b / init_value; c = c / init_value; 1 FP Instruction Counted 3 FP Instructions Counted PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

35 Example 2: Assembler Code Analysis  No optimization  Same instructions  Different (expected) operands  Three division instructions in both  No reason for different FP counts l.d s.d l.d s.d l.d s.d l.d div.d s.d l.d div.d s.d l.d div.d s.d l.d s.d l.d s.d l.d s.d l.d div.d s.d l.d div.d s.d l.d div.d s.d PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

36 Example 3: L1 D-Cache Misses with Random Access – Itanium only when at array size = 10x cache size PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

37 Example 4: L1 I-Cache Misses and Instructions Retired - Itanium PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002 Both about 17% more than expected.

38 Future Work  Extend events studied – include multiprocessor events  Extend processors studied – include Power4  Study sampling on Power4; IBM collaboration re: workload characterization/system resource usage using sampling PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

39 Conclusions  Performance counters provide informative data that can be used for performance tuning  Expected frequency of event may determine usefulness of event counts  Calibration data can make event counts more useful to application programmers (loads, stores, floating-point instructions)  The usefulness of some event counts -- as well as our research – could be enhanced with vendor collaboration  The usefulness of some event counts is questionable without documentation of the related behavior PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

40 Should we attach the following warning to some event counts on some platforms? CAUTION: The values in the performance counters may be greater than you think. PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

41 And should we attach the PCAT Seal of Approval on others? PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002 PCAT

42 Invitation to Vendors Help us understand what’s going on, when to attach the “warning, and when to attach the “seal of approval.” Application programmers will appreciate your efforts and so will we! PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

43 Question to You On-board Performance Counters: What do they really tell you? With all the caveats, are they useful nonetheless? PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002


Download ppt "On-board Performance Counters: What do they really tell us? Pat Teller The University of Texas at El Paso (UTEP) PTools 2002 Annual Meeting, University."

Similar presentations


Ads by Google