On-board Performance Counters: What do they really tell us? Pat Teller The University of Texas at El Paso (UTEP) PTools 2002 Annual Meeting, University.

Slides:



Advertisements
Similar presentations
Performance Analysis and Optimization through Run-time Simulation and Statistics Philip J. Mucci University Of Tennessee
Advertisements

TM 1 ProfileMe: Hardware-Support for Instruction-Level Profiling on Out-of-Order Processors Jeffrey Dean Jamey Hicks Carl Waldspurger William Weihl George.
DBMSs on a Modern Processor: Where Does Time Go? Anastassia Ailamaki Joint work with David DeWitt, Mark Hill, and David Wood at the University of Wisconsin-Madison.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
1.Calculate number of events by searching for event in assembly file or analytical model. 2.Validate the numbers from step one with a simulator. 3.Compare.
Virtual Memory. The Limits of Physical Addressing CPU Memory A0-A31 D0-D31 “Physical addresses” of memory locations Data All programs share one address.
Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.
Erhan Erdinç Pehlivan Computer Architecture Support for Database Applications.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
Is SC + ILP = RC? Presented by Vamshi Kadaru Chris Gniady, Babak Falsafi, and T. N. VijayKumar - Purdue University Spring 2005: CS 7968 Parallel Computer.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
CPE 731 Advanced Computer Architecture ILP: Part II – Branch Prediction Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Microarchitectural Characterization of Production JVMs and Java Workload work in progress Jungwoo Ha (UT Austin) Magnus Gustafsson (Uppsala Univ.) Stephen.
Accuracy of Performance Monitoring Hardware Michael E. Maxwell, Patricia J. Teller, and Leonardo Salayandia University of Texas-El Paso and Shirley Moore.
1 Lecture 6 Performance Measurement and Improvement.
Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.
MPI Program Performance. Introduction Defining the performance of a parallel program is more complex than simply optimizing its execution time. This is.
Catching Accurate Profiles in Hardware Satish Narayanasamy, Timothy Sherwood, Suleyman Sair, Brad Calder, George Varghese Presented by Jelena Trajkovic.
8/16/2015\course\cpeg323-08F\Topics1b.ppt1 A Review of Processor Design Flow.
VPC3: A Fast and Effective Trace-Compression Algorithm Martin Burtscher.
Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,
Multi-core Programming VTune Analyzer Basics. 2 Basics of VTune™ Performance Analyzer Topics What is the VTune™ Performance Analyzer? Performance tuning.
Department of Computer Science Mining Performance Data from Sampled Event Traces Bret Olszewski IBM Corporation – Austin, TX Ricardo Portillo, Diana Villa,
Software Performance Analysis Using CodeAnalyst for Windows Sherry Hurwitz SW Applications Manager SRD Advanced Micro Devices Lei.
Computer Science Department University of Texas at El Paso PCAT Performance Counter Assessment Team PAPI Development Team SC 2003, Phoenix, AZ – November.
Performance Monitoring on the Intel ® Itanium ® 2 Processor CGO’04 Tutorial 3/21/04 CK. Luk Massachusetts Microprocessor Design.
André Seznec Caps Team IRISA/INRIA HAVEGE HArdware Volatile Entropy Gathering and Expansion Unpredictable random number generation at user level André.
Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.
Profiling Memory Subsystem Performance in an Advanced POWER Virtualization Environment The prominent role of the memory hierarchy as one of the major bottlenecks.
ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Execution Characteristics of SPEC CPU2000 Benchmarks: Intel C++ vs. Microsoft VC++
Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.
1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.
Understanding Performance Counter Data Authors: Alonso Bayona, Michael Maxwell, Manuel Nieto, Leonardo Salayandia, Seetharami Seelam Mentor: Dr. Patricia.
 Copyright, HiCLAS1 George Delic, Ph.D. HiPERiSM Consulting, LLC And Arney Srackangast, AS1MET Services
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Michael D’Mello
Memory Performance Profiling via Sampled Performance Monitor Event Traces Diana Villa, Patricia J. Teller, and Jaime Acosta The University of Texas at.
Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.
Session 7C July 9, 2004ICPADS ‘04 A Framework for Profiling Multiprocessor Memory Performance Diana Villa, Jaime Acosta, Patricia J. Teller The University.
CPE 631 Project Presentation Hussein Alzoubi and Rami Alnamneh Reconfiguration of architectural parameters to maximize performance and using software techniques.
1 SciDAC High-End Computer System Performance: Science and Engineering Jack Dongarra Innovative Computing Laboratory University of Tennesseehttp://
Transmeta’s New Processor Another way to design CPU By Wu Cheng
Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)
Full and Para Virtualization
Performance Performance
CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The University of Texas at Austin Laboratory for Computer.
Baum, Boyett, & Garrison Comparing Intel C++ and Microsoft Visual C++ Compilers Michael Baum David Boyett Holly Garrison.
Computer Science Department University of Texas at El Paso PCAT Performance Counter Assessment Team PAPI Development Team UGC 2003, Bellevue, WA – June.
TEST 1 – Tuesday March 3 Lectures 1 - 8, Ch 1,2 HW Due Feb 24 –1.4.1 p.60 –1.4.4 p.60 –1.4.6 p.60 –1.5.2 p –1.5.4 p.61 –1.5.5 p.61.
Embedded System Lab. 오명훈 Addressing Shared Resource Contention in Multicore Processors via Scheduling.
Performance Data Standard and API Shirley Browne, Jack Dongarra, and Philip Mucci University of Tennessee from the Ptools Annual Meeting, May 1998.
Exploiting Instruction Streams To Prevent Intrusion Milena Milenkovic.
DISSERTATION RESEARCH PLAN Mitesh Meswani. Outline  Dissertation Research Update  Previous Approach and Results  Modified Research Plan  Identifying.
PAPI on Blue Gene L Using network performance counters to layout tasks for improved performance.
Department of Computer Science 6 th Annual Austin CAS Conference – 24 February 2005 Ricardo Portillo, Diana Villa, Patricia J. Teller The University of.
Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
Confessions of a Performance Monitor Hardware Designer Workshop on Hardware Performance Monitor Design HPCA February 2005 Jim Callister Intel Corporation.
Lecture 10 Tomasulo’s Algorithm
Lecture 12 Reorder Buffers
Performance monitoring on HP Alpha using DCPI
Adaptive Cache Replacement Policy
What we need to be able to count to tune programs
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Understanding Performance Counter Data - 1
Sampoorani, Sivakumar and Joshua
15-740/ Computer Architecture Lecture 14: Prefetching
Adapted from the slides of Prof
Determining the Accuracy of Event Counts - Methodology
What Are Performance Counters?
Presentation transcript:

On-board Performance Counters: What do they really tell us? Pat Teller The University of Texas at El Paso (UTEP) PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

Credits (Person Power)  Michael Maxwell, Graduate (Ph.D.) Student  Leonardo Salayandia, Graduate (M.S.) Student – graduating in Dec  Alonso Bayona, Undergraduate  Alexander Sainz, Undergraduate PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

Credits (Financial)  DoD PET Program  NSF MIE (Model Institutions of Excellence) REU (Research Experiences for Undergraduates) Program  UTEP Dodson Endowment PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

Motivation  Facilitate performance-tuning efforts that employ aggregate event counts (that are not time multiplexed) accessed via PAPI  When possible provide calibration data, i.e., quantify overhead related to PAPI and other sources  Identify unexpected results – Errors? Misunderstandings of processor functionality? PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

Road Map  Scope of Research  Methodology  Results  Future Work and Conclusions PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

Processors Under Study PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002  MIPS R10K and R12K: 2 counters, 32 events  IBM Power3: 8 counters, 100+ events  Linux/IA-64: 4 counters, 150 events  Linux/Pentium: 2 counters, 80+ events

Events Studied So Far  Number of load and store instructions executed  Number of floating-point instructions executed  Total number of instructions executed (issued/committed)  Number of L1 I-cache and L1 D-cache misses  Number of L2 cache misses  Number of TLB misses  Number of branch mispredictions PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

PAPI Overhead PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002  Extra instructions  Read counter before and after workload  Processing of counter overflow interrupts  Cache pollution  TLB pollution

Methodology  [Configuration micro-benchmark]  Validation micro-benchmark – used to predict event count  Prediction via tool, mathematical model, and/or simulation  Hardware-reported event count collection via PAPI (instrumented benchmark run 100 times; mean event count and standard deviation calculated)  Comparison/analysis  Report findings PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

Validation Micro-benchmark  Simple, usually small program  Stresses a portion of the microarchitecture or memory hierarchy  Its size, simplicity, or execution time facilitates the tracing of its execution path and/or prediction of the number of times an event is generated  Basic types: array, loop, in-line, and floating- point  Scalable w.r.t. granularity, i.e., number of generated events PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

Example – Loop Validation Micro-benchmark For (I = 0; I < number_of_loops; I++) { sequence of 100 instructions with data dependencies that prevent compiler reorder or optimization } Used to stress a particular functional unit,e.g., the load/store unit PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

Configuration Micro-benchmark  Simple, usually small program  Designed to provide insight into structure and management algorithms of microarchitecture and/or memory hierarchy  Example: program to identify the page size used to store user data PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

Some Results

Reported Event Counts: Expected, Consistent and Quantifiable Results  Overhead related to PAPI and other sources is consistent and quantifiable  Reported Event Count – Predicted Event Count= Overhead PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

Example 1: Number of Loads Itanium, Power3, and R12K PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

Example 2: Number of Stores Itanium, Power3, and R12K PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

Example 2: Number of Stores Power3 and Itanium PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002 PlatformMIPS R12K IBM Power3 Linux/IA- 64 Linux/ Pentium Loads462886N/A Stores31129N/A

Example 3: Total Number of Floating Point Operations – Pentium II, R10K and R12K, and Itanium ProcessorAccurateConsistent Pentium II MIPS R10K, R12K Itanium Even when counters overflow. No overhead due to PAPI. PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

Reported Event Counts: Unexpected and Consistent Results --Errors?  The hardware-reported counts are multiples of the predicted counts  Reported Event Count / Multiplier = Predicted Event Count  Cannot identify overhead for calibration PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

Example 1: Total Number of Floating-Point Operations – Power3 PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002 AccurateConsistent

Reported Counts: Expected (Not Quantifiable) Results  Predictions: only possible under special circumstances  Reported event counts seem reasonable  But are they useful without knowing more about the algorithm used by the vendor? PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

Example 1: Total Data TLB Misses  Replacement policy can (unpredictably) affect event counts  PAPI may (unpredictably) affect event counts  Other processes may (unpredictably) affect event counts PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

Example 1: Total Compulsory Data TLB Misses for R10K  % difference per no. of references  Predicted values consistently lower than reported  Small standard deviations  Greater predictability with increased no. of references PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

Example 2: L1 D-Cache Misses # misses relatively constant as # of array references increase PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

Example 2: L1 D-Cache Misses  On some of the processors studied, as the number of accesses increased, the miss rate approached 0  Accessing the array in strides of size two cache-size units plus one cache-line resulted in approximately the same event count as accessing the array in strides of one word  What’s going on? PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

Example 2: L1 D-Cache Misses with Random Access (Foil Prefetch Scheme used by Stream Buffers) L1 D cache misses as a function of % filled % of cache filled % Error Power3 R12k Pentium PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

Example 2: A Mathematical Model that Verifies that Execution Time increases Proportionately with L1 D-Cache Misses total_number_of_cycles = iterations * exec_cycles_per_iteration + cache_misses * cycles_per_cache_miss PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

Reported Event Counts: Unexpected but Consistent Results  Predicted counts and reported counts differ significantly but in a consistent manner  Is this an error?  Are we missing something? PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

Example 1: Compulsory Data TLB Misses for Itanium  % difference per no. of references  Reported counts consistently ~5 times greater than predicted PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

Example 3: Compulsory Data TLB Misses for Power 3  % difference per no. of references  Reported counts consistently ~5/~2 times greater than predicted for small/large counts PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

Reported Event Counts: Unexpected Results  Outliers  Puzzles PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

Example 1: Outliers L1 D-Cache Misses for Itanium PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

Example 1: Supporting Data Itanium L1 Data Cache Misses MeanStandard Deviation 90% of data 1M accesses 1, % of data 1M accesses 782,891566,370 PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

Example 2: R10K Floating-Point Division Instructions a = init_value; b = init_value; c = init_value; a = b / init_value; b = a / init_value; c = b / init_value; a = init_value; b = init_value; c = init_value; a = a / init_value; b = b / init_value; c = c / init_value; 1 FP Instruction Counted 3 FP Instructions Counted PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

Example 2: Assembler Code Analysis  No optimization  Same instructions  Different (expected) operands  Three division instructions in both  No reason for different FP counts l.d s.d l.d s.d l.d s.d l.d div.d s.d l.d div.d s.d l.d div.d s.d l.d s.d l.d s.d l.d s.d l.d div.d s.d l.d div.d s.d l.d div.d s.d PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

Example 3: L1 D-Cache Misses with Random Access – Itanium only when at array size = 10x cache size PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

Example 4: L1 I-Cache Misses and Instructions Retired - Itanium PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002 Both about 17% more than expected.

Future Work  Extend events studied – include multiprocessor events  Extend processors studied – include Power4  Study sampling on Power4; IBM collaboration re: workload characterization/system resource usage using sampling PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

Conclusions  Performance counters provide informative data that can be used for performance tuning  Expected frequency of event may determine usefulness of event counts  Calibration data can make event counts more useful to application programmers (loads, stores, floating-point instructions)  The usefulness of some event counts -- as well as our research – could be enhanced with vendor collaboration  The usefulness of some event counts is questionable without documentation of the related behavior PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

Should we attach the following warning to some event counts on some platforms? CAUTION: The values in the performance counters may be greater than you think. PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

And should we attach the PCAT Seal of Approval on others? PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002 PCAT

Invitation to Vendors Help us understand what’s going on, when to attach the “warning, and when to attach the “seal of approval.” Application programmers will appreciate your efforts and so will we! PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002

Question to You On-board Performance Counters: What do they really tell you? With all the caveats, are they useful nonetheless? PTools 2002 Annual Meeting, University of Tennessee, September 10-12, 2002