Performance Monitoring on the Intel ® Itanium ® 2 Processor CGO’04 Tutorial 3/21/04 CK. Luk Massachusetts Microprocessor Design.

Slides:



Advertisements
Similar presentations
TM 1 ProfileMe: Hardware-Support for Instruction-Level Profiling on Out-of-Order Processors Jeffrey Dean Jamey Hicks Carl Waldspurger William Weihl George.
Advertisements

Profiler In software engineering, profiling ("program profiling", "software profiling") is a form of dynamic program analysis that measures, for example,
Lecture 4: CPU Performance
Final Project : Pipelined Microprocessor Joseph Kim.
Rung-Bin Lin Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches4-1 Chapter 4 Exploiting Instruction-Level Parallelism with Software.
Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.
Computer Organization and Architecture
Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Chapter 12 CPU Structure and Function. CPU Sequence Fetch instructions Interpret instructions Fetch data Process data Write data.
Computer Organization and Architecture
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Chapter 12 Pipelining Strategies Performance Hazards.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )
Pipelined Datapath and Control (Lecture #15) ECE 445 – Computer Organization The slides included herein were taken from the materials accompanying Computer.
Code Coverage Testing Using Hardware Performance Monitoring Support Alex Shye, Matthew Iyer, Vijay Janapa Reddi and Daniel A. Connors University of Colorado.
1 Lecture 7: Branch prediction Topics: bimodal, global, local branch prediction (Sections )
The Processor Data Path & Control Chapter 5 Part 4 – Exception Handling N. Guydosh 3/1/04+
1 Lecture 7: Static ILP and branch prediction Topics: static speculation and branch prediction (Appendix G, Section 2.3)
Intel Pentium 4 Processor Presented by Presented by Steve Kelley Steve Kelley Zhijian Lu Zhijian Lu.
Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.
CH12 CPU Structure and Function
Multi-core Programming VTune Analyzer Basics. 2 Basics of VTune™ Performance Analyzer Topics What is the VTune™ Performance Analyzer? Performance tuning.
Edited By Miss Sarwat Iqbal (FUUAST) Last updated:21/1/13
André Seznec Caps Team IRISA/INRIA HAVEGE HArdware Volatile Entropy Gathering and Expansion Unpredictable random number generation at user level André.
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
1 Vulnerabilities on high-end processors André Seznec IRISA/INRIA CAPS project-team.
Implementing Precise Interrupts in Pipelined Processors James E. Smith Andrew R.Pleszkun Presented By: Ravikumar Source:
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Michael D’Mello
Software Performance Monitoring Daniele Francesco Kruse July 2010.
UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.
1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.
Guiding Ispike with Instrumentation and Hardware (PMU) Profiles CGO’04 Tutorial 3/21/04 CK. Luk Massachusetts Microprocessor Design.
Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.
Implementing Precise Interrupts in Pipelined Processors James E. Smith Andrew R.Pleszkun Presented By: Shrikant G.
A Software Performance Monitoring Tool Daniele Francesco Kruse March 2010.
DISSERTATION RESEARCH PLAN Mitesh Meswani. Outline  Dissertation Research Update  Previous Approach and Results  Modified Research Plan  Identifying.
Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 13: Branch prediction (Chapter 4/6)
Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.
Using Interaction Cost (icost) for Microarchitectural Bottleneck Analysis Brian Fields 1 Rastislav Bodik 1 Mark Hill 2 Chris Newburn 3 1 UC-Berkeley, 2.
*Pentium is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States and other countries Performance Monitoring.
IA-64 Architecture Muammer YÜZÜGÜLDÜ CMPE /12/2004.
Confessions of a Performance Monitor Hardware Designer Workshop on Hardware Performance Monitor Design HPCA February 2005 Jim Callister Intel Corporation.
Computer Architecture Chapter (14): Processor Structure and Function
Computer Organization CS224
Computer Structure Multi-Threading
PowerPC 604 Superscalar Microprocessor
CSE 502: Computer Architecture
Morgan Kaufmann Publishers The Processor
Henk Corporaal TUEindhoven 2009
Performance monitoring on HP Alpha using DCPI
Pipelining: Advanced ILP
The processor: Exceptions and Interrupts
What we need to be able to count to tune programs
The processor: Pipelining and Branching
Lecture 19: Branches, OOO Today’s topics: Instruction scheduling
Module 3: Branch Prediction
Understanding Performance Counter Data - 1
Advanced Computer Architecture
Henk Corporaal TUEindhoven 2011
Control unit extension for data hazards
Lecture 10: Branch Prediction and Instruction Delivery
Wackiness Algorithm A: Algorithm B:
Presentation transcript:

Performance Monitoring on the Intel ® Itanium ® 2 Processor CGO’04 Tutorial 3/21/04 CK. Luk Massachusetts Microprocessor Design Center Intel Corporation

CGO’04 Tutorial2 Itanium 2 Performance Monitoring The Itanium Architecture defines a generic framework for the Performance Monitoring Unit (PMU): –Consistent software APIs across processor models –Yet, a processor model can implement its own PMU extension Generic PMU support: –4 64-bit Performance Monitor Data registers (PMDs) (extensible to 256 in total) –8 64-bit Performance Monitor Configuration registers (PMCs) (extensible to 256 in total) –A performance monitor = 1 PMC + N PMDs (where N >= 1) –3 additional status/control registers: PSR, DCR, PMV Itanium 2 PMU support: –Monitor a rich set (140+) of events –16 PMCs, 18 PMDs (4 for event counting, others for buffering event-specific info.) Can pinpoint exactly where a “miss” event happened in the program PMU usages: –Workload characterization –Profiling

CGO’04 Tutorial3 Workload Characterization Understand the performance characteristics of the workload Two fundamental measures of interests: –Event occurrences How often certain events occurred? –Cycle accounting How were the cycles spent during execution?

CGO’04 Tutorial4 Measuring Event Occurrences On Itanium 2, 140+ monitored micro-architectural events: Basic: clock cycles, retired instructions Instruction Dispersal, Instruction Execution, Branch Pipeline Stall Cache, TLB, RSE System, Bus Much more information can be derived, e.g.: IPC = IA64_INST_RETIRED / CPU_CYCLES #I-cache refs = L1I_READS + L1I_PREFETCHES Multi-occurrence events ( > 1 event per cycle) E.g., inst retired, # live entries in the issue queue Thresholding: An event is counted only if it occurs more than the threshold in a cycle

CGO’04 Tutorial5 Itanium 2 Pipeline with Events ROTIPGRENEXPDETWRBEXEREG Back End Front End Back End not stalled BE_FLUSH_BUBBLE.XPN BE_FLUSH_BUBBLE.BRU BE_L1D_FPU_BUBBLE FE_BUBBLE BE_RSE_BUBBLE Instruction Buffer DISP_STALLED INST_DISPERSED SYLL_NOT_DISPERSED BACK_END_BUBBLE.FE FE_LOST_BW IDEAL_BE_LOST_BW_DUE_TO_FE BE_LOST_BW_DUE_TO_FE BE_EXE_BUBBLE

CGO’04 Tutorial6 Cycle Accounting Cycle breakdown by stall and flush reasons: Overlapping stall and flush conditions are prioritized in the reverse pipeline order Branch Mispredict Flush Exception/Interruption Flush CPU_CYCLES Data Access Stall Scoreboard and Register Dependency Stall RSE Spill/Fill Stall Front End Stall IA64_INST_RETIRED >= 1 decreasing priority

CGO’04 Tutorial7 Profiling Common uses of profile feedback: –Manual tuning of applications –Driving automatic compiler optimizations Profiling support on the Itanium 2: –Program counter (PC) sampling –Miss event address sampling –Event qualification (filtering)

CGO’04 Tutorial8 Program Counter Sampling Two ways to sample the program counter: –Time-based: Sample the pc with timer interrupts (when ITC == ITM ) –Event-based: Sample the pc when an event overflows PC sampling can be used to derive basic-block execution counts But can’t precisely identify which instructions are causing performance problems

CGO’04 Tutorial9 Miss Event Address Sampling Problems: –When a cache miss/branch mispredict event occurs, PC sampling tends to indicate the stall point, not the source The sampled PC is N insts off the actual miss instruction N is non-deterministic due to the nature of the micro-arch. Solution provided by the Itanium 2: –Hardware provides a set of Event Address Registers (EARs) to record the instruction and data addresses of the offending instruction (plus other useful information)

CGO’04 Tutorial10 Three Types of EARs Instruction EAR (I-EAR) –I-cache –I-TLB Data EAR (D-EAR) –D-cache –D-TLB –ALAT Branch Trace Buffer (BTB)  I-EAR, D-EAR, BTB can be activated simultaneously

CGO’04 Tutorial11 Instruction EAR Instruction Cache –Triggers on I-cache misses –Records the inst. address and the fetch latency Instruction TLB –Triggers on I-TLB misses –Records the inst. address and who serviced the miss (VHPT or software)

CGO’04 Tutorial12 Data EAR Data Cache –Triggers on D-cache misses –Records the load PC, data address, and the load latency Data TLB –Triggers on D-TLB misses –Records the load PC, data address, and who serviced the miss (L2 D-TLB, VHPT, or software) ALAT –Trigger on ALAT misses –Records the PC of the inst ( chk.a or ld.c ) that missed

CGO’04 Tutorial13 Example of a D-EAR Sample : [MFI]addl r16=4812,r1 nop.f 0x0 addl r17=4804,r1;; : [MMI]ld4 r14=[r16];; adds r14=1,r14 nop.i 0x0;; : [MMB]st4 [r17]=r14 nop.m 0x0 br.ret.sptk.many b0;; pmd2 = 0x c04 data address pmd3 = 0x7 (7 cycles) pmd17 = 0x load’s pc bits 0-1: slot bit 2: bundle bit 3: valid bits 4-63: bundle addr interrupted pc = 0x

CGO’04 Tutorial14 Branch Trace Buffer (BTB) A circular buffer of 8 entries – Captures the last 4 branches Can select branches to monitor based on: –Taken/not taken, path prediction, target prediction –Type (any combo of ip-rel, ret, non-ret indirect) Information recorded for each branch: –PC of the branch itself –Target bundle address of the branch –Taken or not –Correctly predicted or not

CGO’04 Tutorial15 Event Qualification (Filtering) An event is counted only if certain constraints are met Constraints supported: –Instruction address range check Monitor specific DLLs, functions, loops –Instruction opcode match Monitor specific instruction types or register usages –Data address range check Focus on particular data structures –Event umasks Further specification within an event –e.g., bus transactions originated from the core or I/O –Instruction set check IA-32 or IA-64 –Privilege level (2 dimensions): Process vs. system User vs. kernel vs. interrupt

CGO’04 Tutorial16 References General Itanium PMUs IA-64 Linux Kernel: Design and Implementation, D. Mosberger and S. Eranian. (Chapter 9) Itanium 2-specific PMUs Intel ® Itanium 2 ® Processor Reference Manual for Software Development and Optimization (Chapters 10 and 11)