Presentation is loading. Please wait.

Presentation is loading. Please wait.

Performance Monitoring on the Intel ® Itanium ® 2 Processor CGO’04 Tutorial 3/21/04 CK. Luk Massachusetts Microprocessor Design.

Similar presentations


Presentation on theme: "Performance Monitoring on the Intel ® Itanium ® 2 Processor CGO’04 Tutorial 3/21/04 CK. Luk Massachusetts Microprocessor Design."— Presentation transcript:

1 Performance Monitoring on the Intel ® Itanium ® 2 Processor CGO’04 Tutorial 3/21/04 CK. Luk chi-keung.luk@intel.com Massachusetts Microprocessor Design Center Intel Corporation

2 CGO’04 Tutorial2 Itanium 2 Performance Monitoring The Itanium Architecture defines a generic framework for the Performance Monitoring Unit (PMU): –Consistent software APIs across processor models –Yet, a processor model can implement its own PMU extension Generic PMU support: –4 64-bit Performance Monitor Data registers (PMDs) (extensible to 256 in total) –8 64-bit Performance Monitor Configuration registers (PMCs) (extensible to 256 in total) –A performance monitor = 1 PMC + N PMDs (where N >= 1) –3 additional status/control registers: PSR, DCR, PMV Itanium 2 PMU support: –Monitor a rich set (140+) of events –16 PMCs, 18 PMDs (4 for event counting, others for buffering event-specific info.) Can pinpoint exactly where a “miss” event happened in the program PMU usages: –Workload characterization –Profiling

3 CGO’04 Tutorial3 Workload Characterization Understand the performance characteristics of the workload Two fundamental measures of interests: –Event occurrences How often certain events occurred? –Cycle accounting How were the cycles spent during execution?

4 CGO’04 Tutorial4 Measuring Event Occurrences On Itanium 2, 140+ monitored micro-architectural events: Basic: clock cycles, retired instructions Instruction Dispersal, Instruction Execution, Branch Pipeline Stall Cache, TLB, RSE System, Bus Much more information can be derived, e.g.: IPC = IA64_INST_RETIRED / CPU_CYCLES #I-cache refs = L1I_READS + L1I_PREFETCHES Multi-occurrence events ( > 1 event per cycle) E.g., inst retired, # live entries in the issue queue Thresholding: An event is counted only if it occurs more than the threshold in a cycle

5 CGO’04 Tutorial5 Itanium 2 Pipeline with Events ROTIPGRENEXPDETWRBEXEREG Back End Front End Back End not stalled BE_FLUSH_BUBBLE.XPN BE_FLUSH_BUBBLE.BRU BE_L1D_FPU_BUBBLE FE_BUBBLE BE_RSE_BUBBLE Instruction Buffer DISP_STALLED INST_DISPERSED SYLL_NOT_DISPERSED BACK_END_BUBBLE.FE FE_LOST_BW IDEAL_BE_LOST_BW_DUE_TO_FE BE_LOST_BW_DUE_TO_FE BE_EXE_BUBBLE

6 CGO’04 Tutorial6 Cycle Accounting Cycle breakdown by stall and flush reasons: Overlapping stall and flush conditions are prioritized in the reverse pipeline order Branch Mispredict Flush Exception/Interruption Flush CPU_CYCLES Data Access Stall Scoreboard and Register Dependency Stall RSE Spill/Fill Stall Front End Stall IA64_INST_RETIRED >= 1 decreasing priority

7 CGO’04 Tutorial7 Profiling Common uses of profile feedback: –Manual tuning of applications –Driving automatic compiler optimizations Profiling support on the Itanium 2: –Program counter (PC) sampling –Miss event address sampling –Event qualification (filtering)

8 CGO’04 Tutorial8 Program Counter Sampling Two ways to sample the program counter: –Time-based: Sample the pc with timer interrupts (when ITC == ITM ) –Event-based: Sample the pc when an event overflows PC sampling can be used to derive basic-block execution counts But can’t precisely identify which instructions are causing performance problems

9 CGO’04 Tutorial9 Miss Event Address Sampling Problems: –When a cache miss/branch mispredict event occurs, PC sampling tends to indicate the stall point, not the source The sampled PC is N insts off the actual miss instruction N is non-deterministic due to the nature of the micro-arch. Solution provided by the Itanium 2: –Hardware provides a set of Event Address Registers (EARs) to record the instruction and data addresses of the offending instruction (plus other useful information)

10 CGO’04 Tutorial10 Three Types of EARs Instruction EAR (I-EAR) –I-cache –I-TLB Data EAR (D-EAR) –D-cache –D-TLB –ALAT Branch Trace Buffer (BTB)  I-EAR, D-EAR, BTB can be activated simultaneously

11 CGO’04 Tutorial11 Instruction EAR Instruction Cache –Triggers on I-cache misses –Records the inst. address and the fetch latency Instruction TLB –Triggers on I-TLB misses –Records the inst. address and who serviced the miss (VHPT or software)

12 CGO’04 Tutorial12 Data EAR Data Cache –Triggers on D-cache misses –Records the load PC, data address, and the load latency Data TLB –Triggers on D-TLB misses –Records the load PC, data address, and who serviced the miss (L2 D-TLB, VHPT, or software) ALAT –Trigger on ALAT misses –Records the PC of the inst ( chk.a or ld.c ) that missed

13 CGO’04 Tutorial13 Example of a D-EAR Sample 4000000000000450: [MFI]addl r16=4812,r1 nop.f 0x0 addl r17=4804,r1;; 4000000000000460: [MMI]ld4 r14=[r16];; adds r14=1,r14 nop.i 0x0;; 4000000000000470: [MMB]st4 [r17]=r14 nop.m 0x0 br.ret.sptk.many b0;; pmd2 = 0x6000000000005c04 data address pmd3 = 0x7 (7 cycles) pmd17 = 0x4000000000000468 load’s pc bits 0-1: slot bit 2: bundle bit 3: valid bits 4-63: bundle addr interrupted pc = 0x4000000000000470

14 CGO’04 Tutorial14 Branch Trace Buffer (BTB) A circular buffer of 8 entries – Captures the last 4 branches Can select branches to monitor based on: –Taken/not taken, path prediction, target prediction –Type (any combo of ip-rel, ret, non-ret indirect) Information recorded for each branch: –PC of the branch itself –Target bundle address of the branch –Taken or not –Correctly predicted or not

15 CGO’04 Tutorial15 Event Qualification (Filtering) An event is counted only if certain constraints are met Constraints supported: –Instruction address range check Monitor specific DLLs, functions, loops –Instruction opcode match Monitor specific instruction types or register usages –Data address range check Focus on particular data structures –Event umasks Further specification within an event –e.g., bus transactions originated from the core or I/O –Instruction set check IA-32 or IA-64 –Privilege level (2 dimensions): Process vs. system User vs. kernel vs. interrupt

16 CGO’04 Tutorial16 References General Itanium PMUs IA-64 Linux Kernel: Design and Implementation, D. Mosberger and S. Eranian. (Chapter 9) Itanium 2-specific PMUs Intel ® Itanium 2 ® Processor Reference Manual for Software Development and Optimization (Chapters 10 and 11) http://developer.intel.com/design/itanium2/manuals/251110.htm


Download ppt "Performance Monitoring on the Intel ® Itanium ® 2 Processor CGO’04 Tutorial 3/21/04 CK. Luk Massachusetts Microprocessor Design."

Similar presentations


Ads by Google