Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Intel® Software College.

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Intel® Software College

Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors 2 Objective At the successful completion of this module, you will be able to Use the VTune™ Performance Analyzer to identify micro- architectural bottlenecks in software running on Intel ® Core™ 2 Duo Xeon ® processors Address the performance bottleneck for Intel ® Core™ 2 Duo Xeon ® processors

Intel ® Software College Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors 3 Agenda Core ® micro-architecture review Event basics Events identifying Intel ® Core™ 2 Duo Xeon ® processors bottlenecks Summary

Intel ® Software College Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors 4 Next Generation Micro Architecture Intel® Core™ 2 Duo Processor FSB Shared L2 = 4MB CPU-0 Core CPU-1 Core CPU-0 L1D=32KB CPU-0 L1I=32KB L0/L1 DTLB PMH CPU-1 L1D=32KB CPU-1 L1I=32KB L0/L1 DTLB PMH

Intel ® Software College Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors 5 Disclaimer: This block diagram is for example purposes only. Significant hardware blocks have been arranged or omitted for clarity. Some resources (Bus Unit, L2 Cache, etc…) are shared between cores. Branch Target Buffer Microcode Sequencer Register Allocation Table (RAT) 32 KB Instruction Cache Next IP Instruction Decode (4 issue) Fetch / Decode Retire Re-Order Buffer (ROB) – 96 entry IA Register Set To L2 Cache/Memory Port Bus Unit Reservation Stations (RS) 32 entry Scheduler / Dispatch Ports 32 KB Data Cache Execute Port FP Add SIMD Integer Arithmetic Memory Order Buffer (MOB) Load Store Addr FP Div/Mul Integer Shift/Rotate SIMD Integer Arithmetic Port Store Data Architecture Block and Instruction Flow

Intel ® Software College Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors 6 Agenda Core ® micro-architecture review Event basics Events identifying Intel ® Core™ 2 Duo Xeon ® processors bottlenecks Summary

Intel ® Software College Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors 7 VTune ™ Analyzer Event Basics Events Versus Samples A performance counter increments on the CPU every time an event occurs A sample of the execution context is recorded every time a performance counter overflows Events = samples * sample after value

Intel ® Software College Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors 8 VTune ™ Analyzer Event Basics Retired Versus Non-Retired Events Retired events include only events that occur due to instructions that are committed to the machine state. For example, when measuring the Loads Retired event, a load that occurs on a mispredicted execution path is not counted Most retired events can also be precise events. No event skid

Intel ® Software College Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors 9 VTune ™ Analyzer Event Basics Event Skid Events can appear a few lines after they actually occur in the disassembly source view, which is due to interrupt latency.

Intel ® Software College Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors 10 VTune ™ Analyzer Event Basics Precise Events Do not suffer from event skid Use hardware to record the address where the event occurs Reduce the number of events you can collect at once

Intel ® Software College Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors 11 VTune ™ Analyzer Event Basics Precise Events (cont.) On: Off:

Intel ® Software College Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors 12 VTune ™ Analyzer Event Basics Event Ratios Calculate common processor performance metrics Built in to VTune ™ analyzer

Intel ® Software College Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors 13 VTune ™ Analyzer Event Basics Clockticks and Instructions Retired Clockticks measure CPU cycles Clockticks/processor frequency = time in seconds Instructions retired = the number of instructions committed to the processor state (executed completely) Cycles per instruction (CPI) = clockticks / instructions retired High CPI usually indicates opportunities for optimization.

Intel ® Software College Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors 14 VTune ™ Analyzer Event Basics Clockticks Versus Non-halted Clockticks Clockticks = halted + non-halted cycles (but no sleep cycles) The clockticks event measures cycles when the physical processor is not in any sleep modes. The non-halted clockticks event measures the cycles that a logical processor is not asleep or halted. If you measure clockticks on a Hyper-Threaded technology- enabled system while running a single-threaded application, you will see a lot of samples around the halt instruction in processor.sys.

Intel ® Software College Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors 15 Agenda Core ® micro-architecture review Event basics Performance tuning for Intel ® Core™ 2 Duo Xeon ® processors Events for performance Performance optimization methodology X86 cycle accounting Summary

Intel ® Software College Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors 16 Branch Target Buffer Microcode Sequencer Register Allocation Table (RAT) 32 KB Instruction Cache Next IP Instruction Decode (4 issue) Fetch / Decode Performance Events along µ-op Flow (1) Retire Re-Order Buffer (ROB) – 96 entry IA Register Set To L2 Cache /Memory Port Bus Unit Reservation Stations (RS) 32 entry Scheduler / Dispatch Ports 32 KB Data Cache Execute Port FP Add SIMD Integer Arithmetic Memory Order Buffer (MOB) Load Store Addr FP Div/Mul Integer Shift/Rotate SIMD Integer Arithmetic Port Store Data

Intel ® Software College Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors 17 Memory Access (Examples) Latencies L1 miss hits L2 ~ 10 cycles L2 miss, access to memory ~300 cycles (server/FBD) L2 miss, access to memory ~165 cycles (Desk/DDR2) Cache Bandwidth Bandwidth to cache ~ 8.5 bytes/cycle Memory Bandwidth Desktop ~ 6 GB/sec/socket (linux*) Server ~3.5 GB/sec/socket

Intel ® Software College Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors 18 Performance Events for the Front End EVENTPDescriptionEVENTPDescription CPU_CLK_UNHALTEDBUS_DRDY_CLOCKS.ALL_AGENTSall busy bus cycles INST_RETIRED.ANY_PPBUS_DRDY_CLOCKS.THIS_AGENT all busy bus cycles due to writes INST_RETIRED.LOADSMEM_LOAD_RETIRED.L2_LINE_MISSPL2 demand misses INST_RETIRED.STORESMMX2_PRE_MISS.T1 SW prefetch to L1 inst BUS_TRANS_ANYall bus transactionsMMX2_PRE_MISS.T2 SW prefetch to L2 inst BUS_TRANS_MEMbus trans to memoryMMX2_PRE_MISS.STORES Non Temporal Stores executed BUS_TRANS_BURSTwhole $lines to memL2_LINES_IN.SELF.DEMAND L2$lines in for rfo, load, sw prefetch BUS_TRANS_BRD whole line reads from memL2_LINES_IN.SELF.PREFETCH L2$lines in for hw prefetch BUS_TRANS_WBwritebacks (no NT writes)L2_LINES_OUT.SELF.DEMAND demanded L2$Lines evicted BUS_TRANS_RFO $lines in for RFO (no HW pref)L2_LINES_OUT.SELF.PREFETCH HW prefetch L2$lines evicted Memory BW = 64*Bus_Trans_Mem*freq/Cpu_Clk_Unhalted

Intel ® Software College Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors 19 Lab Activity 1: Calculating the Memory Access Bandwidth In this lab, you will calculate the bandwidth of memory with the performance counter events using the VTune ™ analyzer

Intel ® Software College Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors 20 Branch Target Buffer Microcode Sequencer Register Allocation Table (RAT) 32 KB Instruction Cache Next IP Instruction Decode (4 issue) Fetch / Decode Performance Events along µ-op Flow (2) Retire Re-Order Buffer (ROB) – 96 entry IA Register Set To L2 Cache Port Bus Unit Reservation Stations (RS) 32 entry Scheduler / Dispatch Ports 32 KB Data Cache Execute Port FP Add SIMD Integer Arithmetic Memory Order Buffer (MOB) Load Store Addr FP Div/Mul Integer Shift/Rotate SIMD Integer Arithmetic Port Store Data Resource_Stalls measures here transfer from Decode

Intel ® Software College Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors 21 Performance Events of Resource _Stalls µ-op flow to OOO engine blocked by downstream cause Resource_Stalls.BR_MISS_CLEAR pipeline stalls due to flushing mispredicted branches Combine in Resource_stalls.CLEAR Mispredicted branch followed by fp inst Resource_Stalls.ROB_FULL 96 instructions in ROB Resource_Stalls.LD_ST All Store or Load buffers in use Resource_Stalls.RS_FULL 32 instructions waiting for inputs in Reservation Station

Intel ® Software College Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors 22 Measuring Instruction Starvation There really is no good way to do this Anti Correlate with Resource_stalls.RS_full There could be Cycles Decode queue is empty Cycles RS is empty Cycles ROB is empty

Intel ® Software College Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors 23 Branch Target Buffer Microcode Sequencer Register Allocation Table (RAT) 32 KB Instruction Cache Next IP Instruction Decode (4 issue) Fetch / Decode Performance Events along µ-op Flow (3) Retire Re-Order Buffer (ROB) – 96 entry IA Register Set To L2 Cache Port Bus Unit Reservation Stations (RS) 32 entry Scheduler / Dispatch Ports 32 KB Data Cache Execute Port FP Add SIMD Integer Arithmetic Memory Order Buffer (MOB) Load Store Addr FP Div/Mul Integer Shift/Rotate SIMD Integer Arithmetic Port Store Data Rs_uops_dispatched measures at Execution Other stalls measures at Execution

Intel ® Software College Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors 24 Measuring Efficiency in the Execution Stage OOO engine optimizes instruction issue to functional units from Reservation Station They wait there until their inputs are available RS_UOPS_DISPATCHED measures number of µ-ops dispatched from RS on each cycle There are chains preventing OOO engine from executing in parallel Partial Register Stall Partial Flag Register Stall Domain bypass Others…

Intel ® Software College Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors 25 Branch Target Buffer Microcode Sequencer Register Allocation Table (RAT) 32 KB Instruction Cache Next IP Instruction Decode (4 issue) Fetch / Decode Performance Events along µ-op Flow (4) Retire Re-Order Buffer (ROB) – 96 entry IA Register Set To L2 Cache Port Bus Unit Reservation Stations (RS) 32 entry Scheduler / Dispatch Ports 32 KB Data Cache Execute Port FP Add SIMD Integer Arithmetic Memory Order Buffer (MOB) Load Store Addr FP Div/Mul Integer Shift/Rotate SIMD Integer Arithmetic Port Store Data µ-ops retired measures at Retirement

Intel ® Software College Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors 26 Retirement vs Dispatch Which counters to work on first? For loops, difference is due to OOO execution Fewer false positives when “Stalls” are measured at Dispatch Retirement is generally more important than Dispatch

Intel ® Software College Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors 27 Performance Optimization Methodology This style of optimization has 2 components Minimizing instruction count (path length) A sort of “tree height” minimization Minimizing deviations from ideal execution Generically thought of as “stall cycles” Treating both equally is critical

Intel ® Software College Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors 28 Stalls, Execution Imperfection and Performance Analysis Stall cycles are used to indicate less than perfect execution An architectural decomposition of “stalls” can be used to guide the selection of architectural events The IP correlation of “stalls” and arch events then guides the optimization effort Stalls have 4 basic components in x86 Front End stalls Execution stage instruction starvation (Front End) Mispredicted branch pipeline flushing Execution stalls (Waiting on input/Scoreboard, L2 miss, BW, DTLB, glass jaws etc) Cycles wasted executing instructions that are not retired

Intel ® Software College Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors 29 X86 Cycle Accounting and SW Optimization Cpu_clk_unhalted = “stalls” + dispatch = “stalls” + non_ret_dispatch + ret_dispatch Traditional Stall Removal Reduce Branch Mispredictions PGO Improve Optimization to Reduce Instruction Count, Split Loops to Increase ILP Resource_stalls.br_miss_clear will estimate stalls due to Pipeline Flush

Intel ® Software College Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors 30 Cycle Accounting on X86 Cycles = “stalls” + dispatch An equality by definition Cycles ~ CPU_CLK_UNHALTED.CORE For cpu intensive applications/sampling Stall Cycles = Cycles with NO uops Dispatched = RS_UOPS_DISPATCH.CYCLES_NONE Dispatch Cycle=RS_UOPS_DISPATCH

Intel ® Software College Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors 31 Cycle Accounting on X86 (cont.) Dispatch ~ cycles_dispatch_retiring_uops + cycles_dispatch_non_retiring_uops Assumes no overlap of retired/non retired uops Worst Case Senario Non retired uops = rs_uops_dispatched – (uops_retired.any + Uops_retired.fused) Non retired uop cycles ~ non retired uops/avg_uops_per_cycle Fractional Wasted Work = rs_uops_dispatched / (uops_retired.any + uops_retired.fused) - 1

Intel ® Software College Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors 32 Pulling Cycle Accounting Together Illustrative Example Only, Not Real Data

Intel ® Software College Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors 33 Decomposing Stalls: Elephants First Pipeline Flush = Resource_Stalls.Br_Miss_Clear/cycles L2 Hits = ( MEM_LOAD_RETIRED.L1D_LINE_MISS - MEM_LOAD_RETIRED.L2_LINE_MISS )* 10/cycles DTLB/L2 Miss = event count* penalty/cycles FE + Scoreboard = Stalls – all of the above Illustrative Example Only, Not Real Data

Intel ® Software College Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors 34 Decomposing Unstalled Cycles Non_Retired = (( 1 – (Uops_retired.any+Uops_retired.fused)/RS_Uops_Dispatched) * RS_Uops_Dispatched.Cycles_None / CPU_CLK_UNHALTED.CORE OOO Bursts = Uops_Retired.Any - Stalls – Non_Retired Illustrative Example Only, Not Real Data

Intel ® Software College Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors 35 Pulling it All Together Risks Over-counting / Minimizing FE + Scoreboard But Offers a Guide to Execution Inefficiencies Illustrative Example Only, Not Real Data

Intel ® Software College Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors 36 The “Big 4” Events for Performance CYCLES, STALLS, UNPREFETCHED LOADS and BANDWIDTH CPU_CLK_UNHALTED.CORE RS_UOPS_DISPATCHED.CYCLES.NONE MEM_LOAD_RETIRED.L2_LINE_MISS BUS_TRANS_ANY.SELF

Intel ® Software College Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors 37 Architectural Pitfalls: The Ants IssuePerformance Counter Approx. Penalty (cycles) store to unknown addr preceeds loadLoad_Blocks.ADR~5 store forwarding 4 bytes from middle of 8Load_Blocks.Overlap_Store~6 store to known address precedes load offset by N*4096 Load_Blocks. Overlap_Store~6 load from 2 cachelines (not in L1D)Load_Blocks.UNTIL_RETIRE~22 load from 2 cachelines with preceding store(not in L1DLoad_Blocks.UNTIL_RETIRE~20 Length Changing Prefix (16 bit imm)ILD_STALLS ILD_STALLS, or ~6 per Contribute to “FE + Scoreboard” And don’t forget Micro-Fusion, Macro-fusion, etc..

Intel ® Software College Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors 38 A Heuristic Break-down for Stall Analysis the “Big 4 (L2 cache)”, L1D cache …… Front End Stalls Stalls? Resource Stalls Exe Unit Stalls Retirement Efficiency and others …… RS related and RAT related …… Register related, Domain related …… Instructions decoding, LCP…

Intel ® Software College Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors 39 A Heuristic Break-down for Stall Analysis (cont.) Stall ComponentsCounters NameSolutions Front End L2 cacheMEM_LOAD_RETIRED.L2_LINE_MISSAlignment DTLBMEM_LOAD_RETIRED.DTLB_MISSSW prefetch L1 data cacheMEM_LOAD_RETIRED.L1D_LINE_MISS Instruction QueueINST_QUEUE.FULLDecode pattern Branch predictionRESOURCE_STALLS.BR_MISS_CLEARPGO, Removing uncertainty or brach Execution Core Reservation stationRESOURCE_STALLS.RS_FULL ReOrder BufferRAT_STALLS.ROB_READ_PORT RESOURCE_STALLS.ROB_FULL DispatchingRS_UOPS_DISPATCHED Partial updatingRAT_STALLS.FLAGSWhole register update RAT_STALLS.PARTIAL_CYCLES Domain swingRESOURCE_STALLS.FPCW FP_MMX_TRANS.TO_MMX FP_MMX_TRANS.TO_FP Memory BUS_TRANS_ANY

Intel ® Software College Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors 40 Lab Activity 2: Using SW tools to reduce the instruction counts (path length) In this lab, you will practice the use of Intel compiler vectorization switch to reduce the instruction counts.

Intel ® Software College Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors 41 Lab Activity 3: Addressing the performance bottleneck in Front End In this lab, you will identify and address the performance issue caused in the Front End of the processor by the “Big 4” events analysis.

Intel ® Software College Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors 42 Lab Activity 4: Addressing the performance bottleneck in Execution Core In this lab, you will identify and address the performance issue caused in the execution core of the processor.

Intel ® Software College Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors 43 A Loop Methodology Identify hot functions and raise optimization Fix alignments, split loops to enhance vectorization Identify BW limited functions Merge BW loops with FP limited loops Identify L2 misses and add sw prefetch Optimize flow through OOO Engine Use loop splitting to assist here

Intel ® Software College Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors 44 More Detailed Event Selection Hierarchy FIRST PASS EVENTSSample After Value CPU_CLK_UNHALTED.CORE2,000,000 RS_UOPS_DISPATCHED.CYCLES_NONE2,000,000 UOPS_RETIRED.ANY + UOPS_RETIRED.FUSED2,000,000 RS_UOPS_DISPATCHED2,000,000 MEM_LOAD_RETIRED.L2_LINE_MISS10,000 INST_RETIRED.ANY_P2,000,000 Loops BUS_TRANS_ANY.SELF100,000 BUS_TRANS_ANY.ALL_AGENTS100,000 Branch Dominated RESOURCE_STALLS.BR_MISS_CLEAR2,000,000 SAV values selected so ratio of samples ~ absorbs penalty

Intel ® Software College Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors 45 More Detailed Event Selection Hierarchy (cont.) SECOND LEVEL EVENTSSample After Value MEM_LOAD_RETIRED.DTLB_MISS20,000 MEM_LOAD_RETIRED.L1_LINE_MISS200,000 BR_CND_EXEC BR_CND_EXEC_MISPRED2,000,000 BR_CALL_EXEC BR_CALL_EXEC_MISPRED200,000 RESOURCE_STALLS.RS_FULL (anti correlate)2,000,000 ILD_STALLS200,000 LOAD_BLOCK.STORE_OVERLAP200,000 SAV values selected so ratio of samples ~ absorbs penalty EX: L1 miss/L2_hit penalty is 10 cycles

Intel ® Software College Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors 46 Summary Utilize Core TM micro-architecture for software performance Front end OOO execution core Use the VTune™ analyzer to identify micro-architectural bottlenecks in your software. Use a cycles accounting methodology to improve the performance.

Intel ® Software College Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors 47

Intel ® Software College Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors 48 Micro-Architecture Comparison Intel NetBurst™ ++ NGMA** Pipeline Stages3114 Threads per core21 L1 Cache Org. (12K uop Trace Cache/16K Data)(32K I/32K Data) L2 Cache Org.2 x 2MB1 x 4MB (shared) Instr. Decoders14 Integer Units2 (2x core freq)3 (1x core freq) SIMD Units2 x 64-bits3 x 128-bits SIMD Inst. Issued per Clock13 FP Units3 (Add/Mul/Div) FP Inst. Issued per clock1Up to 2 (Add + Mul or Div) Power135W80W ++ Cedar Mill/Dempsey ** NGMA = Next Generation Micro-Architecture (Conroe/Woodcrest) = per core

Intel ® Software College Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors 49 Execution Unit Comparisons FP Add/ Mul/Div Integer Shift/Rotate SIMD Port Integer Multiply SIMD Integer Arithmetic 2x Core Freq Intel NetBurst ® Micro-Architecture NGMA Port 0 Port 1 FP Add SIMD Port 5 Integer Arithmetic FP Div/Mul Integer Shift/Rotate SIMD Integer Arithmetic Port 2 Load Port 4 Store

Intel ® Software College Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors 50 DTLB Structure DTLB componententrieswayssetsmiss event ~ miss penalty L0 small page1644Dtlb_Misses.L0_miss2 L1 small page256464Dtlb_Misses.L1_miss typical ~ 10 L0 Large Page1644Dtlb_Misses.L0_miss_LG2 L1 Large Page3248Dtlb_Misses.L1_miss_LG typical ~ 11-12 HW Page WalksPMH.Walks~PMH.Cycles L2 $ Hit, L1DTLB Miss L1 $ Hit, L1DTLB Miss L1 $ Hit, L1DTLB Hit Disclaimer: Data is from a pointer chasing microbenchmark and for illustrative purposes only

Intel ® Software College Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors 51 PEBS Usage and Issues Using Precise Event Based Sampling captures architectural state at the time of the event occurrence Basic Block Execution = average of inst_retired over the BB However inst_retired does not give a flat distribution within a basic block. Therefore the average over the basic block should be used

Intel ® Software College Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors 52 Manipulating the XML File CB08 0xCB  event number 0x08  event mask or user mask 0x53  Cmask, Inv etc 0x601001  bitmask for groups event is in…add 2 to put in “favorites” 0 0  counters that can be used..precise events must use 0 MEM_LOAD_RETIRED.L2_LINE_MISS L2 cache line missed by retired loads (precise event). pmm.chm 10000  default SAV yes  identifier for precise events

Intel ® Software College Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors 53 DL’s New Favorite A000 0xA0 0x00 0x1D3  setting cmask = 1 and inv = 1 0x503 0 0  forcing counter 0 RS_UOPS_DISPATCHED_c1_inv  new name Uops Dispatched from the RS pmm.chm 2000000 Cycles Where NO Uops are Dispatched From RS

Intel ® Software College Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors 54 Loop Distribution for Resource Management For(i…i++){ inst1 inst2 inst3. instN (final store) } For(i..i+=blk){ for(j=I;j<blk;j++){ ints1 inst2. instM store_intermediate[j-i] } for(j=I;j<blk;j++){ load_intermediate[j-i]] instM+1. instN (final store) } Shorter Loops -> Greater Unrolling -> Greater ILP

Intel ® Software College Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors 55 Cycle Accounting on X86 Non retired uop cycles ~ non retired uops / avg_uops_per_cycle ~ rs_uops_dispatched:c1* ( 1 - (uops_retired.any + uops_retired.fused) /rs_uops_dispatched ) CPU_CLK_UNHALTED = Stalls + non_retired + effective = rs_uops_dispatched:c1:i1 + rs_uops_dispatched:c1* ( 1 – (uops_retired.any +uops_retired.fused) / rs_uops_dispatched ) + Effective_cycles

Intel ® Software College Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors 56 Methodology Overview The traditional view of performance tuning on X86 processors has focused on instruction retirement The OOO engine has always been viewed as an impenetrable and incomprehensible beast This is perhaps not the best perspective

Intel ® Software College Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors 57 Four Component HW Prefetcher L1 Cache Prefetch (first in Intel® Core Duo Processor) DCU or Streaming prefetcher DCU = Data Cache Unit IP prefetch Repeated stride load at frequently executed IP L2 Prefetch (similar to Pentium™ 4 processor)

Intel ® Software College Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors 58 VTune™ Analyzer Edit Event See Backup Slides for Creating New Pre-Edited Events in XML File

Intel ® Software College Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors 59 Some Features of the PMU CMASK INVINV ENEN INTINT PCPCE OSOS USRUSR umaskEvent # Value to be compared against Invert from GE to LT Enable Counters APIC Interupt Enable Pin Control Count on changing edge Count Ring 3 execution Count Ring 0 execution Setting CMASK = 1 and INV = 1 for RS_uops_dispatched Counts Cycles Where NO UOPS WERE DISPATCHED == Stalls RS_UOPS_DISPATCHED.CYCLES_NONE

Intel ® Software College Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors 60 A Methodology? Total Cycles ~ CPU_CLK_UNHALTED RS_UOPS_DISPATCH:c1 RS_UOPS_DISPATCH:c1:i1 CPU_CLK_UNHALTED can be decomposed into execution and stall cycles in the OOO engine Requires >99% CPU Utilization OR User PL only/sampling EVENTS COUNT EVEN DURING HALTED CYCLES

Intel ® Software College Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors 61 VTune ™ Analyzer Event Basics Thread Specific and Independent Event Categories Thread Specific (TS) – Sample count is per logical processor. Thread Independent (TI) – Sample count is per physical processor. All events are attributed to logical processor 0 – WATCH OUT: The Addresses Might Be Incorrect! Thread specific ESCR limited (TS-E) – Sample count is per logical processor but only data for one logical processor can be captured in a single run. If not specified, the event is TS.

Intel ® Software College Copyright © 2007, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors 62 The Distribution of uops/cycle emon -q -t0 -C $RS_UOPS_DISPATCHED:v$ -f $1_uop_count.txt $1 Up to N uops/cycle emon -q -t0 -C $RS_UOPS_DISPATCHED:c1:i1:v$ -F $1_uop_count.txt $1 emon -q -t0 -C $RS_UOPS_DISPATCHED:c2:i1:v$ -F $1_uop_count.txt $1 emon -q -t0 -C $RS_UOPS_DISPATCHED:c3:i1:v$ -F $1_uop_count.txt $1 emon -q -t0 -C $RS_UOPS_DISPATCHED:c4:i1:v$ -F $1_uop_count.txt $1 emon -q -t0 -C $RS_UOPS_DISPATCHED:c5:i1:v$ -F $1_uop_count.txt $1 emon -q -t0 -C $RS_UOPS_DISPATCHED:c6:i1:v$ -F $1_uop_count.txt $1 emon -q -t0 -C $RS_UOPS_DISPATCHED:c7:i1:v$ -F $1_uop_count.txt $1 Subtract the N-1 value Replace with Vtune graph Distribution of the Instruction Level Parallelism (example: a[i] = exp(x[i]); in a simple loop)

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Intel® Software College.

Similar presentations

Presentation on theme: "Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Intel® Software College."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Intel® Software College.

Similar presentations

Presentation on theme: "Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Intel® Software College."— Presentation transcript:

Similar presentations

About project

Feedback