Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Michael D’Mello

Slides:



Advertisements
Similar presentations
CPU Structure and Function
Advertisements

Topics Left Superscalar machines IA64 / EPIC architecture
Lecture 9 – OOO execution © Avi Mendelson, 5/ MAMAS – Computer Architecture Lecture 9 – Out Of Order (OOO) Dr. Avi Mendelson Some of the slides.
Computer Organization and Architecture
CSCI 4717/5717 Computer Architecture
Real-time Signal Processing on Embedded Systems Advanced Cutting-edge Research Seminar I&III.
Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Microprocessors. Von Neumann architecture Data and instructions in single read/write memory Contents of memory addressable by location, independent of.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Chapter 12 Pipelining Strategies Performance Hazards.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
Chapter 12 CPU Structure and Function. Example Register Organizations.
Computer Organization and Architecture
Copyright © 2006, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners Intel® Core™ Duo Processor.
OOO execution © Avi Mendelson, 4/ MAMAS – Computer Architecture Lecture 7 – Out Of Order (OOO) Avi Mendelson Some of the slides were taken.
A Characterization of Processor Performance in the VAX-11/780 From the ISCA Proceedings 1984 Emer & Clark.
Pipelining By Toan Nguyen.
CS 6354 by WeiKeng Qin, Jian Xiang, & Ren Xu December 8, 2009.
Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,
Micro-operations Are the functional, or atomic, operations of a processor. A single micro-operation generally involves a transfer between registers, transfer.
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Intel® Software College.
Lecture#14. Last Lecture Summary Memory Address, size What memory stores OS, Application programs, Data, Instructions Types of Memory Non Volatile and.
Multi-core Programming VTune Analyzer Basics. 2 Basics of VTune™ Performance Analyzer Topics What is the VTune™ Performance Analyzer? Performance tuning.
Understanding The Nehalem Core Note: The examples herein are mostly illustrative. They have shortcommings compared to the real implementation in favour.
COMPUTER ARCHITECTURE Assoc.Prof. Stasys Maciulevičius Computer Dept.
Hyper-Threading Technology Architecture and Micro-Architecture.
Instruction Issue Logic for High- Performance Interruptible Pipelined Processors Gurinder S. Sohi Professor UW-Madison Computer Architecture Group University.
AMD Opteron Overview Michael Trotter (mjt5v) Tim Kang (tjk2n) Jeff Barbieri (jjb3v)
* Third party brands and names are the property of their respective owners. Performance Tuning Linux* Applications LinuxWorld Conference & Expo Gary Carleton.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Pentium Pro Case Study Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.
Implementing Precise Interrupts in Pipelined Processors James E. Smith Andrew R.Pleszkun Presented By: Ravikumar Source:
© Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Lecture Instruction Execution: Dynamic Scheduling.
Computer Architecture: Wrap-up CENG331 - Computer Organization Instructors: Murat Manguoglu(Section 1) Erol Sahin (Section 2 & 3) Adapted from slides of.
Software Performance Monitoring Daniele Francesco Kruse July 2010.
1 Copyright © 2010, Elsevier Inc. All rights Reserved Chapter 2 Parallel Hardware and Parallel Software An Introduction to Parallel Programming Peter Pacheco.
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.
1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.
The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.
Structure and Role of a Processor
Pentium III Instruction Stream. Introduction Pentium III uses several key features to exploit ILP This part of our presentation will cover the methods.
Interrupts and Exception Handling. Execution We are quite aware of the Fetch, Execute process of the control unit of the CPU –Fetch and instruction as.
*Pentium is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States and other countries Performance Monitoring.
Confessions of a Performance Monitor Hardware Designer Workshop on Hardware Performance Monitor Design HPCA February 2005 Jim Callister Intel Corporation.
Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,
Dynamic Scheduling Why go out of style?
Protection in Virtual Mode
/ Computer Architecture and Design
INTEL HYPER THREADING TECHNOLOGY
PowerPC 604 Superscalar Microprocessor
Lecture 12 Reorder Buffers
Instruction Level Parallelism and Superscalar Processors
Tolerating Long Latency Instructions
15-740/ Computer Architecture Lecture 5: Precise Exceptions
* From AMD 1996 Publication #18522 Revision E
Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011
Objectives Describe common CPU components and their function: ALU Arithmetic Logic Unit), CU (Control Unit), Cache Explain the function of the CPU as.
Computer Architecture
What Are Performance Counters?
Presentation transcript:

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Michael D’Mello

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 2 Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Agenda Basic data collection mechanisms Some architectural considerations Cycle accounting methodology Use the VTune™ Performance Analyzer to identify micro- architectural bottlenecks in software running on Intel ® Core™ 2 Duo Xeon ® processors Address the performance bottleneck for Intel ® Core™ 2 Duo Xeon ® processors

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 3 Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Basic data collection mechanisms Deterministic interrupts Processor interrupted at regular time intervals Interrupts based on pre-assigned metric A performance counter increments on the CPU every time an event occurs A sample of the execution context is recorded every time a performance counter overflows Events = samples * sample after value

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 4 Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Disclaimer: This block diagram is for example purposes only. Significant hardware blocks have been arranged or omitted for clarity. Some resources (Bus Unit, L2 Cache, etc…) are shared between cores. Branch Target Buffer Microcode Sequencer Register Allocation Table (RAT) 32 KB Instruction Cache Next IP Instruction Decode (4 issue) Fetch / Decode Retire Re-Order Buffer (ROB) – 96 entry IA Register Set To L2 Cache/Memory Port Bus Unit Reservation Stations (RS) 32 entry Scheduler / Dispatch Ports 32 KB Data Cache Execute Port FP Add SIMD Integer Arithmetic Memory Order Buffer (MOB) Load Store Addr FP Div/Mul Integer Shift/Rotate SIMD Integer Arithmetic Port Store Data Architecture Block and Instruction Flow

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 5 Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Simpler abstraction – OOO engine Fetch & Branch prediction Decode ReservationStation ExecutionUnits Re-Order Buffer Retirement Writeback Notes: uops wait until their inputs are available in RS uops wait until their inputs are available in RS uops wait to be retired in ROB uops wait to be retired in ROB

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 6 Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Accounting for cycles - 1 For simplicity select the micro-op dispatch point to begin analysis Decompose Total Cycles into sum of two input parts Time spent issuing micro-ops to execution unit Time spent not issuing micro-ops (i.e. execution stalls) Decompose Total Cycles spent issuing micro-ops into three “output” components Cycles during which executed micro-ops are retired Cycles during which executed micro-ops are not retired Stalls Use simple balance equations to dig deeper: micro-ops dispatched/executed = # retired + # not retired Convert to units of cycles using the micro-op dispatch rate

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 7 Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Accounting for cycles - 2 Use VTune® Sampling to track selected events i.CPU_CLK_UNHALTED.CORE tracks Total Cycles ii.RS_UOPS_DISPATCHED to track total number of micro-ops dispatched iii.RS_UOPS_DISPATCHED:C=1 tracks cycles during which micro-ops are dispatched iv.RS_UOPS_DISPATCHED_NONE (same as RS_UOPS_DISPATCHED:C=1:I=1) gives second term of input equation v.UOPS_RETIRED_ANY & UOPS_RETIRED FUSED gives an estimate of total micro-ops retired (approximate) vi.Micro-op dispatch rate obtained by dividing (ii) by (iii) vii.# of cycles during which micro-ops not ultimately retired are executed is given by the difference (ii) – (v) divided by (iii) viii.Using (i), (vi), and (vii) obtain Stalls

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 8 Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Recap Achieve basic objective (Minimize Total Cycles) as follows: Minimize the Stall component by removing memory & other bottlenecks Minimize the Non-Retired component by reducing the branch mispredictions Minimize Retired component by reducing instructions (SSEx)

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 9 Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Characterizing Stalls & Branch Mis- predictions Percentage of time stalled (RS_UOPS_DISPATCHED _CYCLES_NONE/CPU_CLK_UNHALTED.CORE)*100 Fractions of useful & wasted work 1.Count number of UOPS dispatched Use RS_UOPS_DISPATCHED 2.Count number of UOPS executed which are eventually retired Use (UOPS_RETIRED.ANY + UOPS_RETIRED.FUSED) 3.Count number of UOPS executed which are non retired Difference of amount dispatched & amount retired 4.Compute fractions

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 10 Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Characterizing FSB Usage/Saturation Method: [(Core Frequency) *64*BUS_TRANS_BURST.BOTH_CORES.ALL_AGENTS] divided by CPU_CLK_UNHALTED.CORE Always useful to run a calibration test case Further analysis via: BUS_TRANS_ set of events Use VTune® Help facility for explanation of each event

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 11 Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Other characterizations of Stalls Instruction starvation component may be monitored via RESOURCE_STALLS.ANY ROB & RS must be purged of incorrect executions Approximate via RESOURCE_STALLS_BR_MISS_CLEAR Units of this event are in cycles Other resource limited stalls: Resource_Stalls.ROB_FULL (96 instructions in ROB) Resource_Stalls.LD_ST (All Store or Load buffers in use) Resource_Stalls.RS_FULL (32 instructions waiting for inputs in Reservation Station ) For more information see paper on Cycle Accounting by David Levinthal (available through

Copyright © 2006, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners. 12 Performance Counters on Intel® Core™ 2 Duo Xeon® Processors