Confessions of a Performance Monitor Hardware Designer Workshop on Hardware Performance Monitor Design HPCA-11 13 February 2005 Jim Callister Intel Corporation.

Slides:



Advertisements
Similar presentations
Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.
Advertisements

Exploring P4 Trace Cache Features Ed Carpenter Marsha Robinson Jana Wooten.
1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.
Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.
Hardware-based Devirtualization (VPC Prediction) Hyesoon Kim, Jose A. Joao, Onur Mutlu ++, Chang Joo Lee, Yale N. Patt, Robert Cohn* ++ *
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.
THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.
E Virtual Machines Lecture 3 Memory Virtualization
Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.
1 Architectural Complexity: Opening the Black Box Methods for Exposing Internal Functionality of Complex Single and Multiple Processor Systems EECC-756.
Branch Prediction in SimpleScalar
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
1 A Real Problem  What if you wanted to run a program that needs more memory than you have?
Presenter : Shau-Jay Hou Tsung-Cheng Lin Kuan-Fu Kuo 2015/6/12 EICE team TraceDo: An On-Chip Trace System for Real-Time Debug and Optimization in Multiprocessor.
CSCE 212 Chapter 7 Memory Hierarchy Instructor: Jason D. Bakos.
Operating System Support Focus on Architecture
Sunshine Slam Khian Hao Lim Haywood Ho Soe Myint Leo Ting Ka Hou Chan.
Operating System Kernels1 Operating System Support for Performance Monitoring Witawas Srisa-an Chapter: not in the book.
Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
Virtual Memory and Paging J. Nelson Amaral. Large Data Sets Size of address space: – 32-bit machines: 2 32 = 4 GB – 64-bit machines: 2 64 = a huge number.
Computer Organization and Architecture
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy (Part II)
Code Coverage Testing Using Hardware Performance Monitoring Support Alex Shye, Matthew Iyer, Vijay Janapa Reddi and Daniel A. Connors University of Colorado.
CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.
1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.
Computer performance.
Multi-core Programming VTune Analyzer Basics. 2 Basics of VTune™ Performance Analyzer Topics What is the VTune™ Performance Analyzer? Performance tuning.
Design Tradeoffs For Software-Managed TLBs Authers; Nagle, Uhlig, Stanly Sechrest, Mudge & Brown.
Performance Monitoring on the Intel ® Itanium ® 2 Processor CGO’04 Tutorial 3/21/04 CK. Luk Massachusetts Microprocessor Design.
Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.
ACSAC’04 Choice Predictor for Free Mongkol Ekpanyapong Pinar Korkmaz Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Institute.
IT253: Computer Organization
RISC By Ryan Aldana. Agenda Brief Overview of RISC and CISC Features of RISC Instruction Pipeline Register Windowing and renaming Data Conflicts Branch.
Srihari Makineni & Ravi Iyer Communications Technology Lab
Preeti Ranjan Panda, Anant Vishnoi, and M. Balakrishnan Proceedings of the IEEE 18th VLSI System on Chip Conference (VLSI-SoC 2010) Sept Presenter:
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Michael D’Mello
20th May 2008 Presented by Mitesh Meswani. Outline  Problem Description  FPU Availability  FXU Availability.
CPE 631 Project Presentation Hussein Alzoubi and Rami Alnamneh Reconfiguration of architectural parameters to maximize performance and using software techniques.
Software Performance Monitoring Daniele Francesco Kruse July 2010.
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The University of Texas at Austin Laboratory for Computer.
1  2004 Morgan Kaufmann Publishers Chapter Seven Memory Hierarchy-3 by Patterson.
A Software Performance Monitoring Tool Daniele Francesco Kruse March 2010.
Branch Prediction Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.
Combining Software and Hardware Monitoring for Improved Power and Performance Tuning Eric Chi, A. Michael Salem, and R. Iris Bahar Brown University Division.
Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 13: Branch prediction (Chapter 4/6)
*Pentium is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States and other countries Performance Monitoring.
ECE232: Hardware Organization and Design
Memory COMPUTER ARCHITECTURE
Computer Organization CS224
A Real Problem What if you wanted to run a program that needs more memory than you have? September 11, 2018.
Computer Structure Multi-Threading
CSE 502: Computer Architecture
What we need to be able to count to tune programs
CMSC 611: Advanced Computer Architecture
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
15-740/ Computer Architecture Lecture 5: Precise Exceptions
Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle
Lecture 10: Branch Prediction and Instruction Delivery
Morgan Kaufmann Publishers Memory Hierarchy: Virtual Memory
CSC3050 – Computer Architecture
Dynamic Hardware Prediction
The O-GEHL branch predictor
Review What are the advantages/disadvantages of pages versus segments?
What Are Performance Counters?
Presentation transcript:

Confessions of a Performance Monitor Hardware Designer Workshop on Hardware Performance Monitor Design HPCA February 2005 Jim Callister Intel Corporation © Intel Corp. 2005

Itanium ® Processor PMU13 February 2005 Why Include a PMU? Ya gotta do something with all those transistors! Cause my PAPI told me to To give competitors a fighting chance To show my boss how great my branch predictor is (ie., get a raise) To improve the performance of current and future systems

Itanium ® Processor PMU13 February 2005 How much Performance would you give up for PMU Functionality? Transistors may be “free” but… –Wires are not! –Design time costs –Validation costs –Documentation costs –Time to Market costs The answer is not 0% –PMU proven to improve performance But it’s not 10% either!

Itanium ® Processor PMU13 February 2005 PMU Central Collector The PMU Has Tentacles Everywhere!

Itanium ® Processor PMU13 February 2005 What to Architect in the PMU? “Machine Architecture is a contract between hardware and software” Architect too much… –Lowers performance through design constraints –Events don’t map well to hardware Architect too little… –Jeopardizes Software Investment –Discourages Software Support

Itanium ® Processor PMU13 February 2005 Itanium ® Architecture: PMU Architected –Access & Management of PMU Resources PMD registers for Data, PMC registers to control PMU –Counter Overflow Behavior and Interrupt Handling –Only a few basic counter events Implementation Dependent –Number of counters, width of counters –Non-counter performance monitors –Events: Encourage use of CPU-specific tables Itanium architecture protects OS and Tool infrastructure while promoting performance and full visibility

Itanium ® Processor PMU13 February 2005 Performance Events – Let me count the ways… Which events are important? –How will the events be used? –Do you really care about a cache miss if it doesn’t cause any stalls? Mapping an event to signals –Needed signal may not be available On critical path, lack of wires, no signal –Combining signals is problematic Distance between signals, timing, logic

Itanium ® Processor PMU13 February 2005 Itanium ® 2 Processor PMU Events Event CategoriesNumber of Events Cycle Accounting89 Instruction Execution42 Branches69 Caches & TLBs150 Bus73 Misc20 Total443

Itanium ® Processor PMU13 February 2005 Where are the Performance Problems? Counters only give type of problem and magnitude of the problem Use filters on counters (hunt & peck) Itanium ® architecture currently includes: –Opcode Filters –Privilege Level Filters –Instruction Address Range Filters –Data Address Range Filters

Itanium ® Processor PMU13 February 2005 A Better Way to Locate Performance Problems Event Address Registers (EARs) –Logs information about a single cache miss –The logs are sampled by software –Creates a statistical profile of cache misses Branch Trace Buffer (BTB) –Logs information about consecutive branches –Logs also sampled by software

Itanium ® Processor PMU13 February 2005 Lend Me an EAR Instruction & Data EARs –Log Instruction Address of Miss Data EAR also logs Data Address of Miss –Log Latency of Miss –Filter by latency bin –Have an associated counter event –Can also log TLB misses And where TLB miss was resolved Have proven to be extremely useful

Itanium ® Processor PMU13 February 2005 The D-EAR Shadow Effect Miss Recorded Miss Recorded Without extra hardware, these misses would never be recorded! Latency Counter Busy

Itanium ® Processor PMU13 February 2005 The D-EAR Shadow Effect Miss Recorded Miss Recorded Without extra hardware, these misses would never be recorded! Latency Counter Busy The Itanium ® 2 Processor Solution Don’t Track every Opportunity -- randomly pick misses to track Tradeoff: shadow mitigation versus sampling frequency Use LFSR to decide which port to sample and if to sample Every miss has ~1 in 8 chance of being tracked This mitigates the shadow effect, does not totally eliminate it Customer feedback indicates it works very well

Itanium ® Processor PMU13 February 2005 The Itanium ® 2 Processor’s Branch Trace Buffer (BTB) An eight entry Circular Buffer Each entry contains either: –Address & Prediction Data of a branch, or –Address of a branch target Uses of the BTB –Mis-predicted branch profiler –An efficient Instruction Address Profiler –Path Profiler Cool use: in conjunction with EARs –Path leading up to sampled miss!

Itanium ® Processor PMU13 February 2005 The Itanium ® 2 processor’s PMU Helps Improve Performance Performance is measured using specific computer systems and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance.

Itanium ® Processor PMU13 February 2005 Conclusions Walking a micron in HW design shoes –Balancing PMU functionality & overall performance We need to move beyond counters! –Itanium ® 2 processors provide EARs and BTBs –What’s next? The Itanium 2 processor’s PMU has much to offer –Customers are making good use of it –Would like to see more use – how do we do it? Discussion –What is the long-term vision for the PMU? –What can the PMU provide to improve current and future systems –Did anything “stick” or resonate? Itanium ® and Itanium ® 2 are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries

Itanium ® Processor PMU13 February 2005 For More Information…. Manuals Intel Itanium Architecture Software Developer's Manuals Volume 1: Application Architecture Part II: Optimization Guide Intel Itanium Architecture Software Developer's Manuals Volume 2: System Architecture Chapter 7: Debugging and Performance Monitoring Chapter 12: Performance Monitoring Support Intel Itanium 2 Processor Reference Manual for Software Development and Optimization Chapter 10: Performance Monitoring Chapter 11: Performance Monitor Events