TM 1 ProfileMe: Hardware-Support for Instruction-Level Profiling on Out-of-Order Processors Jeffrey Dean Jamey Hicks Carl Waldspurger William Weihl George.

Slides:

Advertisements

Similar presentations

Dynamic Power Redistribution in Failure-Prone CMPs Paula Petrica, Jonathan A. Winter * and David H. Albonesi Cornell University *Google, Inc.

Advertisements

CPU Structure and Function

Re-examining Instruction Reuse in Pre-execution Approaches By Sonya R. Wolff Prof. Ronald D. Barnes June 5, 2011.

SE-292 High Performance Computing Profiling and Performance R. Govindarajan

Read-Write Lock Allocation in Software Transactional Memory Amir Ghanbari Bavarsad and Ehsan Atoofian Lakehead University.

Processes Management.

Topics Left Superscalar machines IA64 / EPIC architecture

Exploring P4 Trace Cache Features Ed Carpenter Marsha Robinson Jana Wooten.

Instruction-Level Parallelism compiler techniques and branch prediction prepared and Instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University March.

Hardware-based Devirtualization (VPC Prediction) Hyesoon Kim, Jose A. Joao, Onur Mutlu ++, Chang Joo Lee, Yale N. Patt, Robert Cohn* ++ *

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Chapter 4 The Processor CprE 381 Computer Organization and Assembly Level Programming, Fall 2013 Zhao Zhang Iowa State University Revised from original.

Computer Organization and Architecture

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

Chapter 12 CPU Structure and Function. CPU Sequence Fetch instructions Interpret instructions Fetch data Process data Write data.

Computer Organization and Architecture

Computer Organization and Architecture

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

Computer Organization and Architecture The CPU Structure.

Multiscalar processors

1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Instrumentation and Profiling David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA

EECC722 - Shaaban #1 Lec # 4 Fall Operating System Impact on SMT Architecture The work published in “An Analysis of Operating System Behavior.

EECC722 - Shaaban #1 Lec # 9 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.

1 Lecture 7: Static ILP and branch prediction Topics: static speculation and branch prediction (Appendix G, Section 2.3)

Catching Accurate Profiles in Hardware Satish Narayanasamy, Timothy Sherwood, Suleyman Sair, Brad Calder, George Varghese Presented by Jelena Trajkovic.

Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.

CH12 CPU Structure and Function

Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,

DBMSs On A Modern Processor: Where Does Time Go? by A. Ailamaki, D.J. DeWitt, M.D. Hill, and D. Wood University of Wisconsin-Madison Computer Science Dept.

Edited By Miss Sarwat Iqbal (FUUAST) Last updated:21/1/13

Statistical Profiling: Hardware, OS, and Analysis Tools.

Performance Monitoring on the Intel ® Itanium ® 2 Processor CGO’04 Tutorial 3/21/04 CK. Luk Massachusetts Microprocessor Design.

André Seznec Caps Team IRISA/INRIA HAVEGE HArdware Volatile Entropy Gathering and Expansion Unpredictable random number generation at user level André.

Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.

Page 1 Trace Caches Michele Co CS 451. Page 2 Motivation  High performance superscalar processors  High instruction throughput  Exploit ILP –Wider.

1 Vulnerabilities on high-end processors André Seznec IRISA/INRIA CAPS project-team.

Super computers Parallel Processing By Lecturer: Aisha Dawood.

Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Michael D’Mello

CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.

High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.

Hardware Support for Out-of-Order Instruction Profiling on Alpha 21264a Lance Berc & Mark Vandevoorde Joint work with: Jennifer Anderson, Jeff Dean, Sanjay.

Using Interaction Cost (icost) for Microarchitectural Bottleneck Analysis Brian Fields 1 Rastislav Bodik 1 Mark Hill 2 Chris Newburn 3 1 UC-Berkeley, 2.

*Pentium is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States and other countries Performance Monitoring.

Computer Organization CS224

5.2 Eleven Advanced Optimizations of Cache Performance

Hyperthreading Technology

Performance monitoring on HP Alpha using DCPI

What we need to be able to count to tune programs

The processor: Pipelining and Branching

Milad Hashemi, Onur Mutlu, Yale N. Patt

Levels of Parallelism within a Single Processor

Lecture 10: Branch Prediction and Instruction Delivery

Sampoorani, Sivakumar and Joshua

Instruction Level Parallelism (ILP)

Hardware Counter Driven On-the-Fly Request Signatures

Levels of Parallelism within a Single Processor

Chapter 11 Processor Structure and function

Presentation transcript:

TM 1 ProfileMe: Hardware-Support for Instruction-Level Profiling on Out-of-Order Processors Jeffrey Dean Jamey Hicks Carl Waldspurger William Weihl George Chrysos Digital Equipment Corporation

TM 2 Motivation Identify Performance Bottlenecks –especially unpredictable dynamic stalls e.g. cache misses, branch mispredicts, etc. –complex out-of-order processors make this difficult Guide Optimizations –help programmers understand and improve code –automatic, profile-driven optimizations Profile Production Workloads –low overhead –transparent –profile whole system

TM 3 Outline Obtaining Instruction-Level Information ProfileMe –sample instructions, not events –sample interactions via paired sampling Potential Applications of Profile Data Future Work Conclusions

TM 4 Existing Instruction-Level Sampling Use Hardware Event Counters –small set of software-loadable counters –each counts single event at a time, e.g. dcache miss –counter overflow generates interrupt Advantages –low overhead vs. simulation and instrumentation –transparent vs. instrumentation –complete coverage, e.g. kernel, shared libs, etc. Effective on In-Order Processors –analysis computes execution frequency –heuristics identify possible reasons for stalls –example: DIGITAL s Continuous Profiling Infrastructure

TM 5 Problems with Event-Based Counters Cant Simultaneously Monitor All Events Limited Information About Events –event has occurred, but no additional context e.g. cache miss latencies, recent execution path,... Blind Spots in Non-Interruptible Code Key Problem: Imprecise Attribution –interrupt delivers restart PC, not PC that caused event –problem worse on out-of-order processors

TM 6 Problem: Imprecise Attribution Experiment –monitor data loads –loop: single load + hundreds of nops In-Order Processor –Alpha –skew; large peak –analysis plausible Out-of-Order Processor –Intel Pentium Pro –skew and smear –analysis hopeless load

TM 7 Outline Obtaining Instruction-Level Information ProfileMe –sample instructions, not events –sample interactions via paired sampling Potential Applications of Profile Data Future Work Conclusions

TM 8 ProfileMe: Instruction-Centric Profiling fetchmapissueexec retire icache branch predict dcache interrupt! arith units done? counter overflow? pcaddrretired?miss?stage latencies tag! tagged? historymp? capture! internal processor registers miss?

TM 9 Instruction-Level Statistics PC + Retire Status execution frequency PC + Cache Miss Flag cache miss rates PC + Branch Mispredict mispredict rates PC + Event Flag event rates PC + Branch Direction edge frequencies PC + Branch History path execution rates PC + Latency instruction stalls

TM 10 Example: Retire Count Convergence Estimate / Actual Accuracy 1/ N

TM 11 Identifying True Bottlenecks ProfileMe: Detailed Data for Single Instruction In-Order Processors –ProfileMe PC + latency data identifies stalls –stalled instructions back up pipeline Out-of-Order Processors –explicitly designed to mask stall latency e.g. dynamic reordering, speculative execution –stall does not necessarily imply bottleneck Example: Does This Stall Matter? loadr1, … add…,r1,… average latency: 35.0 cycles … other instructions …

TM 12 Issue: Need to Measure Concurrency Interesting Concurrency Metrics –retired instructions per cycle –issue slots wasted while an instruction is in flight –pipeline stage utilization How to Measure Concurrency? Special-Purpose Hardware –some metrics difficult to measure e.g. need retire/abort status Sample Potentially-Concurrent Instructions –aggregate info from pairs of samples –statistically estimate metrics

TM 13 Paired Sampling Sample Two Instructions –may be in-flight simultaneously –replicate ProfileMe hardware, add intra-pair distance Nested Sampling –sample window around first profiled instruction –randomly select second profiled instruction –statistically estimate frequency for F(first, second) +W... -W time overlapno overlap

TM 14 Other Uses of Paired Sampling Path Profiling –two PCs close in time can identify execution path –identify control flow, e.g. indirect branches, calls, traps Direct Latency Measurements –data load-to-use –loop iteration cost

TM 15 Outline Obtaining Instruction-Level Information ProfileMe –sample instructions, not events –sample interactions via paired sampling Potential Applications of Profile Data Future Work Conclusions

TM 16 Exploiting Profile Data Latencies and Concurrency –identify and understand bottlenecks –improved scheduling, code generation Cache Miss Data –code stream rearrangement –guide prefetching, instruction scheduling Miss Addresses –inform OS page mapping policies –data reorganization Branch History, PC Pairs –identify common execution paths –trace scheduling

TM 17 Example: Path Profiles Experiment –intra-procedural path reconstruction –control-flow merges –SPECint95 data Execution Counts –most likely path based on frequency History Bits –path consistent with global branch history History + Pairs –path must contain both PCs in pair

TM 18 Future Work Analyze Production Systems Develop New Analyses for ProfileMe Data –cluster samples using events, branch history –reconstruct frequently-occurring pipeline states Explore Automatic Optimizations –better scheduling, prefetching, code and data layout –inform OS policies ProfileMe for Memory System Transactions –can sample memory behavior not visible from processor –sample cache sharing and interference

TM 19 Related Work Westcott & White (IBM Patent) –collects latency and some event info for instructions –only for retired instructions –only when instruction is assigned particular inum, which can introduce bias into samples Specialized Hardware Mechanisms –CML Buffer (Bershad et al.) - locations of frequent misses –Informing Loads (Horowitz et al.) - status bit to allow SW to react to cache misses –can often obtain similar info by analyzing ProfileMe data

TM 20 Conclusions ProfileMe: Sample Instructions, Not Events –provides wealth of instruction-level information –paired sampling reveals dynamic interactions –modest hardware cost –useful for in-order processors, essential for out-of-order Improvements Over Imprecise Event Counters –precise attribution –no blind spots –improved event collection e.g. branch history, concurrency, correlated events

TM 21 Further Information DIGITALs Continuous Profiling Infrastructure project: