Performance monitoring on HP Alpha using DCPI

Slides:



Advertisements
Similar presentations
TM 1 ProfileMe: Hardware-Support for Instruction-Level Profiling on Out-of-Order Processors Jeffrey Dean Jamey Hicks Carl Waldspurger William Weihl George.
Advertisements

In-Order Execution In-order execution does not always give the best performance on superscalar machines. The following example uses in-order execution.
Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.
Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.
Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
CPE 731 Advanced Computer Architecture ILP: Part II – Branch Prediction Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Operating System Kernels1 Operating System Support for Performance Monitoring Witawas Srisa-an Chapter: not in the book.
Choice for the rest of the semester New Plan –assembler and machine language –Operating systems Process scheduling Memory management File system Optimization.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
Appendix A Pipelining: Basic and Intermediate Concepts
1 Lecture 10: FP, Performance Metrics Today’s topics:  IEEE 754 representations  FP arithmetic  Evaluating a system Reminder: assignment 4 due in a.
Catching Accurate Profiles in Hardware Satish Narayanasamy, Timothy Sherwood, Suleyman Sair, Brad Calder, George Varghese Presented by Jelena Trajkovic.
A Characterization of Processor Performance in the VAX-11/780 From the ISCA Proceedings 1984 Emer & Clark.
1 Manchester Mark I, This was the second (the first was a small- scale prototype) machine built at Cambridge. A production version of this computer.
Taming Hardware Event Samples for FDO Compilation Dehao Chen (Tsinghua University) Neil Vachharajani, Robert Hundt, Shih-wei Liao (Google) Vinodha Ramasamy.
1 Layers of Computer Science, ISA and uArch Alexander Titov 20 September 2014.
Multi-core Programming VTune Analyzer Basics. 2 Basics of VTune™ Performance Analyzer Topics What is the VTune™ Performance Analyzer? Performance tuning.
Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.
Performance Monitoring on the Intel ® Itanium ® 2 Processor CGO’04 Tutorial 3/21/04 CK. Luk Massachusetts Microprocessor Design.
Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.
1 Instruction Set Architecture (ISA) Alexander Titov 10/20/2012.
1 Dynamic Branch Prediction. 2 Why do we want to predict branches? MIPS based pipeline – 1 instruction issued per cycle, branch hazard of 1 cycle. –Delayed.
CSCI 6461: Computer Architecture Branch Prediction Instructor: M. Lancaster Corresponding to Hennessey and Patterson Fifth Edition Section 3.3 and Part.
Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.
Full and Para Virtualization
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
Lecture 4.5 Pipelines – Control Hazards Topics Control Hazards Branch Prediction Misprediction stalls Readings: Appendix C September 2, 2015 CSCE 513 Computer.
Hardware Support for Out-of-Order Instruction Profiling on Alpha 21264a Lance Berc & Mark Vandevoorde Joint work with: Jennifer Anderson, Jeff Dean, Sanjay.
Confessions of a Performance Monitor Hardware Designer Workshop on Hardware Performance Monitor Design HPCA February 2005 Jim Callister Intel Corporation.
Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.
SQL Database Management
Instruction-Level Parallelism and Its Dynamic Exploitation
CS203 – Advanced Computer Architecture
Basic Processor Structure/design
Computer Architecture
Concepts and Challenges
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
CS 704 Advanced Computer Architecture
CS203 – Advanced Computer Architecture
Pipeline Implementation (4.6)
Lecture 5 Pipelines – Control Hazards
Advantages of Dynamic Scheduling
A Review of Processor Design Flow
Pipelining: Advanced ILP
What we need to be able to count to tune programs
Understanding Performance Counter Data - 1
Virtualization Techniques
So far we have dealt with control hazards in instruction pipelines by:
Virtual Machines (Introduction to Virtual Machines)
Advanced Computer Architecture
Studying the performance of the FX!32 binary translation system
Lecture 5 Pipelines – Control Hazards
So far we have dealt with control hazards in instruction pipelines by:
Adapted from the slides of Prof
Sampoorani, Sivakumar and Joshua
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Project Instruction Scheduler Assembler for DLX
Adapted from the slides of Prof
Dynamic Hardware Prediction
How to improve (decrease) CPI
Chapter 6 Programming the basic computer
So far we have dealt with control hazards in instruction pipelines by:
So far we have dealt with control hazards in instruction pipelines by:
Chapter 13: I/O Systems “The two main jobs of a computer are I/O and [CPU] processing. In many cases, the main job is I/O, and the [CPU] processing is.
Presentation transcript:

Performance monitoring on HP Alpha using DCPI Paul J. Drongowski Hewlett Packard Corporation Paul.Drongowski@hp.com 10 September 2002

Objectives for this presentation DCPI Objectives for this presentation Give a brief introduction to the HP Continuous Profiling Infrastructure (DCPI) Present instruction sampling, a technique to precisely assign hardware events to instructions on out-of-order execution architecture Demonstrate the accuracy of instruction sampling as applied to a small floating point program kernel September 10, 2002

HP Continuous Profiling Infrastructure DCPI HP Continuous Profiling Infrastructure In daily use with hundreds of registered customers Application and system profiler System-wide data collection and analysis Practical applications include: Troubleshooting performance Driving compiler and post-link optimization (SPIKE) Guiding hardware/software architectural design decisions “The goal of the continuous profiling project is to produce runtime execution profiles of unmodified Alpha UNIX programs with such low overhead that customers boot with profiling turned on and don’t turn it off. Continuous profiling is [part of a larger project] to substantially improve the performance of large customer programs.” Dick Sites, 1996. September 10, 2002

Features Transparent Comprehensive, system-wide profiles DCPI Features Transparent Comprehensive, system-wide profiles Don’t need to modify source or binary Continuous profiling with low overhead (2% to 5%) Incorporates many novel, patented techniques Instruction sampling Aggregation during data collection Stall blame analysis Value profiling September 10, 2002

Definition: Conventional sampling DCPI Definition: Conventional sampling AKA “PC sampling” Hardware counter counts occurrence of an event Hardware triggers interrupt on overflow Device driver associates overflow with program counter value Builds up a profile of dynamic program behavior (e.g., number of times each instruction retired, number of I-cache misses for each instruction) September 10, 2002

Problem: skew and smear DCPI Problem: skew and smear In-order instruction execution Sequential issue Predictable order of instruction execution Predictable skew, very little smear Alpha implementations: 20164, 21164 Out-of-order instruction execution Non-sequential issue based on data-/resource-readiness Unpredictable order of instruction execution Unknown skew and smear over in-flight instruction window Alpha implementations: 21264, 21364 September 10, 2002

ProfileMe instruction sampling DCPI ProfileMe instruction sampling Implemented in Alpha 21264A and later Eliminate skew and smear Basic approach Randomly select an instruction to monitor Capture event information as instruction executes in pipeline Trigger interrupt when instruction completes or aborts Collect and aggregate instruction information/events Program counter value is part of ProfileMe information so event attribution to the instruction is precise September 10, 2002

Experiment: Measurement of instruction execution frequency DCPI Experiment: Measurement of instruction execution frequency Key technique needed to compute FLOPS How accurate is a sampling-based estimate of individual instruction frequency? Accuracy Precision (statistical dispersion) Bias (not addressed here; subject for additional study) Method Run FP kernel 1000 times and capture profile data Record number of retire samples for FP instruction in inner loop Record basic block frequency estimation for same instruction Compute and assess descriptive statistics September 10, 2002

Example FP kernel Execution time (667MHz Alpha 21264A) DCPI Example FP kernel /* Matrix-Matrix multiply */ for (i=0;i<INDEX;i++) for(j=0;j<INDEX;j++) for(k=0;k<INDEX;k++) mresult[i][j] = mresult[i][j] + matrixa[i][k] * matrixb[k][j] ; Execution time (667MHz Alpha 21264A) Without DCPI: 52.58 seconds; with DCPI: 54.90 seconds Overhead: 4.41% while collecting 25,000 cycle and ProfileMe samples per second 15,751 retire samples expected per inner loop instruction (iterations) / (sample period) = 1,000,000,000 / 63,488 September 10, 2002

Image-by-image overview DCPI Image-by-image overview DCPI provides top level view to find candidates for drill down retired :count % cum% image 197520 48.69% 48.69% /dsk0h/dcpidb/PALcode 157574 38.85% 87.54% flops 48656 12.00% 99.54% /vmunix 1451 0.36% 99.89% /usr/shlib/libc.so 181 0.04% 99.94% /usr/bin/dcpid 71 0.02% 99.96% /sbin/loader 51 0.01% 99.97% . . . September 10, 2002

Instruction-by-instruction - ProfileMe DCPI Instruction-by-instruction - ProfileMe Retire BB samples freq Address Instruction 15780 15713 0x120001218 : lds $f10, 0(a4) 15508 15713 0x12000121c lds $f11, 0(a5) 15789 15713 0x120001220 addl a3, 0x1, a3 15582 15713 0x120001224 lda a4, 4(a4) 15753 15713 0x120001228 lda t10, -1000(a3) 15693 15713 0x12000122c lda a5, 4000(a5) 15920 15713 0x120001230 muls $f10,$f11,$f10 15607 15713 0x120001234 adds $f1,$f10,$f1 15787 15713 0x120001238 sts $f1, 0(a1) 15714 15713 0x12000123c blt t10, 0x120001218 September 10, 2002

Source line summary - ProfileMe DCPI Source line summary - ProfileMe Retire BB samples freq Source line 0 0 /* Matrix-Matrix multiply */ 0 15713 for (i=0;i<INDEX;i++) 74 15713 for(j=0;j<INDEX;j++) 78459 15713 for(k=0;k<INDEX;k++) 15651 15713 mresult[i][j] = 15835 15713 mresult[i][j] + 47322 15713 matrixa[i][k]*matrixb[k][j] 0 0 ; September 10, 2002

Instruction-by-instruction - Conventional DCPI Instruction-by-instruction - Conventional Retire samples Address Instruction 14013 0x120001218 : lds $f10, 0(a4) 291864 0x12000121c lds $f11, 0(a5) 22708 0x120001220 addl a3, 0x1, a3 0 0x120001224 lda a4, 4(a4) 0 0x120001228 lda t10, -1000(a3) 0 0x12000122c lda a5, 4000(a5) 10794 0x120001230 muls $f10,$f11,$f10 0 0x120001234 adds $f1,$f10,$f1 5365 0x120001238 sts $f1, 0(a1) 0 0x12000123c blt t10, 0x120001218 September 10, 2002

ProfileMe retire samples DCPI Retire samples (1000 runs) Floating add retires 2 4 6 8 10 15200 15400 15600 15800 16000 16200 ProfileMe retire samples Frequency -sd +sd +2sd -2sd Minimum sample: 15359 Maximum sample: 16111 Average: 15745.859 Standard deviation: 122.243 Coeff of variation: 0.776% Error: ± 1.552% September 10, 2002

BB frequency estimates (1000 runs) DCPI BB frequency estimates (1000 runs) Floating add BB freq est 5 10 15 20 15600 15650 15700 15750 15800 15850 15900 Basic block frequency estimate Frequency -sd +sd +2sd -2sd Minimum sample: 15639 Maximum sample: 15841 Average: 15745.983 Standard deviation: 35.745 Coeff of variation: 0.227% Error: ± 0.454% September 10, 2002

DCPI Improving precision Square root law: quadrupling the sample size doubles the precision of the estimate Practical techniques for increasing sample size Execute test for a longer period of time Aggregate the results of multiple program runs Increase the sampling rate DCPI facilitates all three techniques It supports analysis of very large, long running programs It automatically aggregates multiple runs It supports higher, selectable sampling rates September 10, 2002

Conclusions DCPI – a practical tool for program analysis An accepted, production-ready tool Transparent, low-overhead, comprehensive Pinpoints performance issues at the instruction and source language statement levels Experimental results ProfileMe mitigates the effects of smear and skew that are present in conventional sampling on O-O-O machines Precision as measured in the experiment Raw samples: ±1.552% Basic block frequency estimation: ±0.454% Basic block frequency analysis substantially improves precision DCPI and ProfileMe technology can be applied to architectures other than Alpha such as IA-32 and IPF September 10, 2002

HP Continuous Profiling Infrastructure DCPI HP Continuous Profiling Infrastructure Offered as an “Advanced Development Kit” on Tru64 UNIX Agree to on-line field test agreement Current version is 3.9.6 Contact: dcpi@hp.com URL: http://www.tru64unix.compaq.com/dcpi September 10, 2002

Main components and their roles DCPI Main components and their roles Performance counters monitor and count CPU events Device driver collects samples and performs first level data aggregation Daemon image correlation and second level aggregation Database stores profile data by epoch, host, image Tools access, analyze and present profile information Performance counters Device driver Daemon Database Tools Denotes flow of profile data September 10, 2002

ProfileMe instruction information DCPI ProfileMe instruction information Program counter Instruction was a regular/PAL instruction Pipeline trap occurred Misprediction trap occurred Load-store order trap occurred Pipeline trap type Instruction was not yet prefetched Instruction was killed before register mapping Instruction stalled before register mapping Instruction retired without trapping (valid) Branch was taken Auxiliary counts: retire delay, retires in profiling window September 10, 2002

Image-by-image overview DCPI Image-by-image overview Many aborted instructions due to DTB misses Retire Abort DTBmiss samples samples samples Image 97480 9699 1 /dsk0h/dcpidb/PALcode 78253 200815 3878 flops 24766 597 1 /vmunix 668 148 0 /usr/shlib/libc.so 49 31 0 /usr/bin/dcpid 30 3 0 /sbin/loader September 10, 2002

DTB misses DTB miss summary for inner loop of flops program DCPI Retire Abort DTBmiss samples samples samples Freq Address Instruction 7822 18283 39 7805 0x120001218 : lds $f10, 0(a4) 7677 22059 3812 7805 0x12000121c lds $f11, 0(a5) 7747 19259 0 7805 0x120001220 addl a3, 0x1, a3 7797 19377 0 7805 0x120001224 lda a4, 4(a4) 7980 19338 0 7805 0x120001228 lda t10, -1000(a3) 7865 19415 0 7805 0x12000122c lda a5, 4000(a5) 7682 20811 0 7805 0x120001230 muls $f10,$f11,$f10 7831 20576 0 7805 0x120001234 adds $f1,$f10,$f1 7865 20715 27 7805 0x120001238 sts $f1, 0(a1) 7785 20727 0 7805 0x12000123c blt t10, 0x120001218 September 10, 2002

Dynamic Access to DCPI Data (DADD) Provide dynamic, runtime access to performance data Client / server relationship Application (client) registers interest with DCPI daemon (server) Daemon serves data to application via shared memory region “Virtual counter” API Daemon summarizes profile information into event counts Event counts are written periodically into shared memory region DADD provides virtual counters to PAPI implementation Status: Experimental prototype under development Performance counters Device driver Daemon (server) Application (client) Denotes flow of profile/performance data September 10, 2002