Download presentation
Presentation is loading. Please wait.
Published byElisabeth Heidrich Modified over 6 years ago
1
Performance monitoring on HP Alpha using DCPI
Paul J. Drongowski Hewlett Packard Corporation 10 September 2002
2
Objectives for this presentation
DCPI Objectives for this presentation Give a brief introduction to the HP Continuous Profiling Infrastructure (DCPI) Present instruction sampling, a technique to precisely assign hardware events to instructions on out-of-order execution architecture Demonstrate the accuracy of instruction sampling as applied to a small floating point program kernel September 10, 2002
3
HP Continuous Profiling Infrastructure
DCPI HP Continuous Profiling Infrastructure In daily use with hundreds of registered customers Application and system profiler System-wide data collection and analysis Practical applications include: Troubleshooting performance Driving compiler and post-link optimization (SPIKE) Guiding hardware/software architectural design decisions “The goal of the continuous profiling project is to produce runtime execution profiles of unmodified Alpha UNIX programs with such low overhead that customers boot with profiling turned on and don’t turn it off. Continuous profiling is [part of a larger project] to substantially improve the performance of large customer programs.” Dick Sites, 1996. September 10, 2002
4
Features Transparent Comprehensive, system-wide profiles
DCPI Features Transparent Comprehensive, system-wide profiles Don’t need to modify source or binary Continuous profiling with low overhead (2% to 5%) Incorporates many novel, patented techniques Instruction sampling Aggregation during data collection Stall blame analysis Value profiling September 10, 2002
5
Definition: Conventional sampling
DCPI Definition: Conventional sampling AKA “PC sampling” Hardware counter counts occurrence of an event Hardware triggers interrupt on overflow Device driver associates overflow with program counter value Builds up a profile of dynamic program behavior (e.g., number of times each instruction retired, number of I-cache misses for each instruction) September 10, 2002
6
Problem: skew and smear
DCPI Problem: skew and smear In-order instruction execution Sequential issue Predictable order of instruction execution Predictable skew, very little smear Alpha implementations: 20164, 21164 Out-of-order instruction execution Non-sequential issue based on data-/resource-readiness Unpredictable order of instruction execution Unknown skew and smear over in-flight instruction window Alpha implementations: 21264, 21364 September 10, 2002
7
ProfileMe instruction sampling
DCPI ProfileMe instruction sampling Implemented in Alpha 21264A and later Eliminate skew and smear Basic approach Randomly select an instruction to monitor Capture event information as instruction executes in pipeline Trigger interrupt when instruction completes or aborts Collect and aggregate instruction information/events Program counter value is part of ProfileMe information so event attribution to the instruction is precise September 10, 2002
8
Experiment: Measurement of instruction execution frequency
DCPI Experiment: Measurement of instruction execution frequency Key technique needed to compute FLOPS How accurate is a sampling-based estimate of individual instruction frequency? Accuracy Precision (statistical dispersion) Bias (not addressed here; subject for additional study) Method Run FP kernel 1000 times and capture profile data Record number of retire samples for FP instruction in inner loop Record basic block frequency estimation for same instruction Compute and assess descriptive statistics September 10, 2002
9
Example FP kernel Execution time (667MHz Alpha 21264A)
DCPI Example FP kernel /* Matrix-Matrix multiply */ for (i=0;i<INDEX;i++) for(j=0;j<INDEX;j++) for(k=0;k<INDEX;k++) mresult[i][j] = mresult[i][j] + matrixa[i][k] * matrixb[k][j] ; Execution time (667MHz Alpha 21264A) Without DCPI: seconds; with DCPI: seconds Overhead: 4.41% while collecting 25,000 cycle and ProfileMe samples per second 15,751 retire samples expected per inner loop instruction (iterations) / (sample period) = 1,000,000,000 / 63,488 September 10, 2002
10
Image-by-image overview
DCPI Image-by-image overview DCPI provides top level view to find candidates for drill down retired :count % cum% image % % /dsk0h/dcpidb/PALcode % % flops % % /vmunix % % /usr/shlib/libc.so % % /usr/bin/dcpid % % /sbin/loader % % . . . September 10, 2002
11
Instruction-by-instruction - ProfileMe
DCPI Instruction-by-instruction - ProfileMe Retire BB samples freq Address Instruction x : lds $f10, 0(a4) x c lds $f11, 0(a5) x addl a3, 0x1, a3 x lda a4, 4(a4) x lda t10, -1000(a3) x c lda a5, 4000(a5) x muls $f10,$f11,$f10 x adds $f1,$f10,$f1 x sts $f1, 0(a1) x c blt t10, 0x September 10, 2002
12
Source line summary - ProfileMe
DCPI Source line summary - ProfileMe Retire BB samples freq Source line /* Matrix-Matrix multiply */ for (i=0;i<INDEX;i++) for(j=0;j<INDEX;j++) for(k=0;k<INDEX;k++) mresult[i][j] = mresult[i][j] + matrixa[i][k]*matrixb[k][j] ; September 10, 2002
13
Instruction-by-instruction - Conventional
DCPI Instruction-by-instruction - Conventional Retire samples Address Instruction x : lds $f10, 0(a4) x c lds $f11, 0(a5) x addl a3, 0x1, a3 0 0x lda a4, 4(a4) 0 0x lda t10, -1000(a3) 0 0x c lda a5, 4000(a5) x muls $f10,$f11,$f10 0 0x adds $f1,$f10,$f1 x sts $f1, 0(a1) 0 0x c blt t10, 0x September 10, 2002
14
ProfileMe retire samples
DCPI Retire samples (1000 runs) Floating add retires 2 4 6 8 10 15200 15400 15600 15800 16000 16200 ProfileMe retire samples Frequency -sd +sd +2sd -2sd Minimum sample: Maximum sample: Average: Standard deviation: Coeff of variation: % Error: ± 1.552% September 10, 2002
15
BB frequency estimates (1000 runs)
DCPI BB frequency estimates (1000 runs) Floating add BB freq est 5 10 15 20 15600 15650 15700 15750 15800 15850 15900 Basic block frequency estimate Frequency -sd +sd +2sd -2sd Minimum sample: Maximum sample: Average: Standard deviation: Coeff of variation: % Error: ± 0.454% September 10, 2002
16
DCPI Improving precision Square root law: quadrupling the sample size doubles the precision of the estimate Practical techniques for increasing sample size Execute test for a longer period of time Aggregate the results of multiple program runs Increase the sampling rate DCPI facilitates all three techniques It supports analysis of very large, long running programs It automatically aggregates multiple runs It supports higher, selectable sampling rates September 10, 2002
17
Conclusions DCPI – a practical tool for program analysis
An accepted, production-ready tool Transparent, low-overhead, comprehensive Pinpoints performance issues at the instruction and source language statement levels Experimental results ProfileMe mitigates the effects of smear and skew that are present in conventional sampling on O-O-O machines Precision as measured in the experiment Raw samples: ±1.552% Basic block frequency estimation: ±0.454% Basic block frequency analysis substantially improves precision DCPI and ProfileMe technology can be applied to architectures other than Alpha such as IA-32 and IPF September 10, 2002
18
HP Continuous Profiling Infrastructure
DCPI HP Continuous Profiling Infrastructure Offered as an “Advanced Development Kit” on Tru64 UNIX Agree to on-line field test agreement Current version is 3.9.6 Contact: URL: September 10, 2002
20
Main components and their roles
DCPI Main components and their roles Performance counters monitor and count CPU events Device driver collects samples and performs first level data aggregation Daemon image correlation and second level aggregation Database stores profile data by epoch, host, image Tools access, analyze and present profile information Performance counters Device driver Daemon Database Tools Denotes flow of profile data September 10, 2002
21
ProfileMe instruction information
DCPI ProfileMe instruction information Program counter Instruction was a regular/PAL instruction Pipeline trap occurred Misprediction trap occurred Load-store order trap occurred Pipeline trap type Instruction was not yet prefetched Instruction was killed before register mapping Instruction stalled before register mapping Instruction retired without trapping (valid) Branch was taken Auxiliary counts: retire delay, retires in profiling window September 10, 2002
22
Image-by-image overview
DCPI Image-by-image overview Many aborted instructions due to DTB misses Retire Abort DTBmiss samples samples samples Image /dsk0h/dcpidb/PALcode flops /vmunix /usr/shlib/libc.so /usr/bin/dcpid /sbin/loader September 10, 2002
23
DTB misses DTB miss summary for inner loop of flops program DCPI
Retire Abort DTBmiss samples samples samples Freq Address Instruction x : lds $f10, 0(a4) x c lds $f11, 0(a5) x addl a3, 0x1, a3 x lda a4, 4(a4) x lda t10, -1000(a3) x c lda a5, 4000(a5) x muls $f10,$f11,$f10 x adds $f1,$f10,$f1 x sts $f1, 0(a1) x c blt t10, 0x September 10, 2002
24
Dynamic Access to DCPI Data (DADD)
Provide dynamic, runtime access to performance data Client / server relationship Application (client) registers interest with DCPI daemon (server) Daemon serves data to application via shared memory region “Virtual counter” API Daemon summarizes profile information into event counts Event counts are written periodically into shared memory region DADD provides virtual counters to PAPI implementation Status: Experimental prototype under development Performance counters Device driver Daemon (server) Application (client) Denotes flow of profile/performance data September 10, 2002
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.