Presentation is loading. Please wait.

Presentation is loading. Please wait.

Exploring the Potential of Performance Monitoring Hardware to Support Run-time Optimization Alex Shye M.S. Thesis Defense Committee: Daniel A. Connors,

Similar presentations


Presentation on theme: "Exploring the Potential of Performance Monitoring Hardware to Support Run-time Optimization Alex Shye M.S. Thesis Defense Committee: Daniel A. Connors,"— Presentation transcript:

1 Exploring the Potential of Performance Monitoring Hardware to Support Run-time Optimization Alex Shye M.S. Thesis Defense Committee: Daniel A. Connors, Andrew R. Pleszkun, and Manish Vachharajani University of Colorado at Boulder Department of Electrical and Computer Engineering DRACO Architecture Research Group

2 Thesis Statement Hardware Performance Monitoring (HPM) can be utilized to provide a low-overhead alternative to current techniques for profiling run-time code behavior.

3 Introduction Profile information is critical to success of profile-based optimizations –Point Profile - BB count, edge profile, etc. –Path Profile - correlated points Off-line Path Profiling Methods: –Use static/dynamic instrumentation to gather full path profile On-line Path Profiling Method: –Interpretation and MRET Both incur high overhead!! –Slowdown of 2-3x with Pin for BB counting A BC D EF G 80 20 7030 Edge Profile: ABDFG 70-50 Path Profile: ABDFG 60 ACDFG 10 …

4 Performance Monitoring HPM through on-chip Performance Monitoring Units (PMUs) –Itanium, Pentium 4, PowerPC –Coarse-grained, fine-grained features Obstacles to PMU profiling –Non-deterministic (sampling) –Sample aliasing –Less information Compiler analysis can extend PMU information!!! FeaturesDescription Event CountersCounts of course grained events. ex. cpu cycles, flushes,etc. Branch Trace Buffer (BTB) Record branch vector of last 4 branches executed. Filters: T/NT, predicted correct/mispredicted,etc. Instruction Event Address Registers (IEAR) Sample Icache/ITLB missed. Addresses and latency Data Event Address Registers (DEAR) Sample Dcache, DTLB, ALAT misses. Addresses and latency Itanium-2 PMU Features Goal: Use sampled branch vectors on PMU to derive a path profile comparable to software path profiling techniques.

5 Contributions I.Characterize the information provided by PMU sampling of branch vectors II.Characterize the effect compiler analysis on PMU information III.Demonstrate the construction of a PMU-based path profiler

6 PMU Profiling Framework PMU Branch Vectors … Partial Paths Offline Compiler Analysis Profile Information Intermediate File Kernel Buffer Branch Vector Hash Table Online perfmon interface Interrupt on kernel buffer overflow Terminology Branch Vector: Series of addresses from BTB Partial Path: Path of ops in compiler IR Dominator Analysis Path Profile Generation Partial Path Extensions Address Map Annotated Binary

7 PMU Configuration Itanium-2 PMU BTB masks –Taken Mask (All, T, NT, None) –Predicted Target Address Mask (All, Correct, Incorrect, None) –Predicted Predicate Mask (All, Correct, Incorrect, None) –Branch Type Mask (All, Indirect, Return, IP-relative) Configuration depends on goal –Branch prediction performance? Building call graph? PMU configured to sample only taken branches for path information –Not taken branches can be inferred in control flow graph

8 Partial Path Extensions Compiler view of CFG can be used to extend paths Extend until point of uncertainty –Up until Join Point –Down until Branch Point Join Point Branch Point Partial Path from Branch Vector Extended Partial Path BTB Branch Vector 1-2-3-4 1 2 3 4

9 Dominator Analysis –Finds all blocks guaranteed to execute Partial Path Extensions –Subset of dominator analysis –Constrained to a path Join Point Branch Point Partial Path from Branch Vector Basic Blocks added with Dom. Analysis BTB Branch Vector 1-2-3-4 1 2 3 4 Terminology Dominator: u dominates v if all paths from Entry to v include u Post Dominate: u post-dominates v if all paths from v to Exit include u

10 Path Profile Generation Combine compiler analysis and PMU branch vectors to generate a path profile comparable to software path profiling techniques Issues: –Path of a branch vector inherently different Random start and end of path - path ambiguity Spans boundaries compiler-based paths do not –Number of paths increases exponentially Must map PMU paths to compiler paths –Region Formation –Split partial paths –Path Matching –Path Crediting Hot Path BTB Trace

11 Region 3 Region 1 Region 2 Region Formation Use region-based paths –Makes total # paths more manageable Functions can be large Create loop-based regions –Programs spend most of time in loops Rules for Region R: –R must be single entry –R may not cross function boundaries –R may not cross loop boundaries A CB D L NM O E GF H QP R TS U WV X JI K Y

12 Path Matching and Crediting Path Matching –Find list of all paths that contain partial path Path Crediting –Distribute partial path weight equally among matched paths Ex. ABDLMOP, ABDEFHIK, OPRSUVX Partial PathCountMatchesIncTotal ABDLMOP100ABDLMOPRSUVX ABDLMOPRSUWX ABDLMOPRSUVX ABDLMOPRSUWX +25 25 ABD160ABDLMOPRSUVX …(14 more) ABDLNOQRTUWX +10 … +10 35 10 EFHIK160EFHIK+160160 OPRSUVX280ABDLMOPRSUVX ABDLNOPRSUVX ACDLMOPRSUVX ACDLNOPRSUVX +70 105 80 70 Region 3 Region 1 Region 2 A CB D L NM O E GF H QP R TS U WV X JI K Y

13 Methodology Experiments run on Itanium-2 with 2.6.10 kernel Developed tool using perfmon kernel interface and libpfm-3.1 to interface with PMU Benchmarks –Set of SPEC2000 benchmarks –Compiled with the OpenIMPACT Research Compiler Compared to full path profile gathered with a Pin path profiling tool

14 Effect of Sampling Period Sampling Overhead due to: –Periodic interrupt, copying between buffers, hash table insertion

15 PMU vs Actual Instruction Distribution Kullback-Leibler Divergence (Entropy) –d =  k=0 p k log 2 (p k /q k ) Relative measure of distance between two distributions

16 Code Coverage Explore how PMU branch vectors translate to code coverage information Code Coverage Types –Single BB: Simulates PC-sampling –Branch Vectors –Branch Vectors w/ Dom. Analysis Coverage percentage is percent of actually covered code discovered with compiler-aided analysis of branch vectors Benchmark#Ops# Covered Ops 164.gzip6,4663,063 (47%) 175.vpr23,57312,229 (52%) 177.mesa89,0067,390 (8%) 179.art2,2011,515 (69%) 181.mcf1,9731,401 (71%) 183.equake3,0332,265 (75%) 188.ammp19,5625,835 (30%) 197.parser17,54111,271 (64%) 256.bzip25,0953,138 (62%) 300.twolf40,49015,705 (39%) Number of Instructions and Actual Code Covered

17 Code Coverage

18 Hot Instruction Thresholds For top 10-30% of instructions, code coverage does well (80-100%) Drops off at around 40-50% of hot instructions

19 Stability Across 20 runs, PMU code coverage varies ~5-10%

20 Multiple Runs Regular Sampling: 1) gzip, parser, twolf improve greatly Randomized Sampling may discover code regular sampling cannot

21 Partial Path Characteristics Partial Path extensions increase length ~20% However, splitting drastically decreases lengths –~30% on function boundaries, ~20% more on loop back edges

22 Accuracy Results Accuracy measured similar to Wall’s weight matching scheme [Wall91] –Threshold =.125%

23 Conclusion Motivates and presents initial results and rational for PMU-based profiling Characterizes branch vector sampling –Improves code coverage > 50% over PC-sampling –Branch vector paths are inter-procedural Characterizes effect of compiler analysis –Partial path extensions increase length by ~20% –Dominator analysis on branch vectors improve code coverage > 50% Demonstrates construction of a PMU-based path profiler –~85% accurate at 1% overhead (at sampling period of 5M) Questions?


Download ppt "Exploring the Potential of Performance Monitoring Hardware to Support Run-time Optimization Alex Shye M.S. Thesis Defense Committee: Daniel A. Connors,"

Similar presentations


Ads by Google