Exploring the Potential of Performance Monitoring Hardware to Support Run-time Optimization Alex Shye M.S. Thesis Defense Committee: Daniel A. Connors,

Slides:



Advertisements
Similar presentations
Profiler In software engineering, profiling ("program profiling", "software profiling") is a form of dynamic program analysis that measures, for example,
Advertisements

Discovering and Exploiting Program Phases Timothy Sherwood, Erez Perelman, Greg Hamerly, Suleyman Sair, Brad Calder CSE 231 Presentation by Justin Ma.
A Structure Layout Optimization for Multithreaded Programs Easwaran Raman, Princeton Robert Hundt, Google Sandya S. Mannarswamy, HP.
Hardware-based Devirtualization (VPC Prediction) Hyesoon Kim, Jose A. Joao, Onur Mutlu ++, Chang Joo Lee, Yale N. Patt, Robert Cohn* ++ *
Wish Branches Combining Conditional Branching and Predication for Adaptive Predicated Execution The University of Texas at Austin *Oregon Microarchitecture.
Dynamic Branch Prediction
Overview Motivations Basic static and dynamic optimization methods ADAPT Dynamo.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
1 Cost Effective Dynamic Program Slicing Xiangyu Zhang Rajiv Gupta The University of Arizona.
Enabling Efficient On-the-fly Microarchitecture Simulation Thierry Lafage September 2000.
1 S. Tallam, R. Gupta, and X. Zhang PACT 2005 Extended Whole Program Paths Sriraman Tallam Rajiv Gupta Xiangyu Zhang University of Arizona.
H.264/AVC Baseline Profile Decoder Complexity Analysis Michael Horowitz, Anthony Joch, Faouzi Kossentini, and Antti Hallapuro IEEE TRANSACTIONS ON CIRCUITS.
Colorado Computer Architecture Research Group Architectural Support for Enhanced SMT Job Scheduling Alex Settle Joshua Kihm Andy Janiszewski Daniel A.
Transient Fault Tolerance via Dynamic Process-Level Redundancy Alex Shye, Vijay Janapa Reddi, Tipp Moseley and Daniel A. Connors University of Colorado.
Decomposing Memory Performance Data Structures and Phases Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences.
WCED: June 7, 2003 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-MadisonSlide 1 of 26 Exploring Efficient SMT Branch Predictor Design.
1 Improving Branch Prediction by Dynamic Dataflow-based Identification of Correlation Branches from a Larger Global History CSE 340 Project Presentation.
Spring Path Profile Estimation and Superblock Formation Jeff Pang Jimeng Sun.
Path Profile Estimation and Superblock Formation Jeff Pang Jimeng Sun.
Analysis of Path Profiling Information Generated with Performance Monitoring Hardware Alex Shye, Matt Iyer, Tipp Moseley, Dave Hodgdon Dan Fay, Vijay Janapa.
Multiscalar processors
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
Variational Path Profiling Erez Perelman*, Trishul Chilimbi †, Brad Calder* * University of Califonia, San Diego †Microsoft Research, Redmond.
Code Coverage Testing Using Hardware Performance Monitoring Support Alex Shye, Matthew Iyer, Vijay Janapa Reddi and Daniel A. Connors University of Colorado.
Catching Accurate Profiles in Hardware Satish Narayanasamy, Timothy Sherwood, Suleyman Sair, Brad Calder, George Varghese Presented by Jelena Trajkovic.
Improving the Performance of Object-Oriented Languages with Dynamic Predication of Indirect Jumps José A. Joao *‡ Onur Mutlu ‡* Hyesoon Kim § Rishi Agarwal.
Taming Hardware Event Samples for FDO Compilation Dehao Chen (Tsinghua University) Neil Vachharajani, Robert Hundt, Shih-wei Liao (Google) Vinodha Ramasamy.
Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.
Performance Monitoring on the Intel ® Itanium ® 2 Processor CGO’04 Tutorial 3/21/04 CK. Luk Massachusetts Microprocessor Design.
P ath & E dge P rofiling Michael Bond, UT Austin Kathryn McKinley, UT Austin Continuous Presented by: Yingyi Bu.
1 Fast and Efficient Partial Code Reordering Xianglong Huang (UT Austin, Adverplex) Stephen M. Blackburn (Intel) David Grove (IBM) Kathryn McKinley (UT.
ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Execution Characteristics of SPEC CPU2000 Benchmarks: Intel C++ vs. Microsoft VC++
CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.
1 CS 201 Compiler Construction Introduction. 2 Instructor Information Rajiv Gupta Office: WCH Room Tel: (951) Office.
Targeted Path Profiling : Lower Overhead Path Profiling for Staged Dynamic Optimization Systems Rahul Joshi, UIUC Michael Bond*, UT Austin Craig Zilles,
Session 7C July 9, 2004ICPADS ‘04 A Framework for Profiling Multiprocessor Memory Performance Diana Villa, Jaime Acosta, Patricia J. Teller The University.
Practical Path Profiling for Dynamic Optimizers Michael Bond, UT Austin Kathryn McKinley, UT Austin.
Ensemble Learning for Low-level Hardware-supported Malware Detection
1 Control Flow Analysis Topic today Representation and Analysis Paper (Sections 1, 2) For next class: Read Representation and Analysis Paper (Section 3)
Guiding Ispike with Instrumentation and Hardware (PMU) Profiles CGO’04 Tutorial 3/21/04 CK. Luk Massachusetts Microprocessor Design.
Branch Prediction Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.
Combining Software and Hardware Monitoring for Improved Power and Performance Tuning Eric Chi, A. Michael Salem, and R. Iris Bahar Brown University Division.
CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.
Department of Computer Science 6 th Annual Austin CAS Conference – 24 February 2005 Ricardo Portillo, Diana Villa, Patricia J. Teller The University of.
1 ROGUE Dynamic Optimization Framework Using Pin Vijay Janapa Reddi PhD. Candidate - Electrical And Computer Engineering University of Colorado at Boulder.
An Offline Approach for Whole-Program Paths Analysis using Suffix Arrays G. Pokam, F. Bodin.
Beyond Application Profiling to System Aware Analysis Elena Laskavaia, QNX Bill Graham, QNX.
University of Michigan Electrical Engineering and Computer Science 1 Low Cost Control Flow Protection Using Abstract Control Signatures Daya S Khudia and.
Confessions of a Performance Monitor Hardware Designer Workshop on Hardware Performance Monitor Design HPCA February 2005 Jim Callister Intel Corporation.
Qin Zhao1, Joon Edward Sim2, WengFai Wong1,2 1SingaporeMIT Alliance 2Department of Computer Science National University of Singapore
Memory Protection through Dynamic Access Control Kun Zhang, Tao Zhang and Santosh Pande College of Computing Georgia Institute of Technology.
PINTOS: An Execution Phase Based Optimization and Simulation Tool) PINTOS: An Execution Phase Based Optimization and Simulation Tool) Wei Hsu, Jinpyo Kim,
Computer Architecture: Branch Prediction (II) and Predicated Execution
Prof. Hsien-Hsin Sean Lee
Raghuraman Balasubramanian Karthikeyan Sankaralingam
CS203 – Advanced Computer Architecture
Ph.D. in Computer Science
Selective Code Compression Scheme for Embedded System
Henk Corporaal TUEindhoven 2009
EE 382N Guest Lecture Wish Branches
Ann Gordon-Ross and Frank Vahid*
Estimating Timing Profiles for Simulation of Embedded Systems
Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt
Henk Corporaal TUEindhoven 2011
Lecture 10: Branch Prediction and Instruction Delivery
Sampoorani, Sivakumar and Joshua
Dynamic Hardware Prediction
rePLay: A Hardware Framework for Dynamic Optimization
Dynamic Binary Translators and Instrumenters
Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project
Presentation transcript:

Exploring the Potential of Performance Monitoring Hardware to Support Run-time Optimization Alex Shye M.S. Thesis Defense Committee: Daniel A. Connors, Andrew R. Pleszkun, and Manish Vachharajani University of Colorado at Boulder Department of Electrical and Computer Engineering DRACO Architecture Research Group

Thesis Statement Hardware Performance Monitoring (HPM) can be utilized to provide a low-overhead alternative to current techniques for profiling run-time code behavior.

Introduction Profile information is critical to success of profile-based optimizations –Point Profile - BB count, edge profile, etc. –Path Profile - correlated points Off-line Path Profiling Methods: –Use static/dynamic instrumentation to gather full path profile On-line Path Profiling Method: –Interpretation and MRET Both incur high overhead!! –Slowdown of 2-3x with Pin for BB counting A BC D EF G Edge Profile: ABDFG Path Profile: ABDFG 60 ACDFG 10 …

Performance Monitoring HPM through on-chip Performance Monitoring Units (PMUs) –Itanium, Pentium 4, PowerPC –Coarse-grained, fine-grained features Obstacles to PMU profiling –Non-deterministic (sampling) –Sample aliasing –Less information Compiler analysis can extend PMU information!!! FeaturesDescription Event CountersCounts of course grained events. ex. cpu cycles, flushes,etc. Branch Trace Buffer (BTB) Record branch vector of last 4 branches executed. Filters: T/NT, predicted correct/mispredicted,etc. Instruction Event Address Registers (IEAR) Sample Icache/ITLB missed. Addresses and latency Data Event Address Registers (DEAR) Sample Dcache, DTLB, ALAT misses. Addresses and latency Itanium-2 PMU Features Goal: Use sampled branch vectors on PMU to derive a path profile comparable to software path profiling techniques.

Contributions I.Characterize the information provided by PMU sampling of branch vectors II.Characterize the effect compiler analysis on PMU information III.Demonstrate the construction of a PMU-based path profiler

PMU Profiling Framework PMU Branch Vectors … Partial Paths Offline Compiler Analysis Profile Information Intermediate File Kernel Buffer Branch Vector Hash Table Online perfmon interface Interrupt on kernel buffer overflow Terminology Branch Vector: Series of addresses from BTB Partial Path: Path of ops in compiler IR Dominator Analysis Path Profile Generation Partial Path Extensions Address Map Annotated Binary

PMU Configuration Itanium-2 PMU BTB masks –Taken Mask (All, T, NT, None) –Predicted Target Address Mask (All, Correct, Incorrect, None) –Predicted Predicate Mask (All, Correct, Incorrect, None) –Branch Type Mask (All, Indirect, Return, IP-relative) Configuration depends on goal –Branch prediction performance? Building call graph? PMU configured to sample only taken branches for path information –Not taken branches can be inferred in control flow graph

Partial Path Extensions Compiler view of CFG can be used to extend paths Extend until point of uncertainty –Up until Join Point –Down until Branch Point Join Point Branch Point Partial Path from Branch Vector Extended Partial Path BTB Branch Vector

Dominator Analysis –Finds all blocks guaranteed to execute Partial Path Extensions –Subset of dominator analysis –Constrained to a path Join Point Branch Point Partial Path from Branch Vector Basic Blocks added with Dom. Analysis BTB Branch Vector Terminology Dominator: u dominates v if all paths from Entry to v include u Post Dominate: u post-dominates v if all paths from v to Exit include u

Path Profile Generation Combine compiler analysis and PMU branch vectors to generate a path profile comparable to software path profiling techniques Issues: –Path of a branch vector inherently different Random start and end of path - path ambiguity Spans boundaries compiler-based paths do not –Number of paths increases exponentially Must map PMU paths to compiler paths –Region Formation –Split partial paths –Path Matching –Path Crediting Hot Path BTB Trace

Region 3 Region 1 Region 2 Region Formation Use region-based paths –Makes total # paths more manageable Functions can be large Create loop-based regions –Programs spend most of time in loops Rules for Region R: –R must be single entry –R may not cross function boundaries –R may not cross loop boundaries A CB D L NM O E GF H QP R TS U WV X JI K Y

Path Matching and Crediting Path Matching –Find list of all paths that contain partial path Path Crediting –Distribute partial path weight equally among matched paths Ex. ABDLMOP, ABDEFHIK, OPRSUVX Partial PathCountMatchesIncTotal ABDLMOP100ABDLMOPRSUVX ABDLMOPRSUWX ABDLMOPRSUVX ABDLMOPRSUWX ABD160ABDLMOPRSUVX …(14 more) ABDLNOQRTUWX +10 … EFHIK160EFHIK OPRSUVX280ABDLMOPRSUVX ABDLNOPRSUVX ACDLMOPRSUVX ACDLNOPRSUVX Region 3 Region 1 Region 2 A CB D L NM O E GF H QP R TS U WV X JI K Y

Methodology Experiments run on Itanium-2 with kernel Developed tool using perfmon kernel interface and libpfm-3.1 to interface with PMU Benchmarks –Set of SPEC2000 benchmarks –Compiled with the OpenIMPACT Research Compiler Compared to full path profile gathered with a Pin path profiling tool

Effect of Sampling Period Sampling Overhead due to: –Periodic interrupt, copying between buffers, hash table insertion

PMU vs Actual Instruction Distribution Kullback-Leibler Divergence (Entropy) –d =  k=0 p k log 2 (p k /q k ) Relative measure of distance between two distributions

Code Coverage Explore how PMU branch vectors translate to code coverage information Code Coverage Types –Single BB: Simulates PC-sampling –Branch Vectors –Branch Vectors w/ Dom. Analysis Coverage percentage is percent of actually covered code discovered with compiler-aided analysis of branch vectors Benchmark#Ops# Covered Ops 164.gzip6,4663,063 (47%) 175.vpr23,57312,229 (52%) 177.mesa89,0067,390 (8%) 179.art2,2011,515 (69%) 181.mcf1,9731,401 (71%) 183.equake3,0332,265 (75%) 188.ammp19,5625,835 (30%) 197.parser17,54111,271 (64%) 256.bzip25,0953,138 (62%) 300.twolf40,49015,705 (39%) Number of Instructions and Actual Code Covered

Code Coverage

Hot Instruction Thresholds For top 10-30% of instructions, code coverage does well (80-100%) Drops off at around 40-50% of hot instructions

Stability Across 20 runs, PMU code coverage varies ~5-10%

Multiple Runs Regular Sampling: 1) gzip, parser, twolf improve greatly Randomized Sampling may discover code regular sampling cannot

Partial Path Characteristics Partial Path extensions increase length ~20% However, splitting drastically decreases lengths –~30% on function boundaries, ~20% more on loop back edges

Accuracy Results Accuracy measured similar to Wall’s weight matching scheme [Wall91] –Threshold =.125%

Conclusion Motivates and presents initial results and rational for PMU-based profiling Characterizes branch vector sampling –Improves code coverage > 50% over PC-sampling –Branch vector paths are inter-procedural Characterizes effect of compiler analysis –Partial path extensions increase length by ~20% –Dominator analysis on branch vectors improve code coverage > 50% Demonstrates construction of a PMU-based path profiler –~85% accurate at 1% overhead (at sampling period of 5M) Questions?