Analysis of Path Profiling Information Generated with Performance Monitoring Hardware Alex Shye, Matt Iyer, Tipp Moseley, Dave Hodgdon Dan Fay, Vijay Janapa.

Slides:

Advertisements

Similar presentations

Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.

Advertisements

Branch prediction Titov Alexander MDSP November, 2009.

A Structure Layout Optimization for Multithreaded Programs Easwaran Raman, Princeton Robert Hundt, Google Sandya S. Mannarswamy, HP.

IMPACT Second Generation EPIC Architecture Wen-mei Hwu IMPACT Second Generation EPIC Architecture Wen-mei Hwu Department of Electrical and Computer Engineering.

Dynamic Optimization using ADORE Framework 10/22/2003 Wei Hsu Computer Science and Engineering Department University of Minnesota.

Overview Motivations Basic static and dynamic optimization methods ADAPT Dynamo.

Computer Organization and Architecture

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

CPE 731 Advanced Computer Architecture ILP: Part II – Branch Prediction Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

Persistent Code Caching Exploiting Code Reuse Across Executions & Applications † Harvard University ‡ University of Colorado at Boulder § Intel Corporation.

Colorado Computer Architecture Research Group Architectural Support for Enhanced SMT Job Scheduling Alex Settle Joshua Kihm Andy Janiszewski Daniel A.

Transient Fault Tolerance via Dynamic Process-Level Redundancy Alex Shye, Vijay Janapa Reddi, Tipp Moseley and Daniel A. Connors University of Colorado.

Decomposing Memory Performance Data Structures and Phases Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences.

Exploring the Potential of Performance Monitoring Hardware to Support Run-time Optimization Alex Shye M.S. Thesis Defense Committee: Daniel A. Connors,

WCED: June 7, 2003 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-MadisonSlide 1 of 26 Exploring Efficient SMT Branch Predictor Design.

Spring Path Profile Estimation and Superblock Formation Jeff Pang Jimeng Sun.

Path Profile Estimation and Superblock Formation Jeff Pang Jimeng Sun.

1 PATH: Page Access Tracking Hardware to Improve Memory Management Reza Azimi, Livio Soares, Michael Stumm, Tom Walsh, and Angela Demke Brown University.

Incremental Path Profiling Kevin Bierhoff and Laura Hiatt Path ProfilingIncremental ApproachExperimental Results Path profiling counts how often each path.

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

A Mathematical Model for Balancing Co-Phase Effects in Simulated Multithreaded Systems Joshua L. Kihm, Tipp Moseley, and Dan Connors University of Colorado.

Variational Path Profiling Erez Perelman*, Trishul Chilimbi †, Brad Calder* * University of Califonia, San Diego †Microsoft Research, Redmond.

Code Coverage Testing Using Hardware Performance Monitoring Support Alex Shye, Matthew Iyer, Vijay Janapa Reddi and Daniel A. Connors University of Colorado.

1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.

Catching Accurate Profiles in Hardware Satish Narayanasamy, Timothy Sherwood, Suleyman Sair, Brad Calder, George Varghese Presented by Jelena Trajkovic.

Dynamic Optimization as typified by the Dynamo System See “Dynamo: A Transparent Dynamic Optimization System”, V. Bala, E. Duesterwald, and S. Banerjia,

Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.

1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.

P ath & E dge P rofiling Michael Bond, UT Austin Kathryn McKinley, UT Austin Continuous Presented by: Yingyi Bu.

Buffered dynamic run-time profiling of arbitrary data for Virtual Machines which employ interpreter and Just-In-Time (JIT) compiler Compiler workshop ’08.

1 Fast and Efficient Partial Code Reordering Xianglong Huang (UT Austin, Adverplex) Stephen M. Blackburn (Intel) David Grove (IBM) Kathryn McKinley (UT.

ACSAC’04 Choice Predictor for Free Mongkol Ekpanyapong Pinar Korkmaz Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Institute.

Transmeta and Dynamic Code Optimization Ashwin Bharambe Mahim Mishra Matthew Rosencrantz.

Electrical and Computer Engineering University of Wisconsin - Madison Prefetching Using a Global History Buffer Kyle J. Nesbit and James E. Smith.

ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Execution Characteristics of SPEC CPU2000 Benchmarks: Intel C++ vs. Microsoft VC++

Microprocessor Microarchitecture Instruction Fetch Lynn Choi Dept. Of Computer and Electronics Engineering.

CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.

1 Dynamic Branch Prediction. 2 Why do we want to predict branches? MIPS based pipeline – 1 instruction issued per cycle, branch hazard of 1 cycle. –Delayed.

Targeted Path Profiling : Lower Overhead Path Profiling for Staged Dynamic Optimization Systems Rahul Joshi, UIUC Michael Bond*, UT Austin Craig Zilles,

Memory Hierarchy Adaptivity An Architectural Perspective Alex Veidenbaum AMRM Project sponsored by DARPA/ITO.

Practical Path Profiling for Dynamic Optimizers Michael Bond, UT Austin Kathryn McKinley, UT Austin.

Targeted Path Profiling : Lower Overhead Path Profiling for Staged Dynamic Optimization Systems Rahul Joshi, UIUC Michael Bond*, UT Austin Craig Zilles,

Full and Para Virtualization

Guiding Ispike with Instrumentation and Hardware (PMU) Profiles CGO’04 Tutorial 3/21/04 CK. Luk Massachusetts Microprocessor Design.

1 Ubiquitous Memory Introspection (UMI) Qin Zhao, NUS Rodric Rabbah, IBM Saman Amarasinghe, MIT Larry Rudolph, MIT Weng-Fai Wong, NUS CGO 2007, March 14.

Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 13: Branch prediction (Chapter 4/6)

Best detection scheme achieves 100% hit detection with

1 ROGUE Dynamic Optimization Framework Using Pin Vijay Janapa Reddi PhD. Candidate - Electrical And Computer Engineering University of Colorado at Boulder.

*Pentium is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States and other countries Performance Monitoring.

An Offline Approach for Whole-Program Paths Analysis using Suffix Arrays G. Pokam, F. Bodin.

Profiling/Tracing Method and Tool Evaluation Strategy Summary Slides Hung-Hsun Su UPC Group, HCS lab 1/25/2005.

1 University of Maryland Using Information About Cache Evictions to Measure the Interactions of Application Data Structures Bryan R. Buck Jeffrey K. Hollingsworth.

Confessions of a Performance Monitor Hardware Designer Workshop on Hardware Performance Monitor Design HPCA February 2005 Jim Callister Intel Corporation.

Memory Protection through Dynamic Access Control Kun Zhang, Tao Zhang and Santosh Pande College of Computing Georgia Institute of Technology.

PINTOS: An Execution Phase Based Optimization and Simulation Tool) PINTOS: An Execution Phase Based Optimization and Simulation Tool) Wei Hsu, Jinpyo Kim,

Raghuraman Balasubramanian Karthikeyan Sankaralingam

Dynamo: A Runtime Codesign Environment

Dynamic Branch Prediction

Online Subpath Profiling

Samira Khan University of Virginia Nov 13, 2017

CMSC 611: Advanced Computer Architecture

Module 3: Branch Prediction

EE 382N Guest Lecture Wish Branches

Phase Capture and Prediction with Applications

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

Sampoorani, Sivakumar and Joshua

Adapted from the slides of Prof

Dynamic Hardware Prediction

rePLay: A Hardware Framework for Dynamic Optimization

Phase based adaptive Branch predictor: Seeing the forest for the trees

Presentation transcript:

Analysis of Path Profiling Information Generated with Performance Monitoring Hardware Alex Shye, Matt Iyer, Tipp Moseley, Dave Hodgdon Dan Fay, Vijay Janapa Reddi, Dan Connors University of Colorado at Boulder Department of Electrical and Computer Engineering DRACO Architecture Research Group

Introduction Profile information is critical to success of optimizers –Point Profile - BBs count, edge profiles, etc. –Path Profile - correlated branches Off-line Path Profiling Methods: –Use static/dynamic instrumentation to gather full path profile [Ball96][Joshi04][Bond05] On-line Path Profiling Method: –Interpretation: MRET [Bala00][Bruen03] Both incur high overhead!! For run-time systems, overhead unacceptable A BC D EF G Edge Profile: ABDFG Path Profile: ABDFG 60 ACDFG 10 …

Performance Monitoring Modern processors contain on-chip Performance Monitoring Units(PMUs) –Itanium, Pentium 4, Power PC support branch vectors Sampling PMU –Less information –Non-deterministic, phase behavior Branch Execution Information –Itanium-2 PMU Branch Trace Buffer(BTB) - up to four branches Different configurations: Last-4 branches, Last-4 taken branches, etc –Compiler can expand this information

PMU-based Path Profiling Goal: Combine compiler analysis and PMU branch vectors to generate a path profile In order for PMU-based path profiling to effective, it must to comparable to a full path profile ex. Ball Larus PP [Ball96] Other forms of PMU-based profile information have been shown to be effective at run-time optimization - ADORE [Chen03][Lu04] Hot Path BTB Trace

Hardware Profiling Approaches Proposed Techniques: –BTB profile buffer [Conte94] OS coupled with BTB hardware to fill out an edge profile –Hot Spot Detection [Merten00] Proposed Branch Behavior Buffer to store branch information to fill out edge profile –Programmable Path Profiler [Vaswani05] Hardware Path Stack and Path Detector Performance Monitoring Unit Techniques –Continuous Profiling/Optimization Systems Simple PMU - event counters –ADORE Dynamic Optimizer [Chen03][Lu04] Sampling Itanium-2 PMU to drive memory optimizations

Motivation Unfortunately, most existing techniques are only able to accomplish one or two of these. This project aims to combine the accuracy of path profiling with low-overhead utilizing existing performance monitoring hardware. AccuracySingle-StageLow-Overhead Static Instrumentation Dynamic Instrumentation Interpretation Hardware Techniques Characteristics of the Ideal Run-time Profiler 1.Accuracy - Ability to reflect run-time execution well 2.Single-Stage - Can profile binary on-the-fly without extra compilation stages 3.Low Overhead - Incurs little to no overhead

Itanium-2 PMU Path Profiling 2 Phases –Online BTB Trace Collection –Offline Partial Path Creation Region Formation Path Profile Generation –Path Matching –Path Crediting PMU Processor BTB Traces … Partial Paths Compiler-Aided Offline Analysis PATHS! Region Formation Path Matching/ Crediting Terminology BTB Trace: Series of addresses from BTB Partial Path: Path of ops in compiler IR Region: Single Entrance region in CFG Path: Complete path through a region

BTB Trace Collection BTB Trace: Sequence of four branches per sample –Configured to sample only taken branches Allows for longer partial paths to be built The not taken path is trivial to follow BTB Trace placed into specialized hash table every sample –If BTB Trace exists, increment count At the end of execution, BTB Traces and counts are dumped to a file

Partial Path Creation Partial Path: List of low-level IR ops Partial Path Formation –Recreate path from BTB Trace –Partial Path weight = count –Perform Partial Path Extensions Up until Join Point Down until Branch Point Join Point Branch Point Partial Path from BTB Trace Extended Partial Path BTB Trace Branch

Path Matching and Crediting Path Matching –Find list of all paths that contain partial path Path Crediting –Distribute partial path weight equally among matched paths Example: Challenge: –Number of paths grows exponentially –Large control flow graphs present a problem A CB D L NM O E GF H QP R TS U WV X JI K Y Partial PathCountMatchesIncTotal OPRSUVXY100ABDLMOPRSUVXY ACDLMOPRSUVXY ABDLNOPRSUVXY ACDLNOPRSUVXY

Region 3 Region 2 Region 1 Region Formation We use region-based paths –Makes total # paths more manageable –Limits number of matching paths Rules for Region R: –R must be single entry –R may not cross loop boundaries Loop Regions created first –R may not cross function boundaries –Total # paths in R is limited by a threshold –R must be as large as possible Side Effects of Region Formation –Partial Paths must be split at: Loop boundaries Function boundaries Region boundaries A CB D L NM O E GF H QP R TS U WV X JI K Y

Path Generation Example Suppose we encounter these paths: –ABDLMOP –ABDEFHIK Split into ABD, EFHIK –OPRSUVX Partial PathCountMatchesIncTotal ABDLMOP100ABDLMOPRSUVX ABDLMOPRSUWX ABDLMOPRSUVX ABDLMOPRSUWX ABD160ABDLMOPRSUVX …(14 more) ABDLNOQRTUWX +10 … EFHIK160EFHIK OPRSUVX280ABDLMOPRSUVX ABDLNOPRSUVX ACDLMOPRSUVX ACDLNOPRSUVX Region 3 Region 2 Region 1 A CB D L NM O E GF H QP R TS U WV X JI K Y

Methodology Experiments run on Itanium 2 Developed tool using perfmon kernel interface and libpfm [perfmon] to interface with PMU Benchmarks –Set of SPEC2000 benchmarks –Compiled with the OpenIMPACT Research Compiler [oicc] Without aggressive profile-directed optimizations Off-line analysis with OpenIMPACT module Compared to full path profile gathered with a PIN path profiling tool

Effect of Sampling Period Knee of Overhead curve ~500K Number of Unique Paths consistently grows as sampling period decreases –Levels off some between 50K and 100K

Accuracy Results Accuracy measured similar to Wall’s weight matching scheme [Wall91]

Incorrectly Detected Paths With our path crediting technique: –We can distinguish hot paths in a regions –May incorrectly detect hot paths in program May be crediting cold paths enough for them to seem hot compared to rest of program Partial PathCountMatchesIncTotal ABDLMOP100ABDLMOPRSUVX ABDLMOPRSUWX ABDLMOPRSUVX ABDLMOPRSUWX Region 3 Region 2 Region 1 A CB D L NM O E GF H QP R TS U WV X JI K Y

Partial Path Length Length of Partial Paths drops drastically from splitting on function on loop back edges

Function Correlation MANY partial paths cross function boundaries –Should use function correlation

Multiple Runs May be possible to use multiple runs to provide more accurate path profile data

Future Work Region Formation –Characterize quality of our regions Important because no correlation between regions –Regions stretching across function boundaries Noise Elimination –Crucial to removing false positives due to path crediting Effects of Optimization –Find effects of superblocks, inlining, etc. on partial paths and accuracy of path profile

Conclusion We introduce rationale and initial data of PMU-based path profiling PMU-based profiling shows promise At Sampling Period = 5M cycles –~85% accurate –~1% overhead Questions?

References [Bala00]V. Bala, E. Duesterwald and S. Banerjia. “Dynamo: A Trasparent Dynamic Optimization System” PLDI [Ball92]T. Ball and J.R. Larus. “Optimally Profiling and Tracing Programs” TOPLAS [Ball96]T. Ball and J.R. Larus. “Efficient Path Profiling” MICRO-29, [Bond05] M.D. Bond and K.S. McKinley. “Practical Path Profiling for Dynamic Optimizers”, CGO [Bruen03]D. Bruening, R. Garnett and S. Amarasinghe. “An Infrastructure for Adaptive Dynamic Optimization” CGO [Chen03]H. Chen, W.C. Hsu, J. Lu, P.C. Yew and D.Y. Chen. “Dynamic Trace Selection Using Performance Monitoring Hardware Sampling” CGO [Conte94]T.M. Conte, B.A. Patel and J.S. Cox. “Using Branch Handling Hardware to Support Profile-Driven Optimization” MICRO-27, 1994.

References (cont) [Intel04]Intel, “Intel Itanium 2 Processor Reference Manual: For Software Development and Optimization” May [Joshi04]R. Joshi, M.D. Bond and C. Zilles. “Targeted Path Profiling: Lower Overhead Path Profiling for Staged Dynamic Optimization Systems” CGO [Kistler01]T. Kistler and M. Franz. “Continuous Program Optimization” IEEE Trans. On Computers v50 no6 June [Lu04]J. Lu, H. Chen, P.C. Yew and W.C. Hsu. “Design and Implementation of a Lightweight Dynamic Optimization System” Journal of ILP 6, 2004 [Merten00]M.C. Merten, A.R. Trick, E.M. Nystrom, R.D. Barnes, and W.W. Hwu. “A Hardware Mechanism for Dynamic Extraction and Relayout of Program Hot Spots” ISCA [oicc] [pin]

Extra Slides

ADORE Trace Selection Goal: Gather hot traces with many cache misses to add pre-fetches However, hot traces may not be enough to detect full hot paths Compiler can perform further analysis –Correlate BTB based traces into longer paths PMU Itanium 2 Sample last 4 taken branches Branch TraceD1 MissesI1 MissesCycles 10ac,640,66c, 10c ………… Branch Trace Table BTB Trace Hot Path

Partial Path Characteristics Partial Path extensions increase length ~20% However, splitting drastically decreases lengths –~30% on function boundaries, ~20% more on loop back edges Many paths span 1 or more function boundaries –Indicates a great amount of function correlation is being thrown away BenchmarkInitialExtFuncLoop 164.gzip 175.vpr 177.mesa 179.art 181.mcf 183.equake 186.crafty 188.ammp 197.parser 256.bzip2 300.twolf Benchmark gzip 175.vpr 177.mesa 179.art 181.mcf 183.equake 186.crafty 188.ammp 197.parser 256.bzip2 300.twolf Function Boundaries Spanned Average Partial Path Lengths