Practical Path Profiling for Dynamic Optimizers Michael Bond, UT Austin Kathryn McKinley, UT Austin.

Slides:



Advertisements
Similar presentations
Project : Phase 1 Grading Default Statistics (40 points) Values and Charts (30 points) Analyses (10 points) Branch Predictor Statistics (30 points) Values.
Advertisements

Link-Time Path-Sensitive Memory Redundancy Elimination Manel Fernández and Roger Espasa Computer Architecture Department Universitat.
P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.
Computer Architecture Instruction-Level Parallel Processors
CSCI 4717/5717 Computer Architecture
Dynamic History-Length Fitting: A third level of adaptivity for branch prediction Toni Juan Sanji Sanjeevan Juan J. Navarro Department of Computer Architecture.
CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
Optimal Instruction Scheduling for Multi-Issue Processors using Constraint Programming Abid M. Malik and Peter van Beek David R. Cheriton School of Computer.
Overview Motivations Basic static and dynamic optimization methods ADAPT Dynamo.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Probabilistic Calling Context Michael D. Bond Kathryn S. McKinley University of Texas at Austin.
The Use of Traces for Inlining in Java Programs Borys J. Bradel Tarek S. Abdelrahman Edward S. Rogers Sr.Department of Electrical and Computer Engineering.
Perceptron-based Global Confidence Estimation for Value Prediction Master’s Thesis Michael Black June 26, 2003.
1 Improving Branch Prediction by Dynamic Dataflow-based Identification of Correlation Branches from a Larger Global History CSE 340 Project Presentation.
Spring Path Profile Estimation and Superblock Formation Jeff Pang Jimeng Sun.
VLSI Project Neural Networks based Branch Prediction Alexander ZlotnikMarcel Apfelbaum Supervised by: Michael Behar, Spring 2005.
EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.
EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
Partial Method Compilation using Dynamic Profile Information John Whaley Stanford University October 17, 2001.
Chapter 2 Instruction-Level Parallelism and Its Exploitation
Incremental Path Profiling Kevin Bierhoff and Laura Hiatt Path ProfilingIncremental ApproachExperimental Results Path profiling counts how often each path.
San Diego Supercomputer Center Performance Modeling and Characterization Lab PMaC Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation.
Multiscalar processors
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
EECC551 - Shaaban #1 Winter 2011 lec# Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level.
EENG449b/Savvides Lec /25/05 March 24, 2005 Prof. Andreas Savvides Spring g449b EENG 449bG/CPSC 439bG.
Variational Path Profiling Erez Perelman*, Trishul Chilimbi †, Brad Calder* * University of Califonia, San Diego †Microsoft Research, Redmond.
Appendix A Pipelining: Basic and Intermediate Concepts
Instruction Scheduling II: Beyond Basic Blocks Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp.
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Catching Accurate Profiles in Hardware Satish Narayanasamy, Timothy Sherwood, Suleyman Sair, Brad Calder, George Varghese Presented by Jelena Trajkovic.
1 Storage Free Confidence Estimator for the TAGE predictor André Seznec IRISA/INRIA.
P ath & E dge P rofiling Michael Bond, UT Austin Kathryn McKinley, UT Austin Continuous Presented by: Yingyi Bu.
1 Sampling-based Program Locality Approximation Yutao Zhong, Wentao Chang Department of Computer Science George Mason University June 8th,2008.
Predicated Static Single Assignment (PSSA) Presented by AbdulAziz Al-Shammari
Timing Analysis of Embedded Software for Speculative Processors Tulika Mitra Abhik Roychoudhury Xianfeng Li School of Computing National University of.
CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.
CSCI 6461: Computer Architecture Branch Prediction Instructor: M. Lancaster Corresponding to Hennessey and Patterson Fifth Edition Section 3.3 and Part.
Targeted Path Profiling : Lower Overhead Path Profiling for Staged Dynamic Optimization Systems Rahul Joshi, UIUC Michael Bond*, UT Austin Craig Zilles,
Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.
Targeted Path Profiling : Lower Overhead Path Profiling for Staged Dynamic Optimization Systems Rahul Joshi, UIUC Michael Bond*, UT Austin Craig Zilles,
Interprocedural Path Profiling David Melski Thomas Reps University of Wisconsin.
Optimal Superblock Scheduling Using Enumeration Ghassan Shobaki, CS Dept. Kent Wilken, ECE Dept. University of California, Davis
Branch Prediction Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.
2D-Profiling Detecting Input-Dependent Branches with a Single Input Data Set Hyesoon Kim M. Aater Suleman Onur Mutlu Yale N. Patt HPS Research Group The.
Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.
1 ROGUE Dynamic Optimization Framework Using Pin Vijay Janapa Reddi PhD. Candidate - Electrical And Computer Engineering University of Colorado at Boulder.
Online Subpath Profiling
Feedback directed optimization in Compaq’s compilation tools for Alpha
EE 382N Guest Lecture Wish Branches
Inlining and Devirtualization Hal Perkins Autumn 2011
Correcting the Dynamic Call Graph Using Control Flow Constraints
Adaptive Optimization in the Jalapeño JVM
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Instruction Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Dynamic Hardware Prediction
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
How to improve (decrease) CPI
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Loop-Level Parallelism
Static Scheduling Techniques
Presentation transcript:

Practical Path Profiling for Dynamic Optimizers Michael Bond, UT Austin Kathryn McKinley, UT Austin

Why path profiling? Processors need long instruction sequences Programs have branches A C E B D

Why path profiling? Compiler identifies hot paths across multiple basic blocks A C E B D

Why path profiling? A C E B A C E B D Compiler identifies hot paths across multiple basic blocks – Forms and optimizes “traces”

Why path profiling? A C E B A C E B D Oops! Compiler identifies hot paths across multiple basic blocks – Forms and optimizes “traces”

Why path profiling? Superblocks Dynamo fragments Hyperblocks rePLay frames MSSP tasks Less aggressiveMore aggressive Compiler identifies hot paths across multiple basic blocks – Forms and optimizes “traces”

Ball-Larus path profiling Edge profiling Ball-Larus path profiling Targeted path profiling Practical path profiling Instrumentation measures execution frequency of each path Acyclic, intraprocedural paths

Edge profiling Ball-Larus path profiling Targeted path profiling Practical path profiling Hardware or sampling Estimate hot paths from edge profile

Ideal for dynamic optimizer Edge profiling Ball-Larus path profiling Targeted path profiling Practical path profiling

Targeted path profiling [Joshi et al. ’04] Edge profiling Ball-Larus path profiling Targeted path profiling Practical path profiling Profile-guided profiling Accuracy good Overhead high for dynamic optimizer

Practical path profiling Edge profiling Ball-Larus path profiling Targeted path profiling Practical path profiling

Outline Background – Staged dynamic optimization – Profile-guided profiling – Ball-Larus path profiling Practical path profiling Methodology – Edge profile-guided inlining and unrolling – Measuring accuracy with branch-flow metric Accuracy and overhead

Staged dynamic optimization Static optimizations Stage 0

Staged dynamic optimization Static optimizations Edge profile Stage 0 Sampling-based edge profiler

Staged dynamic optimization Static optimizations Edge profile Stage 0 Local optimizations incl. inlining & unrolling Stage 1 Sampling-based edge profiler

Staged dynamic optimization Static optimizations Edge profile Stage 0 Local optimizations incl. inlining & unrolling Stage 1 Larger routines Longer paths More challenging platform for path profiling Sampling-based edge profiler

Staged dynamic optimization Static optimizations Edge profile Stage 0 Local optimizations incl. inlining & unrolling Path profiling instrumentation Stage 1 Sampling-based edge profiler

Staged dynamic optimization Static optimizations Edge profile Stage 0 Local optimizations incl. inlining & unrolling Path profiling instrumentation Stage 1 Path profile Sampling-based edge profiler

Staged dynamic optimization Static optimizations Edge profile Stage 0 Local optimizations incl. inlining & unrolling Path profiling instrumentation Global optimizations Stage 2 Stage 1 Path profile Sampling-based edge profiler

Staged dynamic optimization Static optimizations Edge profile Stage 0 Local optimizations incl. inlining & unrolling Path profiling instrumentation Global optimizations Stage 2 Stage 1 Path profile Edge profile: Identifies hot and cold edges Provides partial path profile Sampling-based edge profiler

Profile-guided profiling Static optimizations Edge profile Stage 0 Local optimizations incl. inlining & unrolling Path profiling instrumentation Global optimizations Stage 2 Stage 1 Path profile Edge profile: Identifies hot and cold edges Provides partial path profile Sampling-based edge profiler

Ball-Larus path profiling Acyclic, intraprocedural paths – Handles cyclic routines Instrumentation maintains execution frequency of each path – Each path computes unique integer in [0, N-1]

Ball-Larus path profiling 4 paths  [0, 3]

Ball-Larus path profiling paths  [0, 3] Each path sums to unique integer 0 0

Ball-Larus path profiling 2 4 paths  [0, 3] Each path sums to unique integer Path

Ball-Larus path profiling 2 4 paths  [0, 3] Each path sums to unique integer Path 0 Path

Ball-Larus path profiling 2 4 paths  [0, 3] Each path sums to unique integer Path 0 Path 1 Path

Ball-Larus path profiling 2 4 paths  [0, 3] Each path sums to unique integer Path 0 Path 1 Path 2 Path

Ball-Larus path profiling r=r+2 r=0 r=r+1 count[r]++ r : path register – Computes path number count : – Stores path frequencies

Ball-Larus path profiling r=r+2 r=0 count[r]++ r : path register – Computes path number count : – Stores path frequencies – Array by default – Too many paths? Hash table High overhead r=r+1

Outline Background – Ball-Larus path profiling – Staged dynamic optimization – Profile-guided profiling Practical path profiling Methodology – Edge profile-guided inlining and unrolling – Measuring accuracy with branch-flow metric Accuracy and overhead

Practical path profiling Goal: Reduce instrumentation overhead without hurting accuracy – Use profile-guided profiling Strategies – Decrease number of possible paths – Avoid instrumenting paths edge profile predicts well – Simplify instrumentation on profiled paths

Practical path profiling Goal: Reduce instrumentation overhead without hurting accuracy – Use profile-guided profiling Strategies – Decrease number of possible paths – Avoid instrumenting paths edge profile predicts well – Simplify instrumentation on profiled paths Techniques from targeted path profiling – Improves techniques – Adds new techniques

Strategy 1: Fewer possible paths Goal: Hash table  array Want to remove cold paths

Strategy 1: Fewer possible paths Goal: Hash table  array Want to remove cold paths Observation: A path with a cold edge is a cold path Remove cold edges – Local and global criteria New

Strategy 1: Fewer possible paths Goal: Hash table  array Want to remove cold paths Observation: A path with a cold edge is a cold path Remove cold edges – Local and global criteria Paths: 16  4

Remaining paths potentially hot 4 paths  [0, 3] 2 1 Strategy 1: Fewer possible paths 0 0

Remaining paths potentially hot 4 paths  [0, 3] Strategy 1: Fewer possible paths r=r+2 r=0 r=r+1 count[r]++

What if cold edge taken? Strategy 1: Fewer possible paths r=r+2 r=0 r=r+1 count[r]++

What if cold edge taken? Cold edges “poison” path register – Set it to N – Cold paths use [N, 2N-1] Strategy 1: Fewer possible paths r=r+2 r=0 r=4 r=r+1 count[r]++ New

What if cold edge taken? Cold edges “poison” path register – Set it to N – Cold paths use [N, 2N-1] What if still too many possible paths? r=r+2 r=0 r=4 r=r+1 count[r]++ Strategy 1: Fewer possible paths

What if cold edge taken? Cold edges “poison” path register – Set it to N – Cold paths use [N, 2N-1] What if still too many possible paths? Adjust cold edge threshold until hashing avoided Strategy 1: Fewer possible paths r=r+2 r=0 r=4 r=r+1 count[r]++ New

Strategy 2: Avoid instrumenting paths Consider right half of CFG

Strategy 2: Avoid instrumenting paths Consider right half of CFG – Obvious paths: Each path has an edge unique to it – Edge profile provides perfect path profile

Strategy 2: Avoid instrumenting paths Consider right half of CFG – Obvious paths: Each path has an edge unique to it – Edge profile provides perfect path profile We don’t instrument the right half of the CFG r=r+2 r=r+1 r=0 count[r]++

Strategy 2: Avoid instrumenting paths Synergy: Cold edge removal creates more obvious paths

Strategy 2: Avoid instrumenting paths Synergy: Cold edge removal creates more obvious paths – Right half is obvious

Strategy 2: Avoid instrumenting paths What if cold edge is part of obvious and non-obvious paths?

Strategy 2: Avoid instrumenting paths What if cold edge is part of obvious and non-obvious paths? Right half obvious

Strategy 2: Avoid instrumenting paths r=r+2 r=r+1 r=0 count[r]++ What if cold edge is part of obvious and non-obvious paths? Right half obvious – But we haven’t avoided instrumenting it! r=4

Strategy 2: Avoid instrumenting paths What if cold edge is part of obvious and non-obvious paths? Right half obvious – But we haven’t avoided instrumenting it! Aggressive instrumentation pushing r=r+2 r=r+1 r=0 count[r]++ New

Strategy 2: Avoid instrumenting paths Overcounts some hot paths r=r+2 r=r+1 r=0 count[r]++

Strategy 2: Avoid instrumenting paths Overcounts some hot paths Example cold path counts hot path number 1 Overcount tends to be small r=r+2 r=r+1 r=0 count[r]++

Some paths need profiling Correlation between cascading branches

Strategy 3: Simplify instrumentation Moderately biased branches New

Strategy 3: Simplify instrumentation Moderately biased branches Put zeros on hotter edges

Strategy 3: Simplify instrumentation Moderately biased branches Put zeros on hotter edges – No instrumentation on hotter edges r=r+2 r=0 r=r+1 count[r]++

Outline Background – Staged dynamic optimization – Profile-guided profiling – Ball-Larus path profiling Practical path profiling Methodology – Edge profile-guided inlining and unrolling – Measuring accuracy with branch-flow metric Accuracy and overhead

Methodology Path profiling implemented in Scale [McKinley et al.] – Ahead-of-time compiler  deterministic platform Edge profile-guided inlining and unrolling precede path profiling

Methodology Path profiling implemented in Scale [McKinley et al.] – Ahead-of-time compiler  deterministic platform Edge profile-guided inlining and unrolling precede path profiling Alpha binaries for subset of SPEC2000 – C and Fortran 77 only – Scale wouldn’t compile gzip, vortex, gcc ref inputs for all runs

Measuring accuracy Compare estimated profile with actual profile – Wall weight matching* or profile overlap Weight paths by flow: amount of execution – Previous work measures flow with unit-flow metric Flow(p) = Freq(p) – We introduce branch-flow metric Flow(p) = Freq(p) x NumBranches(p)

Motivating the branch-flow metric Programs really execute one very long path call return

Motivating the branch-flow metric Programs really execute one very long path call return

Motivating the branch-flow metric Programs really execute one very long path – Ball-Larus path profiling breaks it into multiple acyclic, intraprocedural paths call return call return

Motivating the branch-flow metric Some paths longer than others – We care more about longer paths – Unit-flow metric unfairly rewards edge profiling call return call return

Outline Background – Staged dynamic optimization – Profile-guided profiling – Ball-Larus path profiling Practical path profiling Methodology – Edge profile-guided inlining and unrolling – Measuring accuracy with branch-flow metric Accuracy and overhead

Accuracy

Overhead

Related work Dynamo [Bala et al. ’00] – Successful path-based dynamic optimizer – “Bails out” when no dominant path Instrumentation sampling & dynamic instrumentation [Arnold & Ryder ’01, Hirzel & Chilimbi ’04, Yasue et al. ’04] – Lower overhead by extending profiling time – Orthogonal to practical path profiling Hardware-based path profiling [Vaswani et al. ’05] – High accuracy when hot path table large enough

Summary Edge profiling Ball-Larus path profiling Targeted path profiling Practical path profiling Contributions: Inlining and unrolling Branch-flow metric Practical path profiling

Questions?