Variational Path Profiling Erez Perelman*, Trishul Chilimbi †, Brad Calder* * University of Califonia, San Diego †Microsoft Research, Redmond.

Slides:



Advertisements
Similar presentations
TM 1 ProfileMe: Hardware-Support for Instruction-Level Profiling on Out-of-Order Processors Jeffrey Dean Jamey Hicks Carl Waldspurger William Weihl George.
Advertisements

Re-examining Instruction Reuse in Pre-execution Approaches By Sonya R. Wolff Prof. Ronald D. Barnes June 5, 2011.
Discovering and Exploiting Program Phases Timothy Sherwood, Erez Perelman, Greg Hamerly, Suleyman Sair, Brad Calder CSE 231 Presentation by Justin Ma.
Dynamic History-Length Fitting: A third level of adaptivity for branch prediction Toni Juan Sanji Sanjeevan Juan J. Navarro Department of Computer Architecture.
CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Instruction-Level Parallelism compiler techniques and branch prediction prepared and Instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University March.
1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.
Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
Dynamic Branch Prediction
UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.
Overview Motivations Basic static and dynamic optimization methods ADAPT Dynamo.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Access Region Locality for High- Bandwidth Processor Memory System Design Sangyeun Cho Samsung/U of Minnesota Pen-Chung Yew U of Minnesota Gyungho Lee.
1 S. Tallam, R. Gupta, and X. Zhang PACT 2005 Extended Whole Program Paths Sriraman Tallam Rajiv Gupta Xiangyu Zhang University of Arizona.
CISC Machine Learning for Solving Systems Problems Presented by: John Tully Dept of Computer & Information Sciences University of Delaware Using.
CS752 Decoupled Architecture for Data Prefetching Jichuan Chang Kai Xu.
Online Performance Auditing Using Hot Optimizations Without Getting Burned Jeremy Lau (UCSD, IBM) Matthew Arnold (IBM) Michael Hind (IBM) Brad Calder (UCSD)
Increasing the Cache Efficiency by Eliminating Noise Philip A. Marshall.
Automatically Characterizing Large Scale Program Behavior Timothy Sherwood Erez Perelman Greg Hamerly Brad Calder.
EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.
UPC Trace-Level Reuse A. González, J. Tubella and C. Molina Dpt. d´Arquitectura de Computadors Universitat Politècnica de Catalunya 1999 International.
The many-core architecture 1. The System One clock Scheduler (ideal) distributes tasks to the Cores according to a task map Cores 256 simple RISC Cores,
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
EENG449b/Savvides Lec /25/05 March 24, 2005 Prof. Andreas Savvides Spring g449b EENG 449bG/CPSC 439bG.
CIS 429/529 Winter 2007 Branch Prediction.1 Branch Prediction, Multiple Issue.
Catching Accurate Profiles in Hardware Satish Narayanasamy, Timothy Sherwood, Suleyman Sair, Brad Calder, George Varghese Presented by Jelena Trajkovic.
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
CSC 4250 Computer Architectures December 5, 2006 Chapter 5. Memory Hierarchy.
Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.
Power Characteristics of Irregular GPGPU Programs Jared Coplin and Martin Burtscher Department of Computer Science 1.
CMPE 421 Parallel Computer Architecture
Bug Localization with Machine Learning Techniques Wujie Zheng
Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.
Optimizing Data Compression Algorithms for the Tensilica Embedded Processor Tim Chao Luis Robles Rebecca Schultz.
1 Sampling-based Program Locality Approximation Yutao Zhong, Wentao Chang Department of Computer Science George Mason University June 8th,2008.
1 of 20 Phase-based Cache Reconfiguration for a Highly-Configurable Two-Level Cache Hierarchy This work was supported by the U.S. National Science Foundation.
Predicated Static Single Assignment (PSSA) Presented by AbdulAziz Al-Shammari
Cache-Conscious Structure Definition By Trishul M. Chilimbi, Bob Davidson, and James R. Larus Presented by Shelley Chen March 10, 2003.
Computer Architecture Memory organization. Types of Memory Cache Memory Serves as a buffer for frequently accessed data Small  High Cost RAM (Main Memory)
1 Dynamic Branch Prediction. 2 Why do we want to predict branches? MIPS based pipeline – 1 instruction issued per cycle, branch hazard of 1 cycle. –Delayed.
Targeted Path Profiling : Lower Overhead Path Profiling for Staged Dynamic Optimization Systems Rahul Joshi, UIUC Michael Bond*, UT Austin Craig Zilles,
Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.
Practical Path Profiling for Dynamic Optimizers Michael Bond, UT Austin Kathryn McKinley, UT Austin.
Targeted Path Profiling : Lower Overhead Path Profiling for Staged Dynamic Optimization Systems Rahul Joshi, UIUC Michael Bond*, UT Austin Craig Zilles,
Dynamic Branch Prediction During Context Switches Jonathan Creekmore Nicolas Spiegelberg T NT.
CPU Design and Pipelining – Page 1CSCI 4717 – Computer Architecture CSCI 4717/5717 Computer Architecture Topic: CPU Operations and Pipelining Reading:
On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.
Reducing the Scheduling Critical Cycle using Wakeup Prediction HPCA-10 Todd Ehrhart and Sanjay Patel Center for Reliable and High-Performance Computing.
An Offline Approach for Whole-Program Paths Analysis using Suffix Arrays G. Pokam, F. Bodin.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
PINTOS: An Execution Phase Based Optimization and Simulation Tool) PINTOS: An Execution Phase Based Optimization and Simulation Tool) Wei Hsu, Jinpyo Kim,
Computer Architecture Chapter (14): Processor Structure and Function
Computer Architecture Principles Dr. Mike Frank
Department of Electrical & Computer Engineering
A Review of Processor Design Flow
CSCI1600: Embedded and Real Time Software
Address-Value Delta (AVD) Prediction
Phase Capture and Prediction with Applications
Adaptive Code Unloading for Resource-Constrained JVMs
Lecture 10: Branch Prediction and Instruction Delivery
Sampoorani, Sivakumar and Joshua
Instruction Level Parallelism (ILP)
Predicting Unroll Factors Using Supervised Classification
Mattan Erez The University of Texas at Austin
Implementation of a De-blocking Filter and Optimization in PLX
CSE 373: Data Structures and Algorithms
CSCI1600: Embedded and Real Time Software
Phase based adaptive Branch predictor: Seeing the forest for the trees
Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project
Presentation transcript:

Variational Path Profiling Erez Perelman*, Trishul Chilimbi †, Brad Calder* * University of Califonia, San Diego †Microsoft Research, Redmond

Observation: Variation in Paths Exists Goal: find the paths to focus on for optimization What is a path –Acyclic control flow trace thru binary (i.e. loop body) Variation in path performance is optimization potential

What is variation? Performance between iterations of a path is not constant –Can be underlying architecture effects (cache misses) that cause variations Example of amount of variation seen –One common path in gzip observed to execute within 48,409 cycles and also 4,004,226 cycles

Goal: Optimize Away Variation Hypothesis: –All execution of a path can take the minimum time (if architecture effects are ignored) Want: Reduce variation of a path to improve program performance –Ideal Time = The fastest execution for a path –Optimize path to execute near its ideal time every time Result –Balanced path execution time (smaller net variation for a path)

How to Find the Variation Sample path executions and measure performance variations –Rank top varying paths in program Highly optimized paths won’t have much variation –Using traditional hot path profilers won’t find you the variation Optimized paths execute same number of times –VPP will focus on good optimization points that have not been exploited

Outline Variational Path Profiling –Profiling –Analysis –Measuring Stability Optimizations –Apply simple optimizations on top paths –Speedup results –Comparison to other path profiling techniques Future Work –Discovering Structure in variation and its implication

VPP: Profiling Sample execution of acyclic paths with Bursty Tracing –Measure time in path –Unique path signature Entry PC and Branch History 0x F-110 Accurate measurement of performance essential

Bursty Tracing A B A’A B’B Original Procedure Modified Procedure (Bursty Tracing)

Sampling Overhead Accuracy is critical for time measurement of path –Bursty Tracing has less than 5% instrumentation overhead –Timing of path is even lower overhead Don’t measure time of instrumentation code Small bias exists, but consistent and can be accounted for

Outline Variational Path Profiling –Profiling –Analysis –Measuring Stability Optimizations –Apply simple optimizations on top paths –Speedup results –Comparison to other path profiling techniques Future Work –Discovering Structure in variation and its implication

VPP: Analysis Compute net variation time for each path –Basetime(i) = fastest execution time –Net variation path (i) =Total time(i) – [Frequency(i) x Basetime(i)] Rank paths according to net variation –Top few paths dominate all program variation

Structure within Variation Bzip2 Top 5 Varying Paths

VPP: Top 10 Paths

Outline Variational Path Profiling –Profiling –Analysis –Measuring Stability Optimizations –Apply simple optimizations on top paths –Speedup results –Comparison to other path profiling techniques Future Work –Discovering Structure in variation and its implication

Stability Do top varying paths change when system load or program input is changed? –System load measures the resource utilization (processor, memory, buses, etc…) Measure stability of tops paths across system loads –Heavy system load vs. light system load Across program inputs –Program execution varies, how does it affect top paths?

Stability: System Load

Stability: Input

Outline Variational Path Profiling –Profiling –Analysis –Measuring Stability Optimizations –Apply simple optimizations on top paths –Speedup results –Comparison to other path profiling techniques Future Work –Discovering Structure in variation and its implication

VPP: Optimize Top Paths Simple optimization strategy for top paths to show optimization potential –Prefetch loads in path one or two iterations ahead of loop –Check for loop bounds to stay within bounds of data accesses After optimization paths lost 41% of net variation on average More elaborate optimizations can reduce more variation

Optimization Example: VPR 1 while (ito < heap_tail) { 2 if (heap[ito+1]->cost cost) 3 ito++; 4 if (heap[ito]->cost > heap[ifrom]->cost) 5 break; 6** if (ito*8 < heap_tail) 7** _mm_prefetch((char*)&heap[ito*8]->cost, 1); 8 temp_ptr = heap[ito]; 9 heap[ito] = heap[ifrom]; 10 heap[ifrom] = temp_ptr; 11 ifrom = ito; 12 ito = 2*ifrom; 13 } this optimization results in 9% speedup!

VPP: Spec 2K Speedup

Outline Variational Path Profiling –Profiling –Analysis –Measuring Stability Optimizations –Apply simple optimizations on top paths –Speedup results –Comparison to other path profiling techniques Future Work –Discovering Structure in variation and its implication

Comparing to other Profiling Techniques Path profiling techniques often base hotness on frequency –Most executed paths are considered hot –Once these are optimized Still hot based on frequency Lower variation, ranking goes down with VPP VPP dynamically ranks paths –Once optimized, path ranking can change

Comparing to other Profiling Techniques

Outline Variational Path Profiling –Profiling –Analysis –Measuring Stability Optimizations –Apply simple optimizations on top paths –Speedup results –Comparison to other path profiling techniques Future Work –Discovering Structure in variation and its implication

Observation: Variation Structure Is there a pattern in variation? –If we plot the variation over time we can see interesting structure Future work: –Does the context leading up to a path have correlation with the path performance –Can specific hardware structures be identified to cause variation –Can specific optimization be recommended based on variation structure

Structure within Variation Bzip2 Top 5 Varying Paths

Conclusion VPP finds the top varying paths with good optimization potential –Few top paths account for majority of variation –Top variational paths are stable Applying simple optimization has 8.5% speedup on avg for Spec 2k on P4 VPP finds hot paths that are not found with other techniques –Once path is optimized, its variation is reduced (the _hotness_ in VPP)