Online Performance Auditing Using Hot Optimizations Without Getting Burned Jeremy Lau (UCSD, IBM) Matthew Arnold (IBM) Michael Hind (IBM) Brad Calder (UCSD)

Slides:

Advertisements

Similar presentations

Efficient Program Compilation through Machine Learning Techniques Gennady Pekhimenko IBM Canada Angela Demke Brown University of Toronto.

Advertisements

ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto

Hardware-based Devirtualization (VPC Prediction) Hyesoon Kim, Jose A. Joao, Onur Mutlu ++, Chang Joo Lee, Yale N. Patt, Robert Cohn* ++ *

Dynamic Branch Prediction

Imbalanced data David Kauchak CS 451 – Fall 2013.

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.

Overview Motivations Basic static and dynamic optimization methods ADAPT Dynamo.

IBM Labs in Haifa © 2005 IBM Corporation Adaptive Application of SAT Solving Techniques Ohad Shacham and Karen Yorav Presented by Sharon Barner.

Automatic Tuning1/33 Boosting Verification by Automatic Tuning of Decision Procedures Domagoj Babić joint work with Frank Hutter, Holger H. Hoos, Alan.

Planning under Uncertainty

Simulation Where real stuff starts. ToC 1.What, transience, stationarity 2.How, discrete event, recurrence 3.Accuracy of output 4.Monte Carlo 5.Random.

The Use of Traces for Inlining in Java Programs Borys J. Bradel Tarek S. Abdelrahman Edward S. Rogers Sr.Department of Electrical and Computer Engineering.

Scott Grissom, copyright 2004 Chapter 5 Slide 1 Analysis of Algorithms (Ch 5) Chapter 5 focuses on: algorithm analysis searching algorithms sorting algorithms.

Perceptron-based Global Confidence Estimation for Value Prediction Master’s Thesis Michael Black June 26, 2003.

VLSI Systems--Spring 2009 Introduction: --syllabus; goals --schedule --project --student survey, group formation.

Scheduling for Embedded Real-Time Systems Amit Mahajan and Haibo.

1 Applying Perceptrons to Speculation in Computer Architecture Michael Black Dissertation Defense April 2, 2007.

Analysis of Algorithms. Time and space To analyze an algorithm means: –developing a formula for predicting how fast an algorithm is, based on the size.

1 Abstract This study presents an analysis of two modified fuzzy ARTMAP neural networks. The modifications are first introduced mathematically. Then, the.

Chapter 2: Algorithm Discovery and Design

Wish Branches A Review of “Wish Branches: Enabling Adaptive and Aggressive Predicated Execution” Russell Dodd - October 24, 2006.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

CS503: First Lecture, Fall 2008 Michael Barnathan.

Multiscalar processors

Compiler Optimization-Space Exploration Adrian Pop IDA/PELAB Authors Spyridon Triantafyllis, Manish Vachharajani, Neil Vachharajani, David.

SyNAR: Systems Networking and Architecture Group Symbiotic Jobscheduling for a Simultaneous Multithreading Processor Presenter: Alexandra Fedorova Simon.

Cmpt-225 Simulation. Application: Simulation Simulation  A technique for modeling the behavior of both natural and human-made systems  Goal Generate.

Generating Adaptation Policies for Multi-Tier Applications in Consolidated Server Environments College of Computing Georgia Institute of Technology Gueyoung.

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Dynamic Compilation II John Cavazos University.

Recursion, Complexity, and Searching and Sorting By Andrew Zeng.

Cristian Urs and Ben Riveira. Introduction The article we chose focuses on improving the performance of Genetic Algorithms by: Use of predictive models.

Invitation to Computer Science, Java Version, Second Edition.

Adaptive Optimization in the Jalapeño JVM Matthew Arnold Stephen Fink David Grove Michael Hind Peter F. Sweeney Source: UIUC.

Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.

Buffered dynamic run-time profiling of arbitrary data for Virtual Machines which employ interpreter and Just-In-Time (JIT) compiler Compiler workshop ’08.

System Software for Parallel Computing. Two System Software Components Hard to do the innovation Replacement for Tradition Optimizing Compilers Replacement.

Testing and Debugging Session 9 LBSC 790 / INFM 718B Building the Human-Computer Interface.

Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.

Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.

Statistical Analysis of Inlining Heuristics in Jikes RVM Jing Yang Department of Computer Science, University of Virginia.

02/09/2010 Industrial Project Course (234313) Virtualization-aware database engine Final Presentation Industrial Project Course (234313) Virtualization-aware.

Dean Tullsen UCSD.  The parallelism crisis has the feel of a relatively new problem ◦ Results from a huge technology shift ◦ Has suddenly become pervasive.

Targeted Path Profiling : Lower Overhead Path Profiling for Staged Dynamic Optimization Systems Rahul Joshi, UIUC Michael Bond*, UT Austin Craig Zilles,

Enabling Self-management of Component-based High-performance Scientific Applications Hua (Maria) Liu and Manish Parashar The Applied Software Systems Laboratory.

Scientific Debugging. Errors in Software Errors are unexpected behaviors or outputs in programs As long as software is developed by humans, it will contain.

Compiler Optimizations ECE 454 Computer Systems Programming Topics: The Role of the Compiler Common Compiler (Automatic) Code Optimizations Cristiana Amza.

CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.

Vertical Profiling : Understanding the Behavior of Object-Oriented Applications Sookmyung Women’s Univ. PsLab Sewon,Moon.

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Method Profiling John Cavazos University.

Adaptive Inlining Keith D. CooperTimothy J. Harvey Todd Waterman Department of Computer Science Rice University Houston, TX.

Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.

Lecture 10 Page 1 CS 111 Online Memory Management CS 111 On-Line MS Program Operating Systems Peter Reiher.

Application-Aware Traffic Scheduling for Workload Offloading in Mobile Clouds Liang Tong, Wei Gao University of Tennessee – Knoxville IEEE INFOCOM

Dynamic Region Selection for Thread Level Speculation Presented by: Jeff Da Silva Stanley Fung Martin Labrecque Feb 6, 2004 Builds on research done by:

CEng 713, Evolutionary Computation, Lecture Notes parallel Evolutionary Computation.

Algorithm Analysis CSE 2011 Winter September 2018.

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

The Simplest Heuristics May Be The Best in Java JIT Compilers

CSCI1600: Embedded and Real Time Software

Adaptive Code Unloading for Resource-Constrained JVMs

Adaptive Optimization in the Jalapeño JVM

Trace-based Just-in-Time Type Specialization for Dynamic Languages

Analysis of Algorithms

Analysis of Algorithms

rePLay: A Hardware Framework for Dynamic Optimization

CSCI1600: Embedded and Real Time Software

CSc 453 Interpreters & Interpretation

Analysis of Algorithms

Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

Presentation transcript:

Online Performance Auditing Using Hot Optimizations Without Getting Burned Jeremy Lau (UCSD, IBM) Matthew Arnold (IBM) Michael Hind (IBM) Brad Calder (UCSD)

2 Problem Trend: Increasing complexity of computer systems –Hardware: more speculation and parallelism –Software: more abstraction layers and virtualization Increasing complexity makes it more difficult to reason about performance –Will optimization X improve performance?

3 Increasing Complexity Increasing distance between application and raw performance –Stack on right vs. classic Application-OS-Hardware stack Hard to predict how all layers will react to application-level optimization Application Application Server OS Hardware Java VM Hypervisor

4 Heuristics When should I use optimization X? Common solution: Use heuristics Example: Apply optimization X if code size < N –“We believe X will improve performance when code size < N” Determine N by running benchmarks and tuning to maximize average performance But heuristics will miss opportunities to improve performance –Because they are tuned for the average case

5 Experiment Aggressive inlining: 4x inlining thresholds –Allows much larger methods to be inlined Apply aggressive inlining to one hot method at a time Calculate per-method speedups vs. default inlining policy –Use cycle counter to measure performance

6 Experiment Results Aggressive inlining vs. default inlining Per-Method Speedups Using J9, IBM’s high-performance Java VM

7 Experiment Analysis Aggressive inlining: mixed results More slowdowns than speedups But there are significant speedups!

8 Wishful Thinking Dream: A world without slowdowns Default inlining heuristics miss these opportunities to improve performance Goal: Be aggressive only when it produces speedup

9 Approach Determine if optimization improves or degrades performance as program executes –For general purpose applications –Using VM support (dynamic compilation) Plan: –Compile two versions of the code: with and without optimization –Measure performance of both versions –Use best performing version

10 Benefits Defense: Avoid slowdowns due to poor optimization decisions –Sometimes O3 is slower than O2. Detect and correct Offense: Find speedups by searching the optimization space –Try high-risk optimizations without fear of long- term slowdowns

11 Challenge Which implementation is fastest? –Decide online, without stopping and restarting the program Can’t just invoke each version once and compare times –Changing inputs, global state, etc Example: Sorting routine. Size of input determines run time –SortVersionA(10 entries) vs. SortVersionB(1,000,000 entries) –Invocation timings don’t reflect performance of A and B ○Unless we know that input size correlates with runtime ○But that requires high-level understanding of program behavior Solution: Collect multiple timing samples for each version –Use statistics to determine how many samples to collect

12 Timing Infrastructure Sort() Version A Sort() Version B Randomly choose Version A or B Invocation of Sort() Method exit Start timer Stop timer Record timing Can generalize: Doesn’t have to be method granularity and Can use more than two versions

13 Statistical Analysis Is A faster than B? –How confident are we? –Use standard statistical hypothesis testing (t-test) If low confidence, collect more timing data Version A Timings Version B Timings Statistical Timing Analysis INPUT: Two sets of method timings OUTPUT: A is faster (or slower) than B byX% with Y% confidence

14 Time to Converge How long will it take to reach a confident conclusion? –Any speedup can be detected with enough timing data Time to converge depends on: –Variance in timing data ○Easy to detect speedup if method always does the same amount of work –Speedup due to optimization ○Easy to detect big speedups Fastest convergence for low variance methods with high speedup

15 Fixed Number of Samples Why not just collect 100 samples? Experiment: Try to detect an X% speedup with 100 samples How often do the samples indicate a slowdown? Each slowdown detected is a false positive –Samples do not accurately represent the population

16 Fixed Number of Samples

17 Fixed Number of Samples Number of samples needed depends on speedup –More speedup → Fewer samples Fixed sampling inefficient –Suppose we want to maintain 5% false positive rate –Could always collect 10k samples, but that wastes time Statistical approach collects only as many samples as needed to reach confident conclusion

18 Prototype Implementation Prototype online performance auditing system implemented in IBM’s J9 Java VM Currently audits a single optimization Experiment with aggressive inlining –Infrastructure is not tied to aggressive inlining. Can evaluate any single optimization When a method reaches highest optimization level: –Compile two versions of the method (with and without aggressive inlining), collect timing data, run statistical analysis If aggressive inlining generates quickly detectable speedup, use it, else fall back to default inlining –Timeout can occur when confident conclusion not reached in 5 seconds

19 Results

20 Results

21 Per-Method Accuracy

22 Timeouts Good news: Few incorrect decisions Timeouts: Only collect one timing sample for each method invocation –Most methods are not invoked frequently enough to converge before timeout Future work: Reduce timeouts by reducing convergence time –Collect multiple timings per invocation: use loop iteration times instead of invocation times

23 Future Work Audit multiple optimizations and settings –Search the optimization space online, as program executes Exponential search space is both challenge and opportunity Apply prior work in offline optimization space search Use Performance Auditor to tune optimization strategy for each method

24 Summary Not easy to predict performance –Should I apply optimization X? Online Performance Auditing –Measure code performance as the program executes Detect slowdowns –Due to poor optimization decisions Find speedups –Use high-risk optimizations without long-term slowdown Enable online optimization space search