Online Performance Auditing Using Hot Optimizations Without Getting Burned Jeremy Lau (UCSD, IBM) Matthew Arnold (IBM) Michael Hind (IBM) Brad Calder (UCSD)
2 Problem Trend: Increasing complexity of computer systems –Hardware: more speculation and parallelism –Software: more abstraction layers and virtualization Increasing complexity makes it more difficult to reason about performance –Will optimization X improve performance?
3 Increasing Complexity Increasing distance between application and raw performance –Stack on right vs. classic Application-OS-Hardware stack Hard to predict how all layers will react to application-level optimization Application Application Server OS Hardware Java VM Hypervisor
4 Heuristics When should I use optimization X? Common solution: Use heuristics Example: Apply optimization X if code size < N –“We believe X will improve performance when code size < N” Determine N by running benchmarks and tuning to maximize average performance But heuristics will miss opportunities to improve performance –Because they are tuned for the average case
5 Experiment Aggressive inlining: 4x inlining thresholds –Allows much larger methods to be inlined Apply aggressive inlining to one hot method at a time Calculate per-method speedups vs. default inlining policy –Use cycle counter to measure performance
6 Experiment Results Aggressive inlining vs. default inlining Per-Method Speedups Using J9, IBM’s high-performance Java VM
7 Experiment Analysis Aggressive inlining: mixed results More slowdowns than speedups But there are significant speedups!
8 Wishful Thinking Dream: A world without slowdowns Default inlining heuristics miss these opportunities to improve performance Goal: Be aggressive only when it produces speedup
9 Approach Determine if optimization improves or degrades performance as program executes –For general purpose applications –Using VM support (dynamic compilation) Plan: –Compile two versions of the code: with and without optimization –Measure performance of both versions –Use best performing version
10 Benefits Defense: Avoid slowdowns due to poor optimization decisions –Sometimes O3 is slower than O2. Detect and correct Offense: Find speedups by searching the optimization space –Try high-risk optimizations without fear of long- term slowdowns
11 Challenge Which implementation is fastest? –Decide online, without stopping and restarting the program Can’t just invoke each version once and compare times –Changing inputs, global state, etc Example: Sorting routine. Size of input determines run time –SortVersionA(10 entries) vs. SortVersionB(1,000,000 entries) –Invocation timings don’t reflect performance of A and B ○Unless we know that input size correlates with runtime ○But that requires high-level understanding of program behavior Solution: Collect multiple timing samples for each version –Use statistics to determine how many samples to collect
12 Timing Infrastructure Sort() Version A Sort() Version B Randomly choose Version A or B Invocation of Sort() Method exit Start timer Stop timer Record timing Can generalize: Doesn’t have to be method granularity and Can use more than two versions
13 Statistical Analysis Is A faster than B? –How confident are we? –Use standard statistical hypothesis testing (t-test) If low confidence, collect more timing data Version A Timings Version B Timings Statistical Timing Analysis INPUT: Two sets of method timings OUTPUT: A is faster (or slower) than B byX% with Y% confidence
14 Time to Converge How long will it take to reach a confident conclusion? –Any speedup can be detected with enough timing data Time to converge depends on: –Variance in timing data ○Easy to detect speedup if method always does the same amount of work –Speedup due to optimization ○Easy to detect big speedups Fastest convergence for low variance methods with high speedup
15 Fixed Number of Samples Why not just collect 100 samples? Experiment: Try to detect an X% speedup with 100 samples How often do the samples indicate a slowdown? Each slowdown detected is a false positive –Samples do not accurately represent the population
16 Fixed Number of Samples
17 Fixed Number of Samples Number of samples needed depends on speedup –More speedup → Fewer samples Fixed sampling inefficient –Suppose we want to maintain 5% false positive rate –Could always collect 10k samples, but that wastes time Statistical approach collects only as many samples as needed to reach confident conclusion
18 Prototype Implementation Prototype online performance auditing system implemented in IBM’s J9 Java VM Currently audits a single optimization Experiment with aggressive inlining –Infrastructure is not tied to aggressive inlining. Can evaluate any single optimization When a method reaches highest optimization level: –Compile two versions of the method (with and without aggressive inlining), collect timing data, run statistical analysis If aggressive inlining generates quickly detectable speedup, use it, else fall back to default inlining –Timeout can occur when confident conclusion not reached in 5 seconds
19 Results
20 Results
21 Per-Method Accuracy
22 Timeouts Good news: Few incorrect decisions Timeouts: Only collect one timing sample for each method invocation –Most methods are not invoked frequently enough to converge before timeout Future work: Reduce timeouts by reducing convergence time –Collect multiple timings per invocation: use loop iteration times instead of invocation times
23 Future Work Audit multiple optimizations and settings –Search the optimization space online, as program executes Exponential search space is both challenge and opportunity Apply prior work in offline optimization space search Use Performance Auditor to tune optimization strategy for each method
24 Summary Not easy to predict performance –Should I apply optimization X? Online Performance Auditing –Measure code performance as the program executes Detect slowdowns –Due to poor optimization decisions Find speedups –Use high-risk optimizations without long-term slowdown Enable online optimization space search