Correcting the Dynamic Call Graph Using Control Flow Constraints

Slides:



Advertisements
Similar presentations
P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.
Advertisements

Stanford University CS243 Winter 2006 Wei Li 1 Register Allocation.
Register Allocation CS 671 March 27, CS 671 – Spring Register Allocation - Motivation Consider adding two numbers together: Advantages: Fewer.
Evaluating Heuristics for the Fixed-Predecessor Subproblem of Pm | prec, p j = 1 | C max.
Counting the bits Analysis of Algorithms Will it run on a larger problem? When will it fail?
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
Fast Algorithms For Hierarchical Range Histogram Constructions
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
ABCD: Eliminating Array-Bounds Checks on Demand Rastislav Bodík Rajiv Gupta Vivek Sarkar U of Wisconsin U of Arizona IBM TJ Watson recent experiments.
Probabilistic Calling Context Michael D. Bond Kathryn S. McKinley University of Texas at Austin.
Heap Shape Scalability Scalable Garbage Collection on Highly Parallel Platforms Kathy Barabash, Erez Petrank Computer Science Department Technion, Israel.
1 Stochastic Event Capture Using Mobile Sensors Subject to a Quality Metric Nabhendra Bisnik, Alhussein A. Abouzeid, and Volkan Isler Rensselaer Polytechnic.
Online Performance Auditing Using Hot Optimizations Without Getting Burned Jeremy Lau (UCSD, IBM) Matthew Arnold (IBM) Michael Hind (IBM) Brad Calder (UCSD)
The Use of Traces for Inlining in Java Programs Borys J. Bradel Tarek S. Abdelrahman Edward S. Rogers Sr.Department of Electrical and Computer Engineering.
Recap from last time We were trying to do Common Subexpression Elimination Compute expressions that are available at each program point.
Wish Branches A Review of “Wish Branches: Enabling Adaptive and Aggressive Predicated Execution” Russell Dodd - October 24, 2006.
Previous finals up on the web page use them as practice problems look at them early.
Schedule Midterm out tomorrow, due by next Monday Final during finals week Project updates next week.
Adaptive Optimization in the Jalapeño JVM M. Arnold, S. Fink, D. Grove, M. Hind, P. Sweeney Presented by Andrew Cove Spring 2006.
Design Space Exploration
Program Development Life Cycle (PDLC)
Trust-Aware Optimal Crowdsourcing With Budget Constraint Xiangyang Liu 1, He He 2, and John S. Baras 1 1 Institute for Systems Research and Department.
P ath & E dge P rofiling Michael Bond, UT Austin Kathryn McKinley, UT Austin Continuous Presented by: Yingyi Bu.
Chapter 12 Recursion, Complexity, and Searching and Sorting
1 Fast and Efficient Partial Code Reordering Xianglong Huang (UT Austin, Adverplex) Stephen M. Blackburn (Intel) David Grove (IBM) Kathryn McKinley (UT.
Stochastic DAG Scheduling using Monte Carlo Approach Heterogeneous Computing Workshop (at IPDPS) 2012 Extended version: Elsevier JPDC (accepted July 2013,
Dynamic Object Sampling for Pretenuring Maria Jump Department of Computer Sciences The University of Texas at Austin Stephen M. Blackburn.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
ABCD: Eliminating Array-Bounds Checks on Demand Rastislav Bodík Rajiv Gupta Vivek Sarkar U of Wisconsin U of Arizona IBM TJ Watson recent experiments.
Targeted Path Profiling : Lower Overhead Path Profiling for Staged Dynamic Optimization Systems Rahul Joshi, UIUC Michael Bond*, UT Austin Craig Zilles,
Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall 2004 AdaBoost.. Binary Classification. Read 9.5 Duda,
380C lecture 19 Where are we & where we are going –Managed languages Dynamic compilation Inlining Garbage collection –Opportunity to improve data locality.
Practical Path Profiling for Dynamic Optimizers Michael Bond, UT Austin Kathryn McKinley, UT Austin.
1 Garbage Collection Advantage: Improving Program Locality Xianglong Huang (UT) Stephen M Blackburn (ANU), Kathryn S McKinley (UT) J Eliot B Moss (UMass),
Targeted Path Profiling : Lower Overhead Path Profiling for Staged Dynamic Optimization Systems Rahul Joshi, UIUC Michael Bond*, UT Austin Craig Zilles,
08120: Programming 2: SoftwareTesting and Debugging Dr Mike Brayshaw.
A Region-Based Compilation Technique for a Java Just-In-Time Compiler Toshio Suganuma, Toshiaki Yasue and Toshio Nakatani Presenter: Ioana Burcea.
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Method Profiling John Cavazos University.
High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.
Michael J. Voss and Rudolf Eigenmann PPoPP, ‘01 (Presented by Kanad Sinha)
1 The Garbage Collection Advantage: Improving Program Locality Xianglong Huang (UT), Stephen M Blackburn (ANU), Kathryn S McKinley (UT) J Eliot B Moss.
Improving Multi-Core Performance Using Mixed-Cell Cache Architecture
Jacob R. Lorch Microsoft Research
Handouts Software Testing and Quality Assurance Theory and Practice Chapter 4 Control Flow Testing
Cork: Dynamic Memory Leak Detection with Garbage Collection
Compositional Pointer and Escape Analysis for Java Programs
Verification and Testing
Online Subpath Profiling
White-Box Testing.
A Framework for Automatic Resource and Accuracy Management in A Cloud Environment Smita Vijayakumar.
CSCI1600: Embedded and Real Time Software
Algorithm An algorithm is a finite set of steps required to solve a problem. An algorithm must have following properties: Input: An algorithm must have.
Instructor: Shengyu Zhang
White-Box Testing.
Adaptive Code Unloading for Resource-Constrained JVMs
Java Programming Loops
Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle
Inlining and Devirtualization Hal Perkins Autumn 2011
Inlining and Devirtualization Hal Perkins Autumn 2009
Smita Vijayakumar Qian Zhu Gagan Agrawal
Adaptive Optimization in the Jalapeño JVM
Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt
Control Structure Testing
Dongyun Jin, Patrick Meredith, Dennis Griffith, Grigore Rosu
Ensemble learning.
Java Programming Loops
Garbage Collection Advantage: Improving Program Locality
CSCI1600: Embedded and Real Time Software
08120: Programming 2: SoftwareTesting and Debugging
Presentation transcript:

Correcting the Dynamic Call Graph Using Control Flow Constraints Byeongcheol (BK) Lee Kevin Resnick Michael Bond Kathryn McKinley UT Austin This is a joint work with three coauthors. Kathryn: Memorize & Say something.

Motivation Complexity of large object oriented programs Decompose the program into small methods Method boundary becomes performance-bottleneck Dynamic interprocedural optimization Solve the method boundary problem Inlining and specialization vary the performance by factor of 2 Dynamic call graph (DCG) is critical input! The motivation for this work is that OOP programs are becoming large and complex. To manage this complexity, programmers tend to decompose the program into smaller modules until each module becomes managable. This increases the number of virtual calls to small methods. This is really good in terms of software maintenance, but this is really bad in terms of performance. Inter-procedural optimizations such as inlining and specialization solve this problem. These optimizations can impact the performance by factor of 2, and the dynamic call graph is critical input. But what if this call graph contains error? Or what if the call graph gives wrong information? This may degrade performance by fact of 2. This is real problem in the online optimization system. Points * In Large and complex OOP, Inter-procedural OPT is important. * Inter-procedural OPT needs DCG. * DCG could be wrong and misleading in online system. b c a w1 w2 Dynamic call graph

Inaccurate call graph b a c DCGSample Error method a call b call c 1,000 500 Error Let me talk about call graph. Suppose there is a method a like here. Suppose that this method a calls a method b 5,000 times and a method c 10,000 times. This method call statistics is represented by labeled graph here, and this is the call graph for the example program. This call graph says that there are three methods a, b, and c. There are 5,000 calls from a to b, and 10.000 calls from a to c. Now, the question is how we get this call graph and how we use this call graph in online optimization system

Timer-based sampling and timing bias Call stack … … c b c c b c c b c c c c b a a t Let’s talk about how call graph sampling loses its accuracy. For each 10-20ms time tick, the program stops its normal execution, and looks at its current call stack. For instance, this timer tick here takes a call sample from the procedure a to the procedure b, and increments a counter in the call graph. The program repeats this sampling continuously until the end of the program. However, since the program spends more time on the procedure b than the procedure c, the call to b has higher chance of being sampled than call to c

Timer-based sampling and timing bias Call stack … … c b c c b c c b c c c c b a a t Let’s talk about how call graph sampling loses its accuracy. For each 10-20ms time tick, the program stops its normal execution, and looks at its current call stack. For instance, this timer tick here takes a call sample from the procedure a to the procedure b, and increments a counter in the call graph. The program repeats this sampling continuously until the end of the program. However, since the program spends more time on the procedure b than the procedure c, the call to b has higher chance of being sampled than call to c

Timer-based sampling and timing bias Call stack … … c b c c b c c b c c c c b a a t Let’s talk about how call graph sampling loses its accuracy. For each 10-20ms time tick, the program stops its normal execution, and looks at its current call stack. For instance, this timer tick here takes a call sample from the procedure a to the procedure b, and increments a counter in the call graph. The program repeats this sampling continuously until the end of the program. However, since the program spends more time on the procedure b than the procedure c, the call to b has higher chance of being sampled than call to c

Timer-based sampling and timing bias Call stack … … c b c c b c c b c c c c b a a t Let’s talk about how call graph sampling loses its accuracy. For each 10-20ms time tick, the program stops its normal execution, and looks at its current call stack. For instance, this timer tick here takes a call sample from the procedure a to the procedure b, and increments a counter in the call graph. The program repeats this sampling continuously until the end of the program. However, since the program spends more time on the procedure b than the procedure c, the call to b has higher chance of being sampled than call to c

Timer-based sampling and timing bias timer tick 1011 timer tick b c a 5 timer tick b c a 11 56 … timer tick b c a 9991000 500 Call stack … … c b c c b c c b c c c c b a a t … DCGSample b c a 910 5 Let’s talk about how call graph sampling loses its accuracy. For each 10-20ms time tick, the program stops its normal execution, and looks at its current call stack. For instance, this timer tick here takes a call sample from the procedure a to the procedure b, and increments a counter in the call graph. The program repeats this sampling continuously until the end of the program. However, since the program spends more time on the procedure b than the procedure c, the call to b has higher chance of being sampled than call to c

Overhead and accuracy in call graph profiling Full instrumentation 25 Arnold-Grove sampling [2005] 20 Overhead (%) 15 10 The full instrumentation has the perfect accuracy, but it’s too expensive for the dynamic compilation. So, dynamic compilation typically uses timer-based sampling to keep the overhead low, but it loses accuracy up to 60%. To improve this accuracy, timer-based sampling was extended to Arnold-Grove sampling. For 70% accuracy, the overhead is minimal, but for more than 80% accuracy, the overhead suddenly increases more than 10%, which is unacceptable in dynamic compilation Our call graph correction improves the accuracies of these samplings with minimal overhead, achieving up to 85% accuracy. Correction [2007] 5 Timer-based sampling [2000] 40 60 80 100 Accuracy (%)

Outline Motivation Call graph correction Evaluation

Timing bias in SPEC JVM98 raytrace Sampling Normalized frequency(%) This is a timing bias in the raytrace benchmark. This graph presents call edge frequencies in the raytrace. The area shows the actual values, which are grouped by their source method with different colors. The diamonds are information from sampling. We can observe that some have too many samples than the actual value and others have less. Also, we can observe that many calls within the source method have the same actual frequencies, perhaps because they are in the same basic block. This implies that the basic block profiles constraint the call graph frequencies. Method calls grouped by source method

Timing bias in SPEC JVM98 raytrace Normalized frequency(%) This is a timing bias in the raytrace benchmark. This graph presents call edge frequencies in the raytrace. The area shows the actual values, which are grouped by their source method with different colors. The diamonds are information from sampling. We can observe that some have too many samples than the actual value and others have less. Also, we can observe that many calls within the source method have the same actual frequencies, perhaps because they are in the same basic block. This implies that the basic block profiles constraint the call graph frequencies. Method calls grouped by source method

Correction algorithms Detect and correct DCG error DCG constraint Static and dynamic approaches Static FDOM (Frequency dominator) correction Static approach Uses static FDOM constraint on DCG Dynamic basic block profile correction Dynamic approach Uses dynamic basic block profile constraint on DCG New All of our correction algorithms modify the call graph frequencies to satisfy CFG constraints. This works because the CFG restricts possible call graph. We have static and dynamic approaches. Static FDOM correction uses our novel FDOM constraint on DCG. FDOM is the abbreviation frequency dominator, which we defined to capture relative basic block frequency. Dynamic basic block profile correction is dynamic approach to exploit dynamic CFG edge profiles.

Static FDOM constraint FDOM constraint on CFG call c is executed at least as many times as call b call c FDOM call b FDOM constraint on DCG f( ) ≥ f( ) call b call c Let me start with the static approach When we look at the control flow graph for the procedure a, we notice that the red basic block executes at least as many times as the yellow block. Intuitively, this is because whenever yellow block executes, the red block executes before or after. This constraint is formally presented as a binary relation on basic blocks or call sites. We have a near-linear algorithm to compute this relation in our technical report. This static control flow graph constraints implies another constraint on the call graph. The call edge count from a to c is at least as many times as the count from a to b. c a b a method a

Static FDOM correction FDOM constraint: f( ) ≥ f( ) c a b a b b Correction 1,000 750 a c a c 500 750 To satisfy the static constraint, We use a simple solution of taking the max frequency if there is a conflict. Then we scale the frequency to reserve the total count before and after the correction. In the call graph from sampling on the left side, the FDOM constraint says that call counts to c is as least the counts to b. So, there is a conflict. Assigning 1,000 to the call to c resolves the conflict. Then, we scale the resulting counts by the scale factor 0.75, which was chosen to preserve total counts before and after the correction. Both calls to b and c have the same frequency, which is 750. Now, the corrected DCG is more similar to the perfect call graph. DCGSample DCGFDOMCorrection Detect error and assign the same average frequency One possible solution to the FDOM constraint Preserve total frequency sum

Dynamic basic block profile constraint Some dynamic optimization systems do edge profiling Baseline compiler in Jikes RVM Dynamic basic block profile constraint on CFG f(call c) = 2 * f(call b) Dynamic basic block profile constraint on DCG f( ) = 2 * f( ) 50% 50% call b call c We also use the edge profiles, and part of these profiles are available in the dynamic optimizers. In the procedure a annotated with the branch probability, the edge profiles say that that the red block executes 2 times as many as the yellow block. This CFG constraint implies that the call to c happened 2 times as many as the call to b. c a b a method a

Dynamic basic block profile correction Constraint: f( ) = 2* f( ) c a b b b Correction 1,000 500 a c 500 a 1,000 c DCGSample DCGEdgeProfileCorrection Now, our goal is to satisfy the dynamic basic block profile constraint and preserve the total count sum. We compute normalized ratio for each call edge from the basic block profile constraint. This ratio for call to c is 2/3, and the call to b has 1/3. Finally, we multiply the normalized ratio and the total frequency and for each call. The corrected call count for the call to c is 1,000, and the call to b has 500. Now, the call graph has perfect match with the perfect call graph. fNew( ) = 1/(1+2) * (1,000+500) = 500 fNew( ) = 2/(1+2) * (1,000+500) = 1,000 b a c a

Best result: raytrace Static FDOM correction Sampling Dynamic basic block profile correction Here is the result with the static and dynamic approaches for raytrace Static approach was effective at correcting errors within the same callers. But the errors across the different callers still exist. Dynamic approach corrected this type of errors, and achieves better call graph. Sampling

Outline Motivation Call graph correction Evaluation

Experimental methodology Jikes RVM 2.4.5 on 3.2G Pentium 4 Replay methodology [Blackburn et al. ‘06] Deterministic run 1st iteration – compilation + application run 2nd iteration – application run Measurement Accuracy Use overlap accuracy [Arnold & Grove ’05] Overhead 1st iteration includes call graph correction Performance 2nd iteration is application-only SPECJVM98 and DaCapo benchmarks To evaluate our correction algorithms in dynamic optimization, we use Jikes RVM 2.4.5 on 3.2 Pentinum 4 machine. we take the overlap accuracy, which compares the resulting call graph with the one from full instrumentation. Then, we use the replay methodology to measure the overhead and performance improvement. This replay methodology eliminates the nodeterminstic optimization decisions in the adaptive system. We use SPEJVM98 and DaCapo benchmarks, which are extensively used in other dynamic optimization researches.

Accuracy Here is the accuracy result. The static approach improves the average accuracy by 6% over the timer based sampling. The dynamic approach adds 13% improvement, resulting in 21% over the baseline.

Overhead This is the overhead of call graph corrections. For both cases, we did not observe any outstanding overhead. Antlr and hsqldb have up to 5% speedup, which is abnormal. We found that this is related to different heap exercising behavior during the correction. The dynamic approach has 1% average overhead, and this is mostly from the additional instrumentation to improve the edge profiles in our framework. Note that the correction happens only during the code optimization not in the normal execution. Therefore, the call graph correction overhead itself is very minial.

Inlining performance Static and dynamic approaches had 2% modest performance improvement. To measure the performance upper bound, we ran the experiment with the perfect call graph, and this improves performance by only 3%. This is because the inlining is the only client in our system and this inlining heustrictics have been developed with the inprecise call graph. The raytrace shows the best improvement. The raytrace contains many small virtual calls, and small variation in the inlining decision has large impact on the performance. Baseline: profile-guided inlining with default call graph sampling

Summary CFG constraint improves the DCG Inlining has been tuned for bad call graph Advantages Can be easily combined with other DCG profiling Minimal overhead only during the compilation Future work More inter-procedural optimizations with high accuracy DCG

Question and comment Thank you! Thank you for your attention

Timing bias misleads optimizer 5,000 times Sampling with timing bias 1,000 samples a c a c 10,000 times 500 samples DCGPerfect DCGSample DCGSample Edge frequencies were reversed! Inlining decision Inliner may inline b instead of c This timing bias may cause wrong optimization decision. In the perfect call graph on the left side, call to c is more frequent than the call to b, but this is reversed by the sampling. With this wrong information, the optimizer would make mistake. For instance, the optimizer may inline a call to b instead of the call to c by relying on the sampling.

Call graph profiling in online optimization system Source program e.g. Java byte code Compile & instrument Machine code Dynamic call graph Online optimization system Point DCG is inaccurate in online system This is because the performance include the DCG profiling overhead Profiling and program run at the same time Minimize profiling overhead Corollary: sacrifice profiling accuracy