Lengthening Traces to Improve Opportunities for Dynamic Optimization Chuck Zhao, Cristiana Amza, Greg Steffan, University of Toronto Youfeng Wu Intel Research Feb. 16, 2007 Interact-12, HPCA
2 Intel’s StarDBT Project StarDBT A Dynamic Binary Translation framework Operates on traces, optimizes hot traces Long term goal: Use StarDBT to allow legacy apps to exploit TM support (NOT by automatically parallelizing legacy apps) Allow speculative sequential optimizations Use hardware TM’s checkpoint/restore Problem: default traces are too small TM overheads would overwhelm benefits Challenge: lengthening traces can be tricky
3 Trace Formation A B D C FE G A B D F G basic-block profiletrace profile C E on-trace blocks off-trace stub Control flow that goes off-trace can be costly
4 A B D F G 5% 100% - 10% = 90% A B D F G A B D F G 5% Trade-offs when Lengthening Traces Tradeoffs: longer traces have more optimization opportunities longer traces have more side-exit branches Completion ratio: likelihood of execution staying on trace percentage of execution reaching trace tail side-exit ratio 100% - 25% = 75% completion ratio Sweet spot exits in between, can we find it?
5 Our Work So Far (i.e., this talk) 1. Lengthening traces while maintaining completion ratios Through unrolling and straightening A characterization of the impact on traces length, completion ratio, unroll factor, … 2. Improving optimization opportunities on longer traces Improve Local Value Numbering (LVN) hits Measurement of impact on performance is pending 3. Performing on-the-fly actions by DBT system Decisions made by instrumenting/sampling code online
6 Related Work Binary Translation Systems Dynamo DynamoRIO PIN StarDBT transparent translation x86 legacy code Trace Collection and Optimizations Java JIT Dynamo, DynamoRIO, Mojo StarDBT x86 binary level MRET 2 to improve trace formation aggressive trace optimizations First full analysis of trace-lengthening issues for DBT systems
7 StarDBT Trace Types self type other trace type elsewhere type a b c d dispatcher
8 Lengthening Traces Through Unrolling Unrolling increases trace’s length, but reduces completion ratio a aaa 90% 72.9% 81% completion ratio:
9 Finding the Sweet-Spot Unroll Factor Unroll factorCompletion ratio 1p (0.99) N (10)p 10 (0.904) …… 2p 2 (0.98) 3p 3 (0.97) given p orig = 99% and p target = 90% N (11)p 11 (0.895) aaa... Traces with 100% completion ratio: set N = 10 chosen by system designer aa
10 Lengthening Traces Through Straightening b cb c We don’t yet implement/evaluate straightening d
11 Evaluation
12 Majority of hot traces have completion ratios in 90%-100% Distribution of Original Completion Ratios Original Completion Ratios original completion ratio
13 Impact of Unrolling on Hot Trace Size Lengthening increases hot trace size by more than 36% completion ratio 36% longer Select SPECIntCPU 2000 bmarks with MinneSpec input Average Number of Instructions
14 How Much are Traces Unrolled? Hot traces are unrolled on average by 1.38x or more Target completion ratio x Average Unroll Factor Not unrolled
15 Average Completion Ratio After Lengthening Lengthening traces reduces completion ratio by < 0.5% <0.5% 10% 20% 30% 40% 50% 60% 70% 80% 90% completion ratio Completion Ratio
16 Impact of Lengthening on Optimizations
17 Local Value Numbering (LVN) No need to build Control Flow Graph (CFG) Partial info No need to perform Data Flow Analysis (DFA) Expensive, rely on CFG Can be arranged into a single-pass scan Ease of implementation Relatively light weight algorithm Performs three optimizations: Common Subexpression Elimination (CSE) Copy Propagation (CP) Dead-Code Elimination (DCE) LVN is common in JIT optimizers
18 Ex: LVN On a Lengthened Trace … c = a + b d = a e = b Original Traces … 312 c 3 = a 1 + b 2 11 d 1 = a 1 22 e 2 = b f 3 = d 1 + e 2 33 f 3 = c 3 44 d 4 = x 4 … CSE hit DCE hit … c = a + b e = b f = c d = x … Lengthened TraceOptimized Trace f = d + e d = x …
19 LVN Hits Improvement (%) 10+% more LVN hits are available through lengthening 35% 30% 25% 20% 15% 10% 5% % Increase in LVN Hits target completion ratio
20 Ongoing Work Complete DBT Optimization Framework Evaluate speculative optimizations on long hot traces with high completion ratios Automatically determine optimal transaction granularity Use HTM to support trace-based speculative optimizations
21 Control Speculation cmp … 10-% ld x=[y] 90+% ld.s x = [y] if(c){ chk.s x, recovery next: … } recovery: ld x=[y] jmp next A Compiler Framework for Speculative Analysis and Optimizations: Lin et. al, PLDI 03
22 Use HTM to Support Trace-based Speculative Optimizations cmp … 10-% ld x=[y] 90+% start_tx ld x = [y] if(c){ chk x, abort_tx … } commit_tx Use longer traces with high completion ratio as tx granularity HTM hardware support simplifies speculative optimization
23 Conclusion Traces can be effectively lengthened increase in trace size by 36+% decrease completion ratio by less than 0.5% Longer traces provide better opportunities for optimization increase in LVN hits by 10%+
24 Q + A
25 Complete StarDBT Optimization Framework X86 CISIC ISA code patching won’t work Really need a code generator and IR Design + implement a low-level Runtime IR close to hardware capture + represent all necessary low-level info easy to convert from/to machine code easy to implement analysis and optimizations Starting point Dynamo IR LLVM IR GCC RTL …
26 StarDBT Overall Structure
27 Trace Formation Heuristics MRET: Most Recent Execution Tail originally proposed by Dynamo Trace head loop head (backward branch target) sampling counter reaches a certain threshold Trace tail satisfy certain trace-tail conditions MRET 2 : 2-pass MRET perform 2 independent MRET trace formation intersect traces with common head
28 Traces and Hot Traces Trace MRET 2 recognize trace heads Trace tails satisfy certain conditions Blocks in between become a trace Hot Trace Based on recognized Traces Put in additional software counters head: head counter each early-exit branch: off-trace counters sampling: hot-trace’s completion ratio
29