1 Fast and Efficient Partial Code Reordering Xianglong Huang (UT Austin, Adverplex) Stephen M. Blackburn (Intel) David Grove (IBM) Kathryn McKinley (UT Austin)
2 Software Trends By 2008, 80% of software will be written in Java or C#. [Gartner report] Java and C# are coming to your OS soon - Jnode, Singularity Advantages of modern programming languages: –Productivity, security, reliability… Performance?
3 Hardware Trends 2X/1.5yr 2X/10 yrs DRAM CPU Processor-Memory Performance Gap: (grows 50% / year) Performance cache 2005
4 Improvement Potential Base case: JikesRVM default with separate code space. Cache configuration: 32K IL1 direct map, 512K L2 (small programs on a big cache)
5 New and Better Opportunities Virtual machine monitors application behavior at runtime Dynamic recompilation –With dynamic feedback –Allocates instructions at runtime
6 Previous Work on Instruction Locality Static schemes –Static profile calling correlation and reorder code at compile and link time [Pettis and Hansen 90] –Cache coloring [Hashemi et al 97] –Profile procedure interleaving [Gloy et al. 99] –Static schemes are not flexible Dynamic scheme –JIT code reordering [Chen et al. 97] –Used as our base case
7 Optimizations in Virtual Machine Static instruction allocation used at runtime, –e.g. Just-in-time compilations –Invocation order Compiler Memory Manager Runtime Static Optimizations
8 Optimizations in Virtual Machine Dynamic instruction allocation/reordering adapt to the program behavior with low overhead Compiler Memory Manager Runtime Static Optimization
9 Opportunity for Instruction Locality Dynamic detection of hot methods, hot basic blocks Dynamic recompilation relocates methods at runtime
10 PCR Optimizations Reduce instruction capacity misses –Code space –Method separation –Code splitting Reduce instruction conflict misses –Code padding
11 PCR System JikesRVM component Input/Output Optimized method Baseline method Data Baseline Compiler Source Code Executing Code Adaptive Sampler Optimizing Compiler Hot Methods
12 PCR System: Method Separation Hot method (optimized code) Cold method (baseline code) Data Code Data Hot Methods Cold Methods Code
13 PCR System: Code Splitting Online edge profile identifies hot basic blocks in a method Code reordering moves hot basic blocks to the beginning of a method Code splitting to separate hot/cold basic blocks inside the heap Cold basic blocks Hot basic blocks Method A:
14 PCR System: Code Splitting Data Hot Blocks Cold Methods Cold Blocks Hot methods (optimized code) Cold methods (baseline code) Data Cold basic blocks Hot basic blocks Data Hot Methods Cold Methods
15 PCR Optimizations Reduce instruction capacity misses –Code Space –Method separation –Code splitting Reduce instruction conflict misses –Code padding
16 PCR System: Code Padding Baseline Compiler Source Code Binary Code Adaptive Sampler Optimizing Compiler Hot Methods Dynamic Call Graph JikesRVM component Input/Output
17 PCR System: Code Padding Method A() { … classC.B(); … } A B Conflict AB Dynamic Call Graph
18 Methodology Java virtual machine: Jikes RVM Various Architectures –x86 (Pentium 4) –PowerPC –Simulator: Dynamic SimpleScalar Use direct-mapped I-cache –Shorter latency –More conflict misses
19 PCR Results: jess on x86
20 PCR Results: fop on x86
21 Impact of Code Padding Base case: JikesRVM default + a separate code space. Cache configuration: 32K IL1 direct map, 512K L2
22 Conclusion Code space improve program performance by 6% (up to 30%) (Pentium 4) PCR has negligible overhead PCR no obvious performance improvement –On Pentium 4, no improvement on average –In simulation, PCR has 14% for one program Not consistent, no improvement on average. Potential opportunities for dynamic optimizations
23 Thank you! Questions? Compiler Garbage collector Runtime Static Optimization
24 Cache: Small vs. Large IL1DL1L2 SizeAssocLatencySizeAssocLatencySizeAsso c Latency 8K K25 16K K28 64K K210 Cacti, 90nm technology, 3GHz frequency
25 Cache-Size Comparison
26 Directmap vs. Two-way cycles (10 6 ) Cacti, 90nm technology, 3GHz
27 Improving Performance Classic optimizations not sufficient! Different programming styles –Automatic memory management –Pointer data structures –Many small methods Optimization costs incurred at runtime Virtual Machine (VM) adds complexity –Class loading, memory management, Just-in- time compiler…
28 Instruction Locality Instructions have better locality? –More instruction accesses –About same # of data cache misses Penalty in pipelined processor –Create bubbles in the pipeline Instruction locality can be more critical
29 Locality Impact On Performance Geometric mean of five Java programs Locality is key to performance 23.2% 40.1% 25.1% 48.3% Execution Time Distribution