Link-Time Path-Sensitive Memory Redundancy Elimination Manel Fernández and Roger Espasa Computer Architecture Department Universitat Politècnica de Catalunya Barcelona, Spain
Motivation The memory “gap” Processor speed increases faster than memory speed L1-cache latency continues to increase Memory operations remain a significant bottleneck Memory redundancy Instructions that repeatedly access the same location Lots of memory operations are redundant Hardware designers exploit memory redundancy E.g., caches take advantage of temporal reuse The compiler must be very aggressive in memory optimizations
Memory redundancy Memory instructions that repeatedly access the same location Lots of memory operations are redundant Sources of redundancy Source code structure Programmers introduce redundancy Traditional compilation Separate compilation units Limitations in the compilation model Code generation introduces redundancy What percentage of memory operations are redundant at run time? … = *p; if ( … ) { *q = … … = *p; } redundant load redundancy source intervening store
Dynamic memory redundancy Load redundancy Store redundancy
Eliminating memory redundancy Can the compiler reduce the redundancy that appears in binary programs? Binary optimizations New opportunities appear on executable code Compiler/language independence Whole program view Object code oriented optimizations Easy collection/use of profiling information Executable code has its own problems Lack of semantic information “Nasty” features Redundancy in binary programs can be eliminated by using binary optimizers
Talk outline Motivation Memory redundancy elimination (MRE) Evaluation Summary
Memory redundancy elimination (MRE) Removal of memory instructions that repeatedly access the same location Targeted at redundancy type Load redundancy elimination (LRE) in a path-sensitive fashion –Based on path-sensitive memory disambiguation Store redundancy elimination (SRE) Targeted at redundancy distance Eliminating close/distant redundancy In the context of a binary optimizer Overcome limitations of traditional compilers Need to deal with “executable code” problems
Load redundancy elimination (LRE) Fundamental problems Alias analysis for disambiguation Liveness analysis for register bypassing Cost-benefit analysis for applying LRE Profile information is needed Eliminating close redundancy Within extended basic blocks (EBBs) Eliminating distant redundancy Intraprocedural dataflow analysis [HorspoolHo97] For fully/partially-redundant loads Redundancy on all/some paths Partial-LRE requires insertion of speculative loads R. N. Horspool and H. C. Ho. Partial redundancy elimination driven by a cost-benefit analysis, CSSE’97 Hot Path move r0, r I 1 load (p0), r1 move r1, r0... I 2 load (p0), r2...
Memory disambiguation Register use-def chains Symbolic descriptors for every use Disambiguation by instruction inspection Fails on path-sensitive redundancies Need to deal with path-sensitive information Partial-LRE is not sufficient either... I 0 def p0... I 1 load (p0),r1... I 3 add p0,8,p0... I Ø Ø-def p0... I 2 load (p0),r2... √ ?
Path-sensitive memory disambiguation Established for only a subset of all the possible paths Subsumes generic disambiguation Path-sensitive LRE Partial-LRE is now adapted for dealing with path-sensitive redundancies Availability on edge (AVEDG ij ) Path-sensitive redundancy... I 0 def p0... I 1 load (p0),r1 move r1, r0... I 3 add p0,8,p0 load (p0),r0... I Ø Ø-def p0... move r0, r2 I 2 load (p0),r √ x
Store redundancy elimination (SRE)... I 1 store r1, (p0)... I 2 store r2, (p0) Similar approach than LRE SRE on EBBs Full- and Partial-SRE New formulation of the analysis No path-sensitive elimination! Elimination of dead stores Other optimizations produce a lot of dead stores Form of dead code elimination Based on heuristics Includes a basic analysis for useless stack locations... I 1 load (p0), r0... I 2 store r0, (p0)
Talk outline Motivation Memory redundancy elimination (MRE) Evaluation Summary
Methodology Benchmark suite SPECint95 Compiled on an AlphaServer with full optimizations Intrumented using Pixie to get profiling information Aggressively re-optimized using Alto Experimental framework Alto executable optimizer Evaluation Dynamic number of loads/stores Actual execution time AlphaServer GS-140, Alpha EV
Dynamic number of loads/stores
Execution time Relative execution time on an AlphaServer GS-140, Alpha EV MHz
Dynamic replay traps Relative number of replay traps on the sim-alpha simulator, modeling an Alpha EV
Talk outline Motivation Memory redundancy elimination (MRE) Evaluation Summary
A high percentage of memory operations are redundant Memory redundancy elimination (MRE) Removal of redundant memory operations Load redundancy elimination (LRE) in a path-sensitive fashion –Based on path-sensitive memory disambiguation Store redundancy elimination (SRE) –Including elimination of dead stores For executable code or link-time Overcome limitations of traditional compilers Valuable results on real execution time Future directions Explore better alias analysis mechanism Additional techniques for MRE
Backup slides
Dynamic memory redundancy
Dynamic load redundancy
Dynamic store redundancy
Load redundancy elimination (LRE) I 1 loads a value from memory into r1 I 2 loads from the same location into r2 Location (p0) is not modified between I 1 and I 2 r1 can be safely bypassed to r2... I 1 load (p0), r1... I 2 load (p0), r2... move r1, r0 move r0, r I 2 can be removed!
LRE on executable code Is (p1) at I 1 the same memory location than (p2) at I 2 ? Is there any available register between I 1 and I 2 that can be used to bypass r1 to r2 ?... I 1 load (p1), r1... I 2 load (p2), r2... Alias analysis! Register liveness analysis! move r1, r0 move r0, r
LRE: Eliminating close redundancy For extended basic blocks (EBBs) Alias analysis: for disambiguation Register live analysis: for bypassing Profile-guided LRE There is not always a benefit in removing a redundant load Hot Path Need to evaluate cost-benefit of applying LRE! move r0, r I 1 load (p0), r1 move r1, r0... I 2 load (p0), r2...
LRE: Eliminating distant redundancy For eliminating fully- and partially- redundant loads Requires insertion of speculative loads Dataflow analysis [HorspoolHo97] Extended cost equation Complex search for available registers... I 2 load (p0),r1... I 1 store r1,(p0)... load (p0), r0 move r0,r move r1,r0 R. N. Horspool and H. C. Ho. Partial redundancy elimination driven by a cost-benefit analysis, CSSE’97
Load redundancy elimination (LRE) Fundamental problems Alias analysis for disambiguation Liveness analysis for register bypassing Cost-benefit analysis for applying LRE Profile information is needed Eliminating close redundancy Within extended basic blocks (EBBs) Eliminating distant redundancy Intraprocedural dataflow analysis [HorspoolHo97] For fully/partially-redundant loads Partial-LRE requires insertion of speculative loads R. N. Horspool and H. C. Ho. Partial redundancy elimination driven by a cost-benefit analysis, CSSE’97 Hot Path move r0, r I 1 load (p0), r1 move r1, r0... I 2 load (p0), r2...
Path-sensitive LRE Path-sensitive redundancy Redundancy occurs only on some execution paths Partial-LRE is not sufficient Memory disambiguation Using register use-def chains Symbolic descriptors for every use Path-sensitive memory disambiguation is needed!... I 0 def p0... I 1 load (p0),r1... I 3 add p0,8,p0... I Ø Ø-def p0... I 2 load (p0),r2...
Path-sensitive information Disambiguation is established for only a subset of all the possible paths For detecting path-sensitive exact memory dependencies Partial-LRE Algorithm is now adapted for dealing with path-sensitive redundancies Availability on edge (AVEDG ij ) Path-sensitive memory disambiguation... I 0 def p0... I 1 load (p0),r1 move r1, r0... I 3 add p0,8,p0 load (p0),r0... I Ø Ø-def p0... move r0, r2 I 2 load (p0),r √ x
A combined algorithm Short-distance MRE Basic MRE within EBBs Long-distance MRE Full Full-MRE Partial Partial-MRE Complete Path-sensitive LRE Partial SRE Dead store elimination Easy optimizations (including Basic-MRE) Function inlining Long-distance MRE (Full/Partial/Complete) Easy optimizations (including Basic-MRE)
Dynamic number of loads
Dynamic number of stores
Alpha results