Load-Reuse Analysis design and evaluation Rastislav Bodík Rajiv Gupta Mary Lou Soffa
Partial Redundancy Elimination (PRE) Partially redundant = computed on some incoming paths x:=a+b y:=a+b
a+b a:=..
Steps: find “reuse” paths, remove redundancy from “reuse” paths.
Register promotion = PRE of loads Three steps: load-reuse analysis: find loads that can reuse prior loads/stores alias analysis: which stores may kill reuse? transformation: remove redundancy: PRE [PLDI ‘98] store a1, x store a3 load a2 load a4
Load-reuse analysis Design goal: completeness find all reuse To approach completeness, the analysis is uniform: analyze scalar, array, and pointer loads path-sensitive: different source of reuse on each path Evaluation goal: how complete? compare with ideal analysis Detecting all reuse is undecidable: no ideal algorithm exists instead, use simulation
Experimental framework load-reuse analysis simulator estimator transformation programinput comparison reuse level weighted solution data-flow solution profile [PLDI ‘98]
1. Load-reuse analysis It’s a data-flow analysis on a reuse-aware representation: Value Name Graph (VNG): [POPL’98] What’s new? Sparse version of the VNG up to 30-times smaller than non-sparse Analyzing indirect loads/stores also, model killing stores
Naming the value y := b+c a := c-1 x := a+b+1
names for the value in ‘x’ x a+b+1 b+c
1 x a+b+1 b+c 1 1 GEN
Naming the value across loads.. := p->f.. := p->next->f *r :=... **(p+4) *p 1 1 p := p->next *p **(p+4) *p 1 1 f next offset: 0 4 GEN
kill if r = p+4 or r = *(p+4) KILL
Sparse representation a1 := A+I load a1 a2 := A+I-1 load a2 for I = 1, N {.. := A[I] + A[I-1] } I := I+1
load a1 load a2 Ø Ø GEN
2. The simulator algorithm load a1 load a2 Ø for I = 1, N {.. := A[I] + A[I-1] } memory access history history length = 1 to 4 A[I-1] A[I] Simulator detects all PRE-exploitable reuse (up to given history length), but also some “noise”: e.g. due to hash table accesses
Ideal amount of load reuse 65% of executed loads has reuse exploitable by PRE intra-procedural reuse, history=1 go m88ksim gcc compress li ijpeg vortex tomcatv swim su2cor hydro history length 1 4 % of all dynamic loads
3. How frequent is the reuse? Edge profile: + cheap and available - cannot reconstruct frequencies of reuse paths load x kill x load x
Path profile: + precise - more expensive Use edge profile, but bound its inherent error: compute lower & upper bound on reuse
Hierarchy of estimators PRE CMP 1 CMP c CMP r CMP f smaller error (but more complex) Hierarchy: a practical approach A simple estimator not precise enough? Use next better one ! Estimator: data-flow solution + edge profile weighted data-flow solution
The algorithms 1. The bounds: generators: points generating reuse stealers: points with no reuse upper bound: all reuse consumed lower bound: all reuse stolen load x kill x load x
2. Separating uncertainty: using the CMP region defined for PRE [PLDI ‘98] CMP = code-motion preventing all error is contained in the CMP region!
Improving precision “one” region connected regions control flow reachability network flow reachability
Estimators: precision PR E CMP 1 CMP c CMP r CMP f error smaller error INT FP
4. Analysis: how close to ideal ? *p **p calls array & pointer stores + calls all stores + calls ideal alias info reuse killed by: 100% = reuse seen by simulator
Related Work Load-Reuse Analysis makes value numbering path-sensitive Steffen, Knoop, Rüthing Value Flow Graph [ESOP ‘90] we show how analyze indirect loads, via symbolic evaluation Simulation-based analysis evaluation Diwan, McKinley, Moss [PLDI’98] Type-based alias analysis: how powerful it needs to be? Estimators Ramalingam “Frequency Analysis” [PLDI’96] returns a single estimate, not its bounds
Summary Load-reuse analysis: reuse across indirect memory references sparse representation Estimators: three principles confidence: bound the edge-profile error separation of uncertainty: inside/outside the CMP region hierarchy: increasing precision and complexity Evaluation: about 65% loads are amenable to PRE our analysis can find about 80% of those
Combine three removal methods code motion control speculation restructuring M S R PLDI ‘98
Example: a+b M S R 10 50
Relative removal power M S R Loads removed, dynamic count, normalized Global CSE path- insensitive INT FP