CS 7810 Lecture 9 Effective Hardware-Based Data Prefetching for High-Performance Processors T-F. Chen and J-L. Baer IEEE Transactions on Computers, 44(5) May 1995
Memory Hierarchy Bottlenecks Caching Strategies victim cache, replacement policies, temporal/spatial caches Prefetching stream buffers, strided predictors, pointer-chasing Memory dependences store barrier cache, store sets Latency tolerance o-o-o execution, runahead
Data Access Patterns Scalar: simple variable references Zero stride: constant array index throughout a loop Constant stride: index is a linear function of loop count Irregular: None of the above (non-linear index, pointer chasing, etc.)
Prefetch Overheads A regular access can be delayed Increased contention on buses and L1/L2/memory ports Before initiating the prefetch, the L1 tags have to be examined Cache pollution and more misses Software prefetching increases instruction count
Software Prefetching Pros: reduces hardware overhead, can avoid the first miss (software pipelining), can handle complex address equations, Cons: Code bloat, can only handle addresses that are easy to compute, control flow is a problem, unpredictable latencies
Basic Reference Prediction For each PC, detect and store a stride and the last fetched address Every fetch initiates the next prefetch If the stride changes, remain in transient states until a regular stride is observed Prefetches are not issued only in no-pred state
Basic Reference Prediction tagprev_addrstridestate PC init trans steady no-pred incorrect correct incorrect (update stride) correct incorrect (update stride) incorrect (update stride) Outstanding Request List L1 Tags prefetch address
Shortcomings Basic technique only prefetches one iteration ahead – Lookahead predictor Mispredictions at the end of every inner loop
Lookahead Prediction Note: these are in-order processors Fetch stalls when instructions stall – but, continue to increment PC The Lookahead PC (LA-PC) accesses the branch predictor and BTB to make forward progress The LA-PC indexes into the RPT and can be up to instructions ahead Note additional bpred and BTB ports
Lookahead Prediction RPT ORL LA-PC BPred PC look-up update decoupled New entry in the RPT, that keeps track of how many steps ahead the LA-PC is
Correlated Reference Prediction for i = 1 to N for k = 0 to i W(i) = W(i) + B(i,k) * W(i-k) Access pattern for B: (1,0) (1,1) (2,0) (2,1) (2,2) (3,0) (3,1) (3,2) (3,3) (4,0) (4,1) (4,2) (4,3) (4,4) Access pattern for W:(1) (2)(2) (2) (3)(3) (3) (3) (4) (4) (4) (4) (4)
Correlated Reference Prediction Inner-loop predictions work well, but the first inner-loop prediction always fails There is a correlation between the branch outcomes and the reference patterns Access pattern for B: (1,0) 1 (1,1) 101 (2,0) 1011 (2,1) (2,2) (3,0) (3,1) (3,2) (3,3) (4,0) (4,1) (4,2) … for i = 1 to N for k = 0 to i W(i) = W(i) + B(i,k) * W(i-k)
Implementation Each PC keeps track of multiple access patterns (prev_addr and stride) The branch history determines the patterns that are relevant (01 refers to an outer loop access) Other branches in the loop can mess up the history – use the compiler to mark loop-termination branches?
Benchmark Characteristics
Results
Results Summary Lookahead is the most cost-effective technique RPT needs 512 entries (4KB capacity) Lookahead should be a little more than memory latency
Modern Processors Pentium 4 I-cache: use the bpred and BTB to stay ahead of current execution L2 cache: attempts to stay 256 bytes ahead; monitors history for multiple streams; minimizes fetch of unwanted data (no h/w prefetch in PIII) Alpha has 16-entry victim caches in L1D and L2D and 16-entry L1I stream buffer Ultra Sparc III has a 2KB prefetch cache
Next Week’s Paper “Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-Order Processors”, O. Mutlu, J. Stark, C. Wilkerson, Y.N. Patt, Proceedings of HPCA-9, February 2003 Useful execution while waiting for a cache miss – perhaps, prefetch the next miss?
Title Bullet