Download presentation
Presentation is loading. Please wait.
1
CS 7810 Lecture 9 Effective Hardware-Based Data Prefetching for High-Performance Processors T-F. Chen and J-L. Baer IEEE Transactions on Computers, 44(5) May 1995
2
Memory Hierarchy Bottlenecks Caching Strategies victim cache, replacement policies, temporal/spatial caches Prefetching stream buffers, strided predictors, pointer-chasing Memory dependences store barrier cache, store sets Latency tolerance o-o-o execution, runahead
3
Data Access Patterns Scalar: simple variable references Zero stride: constant array index throughout a loop Constant stride: index is a linear function of loop count Irregular: None of the above (non-linear index, pointer chasing, etc.)
4
Prefetch Overheads A regular access can be delayed Increased contention on buses and L1/L2/memory ports Before initiating the prefetch, the L1 tags have to be examined Cache pollution and more misses Software prefetching increases instruction count
5
Software Prefetching Pros: reduces hardware overhead, can avoid the first miss (software pipelining), can handle complex address equations, Cons: Code bloat, can only handle addresses that are easy to compute, control flow is a problem, unpredictable latencies
6
Basic Reference Prediction For each PC, detect and store a stride and the last fetched address Every fetch initiates the next prefetch If the stride changes, remain in transient states until a regular stride is observed Prefetches are not issued only in no-pred state
7
Basic Reference Prediction tagprev_addrstridestate PC init trans steady no-pred incorrect correct incorrect (update stride) correct incorrect (update stride) incorrect (update stride) Outstanding Request List L1 Tags prefetch address
8
Shortcomings Basic technique only prefetches one iteration ahead – Lookahead predictor Mispredictions at the end of every inner loop
9
Lookahead Prediction Note: these are in-order processors Fetch stalls when instructions stall – but, continue to increment PC The Lookahead PC (LA-PC) accesses the branch predictor and BTB to make forward progress The LA-PC indexes into the RPT and can be up to instructions ahead Note additional bpred and BTB ports
10
Lookahead Prediction RPT ORL LA-PC BPred PC look-up update decoupled New entry in the RPT, that keeps track of how many steps ahead the LA-PC is
11
Correlated Reference Prediction for i = 1 to N for k = 0 to i W(i) = W(i) + B(i,k) * W(i-k) Access pattern for B: (1,0) (1,1) (2,0) (2,1) (2,2) (3,0) (3,1) (3,2) (3,3) (4,0) (4,1) (4,2) (4,3) (4,4) Access pattern for W:(1) (2)(2) (2) (3)(3) (3) (3) (4) (4) (4) (4) (4)
12
Correlated Reference Prediction Inner-loop predictions work well, but the first inner-loop prediction always fails There is a correlation between the branch outcomes and the reference patterns Access pattern for B: (1,0) 1 (1,1) 101 (2,0) 1011 (2,1) 10111 (2,2) 1011101 (3,0) 10111011 (3,1) 101110111 (3,2) 1011101111 (3,3) 101110111101 (4,0) 1011101111011 (4,1) 10111011110111 (4,2) … for i = 1 to N for k = 0 to i W(i) = W(i) + B(i,k) * W(i-k)
13
Implementation Each PC keeps track of multiple access patterns (prev_addr and stride) The branch history determines the patterns that are relevant (01 refers to an outer loop access) Other branches in the loop can mess up the history – use the compiler to mark loop-termination branches?
14
Benchmark Characteristics
15
Results
16
Results Summary Lookahead is the most cost-effective technique RPT needs 512 entries (4KB capacity) Lookahead should be a little more than memory latency
17
Modern Processors Pentium 4 I-cache: use the bpred and BTB to stay ahead of current execution L2 cache: attempts to stay 256 bytes ahead; monitors history for multiple streams; minimizes fetch of unwanted data (no h/w prefetch in PIII) Alpha 21364 has 16-entry victim caches in L1D and L2D and 16-entry L1I stream buffer Ultra Sparc III has a 2KB prefetch cache
18
Next Week’s Paper “Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-Order Processors”, O. Mutlu, J. Stark, C. Wilkerson, Y.N. Patt, Proceedings of HPCA-9, February 2003 Useful execution while waiting for a cache miss – perhaps, prefetch the next miss?
19
Title Bullet
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.