CS 7810 Lecture 9 Effective Hardware-Based Data Prefetching for High-Performance Processors T-F. Chen and J-L. Baer IEEE Transactions on Computers, 44(5)

CS 7810 Lecture 9 Effective Hardware-Based Data Prefetching for High-Performance Processors T-F. Chen and J-L. Baer IEEE Transactions on Computers, 44(5) May 1995

Memory Hierarchy Bottlenecks Caching Strategies  victim cache, replacement policies, temporal/spatial caches Prefetching  stream buffers, strided predictors, pointer-chasing Memory dependences  store barrier cache, store sets Latency tolerance  o-o-o execution, runahead

Data Access Patterns Scalar: simple variable references Zero stride: constant array index throughout a loop Constant stride: index is a linear function of loop count Irregular: None of the above (non-linear index, pointer chasing, etc.)

Prefetch Overheads A regular access can be delayed  Increased contention on buses and L1/L2/memory ports  Before initiating the prefetch, the L1 tags have to be examined Cache pollution and more misses Software prefetching increases instruction count

Software Prefetching Pros: reduces hardware overhead, can avoid the first miss (software pipelining), can handle complex address equations, Cons: Code bloat, can only handle addresses that are easy to compute, control flow is a problem, unpredictable latencies

Basic Reference Prediction For each PC, detect and store a stride and the last fetched address Every fetch initiates the next prefetch If the stride changes, remain in transient states until a regular stride is observed Prefetches are not issued only in no-pred state

Basic Reference Prediction tagprev_addrstridestate PC init trans steady no-pred incorrect correct incorrect (update stride) correct incorrect (update stride) incorrect (update stride) Outstanding Request List L1 Tags prefetch address

Shortcomings Basic technique only prefetches one iteration ahead – Lookahead predictor Mispredictions at the end of every inner loop

Lookahead Prediction Note: these are in-order processors Fetch stalls when instructions stall – but, continue to increment PC The Lookahead PC (LA-PC) accesses the branch predictor and BTB to make forward progress The LA-PC indexes into the RPT and can be up to  instructions ahead Note additional bpred and BTB ports

Lookahead Prediction RPT ORL LA-PC BPred PC look-up update decoupled New entry in the RPT, that keeps track of how many steps ahead the LA-PC is

Correlated Reference Prediction for i = 1 to N for k = 0 to i W(i) = W(i) + B(i,k) * W(i-k) Access pattern for B: (1,0) (1,1) (2,0) (2,1) (2,2) (3,0) (3,1) (3,2) (3,3) (4,0) (4,1) (4,2) (4,3) (4,4) Access pattern for W:(1) (2)(2) (2) (3)(3) (3) (3) (4) (4) (4) (4) (4)

Correlated Reference Prediction Inner-loop predictions work well, but the first inner-loop prediction always fails There is a correlation between the branch outcomes and the reference patterns Access pattern for B: (1,0) 1 (1,1) 101 (2,0) 1011 (2,1) 10111 (2,2) 1011101 (3,0) 10111011 (3,1) 101110111 (3,2) 1011101111 (3,3) 101110111101 (4,0) 1011101111011 (4,1) 10111011110111 (4,2) … for i = 1 to N for k = 0 to i W(i) = W(i) + B(i,k) * W(i-k)

Implementation Each PC keeps track of multiple access patterns (prev_addr and stride) The branch history determines the patterns that are relevant (01 refers to an outer loop access) Other branches in the loop can mess up the history – use the compiler to mark loop-termination branches?

Benchmark Characteristics

Results

Results Summary Lookahead is the most cost-effective technique RPT needs 512 entries (4KB capacity) Lookahead should be a little more than memory latency

Modern Processors Pentium 4  I-cache: use the bpred and BTB to stay ahead of current execution  L2 cache: attempts to stay 256 bytes ahead; monitors history for multiple streams; minimizes fetch of unwanted data (no h/w prefetch in PIII) Alpha 21364 has 16-entry victim caches in L1D and L2D and 16-entry L1I stream buffer Ultra Sparc III has a 2KB prefetch cache

Next Week’s Paper “Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-Order Processors”, O. Mutlu, J. Stark, C. Wilkerson, Y.N. Patt, Proceedings of HPCA-9, February 2003 Useful execution while waiting for a cache miss – perhaps, prefetch the next miss?

Title Bullet

CS 7810 Lecture 9 Effective Hardware-Based Data Prefetching for High-Performance Processors T-F. Chen and J-L. Baer IEEE Transactions on Computers, 44(5)

Similar presentations

Presentation on theme: "CS 7810 Lecture 9 Effective Hardware-Based Data Prefetching for High-Performance Processors T-F. Chen and J-L. Baer IEEE Transactions on Computers, 44(5)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS 7810 Lecture 9 Effective Hardware-Based Data Prefetching for High-Performance Processors T-F. Chen and J-L. Baer IEEE Transactions on Computers, 44(5)

Similar presentations

Presentation on theme: "CS 7810 Lecture 9 Effective Hardware-Based Data Prefetching for High-Performance Processors T-F. Chen and J-L. Baer IEEE Transactions on Computers, 44(5)"— Presentation transcript:

Similar presentations

About project

Feedback