Download presentation
Presentation is loading. Please wait.
1
Predictor-Directed Stream Buffers Timothy Sherwood Suleyman Sair Brad Calder
2
Sherwood, Sair, and Calder2 Overview Introduction Past Stream Buffer work Predictor-Directed Stream Buffers Policy Improvements Results Contribution
3
Sherwood, Sair, and Calder3 Introduction Memory Wall Latency reduction through prefetching –without eating too much bandwidth Stream Buffers are one of the most used –simple to implement –very efficient Pointer based codes
4
Sherwood, Sair, and Calder4 Past Stream Buffer work Jouppi 1990 –consecutive cache line FIFO Palacharla and Kessler 1994 –non-unit stride (based on memory chunk) –allocation filters Farkas et. al. 1997 –PC-based stride –fully associative / non-overlapping
5
Sherwood, Sair, and Calder5 Past Stream Buffer work tag cache block comparator Predicted Stride Last Address tag cache block comparator from/to next lower level of memory N buffers store predict_stride in streaming buffer on allocation to data cache, register file, and MSHRs
6
Sherwood, Sair, and Calder6 Past Stream Buffer work Past work targeted at streaming in arrays –either in sequential order –or stride order (multidimensional array) Could not handle Pointer Codes –repetitive non-striding references Need a more General Predictor
7
Sherwood, Sair, and Calder7 Predictor-Directed Stream Buffer The Goal: Simple and efficient hardware based prefetching of complex but predictable streams Approach: Take a general predictor and hook it up to the well established stream buffer front end. Separate the predictor from the prefetcher Can use almost any predictor –2 Delta –Context –Markov
8
Sherwood, Sair, and Calder8 PSB Generalized Architecture Load PC History Stride Confidence Last Address Prediction Info tag cache block comparator tag cache block comparator Address Predictor load info (PC, address) from write-back stage from/to next lower level of memory subset of prediction info predicted address predicted address N buffers to data cache, register file, and MSHRs update prediction information
9
Sherwood, Sair, and Calder9 PSB Stages Allocation Prediction Probe Prefetching Lookup
10
Sherwood, Sair, and Calder10 Stage Descriptions Allocation –Stream Buffer is allocated to a particular load –the buffer is initialized –subject to Allocation Filters Prediction –an empty buffer entry asks for an address –subject to limited predictor speed.
11
Sherwood, Sair, and Calder11 Stage Descriptions (Continued) Probe –if there are free ports remove useless prefetches –not mandatory Prefetching –subject to scheduling for ports and priority, prefetches are sent to memory Lookup –when a load performs an L1 access, the Stream Buffers are checked in parallel
12
Sherwood, Sair, and Calder12 PSB Implementation Tried many different address predictors Best is Stride Filtered Markov –similar to Joseph and Grunwald’s Predictor –first order Markov –striding behavior is filtered out Difference is stored to reduce size
13
Sherwood, Sair, and Calder13 Difference Storing
14
Sherwood, Sair, and Calder14 PSB with SFM tag cache block comparator tag cache block comparator from/to next lower level of memory predicted address last address if hit, return predicted address 8 buffers store predicted stride in streaming buffer on allocation Markov Predictor load info (PC, address) from write- back stage Stride Predictor MUX markov hit? Predicted Stride Last Address predicted markov address predicted stride address to data cache, register file, and MSHRs
15
Sherwood, Sair, and Calder15 Methods SimpleScalar 3.0 Rewrote memory hierarchy Model bandwidth between all levels Added perfect store sets Ran over set of Pointer Benchmarks 2K entry predictor table 8 buffers x 4 entry Stream Buffers 32k 4-way associative cache
16
Sherwood, Sair, and Calder16 Speedup from PSB
17
Sherwood, Sair, and Calder17 Allocation Filtering Farkas et.al. showed how two miss filtering –prevents too many streams requesting resources Does not work as well for pointer codes –irregular miss patterns We use Priority and Accuracy Counters –track behavior of Loads –allocate to Loads that are Behaving well
18
Sherwood, Sair, and Calder18 Allocation Filtering Speedup
19
Sherwood, Sair, and Calder19 Stream Buffer Priority Round Robin –give each active buffer equal resources –predictor and prefetching Priority Counters –uses small counters with each buffer –use the counters to rank buffer –more resources to better performing buffers
20
Sherwood, Sair, and Calder20 Priority Scheduling Speedup
21
Sherwood, Sair, and Calder21 Latency Reduction
22
Sherwood, Sair, and Calder22 Contributions Predictor-Directed Stream Buffers allow decoupling of Stream Buffer front end from address generation Using accuracy based allocation filtering and priority scheduling can make a large difference in performance With some simple compression, even small Markov tables can be very effective
23
Sherwood, Sair, and Calder23 Accuracy
24
Sherwood, Sair, and Calder24 Bus Results
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.