Predictor-Directed Stream Buffers Timothy Sherwood Suleyman Sair Brad Calder.

Predictor-Directed Stream Buffers Timothy Sherwood Suleyman Sair Brad Calder

Sherwood, Sair, and Calder2 Overview Introduction Past Stream Buffer work Predictor-Directed Stream Buffers Policy Improvements Results Contribution

Sherwood, Sair, and Calder3 Introduction Memory Wall Latency reduction through prefetching –without eating too much bandwidth Stream Buffers are one of the most used –simple to implement –very efficient Pointer based codes

Sherwood, Sair, and Calder4 Past Stream Buffer work Jouppi 1990 –consecutive cache line FIFO Palacharla and Kessler 1994 –non-unit stride (based on memory chunk) –allocation filters Farkas et. al. 1997 –PC-based stride –fully associative / non-overlapping

Sherwood, Sair, and Calder5 Past Stream Buffer work tag cache block comparator Predicted Stride Last Address tag cache block comparator from/to next lower level of memory N buffers store predict_stride in streaming buffer on allocation to data cache, register file, and MSHRs

Sherwood, Sair, and Calder6 Past Stream Buffer work Past work targeted at streaming in arrays –either in sequential order –or stride order (multidimensional array) Could not handle Pointer Codes –repetitive non-striding references Need a more General Predictor

Sherwood, Sair, and Calder7 Predictor-Directed Stream Buffer The Goal: Simple and efficient hardware based prefetching of complex but predictable streams Approach: Take a general predictor and hook it up to the well established stream buffer front end. Separate the predictor from the prefetcher Can use almost any predictor –2 Delta –Context –Markov

Sherwood, Sair, and Calder8 PSB Generalized Architecture Load PC History Stride Confidence Last Address Prediction Info tag cache block comparator tag cache block comparator Address Predictor load info (PC, address) from write-back stage from/to next lower level of memory subset of prediction info predicted address predicted address N buffers to data cache, register file, and MSHRs update prediction information

Sherwood, Sair, and Calder9 PSB Stages Allocation Prediction Probe Prefetching Lookup

Sherwood, Sair, and Calder10 Stage Descriptions Allocation –Stream Buffer is allocated to a particular load –the buffer is initialized –subject to Allocation Filters Prediction –an empty buffer entry asks for an address –subject to limited predictor speed.

Sherwood, Sair, and Calder11 Stage Descriptions (Continued) Probe –if there are free ports remove useless prefetches –not mandatory Prefetching –subject to scheduling for ports and priority, prefetches are sent to memory Lookup –when a load performs an L1 access, the Stream Buffers are checked in parallel

Sherwood, Sair, and Calder12 PSB Implementation Tried many different address predictors Best is Stride Filtered Markov –similar to Joseph and Grunwald’s Predictor –first order Markov –striding behavior is filtered out Difference is stored to reduce size

Sherwood, Sair, and Calder13 Difference Storing

Sherwood, Sair, and Calder14 PSB with SFM tag cache block comparator tag cache block comparator from/to next lower level of memory predicted address last address if hit, return predicted address 8 buffers store predicted stride in streaming buffer on allocation Markov Predictor load info (PC, address) from write- back stage Stride Predictor MUX markov hit? Predicted Stride Last Address predicted markov address predicted stride address to data cache, register file, and MSHRs

Sherwood, Sair, and Calder15 Methods SimpleScalar 3.0 Rewrote memory hierarchy Model bandwidth between all levels Added perfect store sets Ran over set of Pointer Benchmarks 2K entry predictor table 8 buffers x 4 entry Stream Buffers 32k 4-way associative cache

Sherwood, Sair, and Calder16 Speedup from PSB

Sherwood, Sair, and Calder17 Allocation Filtering Farkas et.al. showed how two miss filtering –prevents too many streams requesting resources Does not work as well for pointer codes –irregular miss patterns We use Priority and Accuracy Counters –track behavior of Loads –allocate to Loads that are Behaving well

Sherwood, Sair, and Calder18 Allocation Filtering Speedup

Sherwood, Sair, and Calder19 Stream Buffer Priority Round Robin –give each active buffer equal resources –predictor and prefetching Priority Counters –uses small counters with each buffer –use the counters to rank buffer –more resources to better performing buffers

Sherwood, Sair, and Calder20 Priority Scheduling Speedup

Sherwood, Sair, and Calder21 Latency Reduction

Sherwood, Sair, and Calder22 Contributions Predictor-Directed Stream Buffers allow decoupling of Stream Buffer front end from address generation Using accuracy based allocation filtering and priority scheduling can make a large difference in performance With some simple compression, even small Markov tables can be very effective

Sherwood, Sair, and Calder23 Accuracy

Sherwood, Sair, and Calder24 Bus Results

Predictor-Directed Stream Buffers Timothy Sherwood Suleyman Sair Brad Calder.

Similar presentations

Presentation on theme: "Predictor-Directed Stream Buffers Timothy Sherwood Suleyman Sair Brad Calder."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Predictor-Directed Stream Buffers Timothy Sherwood Suleyman Sair Brad Calder.

Similar presentations

Presentation on theme: "Predictor-Directed Stream Buffers Timothy Sherwood Suleyman Sair Brad Calder."— Presentation transcript:

Similar presentations

About project

Feedback