Enhancing Signature Path Prefetching with Perceptron Prefetch Filtering Eshan Bhatia1, Gino Chacon1, Elvira Teran2, Paul V. Gratz1, Daniel A. Jiménez3 1Texas A&M University 2Texas A&M International University 3Texas A&M University / Barcelona Supercomputing Center
Introduction Design Space: Standalone L1D, L2C and LLC Prefetchers Distribution of hardware budget across three prefetchers Interaction among the prefetchers Control over placing the incoming prefetch line (L1D vs L2C vs LLC)
Key Ideas Aggressive L2C Prefetching Optimizing Prefetch Queue Sharing Signature Path Prefetcher (SPP)[Kim, MICRO ‘16] Perceptron-based Prefetch Filtering (PPF)[Bhatia, ISCA ‘19] Optimizing Prefetch Queue Sharing Page based resource sharing Minimal LLC Prefetching Lack of information LLC is a shared resource among cores Coordination between levels Minimizing impact of noisy prefetches on lower level prefetchers
Page Based Resource Sharing Prefetch Queue (PQ) limited in number Valuable resource for L1D / L2C Aggressive (but still accurate) prefetching Takes the current page deep into the speculation path Blocks PQ resources for other pages Timing disparity between multiple pages with interleaved accesses Efficient Resource Utilization Track number of distinct pages in last few memory accesses Divide PQ resource over those pages
L1D Prefetcher: Next-N-Lines Fetches N consecutive lines wrt current demand address N determined through PQ resource availability Page level throttling Tracks per page access pattern for the last two accesses Scores page as +1 delta friendly or averse Throttles prefetching for averse pages
L2C Underlying Prefetcher: SPP Lookahead Prefetcher Uses previous prefetch suggestion to trigger new speculation Recursively iterate and keep compounding the confidence Stop when the confidence falls below a certain threshold Threshold (hyperparameter) is an indication of aggressiveness Less threshold -> more aggressive -> more coverage -> less accuracy Pre-defined trade-off between coverage and accuracy
Enhanced SPP Decoupled coverage and accuracy concerns SPP enhanced to its most aggressive extreme Helps capture complex memory access patterns Increases coverage Perceptron Filtering (PPF) takes care of accuracy
Hashed Perceptron Model Use feature values to index into distinct tables Example: PC, memory address etc Prediction: Lookup, summation, threshold Use xi value to index into table of corresponding Wi Learning occurs when ground truth known Positive Outcome: Increment each feature’s partial prediction weight Negative Outcome: Decrement each feature’s partial prediction weight No multiplication, no division, no complex back-propagation
PPF Architecture Baseline prefetcher: SPP Modified for high coverage Perceptron Weights Tables Tables of 5-bit up-down saturating counters 1 table per feature Variable depth, independent indexing Prefetch and Reject Tables Record prefetches for future training
PPF Design Prefetch suggestions tested using PPF Outcome and indexing metadata recorded in Prefetch / Reject Table Subsequent feedback of a prior prefetch Same perceptron weights re-indexed and updated by +1 / -1
Putting Pieces Together Single Core Configuration L1D: Enhanced Next-N-line L2C: PPF with SPP Triggered on all accesses to L2C Can place prefetches in L2C or LLC LLC: Next Line prefetcher Triggered on demand accesses and only last prefetch from L1D reaching LLC Uses the metadata communication path between the prefetchers Overhead: 49.94 KBs Multi Core Configuration L1D: No Prefetching L2C: PPF with SPP Triggered on all accesses to L2C Can place prefetches in L2C or LLC LLC: SPP (without PPF) Separate tables for each core Modified to be less aggressive than the original SPP (LLC is a shared resource) Overhead: 62.83 KBs
Results Improvement reported over no prefetching Single Core: 40.4% Multi Core: 20.3%
Future Works Better baseline prefetchers for PPF Interaction between the prefetchers Metadata communication path between the levels
Thank you!
Backup Slides
L2C Underlying Prefetcher: SPP