Less is More: Leveraging Belady’s Algorithm with Demand-based Learning Jiajun Wang, Lu Zhang, Reena Panda, Lizy John The Univeristy of Texas at Austin
Introduction Why efficient LLC replacement policy is important? Goal LLC shared by multicores LLC accesses have low temporal locality and long data reuse distance Small capacity compared with big data application working set size Goal Ideally, every LLC cache blocks get reused before eviction. (Maximize total reuse counts) It requires: Bypass streaming accesses Select dead block as victim
Review of Belady’s Optimal Algorithm Gives the most optimal case of cache behavior with the knowledge of future, the block with the largest forward distance in the string of future references should be replaced at the time of a miss. Access A B C D time A C B 2-way fully associative cache
Motivation However… Same miss counts != same cycle penalty cost Miss latency variance (e.g., get missed data from LLC vs DRAM) Access type priority (e.g., writeback or prefetch is not on critical path) Access Type: LD ST WB Access Addr: A B C D time A B
Lime Proposal Basic idea: A cache replacement policy which leverages key idea of Belady’s algorithm but focuses on demand accesses (i.e. loads and stores) that have direct impact on system performance, and bypasses training process for writeback and prefetch accesses. Builds on prior work Caching behavior of past load instructions can guide future caching decisions[1][2] Leverages Belady’s algorithm on past accesses[3] [1]W. A. Wong and J.-L. Baer. Modified LRU policies for improving second-level cache behavior. In HPCA 2000, [2]C.-J. Wu, A. Jaleel, W. Hasenplaugh, M. Martonosi, S. C. Steely, Jr.,and J. Emer. SHiP: Signature-based hit predictor for high performance caching. In MICRO 2011 [3] A. Jain and C. Lin, “Back to the future: Leveraging belady’s algorithm for improved cache replacement. In ISCA 2016
Background: Hawkeye OPTgen: Unique Addr D C B A Cached Non-Cached time 1 2 Occupancy Vector
Lime Structure: Overall
Lime Structure: Belady’s Trainer PC Addr Tag Cached? Occupancy Vector Belady Trainer Oldest Access Entry … Latest Access
Handle Writeback and Prefetch Load / Store Writeback Prefetch Belady Trainer Cache/Bypass Cache Cache Data Cache SRRIP replacement Replace way[0] Replace way[0]
Lime Structure: PC Classifier Input: PC Output: Cached If PC is not found in PC Classifier: Cached=true Else: If PC is in RANDOM bin Cached=latest Cache decision Else if PC is in KEEP bin: Else if PC is in BYPASS bin: Cached=false PC PC Classifier KEEP (bloom filter) BYPASS (bloom filter) RANDOM (lut) Should install data into cache
Configuration Storage Cost Workloads Simpoint length of 200M Single core, 2MB LLC Multicore, multiprogram, 8MB LLC Compare against LRU
Results: Single Core. w/o prefetch
Results: Single Core. w/ prefetch
Results: Multicore. w/o prefetch
Results: Multicore. w/ prefetch
Conclusion LIME respects the observation that load/store misses are more likely to cause pipeline stall than writeback and prefetch misses Significant IPC improvement can be achieved with LIME, even with increasing total misses in some cases.
Thank you!