Less is More: Leveraging Belady’s Algorithm with Demand-based Learning

Slides:



Advertisements
Similar presentations
Bypass and Insertion Algorithms for Exclusive Last-level Caches
Advertisements

1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.
Hierarchy-aware Replacement and Bypass Algorithms for Last-level Caches Mainak Chaudhuri Indian Institute of Technology, Kanpur & Jayesh Gaur 1, Nithiyanandan.
FLEXclusion: Balancing Cache Capacity and On-chip Bandwidth via Flexible Exclusion Jaewoong Sim Jaekyu Lee Moinuddin K. Qureshi Hyesoon Kim.
PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
Prefetch-Aware Cache Management for High Performance Caching
Improving Cache Performance by Exploiting Read-Write Disparity
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 (and Appendix B) Memory Hierarchy Design Computer Architecture A Quantitative Approach,
1 Lecture 11: Large Cache Design IV Topics: prefetch, dead blocks, cache networks.
Glenn Reinman, Brad Calder, Department of Computer Science and Engineering, University of California San Diego and Todd Austin Department of Electrical.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1 Lecture 10: Large Cache Design III Topics: Replacement policies, prefetch, dead blocks, associativity Sign up for class mailing list Pseudo-LRU has a.
EECS 470 Cache Systems Lecture 13 Coverage: Chapter 5.
Mitigating Prefetcher-Caused Pollution Using Informed Caching Policies for Prefetched Blocks Vivek Seshadri Samihan Yedkar ∙ Hongyi Xin ∙ Onur Mutlu Phillip.
1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.
Caches – basic idea Small, fast memory Stores frequently-accessed blocks of memory. When it fills up, discard some blocks and replace them with others.
Characterizing Multi-threaded Applications for Designing Sharing-aware Last-level Cache Replacement Policies Ragavendra Natarajan 1, Mainak Chaudhuri 2.
Cooperative Caching for Chip Multiprocessors Jichuan Chang Guri Sohi University of Wisconsin-Madison ISCA-33, June 2006.
Achieving Non-Inclusive Cache Performance with Inclusive Caches Temporal Locality Aware (TLA) Cache Management Policies Aamer Jaleel,
1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University.
Moinuddin K.Qureshi, Univ of Texas at Austin MICRO’ , 12, 05 PAK, EUNJI.
CSE 378 Cache Performance1 Performance metrics for caches Basic performance metric: hit ratio h h = Number of memory references that hit in the cache /
Bypass and Insertion Algorithms for Exclusive Last-level Caches Jayesh Gaur 1, Mainak Chaudhuri 2, Sreenivas Subramoney 1 1 Intel Architecture Group, Intel.
Zheng Wu. Background Motivation Analysis Framework Intra-Core Cache Analysis Cache Conflict Analysis Optimization Techniques WCRT Analysis Experiment.
Improving Cache Performance by Exploiting Read-Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez.
Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.
Sampling Dead Block Prediction for Last-Level Caches
MadCache: A PC-aware Cache Insertion Policy Andrew Nere, Mitch Hayenga, and Mikko Lipasti PHARM Research Group University of Wisconsin – Madison June 20,
Exploiting Compressed Block Size as an Indicator of Future Reuse
MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.
Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,
Reduction of Register File Power Consumption Approach: Value Lifetime Characteristics - Pradnyesh Gudadhe.
The Evicted-Address Filter
Scavenger: A New Last Level Cache Architecture with Global Block Priority Arkaprava Basu, IIT Kanpur Nevin Kirman, Cornell Mainak Chaudhuri, IIT Kanpur.
Cache Replacement Championship
Cache Replacement Policy Based on Expected Hit Count
Improving Multi-Core Performance Using Mixed-Cell Cache Architecture
Improving Cache Performance using Victim Tag Stores
Two Dimensional Highly Associative Level-Two Cache Design
The 2nd Cache Replacement Championship (CRC-2)
CRC-2, ISCA 2017 Toronto, Canada June 25, 2017
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
Adaptive Cache Partitioning on a Composite Core
ASR: Adaptive Selective Replication for CMP Caches
Associativity in Caches Lecture 25
Multilevel Memories (Improving performance using alittle “cash”)
Javier Díaz1, Pablo Ibáñez1, Teresa Monreal2,
18742 Parallel Computer Architecture Caching in Multi-core Systems
Cache Memory Presentation I
Prefetch-Aware Cache Management for High Performance Caching
Bank-aware Dynamic Cache Partitioning for Multicore Architectures
Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor
CARP: Compression Aware Replacement Policies
Accelerating Dependent Cache Misses with an Enhanced Memory Controller
Milad Hashemi, Onur Mutlu, Yale N. Patt
Using Dead Blocks as a Virtual Victim Cache
Lecture 15: Large Cache Design III
Performance metrics for caches
CARP: Compression-Aware Replacement Policies
CDA 5155 Caches.
Adapted from slides by Sally McKee Cornell University
Performance metrics for caches
José A. Joao* Onur Mutlu‡ Yale N. Patt*
Caches: AAT, 3C’s model of misses Prof. Eric Rotenberg
Performance metrics for caches
Principle of Locality: Memory Hierarchies
Lois Orosa, Rodolfo Azevedo and Onur Mutlu
Performance metrics for caches
Stream-based Memory Specialization for General Purpose Processors
Presentation transcript:

Less is More: Leveraging Belady’s Algorithm with Demand-based Learning Jiajun Wang, Lu Zhang, Reena Panda, Lizy John The Univeristy of Texas at Austin

Introduction Why efficient LLC replacement policy is important? Goal LLC shared by multicores LLC accesses have low temporal locality and long data reuse distance Small capacity compared with big data application working set size Goal Ideally, every LLC cache blocks get reused before eviction. (Maximize total reuse counts) It requires: Bypass streaming accesses Select dead block as victim

Review of Belady’s Optimal Algorithm Gives the most optimal case of cache behavior with the knowledge of future, the block with the largest forward distance in the string of future references should be replaced at the time of a miss. Access A B C D time A C B 2-way fully associative cache

Motivation However… Same miss counts != same cycle penalty cost Miss latency variance (e.g., get missed data from LLC vs DRAM) Access type priority (e.g., writeback or prefetch is not on critical path) Access Type: LD ST WB Access Addr: A B C D time A B

Lime Proposal Basic idea: A cache replacement policy which leverages key idea of Belady’s algorithm but focuses on demand accesses (i.e. loads and stores) that have direct impact on system performance, and bypasses training process for writeback and prefetch accesses. Builds on prior work Caching behavior of past load instructions can guide future caching decisions[1][2] Leverages Belady’s algorithm on past accesses[3] [1]W. A. Wong and J.-L. Baer. Modified LRU policies for improving second-level cache behavior. In HPCA 2000, [2]C.-J. Wu, A. Jaleel, W. Hasenplaugh, M. Martonosi, S. C. Steely, Jr.,and J. Emer. SHiP: Signature-based hit predictor for high performance caching. In MICRO 2011 [3] A. Jain and C. Lin, “Back to the future: Leveraging belady’s algorithm for improved cache replacement. In ISCA 2016

Background: Hawkeye OPTgen: Unique Addr D C B A Cached Non-Cached time 1 2 Occupancy Vector

Lime Structure: Overall

Lime Structure: Belady’s Trainer PC Addr Tag Cached? Occupancy Vector Belady Trainer Oldest Access Entry … Latest Access

Handle Writeback and Prefetch Load / Store Writeback Prefetch Belady Trainer Cache/Bypass Cache Cache Data Cache SRRIP replacement Replace way[0] Replace way[0]

Lime Structure: PC Classifier Input: PC Output: Cached If PC is not found in PC Classifier: Cached=true Else: If PC is in RANDOM bin Cached=latest Cache decision Else if PC is in KEEP bin: Else if PC is in BYPASS bin: Cached=false PC PC Classifier KEEP (bloom filter) BYPASS (bloom filter) RANDOM (lut) Should install data into cache

Configuration Storage Cost Workloads Simpoint length of 200M Single core, 2MB LLC Multicore, multiprogram, 8MB LLC Compare against LRU

Results: Single Core. w/o prefetch

Results: Single Core. w/ prefetch

Results: Multicore. w/o prefetch

Results: Multicore. w/ prefetch

Conclusion LIME respects the observation that load/store misses are more likely to cause pipeline stall than writeback and prefetch misses Significant IPC improvement can be achieved with LIME, even with increasing total misses in some cases.

Thank you!