Javier Díaz1, Pablo Ibáñez1, Teresa Monreal2,

Slides:



Advertisements
Similar presentations
Dead Block Replacement and Bypass with a Sampling Predictor Daniel A. Jiménez Department of Computer Science The University of Texas at San Antonio.
Advertisements

Bypass and Insertion Algorithms for Exclusive Last-level Caches
1 A Hybrid Adaptive Feedback Based Prefetcher Santhosh Verma, David Koppelman and Lu Peng Louisiana State University.
Practical Caches COMP25212 cache 3. Learning Objectives To understand: –Additional Control Bits in Cache Lines –Cache Line Size Tradeoffs –Separate I&D.
Adaptive Subset Based Replacement Policy for High Performance Caching Liqiang He Yan Sun Chaozhong Zhang College of Computer Science, Inner Mongolia University.
1 Lecture 9: Large Cache Design II Topics: Cache partitioning and replacement policies.
Cache Replacement Policy Using Map-based Adaptive Insertion Yasuo Ishii 1,2, Mary Inaba 1, and Kei Hiraki 1 1 The University of Tokyo 2 NEC Corporation.
Improving Cache Performance by Exploiting Read-Write Disparity
CS752 Decoupled Architecture for Data Prefetching Jichuan Chang Kai Xu.
1 Lecture 11: Large Cache Design IV Topics: prefetch, dead blocks, cache networks.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.
Virtual Memory Chapter 8.
1 Virtual Memory Chapter 8. 2 Hardware and Control Structures Memory references are dynamically translated into physical addresses at run time –A process.
1 Lecture 10: Large Cache Design III Topics: Replacement policies, prefetch, dead blocks, associativity Sign up for class mailing list Pseudo-LRU has a.
An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.
Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.
EECC722 - Shaaban #1 Lec # 9 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.
Predictor-Directed Stream Buffers Timothy Sherwood Suleyman Sair Brad Calder.
Data Cache Prefetching using a Global History Buffer Presented by: Chuck (Chengyan) Zhao Mar 30, 2004 Written by: - Kyle Nesbit - James Smith Department.
A Hardware-based Cache Pollution Filtering Mechanism for Aggressive Prefetches Georgia Institute of Technology Atlanta, GA ICPP, Kaohsiung, Taiwan,
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
Computer Architecture Lecture 26 Fasih ur Rehman.
Improving Cache Performance by Exploiting Read-Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez.
Sampling Dead Block Prediction for Last-Level Caches
MadCache: A PC-aware Cache Insertion Policy Andrew Nere, Mitch Hayenga, and Mikko Lipasti PHARM Research Group University of Wisconsin – Madison June 20,
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Demand Paging.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%
The Evicted-Address Filter
COMP SYSTEM ARCHITECTURE PRACTICAL CACHES Sergio Davies Feb/Mar 2014COMP25212 – Lecture 3.
1 Contents Memory types & memory hierarchy Virtual memory (VM) Page replacement algorithms in case of VM.
Cache Replacement Championship
Virtual Memory Chapter 8.
Cache Replacement Policy Based on Expected Hit Count
Improving Cache Performance using Victim Tag Stores
Memory Hierarchy Ideal memory is fast, large, and inexpensive
CE 454 Computer Architecture
Address – 32 bits WRITE Write Cache Write Main Byte Offset Tag Index Valid Tag Data 16K entries 16.
CRC-2, ISCA 2017 Toronto, Canada June 25, 2017
Lecture: Large Caches, Virtual Memory
Cache Performance Samira Khan March 28, 2017.
CSC 4250 Computer Architectures
Multilevel Memories (Improving performance using alittle “cash”)
Less is More: Leveraging Belady’s Algorithm with Demand-based Learning
Lecture: Cache Hierarchies
18742 Parallel Computer Architecture Caching in Multi-core Systems
Luis M. Ramos, José Luis Briz, Pablo E. Ibáñez and Víctor Viñals.
Appendix B. Review of Memory Hierarchy
Cache Memory Presentation I
Consider a Direct Mapped Cache with 4 word blocks
Chapter 8 Virtual Memory
Lecture: Cache Hierarchies
CARP: Compression Aware Replacement Policies
Chapter 5 Memory CSE 820.
Lecture 17: Case Studies Topics: case studies for virtual memory and cache hierarchies (Sections )
Lecture: Cache Innovations, Virtual Memory
Virtual Memory فصل هشتم.
Lecture 15: Large Cache Design III
CARP: Compression-Aware Replacement Policies
Adapted from slides by Sally McKee Cornell University
Overheads for Computers as Components 2nd ed.
Lecture 14: Large Cache Design II
Contents Memory types & memory hierarchy Virtual memory (VM)
Lecture: Cache Hierarchies
CSC3050 – Computer Architecture
Chapter Five Large and Fast: Exploiting Memory Hierarchy
Cache - Optimization.
10/18: Lecture Topics Using spatial locality
Multi-Lookahead Offset Prefetching
Presentation transcript:

ReD: A Policy Based on Reuse Detection for Demanding Block Selection in Last-Level Caches Javier Díaz1, Pablo Ibáñez1, Teresa Monreal2, Víctor Viñals1 and José M. Llabería2 1 Aragón Institute of Engineering Research (I3A), University of Zaragoza, and Hipeac 2 Departament d'Arquitectura de Computadors, Universitat Politècnica de Catalunya and Hipeac

Basic ideas A block selection / bypass policy Demanding Can be combined with any other insertion, promotion and victim selection algorithms Demanding Blocks classified dead on arrival and bypassed by default Reuse-based. Blocks are stored only if reuse is detected: the second time they are requested or if their requesting instruction has shown to request highly- reused blocks

A block selection / bypass policy Without selection, most blocks are not requested again from the LLC after they are stored Selection has major potential Approach Focus in block selection as a separate problem Enable combination with other components of the replacement policy

Demanding Reuse-based approach Most blocks are not requested again from the LLC after they are stored By default: blocks classified dead on arrival and bypassed Blocks accessed at least twice tend to be reused many times Our main goal: to detect the second request to a block We need to remember addresses of requests that have recently missed in the LLC Inspired in the Reuse Cache (Albericio et al.)

Address Reuse Table (ART) Remembers addresses that have recently missed in the LLC Miss in ART  first request to a block  bypass LLC, insert into ART Hit in ART  second or later request to a block  store block in LLC Each ART is a set-associative buffer Separated from the LLC Unaffected by decisions of the base replacement policy More simple to implement Private for each core Increases fairness of the reuse detection between threads Diminishes inter-core thrashing in the LLC

The need for a secondary mechanism Using only the ART a block with reuse experiences two LLC misses To avoid one miss  predict the reuse pattern at the initial request Secondary mechanism Detects instructions that request highly-reused blocks Enables storing blocks requested by those instructions at their initial request Requires remembering the past behavior of instructions and blocks  requires the ART

Program Counter - Reuse Table (PCRT) Tracks the reuse of blocks requested by each instruction (PC) Two counters per entry: #reused and #notreused They keep the number of addresses that a PC inserts in ART and are finally reused or not A PC with reuse probability higher than ¼ sends all initial requests to the LLC PCRT also used to reduce the insertion of addresses in ART PCs with reuse probability very high (>¼) or very low (<1/64) only insert 1 in 8 times

ART and PC-RT entries ART ART with PC indexes PCRT Indexed by block address One entry tracks 4 blocks PAt: Partial Address tag 4 valid bits ART with PC indexes 4 PC indexes PCRT Tagless Indexed by 8 bits of the PC Two 10-bit counters

Example State of ReD internal tables after two initial requests (1) (2), and a first-reuse request (3). ART set shown uses PC sampling

Other details Base replacement policy: 2-bit SRRIP On insertion, only applied if ReD decides not to bypass We also tried with 3p-4p with similar results No distinction between prefetch and demand requests Write-back requests Ignored by ReD If they miss, they are allocated in the LLC but with minimum priority

Results: speedup in single-core configs 1.044 1.024

Results: speedup in multi-core configs 1.056 1.036

Results: bypass rate (c1)

Thank you

Backup

ART details One ART per core Set-associative buffer with 16 ways and 512 sets FIFO replacement policy Partial address tags, 11 bits An entry tracks four consecutive LLC blocks Four valid bits per entry 15616 bytes per core

PCRT details PCRT is tagless and has 256 entries Indexed by 8 bits of the trigger PC Two 10-bit counters per entry When a counter reaches its maximum, both counters of the entry are divided by two We need to store in ART the PC that requests each address Set sampling in ART: only ¼ of the ART entries include PC information 640 bytes per core 8192 bytes per core