Cache Replacement Policy Using Map-based Adaptive Insertion Yasuo Ishii 1,2, Mary Inaba 1, and Kei Hiraki 1 1 The University of Tokyo 2 NEC Corporation.

Slides:

Advertisements

Similar presentations

SE-292 High Performance Computing

Advertisements

Performance Evaluation of Cache Replacement Policies for the SPEC CPU2000 Benchmark Suite Hussein Al-Zoubi.

Dead Block Replacement and Bypass with a Sampling Predictor Daniel A. Jiménez Department of Computer Science The University of Texas at San Antonio.

SE-292 High Performance Computing Memory Hierarchy R. Govindarajan

1 A Hybrid Adaptive Feedback Based Prefetcher Santhosh Verma, David Koppelman and Lu Peng Louisiana State University.

1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

Practical Caches COMP25212 cache 3. Learning Objectives To understand: –Additional Control Bits in Cache Lines –Cache Line Size Tradeoffs –Separate I&D.

1 Line Distillation: Increasing Cache Capacity by Filtering Unused Words in Cache Lines Moinuddin K. Qureshi M. Aater Suleman Yale N. Patt HPCA 2007.

Amoeba-Cache Adaptive Blocks for Eliminating Waste in the Memory Hierarchy Snehasish Kumar Arrvindh Shriraman Eric Matthews Lesley Shannon Hongzhou Zhao.

Exploiting Spatial Locality in Data Caches using Spatial Footprints Sanjeev Kumar, Princeton University Christopher Wilkerson, MRL, Intel.

Access Map Pattern Matching Prefetch: Optimization Friendly Method

Adaptive Subset Based Replacement Policy for High Performance Caching Liqiang He Yan Sun Chaozhong Zhang College of Computer Science, Inner Mongolia University.

1 Lecture 9: Large Cache Design II Topics: Cache partitioning and replacement policies.

Insertion Policy Selection Using Decision Tree Analysis Samira Khan, Daniel A. Jiménez University of Texas at San Antonio.

1 The 3P and 4P cache replacement policies Pierre Michaud INRIA Cache Replacement Championship June 20, 2010.

Improving Cache Performance by Exploiting Read-Write Disparity

1 PATH: Page Access Tracking Hardware to Improve Memory Management Reza Azimi, Livio Soares, Michael Stumm, Tom Walsh, and Angela Demke Brown University.

1 Lecture 10: Large Cache Design III Topics: Replacement policies, prefetch, dead blocks, associativity Sign up for class mailing list Pseudo-LRU has a.

Cs 61C L17 Cache.1 Patterson Spring 99 ©UCB CS61C Cache Memory Lecture 17 March 31, 1999 Dave Patterson (http.cs.berkeley.edu/~patterson) www-inst.eecs.berkeley.edu/~cs61c/schedule.html.

Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.

Mitigating Prefetcher-Caused Pollution Using Informed Caching Policies for Prefetched Blocks Vivek Seshadri Samihan Yedkar ∙ Hongyi Xin ∙ Onur Mutlu Phillip.

ECE/CSC Yan Solihin 1 An Optimized AMPM-based Prefetcher Coupled with Configurable Cache Line Sizing Qi Jia, Maulik Bakulbhai Padia, Kashyap Amboju.

An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.

Dyer Rolan, Basilio B. Fraguela, and Ramon Doallo Proceedings of the International Symposium on Microarchitecture (MICRO’09) Dec /7/14.

Managing Multi-Configuration Hardware via Dynamic Working Set Analysis By Ashutosh S.Dhodapkar and James E.Smith Presented by Kyriakos Yioutanis.

A Novel Cache Architecture with Enhanced Performance and Security Zhenghong Wang and Ruby B. Lee.

Achieving Non-Inclusive Cache Performance with Inclusive Caches Temporal Locality Aware (TLA) Cache Management Policies Aamer Jaleel,

A Hardware-based Cache Pollution Filtering Mechanism for Aggressive Prefetches Georgia Institute of Technology Atlanta, GA ICPP, Kaohsiung, Taiwan,

CPE432 Chapter 5A.1Dr. W. Abu-Sufah, UJ Chapter 5B:Virtual Memory Adapted from Slides by Prof. Mary Jane Irwin, Penn State University Read Section 5.4,

Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University.

Computer Architecture Lecture 26 Fasih ur Rehman.

Improving Cache Performance by Exploiting Read-Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez.

1 Utility-Based Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches Written by Moinuddin K. Qureshi and Yale N.

Sampling Dead Block Prediction for Last-Level Caches

MadCache: A PC-aware Cache Insertion Policy Andrew Nere, Mitch Hayenga, and Mikko Lipasti PHARM Research Group University of Wisconsin – Madison June 20,

Computer Organization CS224 Fall 2012 Lessons 45 & 46.

Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,

PIPP: Promotion/Insertion Pseudo-Partitioning of Multi-Core Shared Caches Yuejian Xie, Gabriel H. Loh Georgia Institute of Technology Presented by: Yingying.

11 Intro to cache memory Kosarev Nikolay MIPT Nov, 2009.

The Evicted-Address Filter

1 Efficient System-on-Chip Energy Management with a Segmented Counting Bloom Filter Mrinmoy Ghosh- Georgia Tech Emre Özer- ARM Ltd Stuart Biles- ARM Ltd.

University of Toronto Department of Electrical and Computer Engineering Jason Zebchuk and Andreas Moshovos June 2006.

Cache Miss-Aware Dynamic Stack Allocation Authors: S. Jang. et al. Conference: International Symposium on Circuits and Systems (ISCAS), 2007 Presenter:

1 Appendix C. Review of Memory Hierarchy Introduction Cache ABCs Cache Performance Write policy Virtual Memory and TLB.

15-740/ Computer Architecture Lecture 18: Caching in Multi-Core Prof. Onur Mutlu Carnegie Mellon University.

Cache Replacement Championship

Cache Replacement Policy Based on Expected Hit Count

Improving Cache Performance using Victim Tag Stores

Chang Hyun Park, Taekyung Heo, and Jaehyuk Huh

Cache Performance Samira Khan March 28, 2017.

18-447: Computer Architecture Lecture 23: Caches

Less is More: Leveraging Belady’s Algorithm with Demand-based Learning

Basic Performance Parameters in Computer Architecture:

Javier Díaz1, Pablo Ibáñez1, Teresa Monreal2,

18742 Parallel Computer Architecture Caching in Multi-core Systems

Cache Memory Presentation I

Consider a Direct Mapped Cache with 4 word blocks

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Adaptive Cache Replacement Policy

TLC: A Tag-less Cache for reducing dynamic first level Cache Energy

Using Dead Blocks as a Virtual Victim Cache

Module IV Memory Organization.

Module IV Memory Organization.

CARP: Compression-Aware Replacement Policies

Lecture 14: Large Cache Design II

Cache - Optimization.

A Novel Cache-Utilization Based Dynamic Voltage Frequency Scaling (DVFS) Mechanism for Reliability Enhancements *Yen-Hao Chen, *Yi-Lun Tang, **Yi-Yu Liu,

10/18: Lecture Topics Using spatial locality

Overview Problem Solution CPU vs Memory performance imbalance

Presentation transcript:

Cache Replacement Policy Using Map-based Adaptive Insertion Yasuo Ishii 1,2, Mary Inaba 1, and Kei Hiraki 1 1 The University of Tokyo 2 NEC Corporation

Introduction Modern computers have multi- level cache system Performance improvement of LLC is the key to achieve high performance LLC stores many dead-blocks  Elimination of dead-blocks in LLC improves system performance CORE L2 LLC (L3) Memory L1

Introduction Many multi-core systems adopt shared LLC Shared LLC make issues  Thrashing by other threads  Fairness of shared resource Dead-block elimination is more effective for multicore systems Shared LLC (L3) Memory COR E 1 L2 ・・・・・・・・・・・・ L1 COR E 2 L2 L1 COR E N L2 L1 ・・・・・・

Trade-offs of Prior Works Replacement Algorithm Dead- block Eliminatio n Additional HW Cost LRUInsert to MRUNone DIP [2007 Qureshi+] Partially Random Insertion SomeSeveral counters Light LRF [2009 Xiang+] Predicts from reference pattern StrongShadow tag, PHT Heavy Problem of dead-block prediction Inefficient use of data structure (c.f. shadow tag)

Map-based Data Structure Line Size ACCESS Shadow Tag 40bit/tag Memory Address Space Zone Size Map-based data structure improves cost- efficiency when there is spatial locality Cost: 40bit/line II I III AA A 40bit/tag 1 bit/line Map-base History Cost: 15.3bit/line (=40b+6b/3line)

Map-based Adaptive Insertion (MAIP) Modifies insertion position  Cache bypass  LRU position  Middle of MRU/LRU  MRU position Adopts map-based data structure for tracking many memory accesses Exploits two localities for reuse possibility estimation Low Reuse Possibility High Reuse Possibility

Hardware Implementation Memory access map  Collects memory access history & memory reuse history Bypass filter table  Collects data reuse frequency of memory access instructions Reuse possibility estimation  Estimates reuse possibility from information of other components Estimation Logic Memory Access Map Bypass Filter TableLast Level Cache Memory Access Information Insertion Position

Memory Access Map (1) ACCESS II I I Init Access Data Reuse State Diagram First Touch Zone Size Line Size II ACCESS AA A Detects one information (1)Data reuse The accessed line is previously touched ?

Map Tag Access Count Reuse Count Memory Access Map (2) AA I I Init Access Reuse Count Access Count AI Detects one statistics  Spatial locality How often the neighboring lines are reused? Access Map Attaches counters to detect spatial locality Data Reuse Metric Reuse Count Access Count =

Memory Access Map (3) Implementation  Maps are stored in cache like structure Cost-efficiency  Entry has 256 states  Tracks 16KB memory  16KB = 64B x 256stats  Requires ～ 1.2bit for tracking 1 cache line at the best case TagAccess Map Cache Offset Map Offset Map Index Map Tag = =ACCES S MUX Memory Address Count

Reuse Count Bypass Filter Table Each entry is saturating counter  Count up on data reuse / Count down on first touch Program Counter Bypass Filter Table (8-bit x 512-entry) BYPASS USELES S NORMAL USEFUL REUSE Rarely Reused Frequently Reused Detects one statistic  Temporal locality: How often the instruction reuses data?

Reuse Possibility Estimation Logic Uses 2 localities & data reuse information  Data Reuse  Hit / Miss of corresponding lookup of LLC  Corresponding state of Memory Access Map  Spatial Locality of Data Reuse  Reuse frequency of neighboring lines  Temporal Locality of Memory Access Instruction  Reuse frequency of corresponding instruction Combines information to decide insertion policy

Additional Optimization Adaptive dedicated set reduction(ADSR) Enhancement of set dueling [2007Qureshi+] Reduces dedicated sets when PSEL is strongly biased Set 7 LRU Dedicated Set Set 6 Set 5 Set 4 Set 3 Set 2 Set 1 Set 0 Set 7 Set 6 Set 5 Set 4 Set 3 Set 2 Set 1 Set 0 MAIP Dedicated Set Additional Follower Follower Set

Evaluation Benchmark  SPEC CPU2006, Compiled with GCC 4.2  Evaluates 100M instructions (skips 40G inst.) MAIP configuration (per-core resource)  Memory Access Map: 192 entries, 12-way  Bypass Filter: 512 entries, 8-bit counters  Policy selection counter: 10 bit Evaluates DIP & TADIP-F for comparison

Cache Miss Count (1-core) MAIP reduces MPKI by 8.3% from LRU OPT reduces MPKI by 18.2% from LRU

Speedup (1-core & 4-core) 4-core result 吹き込みで解説を出す 1-core: 2.1% 4-core: 9.1% 1-core result 483.xala

Cost Efficiency of Memory Access Map Requires 1.9 bit / line in average  ～ 20 times better than that of shadow tag Covers >1.00MB(LLC) in 9 of 18 benchmarks Covers >0.25MB(MLC) in 14 of 18 benchmarks

Related Work Uses spatial / temporal locality  Using spatial locality [1997, Johnson+]  Using different types of locality [1995, González+] Prediction-base dead-block elimination  Dead-block prediction [2001, Lai+]  Less Reused Filter [2009, Xiang+] Modified Insertion Policy  Dynamic Insertion Policy [2007, Qureshi+]  Thread Aware DIP[2008, Jaleel+]

Conclusion Map-based Adaptive Insertion Policy (MAIP)  Map-base data structure  x20 cost-effective  Reuse possibility estimation exploiting spatial locality & temporal locality  Improves performance from LRU/DIP Evaluates MAIP with simulation study  Reduces cache miss count by 8.3% from LRU  Improves IPC by 2.1% in 1-core, by 9.1% in 4-core

Comparison Replacement Algorithm Dead- block Eliminatio n Additional HW Cost LRUInsert to MRUNone DIP [2007 Qureshi+] Partially Random Insertion SomeSeveral counters Light LRF [2009 Xiang+] Predicts from reference pattern StrongShadow tag, PHT Heavy MAIPPredicts based on two localities StrongMem access map Medium Improves cost-efficiency by map data structure Improves prediction accuracy by 2 localities

Q & A

How to Detect Insertion Position function is_bypass() if(Sb = BYPASS) return true if(Ca > 16 x Cr) return true return false endfunction function get_insert_position() integer ins_pos=15 if(Hm) ins_pos = ins_pos/2 if(Cr > Ca) ins_pos=ins_pos/2 if(Sb=REUSE) ins_pos=0 if(Sb=USEFUL) ins_pos=ins_pos/2 if(Sb=USELESS) ins_pos=15 return ins_pos endfunction