Adaptive Subset Based Replacement Policy for High Performance Caching Liqiang He Yan Sun Chaozhong Zhang College of Computer Science, Inner Mongolia University.

Slides:



Advertisements
Similar presentations
Song Jiang1 and Xiaodong Zhang1,2 1College of William and Mary
Advertisements

Aamer Jaleel, Kevin B. Theobald, Simon C. Steely Jr. , Joel Emer
A Preliminary Attempt ECEn 670 Semester Project Wei Dang Jacob Frogget Poisson Processes and Maximum Likelihood Estimator for Cache Replacement.
Bypass and Insertion Algorithms for Exclusive Last-level Caches
High Performing Cache Hierarchies for Server Workloads
FLEXclusion: Balancing Cache Capacity and On-chip Bandwidth via Flexible Exclusion Jaewoong Sim Jaekyu Lee Moinuddin K. Qureshi Hyesoon Kim.
Outperforming LRU with an Adaptive Replacement Cache Algorithm Nimrod megiddo Dharmendra S. Modha IBM Almaden Research Center.
1 Line Distillation: Increasing Cache Capacity by Filtering Unused Words in Cache Lines Moinuddin K. Qureshi M. Aater Suleman Yale N. Patt HPCA 2007.
Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Lei Jin and Sangyeun Cho Dept. of Computer Science University.
1 Lecture 9: Large Cache Design II Topics: Cache partitioning and replacement policies.
Insertion Policy Selection Using Decision Tree Analysis Samira Khan, Daniel A. Jiménez University of Texas at San Antonio.
1 The 3P and 4P cache replacement policies Pierre Michaud INRIA Cache Replacement Championship June 20, 2010.
Cache Replacement Policy Using Map-based Adaptive Insertion Yasuo Ishii 1,2, Mary Inaba 1, and Kei Hiraki 1 1 The University of Tokyo 2 NEC Corporation.
Improving Cache Performance by Exploiting Read-Write Disparity
LRU Replacement Policy Counters Method Example
1 PATH: Page Access Tracking Hardware to Improve Memory Management Reza Azimi, Livio Soares, Michael Stumm, Tom Walsh, and Angela Demke Brown University.
1 Lecture 10: Large Cache Design III Topics: Replacement policies, prefetch, dead blocks, associativity Sign up for class mailing list Pseudo-LRU has a.
Cs 61C L17 Cache.1 Patterson Spring 99 ©UCB CS61C Cache Memory Lecture 17 March 31, 1999 Dave Patterson (http.cs.berkeley.edu/~patterson) www-inst.eecs.berkeley.edu/~cs61c/schedule.html.
1 Balanced Cache:Reducing Conflict Misses of Direct-Mapped Caches through Programmable Decoders ISCA 2006,IEEE. By Chuanjun Zhang Speaker: WeiZeng.
Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.
Mitigating Prefetcher-Caused Pollution Using Informed Caching Policies for Prefetched Blocks Vivek Seshadri Samihan Yedkar ∙ Hongyi Xin ∙ Onur Mutlu Phillip.
An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.
Dyer Rolan, Basilio B. Fraguela, and Ramon Doallo Proceedings of the International Symposium on Microarchitecture (MICRO’09) Dec /7/14.
Shuchang Shan † ‡, Yu Hu †, Xiaowei Li † † Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences.
Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott.
Defining Anomalous Behavior for Phase Change Memory
Achieving Non-Inclusive Cache Performance with Inclusive Caches Temporal Locality Aware (TLA) Cache Management Policies Aamer Jaleel,
Exploiting Program Hotspots and Code Sequentiality for Instruction Cache Leakage Management J. S. Hu, A. Nadgir, N. Vijaykrishnan, M. J. Irwin, M. Kandemir.
« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)
ECE8833 Polymorphous and Many-Core Computer Architecture Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Lecture 6 Fair Caching Mechanisms.
(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)
X-RAY: A Non-Invasive Exclusive Caching Mechanism for RAIDs Lakshmi N. Bairavasundaram Muthian Sivathanu Andrea C. Arpaci-Dusseau Remzi H. Arpaci-Dusseau.
Computer Architecture Memory organization. Types of Memory Cache Memory Serves as a buffer for frequently accessed data Small  High Cost RAM (Main Memory)
Computer Architecture Lecture 26 Fasih ur Rehman.
Bypass and Insertion Algorithms for Exclusive Last-level Caches Jayesh Gaur 1, Mainak Chaudhuri 2, Sreenivas Subramoney 1 1 Intel Architecture Group, Intel.
Improving Cache Performance by Exploiting Read-Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez.
1 Utility-Based Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches Written by Moinuddin K. Qureshi and Yale N.
Micro-sliced Virtual Processors to Hide the Effect of Discontinuous CPU Availability for Consolidated Systems Jeongseob Ahn, Chang Hyun Park, and Jaehyuk.
Sampling Dead Block Prediction for Last-Level Caches
MadCache: A PC-aware Cache Insertion Policy Andrew Nere, Mitch Hayenga, and Mikko Lipasti PHARM Research Group University of Wisconsin – Madison June 20,
International Symposium on Computer Architecture ( ISCA – 2010 )
Time Parallel Simulations I Problem-Specific Approach to Create Massively Parallel Simulations.
Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,
PIPP: Promotion/Insertion Pseudo-Partitioning of Multi-Core Shared Caches Yuejian Xie, Gabriel H. Loh Georgia Institute of Technology Presented by: Yingying.
Embedded System Lab. 정범종 PIPP: Promotion/Insertion Pseudo-Partitioning of Multi-Core Shared Caches Yuejian Xie et al. ACM, 2009.
ExLRU : A Unified Write Buffer Cache Management for Flash Memory EMSOFT '11 Liang Shi 1,2, Jianhua Li 1,2, Chun Jason Xue 1, Chengmo Yang 3 and Xuehai.
The Evicted-Address Filter
1 Lecture 12: Large Cache Design Topics: Shared vs. private, centralized vs. decentralized, UCA vs. NUCA, recent papers.
1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.
Cache Miss-Aware Dynamic Stack Allocation Authors: S. Jang. et al. Conference: International Symposium on Circuits and Systems (ISCAS), 2007 Presenter:
Jiahao Chen, Yuhui Deng, Zhan Huang 1 ICA3PP2015: The 15th International Conference on Algorithms and Architectures for Parallel Processing. zhangjiajie,
15-740/ Computer Architecture Lecture 18: Caching in Multi-Core Prof. Onur Mutlu Carnegie Mellon University.
Improving Cache Performance using Victim Tag Stores
Two Dimensional Highly Associative Level-Two Cache Design
CRC-2, ISCA 2017 Toronto, Canada June 25, 2017
Cache Performance Samira Khan March 28, 2017.
Adaptive Cache Partitioning on a Composite Core
Javier Díaz1, Pablo Ibáñez1, Teresa Monreal2,
18742 Parallel Computer Architecture Caching in Multi-core Systems
Lecture 13: Large Cache Design I
Bank-aware Dynamic Cache Partitioning for Multicore Architectures
Energy-Efficient Address Translation
Adaptive Cache Replacement Policy
Amoeba-Cache: Adaptive Blocks for Eliminating Waste in the Memory Hierarchy Snehasish Kumar, Hongzhou Zhao†, Arrvindh Shriraman Eric Matthews∗, Sandhya.
CARP: Compression Aware Replacement Policies
Using Dead Blocks as a Virtual Victim Cache
CARP: Compression-Aware Replacement Policies
Lecture 14: Large Cache Design II
pipelining: static branch prediction Prof. Eric Rotenberg
Program Phase Directed Dynamic Cache Way Reconfiguration
Presentation transcript:

Adaptive Subset Based Replacement Policy for High Performance Caching Liqiang He Yan Sun Chaozhong Zhang College of Computer Science, Inner Mongolia University Hohhot, Inner Mongolia, P. R. China JWAC-1: Cache Replacement Championship ISCA-2010

Background Cache Replacement Policy plays an important role in a cache design. LRU policy is widely used in nowadays microprocessor The LLC has poor locality due to the L1 already filters temporal locality LRU causes thrashing when working set > cache size Inner Mongolia University College of Computer ScienceJWAC-1: Cache Replacement Championship

Possible solution if working set > cache size, retain some working set [Qureshi, et al, ISCA’07] record part of a longer cache access history College of Computer Science Inner Mongolia University How we do it? Grouping a cache set and keeping part of access history in each group. Inspired by the thread migration paper of Pierre at HPCA’04 L2 C0C0 C1C1 CnCn g0g0 g1g1 gngn JWAC-1: Cache Replacement Championship

Overview Proposal: Subset Based Replacement Policy (SRP) Inner Mongolia University College of Computer Science ASRP obtains a 4.5 % of geometric average miss reduction over LRU. JWAC-1: Cache Replacement Championship SRP successfully reduces the misses through retaining part of longer history in the groups. But the static SRP does not suitable for different programs. To adapt the diversity of programs and the behavior changing inside a program, we propose Adaptive SRP policy (ASRP).

Outline Introduction Static Subset Based Replacement Policy Adaptive Subset Based Replacement Policy Summary College of Computer Science Inner Mongolia University JWAC-1: Cache Replacement Championship

Static Subset Based Replacement Policy Inner Mongolia University College of Computer ScienceJWAC-1: Cache Replacement Championship subset Cache set Active: Accept insertion Non-Active Local LRU Stack

Insertion scheme in SRP Inner Mongolia University College of Computer Science JWAC-1: Cache Replacement Championship Insertion only occurs in active subset Choose victim at LRU position. Do NOT promote to MRU abcd MRULRU abci Reference to ‘i’ blocks in active subset

Operation on cache hit in SRP Inner Mongolia University College of Computer Science JWAC-1: Cache Replacement Championship hit in any (active or non-active) subset abcd MRULRU Reference to ‘c’ cabd Move to local MRU position

Changing of active subset When the misses in a set > a threshold X, change active subset Inner Mongolia University College of Computer Science JWAC-1: Cache Replacement Championship Thus: A. force X consecutive misses only replacing the blocks in active subset B. assume N subsets, then a subset can change to active again ONLY after (N-1)*X misses C. a greater value of X, a longer time that blocks in non-active subsets can stay in a set

Thrashing access pattern in SRP College of Computer Science Inner Mongolia University JWAC-1: Cache Replacement Championship b 1 b 2 b 3 b 4 b 5 b 6 b 7 b 8 b 9 b 10 b 11 b 12 b 13 b 14 b 15 b 16 b 17 ….. b 24 x = 6 assume working set is 24 blocks, LLC is 16-way, 4 subsets, 4 blocks/subset b1 b2 b3 b4 LRU MRU Subset 0 b5b6b7 b8 b9 b10 b11b12 Subset 1 b6 b2 b3 b4 Blocks in a set with SRP: b 2 b 3 b 4 b 6 b 8 b 9 b 10 b 12 b 14 b 15 b 16 b 18 b 20 b 21 b 22 b 24 Blocks in a set with LRU: b 9 ….. b 24 When access b 2 b 3 b 4 b 6 b 8 again, SRP hits but LRU misses

Case Study of thrashing workload Inner Mongolia University College of Computer Science JWAC-1: Cache Replacement Championship Different static thresholds have different abilities to reduce misses

Hardware implementation Inner Mongolia University College of Computer Science JWAC-1: Cache Replacement Championship MRU LRU

Results Inner Mongolia University College of Computer Science JWAC-1: Cache Replacement Championship SRP reduces misses for thrashing workloads but increases for LRU-friendly ones. Not exist a threshold that is suitable for all benchmarks

College of Computer Science Inner Mongolia University JWAC-1: Cache Replacement Championship Outline Introduction Static Subset Based Replacement Policy Adaptive Subset Based Replacement Policy Summary

College of Computer Science Inner Mongolia University JWAC-1: Cache Replacement Championship Adaptive SRP policy Different programs prefer different thresholds. Victim selection and insertion policy are same as in SRP ONLY difference: threshold is selected dynamically from a pool of values according to which one causes fewest misses.  The maximum threshold is 128  Pick eight values: 2 0, 2 1, …, 2 7  Apply the best threshold value to the cache In ASRP policy:

++ ASRP policy via “Set Dueling” Divide the cache into two type:  Sampling sets (eight thresholds * 4sets/thres.)  Follower sets Eight counters misses to threshold X’s sampling sets: counter_x++ Counters decides threshold for Follower sets: counter with smallest value Thres-2 0 -sets Follower Sets Thres-2 1 -sets Thres-2 7 -sets Cntr_0 miss Cntr_7 Eight thresholds JWAC-1: Cache Replacement Championship College of Computer Science Inner Mongolia University

Resetting mechanism Eight thresholds last_follow = global_follow Y ++ N -- threshold >? Cntr_0 Cntr_7 reset JWAC-1: Cache Replacement Championship College of Computer Science Inner Mongolia University To avoid the accumulative effect of a big value in a specific Cnrt_x Record the times of a same threshold is selected by the follower sets When the times > a threshold, reset all the Cntr_Xs

College of Computer Science Inner Mongolia University JWAC-1: Cache Replacement Championship Budget Totally 45K bits only 70% of the budget used by LRU policy, and 35% of the total budget provided by this championship

College of Computer Science Inner Mongolia University Results For 1MB 16-ways LLC. ASRP gets a geometric average speedup of 4.5% over LRU JWAC-1: Cache Replacement Championship

Analyze College of Computer Science Inner Mongolia University xalancbmk GemsFDTD JWAC-1: Cache Replacement Championship The sampling mechanism does help ASRP to find the best thresholds for different programs

Conclusion Keeping part of working set in the cache helps reducing misses when the cache suffers a thrashing problem The part of longer access history helps SRP more accurately capturing the frequently used blocks Different programs and different phases of a program prefer different thresholds to contribute maximum hits to the cache “Set Dueling” helps ASRP dynamically selecting a suitable threshold The experiment results show the effectiveness of ASRP policy Inner Mongolia University College of Computer Science JWAC-1: Cache Replacement Championship

Thank you! Any question? College of Computer Science Inner Mongolia University JWAC-1: Cache Replacement Championship

Result on multi-core processor College of Computer Science Inner Mongolia University JWAC-1: Cache Replacement Championship

Case Study of LRU-friendly workload Inner Mongolia University College of Computer Science JWAC-1: Cache Replacement Championship

Inner Mongolia University College of Computer Science JWAC-1: Cache Replacement Championship Explanation of active subset changing

A simple example of SRP policy Inner Mongolia University College of Computer Science JWAC-1: Cache Replacement Championship