Characterizing Multi-threaded Applications for Designing Sharing-aware Last-level Cache Replacement Policies Ragavendra Natarajan 1, Mainak Chaudhuri 2.

Slides:



Advertisements
Similar presentations
A Preliminary Attempt ECEn 670 Semester Project Wei Dang Jacob Frogget Poisson Processes and Maximum Likelihood Estimator for Cache Replacement.
Advertisements

Bypass and Insertion Algorithms for Exclusive Last-level Caches
D. Tam, R. Azimi, L. Soares, M. Stumm, University of Toronto Appeared in ASPLOS XIV (2009) Reading Group by Theo 1.
Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors Chinnakrishnan S. Ballapuram Ahmad Sharif Hsien-Hsin S.
1 Lecture 17: Large Cache Design Papers: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation, Cho and Jin, MICRO’06 Co-Operative Caching.
1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.
High Performing Cache Hierarchies for Server Workloads
Hierarchy-aware Replacement and Bypass Algorithms for Last-level Caches Mainak Chaudhuri Indian Institute of Technology, Kanpur & Jayesh Gaur 1, Nithiyanandan.
FLEXclusion: Balancing Cache Capacity and On-chip Bandwidth via Flexible Exclusion Jaewoong Sim Jaekyu Lee Moinuddin K. Qureshi Hyesoon Kim.
Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.
Pseudo-LIFO: A New Family of Replacement Policies for Last-level Caches Mainak Chaudhuri Indian Institute of Technology, Kanpur.
Cache Replacement Policy Using Map-based Adaptive Insertion Yasuo Ishii 1,2, Mary Inaba 1, and Kei Hiraki 1 1 The University of Tokyo 2 NEC Corporation.
Improving Cache Performance by Exploiting Read-Write Disparity
LRU Replacement Policy Counters Method Example
Internet Cache Pollution Attacks and Countermeasures Yan Gao, Leiwen Deng, Aleksandar Kuzmanovic, and Yan Chen Electrical Engineering and Computer Science.
Mitigating Prefetcher-Caused Pollution Using Informed Caching Policies for Prefetched Blocks Vivek Seshadri Samihan Yedkar ∙ Hongyi Xin ∙ Onur Mutlu Phillip.
Dyer Rolan, Basilio B. Fraguela, and Ramon Doallo Proceedings of the International Symposium on Microarchitecture (MICRO’09) Dec /7/14.
The Dirty-Block Index Vivek Seshadri Abhishek Bhowmick ∙ Onur Mutlu Phillip B. Gibbons ∙ Michael A. Kozuch ∙ Todd C. Mowry.
CPACT – The Conditional Parameter Adjustment Cache Tuner for Dual-Core Architectures + Also Affiliated with NSF Center for High- Performance Reconfigurable.
A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems Dimitris Kaseridis, Jeffery Stuecheli,
Cooperative Caching for Chip Multiprocessors Jichuan Chang Guri Sohi University of Wisconsin-Madison ISCA-33, June 2006.
SYNAR Systems Networking and Architecture Group Scheduling on Heterogeneous Multicore Processors Using Architectural Signatures Daniel Shelepov and Alexandra.
Achieving Non-Inclusive Cache Performance with Inclusive Caches Temporal Locality Aware (TLA) Cache Management Policies Aamer Jaleel,
Pınar Tözün Anastasia Ailamaki SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads Islam Atta Andreas Moshovos.
StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University.
Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University.
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.
Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University.
Moinuddin K.Qureshi, Univ of Texas at Austin MICRO’ , 12, 05 PAK, EUNJI.
Predicting Coherence Communication by Tracking Synchronization Points at Run Time Socrates Demetriades and Sangyeun Cho 45 th International Symposium in.
Last Bank: Dealing with Address Reuse in Non-Uniform Cache Architecture for CMPs Javier Lira ψ Carlos Molina ф Antonio González λ λ Intel Barcelona Research.
Bypass and Insertion Algorithms for Exclusive Last-level Caches Jayesh Gaur 1, Mainak Chaudhuri 2, Sreenivas Subramoney 1 1 Intel Architecture Group, Intel.
Improving Cache Performance by Exploiting Read-Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez.
Sampling Dead Block Prediction for Last-Level Caches
MadCache: A PC-aware Cache Insertion Policy Andrew Nere, Mitch Hayenga, and Mikko Lipasti PHARM Research Group University of Wisconsin – Madison June 20,
Embedded System Lab. 정범종 A_DRM: Architecture-aware Distributed Resource Management of Virtualized Clusters H. Wang et al. VEE, 2015.
Architectural Impact of Stateful Networking Applications Javier Verdú, Jorge García Mario Nemirovsky, Mateo Valero The 1st Symposium on Architectures for.
International Symposium on Computer Architecture ( ISCA – 2010 )
Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,
PIPP: Promotion/Insertion Pseudo-Partitioning of Multi-Core Shared Caches Yuejian Xie, Gabriel H. Loh Georgia Institute of Technology Presented by: Yingying.
BarrierWatch: Characterizing Multithreaded Workloads across and within Program-Defined Epochs Socrates Demetriades and Sangyeun Cho Computer Frontiers.
The Evicted-Address Filter
Scavenger: A New Last Level Cache Architecture with Global Block Priority Arkaprava Basu, IIT Kanpur Nevin Kirman, Cornell Mainak Chaudhuri, IIT Kanpur.
By Islam Atta Supervised by Dr. Ihab Talkhan
1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.
Perceptron-based Coherence Predictors Naveen R. Iyer Publication: Perceptron-based Coherence Predictors. D. Ghosh, J.B. Carter, and H. Duame. In the Proceedings.
Sunpyo Hong, Hyesoon Kim
Cache Miss-Aware Dynamic Stack Allocation Authors: S. Jang. et al. Conference: International Symposium on Circuits and Systems (ISCAS), 2007 Presenter:
Page 1 SARC Samsung Austin R&D Center SARC Maximizing Branch Behavior Coverage for a Limited Simulation Budget Maximilien Breughe 06/18/2016 Championship.
Cache Replacement Championship
Cache Replacement Policy Based on Expected Hit Count
CRC-2, ISCA 2017 Toronto, Canada June 25, 2017
Adaptive Cache Partitioning on a Composite Core
Less is More: Leveraging Belady’s Algorithm with Demand-based Learning
Javier Díaz1, Pablo Ibáñez1, Teresa Monreal2,
18742 Parallel Computer Architecture Caching in Multi-core Systems
Prefetch-Aware Cache Management for High Performance Caching
Bank-aware Dynamic Cache Partitioning for Multicore Architectures
Adaptive Cache Replacement Policy
CARP: Compression Aware Replacement Policies
TLC: A Tag-less Cache for reducing dynamic first level Cache Energy
Using Dead Blocks as a Virtual Victim Cache
CARP: Compression-Aware Replacement Policies
Massachusetts Institute of Technology
Efficient Computing in the Post-Moore Era
CANDY: Enabling Coherent DRAM Caches for Multi-node Systems
rePLay: A Hardware Framework for Dynamic Optimization
DSPatch: Dual Spatial pattern prefetcher
Presentation transcript:

Characterizing Multi-threaded Applications for Designing Sharing-aware Last-level Cache Replacement Policies Ragavendra Natarajan 1, Mainak Chaudhuri 2 1 Department of Computer Science and Engineering, University of Minnesota, Twin Cities 2 Department of Computer Science and Engineering, Indian Institute of Technology Kanpur, India Acknowledgment: Jayesh Gaur, Nithiyanandan Bashyam, Sreenivas Subramoney, Antonia Zhai IEEE International Symposium on Workload Characterization (IISWC), September 23 rd, 2013

Managing shared LLC in multi-threaded applications Shared LLC management crucial in modern CMPs Current policies Mitigate cross-thread interference Improve intra-thread reuse Multi-threaded applications have both intra-thread and cross-thread reuse State-of-the-art polices not optimized for multi-thread applications Can a “sharing-aware” replacement policy improve LLC performance of multi-threaded applications?

Characterization Infrastructure Multi2sim simulator to generate LLC access traces from multi-threaded applications We model an 8-core CMP architecture 8-way, 32KB per-core I-L1 and D-L1 caches 8-way, 128KB per-core L2 cache 16-way, inclusive, shared LLC 4MB and 8MB LLC capacities evaluated (4MB results in paper) Applications from PARSEC, SPLASH and SPECOMP benchmark suites Offline LLC model uses traces as input to generate statistics

How important is cross-thread reuse in multi- threaded applications? Shared fills form a significant fraction of useful LLC fills. Three categories of LLC fills: No-reuse fills Private-reuse fills  Intra-thread reuse Shared fills  Cross-thread reuse Useful LLC fills

How important is cross-thread reuse? (contd.) Shared LLC fills are more valuable compared to private fills in multi-threaded applications.

How sharing-aware are current replacement policies? All LLC replacement policies have some inherent sharing-awareness LLC replacement policy can significantly affect data sharing in LLC Belady’s optimal policy  maximum data sharing Realistic policies  less data sharing Policies evaluated: LRU, SRRIP & DRRIP (ISCA 2010), SHiP-PC (MICRO 2011), SA-Partition (IPDPS 2009) Metrics for quantifying sharing-awareness of a replacement policy  Fraction of shared LLC fills  Average number of distinct sharers per LLC fill

How sharing-aware are current replacement policies? (contd.) Fraction of shared fills in existing policies significantly smaller than Belady’s optimal policy

How sharing-aware are current replacement policies? (contd.) Large gap between sharing-awareness of current policies and optimal sharing-awareness

What is the potential improvement with sharing-aware policies? Two oracle policies to evaluate potential improvement: O one and O all Oracle policies augment replacement policy with optimal sharing information Oracles use annotated LLC access trace to get sharing information LLC access trace DR c0 0xabcdef80 ST c1 0x CR c0 0x … Annotated LLC access trace DR c0 0xabcdef80 LLC miss; evicting 0x ST c1 0x CR c0 0x LLC miss; evicting 0xabcdef80 … Optimal lifetime of 0xabcdef80

Sharing-aware oracles: Augmenting sharing information LLC fill Accessed by >1 core before end of current optimal LLC lifetime? YES NO Mark as shared in LLC and record number of sharers Do not mark as shared

Sharing-aware oracles: Updating sharing information LLC hit Access from new sharer? YES NO O one oracle? Clear shared bit YES NO # sharers = optimal sharer count? (O all oracle) YES Don’t change shared bit NO

Sharing-aware oracles: Making sharing-aware replacement decisions LLC eviction All cache blocks in set marked shared? YES NO Clear shared bits of all blocks Choose victim from unmarked cache blocks

What is the potential improvement with sharing-aware policies? LLC misses incurred by sharing-aware oracles normalized to corresponding baseline policies

What is the potential improvement with sharing-aware policies? O one and O all reduce LLC misses across all policies LLC misses incurred by sharing-aware oracles normalized to corresponding baseline policies

What are the challenges in realizing sharing- aware replacement policies? Realistic implementations of oracles need fill time information Is the cache block likely to be shared during its optimal LLC lifetime? If shared, then how many sharers? An LLC fill-time sharing behavior predictor can answer these questions Characterize multi-threaded applications to answer the following questions:  How is data shared in multi-threaded applications?  How predictable is data sharing in multi-threaded applications?

How is data shared in multi-threaded applications? Data sharing depends on multiple factors Application characteristics LLC replacement policy LLC capacity Cache block shared at LLC level if it is shared in at least one LLC lifetime LLC lifetime sharing: Amount of data sharing for a given LLC configuration Program level sharing  Maximum possible sharing with an infinite LLC  Application characteristic and independent of LLC size

How is data shared in multi-threaded applications? (contd.) LLC data sharing is sparse. Policies that capture program level sharing will be ineffective (SA-Partition)

How predictable is data sharing in multi-threaded applications? A cache block can be private or shared in each of its LLC lifetimes Sharing behavior can be represented by a binary string (sharing history) 0x P S P S P S... LLC data sharing is irregular

How predictable is data sharing? (contd.) Explore feasibility of designing sharing behavior predictors History-based sharing behavior predictors predict sharing behavior based on history window of last w LLC lifetimes Pattern# P# S PP00 PS20 SP02 SS00

How predictable is data sharing? (contd.) Sharing behavior does not correlate well with sharing history of shared block addresses

How predictable is data sharing? (contd.) Sharing behavior does not correlate well with sharing history of LLC fill PCs Evaluation of sharing predictors with DRRIP and SHIP-PC policies leads to negligible improvements High-level program semantic information may help design sharing-aware policies

Summary  Cross-thread reuse is critical in multi-threaded applications  Current policies not optimized for multi-threaded applications  Sharing-aware policies can significantly improve multi- threaded applications  Sharing-aware policies require a sharing behavior predictor in conjunction with baseline replacement policy  Sharing-aware policies must look beyond address and fill PC based predictors  High-level program semantic information can help design sharing-aware replacement policies

BACKUP

LLC hits to private and shared cache blocks

How sharing-aware are current replacement policies? (contd.) 4 B L S D SH SP Fraction of shared fills in existing policies significantly smaller than Belady’s optimal policy

How sharing-aware are current replacement policies? (contd.) Large gap between sharing-awareness of current policies and optimal sharing-awareness