FLEXclusion: Balancing Cache Capacity and On-chip Bandwidth via Flexible Exclusion Jaewoong Sim Jaekyu Lee Moinuddin K. Qureshi Hyesoon Kim.

Slides:

Advertisements

Similar presentations

Bypass and Insertion Algorithms for Exclusive Last-level Caches

Advertisements

A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.

Jaewoong Sim Alaa R. Alameldeen Zeshan Chishti Chris Wilkerson Hyesoon Kim MICRO-47 | December 2014.

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,

Hardware-based Devirtualization (VPC Prediction) Hyesoon Kim, Jose A. Joao, Onur Mutlu ++, Chang Joo Lee, Yale N. Patt, Robert Cohn* ++ *

Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors Chinnakrishnan S. Ballapuram Ahmad Sharif Hsien-Hsin S.

1 Lecture 17: Large Cache Design Papers: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation, Cho and Jin, MICRO’06 Co-Operative Caching.

Multi-Level Caches Vittorio Zaccaria. Preview What you have seen: Data organization, Associativity, Cache size Policies -- how to manage the data once.

1 Memory Performance and Scalability of Intel’s and AMD’s Dual-Core Processors: A Case Study Lu Peng 1, Jih-Kwon Peir 2, Tribuvan K. Prakash 1, Yen-Kuang.

High Performing Cache Hierarchies for Server Workloads

Hierarchy-aware Replacement and Bypass Algorithms for Last-level Caches Mainak Chaudhuri Indian Institute of Technology, Kanpur & Jayesh Gaur 1, Nithiyanandan.

Virtual Exclusion: An Architectural Approach to Reducing Leakage Energy in Multiprocessor Systems Mrinmoy Ghosh Hsien-Hsin S. Lee School of Electrical.

Achieving Non-Inclusive Cache Performance with Inclusive Caches

The Locality-Aware Adaptive Cache Coherence Protocol George Kurian 1, Omer Khan 2, Srini Devadas 1 1 Massachusetts Institute of Technology 2 University.

Adaptive Subset Based Replacement Policy for High Performance Caching Liqiang He Yan Sun Chaozhong Zhang College of Computer Science, Inner Mongolia University.

Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Lei Jin and Sangyeun Cho Dept. of Computer Science University.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture

Prefetch-Aware Cache Management for High Performance Caching

|Introduction |Background |TAP (TLP-Aware Cache Management Policy) Core sampling Cache block lifetime normalization TAP-UCP and TAP-RRIP |Evaluation Methodology.

Improving Cache Performance by Exploiting Read-Write Disparity

Memory System Characterization of Big Data Workloads

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

The Auction: Optimizing Banks Usage in Non-Uniform Cache Architectures Javier Lira ψ Carlos Molina ψ,ф Antonio González ψ,λ λ Intel Barcelona Research.

Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.

1 Energy-efficiency potential of a phase-based cache resizing scheme for embedded systems G. Pokam and F. Bodin.

1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.

Dyer Rolan, Basilio B. Fraguela, and Ramon Doallo Proceedings of the International Symposium on Microarchitecture (MICRO’09) Dec /7/14.

By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and

The Dirty-Block Index Vivek Seshadri Abhishek Bhowmick ∙ Onur Mutlu Phillip B. Gibbons ∙ Michael A. Kozuch ∙ Todd C. Mowry.

Achieving Non-Inclusive Cache Performance with Inclusive Caches Temporal Locality Aware (TLA) Cache Management Policies Aamer Jaleel,

StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University.

Ramazan Bitirgen, Engin Ipek and Jose F.Martinez MICRO’08 Presented by PAK,EUNJI Coordinated Management of Multiple Interacting Resources in Chip Multiprocessors.

Moinuddin K.Qureshi, Univ of Texas at Austin MICRO’ , 12, 05 PAK, EUNJI.

(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)

CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR

A Row Buffer Locality-Aware Caching Policy for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.

Effects of wrong path mem. ref. in CC MP Systems Gökay Burak AKKUŞ Cmpe 511 – Computer Architecture.

Bypass and Insertion Algorithms for Exclusive Last-level Caches Jayesh Gaur 1, Mainak Chaudhuri 2, Sreenivas Subramoney 1 1 Intel Architecture Group, Intel.

Improving Cache Performance by Exploiting Read-Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez.

Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Jason Bosko March 5 th, 2008 Based on “Managing Distributed, Shared L2 Caches through.

Sampling Dead Block Prediction for Last-Level Caches

MadCache: A PC-aware Cache Insertion Policy Andrew Nere, Mitch Hayenga, and Mikko Lipasti PHARM Research Group University of Wisconsin – Madison June 20,

Analyzing Performance Vulnerability due to Resource Denial-Of-Service Attack on Chip Multiprocessors Dong Hyuk WooGeorgia Tech Hsien-Hsin “Sean” LeeGeorgia.

BEAR: Mitigating Bandwidth Bloat in Gigascale DRAM caches

HPCA Laboratory for Computer Architecture1/11/2010 Dimitris Kaseridis 1, Jeff Stuecheli 1,2, Jian Chen 1 & Lizy K. John 1 1 University of Texas.

Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.

Exploiting Compressed Block Size as an Indicator of Future Reuse

Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.

PIPP: Promotion/Insertion Pseudo-Partitioning of Multi-Core Shared Caches Yuejian Xie, Gabriel H. Loh Georgia Institute of Technology Presented by: Yingying.

1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.

HAT: Heterogeneous Adaptive Throttling for On-Chip Networks Kevin Kai-Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu.

Cache Replacement Championship

Improving Multi-Core Performance Using Mixed-Cell Cache Architecture

Improving Cache Performance using Victim Tag Stores

UH-MEM: Utility-Based Hybrid Memory Management

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Zhichun Zhu Zhao Zhang ECE Department ECE Department

ASR: Adaptive Selective Replication for CMP Caches

Prefetch-Aware Cache Management for High Performance Caching

Lecture 13: Large Cache Design I

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Professor, No school name

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Using Dead Blocks as a Virtual Victim Cache

Computer System Design (Processor Design)

Adapted from slides by Sally McKee Cornell University

Computer System Design Lecture 9

CANDY: Enabling Coherent DRAM Caches for Multi-node Systems

Haonan Wang, Adwait Jog College of William & Mary

Presentation transcript:

FLEXclusion: Balancing Cache Capacity and On-chip Bandwidth via Flexible Exclusion Jaewoong Sim Jaekyu Lee Moinuddin K. Qureshi Hyesoon Kim

2/26 Outline  Motivation  FLEXclusion  Design  Monitoring & Operation  Extension  Evaluations  Conclusion

3/26 Introduction  Today’s processors have multi-level cache hierarchies  Design options for each size, inclusion property, # of levels,...  Design choice for cache inclusion  Inclusion: upper-level cache blocks always exist in the lower-level cache  Exclusion: upper-level cache blocks must not exist in the lower-level cache  Non-Inclusion : may contain the upper-level cache blocks InclusionExclusionNon-inclusion UPPER-LEVEL LOWER-LEVEL

4/26 Trend of Cache Size Ratio  Trend of total non-LLC capacity to LLC capacity  High ratio indicates more data duplications with inclusion/non-inclusions Ratio of non-LLC to LLC sizes of Intel’s processors over the past 10 years Multi-Core Era Begins L2: 4 x 256KB, L3: 6MB L3 More than 15% duplication!! L2: 4 x 256KB, L3: 6MB L3 More than 15% duplication!! More Duplication For Capacity: Exclusion is a better option

5/26  What about on-chip traffic?  Each design also has a different impact on on-chip traffic DRAM L2 L3 (LLC) Non-Inclusive Hierarchy Clean Victim Dirty Victim Fill Flow L3 Hit On-Chip Traffic L2 L3 (LLC) Exclusive Hierarchy Clean Victim Dirty Victim Fill Flow L3 Hit For Bandwith: Non-Inclusion is a better option More Traffic!! DRAM Sliently Dropped! Sliently Dropped!

6/26 Static Inclusion want to go for non-inclusion want to go for exclusion Question: Which design do we want to choose? More performance benefits on exclusion More BW consumption on exclusion

7/26 Static Inclusion : Problem  Each policy has its advantages/disadvantages  Non-Inclusion provides less capacity but higher efficiency on on-chip traffic  Exclusion provides more capacity but low efficiency on on-chip traffic  Workloads have diverse capacity/bandwidth requirement Problem: No single static cache configuration works best for all workloads 

8/26 Our Solution : Flexible Exclusion Dynamically change cache inclusion according to the workload requirement!

9/26 Our Solution : Flexible Exclusion  Providing both non-inclusion and exclusion  Capture the best of capacity/bandwidth requirement  Key Observation  Non-inclusion and exclusion require similar hardware  Benefits of FLEXclusion  Reducing on-chip traffic compared to exclusion  Improving performance compared to non-inclusion

10/26 Outline  Motivation  FLEXclusion  Design  Monitoring & Operation  Extension  Evaluations  Conclusion

11/26 FLEXclusion Overview  Goal: Adapts cache inclusion between non-inclusion and exclusion  Overall Design  Monitoring logic  A few logic blocks in the hardware to control traffic

12/26 Design  EXCL-REG: to control L2 clean victim data flow  NICL-GATE: to control incoming blocks from memory  Monitoring & policy decision logic: to switch operating mode Last-Level Cache L2 Cache EXCL-REG Policy Decision & Information Collection Logic L3 Line Fill NICL-GATE L2 Line Fill L2 Clean Victim Monitoring logic is required in many modern cache mechanisms!

13/26 Non-inclusive Mode (PDL signals 0)  Clean L2 victims are silently dropped  Incoming blocks are installed into both L2 and L3  L3 hitting blocks keep residing in the cache Last-Level Cache L2 Cache EXCL-REG Policy Decision & Information Collection Logic L3 Line Fill NICL-GATE L2 Line Fill L2 Clean Victim Non-inclusive mode follows typical non-inclusive behavior

14/26 Exclusive Mode (PDL signals 1)  Clean L2 victims are inserted into L3  Incoming blocks are only installed into L2  L3 hitting blocks are invalidated Last-Level Cache L2 Cache EXCL-REG Policy Decision & Information Collection Logic L3 Line Fill NICL-GATE L2 Line Fill L2 Clean Victim Performs similar to typical exclusive design except for L3 insertions from L2

15/26 Requirement Monitoring  Set-dueling method is used to capture  performance and traffic behavior of exclusion and non- inclusion  Sampling sets follow their original behavior  Monitor cache miss and insertion  Other sets follow the winning policy Counters Set 0 Set 1 Set 2 Set 3 Set 4 Set 5 Set 6 Set 7 Non-Inclusive Set Exclusive Set Following Set Cache Miss Insertion Cache Miss Insertion PDL LLC L2 ICL

16/26 Operating Region  Decision of winning policy is made by Policy Decision Logic (PDL)  Basic operating mode is determined by Perf th  Extensions of FLEXclusion use Insertion th for further performance/traffic optimization PDL LLC L2 ICL L3 IPKI Difference 1.0 Perf th Insertion th Non-Inclusive Region Exclusive Region Non-Inclusive Region (Aggressive) Exclusion Performance Relative to Non-Inclusion (Cache Miss) Exclusive Region (Bypass) Miss(NICL) – Miss(EX) > Perf th Ins(EX) – Ins(NICL) > Insertion th

17/26 Extensions of FLEXclusion  Per-core policy: to isolate each application behavior  Aggressive non-inclusion: to improve performance in non-inclusive mode  Bypass on exclusive mode: to reduce traffic in exclusive mode L2 LLC Line Fill (DRAM) Hit on LLC Clean Victim Bypass on exclusive mode L2 LLC Line Fill (DRAM) Hit on LLC Clean Victim Aggressive non-inclusive mode Detail explanations are in the paper.

18/26 FLEXclusion Operation  A FLEXclusive cache changes operating mode at run-time  FLEXclusion does not require any special actions  - On a switch from non-inclusive to exclusive mode  - On a switch from exclusive to non-inclusive mode FLEXclusion Mode Non-InclusiveExclusiveNon-Inclusive L2 LLC FLEXclusive Hierarchy FILL Dirty Evict Written back into the same position! Hit Evict Hit Dirty Evict

19/26 Outline  Motivation  FLEXclusion  Design  Monitoring & Operation  Extension  Evaluations  Conclusion

20/26 Evaluations  MacSim Simulator  A cycle-level in house simulator (now public)  Power results with Orion (Wang+[MICRO’02])  Baseline Processor  4-core, 4.0GHz, private L1 and L2, shared L3  Workloads  Group A: bzip2, gcc, hmmer, h264, xalancbmk, calculix (Low MPKI)  Group B: mcf, omnetpp, bwaves, soplex, lesilie3d, wrf, sphinx3 (High MPKI)  Multi-programmed: 2-MIX-S, 2-MIX-A, 4-MIX-S  Other results in the paper  Multi-programmed workloads, per-core, aggressive mode, bypass, threshold sensitivity

21/26 Evaluations – Performance/Traffic Performance Traffic FLEXclusion performs similar to exclusion AVG. 6.3% loss for 1MB 5.9% improvement over non-inclusion!! 72.6% reduction over exclusion!!

22/26 Evaluations - Effective Cache Size  Running the same benchmark on 1-/2-/4- cores (4MB L3) One thread is enjoying the cache!! Threads are competing for shared caches!! FLEXclusive cache is configured as exclusive mode more often!! FLEXclusion adapts inclusion on the effective cache size for each workload!!

23/26 Evaluations – Traffic & Power  Impact on L3 insertion traffic reduction in total?  FLEXclusion effectively reduces the traffic 20% Reduction L3 Insertion takes up more than 40%! Reduced to ~10% with FLEXclusion!!

24/26 Outline  Motivation  FLEXclusion  Design  Monitoring & Operation  Extension  Evaluations  Conclusion

25/26 Conclusions & Future Work  FLEXclusion balances performance and on-chip bandwidth consumption  depending on the workload requirement  with negliglibe hardware changes  5.9% performance improvement over non-inclusion  72.6% L3 insertion traffic reduction over exclusion (20% power reduction)  Future Work  More generic flexclusion including inclusion property  Impact on on-chip network

26/26 Q/A  Thank you!