2012.05.11 Speaker : Kyu Hyun, Choi. Problem: Interference in shared caches – Lack of isolation → no QoS – Poor cache utilization → degraded performance.

Slides:



Advertisements
Similar presentations
Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.
Advertisements

Memory.
A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy J. Zebchuk, E. Safi, and A. Moshovos.
A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.
Amoeba-Cache Adaptive Blocks for Eliminating Waste in the Memory Hierarchy Snehasish Kumar Arrvindh Shriraman Eric Matthews Lesley Shannon Hongzhou Zhao.
1 Lecture 9: Large Cache Design II Topics: Cache partitioning and replacement policies.
CSC 4250 Computer Architectures December 8, 2006 Chapter 5. Memory Hierarchy.
SLA-aware Virtual Resource Management for Cloud Infrastructures
Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.
Web Caching Schemes1 A Survey of Web Caching Schemes for the Internet Jia Wang.
1 Virtual Private Caches ISCA’07 Kyle J. Nesbit, James Laudon, James E. Smith Presenter: Yan Li.
Performance Evaluation of IPv6 Packet Classification with Caching Author: Kai-Yuan Ho, Yaw-Chung Chen Publisher: ChinaCom 2008 Presenter: Chen-Yu Chaug.
Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.
1 Energy-efficiency potential of a phase-based cache resizing scheme for embedded systems G. Pokam and F. Bodin.
Dyer Rolan, Basilio B. Fraguela, and Ramon Doallo Proceedings of the International Symposium on Microarchitecture (MICRO’09) Dec /7/14.
Architectural and Compiler Techniques for Energy Reduction in High-Performance Microprocessors Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Constantine.
By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and
The Dirty-Block Index Vivek Seshadri Abhishek Bhowmick ∙ Onur Mutlu Phillip B. Gibbons ∙ Michael A. Kozuch ∙ Todd C. Mowry.
Multiprocessor Cache Coherency
Lecture 33: Chapter 5 Today’s topic –Cache Replacement Algorithms –Multi-level Caches –Virtual Memories 1.
Rensselaer Polytechnic Institute CSC 432 – Operating Systems David Goldschmidt, Ph.D.
A Novel Cache Architecture with Enhanced Performance and Security Zhenghong Wang and Ruby B. Lee.
Achieving Non-Inclusive Cache Performance with Inclusive Caches Temporal Locality Aware (TLA) Cache Management Policies Aamer Jaleel,
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University.
Moinuddin K.Qureshi, Univ of Texas at Austin MICRO’ , 12, 05 PAK, EUNJI.
Subject: Operating System.
How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.
Computer Architecture Memory organization. Types of Memory Cache Memory Serves as a buffer for frequently accessed data Small  High Cost RAM (Main Memory)
Computer Architecture Lecture 26 Fasih ur Rehman.
1 Lecture 13: Cache, TLB, VM Today: large caches, virtual memory, TLB (Sections 2.4, B.4, B.5)
1 Utility-Based Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches Written by Moinuddin K. Qureshi and Yale N.
Sampling Dead Block Prediction for Last-Level Caches
1 Virtual Machine Memory Access Tracing With Hypervisor Exclusive Cache USENIX ‘07 Pin Lu & Kai Shen Department of Computer Science University of Rochester.
By Teacher Asma Aleisa Year 1433 H.   Goals of memory management  To provide a convenient abstraction for programming.  To allocate scarce memory.
Heracles: Improving Resource Efficiency at Scale ISCA’15 Stanford University Google, Inc.
HPCA Laboratory for Computer Architecture1/11/2010 Dimitris Kaseridis 1, Jeff Stuecheli 1,2, Jian Chen 1 & Lizy K. John 1 1 University of Texas.
Improving Disk Throughput in Data-Intensive Servers Enrique V. Carrera and Ricardo Bianchini Department of Computer Science Rutgers University.
MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
COMP SYSTEM ARCHITECTURE HOW TO BUILD A CACHE Antoniu Pop COMP25212 – Lecture 2Jan/Feb 2015.
ASPLOS’02 Presented by Kim, Sun-Hee.  Technology trends ◦ The rate of frequency scaling is slowing down  Performance must come from exploiting concurrency.
PIPP: Promotion/Insertion Pseudo-Partitioning of Multi-Core Shared Caches Yuejian Xie, Gabriel H. Loh Georgia Institute of Technology Presented by: Yingying.
Embedded System Lab. 정범종 PIPP: Promotion/Insertion Pseudo-Partitioning of Multi-Core Shared Caches Yuejian Xie et al. ACM, 2009.
The Evicted-Address Filter
Project Summary Fair and High Throughput Cache Partitioning Scheme for CMPs Shibdas Bandyopadhyay Dept of CISE University of Florida.
1 Appendix C. Review of Memory Hierarchy Introduction Cache ABCs Cache Performance Write policy Virtual Memory and TLB.
Cache Perf. CSE 471 Autumn 021 Cache Performance CPI contributed by cache = CPI c = miss rate * number of cycles to handle the miss Another important metric.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Computer Orgnization Rabie A. Ramadan Lecture 9. Cache Mapping Schemes.
Memshare: a Dynamic Multi-tenant Key-value Cache
Cache Performance Samira Khan March 28, 2017.
18-447: Computer Architecture Lecture 23: Caches
Adaptive Cache Partitioning on a Composite Core
Xiaodong Wang, Shuang Chen, Jeff Setter,
18742 Parallel Computer Architecture Caching in Multi-core Systems
Cache Memory Presentation I
Bank-aware Dynamic Cache Partitioning for Multicore Architectures
CARP: Compression Aware Replacement Policies
Reducing Memory Reference Energy with Opportunistic Virtual Caching
TLC: A Tag-less Cache for reducing dynamic first level Cache Energy
Using Dead Blocks as a Virtual Victim Cache
Module IV Memory Organization.
Distributed Systems CS
Massachusetts Institute of Technology
Lecture 3: Main Memory.
Lecture 14: Large Cache Design II
Cache - Optimization.
Restrictive Compression Techniques to Increase Level 1 Cache Capacity
Presentation transcript:

Speaker : Kyu Hyun, Choi

Problem: Interference in shared caches – Lack of isolation → no QoS – Poor cache utilization → degraded performance Cache partitioning addresses interference, but current partitioning techniques (e.g. way-partitioning) have serious drawbacks – Support few coarse-grain partitions → do not scale to many-cores – Hurt associativity → degraded performance Vantage solves deficiencies of previous partitioning techniques – Support hundreds of fine-grain partitions – Maintains high associativity – Strict isolation among partitions – Enables cache partitioning in many-cores 2

Introduction Vantage Cache Partitioning Evaluation 3

Fully shared last-level caches are norm in multi-cores Better cache utilization, faster communication, cheaper coherence  Interference → performance degradation, no QoS Increasingly important problem due to more cores / chip and virtualization, consolidation (datacenter / cloud) – Major performance and energy losses due to cache contention (~2x) – Consolidation opportunities lost to maintain SLAs (Service Level Agreements) 4

Cache partitioning: Divide cache space among competing workload (threads, processe, VMs) Eliminate interference, enabling QoS guarantees Adjust partition sizes to maximize performance, fairness, satisfy SLA  Previously proposed partitioning schemes have major drawbacks 5

Cache partitioning consists of a policy (decide partition sizes to achieve a goal, e.g. fairness) and a scheme (enforce sizes) Focus on the scheme For policy to be effective, scheme should be: 1. Scalable: can create hundreds of partitions 2. Fine-grain: partitions sizes specified in cache lines 3. Strict isolation: partition performance does not depend on other partitions 4. Dynamic: can create, remove, resize partitions effectively 5. Maintains associativity 6. Independent of replacement policy 7. Simple to implement 6 Maintain high cache performance

Based on restricting line placement Way partitioning: Restrict insertions to specific ways Strict isolation Dynamic Independent of replacement policy Simple  Few coarse-grain partitions  Hurt associativity 7

Based on tweaking the replacement policy PIPP [ISCA 2009]: Lines inserted and promoted in LRU chain depending on the partition they belong to Dynamic Maintains associativity Simple Few coarse-grain partitions  Weak isolation  Sacrifices replacement policy 8

9

Introduction Vantage Cache Partitioning Evaluation 10

Use a highly-associative cache (e.g. Zcache) Logically divide cache in managed and unmanaged regions Logically partition the managed region – Leverage unmanaged region to allow many partitions with minimal interference 11

Simple approach – Use a hash function to index the cache Simple hashing reduces conflicts 12

A highly-associative cache with a low number of ways – Hits take a single lookup – In a miss, replacement process provides many replacement candidates Provides cheap high associativity (e.g. associativity equivalent to 64 ways with a 4-way cache) Achieves analytical guarantees on associativity 13

Hits require a single lookup Misses exploit the multiple hash functions to obtain an arbitrariliy large number of replacement candidates – Phase 1: Walk the tag array, get base candidate – Phase 2: Move a few lines to fit everything 14

Start replacement process while fetching Y 15

16

17

18

19

20

21

Eviction priority: Rank of a line given by the replacement policy (e.g. LRU), normalized to [0,1] – Higher is better to evict (e.g. LRU line has 1.0 priority, MRU has 0.0) Associativity distribution: Probability distribution of the eviction priorities of evicted lines In a zcache, associativity distribution depends only on the number of replacement candidates (R) – Independent of ways, workload and replacement policy 22

Logical division (tag each block as managed / unmanaged) Unmanaged region large enough to absorb most evictions Unmanaged region still used, acts as victim cache (demotion → eviction) Single partition with guaranteed size 23

P partitions + Unmapped region Each line is tagged with its partition ID (0 to P-1) On each miss: – Insert new line into corresponding partition – Demote one of the candidates to unmanaged region – Evict from the unmanaged region 24

Problem: always demoting from inserting partition does not scale – Could demote from partition 0, but only 3 candidates – With many partitions, might not even see a candidate from inserting partition Instead, demote to match insertion rate (churn) and demotion rate 25

Aperture: Portion of candidates to demote from each partition 26

Set each aperture so that partition churn = demotion rate – Instantaneous partition sizes vary a bit, but sizes are maintained – Unmanaged region prevents interference Each partition requires aperture proportional to its churn / size ratio – Higher churn ↔ More frequent insertions (and demotions!) – Larger size ↔ We see lines from that partition more often Partition aperture determines partition associativity – Higher aperture ↔ less selective ↔ lower associativity 27

28

Directly implementing these techniques is impractical – Must constantly compute apertures, estimate churns – Need to know eviction priorities of every block Solution: Use negative feedback loops to derive apertures and the line below aperture – Practical implementation – Maintains analytical guarantees 29

Adjust aperture by letting partitions size (Si) grow over its target (Ti) Need small extra space in unmanaged region – e.g. 0.5% of the cache with R=52, A max =0.4, slack=10% 30

Derive portion of lines below aperture without tracking eviction priorities Coarse-grain timestamp LRU replacement – Tag each block with an 8-bit LRU per-partition timestamp – Increment timestamp every k accesses Demote every candidate below the setpoint timestamp (Candidates seen / candidates demoted) ratio increase or decrease setpoint TS compare with aperture 31

32

Use a cache with associativity guarantees Maintain an unmanaged region Match insertion and demotion rates in each partition – Partitions help each other evict lines → maintain associativity – Unmanaged region guarantees isolation and stability Use negative feedback to simplify implementation 33

Introduction Vantage Cache Partitioning Evaluation 34

Simulations of small (4-core) and large (32-core) systems – Private L1s, shared non-inclusive L2, 1 partition / core Partitioning policy: Utility-based partitioning [ISCA 2006] – Assign more space to threads that can use it better Partitioning schemes: Way-partitioning, PIPP, Vantage Workloads: 350 multi-programmed mixes from SPEC CPU 2006 (full suite) 35

Each line shows throughput improvement versus an unpartitioned 16-way set-associative cache Way-partitioning and PIPP degrade throughput for 45% of workloads 36

Vantage works on best on zcaches We use Vantage on a 4-way zcache with R=52 replacement candidates 37

Vantage improves throughput for most workloads 6.2% throughput improvement (gmean), 26% for the 50 most memory-intensive workloads 38

Way-partitioning and PIPP use a 64-way set-associative cache Both degrade throughput for most workloads 39

Vantage uses the same Z4/52 cache as the 4-core system Vantage improves throughput for most workloads → scalable 40

Vantage maintains strict partition sizes Vantage maintains high associativity even in the worst case 41

A larger unmanaged region reduces UCP performance slightly, but gives better isolation 42

Vantage enables cache partitioning for many-core – Tens to hundreds of fine-grain partitions – High associativity per partition – Strict isolation among partitions – Derived from analytical models, bounds independent of number of partitions and cache ways – Simple to implement 43

44