Xiaodong Wang, Shuang Chen, Jeff Setter,

Slides:

Advertisements

Similar presentations

Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.

Advertisements

CRUISE: Cache Replacement and Utility-Aware Scheduling

The Interaction of Simultaneous Multithreading processors and the Memory Hierarchy: some early observations James Bulpin Computer Laboratory University.

Jaewoong Sim Alaa R. Alameldeen Zeshan Chishti Chris Wilkerson Hyesoon Kim MICRO-47 | December 2014.

Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.

D. Tam, R. Azimi, L. Soares, M. Stumm, University of Toronto Appeared in ASPLOS XIV (2009) Reading Group by Theo 1.

1 Lecture 17: Large Cache Design Papers: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation, Cho and Jin, MICRO’06 Co-Operative Caching.

Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture Seongbeom Kim, Dhruba Chandra, and Yan Solihin Dept. of Electrical and Computer.

Nikos Hardavellas, Northwestern University

1 Line Distillation: Increasing Cache Capacity by Filtering Unused Words in Cache Lines Moinuddin K. Qureshi M. Aater Suleman Yale N. Patt HPCA 2007.

CML Efficient & Effective Code Management for Software Managed Multicores CODES+ISSS 2013, Montreal, Canada Ke Bai, Jing Lu, Aviral Shrivastava, and Bryce.

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Lei Jin and Sangyeun Cho Dept. of Computer Science University.

A Cache-Like Memory Organization for 3D memory systems CAMEO 12/15/2014 MICRO Cambridge, UK Chiachen Chou, Georgia Tech Aamer Jaleel, Intel Moinuddin K.

NUMA Tuning for Java Server Applications Mustafa M. Tikir.

Computer Organization Cs 147 Prof. Lee Azita Keshmiri.

Defining Anomalous Behavior for Phase Change Memory

A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems Dimitris Kaseridis, Jeffery Stuecheli,

Jiang Lin 1, Qingda Lu 2, Xiaoning Ding 2, Zhao Zhang 1, Xiaodong Zhang 2, and P. Sadayappan 2 Gaining Insights into Multi-Core Cache Partitioning: Bridging.

StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University.

Ramazan Bitirgen, Engin Ipek and Jose F.Martinez MICRO’08 Presented by PAK,EUNJI Coordinated Management of Multiple Interacting Resources in Chip Multiprocessors.

CoLT: Coalesced Large-Reach TLBs December 2012 Binh Pham §, Viswanathan Vaidyanathan §, Aamer Jaleel ǂ, Abhishek Bhattacharjee § § Rutgers University ǂ.

Budget-based Control for Interactive Services with Partial Execution 1 Yuxiong He, Zihao Ye, Qiang Fu, Sameh Elnikety Microsoft Research.

ECE8833 Polymorphous and Many-Core Computer Architecture Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Lecture 6 Fair Caching Mechanisms.

(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

BEAR: Mitigating Bandwidth Bloat in Gigascale DRAM caches

MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.

Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.

Embedded System Lab. 오명훈 Addressing Shared Resource Contention in Multicore Processors via Scheduling.

Speaker : Kyu Hyun, Choi. Problem: Interference in shared caches – Lack of isolation → no QoS – Poor cache utilization → degraded performance.

Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Lavanya Subramanian 1.

CS161 – Design and Architecture of Computer

vCAT: Dynamic Cache Management using CAT Virtualization

Is Virtualization ready for End-to-End Application Performance?

Failure-Atomic Slotted Paging for Persistent Memory

Reducing OLTP Instruction Misses with Thread Migration

Chang Hyun Park, Taekyung Heo, and Jaehyuk Huh

Algorithmic Improvements for Fast Concurrent Cuckoo Hashing

Chapter 2 Memory and process management

Reducing Memory Interference in Multicore Systems

Software Coherence Management on Non-Coherent-Cache Multicores

Xiaodong Wang José F. Martínez Computer Systems Lab Cornell University

CS161 – Design and Architecture of Computer

A Staged Memory Resource Management Method for CMP Systems

Adaptive Cache Partitioning on a Composite Core

Xiaodong Wang José F. Martínez Computer Systems Lab Cornell University

Lecture: Large Caches, Virtual Memory

Chapter 9 – Real Memory Organization and Management

18742 Parallel Computer Architecture Caching in Multi-core Systems

Building Expressive, Area-Efficient Coherence Directories

Juan Rubio, Lizy K. John Charles Lefurgy

Main Memory Management

Prefetch-Aware Cache Management for High Performance Caching

Lecture 13: Large Cache Design I

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Energy-Efficient Address Translation

Application Slowdown Model

Amoeba-Cache: Adaptive Blocks for Eliminating Waste in the Memory Hierarchy Snehasish Kumar, Hongzhou Zhao†, Arrvindh Shriraman Eric Matthews∗, Sandhya.

Improved schedulability on the ρVEX polymorphic VLIW processor

Department of Computer Science University of California, Santa Barbara

Reducing Memory Reference Energy with Opportunistic Virtual Caching

Haishan Zhu, Mattan Erez

CARP: Compression-Aware Replacement Policies

Jianbo Dong, Lei Zhang, Yinhe Han, Ying Wang, and Xiaowei Li

Lecture 14: Large Cache Design II

(A Research Proposal for Optimizing DBMS on CMP)

Department of Computer Science University of California, Santa Barbara

Operating Systems: Internals and Design Principles, 6/E

Presentation transcript:

Xiaodong Wang, Shuang Chen, Jeff Setter, SWAP: Effective fine-grain management of shared last-level caches with minimum hardware support Xiaodong Wang, Shuang Chen, Jeff Setter, and José F. Martínez Computer Systems Lab Cornell University

Motivation IBM Blue Gene/Q Shared cache Source: IBM SWAP Last-level cache, which occupies almost 50% of the chip area, is an important shared resources in chip-multiprocessors. In this IBM blue gene/Q processor, there are 18 cores sharing the L2 cache, where the interferences among the cores can be high that hurt system performance. We also observed a trend that the number of cores increases, which exerts even higher pressure to the shared cache. Source: IBM SWAP Motivation • Background

Motivation Cavium ThunderX® 48-core CMP Source: Cavium SWAP Therefore, intelligently manage the shared last-level cache is important to optimize system performance. Source: Cavium SWAP Motivation • Background

Shared last-level cache Motivation Last-level cache is critical to system performance ~50% chip area Performance isolation in shared cache Improve system throughput Guarantee QoS of latency-critical workloads Eliminate timing channels Core Core Core Private cache Private cache Private cache Shared last-level cache To prevent interference among the co-running applications, is desirable xxx SWAP Motivation • Background

Background Cache way partition Assign different cache ways to different cores Perfect isolation Readily available in existing CPUs Low repartition overhead Coarse-grained 16 cache ways in ThunderX 48-core processor Associativity lost Core Core Core Core Shared last-level cache Shared last-level cache SWAP Motivation • Background • SWAP

Background Page coloring Assign different cache sets to different cores Perfect isolation OS-level software technique Core Core Core Core Shared last-level cache Shared last-level cache SWAP Motivation • Background • SWAP

Background Page coloring Assign different cache sets to different cores Perfect isolation OS-level software technique High repartition overhead Coarse-grained: the number of page colors is limited 4 color bits, 16 colors in ThunderX 48-core processor OS Physical address HW SWAP Motivation • Background • SWAP

Fine-grained cache partitioning [1, 2, 3, 4] Background Fine-grained cache partitioning [1, 2, 3, 4] Probabilistically guarantee the size of partitions at the granularity of cache lines Requires non-trivial hardware changes No clear boundary across partitions: isolation is not strict [1] Xie and Loh, ISCA’ 09 [2] Sanchez and Kozyrakis, ISCA’ 11 [3] Manikantan et al., ISCA’ 12 [4] Wang and Chen, MICRO’ 14 SWAP Motivation • Background • SWAP

Comparison of schemes Yes Yes Yes No No No Yes Yes Low No High Low Yes Way partitioning Page coloring Probabilistic partitioning SWAP Perfect isolation Yes Yes Probabilistic Yes Fine-grain No No No Yes Yes Hardware overhead Low No High Low Real system Yes Yes No Yes Color blind Repartition overhead Low High Low Median SWAP Motivation • Background • SWAP

SWAP: Set and WAy Partitioning Way partitioning vertically divides the cache 16 cache ways in ThunderX for 48 cores Page coloring horizontally divides the cache 16 page colors in ThunderX for 48 cores Combine way partitioning and page coloring SWAP Background • SWAP • Evaluation

SWAP: Set and WAy Partitioning Combine way partitioning and page coloring Divide the cache in a 2-dimensional manner Maximum 256 partitions w/ 16 cache ways and 16 page colors, fine-grained enough for 48 cores SWAP Background • SWAP • Evaluation

SWAP: Set and WAy Partitioning Contribution Combine way partitioning and page coloring that enables fine-grain cache partition in real systems Challenges What’s the shape of the partition? How are partitions placed with each other? How to minimize repartition overhead? SWAP Background • SWAP • Evaluation

SWAP: Set and WAy Partitioning Partition shape Given the partition size, how many cache ways and pages colors should the partition have? Partition size = 18 # cache way 3 2 6 9 # page color 6 9 3 2 SWAP Background • SWAP • Evaluation

SWAP: Set and WAy Partitioning Partition Placement Partitions do not overlap (interference-free) No cache space is wasted SWAP Background • SWAP • Evaluation

SWAP: Set and WAy Partitioning Partition shape Partitions cannot simply expand to occupy unused area SWAP Background • SWAP • Evaluation

SWAP: Set and WAy Partitioning Partition shape Given the partition size, we classify the partitions into different categories Page colors unchanged if the partition size stays within a certain range Partitions aligned with each other P1 P2 P4 P3 P6 P9 P5 P8 P7 … Cache capacity = S, number of page colors = K Category Cavium ThunderX® 48-core processor: Cache capacity = 256 (16 MB), number of page colors = 16 Partition size # page color SWAP Background • SWAP • Evaluation

SWAP: Set and WAy Partitioning Partition Placement Start with large partitions (with more colors) Assign the partition with page colors that have most cache ways left 8 ways usge cntr Size P1 40 P2 12 P3 P1 P2 8 5 8 5 8 page colors P3 SWAP Background • SWAP • Evaluation

SWAP: Set and WAy Partitioning Reduce repartition overhead Key: reduce page re-coloring Classify the partitions as before Same: keep the original colors Downgrade: use partial original colors Upgrade: may use any colors Estimate the cache way usage before placement Transition is too fast SWAP Background • SWAP • Evaluation

SWAP: Set and WAy Partitioning Reduce repartition overhead Classify the partitions as before Estimate the cache way usage 8 ways P3’ 8x6 usge cntr P1’ 4x2 P2’ 4x2 Before After P1 40 8 P2 12 P3 48 P1 P2 2 3 1 Down 8 page colors P3 Up SWAP Background • SWAP • Evaluation

SWAP: Set and WAy Partitioning Reduce repartition overhead Start with large partitions (with more colors) Assign the partition with page colors that have most cache ways left 8 ways usge cntr Before After P1 40 8 P2 12 P3 48 P3’ P2’ 9 7 3 1 8 Down 8 page colors P1’ Up SWAP Background • SWAP • Evaluation

Cache miss-ratio curve Lookahead algorithm [1] decides partition sizes Other issues Cache miss-ratio curve Profiling Lookahead algorithm [1] decides partition sizes Other issues Hashed indexing Superpage [1] Qureshi and Patt, MICRO’ 06 SWAP Background • SWAP • Evaluation

Cavium ThunderX Processor Experimental setup Cavium ThunderX Processor 48-core CMP, 1.9GHz 16MB shared last-level cache 64GB DDR4-2133, 4 channels Ubuntu Linux 3.18 Performance analysis Mix of SPEC2000 and SPEC2006 multi-programed workloads Latency critical workload memcached Partition L2 cache based on cache ways. Essentially, other cache allocation techniques, such as z-cache, vantage essentially mimic high associativity with limited physical cache ways. There is no fundamental reason that we cannot use it. not really a limitation: essentially z-cache does the same thing 640W thing: IBM Power8 12 core: 200-250W @ 22nm Intel Haswell: 155W@ 22nm The results doesn’t depend on the absolute number. The actual cache partition is orthogonal to our proposal. SWAP SWAP • Evaluations • Conclusions

Running application bundle Static partitioning Running application bundle 12.5% Baseline: unmanaged, shared cache 48-app SWAP SWAP • Evaluations • Conclusions

Running application sequence Dynamic partitioning Running application sequence The next application in the sequence replaces the finished one; cache partitions change dynamically SWAP SWAP • Evaluations • Conclusions

Running application sequence Dynamic partitioning Running application sequence The next application in the sequence replaces the finished one; cache partitions change dynamically Cores Seq WAY SET SWAP Avg. Inj interval 1 1.04x 1.02x 1.08x 46s 2 1.11x 1.17x 41s 0.97x 31s 1.20x 25s 0.92x 0.99x 34s 1.00x 1.03x 1.15x 16 32 48 SWAP SWAP • Evaluations • Conclusions

Guarantee QoS Latency workload memcached co-located with background multi-programmed workloads Cache exploration or phase changes of the background workloads Shared cache SWAP SWAP SWAP • Evaluations • Conclusions

Guarantee QoS Latency workload memcached co-located with background multi-programmed workloads 8.1% 16-app SPEC bundle SWAP SWAP • Evaluations • Conclusions

Conclusions A real system implementation of fine-grain cache partitioning in large CMP systems Combine cache way partitioning and page coloring Delivers superior system throughput Guarantee QoS of latency-critical workloads SWAP SWAP • Evaluation • Conclusions

Xiaodong Wang, Shuang Chen, Jeff Setter, SWAP: Effective Fine-Grain Management of Shared Last-Level Caches with Minimum Hardware Support Xiaodong Wang, Shuang Chen, Jeff Setter, and José F. Martínez Computer Systems Lab Cornell University

Static partitioning 16-core 24-core 32-core 48-core SWAP 13.9% 14.1% 12.5% 12.5% 32-core 48-core SWAP