ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors Mohammad Hammoud, Sangyeun Cho, and Rami Melhem Presenter: Socrates Demetriades.

Slides:

Advertisements

Similar presentations

Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.

Advertisements

Cache coherence for CMPs Miodrag Bolic. Private cache Each cache bank is private to a particular core Cache coherence is maintained at the L2 cache level.

A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy J. Zebchuk, E. Safi, and A. Moshovos.

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,

AN ANALYTICAL MODEL TO STUDY OPTIMAL AREA BREAKDOWN BETWEEN CORES AND CACHES IN A CHIP MULTIPROCESSOR Taecheol Oh, Hyunjin Lee, Kiyeon Lee and Sangyeun.

D. Tam, R. Azimi, L. Soares, M. Stumm, University of Toronto Appeared in ASPLOS XIV (2009) Reading Group by Theo 1.

To Include or Not to Include? Natalie Enright Dana Vantrease.

1 Lecture 17: Large Cache Design Papers: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation, Cho and Jin, MICRO’06 Co-Operative Caching.

Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras

A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.

1 Memory Performance and Scalability of Intel’s and AMD’s Dual-Core Processors: A Case Study Lu Peng 1, Jih-Kwon Peir 2, Tribuvan K. Prakash 1, Yen-Kuang.

1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.

Nikos Hardavellas, Northwestern University

High Performing Cache Hierarchies for Server Workloads

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

Practical Caches COMP25212 cache 3. Learning Objectives To understand: –Additional Control Bits in Cache Lines –Cache Line Size Tradeoffs –Separate I&D.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.

PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.

Cache Optimization Summary

Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Lei Jin and Sangyeun Cho Dept. of Computer Science University.

1 Lecture 12: Large Cache Design Papers (papers from last class and…): Co-Operative Caching for Chip Multiprocessors, Chang and Sohi, ISCA’06 Victim Replication,

Beneficial Caching in Mobile Ad Hoc Networks Bin Tang, Samir Das, Himanshu Gupta Computer Science Department Stony Brook University.

1 Lecture 8: Large Cache Design I Topics: Shared vs. private, centralized vs. decentralized, UCA vs. NUCA, recent papers.

Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.

An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.

By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and

Topics covered: Memory subsystem CSE243: Introduction to Computer Architecture and Hardware/Software Interface.

Cooperative Caching for Chip Multiprocessors Jichuan Chang Guri Sohi University of Wisconsin-Madison ISCA-33, June 2006.

Dynamic Cache Clustering for Chip Multiprocessors

StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Predicting Coherence Communication by Tracking Synchronization Points at Run Time Socrates Demetriades and Sangyeun Cho 45 th International Symposium in.

CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR

Computer Architecture Memory organization. Types of Memory Cache Memory Serves as a buffer for frequently accessed data Small  High Cost RAM (Main Memory)

Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.

Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Jason Bosko March 5 th, 2008 Based on “Managing Distributed, Shared L2 Caches through.

Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.

CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z.

ASPLOS’02 Presented by Kim, Sun-Hee.  Technology trends ◦ The rate of frequency scaling is slowing down  Performance must come from exploiting concurrency.

BarrierWatch: Characterizing Multithreaded Workloads across and within Program-Defined Epochs Socrates Demetriades and Sangyeun Cho Computer Frontiers.

By Islam Atta Supervised by Dr. Ihab Talkhan

1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.

Optimizing Replication, Communication, and Capacity Allocation in CMPs Z. Chishti, M. D. Powell, and T. N. Vijaykumar Presented by: Siddhesh Mhambrey Published.

컴퓨터교육과 이상욱 Published in: COMPUTER ARCHITECTURE LETTERS (VOL. 10, NO. 1) Issue Date: JANUARY-JUNE 2011 Publisher: IEEE Authors: Omer Khan (Massachusetts.

COMP SYSTEM ARCHITECTURE PRACTICAL CACHES Sergio Davies Feb/Mar 2014COMP25212 – Lecture 3.

LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”

The University of Adelaide, School of Computer Science

An Adaptive Cache Coherence Protocol Optimized for Producer-Consumer Sharing Liquin Cheng, John B. Carter and Donglai Dai cs.utah.edu by Evangelos Vlachos.

Computer Organization

ASR: Adaptive Selective Replication for CMP Caches

Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper, Dan Sorin§, Mark.

Cache Memory Presentation I

Lecture: Large Caches, Virtual Memory

CS61C : Machine Structures Lecture 6. 2

Lecture 13: Large Cache Design I

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

The University of Adelaide, School of Computer Science

Michael Zhang & Krste Asanovic Computer Architecture Group MIT CSAIL

Faustino J. Gomez, Doug Burger, and Risto Miikkulainen

CS 213 Lecture 11: Multiprocessor 3: Directory Organization

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

The University of Adelaide, School of Computer Science

Overview Problem Solution CPU vs Memory performance imbalance

Presentation transcript:

ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors Mohammad Hammoud, Sangyeun Cho, and Rami Melhem Presenter: Socrates Demetriades Dept. of Computer Science University of Pittsburgh

Tiled CMP Architectures Tiled CMP Architectures have recently been advocated as a scalable design. They replicate identical building blocks (tiles) connected over a switched network on-chip (NoC). A tile typically incorporates a private L1 cache and an L2 cache bank. A traditional practice of CMP caches is a one that logically shares the physically distributed L2 banks  Shared Scheme

L2 miss  The home tile of a cache block B is designated by the HS bits of B’s physical address.  Tile T1 requests B.  B is fetched from the main memory and mapped at its home tile (together with its dir info).  Pros: High capacity utilization. Simple Coherence Enforcement (Only for L1).

Shared Scheme: Latency Problem (Cons) Access latencies to L2 banks differ depending on the distances between requester cores and target banks. This design is referred to as a Non Uniform Cache Architecture  NUCA

NUCA Solution: Block Migration T0 requests block B. Move accessed blocks closer to the requesting cores  Block Migration B is migrated from T15 to T0. T0 requests B. Local hit Total Hops = 14 Total Hops = 0  HS of B = 1111 (T15)

NUCA Solution: Block Migration T3 requests B (hops = 6). T0 requests B (hops = 8). T8 requests B (hops = 8). Assume B is migrated to T3. T3 requests B (hops = 0). T0 requests B (hops = 11). T8 requests B (hops = 13). Though T0 saved 6 hops, in total there is a loss of 2 hops. Total Hops = 22 Total Hops = 24  HS of B = 0110 (T6)

Our work  Collect information about tiles (sharers) that have accessed a block B.  Depend on the past to predict the future: a core that accessed a block in the past is likely to access it again in the future.  Migrate B to a tile (host) that minimizes the overall number of NoC hops needed.

Talk roadmap  Predicting optimal host location  Locating Migratory Blocks Cache-the-cache-tag policy.  Replacement policy upon migration Swap-with-the-lru policy.  Quantitative Evaluation  Conclusion and future works

Predicting Optimal Host Location  Keeping a cache block B at its home tile might not be optimal.  The best host location of B is not known until runtime.  Adaptive Controlled Migration (ACM): Keep a pattern for the accessibility of B. At runtime (after a specific migration frequency level is reached for B) compute the best host to migrate B by finding the one that minimizes the total latency cost between the sharers of B

ACM: A Working Example Tiles 0 and 6 are sharers:  Case 1: Tile 3 is a host.  Case 2: Tile 15 is a host.  Case 3: Tile 2 is a host.  Case 4: Tile 0 is the host. Total Latency Cost = 14Total Latency Cost = 22Total Latency Cost = 10Total Latency Cost = 8 Select T0

Locating Migratory Blocks  After a cache block B is migrated, the HS bits of B’s physical address can’t be used anymore to locate B at a subsequent access.  Assume B has been migrated from its home tile T4 to a new host tile T7.  T3 requests B: L2 miss.  A tag can be kept at T4 to point to T7.  Scenario: 3-way cache-to-cache transfer (T3, T4, and T7)  Deficiencies: Useless migration. Fails to exploit distance locality False L2 Miss  HS of B = 0100 (T4) B at T7

Locating Migratory Blocks: cache-the-cache-tag Policy  Idea: cache the tag of block B at the requester’s tile (within a data structure referred to as MT table).  T3 requests B. It looks up its MT table before reaching B’s home tile. MT miss: 3-way communication (first access).  T3 caches B’s tag at its MT table.  T3 requests B. It looks up its MT table before reaching B’s home tile. MT hit: direct fetch (second- and up-accesses)  HS of B = 0100 (T4) MT Miss MT Hit

Locating Migratory Blocks: cache-the-cache-tag Policy  The MT table of a tile T can now hold 2 types of tags: A tag for each block B whose home tile is T and had been migrated to another tile (local entry). Tags to keep track of the locations of the migratory blocks that have been recently accessed by T but whose home tile is not T (remote entry).  The MT table replacement policy: An invalid tag. The LRU remote entry.  The MT remote and local tags of B are kept consistent via extending the local entry of B at B’s home tile by a bit mask that indicates which tiles have cached corresponding remote entries.

Replacement Policy Upon Migration: swap-with-lru Policy  After the ACM algorithm predicts the optimal host, H, for a block B, a decision is to be made regarding which block to replace at H upon migrating B.  Idea: Swap B with the LRU block at H (swap-with-the-lru policy).  The LRU block at H could be: A migratory one. A non-migratory one.  The swap-with-the-lru policy is very effective especially for workloads that have working sets which are large relative to L2 banks (bears similarity to victim replication but more robust)

Quantitative Evaluation: Methodology and Benchmarks.  We simulate a 16-way tiled CMP.  Simulator: Simics (Solaris OS)  Cache line size: 64 Bytes.  L1 I-D sizes/ways/latency: 16KB/2 ways/1 cycles.  L2 size/ways/latency: 512KB per bank/16 ways/6 cycles.  Latency per hop: 5 cycles.  Memory latency: 300 cycles.  Migration frequency level: 10  Benchmarks: NameInput SPECjbbJava HotSpot ™ server VM 1.5, 4 warehouses Lu1024*1024 (16 threads) Ocean514*514 (16 threads) Radix2 M integers (16 threads) Barnes16K particles (16 threads) Parser, Art, Equake, Mcf, Ammp, Vortex Reference MIX1(vortex, Ammp, Mcf, and Equake) MIX2(Art, Equake, Parser, Mcf)

Quantitative Evaluation: Single- threaded and Multiprogramming Results  VR successfully offsets the miss rate from fast replica hits for all the single-threaded benchmarks.  VR fails to offset the L2 miss increase of MIX1 and MIX2.  For single-threaded workloads: ACM generates on average 20.5% and 3.7% better AAL than S and VR, respectively.  For multiprogramming workloads: ACM generates on average 2.8% and 31.3% better AAL than S and VR Poor Capacity Utilization Maintains Efficient Capacity Utilization

Quantitative Evaluation: Multithreaded Results  An increase in the degree of sharing suggests that the capacity occupied by replicas could increase significantly leading to a decrease in the effective L2 cache size.  ACM exhibits AALs that are on average 27% and 37.1% better than S and VR, respectively.

Quantitative Evaluation: Avg. Memory Access Cycles Per 1K Instr.  ACM performs on average 18.6% and 2.6% better than S for the single-threaded and multipr ogramming workloads, respectively.  ACM performs on average 20.7% better than S for multithreaded workloads.  VR performs on average 15.1% better than S, and 38.4% worse than S for the single- threaded and multiprogramming workloads, respectively.  VR performs on average 19.6% worse than S for multithreaded workloads.

Quantitative Evaluation: ACM Scalability Poor Capacity Utilization  As the number of tiles on a CMP platform increases, the NUCA problem exacerbates.  ACM is independent of the underlying platform and always selects hosts that minimize AAL.  More Exposure to the NUCA problem translates effectively to a larger benefit from ACM.  For the simulated benchmarks: with 16-way CMP, ACM improves AAL by 11.6% over S.  With 32-way CMP, ACM improves AAL by 56.6% on average over S.

Quantitative Evaluation: Sensitivity to MT table Sizes.  With half (50%) and quarter (25%) MT table sizes as compared to the regular L2 cache bank size, ACM increases AAL by 5.9% and 11.3% over the base one (100% - or identical to the L2 cache bank size).

Quantitative Evaluation: Sensitivity to L2 Cache Sizes.  AAL maintains improvement of 39.7% over S.  VR fails to demonstrate stability.

Conclusion  This work proposes ACM, a strategy to manage CMP NUCA caches.  ACM offers: Better average L2 access latency over traditional NUCA (20.4% on average). Maintains L2 miss rate of NUCA.  ACM proposes a robust location strategy (cache-the-cache- tag) that can work for any NUCA migration scheme.  ACM reveals the usefulness of migration technique in CMP context.

Future works  Improve ACM prediction mechanism. Currently: Cores are treated equally (we consider only the case with 0-1 weights assigning 1 for a core that accessed block B and 0 for a one that didn’t). Improvement: Reflect the non-uniformity in cores access weights (trade off between access weights and storage overhead).  Propose an adaptive mechanism for selecting migration frequency levels.

ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors M. Hammoud, S. Cho, and R. Melhem Special thank to Socrates Demetriades Dept. of Computer Science University of Pittsburgh Thank you!