Cooperative Caching for Chip Multiprocessors Jichuan Chang Guri Sohi University of Wisconsin-Madison ISCA-33, June 2006.

Slides:

Advertisements

Similar presentations

Dynamic Power Redistribution in Failure-Prone CMPs Paula Petrica, Jonathan A. Winter * and David H. Albonesi Cornell University *Google, Inc.

Advertisements

Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.

Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.

To Include or Not to Include? Natalie Enright Dana Vantrease.

1 Lecture 17: Large Cache Design Papers: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation, Cho and Jin, MICRO’06 Co-Operative Caching.

Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.

ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors Mohammad Hammoud, Sangyeun Cho, and Rami Melhem Presenter: Socrates Demetriades.

A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.

Optimizing Shared Caches in Chip Multiprocessors Samir Sapra Athula Balachandran Ravishankar Krishnaswamy.

High Performing Cache Hierarchies for Server Workloads

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.

The Locality-Aware Adaptive Cache Coherence Protocol George Kurian 1, Omer Khan 2, Srini Devadas 1 1 Massachusetts Institute of Technology 2 University.

Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.

Teaching Old Caches New Tricks: Predictor Virtualization Andreas Moshovos Univ. of Toronto Ioana Burcea’s Thesis work Some parts joint with Stephen Somogyi.

Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Lei Jin and Sangyeun Cho Dept. of Computer Science University.

Improving Cache Performance by Exploiting Read-Write Disparity

CS752 Decoupled Architecture for Data Prefetching Jichuan Chang Kai Xu.

1 DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems Magnus Jahre †, Marius Grannaes † ‡ and Lasse Natvig † † Norwegian.

1 Lecture 12: Large Cache Design Papers (papers from last class and…): Co-Operative Caching for Chip Multiprocessors, Chang and Sohi, ISCA’06 Victim Replication,

1 Virtual Private Caches ISCA’07 Kyle J. Nesbit, James Laudon, James E. Smith Presenter: Yan Li.

Caches J. Nelson Amaral University of Alberta. Processor-Memory Performance Gap Bauer p. 47.

(C) 2002 Milo MartinHPCA, Feb Bandwidth Adaptive Snooping Milo M.K. Martin, Daniel J. Sorin Mark D. Hill, and David A. Wood Wisconsin Multifacet.

Cache intro CSE 471 Autumn 011 Principle of Locality: Memory Hierarchies Text and data are not accessed randomly Temporal locality –Recently accessed items.

1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.

Interactions Between Compression and Prefetching in Chip Multiprocessors Alaa R. Alameldeen* David A. Wood Intel CorporationUniversity of Wisconsin-Madison.

Shuchang Shan † ‡, Yu Hu †, Xiaowei Li † † Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences.

Achieving Non-Inclusive Cache Performance with Inclusive Caches Temporal Locality Aware (TLA) Cache Management Policies Aamer Jaleel,

1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University.

Comparing Memory Systems for Chip Multiprocessors Leverich et al. Computer Systems Laboratory at Stanford Presentation by Sarah Bird.

Cooperative Caching for Chip Multiprocessors Jichuan Chang †, Enric Herrero ‡, Ramon Canal ‡ and Gurindar S. Sohi * HP Labs † Universitat Politècnica de.

Building Expressive, Area-Efficient Coherence Directories Michael C. Huang Guofan Jiang Zhejiang University University of Rochester IBM 1 Lei Fang, Peng.

Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University.

A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Jason Zebchuk, Elham Safi, and Andreas Moshovos

(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)

Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing Barroso, Gharachorloo, McNamara, et. Al Proceedings of the 27 th Annual ISCA, June.

Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching Somayeh Sardashti and David A. Wood University of Wisconsin-Madison.

Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.

Effects of wrong path mem. ref. in CC MP Systems Gökay Burak AKKUŞ Cmpe 511 – Computer Architecture.

Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Jason Bosko March 5 th, 2008 Based on “Managing Distributed, Shared L2 Caches through.

Virtual Hierarchies to Support Server Consolidation Mike Marty Mark Hill University of Wisconsin-Madison ISCA 2007.

Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.

ΕΠΛ 605: Προχωρημένη Αρχιτεκτονική Υπολογιστών Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing Luiz André Barroso, Kourosh Gharachorloo,

MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.

Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.

CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z.

PIPP: Promotion/Insertion Pseudo-Partitioning of Multi-Core Shared Caches Yuejian Xie, Gabriel H. Loh Georgia Institute of Technology Presented by: Yingying.

Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache And Pefetch Buffers Norman P. Jouppi Presenter:Shrinivas Narayani.

Analyzing the Impact of Data Prefetching on Chip MultiProcessors Naoto Fukumoto, Tomonobu Mihara, Koji Inoue, Kazuaki Murakami Kyushu University, Japan.

The Evicted-Address Filter

By Islam Atta Supervised by Dr. Ihab Talkhan

1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.

Optimizing Replication, Communication, and Capacity Allocation in CMPs Z. Chishti, M. D. Powell, and T. N. Vijaykumar Presented by: Siddhesh Mhambrey Published.

Cache Miss-Aware Dynamic Stack Allocation Authors: S. Jang. et al. Conference: International Symposium on Circuits and Systems (ISCAS), 2007 Presenter:

Cache Perf. CSE 471 Autumn 021 Cache Performance CPI contributed by cache = CPI c = miss rate * number of cycles to handle the miss Another important metric.

HAT: Heterogeneous Adaptive Throttling for On-Chip Networks Kevin Kai-Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu.

1 Scaling Soft Processor Systems Martin Labrecque Peter Yiannacouras and Gregory Steffan University of Toronto FCCM 4/14/2008.

An Adaptive Cache Coherence Protocol Optimized for Producer-Consumer Sharing Liquin Cheng, John B. Carter and Donglai Dai cs.utah.edu by Evangelos Vlachos.

Improving Cache Performance using Victim Tag Stores

ASR: Adaptive Selective Replication for CMP Caches

Less is More: Leveraging Belady’s Algorithm with Demand-based Learning

Managing Wire Delay in CMP Caches

Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper, Dan Sorin§, Mark.

Cache Memory Presentation I

Jason F. Cantin, Mikko H. Lipasti, and James E. Smith

Lecture 13: Large Cache Design I

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

CARP: Compression-Aware Replacement Policies

Improving Multiple-CMP Systems with Token Coherence

Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP

Lecture: Cache Hierarchies

Presentation transcript:

Cooperative Caching for Chip Multiprocessors Jichuan Chang Guri Sohi University of Wisconsin-Madison ISCA-33, June 2006

CMP Cooperative Caching / ISCA Motivation Chip multiprocessors (CMPs) both require and enable innovative on-chip cache designs Two important demands for CMP caching –Capacity: reduce off-chip accesses –Latency: reduce remote on-chip references Need to combine the strength of both private and shared cache designs

CMP Cooperative Caching / ISCA Yet Another Hybrid CMP Cache - Why? Private cache based design –Lower latency and per-cache associativity –Lower cross-chip bandwidth requirement –Self-contained for resource management –Easier to support QoS, fairness, and priority Need a unified framework –Manage the aggregate on-chip cache resources –Can be adopted by different coherence protocols

CMP Cooperative Caching / ISCA CMP Cooperative Caching Form an aggregate global cache via cooperative private caches –Use private caches to attract data for fast reuse –Share capacity through cooperative policies –Throttle cooperation to find an optimal sharing point Inspired by cooperative file/web caches –Similar latency tradeoff –Similar algorithms P L2 L1IL1D P L2 L1IL1D P L2 L1IL1D P L2 L1IL1D Network

CMP Cooperative Caching / ISCA Outline Introduction CMP Cooperative Caching Hardware Implementation Performance Evaluation Conclusion

CMP Cooperative Caching / ISCA Policies to Reduce Off-chip Accesses Cooperation policies for capacity sharing –(1) Cache-to-cache transfers of clean data –(2) Replication-aware replacement –(3) Global replacement of inactive data Implemented by two unified techniques –Policies enforced by cache replacement/placement –Information/data exchange supported by modifying the coherence protocol

CMP Cooperative Caching / ISCA Policy (1) - Make use of all on-chip data Don’t go off-chip if on-chip (clean) data exist Beneficial and practical for CMPs –Peer cache is much closer than next-level storage –Affordable implementations of “clean ownership” Important for all workloads –Multi-threaded: (mostly) read-only shared data –Single-threaded: spill into peer caches for later reuse

CMP Cooperative Caching / ISCA Policy (2) – Control replication Intuition – increase # of unique on-chip data Latency/capacity tradeoff –Evict singlets only when no replications exist –Modify the default cache replacement policy “Spill” an evicted singlet into a peer cache –Can further reduce on-chip replication –Randomly choose a recipient cache for spilling “singlets”

CMP Cooperative Caching / ISCA Policy (3) - Global cache management Approximate global-LRU replacement –Combine global spill/reuse history with local LRU Identify and replace globally inactive data –First become the LRU entry in the local cache –Set as MRU if spilled into a peer cache –Later become LRU entry again: evict globally 1-chance forwarding (1-Fwd) –Blocks can only be spilled once if not reused

CMP Cooperative Caching / ISCA Cooperation Throttling Why throttling? –Further tradeoff between capacity/latency Two probabilities to help make decisions –Cooperation probability: control replication –Spill probability: throttle spilling Shared Private Cooperative CachingCC 100% Policy (1) CC 0%

CMP Cooperative Caching / ISCA Outline Introduction CMP Cooperative Caching Hardware Implementation Performance Evaluation Conclusion

CMP Cooperative Caching / ISCA Hardware Implementation Requirements –Information: singlet, spill/reuse history –Cache replacement policy –Coherence protocol: clean owner and spilling Can modify an existing implementation Proposed implementation –Central Coherence Engine (CCE) –On-chip directory by duplicating tag arrays

CMP Cooperative Caching / ISCA Information and Data Exchange Singlet information –Directory detects and notifies the block owner Sharing of clean data –PUTS: notify directory of clean data replacement –Directory sends forward request to the first sharer Spilling –Currently implemented as a 2-step data transfer –Can be implemented as recipient-issued prefetch

CMP Cooperative Caching / ISCA Outline Introduction CMP Cooperative Caching Hardware Implementation Performance Evaluation Conclusion

CMP Cooperative Caching / ISCA Performance Evaluation Full system simulator –Modified GEMS Ruby to simulate memory hierarchy –Simics MAI-based OoO processor simulator Workloads –Multithreaded commercial benchmarks (8-core) OLTP, Apache, JBB, Zeus –Multiprogrammed SPEC2000 benchmarks (4-core) 4 heterogeneous, 2 homogeneous Private / shared / cooperative schemes –Same total capacity/associativity

CMP Cooperative Caching / ISCA Multithreaded Workloads - Throughput CC throttling - 0%, 30%, 70% and 100% Ideal – Shared cache with local bank latency

CMP Cooperative Caching / ISCA Multithreaded Workloads - Avg. Latency Low off-chip miss rate High hit ratio to local L2 Lower bandwidth needed than a shared cache

CMP Cooperative Caching / ISCA Multiprogrammed Workloads L1 Local L2 Remote L2 Off-chip L1 Local L2 Remote L2 Off-chip

CMP Cooperative Caching / ISCA Sensitivity - Varied Configurations In-order processor with blocking caches Performance Normalized to Shared Cache

CMP Cooperative Caching / ISCA Comparison with Victim Replication SPECOMP Single- threaded

CMP Cooperative Caching / ISCA Speight et al. ISCA’05 - Private caches - Writeback into peer caches A Spectrum of Related Research Shared Caches Private Caches Configurable/malleable Caches (Liu et al. HPCA’04, Huh et al. ICS’05, cache partitioning proposals, etc) Cooperative Caching Best capacity Best latency Hybrid Schemes (CMP-NUCA proposals, Victim replication, Speight et al. ISCA’05, CMP-NuRAPID, Fast & Fair, synergistic caching, etc) …

CMP Cooperative Caching / ISCA Conclusion CMP cooperative caching –Exploit benefits of private cache based design –Capacity sharing through explicit cooperation –Cache replacement/placement policies for replication control and global management –Robust performance improvement Thank you!

Backup Slides

CMP Cooperative Caching / ISCA Shared vs. Private Caches for CMP Shared cache – best capacity –No replication, dynamic capacity sharing Shared Private Private cache – best latency –High local hit ratio, no interference from other cores Duplication Desired Need to combine the strengths of both

CMP Cooperative Caching / ISCA CMP Caching Challenges/Opportunities Future hardware –More cores, limited on-chip cache capacity –On-chip wire-delay, costly off-chip accesses Future software –Diverse sharing patterns, working set sizes –Interference due to capacity sharing Opportunities –Freedom in designing on-chip structures/interfaces –High-bandwidth, low-latency on-chip communication

CMP Cooperative Caching / ISCA Multi-cluster CMPs (Scalability) P L1IL1D L2 P L1IL1D L2 L1L1DL1L1D PP CCE P L1IL1D L2 P L1IL1D L2 L1L1DL1L1D PP CCE P L1L1D L2 P L1L1D L2 L1L1DL1L1D PP CCE P L1L1D L2 P L1L1D L2 L1L1DL1L1D PP CCE

CMP Cooperative Caching / ISCA L1 Tag Array L2 Tag Array P1P8 4-to-1 mux State vector gather 4-to-1 mux State vector … Central Coherence Engine (CCE) Spilling buffer Output queue Input queue Network Directory Memory...

CMP Cooperative Caching / ISCA Cache/Network Configurations Parameters –128-byte block size; 300-cycle memory latency –L1 I/D split caches: 32KB, 2-way, 2-cycle –L2 caches: Sequential tag/data access, 15-cycle –On-chip network: 5-cycle per-hop Configurations –4-core: 2X2 mesh network –8-core: 3X3 or 4X2 mesh network –Same total capacity/associativity for private/shared/cooperative caching schemes