Optimizing Shared Caches in Chip Multiprocessors Samir Sapra Athula Balachandran Ravishankar Krishnaswamy.

Slides:

Advertisements

Similar presentations

L.N. Bhuyan Adapted from Patterson’s slides

Advertisements

Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.

Cache coherence for CMPs Miodrag Bolic. Private cache Each cache bank is private to a particular core Cache coherence is maintained at the L2 cache level.

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.

CA 714CA Midterm Review. C5 Cache Optimization Reduce miss penalty –Hardware and software Reduce miss rate –Hardware and software Reduce hit time –Hardware.

1 Lecture 17: Large Cache Design Papers: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation, Cho and Jin, MICRO’06 Co-Operative Caching.

Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.

Multi-Level Caches Vittorio Zaccaria. Preview What you have seen: Data organization, Associativity, Cache size Policies -- how to manage the data once.

A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.

Nikos Hardavellas, Northwestern University

2013/06/10 Yun-Chung Yang Kandemir, M., Yemliha, T. ; Kultursay, E. Pennsylvania State Univ., University Park, PA, USA Design Automation Conference (DAC),

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

The Locality-Aware Adaptive Cache Coherence Protocol George Kurian 1, Omer Khan 2, Srini Devadas 1 1 Massachusetts Institute of Technology 2 University.

Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.

Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Lei Jin and Sangyeun Cho Dept. of Computer Science University.

Organizing the Last Line of Defense before hitting the Memory Wall for Chip-Multiprocessors (CMPs) C. Liu, A. Sivasubramaniam, M. Kandemir The Pennsylvania.

The Stanford Directory Architecture for Shared Memory (DASH)* Presented by: Michael Bauer ECE 259/CPS 221 Spring Semester 2008 Dr. Lebeck * Based on “The.

Computer Architecture Introduction to MIMD architectures Ola Flygt Växjö University

Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.

Web Caching Schemes1 A Survey of Web Caching Schemes for the Internet Jia Wang.

1 Lecture 12: Large Cache Design Papers (papers from last class and…): Co-Operative Caching for Chip Multiprocessors, Chang and Sohi, ISCA’06 Victim Replication,

1 Virtual Private Caches ISCA’07 Kyle J. Nesbit, James Laudon, James E. Smith Presenter: Yan Li.

1 Multiprocessors. 2 Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad.

Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers Manu Awasthi, David Nellans, Kshitij Sudan, Rajeev Balasubramonian,

Chapter 17 Parallel Processing.

1 Lecture 8: Large Cache Design I Topics: Shared vs. private, centralized vs. decentralized, UCA vs. NUCA, recent papers.

331 Lec20.1Spring :332:331 Computer Architecture and Assembly Language Spring 2005 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.

CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.

Cooperative Caching for Chip Multiprocessors Jichuan Chang Guri Sohi University of Wisconsin-Madison ISCA-33, June 2006.

StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University.

Cooperative Caching for Chip Multiprocessors Jichuan Chang †, Enric Herrero ‡, Ramon Canal ‡ and Gurindar S. Sohi * HP Labs † Universitat Politècnica de.

Lecture 15: Virtual Memory EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2014, Dr.

CSIE30300 Computer Architecture Unit 15: Multiprocessors Hsin-Chou Chi [Adapted from material by and

1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)

1 Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly Koushik Chakraborty Philip Wells Gurindar Sohi

Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Jason Bosko March 5 th, 2008 Based on “Managing Distributed, Shared L2 Caches through.

Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.

CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z.

1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)

Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache And Pefetch Buffers Norman P. Jouppi Presenter:Shrinivas Narayani.

CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.

1 Scaling Soft Processor Systems Martin Labrecque Peter Yiannacouras and Gregory Steffan University of Toronto FCCM 4/14/2008.

1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)

Architecture and Algorithms for an IEEE 802

Reducing Memory Interference in Multicore Systems

Lecture: Large Caches, Virtual Memory

CS5102 High Performance Computer Systems Thread-Level Parallelism

Lecture: Large Caches, Virtual Memory

Reactive NUMA A Design for Unifying S-COMA and CC-NUMA

Reactive NUMA: A Design for Unifying S-COMA and CC-NUMA

Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.

Lecture: Large Caches, Virtual Memory

Lecture 13: Large Cache Design I

CMSC 611: Advanced Computer Architecture

Lecture 12: Cache Innovations

Lecture 23: Cache, Memory, Virtual Memory

Lecture 1: Parallel Architecture Intro

Lecture 22: Cache Hierarchies, Memory

Lecture: Cache Innovations, Virtual Memory

Lecture 24: Memory, VM, Multiproc

Adaptive Single-Chip Multiprocessing

Lecture: Cache Hierarchies

Chapter 4 Multiprocessors

Lecture 24: Virtual Memory, Multiprocessors

Lecture 23: Virtual Memory, Multiprocessors

Principle of Locality: Memory Hierarchies

Presentation transcript:

Optimizing Shared Caches in Chip Multiprocessors Samir Sapra Athula Balachandran Ravishankar Krishnaswamy

Core 2 Duo die “Just a few years ago, the idea of putting multiple processors on a chip was farfetched. Now it is accepted and commonplace, and virtually every new high performance processor is a chip multiprocessor of some sort…” Center for Electronic System Design Univ. of California Berkeley Chip Multiprocessors?? “Mowry is working on the development of single-chip multiprocessors: one large chip capable of performing multiple operations at once, using similar techniques to maximize performance” -- Technology Review, 1999 Sony's Playstation 3, 2006

CMP Caches: Design Space Architecture – Placement of Cache/Processors – Interconnects/Routing Cache Organization & Management – Private/Shared/Hybrid – Fully Hardware/OS Interface “L2 is the last line of defense before hitting the memory wall, and is the focus of our talk”

Private L2 Cache I$D$I$D$ L2 $ I NT ER CO NN EC T Coherence Protocol Offchip Memory + Less interconnect traffic + Insulates L2 units + Hit latency – Duplication – Load imbalance – Complexity of coherence – Higher miss rate L1 Proc

Shared-Interleaved L2 Cache – Interconnect traffic – Interference between cores – Hit latency is higher + No duplication + Balance the load + Lower miss rate + Simplicity of coherence I$D$I$D$ I NT ER CO NN EC T Coherence Protocol L1 L2

Take Home Message Leverage on-chip access time

Take Home Messages Leverage on-chip access time Better sharing of cache resources Isolating performance of processors Place data on the chip close to where it is used Minimize inter-processor misses (in shared cache) Fairness towards processors

On to some solutions… Jichuan Chang and Gurindar S. Sohi Cooperative Caching for Chip Multiprocessors International Symposium on Computer Architecture, Nikos Hardavellas, Michael Ferdman, Babak Falsafi, and Anastasia Ailamaki Reactive NUCA: Near-Optimal Block Placement and Replication in Distributed Caches International Symposium on Computer Architecture, Shekhar Srikantaiah, Mahmut Kandemir, and Mary Jane Irwin Adaptive Set-Pinning: Managing Shared Caches in Chip Multiprocessors Architectural Support for Programming Languages and Operating, Systems each handles this problem in a different way

Co-operative Caching (Chang & Sohi) Private L2 caches Attract data locally to reduce remote on chip access. Lowers average on-chip misses. Co-operation among the private caches for efficient use of resources on the chip. Controlling the extent of co-operation to suit the dynamic workload behavior

CC Techniques Cache to cache transfer of clean data – In case of miss transfer “clean” blocks from another L2 cache. – This is useful in the case of “read only” data (instructions). Replication aware data replacement – Singlet/Replicate. – Evict singlet only when no replicates exist. – Singlets can be “spilled” to other cache banks. Global replacement of inactive data – Global management needed for managing “spilling”. – N-Chance Forwarding. – Set recirculation count to N when spilled. – Decrease N by 1 when spilled again, unless N becomes 0.

Set “Pinning” -- Setup P 1 P 2 P 3 P 4 Set 0 Set 1 :::: Set (S-1) L1 cache ProcessorsShared L2 cache InterconnectInterconnect Main Memory

Set “Pinning” -- Problem P 1 P 2 P 3 P 4 Set 0 Set 1 :::: Set (S-1) Main Memory

Set “Pinning” -- Types of Cache Misses Compulsory (aka Cold) Capacity Conflict Coherence Compulsory Inter-processor Intra-processor versus

P 1 P 2 P 3 P 4 Main Memory POP 1 POP 2 POP 3 POP 4 Set :::: OwnerOther bitsData

R-NUCA: Use Class-Based Strategies Solve for the common case! Most current (and future) programs have the following types of accesses 1.Instruction Access – Shared, but Read-Only 2.Private Data Access – Read-Write, but not Shared 3.Shared Data Access – Read-Write (or) Read-Only, but Shared.

R-NUCA: Can do this online! We have information from the OS and TLB For each memory block, classify it as – Instruction – Private Data – Shared Data Handle them differently – Replicate instructions – Keep private data locally – Keep shared data globally

R-NUCA: Reactive Clustering Assign clusters based on level of sharing – Private Data given level-1 clusters (local cache) – Shared Data given level-16 clusters (16 neighboring machines), etc. Clusters ≈ Overlapping Sets in Set-Associative Mapping Within a cluster, “Rotational Interleaving” – Load-Balancing to minimize contention on bus and controller

Future Directions Area has been closed.

Just Kidding… Optimize for Power Consumption Assess trade-offs between more caches and more cores Minimize usage of OS, but still retain flexibility Application adaptation to allocated cache quotas Adding hardware directed thread level speculation

Questions? THANK YOU!

Backup Commercial and research prototypes – Sun MAJC – Piranha – IBM Power 4/5 – Stanford Hydra

Backup

Design Space / Tradeoffs Designs to achieve best of both worlds. SharedPrivate miss rate and eliminates coherence issues. Private L2 reduces the hit latency and complexity. On-chipoff-chip low on-chip access time TB/s of Bandwidth Coherence effects higher capacity HardwareSoftware Flexible