DDM – A Cache Only Memory Architecture Hagersten, Landin, and Haridi (1991) Presented by Patrick Eibl.

Slides:



Advertisements
Similar presentations
Chapter 5 Part I: Shared Memory Multiprocessors
Advertisements

L.N. Bhuyan Adapted from Patterson’s slides
CSCI 8150 Advanced Computer Architecture
Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.
Cache coherence for CMPs Miodrag Bolic. Private cache Each cache bank is private to a particular core Cache coherence is maintained at the L2 cache level.
1 Lecture 4: Directory Protocols Topics: directory-based cache coherence implementations.
Cache Optimization Summary
Multiple Processor Systems
CS 258 Parallel Computer Architecture Lecture 15.1 DASH: Directory Architecture for Shared memory Implementation, cost, performance Daniel Lenoski, et.
The Stanford Directory Architecture for Shared Memory (DASH)* Presented by: Michael Bauer ECE 259/CPS 221 Spring Semester 2008 Dr. Lebeck * Based on “The.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
1 Lecture 12: Hardware/Software Trade-Offs Topics: COMA, Software Virtual Memory.
Reactive NUMA: A Design for Unifying S-COMA and CC-NUMA Babak Falsafi and David A. Wood University of Wisconsin, Madison, 1997 Presented by: Jie Xiao.
1 Multiprocessors. 2 Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad.
CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.
Memory Organization.
ECE669 L18: Scalable Parallel Caches April 6, 2004 ECE 669 Parallel Computer Architecture Lecture 18 Scalable Parallel Caches.
1 CSE SUNY New Paltz Chapter Nine Multiprocessors.
NUMA coherence CSE 471 Aut 011 Cache Coherence in NUMA Machines Snooping is not possible on media other than bus/ring Broadcast / multicast is not that.
CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Multiprocessor Cache Coherency
Spring 2003CSE P5481 Cache Coherency Cache coherent processors reading processor must get the most current value most current value is the last write Cache.
DDM - A Cache-Only Memory Architecture Erik Hagersten, Anders Landlin and Seif Haridi Presented by Narayanan Sundaram 03/31/2008 1CS258 - Parallel Computer.
Distributed Shared Memory: A Survey of Issues and Algorithms B,. Nitzberg and V. Lo University of Oregon.
August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.
Shared Address Space Computing: Hardware Issues Alistair Rendell See Chapter 2 of Lin and Synder, Chapter 2 of Grama, Gupta, Karypis and Kumar, and also.
Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing Barroso, Gharachorloo, McNamara, et. Al Proceedings of the 27 th Annual ISCA, June.
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
Performance of the Shasta distributed shared memory protocol Daniel J. Scales Kourosh Gharachorloo 創造情報学専攻 M グェン トアン ドゥク.
L/O/G/O Cache Memory Chapter 3 (b) CS.216 Computer Architecture and Organization.
1 Lecture 12: Hardware/Software Trade-Offs Topics: COMA, Software Virtual Memory.
Distributed Shared Memory Based on Reference paper: Distributed Shared Memory, Concepts and Systems.
Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.
ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Shared Memory MPs – COMA & Beyond Copyright 2004 Daniel J. Sorin Duke.
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
CS203 – Advanced Computer Architecture Virtual Memory.
The University of Adelaide, School of Computer Science
CS161 – Design and Architecture of Computer
CS161 – Design and Architecture of Computer
Lecture 21 Synchronization
Architecture and Design of AlphaServer GS320
Reactive NUMA A Design for Unifying S-COMA and CC-NUMA
Reactive NUMA: A Design for Unifying S-COMA and CC-NUMA
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Multiprocessor Cache Coherency
The University of Adelaide, School of Computer Science
CMSC 611: Advanced Computer Architecture
Example Cache Coherence Problem
The University of Adelaide, School of Computer Science
Death Match ’92: NUMA v. COMA
Outline Midterm results summary Distributed file systems – continued
Multiprocessors - Flynn’s taxonomy (1966)
Lecture 8: Directory-Based Cache Coherence
Lecture 7: Directory-Based Cache Coherence
Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP
CS 213 Lecture 11: Multiprocessor 3: Directory Organization
DDM – A Cache-Only Memory Architecture
/ Computer Architecture and Design
High Performance Computing
Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini
Design Alternatives for SAS: The Beauty of Mobile Homes
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 24: Virtual Memory, Multiprocessors
Lecture 17 Multiprocessors and Thread-Level Parallelism
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

DDM – A Cache Only Memory Architecture Hagersten, Landin, and Haridi (1991) Presented by Patrick Eibl

Outline Basics of Cache-Only Memory Architectures The Data Diffusion Machine (DDM) DDM Coherence Protocol Examples of Replacement, Reading, Writing Memory Overhead Simulated Performance Strengths and Weaknesses Alternatives to DDM Architecture

The Big Idea: UMA→NUMA →COMA Centralized shared memory feeds data through network to individual caches Uniform access time to all memory Shared memory is distributed among processors (DASH) Data can move from home memory to other caches as needed No notion of “home” for data; moves to wherever it is needed Individual memories behave like caches

COMA: The Basics Individual memories are called Attraction Memories (AM) – each processor “attracts” its working data set AM also contains data that has never been accessed (+/-?) Uses shared memory programming model, but with no pressure to optimize static partitioning Limited duplication of shared memory The Data Diffusion Machine is the specific COMA presented here

Data Diffusion Machine Directory hierarchy allows scaling to arbitrary number of processors Branch factors and bottlenecks a consideration – Hierarchy can be split into different address domains to improve bandwidth

Coherence Protocol Transient states support split-transaction bus Fairly standard protocol with important exception of replacement, which must be managed carefully (example to come) Sequential consistency is guaranteed, but with cost that writes must wait for acknowledge before continuing Item States I: Invalid E: Exclusive S: Shared R: Reading W: Waiting RW: Reading and Waiting Bus Transactions e: erase x: exclusive r: read d: data i: inject o: out

I: Invalid S: Shared Processors Caches Directories o: out i: inject Replacement Example o o oi i id 1. A block needs to be brought into a full AM, necessitating a replacement and an out transaction 2. Out propagates up until it finds another copy of block in S, R, W, or RW 3. Out reaches top and is converted to inject, meaning this is the last copy of the data and it needs a new home 4. Inject finds space in new AM 5. Data is transferred to new home 6. States change accordingly

r r r r r r rd d d d d d d I: Invalid R: Reading A: Answering S: Shared Processors Caches Directories r: read d: data Multilevel Read Example 1. First cache issues read request 2. Read propagates up hierarchy 4. Directories change to answering state while waiting for data 3. Read reaches directory with block in shared state 5. Data moves back along same path, changing states to shared as it goes 2. Second cache issues request for same block 3. Request for same block encountered; directory simply waits for data reply from other request

e I: Invalid R: Reading W: Waiting E: Exclusive S: Shared Processors Caches Directories e: erase x: exclusive e ee e e e x x Multilevel Write Example 1. Cache issues write request 2. Erase propagates up hierarchy and back down, invalidating all other copies 5. ACK propagates back down, changing states from Waiting to Exclusive 4. Top of hierarchy responds with acknowledge 2. Second cache issues write to same block 3. Second exclusive request encounters other write to same block; first one won because it arrived first; other erase is propagated back down 4. State of second cache changed to RW, and will issue a read request before another erase (not shown)

Memory Overhead Inclusion is necessary for directories, but not for data Directories only need state bits and address tags For two sample configurations given, overheads were 6% for one-level 32-processor and 16% for two-level 256-processor Larger item size reduces overhead

Simulated Performance Minimal success on programs for which each processor operates on entire shared data MP3D was rewritten to improve performance by exploiting fact that data has no home OS, hardware, and emulator in development at the time Different DDM topology for each program (-)

Strengths Each processor attracts the data it’s using into its own memory space Data doesn’t need to be duplicated at a home node Ordinary shared memory programming model No need to optimize static partitioning (there is none) Directory hierarchy scales reasonably well Good when data is moved around in smaller chunks

Weaknesses Attraction memories hold data that isn’t being used, making them bigger and slower Different DDM hierarchy topology was used for each program in simulations Does not fully exploit large spatial locality; software wins in that case (S-COMA) Branching hierarchy is prone to bottlenecks and hotspots No way to know where data is but with expensive tree traversal (NUMA wins here)

Alternatives to COMA/DDM Flat-COMA – Blocks are free to migrate, but have home nodes with directories corresponding to physical address Simple-COMA – Allocation managed by OS and done at page granularity Reactive-NUMA – Switches between S-COMA and NUMA with remote cache on per-page basis Good summary of COMAs: