The Directory-Based Cache Coherence Protocol for the DASH Multiprocessor Computer System Laboratory Stanford University Daniel Lenoski, James Laudon, Kourosh.

Slides:



Advertisements
Similar presentations
1 Lecture 6: Directory Protocols Topics: directory-based cache coherence implementations (wrap-up of SGI Origin and Sequent NUMA case study)
Advertisements

Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
1 Lecture 4: Directory Protocols Topics: directory-based cache coherence implementations.
Cache Optimization Summary
CS 258 Parallel Computer Architecture Lecture 15.1 DASH: Directory Architecture for Shared memory Implementation, cost, performance Daniel Lenoski, et.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Directory-Based Caches II Steve Ko Computer Sciences and Engineering University at Buffalo.
CIS629 Coherence 1 Cache Coherence: Snooping Protocol, Directory Protocol Some of these slides courtesty of David Patterson and David Culler.
1 Lecture 4: Directory-Based Coherence Details of memory-based (SGI Origin) and cache-based (Sequent NUMA-Q) directory protocols.
CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.
1 Lecture 2: Snooping and Directory Protocols Topics: Snooping wrap-up and directory implementations.
EECC756 - Shaaban #1 lec # 13 Spring Scalable Cache Coherent Systems Scalable distributed shared memory machines Assumptions: –Processor-Cache-Memory.
1 Lecture 1: Introduction Course organization:  4 lectures on cache coherence and consistency  2 lectures on transactional memory  2 lectures on interconnection.
CS 152 Computer Architecture and Engineering Lecture 21: Directory-Based Cache Protocols Scott Beamer (substituting for Krste Asanovic) Electrical Engineering.
1 Lecture 5: Directory Protocols Topics: directory-based cache coherence implementations.
1 Lecture 23: Multiprocessors Today’s topics:  RAID  Multiprocessor taxonomy  Snooping-based cache coherence protocol.
ECE669 L18: Scalable Parallel Caches April 6, 2004 ECE 669 Parallel Computer Architecture Lecture 18 Scalable Parallel Caches.
NUMA coherence CSE 471 Aut 011 Cache Coherence in NUMA Machines Snooping is not possible on media other than bus/ring Broadcast / multicast is not that.
1 Lecture 3: Directory-Based Coherence Basic operations, memory-based and cache-based directories.
CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.
1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )
April 18, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 19: Directory-Based Cache Protocols Krste Asanovic Electrical Engineering.
1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.
Multiprocessor Cache Coherency
Spring 2003CSE P5481 Cache Coherency Cache coherent processors reading processor must get the most current value most current value is the last write Cache.
1 Cache coherence CEG 4131 Computer Architecture III Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini.
Lecture 3. Directory-based Cache Coherence Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.
Shared Address Space Computing: Hardware Issues Alistair Rendell See Chapter 2 of Lin and Synder, Chapter 2 of Grama, Gupta, Karypis and Kumar, and also.
Dynamic Verification of Cache Coherence Protocols Jason F. Cantin Mikko H. Lipasti James E. Smith.
EECS 252 Graduate Computer Architecture Lec 13 – Snooping Cache and Directory Based Multiprocessors David Patterson Electrical Engineering and Computer.
Ch4. Multiprocessors & Thread-Level Parallelism 2. SMP (Symmetric shared-memory Multiprocessors) ECE468/562 Advanced Computer Architecture Prof. Honggang.
RSIM: An Execution-Driven Simulator for ILP-Based Shared-Memory Multiprocessors and Uniprocessors.
Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 March 20, 2008 Session 9.
1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )
1 Lecture 3: Coherence Protocols Topics: consistency models, coherence protocol examples.
ECE 4100/6100 Advanced Computer Architecture Lecture 13 Multiprocessor and Memory Coherence Prof. Hsien-Hsin Sean Lee School of Electrical and Computer.
1 Lecture 7: PCM Wrap-Up, Cache coherence Topics: handling PCM errors and writes, cache coherence intro.
CS267 Lecture 61 Shared Memory Hardware and Memory Consistency Modified from J. Demmel and K. Yelick
1 Lecture: Coherence Topics: snooping-based coherence, directory-based coherence protocols (Sections )
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
The University of Adelaide, School of Computer Science
Multi Processing prepared and instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University June 2016Multi Processing1.
1 Lecture 8: Snooping and Directory Protocols Topics: 4/5-state snooping protocols, split-transaction implementation details, directory implementations.
Lecture 8: Snooping and Directory Protocols
Architecture and Design of AlphaServer GS320
The University of Adelaide, School of Computer Science
Lecture 18: Coherence and Synchronization
12.4 Memory Organization in Multiprocessor Systems
Multiprocessor Cache Coherency
CMSC 611: Advanced Computer Architecture
Example Cache Coherence Problem
Lecture 9: Directory-Based Examples II
CS5102 High Performance Computer Systems Distributed Shared Memory
The DASH Prototype: Implementation and Performance
11 – Snooping Cache and Directory Based Multiprocessors
Lecture 25: Multiprocessors
Lecture 9: Directory-Based Examples
High Performance Computing
Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini
Lecture 8: Directory-Based Examples
Lecture 25: Multiprocessors
The University of Adelaide, School of Computer Science
Cache coherence CEG 4131 Computer Architecture III
Lecture 24: Multiprocessors
Coherent caches Adapted from a lecture by Ian Watson, University of Machester.
Lecture 18: Coherence and Synchronization
The University of Adelaide, School of Computer Science
CSE 486/586 Distributed Systems Cache Coherence
Lecture 10: Directory-Based Examples II
Presentation transcript:

The Directory-Based Cache Coherence Protocol for the DASH Multiprocessor Computer System Laboratory Stanford University Daniel Lenoski, James Laudon, Kourosh Gharachoroloo, Anoop Gupta, and John Hennessy

Designing low-cost high- performance multiprocessor  Message-passing (multicomputer) -distributed add. space, locally access more scalable  more cumbersome to program  Shared-memory (multiprocessor) -single add. space, remote access simplicity( data partitioning, dynamic load distribution)  consume bandwidth, cache coherence

DASH (Directory Architecture for Shared memory)  Distributed shared main mem. among the processing nodes to provide scalable mem. bandwidth  Distributed directory-based protocol to support cache coherence

DASH architecture  Processing node (cluster) -bus-based multiprocessor -snoopy protocol, amortizes cost of dir. logic & network interface  Set of clusters -mesh interconnected network -distributed directory-based protocol, keeps the summary info for each mem.line specifying the cluster that are caching it.

Details  Cache--individual to each processor  Memory-- shared to processors w/in the same cluster  Directory memory-- keep track of all processors caching a block, send point-to- point msg (invalidate/update), avoid broadcast  Remote Access Cache (RAC)– maintaining state of currently outstanding requests, buffering replies from the network to release waiting processor for bus arbitration.

Design distributed directory-based protocol  Correctness issues -memory consistency model, strong constrained? Less constrained? -deadlock, loop, generation of previous request is the requirement of the next. -error handling, manage data integrity & fault tolerance.  Performance issues -latency write misses-write buffer, release consistency model read misses-min inter-cluster msg, delay of msg. -bandwidth, reduce serialization (queuing delays), traffic, # of msg, caches & distributed memory in DASH.  Distributed control & complexity issues -distribute control to components, balance system performance & complexity of the components.

DASH prototype  Cluster(node) Silicon Graphics PowerStation 4D/240  4 processors (MIPS 3000/3010)  L1(64 Kbyte instruction,64Kbyte write-through data)  L2(256 Kbyte write-back), convert RT  RB, cache tag for snooping, maintaining consistency using Illinois MESI protocol

 Memory bus  Separated into 32-bit add. bus & 64-bit data bus.  Supporting mem-to-cache & cache-to-cache transfer  16 bytes every 4 bus clocks with a latency of 6 bus clocks, max bandwidth 64 mbps  Retry mechanism, when a request requires services from a remote cluster, remote request are signaled to retry, mask & unmasked requesting processor to avoid unnecessary retries.

Modification  Directory controller board -maintaining cache coherence inter-node, interface to interconnection network  Directory controller (DC)-contains the directory mem. corresponding to the portion of main mem. Initiates out-bound network requests  Pseudo-CPU (PCPU)- buffering income requests, issuing requests on bus  Reply controller (RC)- tracks outstanding requests made by local processors, receives & buffers the corresponding replies from remote cluster, acts as mem. In case of request retry.  Interconnection network-2 wormhole routed meshes (request & reply)  HW monitoring logic, miscellaneous control and status registers-logic samples directory board and bus events, derive usage and performance statistics.

 Directory memory -array of directory entries -one entry for each mem. Block -single state bit (shared/dirty) -a bit vector of pointer to each of the 16 clusters -directory information is combined with bus operation, address, and result of snooping within the cluster -DC generates network msg & bus controls

Assume “N" processors. With each cache-block in memory : N presence-bits (bit vector), and 1 dirty-bit (state bit)

 Remote Access Cache (RAC)  Maintaining state of currently outstanding requests from RC  Buffering replies from the network, waiting processor is released for bus arbitration.  Supplementing the functionality of the processor’s caches  Supplies data cache-to-cache when released processor retry the access

DASH cache coherence protocol  Local cluster a cluster that contains the processor originating a given request  Home cluster the cluster which contains the main memory and directory for a given physical memory address  Remote cluster any other cluster  Owning cluster a cluster owns a dirty memory block  Local memory the main memory associated with the local cluster  Remote memory any memory whose home is not the local

DASH cache coherence protocol  Invalidation-based ownership protocol  Memory block  Unchached-remote-- not cached by any remote cluster  Shared-remote--cached in an unmodified state by one or more remote clusters  Dirty-remote—cached in a modified state by a single remote cluster  Cache block  Invalid–the copy in cache is stale  Shared—other processors caching that location  Dirty—this cache contains an exclusive copy of the memory block, and the block has been modified.

3 primitive operations  Read request (load)  In L1, simply supplies the data  In L2, fill operation find and bring the required block to L1  Others, send a read request on the bus Shares- local, simply transfer over the bus Dirty-local, RAC take ownership of the cache line Unchached-remote/shared-remote, send data over the reply network to requesting cluster Dirty-remote, forward request to owning cluster, owning cluster send data to requesting cluster and sharing write-back request to home cluster.

Forward strategy reduce latency by direct responds process many request simultaneously (multithreaded) reduce serialization  Additional latency when simultaneously accesses are made to the same block, 1st request will be satisfied and dirty cluster loses ownership, 2 nd request return negative acknowledge(NAK) that force retry access.

 Read-exclusive request (store)  In local memory, write and invalidate others copies  Dirty-remote, owning processor invalidate that block from its cache, send granting ownership and data to requesting cluster, send update ownership msg to home cluster.  Unchached-remote/ shared-remote, write, send invalidate request for shared state.

Acknowledge -needed for the requesting processor to know when the store has been complete w/ respect to all processors. -maintain consistency, guarantee that new owner will not loose ownership before the directory has been updated

 Write-back request a dirty cache line that is replaced must be written back to memory  Home cluster is local, write back to main memory  Home cluster is remote, send a message to the remote home cluster, update the main memory in remote home and mark the block unchached-remote.

Bus initiated cache transaction  Transactions made by cache snooping the bus  Read operation, dirty cache supplies date and changes to shared state  Read-exclusive operation, invalidate all other cached copies  Line in L2 is invalidated, L1 do the same

Exception conditions  A request forwarded to a dirty cluster may arrived there to find that the dirty cluster no longer owns the data.  Prior access, change ownership  Owning cluster perform a write back Sol: requesting cluster is sent a NAK responses and is required to reissure the request(release mask, treating as new request)

 Ownership bouncing back to two remote clusters, requesting cluster receives multiple NAK’s  Time-out  Return a bus error Sol: add a additional directory states access queue, responds for all read only requests, grants ownership to each exclusive request on a pseudo-random basis.

 Separate request and reply network, some msg sent between 2 clusters can be received out-of-order Sol: acknowledge reply,out-of-order requests receive NAK response

 Invalidate request overtakes read reply which try to purge the read copy. Sol: when RAC detects an invalidation request for a pending read, change state of that RAC entry to invalidate- read-pending, RC assumes that any read reply is stale and treats the reply as a NAK response.

Deadlock  HW  2 mesh network, point-to-point message passing  consumption of an incoming message may require the generation of another outgoing message.  Protocol  Request message read, read-exclusive, invalidation requests  Reply message read & read-exclusive replies, invalidation ack.  Separate mesh function

Error handling  Error checking system  ECC on main memory  Parity checking on directory memory  Length checking of network message  Inconsistent bus and network message checking  Report to processor through bus errors and associated error capture registers.  Issuing processor time-out originating request or fencing operation. OS can clean up the state of a line by using back-door paths the allow direct addressing of the RAC and directory mem.

Scalability of the DASH directory  Amount of dir.mem.=mem.size x processors #  Limited pointer per entry, no space for processors that are not caching the line  Allow pointer to be shared between directory entries  Use a cache of directory entries to supplement or replace the normal directory  Sparse-directories, limited pointers and a coarse vector

Validation of the protocol  2 SW simulator base testing methods  Low-level DASH system simulator that incorporates the coherence protocol, caches, buses and interconnection network  High-level functional simulator that models the processors and executes parallel programs  2 scheme for testing protocol  Running existing parallel programming and compare output  Test script  Hardware

Comparison with scalable coherent interface protocol (SCI)  Similarities -rely on coherence caches maintained by distributed directories -rely on distributed memories to provide scalable memory bandwidth  Differences -in SCI, directory is a distributed sharing list maintained by cache -in DASH, all the directory info is placed with main memory

 SCI advantages -amount of directory pointer grows naturally with the # of processors -employ SRAM technology used by cache -guarantee forward progress in all cases  SCI disadvantages -directory entries increases the complexity and latency of the directory protocol, additional update msg must be sent bet caches -require more inter-node communication