Download presentation
Presentation is loading. Please wait.
Published byGordon Ward Modified over 9 years ago
1
The Directory-Based Cache Coherence Protocol for the DASH Multiprocessor Computer System Laboratory Stanford University Daniel Lenoski, James Laudon, Kourosh Gharachoroloo, Anoop Gupta, and John Hennessy
2
Designing low-cost high- performance multiprocessor Message-passing (multicomputer) -distributed add. space, locally access more scalable more cumbersome to program Shared-memory (multiprocessor) -single add. space, remote access simplicity( data partitioning, dynamic load distribution) consume bandwidth, cache coherence
3
DASH (Directory Architecture for Shared memory) Distributed shared main mem. among the processing nodes to provide scalable mem. bandwidth Distributed directory-based protocol to support cache coherence
4
DASH architecture Processing node (cluster) -bus-based multiprocessor -snoopy protocol, amortizes cost of dir. logic & network interface Set of clusters -mesh interconnected network -distributed directory-based protocol, keeps the summary info for each mem.line specifying the cluster that are caching it.
6
Details Cache--individual to each processor Memory-- shared to processors w/in the same cluster Directory memory-- keep track of all processors caching a block, send point-to- point msg (invalidate/update), avoid broadcast Remote Access Cache (RAC)– maintaining state of currently outstanding requests, buffering replies from the network to release waiting processor for bus arbitration.
7
Design distributed directory-based protocol Correctness issues -memory consistency model, strong constrained? Less constrained? -deadlock, loop, generation of previous request is the requirement of the next. -error handling, manage data integrity & fault tolerance. Performance issues -latency write misses-write buffer, release consistency model read misses-min inter-cluster msg, delay of msg. -bandwidth, reduce serialization (queuing delays), traffic, # of msg, caches & distributed memory in DASH. Distributed control & complexity issues -distribute control to components, balance system performance & complexity of the components.
9
DASH prototype Cluster(node) Silicon Graphics PowerStation 4D/240 4 processors (MIPS 3000/3010) L1(64 Kbyte instruction,64Kbyte write-through data) L2(256 Kbyte write-back), convert RT RB, cache tag for snooping, maintaining consistency using Illinois MESI protocol
11
Memory bus Separated into 32-bit add. bus & 64-bit data bus. Supporting mem-to-cache & cache-to-cache transfer 16 bytes every 4 bus clocks with a latency of 6 bus clocks, max bandwidth 64 mbps Retry mechanism, when a request requires services from a remote cluster, remote request are signaled to retry, mask & unmasked requesting processor to avoid unnecessary retries.
12
Modification Directory controller board -maintaining cache coherence inter-node, interface to interconnection network Directory controller (DC)-contains the directory mem. corresponding to the portion of main mem. Initiates out-bound network requests Pseudo-CPU (PCPU)- buffering income requests, issuing requests on bus Reply controller (RC)- tracks outstanding requests made by local processors, receives & buffers the corresponding replies from remote cluster, acts as mem. In case of request retry. Interconnection network-2 wormhole routed meshes (request & reply) HW monitoring logic, miscellaneous control and status registers-logic samples directory board and bus events, derive usage and performance statistics.
15
Directory memory -array of directory entries -one entry for each mem. Block -single state bit (shared/dirty) -a bit vector of pointer to each of the 16 clusters -directory information is combined with bus operation, address, and result of snooping within the cluster -DC generates network msg & bus controls
16
Assume “N" processors. With each cache-block in memory : N presence-bits (bit vector), and 1 dirty-bit (state bit)
17
Remote Access Cache (RAC) Maintaining state of currently outstanding requests from RC Buffering replies from the network, waiting processor is released for bus arbitration. Supplementing the functionality of the processor’s caches Supplies data cache-to-cache when released processor retry the access
18
DASH cache coherence protocol Local cluster a cluster that contains the processor originating a given request Home cluster the cluster which contains the main memory and directory for a given physical memory address Remote cluster any other cluster Owning cluster a cluster owns a dirty memory block Local memory the main memory associated with the local cluster Remote memory any memory whose home is not the local
19
DASH cache coherence protocol Invalidation-based ownership protocol Memory block Unchached-remote-- not cached by any remote cluster Shared-remote--cached in an unmodified state by one or more remote clusters Dirty-remote—cached in a modified state by a single remote cluster Cache block Invalid–the copy in cache is stale Shared—other processors caching that location Dirty—this cache contains an exclusive copy of the memory block, and the block has been modified.
20
3 primitive operations Read request (load) In L1, simply supplies the data In L2, fill operation find and bring the required block to L1 Others, send a read request on the bus Shares- local, simply transfer over the bus Dirty-local, RAC take ownership of the cache line Unchached-remote/shared-remote, send data over the reply network to requesting cluster Dirty-remote, forward request to owning cluster, owning cluster send data to requesting cluster and sharing write-back request to home cluster.
21
Forward strategy reduce latency by direct responds process many request simultaneously (multithreaded) reduce serialization Additional latency when simultaneously accesses are made to the same block, 1st request will be satisfied and dirty cluster loses ownership, 2 nd request return negative acknowledge(NAK) that force retry access.
22
Read-exclusive request (store) In local memory, write and invalidate others copies Dirty-remote, owning processor invalidate that block from its cache, send granting ownership and data to requesting cluster, send update ownership msg to home cluster. Unchached-remote/ shared-remote, write, send invalidate request for shared state.
23
Acknowledge -needed for the requesting processor to know when the store has been complete w/ respect to all processors. -maintain consistency, guarantee that new owner will not loose ownership before the directory has been updated
24
Write-back request a dirty cache line that is replaced must be written back to memory Home cluster is local, write back to main memory Home cluster is remote, send a message to the remote home cluster, update the main memory in remote home and mark the block unchached-remote.
25
Bus initiated cache transaction Transactions made by cache snooping the bus Read operation, dirty cache supplies date and changes to shared state Read-exclusive operation, invalidate all other cached copies Line in L2 is invalidated, L1 do the same
26
Exception conditions A request forwarded to a dirty cluster may arrived there to find that the dirty cluster no longer owns the data. Prior access, change ownership Owning cluster perform a write back Sol: requesting cluster is sent a NAK responses and is required to reissure the request(release mask, treating as new request)
27
Ownership bouncing back to two remote clusters, requesting cluster receives multiple NAK’s Time-out Return a bus error Sol: add a additional directory states access queue, responds for all read only requests, grants ownership to each exclusive request on a pseudo-random basis.
28
Separate request and reply network, some msg sent between 2 clusters can be received out-of-order Sol: acknowledge reply,out-of-order requests receive NAK response
29
Invalidate request overtakes read reply which try to purge the read copy. Sol: when RAC detects an invalidation request for a pending read, change state of that RAC entry to invalidate- read-pending, RC assumes that any read reply is stale and treats the reply as a NAK response.
30
Deadlock HW 2 mesh network, point-to-point message passing consumption of an incoming message may require the generation of another outgoing message. Protocol Request message read, read-exclusive, invalidation requests Reply message read & read-exclusive replies, invalidation ack. Separate mesh function
31
Error handling Error checking system ECC on main memory Parity checking on directory memory Length checking of network message Inconsistent bus and network message checking Report to processor through bus errors and associated error capture registers. Issuing processor time-out originating request or fencing operation. OS can clean up the state of a line by using back-door paths the allow direct addressing of the RAC and directory mem.
32
Scalability of the DASH directory Amount of dir.mem.=mem.size x processors # Limited pointer per entry, no space for processors that are not caching the line Allow pointer to be shared between directory entries Use a cache of directory entries to supplement or replace the normal directory Sparse-directories, limited pointers and a coarse vector
33
Validation of the protocol 2 SW simulator base testing methods Low-level DASH system simulator that incorporates the coherence protocol, caches, buses and interconnection network High-level functional simulator that models the processors and executes parallel programs 2 scheme for testing protocol Running existing parallel programming and compare output Test script Hardware
34
Comparison with scalable coherent interface protocol (SCI) Similarities -rely on coherence caches maintained by distributed directories -rely on distributed memories to provide scalable memory bandwidth Differences -in SCI, directory is a distributed sharing list maintained by cache -in DASH, all the directory info is placed with main memory
35
SCI advantages -amount of directory pointer grows naturally with the # of processors -employ SRAM technology used by cache -guarantee forward progress in all cases SCI disadvantages -directory entries increases the complexity and latency of the directory protocol, additional update msg must be sent bet caches -require more inter-node communication
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.