Copyright 2004 Daniel J. Sorin

Slides:



Advertisements
Similar presentations
1 Lecture 6: Directory Protocols Topics: directory-based cache coherence implementations (wrap-up of SGI Origin and Sequent NUMA case study)
Advertisements

4/16/2013 CS152, Spring 2013 CS 152 Computer Architecture and Engineering Lecture 19: Directory-Based Cache Protocols Krste Asanovic Electrical Engineering.
1 Lecture 4: Directory Protocols Topics: directory-based cache coherence implementations.
Cache Optimization Summary
CS 213 Commercial Multiprocessors. Origin2000 System – Shared Memory Directory state in same or separate DRAMs, accessed in parallel Upto 512 nodes (1024.
CIS629 Coherence 1 Cache Coherence: Snooping Protocol, Directory Protocol Some of these slides courtesty of David Patterson and David Culler.
Scalable CC-NUMA Design Study - SGI Origin 2000 CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.
1 Lecture 4: Directory-Based Coherence Details of memory-based (SGI Origin) and cache-based (Sequent NUMA-Q) directory protocols.
CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.
1 Lecture 2: Snooping and Directory Protocols Topics: Snooping wrap-up and directory implementations.
CS252/Patterson Lec /28/01 CS 213 Lecture 9: Multiprocessor: Directory Protocol.
1 Lecture 5: Directory Protocols Topics: directory-based cache coherence implementations.
NUMA coherence CSE 471 Aut 011 Cache Coherence in NUMA Machines Snooping is not possible on media other than bus/ring Broadcast / multicast is not that.
1 Lecture 3: Directory-Based Coherence Basic operations, memory-based and cache-based directories.
CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.
EECS 570: Fall rev3 1 Chapter 8: Cache Coherents in Scalable Multiprocessors.
1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.
Spring 2003CSE P5481 Cache Coherency Cache coherent processors reading processor must get the most current value most current value is the last write Cache.
The Directory-Based Cache Coherence Protocol for the DASH Multiprocessor Computer System Laboratory Stanford University Daniel Lenoski, James Laudon, Kourosh.
1 Cache coherence CEG 4131 Computer Architecture III Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini.
Lecture 3. Directory-based Cache Coherence Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.
Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 March 20, 2008 Session 9.
The University of Adelaide, School of Computer Science
1 Lecture 8: Snooping and Directory Protocols Topics: 4/5-state snooping protocols, split-transaction implementation details, directory implementations.
Lecture 8: Snooping and Directory Protocols
Architecture and Design of AlphaServer GS320
Scalable Cache Coherent Systems
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
CS 704 Advanced Computer Architecture
Multiprocessor Cache Coherency
Scalable Cache Coherent Systems
The University of Adelaide, School of Computer Science
CMSC 611: Advanced Computer Architecture
Krste Asanovic Electrical Engineering and Computer Sciences
Example Cache Coherence Problem
The University of Adelaide, School of Computer Science
Lecture 9: Directory-Based Examples II
CS5102 High Performance Computer Systems Distributed Shared Memory
Lecture 2: Snooping-Based Coherence
Scalable Cache Coherent Systems
Lecture 8: Directory-Based Cache Coherence
Lecture 7: Directory-Based Cache Coherence
11 – Snooping Cache and Directory Based Multiprocessors
CS 213 Lecture 11: Multiprocessor 3: Directory Organization
Lecture 9: Directory Protocol Implementations
Lecture 25: Multiprocessors
Lecture 9: Directory-Based Examples
Scalable Cache Coherent Systems
High Performance Computing
Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini
Lecture 8: Directory-Based Examples
Lecture 25: Multiprocessors
Cache Coherence: Part II Scalable Approaches Todd C
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Cache coherence CEG 4131 Computer Architecture III
Lecture 24: Virtual Memory, Multiprocessors
Scalable Cache Coherent Systems
Lecture 24: Multiprocessors
Scalable Cache Coherent Systems
Coherent caches Adapted from a lecture by Ian Watson, University of Machester.
Lecture 17 Multiprocessors and Thread-Level Parallelism
The University of Adelaide, School of Computer Science
Scalable Cache Coherent Systems
Lecture 10: Directory-Based Examples II
Scalable Cache Coherent Systems
Lecture 17 Multiprocessors and Thread-Level Parallelism
Interconnection Network and Prefetching
Scalable Cache Coherent Systems
Presentation transcript:

Copyright 2004 Daniel J. Sorin ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Shared Memory MPs - Directories Copyright 2004 Daniel J. Sorin Duke University Slides are derived from work by Sarita Adve (Illinois), Babak Falsafi (CMU), Mark Hill (Wisconsin), Alvy Lebeck (Duke), Steve Reinhardt (Michigan), and J. P. Singh (Princeton). Thanks!

Outline Directory-Based Cache Coherence Stanford DASH Case Study Motivation Basic Idea Some Variations Stanford DASH Case Study SGI Origin Case Study Advanced Directory Systems ECE 259 / CPS 221

Why Snooping Doesn’t Scale Limitations of snooping Broadcasting uses lots of bus bandwidth Snooping every transaction uses lots of controller bandwidth Snooping is great for small-medium size machines Largest current snooping system has 128 processors Snooping hits bottleneck for large machines So how do we overcome this bottleneck? Get rid of protocol that requires: Broadcasting Logical bus ECE 259 / CPS 221

Large Scale Shared Memory Multiprocessors 100s to 1000s of nodes (processors) with single shared physical address space Use general purpose interconnection network Still have cache coherence protocol Use messages instead of bus transactions No hardware broadcast Communication Assist Interconnect Cntrl/NI Mem P $ ECE 259 / CPS 221

Directory Based Cache Coherence Avoid broadcast request to all nodes on a miss Maintain directory of which nodes have cached copies of the block (directory controller + directory state) On a miss, send message to directory Communication assist Directory determines what (if any) protocol action is required E.g., invalidation Directory waits for protocol actions to finish and then responds to the original request Directory is the new serialization point ECE 259 / CPS 221

Directory Example Directory Node P1 D P2 S Time st x I Request S Invalidate I M Ack M Response ECE 259 / CPS 221

Centralized Directory Single directory that contains a copy of all nodes’ cache tags Disadvantages Bottleneck (1000s of processors…) Directory changes with number of nodes Advantages Only send invalidates/updates to those nodes that have copy of block ECE 259 / CPS 221

Flat, Memory-Based Distributed Directory Distribute directory among memory modules Maintain directory for each memory block Block’s home node = node with directory info for that block Interconnect Cntrl/NI Dir Mem P $ Mem Block ECE 259 / CPS 221

Directory Nomenclature DiriX -- 2 variables (i and X) Directory of i pointers (i <= total number of nodes) X specifies what to do on Shared to Modified transition B = Broadcast NB = No Broadcast SW = Software DirN = full-map directory Bit vector per memory block Bit per node in system No need to broadcast ECE 259 / CPS 221

Limited Pointer Directory Directory contains < N pointers Not bit vector, but instead pointers to nodes Less overhead than full-map directory What to do when run out of pointers depends on what we want to do on SM transition Broadcast (DiriB) Just give out another copy of block Modify state to indicate broadcast If SM, then broadcast invalidation No Broadcast (DiriNB) Never allow more than i Sharers If another request for shared, then invalidate a current sharer ECE 259 / CPS 221

Replacement Notification Should the directory be notified for blocks that are replaced from Shared state? Reasons to do this Can avoid broadcast, clear bit/pointer when notified Reasons not to do this Read-only data that will never be invalidated Notifications cause unnecessary traffic ECE 259 / CPS 221

Coarse Vector and Sparse Directories Instead of full-map or broadcast, indicate a set of nodes that may have the block Reduces space requirements Many applications have near neighbor sharing Sparse Directory (aka Directory Caching) Not all of memory is in processor caches Size of memory >> sum of cache sizes Cache of directory entries at memory ECE 259 / CPS 221

Software Assistance Trap to Software if we run out of pointers Limitless Directory (MIT Alewife) Dir1SW (Wisconsin Wind Tunnel Group) Why software assistance? Cost (less hardware) Flexibility Actually, can do everything in software = Software DSM Page-based DSM Blizzard-S Shasta ECE 259 / CPS 221

Flat, Cache-Based Distributed Directory (SCI) Also known as “chaining directory” Build linked list of nodes containing cache block Store pointers in cache with block of data Home node points to start of list cache 1 cache 5 home Issues Replacements Time to send invalidations ECE 259 / CPS 221

Chaining (SCI) Scalable Coherent Interface (IEEE standard) Build linked list of nodes containing cache block Store pointers in cache with block of data Home node points to start of list Issues Replacements Time to send invalidations ECE 259 / CPS 221

Outline Directory-Based Cache Coherence Stanford DASH Case Study SGI Origin Case Study Advanced Directory Systems ECE 259 / CPS 221

DASH First system with directory-based cache coherence Academic design (Stanford) that led to SGI Origin Also had follow-on at Stanford called FLASH DISCUSS DASH PAPER ECE 259 / CPS 221

Outline Directory-Based Cache Coherence Stanford DASH Case Study SGI Origin Case Study Overview Directory & Protocol States Detailed Coherence Protocol Examples Advanced Directory Systems ECE 259 / CPS 221

Origin2000 System Overview Single 16”-by-11” PCB (except for Xbow I/O) Directory state in same or separate DRAMs, accessed in parallel Up to 512 nodes (1024 processors) With 195MHz R10K processor, peak 390MFLOPS/780 MIPS Peak SysAD bus b/w is 780 MB/s, same for Hub to Mem b/w Hub to router chip and to Xbow is 1.56 GB/s (both are off-board) ECE 259 / CPS 221

Origin Node Board Hub is 500K-gate in 0.5 um CMOS Has outstanding transaction buffers for each processor (4 each) Has two block transfer engines (memory copy and fill) Interfaces to and connects processor, memory, network and I/O Provides support for synch primitives, and for page migration Two processors within node not snoopy-coherent (cost) ECE 259 / CPS 221

Origin Network Each router has six pairs of 1.56MB/s unidirectional links Two to nodes, up to four to other routers Latency: 41ns pin to pin across a router Flexible cables up to 3 ft long Four “virtual channels”: request, reply, two for priority or I/O ECE 259 / CPS 221

Origin Directory Structure Flat, memory-based: all directory information at home Three directory formats: (1) If exclusive in a cache, entry is pointer to that specific processor (not node) (2) If shared, bit vector: each bit points to a node (Hub), not processor Invalidation sent to a Hub is broadcast to both processors in the node Two sizes, depending on scale 16-bit format (32 procs), kept in main memory DRAM 64-bit format (128 procs), extra bits kept in extension memory (3) For larger machines, coarse vector: each bit corresponds to p/64 nodes Ignore coarse vector in discussion for simplicity ECE 259 / CPS 221

Origin Cache and Directory States Cache states: MESI (like Illinois snooping protocol) Seven directory states Unowned: no cache has a copy, memory copy is valid Shared: one or more caches has a shared copy, memory is valid Exclusive: one cache (pointed to) has block in modified or exclusive state Three pending or busy states, one for each of the above: Indicates directory has received a previous request for the block Couldn’t satisfy it itself, sent it to another node and is waiting Cannot take another request for the block yet Poisoned state, used for efficient page migration (later) Let’s see how it handles read and “write” requests No point-to-point order assumed in network  lots of races! ECE 259 / CPS 221

Races in the Protocol Without point-to-point ordering in the network, messages can bypass each other and arrive at unexpected times Example (all messages involve block B) Initially: all caches in Invalid, directory in unowned P1 sends GETX to Dir Dir receives P1’s GETX, responds with data (msg gets delayed) P2 sends GETX to Dir Dir forwards P2’s GETX to P1 P1 receives Forwarded-GETX … while in state Invalid! ECE 259 / CPS 221

(1) Handling a Read Miss Hub looks at address If remote, sends request to home directory If local, looks up directory entry and memory itself Directory may indicate one of many states If Shared or Unowned State: If shared, directory sets presence bit If unowned, goes to exclusive state and uses pointer format Replies with block to requestor Strict request-reply (no network transactions if home is local) Also looks up memory speculatively to get data If directory is shared or unowned, data already obtained by Hub If not one of these, speculative memory access is wasted If Busy state: not ready to handle NACK, so as not to hold up buffer space for long ECE 259 / CPS 221

Read Miss to Block in Exclusive State Most interesting case is read miss to Exclusive block If owner is not home, need to get data to home and requestor from owner Uses reply forwarding for lowest latency and traffic Not strict request-reply H req forwarded req spec response response ack R L ECE 259 / CPS 221

Actions at Home & Owner At the home: At the owner: Set directory to Busy-Exclusive and NACK subsequent requests General philosophy of protocol Can’t set to shared or exclusive Alternative is to buffer at home until done, but input buffer problem Set and unset appropriate presence bits Assume block is clean-exclusive and send speculative reply At the owner: If block is dirty Send data reply to requestor and send “sharing writeback” (aka “copyback”) with data to home If block is clean exclusive Similar, but don’t send data (msg to home is called “downgrade”) Home changes state to shared when it receives msg ECE 259 / CPS 221

(2) Handling a Write Miss Request to home could be upgrade or read-exclusive If state is busy: NACK If state is unowned: If RdEx, set bit, change state to dirty, reply with data If Upgrade, means block has been replaced from cache and directory already notified, so upgrade is inappropriate request NACKed (will be retried as RdEx) If state is shared or exclusive: Invalidations must be sent Use reply forwarding; i.e. invalidation acks sent to requestor, not home ECE 259 / CPS 221

Write to Block in Shared State At the home: Set directory state to exclusive and set presence bit for requestor Ensures that subsequent requests will be forwarded to requestor If RdEx, send “excl. reply with invals pending” to requestor (w/data) How many sharers to expect invalidations from If Upgrade, similar “upgrade ack with invals pending” reply, no data Send invals to sharers, which will ack requestor At requestor, wait for all acks to come back before “closing” the operation Subsequent request for block to home is forwarded as intervention to requestor For proper serialization, requestor does not handle it until all acks received for its outstanding request ECE 259 / CPS 221

Write to Block in Exclusive State If upgrade, not valid so NACKed Another write has beaten this one to the home, so requestor’s data not valid If RdEx: Like read, set to busy state, set presence bit, send speculative reply Send invalidation to owner with identity of requestor At owner: If block is dirty in cache Send “ownership xfer” revision msg to home (no data) Send response with data to requestor (overrides speculative reply) If block in clean exclusive state Send ack to requestor (no data; got that from speculative reply) ECE 259 / CPS 221

(3) Handling Writeback Requests Directory state cannot be shared or unowned Requestor (owner) has block dirty If another request had come in to set state to shared, would have been forwarded to owner and state would be busy State is exclusive Directory state set to unowned, and ack returned State is busy: interesting race condition Busy because intervention due to request from another node (Y) has been forwarded to the node X that is doing the writeback Intervention and writeback have crossed each other Y’s operation is already in flight and has had its effect on directory Can’t drop writeback (only valid copy) Can’t NACK writeback and retry after Y’s ref completes Y’s cache will have valid copy while a different dirty copy is written back ECE 259 / CPS 221

Solution to Writeback Race Combine the two operations When writeback reaches directory, it changes the state To shared if it was busy-shared (i.e., Y requested a read copy) To exclusive if it was busy-exclusive Home forwards the writeback data to the requestor Y Sends writeback ack to X When X receives the intervention, it ignores it Knows to do this since it has an outstanding writeback for the line Y’s operation completes when it gets the reply X’s writeback completes when it gets the writeback ack ECE 259 / CPS 221

(4) Replacement of Shared Block Could send a replacement hint to the directory To remove the node from the sharing list Can eliminate an invalidation the next time block is written But does not reduce traffic Have to send replacement hint Incurs the traffic at a different time Origin protocol does not use replacement hints Total transaction types: Coherent memory: 9 request transaction types, 6 inval/intervention, 39 reply Noncoherent (I/O, synch, special ops): 19 request, 14 reply (no inval/intervention) ECE 259 / CPS 221

Outline Directory-Based Cache Coherence Stanford DASH Case Study SGI Origin 2000 Case Study Advanced Directory Systems AlphaServer GS320 Compaq Piranha ECE 259 / CPS 221

AlphaServer GS320 PRESENTATION ECE 259 / CPS 221

Compaq Piranha PRESENTATION ECE 259 / CPS 221

Outline Directory-Based Cache Coherence SGI Origin 2000 Case Study Advanced Directory Systems ECE 259 / CPS 221