CSE 586 Computer Architecture Lecture 9

CSE 586 Computer Architecture Lecture 9
Jean-Loup Baer CSE 586 Spring 00

Highlights from last week
Parallel Processing Flynn’s taxonomy MIMD machines --Shared-memory multiprocessors UMA NUMA-cc DSM MIMD machines – Message passing systems Multicomputers Synchronous vs. asynchronous message passing Pros and cons of shared-memory and message passing paradigms Amdahl’s law as applied to parallel processing CSE 586 Spring 00

Highlights from last week (c’ed)
SMP’s Cache coherence using snoopy protocols Write-update protocols (Dragon) Write-invalidate protocols (Illinois) Cache coherence misses Impact of capacity and block sizes Multilevel inclusion property CSE 586 Spring 00

Highlights from last week (c’ed)
Interconnection networks for tightly-coupled systems Centralized vs. decentralized switches Centralized switches Crossbar Perfect shuffle – Omega and Butterfly networks Decentralized switches Meshes and tori Performance metrics Bandwidth; Bisection bandwidth; latency Routing and flow control CSE 586 Spring 00

Cache Coherence in NUMA Machines
Snooping is not possible on media other than bus/ring Broadcast / multicast is not that easy In Multistage Interconnection Networks (MINs), potential for blocking is very large In mesh-like networks, broadcast to every node is very inefficient How to enforce cache coherence Having no caches (Tera MTA) By software: disallow caching of shared variables (Cray 3TD) By hardware: having a data structure (a directory) that records the state of each block CSE 586 Spring 00

Information Needed for Cache Coherence
What information should the directory contain At the very least whether a block is cached or not Whether the cache copy – or copies – is clean or dirty Where are the copies of the block Directory structure associated with the block in memory Linked list of all copies in the caches , including the one in memory CSE 586 Spring 00

Full Directory Full information associated with each block in memory
Entry in the directory: state vector associated with the block For an n processor system, a (n+1) bit vector Bit 0, clean/dirty Bits 1-n: “location” vector ; Bit i set if ith cache has a copy Protocol is write-invalidate Memory overhead: For a 64 processor system, 65 bits / block If a block is 64 bytes, overhead = 65 / (64 * 8) i.e., over 10% This data structure is not scalable (but see later) CSE 586 Spring 00

Home Node Definition Home node: the node that contains the initial value of the block as determined by its physical address Home node contains the directory entry for a block Remote node: any other node On a cache miss (read and write), the request for data will be sent to the home node If a block has to be evicted from a cache, and it is dirty, its value should be written back in the home node CSE 586 Spring 00

Basic Protocol – Read Miss on Uncached/clean Block
Cache i has a read miss on an uncached block (state vector full of 0’s) The home node responds with the data Add entry in directory (set clean and ith bit) Cache i has a read miss on a clean block (clean bit on; at least one of the other bits on) Add entry in directory (set ith bit) CSE 586 Spring 00

Basic Protocol – Read Miss on Dirty Block
Cache i has a read miss on a dirty block If dirty block is in home node, say node j (dirty and jth bits on) home node: Updates memory (write back from its own cache j) Changes the block encoding (dirty -> clean and set ith bit); Sends data to cache i (1-hop) If dirty block is not in home node but is in cache k (dirty and kth bits on), home node Asks cache k to send the block and updates memory Change entry in directory (dirty -> clean and set ith bit); Sends the data (2-hops) CSE 586 Spring 00

Basic Protocol – Write Miss on Uncached/clean Block
Cache i has a write miss on an uncached block (state vector full of 0’s) The home node responds with the data Add entry in directory (set dirty and ith bits) Cache i has a write miss on a clean block (clean bit on; at least one of the other bits on) Home node sends an invalidate message to all caches whose bits are on in the state vector (this is a series of messages) Change entry in directory (clean -> dirty and set ith bit) Note : the memory is not up-to-date CSE 586 Spring 00

Basic Protocol – Write Miss on Dirty Block
Cache i has a write miss on a dirty block If dirty block is in home node, say node j (dirty and jth bits on) home node: Updates memory (write back from its own cache j) Changes the block encoding (clear jth bit and set ith bit); Sends data to cache i (1-hop) If dirty block is not in home node but is in cache k (dirty and kth bits on), home node Asks cache k to send the block and updates memory Change entry in directory (clear kth bit and set ith bit); Sends the data (2-hops) CSE 586 Spring 00

Basic Protocol – Request to Write a Clean Block
Cache i wants to write one of its block which is clean This implies that clean/dirty bits also exist in the cache metadata Perform as in write miss on a clean block except that the memory does not have to send the data CSE 586 Spring 00

Basic Protocol - Replacing a Block
What happens when a block is replaced If dirty, it is of course written back and its state becomes a vector of 0’s If clean could either “do nothing” but then encoding is wrong leading to possibly unneeded invalidations (and acks) or could send message and modify state vector accordingly (reset corresponding bit) Acks are necessary to ensure correctness mostly if messages can be delivered out of order CSE 586 Spring 00

The Most Economical (Memory-wise) Protocol
Recall the minimal number of states needed Not cached anywhere (i.e., valid in home memory) Cached in one or more caches but not modified (clean) Cached in one cache and modified (dirty) Simply encode the states (2-bit protocol) and perform broadcast invalidations (expensive because most often the data is not shared by many processors) Fourth state to enhance performance, say valid-exclusive: Cached in one cache only and still clean: no need to broadcast invalidations on a request to write a clean block but the cache has to know that it is in v-e state (metadata in the cache) CSE 586 Spring 00

2-bit Protocol Differences with full directory protocol
Of course no bit setting in “location” vector On a read miss to uncached block go to state valid-exclusive On “request to write a clean block” from a cache that has the block in valid-exclusive state, if the block is still in valid-exclusive state in the directory, no need to broadcast invalidations On a read miss to a valid-exclusive block, change state to clean On a write miss to clean block and to valid-exclusive block from another cache and read/write miss to dirty block, need to send a broadcast invalidate signal to all processors; in the case of dirty, the one with the copy of the block will send it back along with its ack. CSE 586 Spring 00

Need for Partial Directories
Full directory not scalable. Location vector depends on number of processors Might become too much memory overhead 2-bit protocol invalidations are costly Observation: Sharing is often limited to a small number of processors Instead of full directory, have room for a limited number of processor id’s. CSE 586 Spring 00

Examples of Partial Directories
Coarse bit-vector Share a “location” bit among 2 or 4 or 8 processors etc. Advantage: scalable since fixed amount of memory/block Dynamic pointer (many variations) Directory for a block has 1 bit for local cache, one or more fields for a limited number of other caches, and possibly a pointer to a linked list in memory for overflow. Need to “reclaim” pointers on clean replacements and/or to invalidate blindly if there is overflow Protocols are DiriB (i pointers and broadcast) or DiriNB (i pointers and No Broadcast, i.e., forced invalidations) CSE 586 Spring 00

Directories in the Cache -- The SCI Approach
Copies of blocks residing in various caches are linked via a doubly linked list Doubly linked so that it is easy to insert/delete Header in the block’s home Insertions “between” home node and new cache Economical in memory space Proportional to cache space rather than memory space Invalidations can be lengthy (list traversal) CSE 586 Spring 00

A Caveat about Cache Coherence Protocols
They are more complex in the details than they look! Snoopy protocols Writes are not atomic (first detect write miss and send request on the bus; then get block and write data -- only then should the block become dirty) The cache controller must implement “pending states” for situations which would allow more than one cache to write data in a block, or replace a dirty block, i.e., write in memory Things become more complex for split-transaction buses Things become even more complex for lock-up free caches (but it’s manageable) CSE 586 Spring 00

Subtleties in Directory Protocols
No transaction is atomic. If they were treated as atomic, deadlock could occur Assume block A from home node X is dirty in P1 Assume block B from home node Y is dirty in P2 P1 reads miss on B and P2 reads miss on A Home node Y generates a “purge” for B in P2 and Home node X generates a “purge” for A in P1 Both P1 and P2 wait for their read misses and cannot answer the home node purges hence deadlock. So assume non-atomicity of transactions and allow only one in-flight transaction per block (nak any other while one is in progress) CSE 586 Spring 00

Problems with Buffering
Directory and cache controllers might have to send/receive many messages at the same time Protocols must take into account finite amount of buffers This leads to possibility of deadlocks This is even more important for 2-bit protocol with lots of broadcasts Solutions involve one or more of the following separate networks for requests and replies so that requests don’t block replies which free buffer space each request reserves buffer room for its reply use of naks and of retries CSE 586 Spring 00

COMA – Cache Only Memory Architecture
Replace memory modules by cache-like structures (attraction memories) Costly since need for tags and state per block Data migration, replication, replacement etc. all driven by hardware Pros: No need to “write back” a replaced block if it exists somewhere else ; data migrates naturally towards the processor that needs it Con: need to know whether there exists another copy of a block before replacing ; What to do if it’s the last copy and its place in an attraction memory is to be taken by another block CSE 586 Spring 00

COMA Implementations Commercial: KSR Research machine
Interconnection: hierarchy of rings. This allow broadcasts and thus facilitates finding blocks Research machine DDM (Data Diffusion Machine). Tree interconnect with directories at each node of the tree. Variations (still some research on “Efficient COMA”) Flat COMA: Fixed home for directory Summary: elegant solution but cost/performance not good enough CSE 586 Spring 00

Some Recent Medium-scale NUMA Multiprocessors (research machines)
DASH (Stanford) multiprocessor. “Cluster” = 4 processors on a shared-bus with a shared L2 Directory cache coherence on a cluster basis Clusters (up to 16) connected through 2 2D-meshes (one for sending messages, one for acks) Alewife (MIT) Dynamic pointer allocation directory (5 pointers) On “overflow”, software takes over Multithreaded. (Fast) Context-switch on a cache miss to a remote node FLASH (Stanford) Use of a programmable protocol processor. Can implement different protocols (including message passing) depending on the application CSE 586 Spring 00

Some Recent Medium-scale NUMA Multiprocessors (commercial machines)
SGI Origin (follow-up on DASH) 2 processors/cluster Full directory Hypercube topology up to 32 processors (16 nodes) Then “fat hypercube” with a metarouter (up to 256 processors) vertices of hypercubes connected to switches in metarouter Sequent NUMA-Q SPM clusters of 4 processors + shared “remote” cache (caches only data not homed in cluster) Clusters connected in a ring SCI cache coherence via remote caches CSE 586 Spring 00

Extending the range of SMP’s – Sun’s Starfire
Use snooping buses (4 of them) for transmitting requests and addresses One bus per each quarter of the physical address bus Up to 16 clusters of 4 processor/memory modules each Data is transmitted via a 16 x 16 cross-bar between clusters “Analysis” shows that up to 12 clusters limitation is on the data part; after that it’s on the snooping buses CSE 586 Spring 00

Multiprogramming and Multiprocessing Imply Synchronization
Locking Critical sections Mutual exclusion Used for exclusive access to shared resource or shared data for some period of time Efficient update of a shared (work) queue Barriers Process synchronization -- All processes must reach the barrier before any one can proceed (e.g., end of a parallel loop). CSE 586 Spring 00

Locking Typical use of a lock:
while (!acquire (lock)) /*spin*/ ; /* some computation on shared data*/ release (lock) Acquire based on primitive: Read-Modify-Write Basic principle: “Atomic exchange” Test-and-set Fetch-and-add CSE 586 Spring 00

Test-and-set Lock is stored in a memory location that contains 0 or 1
Test-and-set (attempt to acquire) writes a 1 and returns the value in memory If the value is 0, the process gets the lock; if the value is 1 another process has the lock. To release, just clear (set to 0) the memory location. CSE 586 Spring 00

Atomic Exchanges Test-and-set is one form of atomic exchange
Atomic-swap is a generalization of Test-and-set that allows values besides 0 and 1 Compare-and-swap is a further generalization: the value in memory is not changed unless it is equal to the test value supplied CSE 586 Spring 00

Fetch-and-Θ Generic name for fetch-and-add, fetch-and-store etc.
Can be used as test-and-set (since atomic exchange) but more general. Will be used for barriers Introduced by the designers of the NYU Ultra where the interconnection network allowed combining. If two fetch-and-add have the same destination, they can be combined. However, they have to be forked on the return path CSE 586 Spring 00

Full/Empty Bits Based on producer-consumer paradigm
Each memory location has a synchronization bit associated with it Bit = 0 indicates the value has not been produced (empty) Bit = 1 indicates the value has been produced (full) A write stalls until the bit is empty (0). After the write the bit is set to full (1). A read stalls until the bit is full and then empty it. Not all load/store instructions need to test the bit. Only those needed for synchronization (special opcode) First implemented in HEP and now in Tera. CSE 586 Spring 00

Faking Atomicity Instead of atomic exchange, have an instruction pair that can be deduced to have operated in an atomic fashion Load locked (ll) + Store conditional (sc) (Alpha) sc detects if the value of the memory location loaded by ll has been modified. If so returns 0 (locking fails) otherwise 1 (locking succeeds) Similar to atomic exchange but does nor require read-modify-write Implementation Use a special register (link register) to store the address of the memory location addressed by ll . On context-switch, interrupt or invalidation of block corresponding to that address (by another sc), the register is cleared. If on sc, the addresses match, the sc succeeds CSE 586 Spring 00

Using ll-sc to Implement Test-and-Set
Try: li R1, Set R1 to 1 ll R2, 0(R3) Set R2 with value in memory whose address is put in link register sc R1, 0(R3) R1 = 1 if address in link register has not changed, otherwise R1 =0 beqz R1, try “test-and-set “ has failed because of some exception or another processor has modified the lock between the ll and sc Now test R2 to see if the lock has been obtained…. CSE 586 Spring 00

Spin Locks Repeatedly: try to acquire the lock
Test-and-Set in a cache coherent environment (invalidation-based): Bus utilized during the whole read-modify-write cycle Since test-and-set writes a location in memory, need to send an invalidate (even if the lock is not acquired) In general loop to test the lock is short, so lots of bus contention Possibility of “exponential back-off” (like in Ethernet protocol to avoid too many collisions) CSE 586 Spring 00

Test and Test-and-Set Replace “test-and-set” with “test and test-and-set”. Keep the test (read) local to the cache. First test in the cache (non atomic). If lock cannot be acquired, repeatedly test in the cache (no bus transaction) On lock release (write 0 in memory location) all other cached copies of the lock are invalidated. Still racing condition for acquiring a lock that has just been released. (O(n**2) bus transactions for n contending processes). Can use ll+sc but still racing condition when the lock is released CSE 586 Spring 00

Queuing Locks Basic idea: a queue of waiting processors is maintained in shared-memory for each lock (best for bus-based machines) Each processor performs an atomic operation to obtain a memory location (element of an array) on which to spin Upon a release, the lock can be directly handed off to the next waiting processor CSE 586 Spring 00

Software Implementation
lock struct {int Queue[P]; int Queuelast;} /*for P processors*/ ACQUIRE myplace := fetch-and-add (lock->Queuelast); while (lock->Queue[myplace modP] = = 1; /* spin*/ lock->Queue[myplace modP] := 1; RELEASE lock->Queue[myplace + 1 modP] := 0; The Release should invalidate the cached value in the next processor that can then fetch the new value stored in the array. CSE 586 Spring 00

Queuing Locks (hardware implementation)
Can be done several ways via directory controllers Associate a syncbit (aka, full/empty bit) with each block in memory ( a single lock will be in that block) Test-and-set the syncbit for acquiring the lock Unset to release Special operation (QOLB) non-blocking operation that enqueues the processor for that lock if not already in the queue. Can be done in advance, like a prefetch operation. Have to be careful if process is context-switched (possibility of deadlocks) CSE 586 Spring 00

Barriers All processes have to wait at a synchronization point
End of parallel do loops Processes don’t progress until they all reach the barrier Low-performance implementation: use a counter initialized with the number of processes When a process reaches the barrier, it decrements the counter (atomically -- fetch-and-add (-1)) and busy waits When the counter is zero, all processes are allowed to progress (broadcast) Lots of possible optimizations (tree, butterfly etc. ) Is it important? Barriers do not occur that often (Amdahl’s law….) CSE 586 Spring 00

A Primer on Memory Consistency
The (parallel) programmer model is sequential consistency Result of any execution is the same as if each process accesses memory in program order and processes were interleaved on a single processor P P2 Write (A) ; repeat (noop) flag = 1 ; until (flag = = 1) ; Read (A); If the write to A takes “longer” from P2’s view than the write to flag (e.g., because of invalidation delays, or A is cached and flag is not) the system is not sequentially consistent CSE 586 Spring 00

A (slightly) More Subtle Example
Initially X and Y are 0 P1 P2 X = Y = 1 If ( Y = = 0) Kill P If (X = = 0 ) Kill P1 Clearly the intent is to kill at most 1 of P1 and P2. But if X and Y are put in write buffers and reads are allowed to pass writes, both P1 and P2 could be killed CSE 586 Spring 00

Models of Memory Consistency
Sequential consistency imposes sequential access in memory operations Could lead to huge losses in performance Instead give a programming model where all possible shared data race conditions are resolved explicitly via locking Models of consistency then become models of when locked data can be accessed and released Requires fences i.e., points in the program where memory operations have to be completed before the process can continue CSE 586 Spring 00

Processor Consistency
Load Processor consistency loads can bypass stores Store Sequential consistency CSE 586 Spring 00

Weak Ordering Access to global synchronizing variables are totally ordered (i.e., sequentially consistent) No access to a synchronizing variable is issued by a processor before all previous global data accesses have been “performed” A load is performed when the value to be loaded has been set and cannot be changed A store is performed when the values stored by the processor can be seen by all other processors No access to global data is issued by a processor before a previous access to a synchronizing variable has been “performed” CSE 586 Spring 00

Weak Ordering (c’ed) Load/store in any order Acquire
Release Load/store in any order CSE 586 Spring 00

Release Consistency Can go even further by relaxing the constraints on what happens on “acquire” and “release”, the two types of accesses to synchronizing variables. In order to “acquire” there is no need for all ordinary memory operations on the same processor to be completed Ordinary memory operations following in program order a “release” do not have to wait for the release to be completed CSE 586 Spring 00

Release Consistency (c’ed)
Acquire Load/store in any order Release Acquire Load/store in any order Release CSE 586 Spring 00

CSE 586 Computer Architecture Lecture 9

Similar presentations

Presentation on theme: "CSE 586 Computer Architecture Lecture 9"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CSE 586 Computer Architecture Lecture 9

Similar presentations

Presentation on theme: "CSE 586 Computer Architecture Lecture 9"— Presentation transcript:

Similar presentations

About project

Feedback