1 Lecture 8: Snooping and Directory Protocols Topics: 4/5-state snooping protocols, split-transaction implementation details, directory implementations.

Slides:



Advertisements
Similar presentations
L.N. Bhuyan Adapted from Patterson’s slides
Advertisements

1 Lecture 6: Directory Protocols Topics: directory-based cache coherence implementations (wrap-up of SGI Origin and Sequent NUMA case study)
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
1 Lecture 4: Directory Protocols Topics: directory-based cache coherence implementations.
Cache Optimization Summary
CS 7810 Lecture 19 Coherence Decoupling: Making Use of Incoherence J.Huh, J. Chang, D. Burger, G. Sohi Proceedings of ASPLOS-XI October 2004.
CIS629 Coherence 1 Cache Coherence: Snooping Protocol, Directory Protocol Some of these slides courtesty of David Patterson and David Culler.
1 Lecture 4: Directory-Based Coherence Details of memory-based (SGI Origin) and cache-based (Sequent NUMA-Q) directory protocols.
CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.
1 Lecture 2: Snooping and Directory Protocols Topics: Snooping wrap-up and directory implementations.
EECC756 - Shaaban #1 lec # 13 Spring Scalable Cache Coherent Systems Scalable distributed shared memory machines Assumptions: –Processor-Cache-Memory.
1 Lecture 20: Coherence protocols Topics: snooping and directory-based coherence protocols (Sections )
1 Lecture 1: Introduction Course organization:  4 lectures on cache coherence and consistency  2 lectures on transactional memory  2 lectures on interconnection.
1 Lecture 3: Snooping Protocols Topics: snooping-based cache coherence implementations.
1 Lecture 5: Directory Protocols Topics: directory-based cache coherence implementations.
NUMA coherence CSE 471 Aut 011 Cache Coherence in NUMA Machines Snooping is not possible on media other than bus/ring Broadcast / multicast is not that.
1 Lecture 2: Intro and Snooping Protocols Topics: multi-core cache organizations, programming models, cache coherence (snooping-based)
EECC756 - Shaaban #1 lec # 12 Spring Scalable Cache Coherent Systems Scalable distributed shared memory machines Assumptions: –Processor-Cache-Memory.
1 Lecture 3: Directory-Based Coherence Basic operations, memory-based and cache-based directories.
CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.
1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )
1 Cache coherence CEG 4131 Computer Architecture III Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini.
Lecture 7: PCM, Cache coherence
Lecture 3. Directory-based Cache Coherence Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.
Cache Control and Cache Coherence Protocols How to Manage State of Cache How to Keep Processors Reading the Correct Information.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 5, 2005 Session 22.
Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 March 20, 2008 Session 9.
1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )
1 Lecture 3: Coherence Protocols Topics: consistency models, coherence protocol examples.
1 Lecture 7: PCM Wrap-Up, Cache coherence Topics: handling PCM errors and writes, cache coherence intro.
1 Lecture 7: Implementing Cache Coherence Topics: implementation details.
Lecture 8: Snooping and Directory Protocols
Cache Coherence: Directory Protocol
Cache Coherence: Directory Protocol
תרגול מס' 5: MESI Protocol
Architecture and Design of AlphaServer GS320
Scalable Cache Coherent Systems
CS 704 Advanced Computer Architecture
Lecture 18: Coherence and Synchronization
Multiprocessor Cache Coherency
CMSC 611: Advanced Computer Architecture
Example Cache Coherence Problem
Lecture 9: Directory-Based Examples II
Lecture 2: Snooping-Based Coherence
Scalable Cache Coherent Systems
Lecture 5: Snooping Protocol Design Issues
Lecture 8: Directory-Based Cache Coherence
Lecture 7: Directory-Based Cache Coherence
11 – Snooping Cache and Directory Based Multiprocessors
Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP
CS 213 Lecture 11: Multiprocessor 3: Directory Organization
Lecture 9: Directory Protocol Implementations
Lecture 25: Multiprocessors
Lecture 9: Directory-Based Examples
Scalable Cache Coherent Systems
Slides developed by Dr. Hesham El-Rewini Copyright Hesham El-Rewini
Lecture 8: Directory-Based Examples
Lecture 25: Multiprocessors
Cache coherence CEG 4131 Computer Architecture III
Lecture 24: Virtual Memory, Multiprocessors
Lecture 23: Virtual Memory, Multiprocessors
Lecture 24: Multiprocessors
Lecture: Coherence, Synchronization
Lecture: Coherence Topics: wrap-up of snooping-based coherence,
Lecture 3: Coherence Protocols
Coherent caches Adapted from a lecture by Ian Watson, University of Machester.
Lecture 19: Coherence and Synchronization
Lecture 18: Coherence and Synchronization
CSE 486/586 Distributed Systems Cache Coherence
Lecture 10: Directory-Based Examples II
Presentation transcript:

1 Lecture 8: Snooping and Directory Protocols Topics: 4/5-state snooping protocols, split-transaction implementation details, directory implementations (memory- and cache-based)

2 Reporting Snoop Results In a multiprocessor, memory has to wait for the snoop result before it chooses to respond – need 3 wired-OR signals: (i) indicates that a cache has a copy, (ii) indicates that a cache has a modified copy, (iii) indicates that the snoop has not completed Ensuring timely snoops: the time to respond could be fixed or variable (with the third wired-OR signal) Tags are usually duplicated if they are frequently accessed by the processor (regular ld/sts) and the bus (snoops)

3 4 and 5 State Protocols Multiprocessors execute many single-threaded programs A read followed by a write will generate bus transactions to acquire the block in exclusive state even though there are no sharers (leads to MESI protocol) Also, to promote cache-to-cache sharing, a cache must be designated as the responder (leads to MOESI protocol) Note that we can optimize protocols by adding more states – increases design/verification complexity

4 MESI Protocol The new state is exclusive-clean – the cache can service read requests and no other cache has the same block When the processor attempts a write, the block is upgraded to exclusive-modified without generating a bus transaction When a processor makes a read request, it must detect if it has the only cached copy – the interconnect must include an additional signal that is asserted by each cache if it has a valid copy of the block When a block is evicted, a block may be exclusive-clean, but will not realize it

5 MOESI Protocol The first reader or the last writer are usually designated as the owner of a block The owner is responsible for responding to requests from other caches There is no need to update memory when a block transitions from M  S state The block in O state is responsible for writing back a dirty block when it is evicted

6 Split Transaction Bus So far, we have assumed that a coherence operation (request, snoops, responses, update) happens atomically What would it take to implement the protocol correctly while assuming a split transaction bus? Split transaction bus: a cache puts out a request, releases the bus (so others can use the bus), receives its response much later Assumptions:  only one request per block can be outstanding  separate lines for addr (request) and data (response)

7 Split Transaction Bus Proc 1 Cache Proc 2 Cache Proc 3 Cache Request lines Response lines Buf

8 Design Issues Could be a 3-stage pipeline: request/snoop/response or (much simpler) 2-stage pipeline: request-snoop/response (note that the response is slowest and needs to be hidden) Buffers track the outstanding transactions; buffers are identical in each core; an entry is freed when the response is seen; the next operation uses any free entry; every bus operation carries the buffer entry number as a tag Must check the buffer before broadcasting a new operation; must ensure only one outstanding operation per block What determines the write order – requests or responses?

9 Design Issues II What happens if processor-A is arbitrating for the bus and witnesses another bus transaction for the same address or same buffer entry? What if processor-A was trying to do an upgrade? What if processor-A was trying to do a read and there is already a matching read in the request table? Processor-cache handshake: after acquiring the block in excl state, the processor must complete the write before handing the block to other writers; else, there’s a livelock

10 Directory-Based Protocol For each block, there is a centralized “directory” that maintains the state of the block in different caches The directory is co-located with the corresponding memory Requests and replies on the interconnect are no longer seen by everyone – the directory serializes writes P C MemCADir P C MemCADir

11 Definitions Home node: the node that stores memory and directory state for the cache block in question Dirty node: the node that has a cache copy in modified state Owner node: the node responsible for supplying data (usually either the home or dirty node) Also, exclusive node, local node, requesting node, etc. P C MemCADir P C MemCADir

12 Directory Organizations Centralized Directory: one fixed location – bottleneck! Flat Directories: directory info is in a fixed place, determined by examining the address – can be further categorized as memory-based or cache-based Hierarchical Directories: the processors are organized as a logical tree structure and each parent keeps track of which of its immediate children has a copy of the block – more searching, can exploit locality

13 Flat Memory-Based Directories Directory is associated with memory and stores info for all cached copies A presence vector stores a bit for every processor, for every memory block – the overhead is a function of memory/block size and #processors Reducing directory overhead:

14 Flat Memory-Based Directories Directory is associated with memory and stores info for all cache copies A presence vector stores a bit for every processor, for every memory block – the overhead is a function of memory/block size and #processors Reducing directory overhead:  Width: pointers (keep track of processor ids of sharers) (need overflow strategy), organize processors into clusters  Height: increase block size, track info only for blocks that are cached (note: cache size << memory size)

15 Flat Cache-Based Directories The directory at the memory home node only stores a pointer to the first cached copy – the caches store pointers to the next and previous sharers (a doubly linked list) Main memory Cache 7Cache 3Cache 26

16 Flat Cache-Based Directories The directory at the memory home node only stores a pointer to the first cached copy – the caches store pointers to the next and previous sharers (a doubly linked list) Potentially lower storage, no bottleneck for network traffic Invalidates are now serialized (takes longer to acquire exclusive access), replacements must update linked list, must handle race conditions while updating list

17 Flat Memory-Based Directories Main memory Cache 1Cache 2Cache 64 … … Block size = 128 B Memory in each node = 1 GB Cache in each node = 1 MB For 64 nodes and 64-bit directory, Directory size = 4 GB For 64 nodes and 12-bit directory, Directory size = 0.75 GB

18 Flat Cache-Based Directories Main memory … Block size = 128 B Memory in each node = 1 GB Cache in each node = 1 MB 6-bit storage in DRAM for each block; DRAM overhead = GB 12-bit storage in SRAM for each block; SRAM overhead = 0.75 MB Cache 7Cache 3Cache 26

19 Flat Memory-Based Directories L2 cache L1 Cache 1L1 Cache 2L1 Cache 64 … … Block size = 64 B L3 cache in each node = 2 MB L2 Cache in each node = 256 KB For 64 nodes and 64-bit directory, Directory size = 16 MB For 64 nodes and 12-bit directory, Directory size = 3 MB

20 Flat Cache-Based Directories Main memory … 6-bit storage in L3 for each block; L3 overhead = 1.5 MB 12-bit storage in L2 for each block; L2 overhead = 384 KB Cache 7Cache 3Cache 26 Block size = 64 B L3 cache in each node = 2 MB L2 Cache in each node = 256 KB

21 SGI Origin 2000 Flat memory-based directory protocol Uses a bit vector directory representation Two processors per node – combining multiple processors in a node reduces cost P L2 CA M/D P L2 Interconnect

22 Directory Structure The system supports either a 16-bit or 64-bit directory (fixed cost); for small systems, the directory works as a full bit vector representation Seven states, of which 3 are stable For larger systems, a coarse vector is employed – each bit represents p/64 nodes State is maintained for each node, not each processor – the communication assist broadcasts requests to both processors

23 Title Bullet