1 Lecture 24: Fault Tolerance Papers: Token Coherence: Decoupling Performance and Correctness, ISCA’03, Wisconsin A Low Overhead Fault Tolerant Coherence.

Slides:



Advertisements
Similar presentations
1 Lecture 18: Transactional Memories II Papers: LogTM: Log-Based Transactional Memory, HPCA’06, Wisconsin LogTM-SE: Decoupling Hardware Transactional Memory.
Advertisements

1 Lecture 6: Directory Protocols Topics: directory-based cache coherence implementations (wrap-up of SGI Origin and Sequent NUMA case study)
1 Lecture 20: Speculation Papers: Is SC+ILP=RC?, Purdue, ISCA’99 Coherence Decoupling: Making Use of Incoherence, Wisconsin, ASPLOS’04 Selective, Accurate,
Lecture 11 Recoverability. 2 Serializability identifies schedules that maintain database consistency, assuming no transaction fails. Could also examine.
1 Lecture 12: Hardware/Software Trade-Offs Topics: COMA, Software Virtual Memory.
CS 7810 Lecture 19 Coherence Decoupling: Making Use of Incoherence J.Huh, J. Chang, D. Burger, G. Sohi Proceedings of ASPLOS-XI October 2004.
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
1 Lecture 4: Directory-Based Coherence Details of memory-based (SGI Origin) and cache-based (Sequent NUMA-Q) directory protocols.
1 Lecture 22: Fault Tolerance Papers: Token Coherence: Decoupling Performance and Correctness, ISCA’03, Wisconsin A Low Overhead Fault Tolerant Coherence.
1 Lecture 7: Consistency Models Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models.
1 Lecture 24: Transactional Memory Topics: transactional memory implementations.
1 Lecture 6: TM – Eager Implementations Topics: Eager conflict detection (LogTM), TM pathologies.
1 Lecture 5: Directory Protocols Topics: directory-based cache coherence implementations.
(C) 2004 Daniel SorinDuke Architecture Using Speculation to Simplify Multiprocessor Design Daniel J. Sorin 1, Milo M. K. Martin 2, Mark D. Hill 3, David.
1 Lecture 9: TM Implementations Topics: wrap-up of “lazy” implementation (TCC), eager implementation (LogTM)
1 Lecture 3: Directory-Based Coherence Basic operations, memory-based and cache-based directories.
1 Lecture 10: TM Implementations Topics: wrap-up of eager implementation (LogTM), scalable lazy implementation.
1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )
Error Checking continued. Network Layers in Action Each layer in the OSI Model will add header information that pertains to that specific protocol. On.
File System. NET+OS 6 File System Architecture Design Goals File System Layer Design Storage Services Layer Design RAM Services Layer Design Flash Services.
Lecture 7: PCM, Cache coherence
Dynamic Verification of Cache Coherence Protocols Jason F. Cantin Mikko H. Lipasti James E. Smith.
SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,
SafetyNet Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,
(C) 2003 Daniel SorinDuke Architecture Dynamic Verification of End-to-End Multiprocessor Invariants Daniel J. Sorin 1, Mark D. Hill 2, David A. Wood 2.
1 Lecture 12: Hardware/Software Trade-Offs Topics: COMA, Software Virtual Memory.
EEC 688/788 Secure and Dependable Computing Lecture 10 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
CS603 Fault Tolerance - Communication April 17, 2002.
Building Dependable Distributed Systems, Copyright Wenbing Zhao
1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )
1 Lecture 3: Coherence Protocols Topics: consistency models, coherence protocol examples.
1 Lecture 20: Speculation Papers: Is SC+ILP=RC?, Purdue, ISCA’99 Coherence Decoupling: Making Use of Incoherence, Wisconsin, ASPLOS’04.
1 Lecture 7: PCM Wrap-Up, Cache coherence Topics: handling PCM errors and writes, cache coherence intro.
Mutual Exclusion Algorithms. Topics r Defining mutual exclusion r A centralized approach r A distributed approach r An approach assuming an organization.
EEC 688/788 Secure and Dependable Computing Lecture 10 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Cache Coherence Constructive Computer Architecture Arvind
Lecture 20: Consistency Models, TM
The University of Adelaide, School of Computer Science
Lecture 11: Consistency Models
Lecture 18: Coherence and Synchronization
Ivy Eva Wu.
Lecture 9: Directory-Based Examples II
Lecture 5: Snooping Protocol Design Issues
Lecture 17: Transactional Memories I
Lecture 8: Directory-Based Cache Coherence
Improving Multiple-CMP Systems with Token Coherence
Lecture 7: Directory-Based Cache Coherence
Lecture 22: Consistency Models, TM
Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP
EEC 688/788 Secure and Dependable Computing
Lecture 25: Multiprocessors
Lecture 9: Directory-Based Examples
Lecture 10: Consistency Models
Lecture 8: Directory-Based Examples
Lecture 25: Multiprocessors
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Lecture 24: Multiprocessors
Lecture: Coherence Topics: wrap-up of snooping-based coherence,
Distributed Resource Management: Distributed Shared Memory
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 23: Transactional Memory
Lecture: Coherence and Synchronization
Lecture: Consistency Models, TM
Lecture 19: Coherence and Synchronization
Lecture 18: Coherence and Synchronization
Lecture 10: Directory-Based Examples II
Error Checking continued
University of Wisconsin-Madison Presented by: Nick Kirchem
Lecture 11: Consistency Models
Presentation transcript:

1 Lecture 24: Fault Tolerance Papers: Token Coherence: Decoupling Performance and Correctness, ISCA’03, Wisconsin A Low Overhead Fault Tolerant Coherence Protocol for CMP Architectures, HPCA’07, Spain Error Detection Via Online Checking of Cache Coherence with Token Coherence Signatures, HPCA’07, Duke

2 Faults Faults can be permanent or transient Transient faults are typically caused by high-energy particles and noise in voltage levels Faults lead to two main errors: silent data corruption, detected unrecoverable error A coherence protocol can yield errors in many ways:  The delivered message contains corrupted data  A message is never delivered  The coherence controller computes incorrectly  Corrupted state/data in caches  Human design error

3 Token Coherence Each memory block has N tokens A cache may read the contents of a block if it has at least one token A cache may write to a block only if it has all N tokens Simplifies the design of a correct protocol: the above abstraction makes it easier to reason about correctness; much easier than reasoning about every corner case as we did for various conventional protocols Some inefficiencies: can’t silently evict the block – must send the token to the memory controller first

4 Token Coherence Invariants Each block has T tokens, of which one is the owner token A processor can write only if it has all T tokens A processor can read a block only if it has at least one token and valid data If a coherence message contains the owner token, it also contains valid data

5 Assumes an underlying token coherence protocol Fault model: only handles faults in the interconnection network; the faults manifest as dropped messages (either the message never arrives or it is discarded because CRC indicates corrupted data) Fault Tolerant Protocol (Paper #1)

6 Potential Problems Loss of a token-less message (invalidation) will eventually cause a timeout and the invocation of a persistent request Loss of a message with a token will eventually cause a deadlock when the next writer attempts to collect all tokens Loss of a message that contains the owner token and data will end up deadlocking and causing loss of valid data

7 Timeouts The token coherence protocol includes a timeout for retry and persistent requests – these will be invoked in case an invalidation message is lost If the above fail, another timeout signals a deadlock (potential loss of token) and invokes a token recreation process It is possible that a token was never lost, so the token recreation process must be careful to not increase the number of tokens

8 Backup Data If a message carrying ownertoken+data is lost, the only valid copy of data may be lost The block is not evicted from the sender’s cache until it receives an ack for the above message (the block may be placed in a small backup victim cache) The OAck and BDAck need not hold up the write, but they will hold up the transfer of the owner token to the next writer – the system avoids multiple backup copies to simplify recovery

9 Summary

10 Token Recreation Process The memory controller ensures that the requesting cache ends up with a valid copy of data with all tokens – slow, but correct Since tokens may be stuck in traffic (deadlock false positive), the token-count invariant must not be violated when this token is eventually delivered – hence, tokens have serial numbers and a new serial number is used by the token recreation process The memory controller first cancels all existing tokens, informs every node of the new serial number, and collects valid copies of data; then creates the new serial number and tokens, and passes data+tokens to requestor

11 Results No injected faults in experiments below Potential performance loss: less cache space because of backup copies; can’t service the next write request until the backup is deleted

12 Results with Message Loss

13 Token Coherence Signatures Errors can be detected by checking each invariant in token coherence: Each block has T tokens, of which one is the owner token  there are initially T tokens for a block in the system  a node can never hold T tokens for a block  if a node sends N t tokens for a block at time t, another node must receive N t tokens for that block at time t (transfers are not instantaneous, but we will assume that the receiver owned the tokens since time t) A processor can write only if it has all T tokens A processor can read a block only if it has at least one token and valid data If a coherence message contains the owner token, it also contains valid data

14 Distributed Logical Time Each node has a time that is incremented on a message send (or receive) – the message carries the sender’s timestamp with it If the receiver’s time is less than the timestamp, the receiver updates its clock to timestamp + 1

15 Token Coherence Signature Checker Confirm that the {tokens received, timestamp} at the sender matches that at the receiver Over a given time interval, maintain a signature to represent all received/sent tokens/timestamps and confirm that they are consistent (allow a grace period as a message with an old timestamp may not have been delivered by the end of the interval – if the message is stuck for a really long time, treat it as a fault) If not, rely on previously proposed checkpoint mechanisms to rollback to valid state and re-execute

16 Signature Positive for a receive, negative for a send; the sum of signatures should total to zero I is the time interval; t is every logical time step in I; N t is the number of tokens received at time t; T is the total number of tokens for a block; can limit the size of the signature by making every computation modulo n Errors can also happen if the token is assigned to the wrong block; an address signature helps detect such errors:

17 Example Results summary: bandwidth overhead (timestamps and signature collection) of less than 7% (negligible performance impact)

18 Title Bullet