(C) 2003 Daniel SorinDuke Architecture Dynamic Verification of End-to-End Multiprocessor Invariants Daniel J. Sorin 1, Mark D. Hill 2, David A. Wood 2 1 Department of Electrical & Computer Engineering Duke University 2 Computer Sciences Department University of Wisconsin-Madison
DSN 2003 – Daniel Sorin slide 2 My Talk in One Slide Commercial server availability is important –System model: Symmetric Multiprocessor (SMP) –Fault model: Mostly transient, some permanent Recent work developed efficient checkpoint/recovery –But we can only recover from hardware errors we detect –Many hardware errors are hard to detect Proposal: Dynamic verification of invariants –Online checking of end-to-end system invariants –Checking performed with distributed signature analysis –Triggers recovery if invariant is violated
DSN 2003 – Daniel Sorin slide 3 Outline Background –SMPs and availability –Existing hardware error detection Invariant checking with distributed signature analysis Two invariant checkers Evaluation Conclusions
DSN 2003 – Daniel Sorin slide 4 Symmetric Multiprocessor (SMP) IM Issue request Wait for response Receive response PPPP shared wire bus System ModelCache Coherence Transaction
DSN 2003 – Daniel Sorin slide 5 Symmetric Multiprocessor (SMP) IM Issue request Wait for response Receive response System ModelCache Coherence Transaction PPPP switch –Broadcast request not delivered to subset of nodes –Broadcast requests delivered out of order to subset of nodes
DSN 2003 – Daniel Sorin slide 6 Symmetric Multiprocessor (SMP) System ModelCache Coherence Transaction PPPP switch –Broadcast request not delivered to subset of nodes –Broadcast requests delivered out of order to subset of nodes IMt1 t3 t2 issue request response arrives request arrives response arrives request arrives –More chances for incorrect state transitions
DSN 2003 – Daniel Sorin slide 7 Backward Error Recovery Can improve availability with backward error recovery If error detected, then recover to pre-fault state Backward error recovery (BER) requires: –Checkpoint/recovery mechanism –Error detection mechanisms
DSN 2003 – Daniel Sorin slide 8 SafetyNet Checkpoint/Recovery SafetyNet: all-hardware scheme [ISCA 2002] –Periodically take logical checkpoint of multiprocessor MP State: processor registers, caches, memory –Incrementally log changes to caches and memory –Consistent checkpointing performed in logical time E.g., every 3000 broadcast cache coherence requests –Can tolerate >100,000 cycles of error detection latency time Active execution CP 4CP 3CP 2CP 1 Validated execution Pending validation – Still detecting errors
DSN 2003 – Daniel Sorin slide 9 Error Detection Error model: mostly due to transient faults Example error detection mechanisms: –Parity bit on cache line –Checksum on incoming message –Timeout on cache coherence transaction But error detection for servers is still weak Why? –Error detection is often on critical path and must be fast –Fast error detection can’t incorporate info from other nodes
DSN 2003 – Daniel Sorin slide 10 Why Local Information Isn’t Sufficient P1P4P3P2 switch SharedOwned
DSN 2003 – Daniel Sorin slide 11 Why Local Information Isn’t Sufficient P1P4P3P2 switch SharedOwned Broadcast Request for Exclusive fault!
DSN 2003 – Daniel Sorin slide 12 Why Local Information Isn’t Sufficient P1P4P3P2 switch SharedOwned Broadcast Request for Exclusive Invalid Data Response fault!
DSN 2003 – Daniel Sorin slide 13 Why Local Information Isn’t Sufficient P1P4P3P2 switch SharedModified Neither P1 nor P2 can detect that an error has occurred!
DSN 2003 – Daniel Sorin slide 14 Outline Background End-to-end invariant checking Two invariant checkers Evaluation Conclusions
DSN 2003 – Daniel Sorin slide 15 Distributed Signature Analysis Reduces long history of events into small signature –Signatures map almost-uniquely to event histories P1 Signature P2 Signature Event N at P1 : Event 2 at P1 Event 1 at P1 Event N at P2 : Event 2 at P2 Event 1 at P2 Checker P2’s signatureP1’s signature } Check periodically in logical time (every 3000 requests)
DSN 2003 – Daniel Sorin slide 16 Designing Signature Analysis Schemes Must devise two functions: Update and Check Signature(Pi) = Update(Signature(Pi), Event) Check(Signature(P1),…,Signature(PN)) = true if error Simple example: check that message inflow=outflow –Assume only unicast messages –Update: +1 for receive, -1 for send –Check: true if sum of all signatures doesn’t equal 0
DSN 2003 – Daniel Sorin slide 17 Implementing Distributed Signature Analysis All components cooperate to perform checking –Component = cache controller or memory controller Each component contains: –Local signature register –Logic to compute signature updates System contains: –System controller that performs check function Use distributed signature analysis for dynamic verification –Verify end-to-end invariants
DSN 2003 – Daniel Sorin slide 18 Outline Background End-to-end invariant checking Two invariant checkers –Message invariant –Cache coherence invariant Evaluation Conclusions
DSN 2003 – Daniel Sorin slide 19 A Message-Level Invariant Checker Context: symmetric multiprocessor (SMP) –Cache coherence with broadcast snooping protocol Invariant: all nodes see same total order of broadcast cache coherence requests Update: for each incoming broadcast, “add” Address –Not quite this simple (e.g., doesn’t detect reorderings) Check: error if all signatures aren’t equal
DSN 2003 – Daniel Sorin slide 20 Aliasing Aliasing occurs if two histories have same signature 3 possible sources of aliasing –Finite resources – b bits can only distinguish 2 b histories –Fault in signature analysis hardware itself –Inherent flaw in scheme Examples of inherent aliasing in previous scheme –Arrival of message with Address=0 doesn’t change signature –Reordering of messages doesn’t change signature –We solve aliasing issues in paper Tricks: hash more than 1 field of message, use LFSRs, etc.
DSN 2003 – Daniel Sorin slide 21 A Cache Coherence Invariant Checker Invariant: all coherence upgrades cause downgrades –Upgrade: increase permissions to block (e.g., none read) –Downgrade: decrease permissions (e.g., write read) Update: add Address for upgrade subtract Address for downgrade Check: error if sum of all signatures doesn’t equal 0 Challenges –Can be more than one downgrade per upgrade –Upgrader doesn’t know how how many downgraders exist –See paper for solutions to these challenges
DSN 2003 – Daniel Sorin slide 22 Outline Background End-to-end invariant checking Two invariant checkers Evaluation Conclusions
DSN 2003 – Daniel Sorin slide 23 Methodology Full-system simulation of 16-processor machine –Simics provides functional simulation of everything –We added timing simulation for memory system & SafetyNet Commercial workloads running on Solaris 8 –Database: IBM’s DB2 running online transaction processing –Static web server: Apache –Dynamic web server: Slashdot –Java middleware
DSN 2003 – Daniel Sorin slide 24 Detection Coverage How do we know if our checkers work? Inject errors periodically –Corrupt messages –Drop messages –Reorder messages –Improperly process cache coherence messages Global invariant checkers detected all errors
DSN 2003 – Daniel Sorin slide 25 Performance Error bars represent +/- one standard deviation
DSN 2003 – Daniel Sorin slide 26 Conclusions Goal: improve multiprocessor availability How? Dynamic verification of end-to-end invariants –Implemented with distributed signature analysis Results –Detects previously undetectable hardware errors –Negligible performance overhead for error-free execution Duke FaultFinder Project – Wisconsin Multifacet Project –