(C) 2003 Daniel SorinDuke Architecture Dynamic Verification of End-to-End Multiprocessor Invariants Daniel J. Sorin 1, Mark D. Hill 2, David A. Wood 2.

Slides:



Advertisements
Similar presentations
L.N. Bhuyan Adapted from Patterson’s slides
Advertisements

Full-System Timing-First Simulation Carl J. Mauer Mark D. Hill and David A. Wood Computer Sciences Department University of Wisconsin—Madison.
Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.
Cache Coherence Mechanisms (Research project) CSCI-5593
(C) 2001 Daniel Sorin Correctly Implementing Value Prediction in Microprocessors that Support Multithreading or Multiprocessing Milo M.K. Martin, Daniel.
(C) 2002 Daniel SorinWisconsin Multifacet Project SafetyNet: Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
(C) 2005 Daniel SorinDuke Computer Engineering Autonomic Computing via Dynamic Self-Repair Daniel J. Sorin Department of Electrical & Computer Engineering.
Making Services Fault Tolerant
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Directory-Based Caches II Steve Ko Computer Sciences and Engineering University at Buffalo.
1 Lecture 22: Fault Tolerance Papers: Token Coherence: Decoupling Performance and Correctness, ISCA’03, Wisconsin A Low Overhead Fault Tolerant Coherence.
CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.
(C) 2002 Milo MartinHPCA, Feb Bandwidth Adaptive Snooping Milo M.K. Martin, Daniel J. Sorin Mark D. Hill, and David A. Wood Wisconsin Multifacet.
CS-550 (M.Soneru): Recovery [SaS] 1 Recovery. CS-550 (M.Soneru): Recovery [SaS] 2 Recovery Computer system recovery: –Restore the system to a normal operational.
(C) 2003 Milo Martin Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper,
Evaluating Non-deterministic Multi-threaded Commercial Workloads Computer Sciences Department University of Wisconsin—Madison
1 Lecture 23: Multiprocessors Today’s topics:  RAID  Multiprocessor taxonomy  Snooping-based cache coherence protocol.
(C) 2004 Daniel SorinDuke Architecture Using Speculation to Simplify Multiprocessor Design Daniel J. Sorin 1, Milo M. K. Martin 2, Mark D. Hill 3, David.
CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Snooping Cache and Shared-Memory Multiprocessors
April 18, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 19: Directory-Based Cache Protocols Krste Asanovic Electrical Engineering.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Presented by Deepak Srinivasan Alaa Aladmeldeen, Milo Martin, Carl Mauer, Kevin Moore, Min Xu, Daniel Sorin, Mark Hill and David Wood Computer Sciences.
Towards a Hardware-Software Co-Designed Resilient System Man-Lap (Alex) Li, Pradeep Ramachandran, Sarita Adve, Vikram Adve, Yuanyuan Zhou University of.
1 Fault Tolerance in the Nonstop Cyclone System By Scott Chan Robert Jardine Presented by Phuc Nguyen.
Institute of Computer and Communication Network Engineering OFC/NFOEC, 6-10 March 2011, Los Angeles, CA Lessons Learned From Implementing a Path Computation.
Dynamic Verification of Cache Coherence Protocols Jason F. Cantin Mikko H. Lipasti James E. Smith.
SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,
SafetyNet Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,
Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.
Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.
Simulating a $2M Commercial Server on a $2K PC Alaa R. Alameldeen, Milo M.K. Martin, Carl J. Mauer, Kevin E. Moore, Min Xu, Daniel J. Sorin, Mark D. Hill.
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
Ch4. Multiprocessors & Thread-Level Parallelism 2. SMP (Symmetric shared-memory Multiprocessors) ECE468/562 Advanced Computer Architecture Prof. Honggang.
Effects of wrong path mem. ref. in CC MP Systems Gökay Burak AKKUŞ Cmpe 511 – Computer Architecture.
Distributed Shared Memory Based on Reference paper: Distributed Shared Memory, Concepts and Systems.
ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Availability Copyright 2004 Daniel J. Sorin Duke University.
Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.
Virtual Hierarchies to Support Server Consolidation Mike Marty Mark Hill University of Wisconsin-Madison ISCA 2007.
1 Lecture 24: Fault Tolerance Papers: Token Coherence: Decoupling Performance and Correctness, ISCA’03, Wisconsin A Low Overhead Fault Tolerant Coherence.
Coherence Decoupling: Making Use of Incoherence J. Huh, J. Chang, D. Burger, G. Sohi ASPLOS 2004.
Safetynet: Improving The Availability Of Shared Memory Multiprocessors With Global Checkpoint/Recovery D. Sorin M. Martin M. Hill D. Wood Presented by.
Availability in CMPs By Eric Hill Pranay Koka. Motivation RAS is an important feature for commercial servers –Server downtime is equivalent to lost money.
Token Coherence: Decoupling Performance and Correctness Milo M. D. Martin Mark D. Hill David A. Wood University of Wisconsin-Madison ISCA-30 (2003)
“An Evaluation of Directory Schemes for Cache Coherence” Presented by Scott Weber.
Dynamic Verification of Sequential Consistency Albert Meixner Daniel J. Sorin Dept. of Computer Dept. of Electrical and Science Computer Engineering Duke.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper, Dan Sorin§, Mark.
Ivy Eva Wu.
The University of Adelaide, School of Computer Science
CMSC 611: Advanced Computer Architecture
Two Ideas of This Paper Using Permissions-only Cache to deduce the rate at which less-efficient overflow handling mechanisms are invoked. When the overflow.
Krste Asanovic Electrical Engineering and Computer Sciences
Example Cache Coherence Problem
The University of Adelaide, School of Computer Science
Improving Multiple-CMP Systems with Token Coherence
11 – Snooping Cache and Directory Based Multiprocessors
Simulating a $2M Commercial Server on a $2K PC
The University of Adelaide, School of Computer Science
Co-designed Virtual Machines for Reliable Computer Systems
Lecture 17 Multiprocessors and Thread-Level Parallelism
Dynamic Verification of Sequential Consistency
Lecture 17 Multiprocessors and Thread-Level Parallelism
The University of Adelaide, School of Computer Science
University of Wisconsin-Madison Presented by: Nick Kirchem
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

(C) 2003 Daniel SorinDuke Architecture Dynamic Verification of End-to-End Multiprocessor Invariants Daniel J. Sorin 1, Mark D. Hill 2, David A. Wood 2 1 Department of Electrical & Computer Engineering Duke University 2 Computer Sciences Department University of Wisconsin-Madison

DSN 2003 – Daniel Sorin slide 2 My Talk in One Slide Commercial server availability is important –System model: Symmetric Multiprocessor (SMP) –Fault model: Mostly transient, some permanent Recent work developed efficient checkpoint/recovery –But we can only recover from hardware errors we detect –Many hardware errors are hard to detect Proposal: Dynamic verification of invariants –Online checking of end-to-end system invariants –Checking performed with distributed signature analysis –Triggers recovery if invariant is violated

DSN 2003 – Daniel Sorin slide 3 Outline Background –SMPs and availability –Existing hardware error detection Invariant checking with distributed signature analysis Two invariant checkers Evaluation Conclusions

DSN 2003 – Daniel Sorin slide 4 Symmetric Multiprocessor (SMP) IM Issue request Wait for response Receive response PPPP shared wire bus System ModelCache Coherence Transaction

DSN 2003 – Daniel Sorin slide 5 Symmetric Multiprocessor (SMP) IM Issue request Wait for response Receive response System ModelCache Coherence Transaction PPPP switch –Broadcast request not delivered to subset of nodes –Broadcast requests delivered out of order to subset of nodes

DSN 2003 – Daniel Sorin slide 6 Symmetric Multiprocessor (SMP) System ModelCache Coherence Transaction PPPP switch –Broadcast request not delivered to subset of nodes –Broadcast requests delivered out of order to subset of nodes IMt1 t3 t2 issue request response arrives request arrives response arrives request arrives –More chances for incorrect state transitions

DSN 2003 – Daniel Sorin slide 7 Backward Error Recovery Can improve availability with backward error recovery If error detected, then recover to pre-fault state Backward error recovery (BER) requires: –Checkpoint/recovery mechanism –Error detection mechanisms

DSN 2003 – Daniel Sorin slide 8 SafetyNet Checkpoint/Recovery SafetyNet: all-hardware scheme [ISCA 2002] –Periodically take logical checkpoint of multiprocessor MP State: processor registers, caches, memory –Incrementally log changes to caches and memory –Consistent checkpointing performed in logical time E.g., every 3000 broadcast cache coherence requests –Can tolerate >100,000 cycles of error detection latency time Active execution CP 4CP 3CP 2CP 1 Validated execution Pending validation – Still detecting errors

DSN 2003 – Daniel Sorin slide 9 Error Detection Error model: mostly due to transient faults Example error detection mechanisms: –Parity bit on cache line –Checksum on incoming message –Timeout on cache coherence transaction But error detection for servers is still weak Why? –Error detection is often on critical path and must be fast –Fast error detection can’t incorporate info from other nodes

DSN 2003 – Daniel Sorin slide 10 Why Local Information Isn’t Sufficient P1P4P3P2 switch SharedOwned

DSN 2003 – Daniel Sorin slide 11 Why Local Information Isn’t Sufficient P1P4P3P2 switch SharedOwned Broadcast Request for Exclusive fault!

DSN 2003 – Daniel Sorin slide 12 Why Local Information Isn’t Sufficient P1P4P3P2 switch SharedOwned Broadcast Request for Exclusive Invalid Data Response fault!

DSN 2003 – Daniel Sorin slide 13 Why Local Information Isn’t Sufficient P1P4P3P2 switch SharedModified Neither P1 nor P2 can detect that an error has occurred!

DSN 2003 – Daniel Sorin slide 14 Outline Background End-to-end invariant checking Two invariant checkers Evaluation Conclusions

DSN 2003 – Daniel Sorin slide 15 Distributed Signature Analysis Reduces long history of events into small signature –Signatures map almost-uniquely to event histories P1 Signature P2 Signature Event N at P1 : Event 2 at P1 Event 1 at P1 Event N at P2 : Event 2 at P2 Event 1 at P2 Checker P2’s signatureP1’s signature } Check periodically in logical time (every 3000 requests)

DSN 2003 – Daniel Sorin slide 16 Designing Signature Analysis Schemes Must devise two functions: Update and Check Signature(Pi) = Update(Signature(Pi), Event) Check(Signature(P1),…,Signature(PN)) = true if error Simple example: check that message inflow=outflow –Assume only unicast messages –Update: +1 for receive, -1 for send –Check: true if sum of all signatures doesn’t equal 0

DSN 2003 – Daniel Sorin slide 17 Implementing Distributed Signature Analysis All components cooperate to perform checking –Component = cache controller or memory controller Each component contains: –Local signature register –Logic to compute signature updates System contains: –System controller that performs check function Use distributed signature analysis for dynamic verification –Verify end-to-end invariants

DSN 2003 – Daniel Sorin slide 18 Outline Background End-to-end invariant checking Two invariant checkers –Message invariant –Cache coherence invariant Evaluation Conclusions

DSN 2003 – Daniel Sorin slide 19 A Message-Level Invariant Checker Context: symmetric multiprocessor (SMP) –Cache coherence with broadcast snooping protocol Invariant: all nodes see same total order of broadcast cache coherence requests Update: for each incoming broadcast, “add” Address –Not quite this simple (e.g., doesn’t detect reorderings) Check: error if all signatures aren’t equal

DSN 2003 – Daniel Sorin slide 20 Aliasing Aliasing occurs if two histories have same signature 3 possible sources of aliasing –Finite resources – b bits can only distinguish 2 b histories –Fault in signature analysis hardware itself –Inherent flaw in scheme Examples of inherent aliasing in previous scheme –Arrival of message with Address=0 doesn’t change signature –Reordering of messages doesn’t change signature –We solve aliasing issues in paper Tricks: hash more than 1 field of message, use LFSRs, etc.

DSN 2003 – Daniel Sorin slide 21 A Cache Coherence Invariant Checker Invariant: all coherence upgrades cause downgrades –Upgrade: increase permissions to block (e.g., none  read) –Downgrade: decrease permissions (e.g., write  read) Update: add Address for upgrade subtract Address for downgrade Check: error if sum of all signatures doesn’t equal 0 Challenges –Can be more than one downgrade per upgrade –Upgrader doesn’t know how how many downgraders exist –See paper for solutions to these challenges

DSN 2003 – Daniel Sorin slide 22 Outline Background End-to-end invariant checking Two invariant checkers Evaluation Conclusions

DSN 2003 – Daniel Sorin slide 23 Methodology Full-system simulation of 16-processor machine –Simics provides functional simulation of everything –We added timing simulation for memory system & SafetyNet Commercial workloads running on Solaris 8 –Database: IBM’s DB2 running online transaction processing –Static web server: Apache –Dynamic web server: Slashdot –Java middleware

DSN 2003 – Daniel Sorin slide 24 Detection Coverage How do we know if our checkers work? Inject errors periodically –Corrupt messages –Drop messages –Reorder messages –Improperly process cache coherence messages Global invariant checkers detected all errors

DSN 2003 – Daniel Sorin slide 25 Performance Error bars represent +/- one standard deviation

DSN 2003 – Daniel Sorin slide 26 Conclusions Goal: improve multiprocessor availability How? Dynamic verification of end-to-end invariants –Implemented with distributed signature analysis Results –Detects previously undetectable hardware errors –Negligible performance overhead for error-free execution Duke FaultFinder Project – Wisconsin Multifacet Project –