Download presentation
Presentation is loading. Please wait.
2
Concurrent, Distributed Systems Stock ExchangesTelecoms Commuter Rail
3
Concurrent, Distributed Systems System Nodes exchanging Messages Execution 1.Node gets message event 2.Executes event handler - Updates node state - Sends new messages 3. Repeat…
4
Distributed Systems: Challenges System Nodes exchanging Messages Challenges Nodes: enter, leave, fail Messages: reordered, lost System must stay available - Eventually, all nodes regroup - Eventually, all packets delivered - Eventually, some good happens Liveness Properties
5
The Space of System Executions 1 1 2 2 Initial State 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 event@1 event@2 fail@1 event@1 fail@2 event@2 At each state, scheduler picks: 1.Node n 2.Event @n 3.Executes code
6
An Execution = Sequence of Choices 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 event@1 event@2 fail@1 event@1 fail@2 event@1
7
An Execution = Sequence of Choices 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 event@1 event@2 fail@1 event@1 fail@2 event@1
8
An Execution = Sequence of Choices 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 event@1 event@2 fail@1 event@1 fail@2 event@1
9
An Execution = Sequence of Choices 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 event@1 event@2 fail@1 event@1 fail@2 event@1
10
Bad States Safety Bugs: Execution that drives system to bad state 1 1 2 2 1 1 2 2 Safety Bugs Bad States Null Dereferences Buffer overflows Assertion Failures Low-level crash 1 1 2 2 1 1 2 2 event@2 fail@2
11
How to find Safety Bugs? Find path from Initial to Bad By systematically exploring executions (Iterating over sequences of choices) Initial State Bad States
12
1 1 2 2 Model Checking for Safety Bugs Bad States 1 1 2 2 Find path from Initial to Bad By systematically exploring executions [Verisoft 97, Cmc 04, Chess 07]
13
Safety Properties are too Low Level Find path from Initial to Bad By systematically exploring executions [Verisoft 97, Cmc 04, Chess 07]
14
Safety Properties are too Low Level Distributed Systems: Designed for crashes & failures Challenge: End-to-end Problems Liveness bugs
15
Live States Bad States Initial State Good States: All nodes regroup All packets delivered Live States: Eventually Good Happens
16
Live Executions Initial State Live States
17
Liveness Violations Initial State Live States Execution never reaches live state
18
How to Find Liveness Violations? Live States Explore all executions ? Infinitely many...
19
How to Find Liveness Violations? Live States Explore all executions upto bound ? Combinatorial explosion (depth < 50) Liveness at depth >> 50 [Verisoft 97, Cmc 04, Chess 07]
20
How to Find Liveness Violations? Live States Looks pretty hopeless...
21
Live States Idea 1: Dead States Dead States No execution can reach live states Recovery is impossible
22
Idea 1: Dead States To find Liveness bugs, Look for Dead executions. How to tell if a state is Dead ?
23
Idea 2: Random Walks Live States Dead States Execute long random walks from state Pr[reaching live] = 0 Pr[reaching live] = 1 How to tell if a state is Dead ?
24
Executions and Random Walks At each execution step, 1.Scheduler picks node n 2.Schedular picks event @n 3.Executes event code Random Walk: Scheduler picks randomly (from some Prob. Dist. over nodes, events)
25
Liveness Bugs = Search + Random Walks 1. Systematic Search: find candidates 2. Random Walk: test if candidate dead Live States Iterate
26
Liveness Bugs = Search + Random Walks Live States If walk length >> avg. steps to liveness Then non-live walk is likely liveness bug! 100k Events 1k Events 100,000 Step Execution (2 Gb Log file) How to pinpoint bug ?
27
Live States Idea 3: The Critical Transition Dead States System transitions from a recoverable to a dead state How to find Critical Transition without knowing Dead States?
28
Live States Idea 3: The Critical Transition Binary Search using Random Walks!
29
Live States Idea 3: The Critical Transition Binary Search using Random Walks! Binary Search
30
Live States Idea 3: The Critical Transition Critical Transition Dead States System transitions from a recoverable to a dead state Pinpoints bug
31
Recap Liveness Bugs Found System has shot itself (but doesnt know it) Systematic Search Finds candidate dead states Random Walks Determine if candidate is dead Critical Transition The event that makes recovery impossible
32
Bells and Whistles (1/2) Random Walk Bias Assign “likely” events higher weight e.g. application > network > timer > fail Bugs not missed Random walk only tests deadness Live state reached sooner Error traces shorter, simpler
33
Bells and Whistles (2/2) Prefix-Based Search Restart search after reaching liveness Analyzes effect of failures in “steady-state”
34
Evaluation Liveness Bugs, Critical Transition Mace (C++) System MaceMC Liveness Properties
35
Systems RandTree Random Overlay Tree with max degree. MaceTransport User-level, reliable messaging service. Pastry Key-based routing, using an overlay ring. Chord Key-based routing, using an overlay ring.
36
Liveness Properties RandTree Random Overlay Tree with max degree. MaceTransport User-level, reliable transport service. Pastry Key-based routing, using an overlay ring. Chord Key-based routing, using an overlay ring. Eventually, all messages acknowledged. Eventually, all nodes form single tree. Eventually, all nodes form a ring.
37
Sample Bug: RandTree Nodes With Child, Parent pointers Property Eventually nodes form tree
38
Sample Bug: RandTree C C A A C requests to join under A A sends ack C fails and restarts C ignores ack from A C joins under B Bug: System stuck as a DAG! C’s failure not propagated to A B B
39
Liveness Bugs Yield Safety Assertions Dead States Violations of a priori unknown safety properties Critical Transition Helps identify dead states Yields new safety properties and bugs
40
New Safety Property: Chord Nodes with Fwd, Back pointers Property Eventually nodes form a ring Critical Transition To Dead State Where: n.back=n, n.fwd = m New Safety Property IF n.back=n THEN n.fwd=n
41
Scorecard SystemBugsLivenessSafety MaceTransport1156 RandTree17125 Pastry550 Chord19910 Totals523121 Several “protocol level” bugs Routinely used by Mace programmers
42
Programming Challenges How to handle unexpected events ? How to propagate effects of failures ? How to limit impact on performance?
43
Take Away Message Liveness Bugs Are Very Important Randomness Helps.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.