Download presentation
Presentation is loading. Please wait.
2
Jan 2003: Slammer Worm Exploits Buffer Overflow
3
August, 2004: North American Blackout Caused by Race Condition
4
Software is Unreliable
5
Why?
6
Many Domains Many Reasons
7
One Common Factor
8
Intended Properties Actual Code Developer Misses Corner Cases
9
Intended Properties Actual Code Properties drift away Code Evolves Key to Reliability: Connect Properties & Code
10
How?
11
For each domain: 1. Formalize Properties 2. Automate Analysis 3. Build Tools
12
Device Drivers Distributed Systems Configuration Management PL/Databases Web 2.0 Security Domains
13
Concurrent, Distributed Systems Stock ExchangesTelecoms Commuter Rail
14
System Nodes exchanging messages Concurrent, Distributed Systems
15
Execution 1.Node gets message event 2.Executes event handler - Updates node state - Sends new messages 3. Repeat… Concurrent, Distributed Systems
16
Challenges Nodes enter, leave, fail Messages are reordered, lost System must stay available - Eventually, all nodes regroup - Eventually, all packets delivered
17
Pastry [Rowstron & Druschel ‘01] Key-Value Store Distributed Across Nodes Organized in Ring Topology
18
Nodes Leave and Rejoin Leaves
19
Nodes Leave and Rejoin Detect, Reconnect
20
Nodes Leave and Rejoin Returns
21
Nodes Leave and Rejoin Asks for Neighbors
22
Nodes Leave and Rejoin Rejoins Neighbors
23
But Sometimes... Asks for Neighbors Query Bounces Back! Node forever unable to rejoin...
24
??! #@ How to find ? How to reproduce? How to fix?
25
For each domain: 1. Formalize Properties 2. Automate Analysis 3. Build Tools Distributed Systems:
26
1. Formalize Properties System Liveness Bugs
27
System Nodes exchanging messages States and Transitions
28
1 1 2 2 State Snapshot of system 1 1 2 2 event@1 At each state, scheduler chooses 1.Node n 2.Event @n 3.Executes code (C++)
29
The Space of System Executions 1 1 2 2 Initial State 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 event@1 event@2 fail@1 event@1 fail@2 event@2 At each state, scheduler chooses 1.Node n 2.Event @n 3.Executes code (C++)
30
An Execution = Sequence of Choices 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 event@1 event@2 fail@1 event@1 fail@2 event@1
31
An Execution = Sequence of Choices 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 event@1 event@2 fail@1 event@1 fail@2 event@1
32
An Execution = Sequence of Choices 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 event@1 event@2 fail@1 event@1 fail@2 event@1
33
An Execution = Sequence of Choices 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 event@1 event@2 fail@1 event@1 fail@2 event@1
34
1. Formalize Properties System Liveness Bugs
35
System must stay available System must recover from failure Desired Properties
36
Eventually all nodes regroup (Despite Failures,...) Eventually some good happens Liveness Properties Eventually all data deliveredEventually “P is true” Eventually n,m i: n.fwd i = m “nodes form ring”
37
1. Formalize Properties System Liveness Bugs
38
Live States Initial State Some good happened P is true Live States
39
Live Executions Initial State Live States
40
Liveness Violations Initial State Live States Execution never reaches live state
41
Liveness Violations Two kinds
42
1. Unlucky Executions Initial State Live States At each step, some choice leads to liveness But scheduler keeps making bad choices
43
Live States 2: Dead Executions Dead States No execution can reach live states Execution Reaches Recovery is Impossible
44
Idea 1: Focus on dead execution bugs Because Very Severe Unluckiness may be fixed by scheduler Because Very Plentiful Because We Can Yet, difficult to find, reproduce, fix
45
1. Formalize Properties System Liveness Bugs Distributed Systems: 1. Formalize Properties 2. Automate Analysis 3. Build Tools
46
2. Automate Checking Systematically Search for Dead Executions “Model Checking” How to tell if execution Dead?
47
Model Checking Systematically Explore Executions Initial State Iterate: over sequences of choices Execution = Sequence of choices
48
1 1 2 2 1 1 2 2 Model Checking Iterate: over sequences of choices Until: hit a bad state
49
1 1 2 2 1 1 2 2 [Verisoft 97, Cmc 04, Chess 07] Model Checking Iterate: over sequences of choices Until: hit a bad state Null Dereference Buffer overflow Assertion failure Safety Bugs
50
Safety Bugs Model Checking Iterate: over sequences of choices Until: hit a bad state How to find liveness bugs? Until: hit a dead state How to tell if state is Dead? Property only says which are live [Verisoft 97, Cmc 04, Chess 07] Null Dereference Buffer overflow Assertion failure
51
Idea 2: Random Walks Live States Dead States Execute long random walks from state Pr[reaching live] = 0 Pr[reaching live] = 1 How to tell if state is Dead? Property only says which are live
52
Executions and Random Walks At each execution step, 1.Scheduler picks node n 2.Scheduler picks event @n 3.Executes event code Random Walk: Scheduler picks randomly From some Prob. Dist. over nodes, events
53
Algorithm = Search + Random Walks 1. Systematic Search: find candidates 2. Random Walk: test if candidate dead Live States Iterate
54
Live States If walk length >> avg. steps to liveness Then non-live walk is likely liveness bug! 100k Events 1k Events How to pinpoint bug ? 100,000 Step Execution (2 Gb Log file) Algorithm = Search + Random Walks
55
Live States Idea 3: The Critical Transition Dead States System transitions from a recoverable to a dead state How to find Critical Transition without knowing Dead States?
56
Live States Idea 3: The Critical Transition Binary Search using Random Walks
57
Live States Idea 3: The Critical Transition Binary Search using Random Walks Binary Search
58
Live States Idea 3: The Critical Transition Critical Transition Dead States System transitions from a recoverable to a dead state Pinpoints bug
59
Recap Dead Executions System has shot itself (but doesnt know it) Systematic Search Finds candidate dead states Random Walks Determine if candidate is dead Critical Transition Event that makes recovery impossible
60
For each domain: 1. Formalize Properties 2. Automate Analysis 3. Build Tools Distributed Systems:
61
Liveness Bugs, Critical Transition Mace (C++) System Liveness Properties First Liveness Checker For Systems Code MaceMC
62
Liveness Bugs Mace (C++) System Liveness Properties First Liveness Checker For Systems Code MaceMC [NSDI 07]
63
Implementation Details (1/2) Random Walk Bias Assign “likely” events higher weight e.g. application > network > timer > fail Bugs not missed Random walk only tests deadness Live state reached sooner Error traces shorter, simpler
64
Prefix-Based Search Restart search after reaching liveness Analyzes effect of failures in “steady” state Implementation Details (2/2)
65
Systems Analyzed RandTree Random Overlay Tree with max degree. MaceTransport User-level, reliable messaging service. Pastry Key-based routing, using an overlay ring. Chord Key-based routing, using an overlay ring.
66
Liveness Properties RandTree Random Overlay Tree with max degree. MaceTransport User-level, reliable transport service. Pastry Key-based routing, using an overlay ring. Chord Key-based routing, using an overlay ring. Eventually, all messages acknowledged. Eventually, all nodes form single tree. Eventually, all nodes form a ring.
67
Pastry Bug Understood Node forever unable to rejoin...
68
C C B B Pastry Bug Understood A A B sends C message about A A
69
C C B B Pastry Bug Understood A leaves A A A Ring reforms
70
C C B B Pastry Bug Understood A A A returns A
71
B B Pastry Bug Understood C receives (stale) message about A Updates routing information A A C C A A Critical Transition!
72
B B Pastry Bug Understood A’s Rejoin requests bounced back A C C A A A forever unable to rejoin...
73
“Dropped JoinRequest on rapid rejoin problem: There was a problem with nodes not being able to quickly rejoin if they used the same NodeId. Didn’t find the cause of this bug, but can no longer reproduce.” (FreePastry README, “Changes since 1.4.2”) Also in Original Implementation A “Protocol Level” Bug
74
Sample Bug: RandTree Nodes With Child, Parent pointers Property Eventually nodes form tree
75
Sample Bug: RandTree C C A A C requests to join under A A sends ack C fails and restarts C ignores ack from A C joins under B Bug: System stuck as a DAG! B B
76
MaceMC Finds failure, reordering bugs (Hard to catch via regular testing)
77
Liveness Bugs Yield Safety Assertions Dead States Violate a priori unknown safety assertions Critical Transition Helps identify dead states Yields new safety assertions and bugs
78
New Safety Property: Chord Nodes with Fwd, Back pointers Property Eventually nodes form a ring Critical Transition To Dead State Where: n.back = n, n.fwd = m New Safety Property IF n.back=n THEN n.fwd=n
79
New Safety Property: Chord Hard to determine low-level Safety properties in advance Easy to determine high-level Liveness properties in advance
80
MaceMC Helps find safety assertions Using liveness violations
81
Scorecard SystemBugsLivenessSafety MaceTransport1156 RandTree17125 Pastry550 Chord19910 Total523121 Several “protocol level” bugs Routinely used by Mace programmers Zero False Alarms
82
For each domain: 1. Formalize Properties 2. Automate Analysis 3. Build Tools Distributed Systems Other Domains
83
Lazy Abstraction Craig Interpolation Liquid Types Software Verification [popl02, pldi04, icse04...] [pldi08, pldi09, popl10] [popl04, tacas06] How to prove device drivers use kernel API properly?
84
Scalable Race Detection Multithread Analysis = Sequential Analysis x Race Detection Multithread Analysis [popl07] [pldi08] [fse07] How to prevent and control Thread Interference ? Lock Allocation
85
Config Management NP-Complete Encode and Solve via SAT [icse 07] SAT solvers in Eclipse, Suse-Linux How to avoid “DLL hell” ? Q:Can I install emacs? A:If you have X11 or Xorg (not both)
86
Staged Analysis for JavaScript Web 2.0 Security [pldi 09] Tracking Info Flow in Browser [?] How to prevent JavaScript from doing mischief ?
87
For each domain: 1. Formalize Properties 2. Automate Analysis 3. Build Tools Analysis Connects Properties & Code Analyze tricky corner cases + Re-analyze as code evolves Improve Software Reliability
88
“ucsd progsys” (people, papers, code, demos, etc.)
90
Execution 1.Node gets message event 2.Executes event handler - Updates node state - Sends new messages 3. Repeat… Concurrent, Distributed Systems
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.