Presentation is loading. Please wait.

Presentation is loading. Please wait.

Jan 2003: Slammer Worm Exploits Buffer Overflow August, 2004: North American Blackout Caused by Race Condition.

Similar presentations


Presentation on theme: "Jan 2003: Slammer Worm Exploits Buffer Overflow August, 2004: North American Blackout Caused by Race Condition."— Presentation transcript:

1

2 Jan 2003: Slammer Worm Exploits Buffer Overflow

3 August, 2004: North American Blackout Caused by Race Condition

4 Software is Unreliable

5 Why?

6 Many Domains Many Reasons

7 One Common Factor

8 Intended Properties Actual Code Developer Misses Corner Cases

9 Intended Properties Actual Code Properties drift away Code Evolves Key to Reliability: Connect Properties & Code

10 How?

11 For each domain: 1. Formalize Properties 2. Automate Analysis 3. Build Tools

12 Device Drivers Distributed Systems Configuration Management PL/Databases Web 2.0 Security Domains

13 Concurrent, Distributed Systems Stock ExchangesTelecoms Commuter Rail

14 System Nodes exchanging messages Concurrent, Distributed Systems

15 Execution 1.Node gets message event 2.Executes event handler - Updates node state - Sends new messages 3. Repeat… Concurrent, Distributed Systems

16 Challenges Nodes enter, leave, fail Messages are reordered, lost System must stay available - Eventually, all nodes regroup - Eventually, all packets delivered

17 Pastry [Rowstron & Druschel ‘01] Key-Value Store Distributed Across Nodes Organized in Ring Topology

18 Nodes Leave and Rejoin Leaves

19 Nodes Leave and Rejoin Detect, Reconnect

20 Nodes Leave and Rejoin Returns

21 Nodes Leave and Rejoin Asks for Neighbors

22 Nodes Leave and Rejoin Rejoins Neighbors

23 But Sometimes... Asks for Neighbors Query Bounces Back! Node forever unable to rejoin...

24 ??! #@ How to find ? How to reproduce? How to fix?

25 For each domain: 1. Formalize Properties 2. Automate Analysis 3. Build Tools Distributed Systems:

26 1. Formalize Properties System Liveness Bugs

27 System Nodes exchanging messages States and Transitions

28 1 1 2 2 State Snapshot of system 1 1 2 2 event@1 At each state, scheduler chooses 1.Node n 2.Event @n 3.Executes code (C++)

29 The Space of System Executions 1 1 2 2 Initial State 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 event@1 event@2 fail@1 event@1 fail@2 event@2 At each state, scheduler chooses 1.Node n 2.Event @n 3.Executes code (C++)

30 An Execution = Sequence of Choices 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 event@1 event@2 fail@1 event@1 fail@2 event@1

31 An Execution = Sequence of Choices 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 event@1 event@2 fail@1 event@1 fail@2 event@1

32 An Execution = Sequence of Choices 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 event@1 event@2 fail@1 event@1 fail@2 event@1

33 An Execution = Sequence of Choices 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 1 1 2 2 event@1 event@2 fail@1 event@1 fail@2 event@1

34 1. Formalize Properties System Liveness Bugs

35 System must stay available System must recover from failure Desired Properties

36 Eventually all nodes regroup (Despite Failures,...) Eventually some good happens Liveness Properties Eventually all data deliveredEventually “P is true” Eventually  n,m  i: n.fwd i = m “nodes form ring”

37 1. Formalize Properties System Liveness Bugs

38 Live States Initial State Some good happened P is true Live States

39 Live Executions Initial State Live States

40 Liveness Violations Initial State Live States Execution never reaches live state

41 Liveness Violations Two kinds

42 1. Unlucky Executions Initial State Live States At each step, some choice leads to liveness But scheduler keeps making bad choices

43 Live States 2: Dead Executions Dead States No execution can reach live states Execution Reaches Recovery is Impossible

44 Idea 1: Focus on dead execution bugs Because Very Severe Unluckiness may be fixed by scheduler Because Very Plentiful Because We Can Yet, difficult to find, reproduce, fix

45 1. Formalize Properties System Liveness Bugs Distributed Systems: 1. Formalize Properties 2. Automate Analysis 3. Build Tools

46 2. Automate Checking Systematically Search for Dead Executions “Model Checking” How to tell if execution Dead?

47 Model Checking Systematically Explore Executions Initial State Iterate: over sequences of choices Execution = Sequence of choices

48 1 1 2 2 1 1 2 2 Model Checking Iterate: over sequences of choices Until: hit a bad state

49 1 1 2 2 1 1 2 2 [Verisoft 97, Cmc 04, Chess 07] Model Checking Iterate: over sequences of choices Until: hit a bad state Null Dereference Buffer overflow Assertion failure Safety Bugs

50 Safety Bugs Model Checking Iterate: over sequences of choices Until: hit a bad state How to find liveness bugs? Until: hit a dead state How to tell if state is Dead? Property only says which are live [Verisoft 97, Cmc 04, Chess 07] Null Dereference Buffer overflow Assertion failure

51 Idea 2: Random Walks Live States Dead States Execute long random walks from state Pr[reaching live] = 0 Pr[reaching live] = 1 How to tell if state is Dead? Property only says which are live

52 Executions and Random Walks At each execution step, 1.Scheduler picks node n 2.Scheduler picks event @n 3.Executes event code Random Walk: Scheduler picks randomly From some Prob. Dist. over nodes, events

53 Algorithm = Search + Random Walks 1. Systematic Search: find candidates 2. Random Walk: test if candidate dead Live States Iterate

54 Live States If walk length >> avg. steps to liveness Then non-live walk is likely liveness bug! 100k Events 1k Events How to pinpoint bug ? 100,000 Step Execution (2 Gb Log file) Algorithm = Search + Random Walks

55 Live States Idea 3: The Critical Transition Dead States System transitions from a recoverable to a dead state How to find Critical Transition without knowing Dead States?

56 Live States Idea 3: The Critical Transition Binary Search using Random Walks

57 Live States Idea 3: The Critical Transition Binary Search using Random Walks Binary Search

58 Live States Idea 3: The Critical Transition Critical Transition Dead States System transitions from a recoverable to a dead state Pinpoints bug

59 Recap Dead Executions System has shot itself (but doesnt know it) Systematic Search Finds candidate dead states Random Walks Determine if candidate is dead Critical Transition Event that makes recovery impossible

60 For each domain: 1. Formalize Properties 2. Automate Analysis 3. Build Tools Distributed Systems:

61 Liveness Bugs, Critical Transition Mace (C++) System Liveness Properties First Liveness Checker For Systems Code MaceMC

62 Liveness Bugs Mace (C++) System Liveness Properties First Liveness Checker For Systems Code MaceMC [NSDI 07]

63 Implementation Details (1/2) Random Walk Bias Assign “likely” events higher weight e.g. application > network > timer > fail Bugs not missed Random walk only tests deadness Live state reached sooner Error traces shorter, simpler

64 Prefix-Based Search Restart search after reaching liveness Analyzes effect of failures in “steady” state Implementation Details (2/2)

65 Systems Analyzed RandTree Random Overlay Tree with max degree. MaceTransport User-level, reliable messaging service. Pastry Key-based routing, using an overlay ring. Chord Key-based routing, using an overlay ring.

66 Liveness Properties RandTree Random Overlay Tree with max degree. MaceTransport User-level, reliable transport service. Pastry Key-based routing, using an overlay ring. Chord Key-based routing, using an overlay ring. Eventually, all messages acknowledged. Eventually, all nodes form single tree. Eventually, all nodes form a ring.

67 Pastry Bug Understood Node forever unable to rejoin...

68 C C B B Pastry Bug Understood A A B sends C message about A A

69 C C B B Pastry Bug Understood A leaves A A A Ring reforms

70 C C B B Pastry Bug Understood A A A returns A

71 B B Pastry Bug Understood C receives (stale) message about A Updates routing information A A C C A A Critical Transition!

72 B B Pastry Bug Understood A’s Rejoin requests bounced back A C C A A A forever unable to rejoin...

73 “Dropped JoinRequest on rapid rejoin problem: There was a problem with nodes not being able to quickly rejoin if they used the same NodeId. Didn’t find the cause of this bug, but can no longer reproduce.” (FreePastry README, “Changes since 1.4.2”) Also in Original Implementation A “Protocol Level” Bug

74 Sample Bug: RandTree Nodes With Child, Parent pointers Property Eventually nodes form tree

75 Sample Bug: RandTree C C A A C requests to join under A A sends ack C fails and restarts C ignores ack from A C joins under B Bug: System stuck as a DAG! B B

76 MaceMC Finds failure, reordering bugs (Hard to catch via regular testing)

77 Liveness Bugs Yield Safety Assertions Dead States Violate a priori unknown safety assertions Critical Transition Helps identify dead states Yields new safety assertions and bugs

78 New Safety Property: Chord Nodes with Fwd, Back pointers Property Eventually nodes form a ring Critical Transition To Dead State Where: n.back = n, n.fwd = m New Safety Property IF n.back=n THEN n.fwd=n

79 New Safety Property: Chord Hard to determine low-level Safety properties in advance Easy to determine high-level Liveness properties in advance

80 MaceMC Helps find safety assertions Using liveness violations

81 Scorecard SystemBugsLivenessSafety MaceTransport1156 RandTree17125 Pastry550 Chord19910 Total523121 Several “protocol level” bugs Routinely used by Mace programmers Zero False Alarms

82 For each domain: 1. Formalize Properties 2. Automate Analysis 3. Build Tools Distributed Systems Other Domains

83 Lazy Abstraction Craig Interpolation Liquid Types Software Verification [popl02, pldi04, icse04...] [pldi08, pldi09, popl10] [popl04, tacas06] How to prove device drivers use kernel API properly?

84 Scalable Race Detection Multithread Analysis = Sequential Analysis x Race Detection Multithread Analysis [popl07] [pldi08] [fse07] How to prevent and control Thread Interference ? Lock Allocation

85 Config Management NP-Complete Encode and Solve via SAT [icse 07] SAT solvers in Eclipse, Suse-Linux How to avoid “DLL hell” ? Q:Can I install emacs? A:If you have X11 or Xorg (not both)

86 Staged Analysis for JavaScript Web 2.0 Security [pldi 09] Tracking Info Flow in Browser [?] How to prevent JavaScript from doing mischief ?

87 For each domain: 1. Formalize Properties 2. Automate Analysis 3. Build Tools Analysis Connects Properties & Code Analyze tricky corner cases + Re-analyze as code evolves Improve Software Reliability

88 “ucsd progsys” (people, papers, code, demos, etc.)

89

90 Execution 1.Node gets message event 2.Executes event handler - Updates node state - Sends new messages 3. Repeat… Concurrent, Distributed Systems


Download ppt "Jan 2003: Slammer Worm Exploits Buffer Overflow August, 2004: North American Blackout Caused by Race Condition."

Similar presentations


Ads by Google