Tanakorn Leesatapornwongsa, Jeffrey F. Lukman, Shan Lu, Haryadi S. Gunawi
Tanakorn Leesatapornwongsa, Jeffrey F. Lukman, Shan Lu, Haryadi S. Gunawi
ASPLOS ‘16 4 More people develop distributed systems Distributed systems are hard Hard largely because of concurrency Concurrency leads to unexpected timings X should arrive before Y, but X can arrive after Y Unexpected timings lead to distributed concurrency (DC) bugs
ASPLOS ‘16 5 “… be able to reason about the correctness of increasingly more complex distributed systems that are used in production” – Azure engineers & managers Uncovering Bugs in Distributed Storage Systems during Testing (Not in Production!) [FAST ‘16] Understanding distributed system bugs is important!
ASPLOS ‘16 6 Bugs caused by non-deterministic timing Non-deterministic timing of concurrent events involving more than one node Messages, crashes, reboots, timeouts, computations
ASPLOS ‘16 7 (LC bug: multi-threaded single machine software) Top 10 most cited ASPLOS paper
ASPLOS ‘16 8 104 bugs 4 varied distributed systems Bugs in Study description, source code, patches
ASPLOS’16 9 TaxDC Trigger Timin g Order Violation Atomicity Violation Fault Timing Reboot Timing Input Fault Reboot Workload Scope Nodes Messag es Protocols Error & Failure Error Loc Mem Loc Sem Loc Hang Loc Silence Glob Wrong Glob Miss Glob Silence Failure Downtime Data Loss Op Fail Performanc e Fix Timing Global Sync Local Sync Handlin g Retry Ignore Accept Others Timin g InputScopeErrorFailure Handlin g Timin g
ASPLOS ‘16 10 ZooKeeper Follower F crashes, reboots, and joins cluster 2. Leader L sync snapshot with F 3. Client requests new update, F applies this only in memory 4. Sync finishes 5. Client requests other update, F writes this to disk correctly 6. F crashes, reboots, and joins cluster again 7. This time L sends only diff after update in step F loses update in step 3.
ASPLOS ‘16 11 ZooKeeper Follower F crashes, reboots, and joins cluster 2. Leader L sync snapshot with F 3. Client requests new update, F applies this only in memory 4. Sync finishes 5. Client requests other update, F writes this to disk correctly 6. F crashes, reboots, and joins cluster again 7. This time L sends only diff after update in step F loses update in step 3. Timing: - Atomicity violation - Fault Timing Input: - 4 Protocols - 2 faults - 2 reboots Error: - Global Failure: Data inconsistenc y Fix: Delay msg.
ASPLOS’16 12 TaxDC Trigger Timin g Order Violation Atomicity Violation Fault Timing Reboot Timing Input Fault Reboot Workload Scope Nodes Messag es Protocols Error & Failure Error Loc Mem Loc Sem Loc Hang Loc Silence Glob Wrong Glob Miss Glob Silence Failure Downtime Data Loss Op Fail Performanc e Fix Timing Global Sync Local Sync Handlin g Retry Ignore Accept Others Timin g InputScopeErrorFailure Handlin g Timin g
ASPLOS’16 13 TaxDC Trigger Timin g Order Violation Atomicity Violation Fault Timing Reboot Timing Input Fault Reboot Workload Scope Nodes Messag es Protocols Error & Failure Error Loc Mem Loc Sem Loc Hang Loc Silence Glob Wrong Glob Miss Glob Silence Failure Downtime Data Loss Op Fail Performanc e Fix Timing Global Sync Local Sync Handlin g Retry Ignore Accept Others Timin g InputScopeErrorFailure Handlin g Timin g Conditions that make bugs happen
ASPLOS’16 14 TaxDC Trigger Timin g Order Violation Atomicity Violation Fault Timing Reboot Timing Input Fault Reboot Workload Scope Nodes Messag es Protocols Error & Failure Error Loc Mem Loc Sem Loc Hang Loc Silence Glob Wrong Glob Miss Glob Silence Failure Downtime Data Loss Op Fail Performanc e Fix Timing Global Sync Local Sync Handlin g Retry Ignore Accept Others Timin g InputScopeErrorFailure Handlin g Timin g What: Untimely moment that makes bug happens Why: Help design bug detection tools
ASPLOS ‘16 15 Trigger Timing Message Ex: MapReduce-3274 “Does the timing involve many messages?”
“Does the timing involve many messages?” ASPLOS ‘16 16 Trigger Timing Message Order violation (44%) Ex: MapReduce events, X and Y Y must happen after X But Y happens before X
“Does the timing involve many messages?” ASPLOS ‘16 17 Trigger Timing Message Order violation (44%) Msg-msg race Submit Kill Ex: MapReduce-3274 Submit 2 events, X and Y Y must happen after X But Y happens before X
ASPLOS ‘16 18 Trigger Timing Receive- receive race Send-send race Receive- send race Message Order violation (44%) Msg-msg race AB AB AB AB Ne w key Ne w Old key HBase MapReduce Kill End report Kill what job? Expired! Ne w key (late ) MapReduce End report Kill
ASPLOS ‘16 19 Trigger Timing cmp Message Order violation (44%) Msg-msg race Msg-compute race Order violation: 2 events, X and Y Y must happen after X But Y happens before X Ex: MapReduce-4157
ASPLOS ‘16 20 Trigger Timing Message Order violation (44%) Atomicity violation (20%) A message comes in the middle of atomic operation ABAB AB Ex: Cassandra-1011, Hbase-4729, MapReduce-5009, Zookeeper-1496
ASPLOS ‘16 21 Trigger Timing Message Fault (21%) Fault at specific timing AB ABC ABC Ex: Cassandra-6415, Hbase-5806, MapReduce-3858, Zookeeper-1653 No fault timing in LC bugs Only in DC bugs
ASPLOS ‘16 22 Trigger Timing Message Fault Reboot (11%) AB AB Reboot at specific timing Ex: Cassandra-2083, Hadoop-3186, MapReduce-5489, Zookeeper-975
ASPLOS ‘16 23 Trigger Timing Message Fault Reboot Mix (4%) ZooKeeper Follower F crashes, reboots, and joins cluster 2. Leader L sync snapshot with F 3. Client requests new update, F applies this only in memory (in the middle of sync snapshot) 4. Sync finishes 5. Client requests other update, F writes this to disk correctly 6. F crashes, reboots, and joins cluster again 7. This time L sends only diff after update in step F loses update in step 3. Atomicity violation Fault timing Failure
ASPLOS ‘16 24 Trigger Timing cmp Message timingFault timing Reboot timing Implication: simple patterns can inform pattern-based bug detection tools, etc.
ASPLOS’16 25 TaxDC Trigger Timin g Order Violation Atomicity Violation Fault Timing Reboot Timing Input Fault Reboot Workload Scope Nodes Messag es Protocols Error & Failure Error Loc Mem Loc Sem Loc Hang Loc Silence Glob Wrong Glob Miss Glob Silence Failure Downtime Data Loss Op Fail Performanc e Fix Timing Global Sync Local Sync Handlin g Retry Ignore Accept Others Timin g Input ScopeErrorFailure Handlin g Timin g What: Input to exercise buggy code Why: Improve testing coverage
ASPLOS ‘16 26 Trigger Timing Input ZooKeeper Follower F crashes, reboots, and joins cluster 2. Leader L sync snapshot with F 3. Client requests new update, F applies this update only in memory 4. Sync finishes 5. Client requests other update, F writes this to disk correctly 6. F crashes, reboots, and joins cluster again 7. This time L sends only diff after update in step F loses update in step 3. Fault & reboot 2 crashes 2 reboots
ASPLOS ‘16 27 Trigger Timing Input Fault “How many bugs require fault injection?” 37% = No fault 63% = Yes “What kind of fault? & How many times?” 88% = No timeout12% 53% = No crash35% = 1 crash12% Real-world DC bugs are NOT just about message re-ordering, but faults as well
ASPLOS ‘16 28 Trigger Timing Input Fault Reboot “How many reboots?” 73% = No reboot20% = 17%
ASPLOS ‘16 29 Trigger Timing Input Fault Reboot Workload nmnm popo rqrq Cassandra Paxos bug (Cassandra-6023) 3 concurrent user requests! “How many protocols to run as input?” 20% = 1 80% = 2+ protocols Implication: multiple protocols for DC testing
ASPLOS ’16 30 TaxDC Trigger Timin g Order Violation Atomicity Violation Fault Timing Reboot Timing Input Fault Reboot Workload Scope Nodes Messag es Protocols Error & Failure Error Loc Mem Loc Sem Loc Hang Loc Silence Glob Wrong Glob Miss Glob Silence Failure Downtime Data Loss Op Fail Performanc e Fix Timing Global Sync Local Sync Handlin g Retry Ignore Accept Others Timin g InputScopeFailure Handlin g Timin g Error What: First effect of untimely ordering Why: Help failure diagnosis and bug detection
ASPLOS ‘16 31 Trigger Local Error Error can be observed in one triggering node (46%) Implication: identify opportunities for failure diagnosis and bug detection Null pointer, false assertion, etc.
ASPLOS ‘16 32 Many are silent errors and hard to diagnose (hidden errors, no error messages, long debugging) Trigger Local Error Global Error cannot be observed in one node (54%) ??
ASPLOS’16 33 TaxDC Trigger Timin g Order Violation Atomicity Violation Fault Timing Reboot Timing Input Fault Reboot Workload Scope Nodes Messag es Protocols Error & Failure Error Loc Mem Loc Sem Loc Hang Loc Silence Glob Wrong Glob Miss Glob Silence Failure Downtime Data Loss Op Fail Performanc e Fix Timing Global Sync Local Sync Handlin g Retry Ignore Accept Others Timin g InputScopeErrorFailure Handlin g Timin g What: How developers fix bugs Why: Help design runtime prevention and automatic patch generation
ASPLOS ‘16 34 Trigger Error Fix Comple x Add Global Synchro - nization Similar to fixing LC bugs: add synchronization e.g. lock() Are patches complicated? Are patches adding synch.? Add new states & transitions
ASPLOS ‘16 35 Trigger Delay Comple x Error Fix Simple
ASPLOS ‘16 36 Trigger Comple x Error Fix Simple Delay Ignore/discard
ASPLOS ‘16 37 Trigger Comple x Error Fix Simple Delay Ignore/Discard Retry
ASPLOS ‘16 38 Trigger Comple x Error Fix Simple f(msg); g(msg); Delay Ignore/Discard Retry Accept
ASPLOS ‘16 39 Trigger Comple x Error Fix Simple Delay Ignore Retry Accept 40% are easy to fix (no new computation logic) Implication: many fixes can inform automatic runtime prevention f(msg); g(msg);
ASPLOS ‘16 40 Trigger Comple x Error Fix Simple Delay Ignore/Discard Retry Accept Sync. Fix DC bugs vs. LC bugs
ASPLOS ‘16 41 Distributed system model checker Formal verification DC bug detection Runtime failure prevention
ASPLOS ‘16 42 RealityEvent Message Crash Multiple crashes Reboot Multiple reboots Timeout Computation Disk fault Modist NSDI’11 Demeter SOSP’11 MaceMC NSDI’07 SAMC OSDI’14 Let’s find out how to re-order all events without exploding the state space!
State-of-the-art Verdi [PLDI ‘15] Raft update ~ 6,000 lines of proof IronFleet [SOSP ‘15] Paxos update Lease-based read/write ~ 5,000 – 10,000 lines of proof Challenges Foreground & Background #Protocol interactions ASPLOS ‘ %= 1 80% = 2+ Protocols 29% = Mix 19%=FG 52% = BG Let’s find out how to better verify more protocol interactions! Only verify foreground protocols Foreground & background
State-of-the-art: LC bug detection Pattern-based detection Error-based detection Statistical bug detection Opportunities: DC bug detection? Pattern-based detection Error-based detection ASPLOS ‘ % = Explicit47% = Silent Message timing Fault timing Reboot timing Let’s leverage these timing patterns and explicit error to do DC bug detection!
State-of-the-art: LC bug prevention Deadlock Immunity [OSDI ‘08] Aviso [ASPLOS ‘13] ConAir [ASPLOS ‘13] Etc. Opportunities: DC bug prevention Fixes ASPLOS ‘ % = Simple60% = Complex Let’s build runtime prevention technique that leverage this simplicity!
ASPLOS ‘16 46 “Why seriously address DC bugs now?” Everything is distributed and large-scale! DC bugs are not uncommon! “Why is tackling DC bugs possible now?” 1.Open access to source code 2.Pervasive documentations 3.Detailed bug descriptions
47 ASPLOS ‘16