Download presentation
Presentation is loading. Please wait.
Published byPoppy Jordan Modified over 9 years ago
1
Effectively Model Checking Real-World Distributed Systems Junfeng Yang Joint work with Huayang Guo, Ming Wu, Lidong Zhou, Gang Hu, Lintao Zhang, Heming Cui, Jingyue Wu, Chia-che Tsai, John Gallagher 1
2
One-slide Summary Distributed systems: important, but hard to get right Model checking: find serious bugs but is slow Dynamic Interface Reduction: a new type of state- space reduction technique in 25 years [DeMeter SOSP 11] – exponentially speed up model checking – One data point: 34 years 18 hours Stable Multithreading: a radically new approach [Tern OSDI '10] [Peregrine SOSP '11] [PLDI '12] [Parrot SOSP '13] [CACM '13] – what-you-check-is-what-you-run – Billions of years 7 hours – https://github.com/columbia/smt-mc 2
3
Distributed Systems: Pervasive and Critical 3
4
Distributed Systems: Hard to Get Right Node has no centralized view of entire system Code must correctly handle many failures – Link failures, network partitions – message loss, delay, or reordering – machine crashes Worse: geo, larger, weird failures more likely Complex protocols, more complex code, bugs 4
5
Model Checking Distributed Systems Implementations 5 … Choices of actions – Send message – Recv message – Run thread – Delay message – Fail link – Crash machine –…–… Run checkers on states – E.g., assertions send fail link thread crash …
6
Good Error Detection Results E.g., [MoDist NSDI 09] [dBug SSV 10] – Easy: check unmodified, real code in native environment (“in-situ” [eXplode OSDI 06] ) – Comprehensive: check many corner cases – Deterministic: detected errors can be replay MoDist results – Checked Berkeley DB rep, MPS (Microsoft production), PacificA – Found 35 bugs 10 Protocol flaws found in every system checked – Transfer to Microsoft product groups 6
7
But, the State Explosion Problem Real-world distributed systems have too many states to completely explore –Even for conceptually small state spaces –3-node MPS: 34 years for MoDist! Incompleteness Low assurance Prior model checkers explored many redundant states 7
8
This Talk: Two Techniques to Effectively Reduce/Shrink State Space Dynamic Interface Reduction: check components separately to avoid costly global exploration [DeMeter SOSP 11] –34 years 18 hours, 10^5 reduction Leverage Stable Multithreading [Tern OSDI '10] [Peregrine SOSP '11] [PLDI '12] [Parrot SOSP '13] [CACM '13] to make what- you-check what-you-run (ongoing) 8
9
Dynamic Interface Reduction (DIR) Insight: system builders decompose a system into components with narrow interfaces – e.g., [Clarke, Long, McMillan 87] [Laster, Grumberg 98] Distinguish global and local actions Check local actions via conceptually local fork() 9 // main // ckpt n=recv() total+=n Send(n) Log(total)
10
Reduction Analysis N components, each having M local actions 10 w/o DIR: M * M * … * M = M^N w DIR: M + M + … + M = M * N Exponential reduction … … … … …
11
Challenge in Implementing DIR How to automatically compute interfaces from real code w/o causing false positives or missing bugs? Manual spec: tedious, costly, error-prone – Required by prior compositional or modular model checking work Made-up interfaces: difficult-to-diagnose false positives [Guerraoui and Yabandeh, NSDI 11] 11
12
Automatically Discover Interface by Running Code 12 Global Explorer Explore global actions Local Explorers Explore local actions Explore local actons Message Traces Insight: message traces collectively define interfaces Message Traces
13
13 // main // ckpt While(n=recv()){ total+=n Send(S, n) } Log(total) if (Toss(2) == 0)) { Send(P, 1); Send(P, 2); } else { Send(P, 1); Send(P, 3); } Example // main // ckpt While(n=recv()){ total+=n } Log(total) Client C Primary P Second S
14
14 // main // ckpt While(n=recv()){ total+=n Send(S, n) } Log(total) if (Toss(2) == 0)) { Send(P, 1); Send(P, 2); } else { Send(P, 1); Send(P, 3); } Global Explorer: Compute Initial Global Trace // main // ckpt While(n=recv()){ total+=n } Log(total) Client C Primary P Second S C.Toss(2) = 0 C.Send(P, 1) P.Recv(C, 1) P.Log P.total+=1 P.Send(S, 1) S.Recv(P, 1) S.Log S.total+=1 C.Send(P, 2) P.Recv(C, 2) P.total+=2 P.Send(S, 2) S.Recv(P, 2) S.total+=2 Global
15
15 // main // ckpt While(n=recv()){ total+=n Send(S, n) } Log(total) if (Toss(2) == 0)) { Send(P, 1); Send(P, 2); } else { Send(P, 1); Send(P, 3); } Global Explorer: Project Message Traces // main // ckpt While(n=recv()){ total+=n } Log(total) Client C Primary P Second S C.Toss(2) = 0 C.Send(P, 1) P.Recv(C, 1) P.Log P.total+=1 P.Send(S, 1) S.Recv(P, 1) S.Log S.total+=1 C.Send(P, 2) P.Recv(C, 2) P.total+=2 P.Send(S, 2) S.Recv(P, 2) S.total+=2 Global P.Recv(C, 1) P.Send(S, 1) P.Recv(C, 2) P.Send(S, 2) S.Recv(P, 1) S.Recv(P, 2) C.Send(P, 1) C.Send(P, 2)
16
16 // main // ckpt While(n=recv()){ total+=n Send(S, n) } Log(total) if (Toss(2) == 0)) { Send(P, 1); Send(P, 2); } else { Send(P, 1); Send(P, 3); } Local Explorers: Explore Local Actions Using Message traces // main // ckpt While(n=recv()){ total+=n } Log(total) Client C Primary P Second S C.Toss(2) = 0 C.Send(P, 1) P.Recv(C, 1) P.Log P.total+=1 P.Send(S, 1) S.Recv(P, 1) S.Log S.total+=1 C.Send(P, 2) P.Recv(C, 2) P.total+=2 P.Send(S, 2) S.Recv(P, 2) S.total+=2 Global P.Recv(C, 1) P.Send(S, 1) P.Recv(C, 2) P.Send(S, 2) S.Recv(P, 1) S.Recv(P, 2) C.Send(P, 1) C.Send(P, 2)
17
17 // main // ckpt While(n=recv()){ total+=n Send(S, n) } Log(total) if (Toss(2) == 0)) { Send(P, 1); Send(P, 2); } else { Send(P, 1); Send(P, 3); } Local Explorer of Primary: Explore Local Trace 1 // main // ckpt While(n=recv()){ total+=n } Log(total) Client C Primary P Second S C.Toss(2) = 0 C.Send(P, 1) P.Recv(C, 1) P.Log P.total+=1 P.Send(S, 1) S.Recv(P, 1) S.Log S.total+=1 C.Send(P, 2) P.Recv(C, 2) P.total+=2 P.Send(S, 2) S.Recv(P, 2) S.total+=2 Global P.Recv(C, 1) P.Send(S, 1) P.Recv(C, 2) P.Send(S, 2) S.Recv(P, 1) S.Recv(P, 2) C.Send(P, 1) C.Send(P, 2) P.Log P.total+=1 P.total+=2
18
18 // main // ckpt While(n=recv()){ total+=n Send(S, n) } Log(total) if (Toss(2) == 0)) { Send(P, 1); Send(P, 2); } else { Send(P, 1); Send(P, 3); } Local Explorer of Primary: Explore Local Trace 2 // main // ckpt While(n=recv()){ total+=n } Log(total) Client C Primary P Second S C.Toss(2) = 0 C.Send(P, 1) P.Recv(C, 1) P.Log P.total+=1 P.Send(S, 1) S.Recv(P, 1) S.Log S.total+=1 C.Send(P, 2) P.Recv(C, 2) P.total+=2 P.Send(S, 2) S.Recv(P, 2) S.total+=2 Global P.Recv(C, 1) P.Send(S, 1) P.Recv(C, 2) P.Send(S, 2) S.Recv(P, 1) S.Recv(P, 2) C.Send(P, 1) C.Send(P, 2) P.Log P.total+=1 P.total+=2
19
19 // main // ckpt While(n=recv()){ total+=n Send(S, n) } Log(total) if (Toss(2) == 0)) { Send(P, 1); Send(P, 2); } else { Send(P, 1); Send(P, 3); } Local Explorer of Primary: Explore Local Trace 3 // main // ckpt While(n=recv()){ total+=n } Log(total) Client C Primary P Second S C.Toss(2) = 0 C.Send(P, 1) P.Recv(C, 1) P.Log P.total+=1 P.Send(S, 1) S.Recv(P, 1) S.Log S.total+=1 C.Send(P, 2) P.Recv(C, 2) P.total+=2 P.Send(S, 2) S.Recv(P, 2) S.total+=2 Global P.Recv(C, 1) P.Send(S, 1) P.Recv(C, 2) P.Send(S, 2) S.Recv(P, 1) S.Recv(P, 2) C.Send(P, 1) C.Send(P, 2) P.Log P.total+=1 P.total+=2
20
20 // main // ckpt While(n=recv()){ total+=n Send(S, n) } Log(total) if (Toss(2) == 0)) { Send(P, 1); Send(P, 2); } else { Send(P, 1); Send(P, 3); } Local Explorer of Client // main // ckpt While(n=recv()){ total+=n } Log(total) Client C Primary P Second S C.Toss(2) = 0 C.Send(P, 1) P.Recv(C, 1) P.Log P.total+=1 P.Send(S, 1) S.Recv(P, 1) S.Log S.total+=1 C.Send(P, 2) P.Recv(C, 2) P.total+=2 P.Send(S, 2) S.Recv(P, 2) S.total+=2 Global P.Recv(C, 1) P.Send(S, 1) P.Recv(C, 2) P.Send(S, 2) S.Recv(P, 1) S.Recv(P, 2) C.Send(P, 1) C.Send(P, 2)
21
21 // main // ckpt While(n=recv()){ total+=n Send(S, n) } Log(total) if (Toss(2) == 0)) { Send(P, 1); Send(P, 2); } else { Send(P, 1); Send(P, 3); } Local Explorer of Client // main // ckpt While(n=recv()){ total+=n } Log(total) Client C Primary P Second S C.Toss(2) = 0 C.Send(P, 1) P.Recv(C, 1) P.Log P.total+=1 P.Send(S, 1) S.Recv(P, 1) S.Log S.total+=1 C.Send(P, 2) P.Recv(C, 2) P.total+=2 P.Send(S, 2) S.Recv(P, 2) S.total+=2 Global P.Recv(C, 1) P.Send(S, 1) P.Recv(C, 2) P.Send(S, 2) S.Recv(P, 1) S.Recv(P, 2) C.Send(P, 1) C.Send(P, 2) C.Toss(2) = 0
22
22 // main // ckpt While(n=recv()){ total+=n Send(S, n) } Log(total) if (Toss(2) == 0)) { Send(P, 1); Send(P, 2); } else { Send(P, 1); Send(P, 3); } Local Explorer of Client Found New Message Trace // main // ckpt While(n=recv()){ total+=n } Log(total) Client C Primary P Second S C.Toss(2) = 0 C.Send(P, 1) P.Recv(C, 1) P.Log P.total+=1 P.Send(S, 1) S.Recv(P, 1) S.Log S.total+=1 C.Send(P, 2) P.Recv(C, 2) P.total+=2 P.Send(S, 2) S.Recv(P, 2) S.total+=2 Global P.Recv(C, 1) P.Send(S, 1) P.Recv(C, 2) P.Send(S, 2) S.Recv(P, 1) S.Recv(P, 2) C.Send(P, 1) C.Send(P, 3) C.Toss(2) = 1 C.Send(P, 2)
23
23 // main // ckpt While(n=recv()){ total+=n Send(S, n) } Log(total) if (Toss(2) == 0)) { Send(P, 1); Send(P, 2); } else { Send(P, 1); Send(P, 3); } Global Explorer: Composition // main // ckpt While(n=recv()){ total+=n } Log(total) Client C Primary P Second S C.Toss(2) = 0 C.Send(P, 1) P.Recv(C, 1) P.Log P.total+=1 P.Send(S, 1) S.Recv(P, 1) S.Log S.total+=1 C.Send(P, 2) P.Recv(C, 2) P.total+=2 P.Send(S, 2) S.Recv(P, 2) S.total+=2 Global P.Recv(C, 1) P.Send(S, 1) P.Recv(C, 2) P.Send(S, 2) S.Recv(P, 1) S.Recv(P, 2) C.Send(P, 1) C.Send(P, 3) C.Toss(2) = 1 C.Send(P, 2)
24
24 // main // ckpt While(n=recv()){ total+=n Send(S, n) } Log(total) if (Toss(2) == 0)) { Send(P, 1); Send(P, 2); } else { Send(P, 1); Send(P, 3); } Global Explorer: New Global Trace // main // ckpt While(n=recv()){ total+=n } Log(total) Client C Primary P Second S C.Toss(2) = 0 C.Send(P, 1) P.Recv(C, 1) P.Log P.total+=1 P.Send(S, 1) S.Recv(P, 1) S.Log S.total+=1 C.Send(P, 3) Global P.Recv(C, 1) P.Send(S, 1) P.Recv(C, 2) P.Send(S, 2) S.Recv(P, 1) S.Recv(P, 2) C.Send(P, 1) C.Send(P, 3) C.Toss(2) = 1 C.Send(P, 2)
25
Implementation 7,279 lines of C++ Integrated DIR with –MoDist [MoDist NSDI 09],757 lines –MaceMC [MaceMC NSDI 07],1,114 lines –Easy Orthogonal with partial order reduction through vector clock tricks 25
26
Verification/Reduction Results MPS (Microsoft production system) BDB: Berkeley DB Replication Chord: Chord implementation in Mace *-n: n nodes Results of other benchmarks in [Demeter SOSP 11] 26 AppMPS-2MPS-3BDB-2BDB-3Chord-2Chord-3 Reduction488542944277278481191587 Speedup15321717850 442037547 DIR-MoDistDIR-MaceMC
27
DIR Summary Proven sound (introduce no false positive) and complete (introduce no false negative) Fully automatic, real, exponential reduction Works seamlessly w/ existing model checkers –Integrated into MoDist and MaceMC; easy Results –Verified instances of real-world systems –Empirically observed large reduction 34 years 18 hours (10^5) on MPS 27
28
This Talk: Two Techniques to Effectively Reduce State Space Dynamic Interface Reduction: check components separately to avoid costly global exploration [DeMeter SOSP 11] –34 years 18 hours, 10^5 reduction Leverage Stable Multithreading [Tern OSDI '10] [Peregrine SOSP '11] [PLDI '12] [Parrot SOSP '13] [CACM '13] to make what- you-check what-you-run (ongoing) 28
29
Threads: Difficult to Model Check Many thread interleavings, or schedules – To verify, local explorer must explore all schedules Wide interfaces between threads – Any shared-memory load/store – Tracing load/store is costly – DIR may not work well 29
30
What-you-check is what-you-run Coverage = C/R Reduction: enlarge C exploiting equivalence But equivalence is rare, hard to find! – DIR took us 2-3 years Can we increase coverage w/o equivalence? Shrink R w/ Stable Multithreading [Tern OSDI '10] [Peregrine SOSP '11] [PLDI '12] [Parrot SOSP '13] [CACM '13] 30 All possible runtime schedules (R) Model checked schedules (C)
31
Stable Multithreading 31 Reuse well-checked schedules on diff. inputs How does it work? See papers [Tern OSDI '10] [Peregrine SOSP '11] [PLDI '12] [Parrot SOSP '13] [CACM '13] So much easier that it feels like cheating Nondeterministic Stable Deterministic
32
Conclusion Dynamic Interface Reduction: check components separately to avoid costly global exploration [DeMeter SOSP 11] – Automatic, real, exponential reduction – Proven sound and complete –34 years 18 hours, 10^5 reduction Leverage Stable Multithreading [Tern OSDI '10] [Peregrine SOSP '11] to make what-you-check what-you-run (ongoing) 32
33
Key Challenge Make stable multithreading work with real- world distributed systems – Physical time? – Message passing? – Dynamic load balancing? – Overhead? 33
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.