Presentation is loading. Please wait.

Presentation is loading. Please wait.

Effectively Model Checking Real-World Distributed Systems Junfeng Yang Joint work with Huayang Guo, Ming Wu, Lidong Zhou, Gang Hu, Lintao Zhang, Heming.

Similar presentations


Presentation on theme: "Effectively Model Checking Real-World Distributed Systems Junfeng Yang Joint work with Huayang Guo, Ming Wu, Lidong Zhou, Gang Hu, Lintao Zhang, Heming."— Presentation transcript:

1 Effectively Model Checking Real-World Distributed Systems Junfeng Yang Joint work with Huayang Guo, Ming Wu, Lidong Zhou, Gang Hu, Lintao Zhang, Heming Cui, Jingyue Wu, Chia-che Tsai, John Gallagher 1

2 One-slide Summary Distributed systems: important, but hard to get right Model checking: find serious bugs but is slow Dynamic Interface Reduction: a new type of state- space reduction technique in 25 years [DeMeter SOSP 11] – exponentially speed up model checking – One data point: 34 years  18 hours Stable Multithreading: a radically new approach [Tern OSDI '10] [Peregrine SOSP '11] [PLDI '12] [Parrot SOSP '13] [CACM '13] – what-you-check-is-what-you-run – Billions of years  7 hours – https://github.com/columbia/smt-mc 2

3 Distributed Systems: Pervasive and Critical 3

4 Distributed Systems: Hard to Get Right Node has no centralized view of entire system Code must correctly handle many failures – Link failures, network partitions – message loss, delay, or reordering – machine crashes Worse: geo, larger, weird failures more likely  Complex protocols, more complex code, bugs 4

5 Model Checking Distributed Systems Implementations 5 … Choices of actions – Send message – Recv message – Run thread – Delay message – Fail link – Crash machine –…–… Run checkers on states – E.g., assertions send fail link thread crash …

6 Good Error Detection Results E.g., [MoDist NSDI 09] [dBug SSV 10] – Easy: check unmodified, real code in native environment (“in-situ” [eXplode OSDI 06] ) – Comprehensive: check many corner cases – Deterministic: detected errors can be replay MoDist results – Checked Berkeley DB rep, MPS (Microsoft production), PacificA – Found 35 bugs 10 Protocol flaws found in every system checked – Transfer to Microsoft product groups 6

7 But, the State Explosion Problem Real-world distributed systems have too many states to completely explore –Even for conceptually small state spaces –3-node MPS: 34 years for MoDist! Incompleteness  Low assurance Prior model checkers explored many redundant states 7

8 This Talk: Two Techniques to Effectively Reduce/Shrink State Space Dynamic Interface Reduction: check components separately to avoid costly global exploration [DeMeter SOSP 11] –34 years  18 hours, 10^5 reduction Leverage Stable Multithreading [Tern OSDI '10] [Peregrine SOSP '11] [PLDI '12] [Parrot SOSP '13] [CACM '13] to make what- you-check what-you-run (ongoing) 8

9 Dynamic Interface Reduction (DIR) Insight: system builders decompose a system into components with narrow interfaces – e.g., [Clarke, Long, McMillan 87] [Laster, Grumberg 98] Distinguish global and local actions Check local actions via conceptually local fork() 9 // main // ckpt n=recv() total+=n Send(n) Log(total)

10 Reduction Analysis N components, each having M local actions 10 w/o DIR: M * M * … * M = M^N w DIR: M + M + … + M = M * N Exponential reduction … … … … …

11 Challenge in Implementing DIR How to automatically compute interfaces from real code w/o causing false positives or missing bugs? Manual spec: tedious, costly, error-prone – Required by prior compositional or modular model checking work Made-up interfaces: difficult-to-diagnose false positives [Guerraoui and Yabandeh, NSDI 11] 11

12 Automatically Discover Interface by Running Code 12 Global Explorer Explore global actions Local Explorers Explore local actions Explore local actons Message Traces Insight: message traces collectively define interfaces Message Traces

13 13 // main // ckpt While(n=recv()){ total+=n Send(S, n) } Log(total) if (Toss(2) == 0)) { Send(P, 1); Send(P, 2); } else { Send(P, 1); Send(P, 3); } Example // main // ckpt While(n=recv()){ total+=n } Log(total) Client C Primary P Second S

14 14 // main // ckpt While(n=recv()){ total+=n Send(S, n) } Log(total) if (Toss(2) == 0)) { Send(P, 1); Send(P, 2); } else { Send(P, 1); Send(P, 3); } Global Explorer: Compute Initial Global Trace // main // ckpt While(n=recv()){ total+=n } Log(total) Client C Primary P Second S C.Toss(2) = 0 C.Send(P, 1) P.Recv(C, 1) P.Log P.total+=1 P.Send(S, 1) S.Recv(P, 1) S.Log S.total+=1 C.Send(P, 2) P.Recv(C, 2) P.total+=2 P.Send(S, 2) S.Recv(P, 2) S.total+=2 Global

15 15 // main // ckpt While(n=recv()){ total+=n Send(S, n) } Log(total) if (Toss(2) == 0)) { Send(P, 1); Send(P, 2); } else { Send(P, 1); Send(P, 3); } Global Explorer: Project Message Traces // main // ckpt While(n=recv()){ total+=n } Log(total) Client C Primary P Second S C.Toss(2) = 0 C.Send(P, 1) P.Recv(C, 1) P.Log P.total+=1 P.Send(S, 1) S.Recv(P, 1) S.Log S.total+=1 C.Send(P, 2) P.Recv(C, 2) P.total+=2 P.Send(S, 2) S.Recv(P, 2) S.total+=2 Global P.Recv(C, 1) P.Send(S, 1) P.Recv(C, 2) P.Send(S, 2) S.Recv(P, 1) S.Recv(P, 2) C.Send(P, 1) C.Send(P, 2)

16 16 // main // ckpt While(n=recv()){ total+=n Send(S, n) } Log(total) if (Toss(2) == 0)) { Send(P, 1); Send(P, 2); } else { Send(P, 1); Send(P, 3); } Local Explorers: Explore Local Actions Using Message traces // main // ckpt While(n=recv()){ total+=n } Log(total) Client C Primary P Second S C.Toss(2) = 0 C.Send(P, 1) P.Recv(C, 1) P.Log P.total+=1 P.Send(S, 1) S.Recv(P, 1) S.Log S.total+=1 C.Send(P, 2) P.Recv(C, 2) P.total+=2 P.Send(S, 2) S.Recv(P, 2) S.total+=2 Global P.Recv(C, 1) P.Send(S, 1) P.Recv(C, 2) P.Send(S, 2) S.Recv(P, 1) S.Recv(P, 2) C.Send(P, 1) C.Send(P, 2)

17 17 // main // ckpt While(n=recv()){ total+=n Send(S, n) } Log(total) if (Toss(2) == 0)) { Send(P, 1); Send(P, 2); } else { Send(P, 1); Send(P, 3); } Local Explorer of Primary: Explore Local Trace 1 // main // ckpt While(n=recv()){ total+=n } Log(total) Client C Primary P Second S C.Toss(2) = 0 C.Send(P, 1) P.Recv(C, 1) P.Log P.total+=1 P.Send(S, 1) S.Recv(P, 1) S.Log S.total+=1 C.Send(P, 2) P.Recv(C, 2) P.total+=2 P.Send(S, 2) S.Recv(P, 2) S.total+=2 Global P.Recv(C, 1) P.Send(S, 1) P.Recv(C, 2) P.Send(S, 2) S.Recv(P, 1) S.Recv(P, 2) C.Send(P, 1) C.Send(P, 2) P.Log P.total+=1 P.total+=2

18 18 // main // ckpt While(n=recv()){ total+=n Send(S, n) } Log(total) if (Toss(2) == 0)) { Send(P, 1); Send(P, 2); } else { Send(P, 1); Send(P, 3); } Local Explorer of Primary: Explore Local Trace 2 // main // ckpt While(n=recv()){ total+=n } Log(total) Client C Primary P Second S C.Toss(2) = 0 C.Send(P, 1) P.Recv(C, 1) P.Log P.total+=1 P.Send(S, 1) S.Recv(P, 1) S.Log S.total+=1 C.Send(P, 2) P.Recv(C, 2) P.total+=2 P.Send(S, 2) S.Recv(P, 2) S.total+=2 Global P.Recv(C, 1) P.Send(S, 1) P.Recv(C, 2) P.Send(S, 2) S.Recv(P, 1) S.Recv(P, 2) C.Send(P, 1) C.Send(P, 2) P.Log P.total+=1 P.total+=2

19 19 // main // ckpt While(n=recv()){ total+=n Send(S, n) } Log(total) if (Toss(2) == 0)) { Send(P, 1); Send(P, 2); } else { Send(P, 1); Send(P, 3); } Local Explorer of Primary: Explore Local Trace 3 // main // ckpt While(n=recv()){ total+=n } Log(total) Client C Primary P Second S C.Toss(2) = 0 C.Send(P, 1) P.Recv(C, 1) P.Log P.total+=1 P.Send(S, 1) S.Recv(P, 1) S.Log S.total+=1 C.Send(P, 2) P.Recv(C, 2) P.total+=2 P.Send(S, 2) S.Recv(P, 2) S.total+=2 Global P.Recv(C, 1) P.Send(S, 1) P.Recv(C, 2) P.Send(S, 2) S.Recv(P, 1) S.Recv(P, 2) C.Send(P, 1) C.Send(P, 2) P.Log P.total+=1 P.total+=2

20 20 // main // ckpt While(n=recv()){ total+=n Send(S, n) } Log(total) if (Toss(2) == 0)) { Send(P, 1); Send(P, 2); } else { Send(P, 1); Send(P, 3); } Local Explorer of Client // main // ckpt While(n=recv()){ total+=n } Log(total) Client C Primary P Second S C.Toss(2) = 0 C.Send(P, 1) P.Recv(C, 1) P.Log P.total+=1 P.Send(S, 1) S.Recv(P, 1) S.Log S.total+=1 C.Send(P, 2) P.Recv(C, 2) P.total+=2 P.Send(S, 2) S.Recv(P, 2) S.total+=2 Global P.Recv(C, 1) P.Send(S, 1) P.Recv(C, 2) P.Send(S, 2) S.Recv(P, 1) S.Recv(P, 2) C.Send(P, 1) C.Send(P, 2)

21 21 // main // ckpt While(n=recv()){ total+=n Send(S, n) } Log(total) if (Toss(2) == 0)) { Send(P, 1); Send(P, 2); } else { Send(P, 1); Send(P, 3); } Local Explorer of Client // main // ckpt While(n=recv()){ total+=n } Log(total) Client C Primary P Second S C.Toss(2) = 0 C.Send(P, 1) P.Recv(C, 1) P.Log P.total+=1 P.Send(S, 1) S.Recv(P, 1) S.Log S.total+=1 C.Send(P, 2) P.Recv(C, 2) P.total+=2 P.Send(S, 2) S.Recv(P, 2) S.total+=2 Global P.Recv(C, 1) P.Send(S, 1) P.Recv(C, 2) P.Send(S, 2) S.Recv(P, 1) S.Recv(P, 2) C.Send(P, 1) C.Send(P, 2) C.Toss(2) = 0

22 22 // main // ckpt While(n=recv()){ total+=n Send(S, n) } Log(total) if (Toss(2) == 0)) { Send(P, 1); Send(P, 2); } else { Send(P, 1); Send(P, 3); } Local Explorer of Client Found New Message Trace // main // ckpt While(n=recv()){ total+=n } Log(total) Client C Primary P Second S C.Toss(2) = 0 C.Send(P, 1) P.Recv(C, 1) P.Log P.total+=1 P.Send(S, 1) S.Recv(P, 1) S.Log S.total+=1 C.Send(P, 2) P.Recv(C, 2) P.total+=2 P.Send(S, 2) S.Recv(P, 2) S.total+=2 Global P.Recv(C, 1) P.Send(S, 1) P.Recv(C, 2) P.Send(S, 2) S.Recv(P, 1) S.Recv(P, 2) C.Send(P, 1) C.Send(P, 3) C.Toss(2) = 1 C.Send(P, 2)

23 23 // main // ckpt While(n=recv()){ total+=n Send(S, n) } Log(total) if (Toss(2) == 0)) { Send(P, 1); Send(P, 2); } else { Send(P, 1); Send(P, 3); } Global Explorer: Composition // main // ckpt While(n=recv()){ total+=n } Log(total) Client C Primary P Second S C.Toss(2) = 0 C.Send(P, 1) P.Recv(C, 1) P.Log P.total+=1 P.Send(S, 1) S.Recv(P, 1) S.Log S.total+=1 C.Send(P, 2) P.Recv(C, 2) P.total+=2 P.Send(S, 2) S.Recv(P, 2) S.total+=2 Global P.Recv(C, 1) P.Send(S, 1) P.Recv(C, 2) P.Send(S, 2) S.Recv(P, 1) S.Recv(P, 2) C.Send(P, 1) C.Send(P, 3) C.Toss(2) = 1 C.Send(P, 2)

24 24 // main // ckpt While(n=recv()){ total+=n Send(S, n) } Log(total) if (Toss(2) == 0)) { Send(P, 1); Send(P, 2); } else { Send(P, 1); Send(P, 3); } Global Explorer: New Global Trace // main // ckpt While(n=recv()){ total+=n } Log(total) Client C Primary P Second S C.Toss(2) = 0 C.Send(P, 1) P.Recv(C, 1) P.Log P.total+=1 P.Send(S, 1) S.Recv(P, 1) S.Log S.total+=1 C.Send(P, 3) Global P.Recv(C, 1) P.Send(S, 1) P.Recv(C, 2) P.Send(S, 2) S.Recv(P, 1) S.Recv(P, 2) C.Send(P, 1) C.Send(P, 3) C.Toss(2) = 1 C.Send(P, 2)

25 Implementation 7,279 lines of C++ Integrated DIR with –MoDist [MoDist NSDI 09],757 lines –MaceMC [MaceMC NSDI 07],1,114 lines –Easy Orthogonal with partial order reduction through vector clock tricks 25

26 Verification/Reduction Results MPS (Microsoft production system) BDB: Berkeley DB Replication Chord: Chord implementation in Mace *-n: n nodes Results of other benchmarks in [Demeter SOSP 11] 26 AppMPS-2MPS-3BDB-2BDB-3Chord-2Chord-3 Reduction488542944277278481191587 Speedup15321717850 442037547 DIR-MoDistDIR-MaceMC

27 DIR Summary Proven sound (introduce no false positive) and complete (introduce no false negative) Fully automatic, real, exponential reduction Works seamlessly w/ existing model checkers –Integrated into MoDist and MaceMC; easy Results –Verified instances of real-world systems –Empirically observed large reduction 34 years  18 hours (10^5) on MPS 27

28 This Talk: Two Techniques to Effectively Reduce State Space Dynamic Interface Reduction: check components separately to avoid costly global exploration [DeMeter SOSP 11] –34 years  18 hours, 10^5 reduction Leverage Stable Multithreading [Tern OSDI '10] [Peregrine SOSP '11] [PLDI '12] [Parrot SOSP '13] [CACM '13] to make what- you-check what-you-run (ongoing) 28

29 Threads: Difficult to Model Check Many thread interleavings, or schedules – To verify, local explorer must explore all schedules Wide interfaces between threads – Any shared-memory load/store – Tracing load/store is costly – DIR may not work well  29

30 What-you-check is what-you-run Coverage = C/R Reduction: enlarge C exploiting equivalence But equivalence is rare, hard to find! – DIR took us 2-3 years Can we increase coverage w/o equivalence? Shrink R w/ Stable Multithreading [Tern OSDI '10] [Peregrine SOSP '11] [PLDI '12] [Parrot SOSP '13] [CACM '13] 30 All possible runtime schedules (R) Model checked schedules (C)

31 Stable Multithreading 31 Reuse well-checked schedules on diff. inputs How does it work? See papers [Tern OSDI '10] [Peregrine SOSP '11] [PLDI '12] [Parrot SOSP '13] [CACM '13] So much easier that it feels like cheating Nondeterministic Stable Deterministic

32 Conclusion Dynamic Interface Reduction: check components separately to avoid costly global exploration [DeMeter SOSP 11] – Automatic, real, exponential reduction – Proven sound and complete –34 years  18 hours, 10^5 reduction Leverage Stable Multithreading [Tern OSDI '10] [Peregrine SOSP '11] to make what-you-check what-you-run (ongoing) 32

33 Key Challenge Make stable multithreading work with real- world distributed systems – Physical time? – Message passing? – Dynamic load balancing? – Overhead? 33


Download ppt "Effectively Model Checking Real-World Distributed Systems Junfeng Yang Joint work with Huayang Guo, Ming Wu, Lidong Zhou, Gang Hu, Lintao Zhang, Heming."

Similar presentations


Ads by Google