Message Passing: Formalization, Dynamic Verification Ganesh Gopalakrishnan School of Computing, University of Utah, Salt Lake City, UT 84112, USA based on research done by students Sarvani Vakkalanka, Anh Vo, Michael DeLisi, Alan Humphrey, Chris Derrick, Sriram Aananthakrishnan, and faculty colleague Mike Kirby / formal_verification Supported by NSF CNS and Microsoft 1
Correctness Concerns Will Loom Everywhere… Debug Concurrent Systems, providing rigorous guarantees 2
Need for help / rigor noted by notable practitioners “Sequential programming is really hard, and parallel programming is a step beyond that” Tanenbaum, USENIX 2008 Lifetime Achievement Award talk “Formal methods provide the only truly scalable approach to developing correct code in this complex programming environment.” Rusty Lusk, in his EC Invited Talk entitled “Slouching Towards Exascale: Programming Models for High Performance Computing”
Must Cover BOTH Types of Concurrency Shared Memory Enjoys the most attention (esp. from the CS FV community) Message Passing Formal aspects of message passing are represented by CCS, CSP, … Many practical message passing libraries exist, but without a rigorous semantics that characterizes their stand-alone behavior and/or their semantics in the context of a standard programming language (e.g. how compiler optimizations work in their presence) The time is now ripe to make progress with respect to a few important message passing libraries (e.g., MPI, MCAPI, …)
Importance of Formalizing High-performance Message Passing Behavior Fundamental to dealing with the Message Passing Interface (MPI) API MPI is VERY widely used Enables reasoning about the reactive behavior of API calls Out of order issue and completion – easily explained thru Happens-before (HB) This HB took us a long time to discover; but it is surprisingly easy to explain! Made up of MATCHES-BEFORE and COMPLETES-BEFORE Happens-before depends on available run-time resources Can help characterize compiler optimizations formally Handle new correctness-critical message-passing libraries Multi-core Communications API or MCAPI for embedded systems use (e.g. Cell- phones etc) – can be understood using VERY SIMILAR formalism Understanding / pedagogy of message-passing program behavior No need to dismiss this area as “too hairy” Enables building formal dynamic verification tools Find bugs, reveal lurking “unexpected behaviors”, …
In general, we must get better at verifying concurrent programs written against a growing number of real APIs Code written using mature libraries (MPI, OpenMP, PThreads, …) Code written using mature libraries (MPI, OpenMP, PThreads, …) API calls made from real programming languages (C, Fortran, C++) API calls made from real programming languages (C, Fortran, C++) Runtime semantics determined by realistic Compilers and Runtimes Model building and Model maintenance have HUGE costs (I would assert: “impossible in practice”) and does not ensure confidence !! 6
7 SiCortex 5832 processor System (Courtesy SiCortex) IBM Blue Gene (Picture Courtesy IBM) LANL’s Petascale machine “Roadrunner” (AMD Opteron CPUs and IBM PowerX Cell) Importance of MPI Program Analysis / Debugging Almost the default choice for large-scale parallel simulations Huge support base Very mature codes exist in MPI – cannot easily be re-implemented Performs critical simulations in Science and Engineering – Weather / Earthquake Prediction, Computational Chemistry,…Parallel Model Checking,..
Two Classes of MPI Programs Mostly Computational these are sequential programs “pulled apart” one can see higher order functions (map, …) While optimizing these programs, reactive behavior creeps in non-blocking sends overlapped with computation probing for computations finishing and initiating new work early Highly Reactive User level libraries written in MPI e.g. Adaptive Dynamic Load Balancing libraries Bottom-line : must employ suitable dynamic verification methods for MPI
Our Work We have a formal model for MPI This formal model explains succinctly the space of all standard-compliant executions of MPI What must a standard-compliant MPI library together with the support infrastructure (runtime, compilers, …) finally amount to?
Practical Contribution of Our Work We have built the only push-button dynamic analysis tool for MPI / C programs called ISP Work on MPI / Fortran in progress Runs on MAC OS/X, Windows, Linux Tested against five state-of-the-art MPI libraries MPICH2, OpenMPI, MSMPI, MVAPICH, IBM MPI (in progress) Visual-Studio and Eclipse Parallel Tools Platform integration 100s of large case studies Efficiency is decent (getting better) 15K LOC Parmetis Hypergraph Partitioner analyzed for deadlocks, resource leaks, assertion violations for a given test harness in < 5 seconds for 2 MPI processes on a laptop Being downloaded by many Contribution to the Eclipse Consortium underway ISP can dynamically execute and reveal the space of all standard-compliant executions of MPI even when running on an arbitrary (standard-compliant) platform ISP’s internal scheduling decisions are taken in a fairly general way
One-page Ad on ISP 11 (BlueGene/L - Image courtesy of IBM / LLNL) (Image courtesy of Steve Parker, U of Utah) Verifies MPI User Applications, generating only the Relevant Process Interleavings Detects all Deadlocks, Assert Violations, MPI object leaks, and Default Safety Properties Works by Instrumenting MPI Calls Computing Relevant Interleavings, Replaying
This talk Explains the core of MPI using four letters S, R, B, W S starts a DMA send transfer, R starts a DMA receive transfer, W waits for the transfer to finish, B arranges for efficient global synchronization. [Hunch] Any attempt to create efficient message passing will result in a similar set of primitives We can now explain one-liner MPI programs that can confound even experts! This explanation is what ISP’s algorithm also uses
Summary of Some MPI Commands MPI_Isend(destination, msg_buf, request_structure, other args) This is a non-blocking call It initiates copying of msg_buf into MPI runtime so that a matching MPI Receive invoked from process destination will receive the contents of msg_buf MPI_Wait(… request_structure…) typically follows MPI_Isend When this BLOCKING call returns, the copying is finished 13
Summary of Some MPI Commands MPI_Isend(destination, msg_buf, request_structure, others) We will abbreviate this call as – Isend(destination, request_structure) – Example: Isend(2, req).. And finally as S(2) or S(to:2) or S(to:2, req) 14
Summary of Some MPI Commands MPI_Irecv(source, msg_bug, request_structure, other args) This is a non-blocking call It initiates receipt into msg_buf from the MPI runtime so that a matching MPI Send invoked from process source can provide the contents of msg_buf MPI_Wait(… request_structure…) typically follows MPI_Irecv When this BLOCKING call returns, the receipt is finished Wait is abbreviated W(req) or W or … 15
Summary of Some MPI Commands MPI_Irecv(source, msg_bug, request_structure, other args) Abbreviated as Irecv(source, req) Example : Irecv(3, req) OR EVEN Irecv(*, req) – in case any available source would do.. Finall as R(from:3, req), R(from:3), R(3), … 16
More MPI Commands – MPI_Barrier(…) is abbreviated as Barrier() or even Barrier – All processes must invoke Barrier before any process can return from the Barrier invocation – Useful high-performance global sync. operation –.. Abbreviated as B 17
Simple MPI Program : ‘lucky.c’ Process P0 R(from:*, r1) ; R(from:2, r2); S(to:2, r3); R(from:*, r4); All the Ws… Process P1 Sleep(3); S(to:0, r1); All the Ws… Process P2 //Sleep(3); S(to:0, r1); R(from:0, r2); S(to:0, r3); All the Ws… 18
Simple MPI Program : ‘lucky.c’ Process P0 R(from:*, r1) ; R(from:2, r2); S(to:2, r3); R(from:*, r4); All the Ws… Process P1 Sleep(3); S(to:0, r1); All the Ws… Process P2 //Sleep(3); S(to:0, r1); R(from:0, r2); S(to:0, r3); All the Ws… 19
Simple MPI Program : ‘lucky.c’ Process P0 R(from:*, r1) ; R(from:2, r2); S(to:2, r3); R(from:*, r4); All the Ws… Process P1 Sleep(3); S(to:0, r1); All the Ws… Process P2 //Sleep(3); S(to:0, r1); R(from:0, r2); S(to:0, r3); All the Ws… 20 deadlock
Simple MPI Program : ‘unlucky.c’ Process P0 R(from:*, r1) ; R(from:2, r2); S(to:2, r3); R(from:*, r4); All the Ws… Process P1 // Sleep(3); S(to:0, r1); All the Ws… Process P2 Sleep(3); S(to:0, r1); R(from:0, r2); S(to:0, r3); All the Ws… 21
Simple MPI Program : ‘unlucky.c’ Process P0 R(from:*, r1) ; R(from:2, r2); S(to:2, r3); R(from:*, r4); All the Ws… Process P1 // Sleep(3); S(to:0, r1); All the Ws… Process P2 Sleep(3); S(to:0, r1); R(from:0, r2); S(to:0, r3); All the Ws… 22 No deadlock
Runs of lucky.c and unlucky.c on mpich using “standard testing” (“lucky” for tester) 23 mpicc lucky.c -o lucky.out mpirun -np 3./lucky.out (0) is alive on ganesh-desktop (1) is alive on ganesh-desktop (2) is alive on ganesh-desktop Rank 0 did Irecv Rank 2 did Send Sleep over Rank 1 did Send [.. hang..] mpicc unlucky.c -o unlucky.out mpirun -np 3./unlucky.out (0) is alive on ganesh-desktop (2) is alive on ganesh-desktop (1) is alive on ganesh-desktop Rank 0 did Irecv Rank 1 did Send Rank 0 got 11 Sleep over Rank 2 did Send (2) Finished normally (1) Finished normally (0) Finished normally [.. OK..]
Runs of lucky.c and unlucky.c using ISP ISP will find the deadlock in both cases, unaffected by the “sleep”s The tailor-made DPOR that ISP uses, the dynamic instruction rewriting based execution control,… discussed elsewhere 24
How many interleavings in lucky.c? 25 Process P0 R(from:*, r1) ; R(from:2, r2); S(to:2, r3); R(from:*, r4); All the Ws… Process P1 Sleep(3); S(to:0, r1); All the Ws… Process P2 //Sleep(3); S(to:0, r1); R(from:0, r2); S(to:0, r3); All the Ws… > 500 interleavings without any reductions
How many relevant interleavings? Just two ! One for each Irecv(..) match. 26 Process P0 R(from:*, r1) ; R(from:2, r2); S(to:2, r3); R(from:*, r4); All the Ws… Process P1 Sleep(3); S(to:0, r1); All the Ws… Process P2 //Sleep(3); S(to:0, r1); R(from:0, r2); S(to:0, r3); All the Ws…
MPI is tricky… till you see how it really works!
Which send must be allowed to finish first? P0 --- S(to:1, big-message, h1); … S(to:2, small-message, h2); … W(h2); … W(h1); P1 --- R(from:1, buf1, h3); … W(h3); P1 --- R(from:2, buf2, h4); … W(h4);
MPI is tricky… till you see how it really works! Will this single-process example called “Auto-send” deadlock ? P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2);
30 The “Crooked Barrier” example P0 --- S 1 (to : P2 ); B P1 --- B; P2 --- R(from : *); B S 2 (to : P2 ) Can S 2 (to : P2 ) match R(from : *) ?
31 The “Crooked Barrier” example P0 --- S 1 (to : P2 ); B P1 --- B; P2 --- R(from : *); B S 2 (to : P2 ) Can S 2 (to : P2 ) match R(from : *) ? Match Across Barrier Possible ?
It will be good to explain all these programs without relying upon “bee dances”
MPI HB to the rescue! These pairs WITHIN A PROCESS are in the MPI HB S(to:x); … ; S(to:x) R(from:y); … ; R(from:y) R(from:*); … ; R(from:any) S(to:x, h); … ; W(h) R(from:y, h); … ; W(h) W(h); … ; any B; … ; any 33
This HB is what makes MPI high-performance !! S(to:x); … ; S(to:x) -- order only for non-overtaking R(from:y); … ; R(from:y) -- ditto R(from:*); … ; R(from:any) -- OK wildcard trumps ordinary-card S(to:x, h); … ; W(h) -- Neat! Resource modeling hidden here! (so neat that in our latest work, this HB explains slack inelasticity!!) R(from:y, h); … ; W(h) -- Neat too W(h); … ; any -- One place to truly block B; … ; any -- Another place to block! 34
Strictly, we must define HB on inner events Issued -- > Call returned -- < Call matched -- <> Call completed -- * S, R go thru all four states W has no meaningful <> (take it the same as *) B has no meaningful * (take it the same as <>) For this talk, define HB wrt the higher level instructions themselves (see FM 2009 for details) 35
HB based state transition semantics Fence = instructions that order all later program-ordered instructions via HB also (for us, they are B and W) “Process at a fence” = Process just issued a fence instruction During dynamic verification, each process that is not at a fence is permitted to issue its next instruction, and then extend the HB graph Define HB-ancestor, HB-descendent, matched-HB-ancestor Match-enabled instruction = Whose HB-ancestors have all matched Allow any match-enabled instruction to form a match-set suitably – S goes with matching R, B goes with another B – For S(to:1), S(to:2), and R(from:*), dynamically rewrite to match sets {S(to:1), R(from:1)}, and {S(to:2), R(from:2)} This is called an R* match-set (actually set of match-sets) Fire match sets; an R* match-set is fired only when there are no non- R* match sets, and all processes are at a fence 36
P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2); How Example Auto-send works
P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2); The HB How Example Auto-send works
P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2); Issue R(from:0, h1), because prior to issuing R, P0 is not at a fence
How Example Auto-send works P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2); Issue B, because after issuing R, P0 is not at a fence
P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2); Form match set; Match-enabled set is {B} How Example Auto-send works
P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2); Fire Match-enabled set {B} How Example Auto-send works
P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2); Issue S(to:0, h2) because since B is gone, P0 is no longer at a fence How Example Auto-send works
P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2); Issue W(h1) because after S(to:0, h2), P0 is not at a fence How Example Auto-send works
P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2); Can’t form a { W(h1) } match set because it has an unmatched ancestor (namely R(from:0, h1) ). How Example Auto-send works
P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2); Form and issue the { R(from:0, h1), S(to:0, h2) } match set, and issue How Example Auto-send works
P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2); Now form and issue the match set { W(h1) } How Example Auto-send works
P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2); Now issue W(h2) How Example Auto-send works
P0 : R(from:0, h1); B; S(to:0, h2); W(h1); W(h2); Form match set { W(h2) } and fire it. Done. How Example Auto-send works
50 The “Crooked Barrier” example P0 --- S 1 (to : P2 ); B P1 --- B; P2 --- R(from : *); B S 2 (to : P2 ) S 2 (to : P2 ) can match R(from : *) ! Here is how …
51 The “Crooked Barrier” example P0 --- S 1 (to : P2 ); B P1 --- B; P2 --- R(from : *); B S 2 (to : P2 ) S 2 (to : P2 ) can match R(from : *) ! Here is how …
52 The “Crooked Barrier” example P0 --- S 1 (to : P2 ); B P1 --- B; P2 --- R(from : *); B S 2 (to : P2 ) S 2 (to : P2 ) can match R(from : *) ! Here is how …
53 The “Crooked Barrier” example P0 --- S 1 (to : P2 ); B P1 --- B; P2 --- R(from : *); B S 2 (to : P2 ) S 2 (to : P2 ) can match R(from : *) ! Here is how …
MPI Program that needs this sort of API-aware dyn. Verif. (we will see how ISP works on this example) Process P0 Isend(1, req) ; Barrier ; Wait(req) ; Process P1 Irecv(*, req) ; Barrier ; Recv(2) ; Wait(req) ; Process P2 Barrier ; Isend(1, req) ; Wait(req) ; 54
Executable Proc 1 Proc 2 …… Proc n Scheduler that generates ALL RELEVANT schedules (Mazurkeiwicz Traces) Run MPI Runtime 55 MPI Program Interposition Layer Workflow of ISP
56 P0 P1 P2 Barrier Isend(1, req) Wait(req) Scheduler Irecv(*, req) Barrier Recv(2) Wait(req) Isend(1, req) Wait(req) Barrier Isend(1) sendNext Barrier MPI Runtime POE
P0 P1 P2 Barrier Isend(1, req) Wait(req) Scheduler Irecv(*, req) Barrier Recv(2) Wait(req) Isend(1, req) Wait(req) Barrier Isend(1) sendNext Barrier Irecv(*) Barrier 57 MPI Runtime POE
P0 P1 P2 Barrier Isend(1, req) Wait(req) Scheduler Irecv(*, req) Barrier Recv(2) Wait(req) Isend(1, req) Wait(req) Barrier Isend(1) Barrier Irecv(*) Barrier Barrier Barrier 58 MPI Runtime POE
P0 P1 P2 Barrier Isend(1, req) Wait(req) MPI Runtime Scheduler Irecv(*, req) Barrier Recv(2) Wait(req) Isend(1, req) Wait(req) Barrier Isend(1) Barrier Irecv(*) Barrier Wait (req) Recv(2) Isend(1) SendNext Wait (req) Irecv(2) Isend Wait No Match-Set No Match-Set 59 Deadlock! POE
Buffering Sensitive Deadlock (deadlocks if buffering not present in MPI runtime – same theory works) Process P0 Send(to:1, tag:10); Send(to:2, tag:9); Process P1 Recv(from:*, tag:11); Recv(from:*, tag:10); Process P2 Recv(from:0, tag:9); Send(to:1, tag:11); If Send(to:1, tag:10); is provided INSUFFICIENT BUFFERING by the runtime, then the execution will deadlock 60
Concluding Remarks Formal Verification for Concurrency serves many purposes – Helps find bugs – Helps understand programs – Helps improve efficiency of code with FV serving as safety-net One of the biggest remaining challenges – Efficient DEBUGGING – Safe Design Practices – Exploitation of Concurrency Patterns to reduce verification complexity How to formally downscale systems? How to address symmetry? How to achieve parameterized verification? How to DESIGN well-parameterized systems so that downscaling is easier? 61
Extra Slides 62
Summary of Some MPI Commands – Let Send(2) stand for atomic { Isend(2, req); Wait(req) } – Let Recv(3) stand for atomic { Irecv(3, req); Wait(req) } – These are actually BLOCKING MPI operations 63
How Dynamic Verification using Stateless Search Relies on Replays (a recap…) P0 P1 P2 lock(y) ………….. unlock(y) lock(x) ………….. unlock(x) lock(x) ………….. unlock(x) L0L0 U0U0 L1L1 L2L2 U1U1 U2U2 L0L0 U0U0 L2L2 U2U2 L1L1 U1U1 64