1 Scalable Formal Dynamic Verification of MPI Programs through Distributed Causality Tracking Dissertation Defense Anh Vo Committee: Prof. Ganesh Gopalakrishnan.

1 Scalable Formal Dynamic Verification of MPI Programs through Distributed Causality Tracking Dissertation Defense Anh Vo Committee: Prof. Ganesh Gopalakrishnan (co-advisor), Prof. Robert M. Kirby (co-advisor), Dr. Bronis R. de Supinski (LLNL), Prof. Mary Hall and Prof. Matthew Might

2 Our computational ambitions are endless! Terascale Petascale (where we are now) Exascale Zettascale Correctness is important Jaguar, Courtesy ORNL Protein Folding, Courtesy Wikipedia Computation Astrophysics, Courtesy LBL

3 Yet we are falling behind when it comes to correctness Concurrent software debugging is hard It gets harder as the degree of parallelism in applications increases – Node level: Message Passing Interface (MPI) – Core level: Threads, OpenMPI, CUDA Hybrid programming will be the future – MPI + Threads – MPI + OpenMP – MPI + CUDA Yet tools are lagging behind! – Many tools cannot operate at scale MPI Apps MPI Correctness Tools

4 We focus on dynamic verification for MPI Lack of systematic verification tools for MPI We need to build verification tools for MPI first – Realistic MPI programs run at large scale – Downscaling might mask bugs MPI tools can be expanded to support hybrid programs

5 We choose MPI because of its ubiquity Born 1994 when the world had 600 internet sites, 700 nm lithography, 68 MHz CPUs Still the dominant API for HPC – Most widely supported and understood – High performance, flexible, portable

6 Thesis statement Scalable, modular and usable dynamic verification of realistic MPI programs is feasible and novel.

7 Contributions

8 Agenda Motivation and Contributions Background MPI ordering based on Matches-Before The centralized approach: ISP The distributed approach: DAMPI Conclusions

9 Example: Deterministic operations are permuted MPI_Barrier … P1P2…Pn Exploring all n! permutations is wasteful Traditional testing is wasteful….

10 P0 --- MPI_Send(to P1…); MPI_Send(to P1, data=22); P1 --- MPI_Recv(from P0…); MPI_Recv(from P2…); MPI_Recv(*, x); if (x==22) then ERROR else MPI_Recv(*, x); P2 --- MPI_Send(to P1…); MPI_Send(to P1, data=33); Unlucky (bug missed) Testing can also be inconclusive; without nondeterminism coverage, we can miss bugs

11 P0 --- MPI_Send(to P1…); MPI_Send(to P1, data=22); P1 --- MPI_Recv(from P0…); MPI_Recv(from P2…); MPI_Recv(*, x); if (x==22) then ERROR else MPI_Recv(*, x); P2 --- MPI_Send(to P1…); MPI_Send(to P1, data=33); Lucky (bug caught!) Testing can also be inconclusive; without nondeterminism coverage, we can miss bugs

12 P0 --- MPI_Send(to P1…); MPI_Send(to P1, data=22); P1 --- MPI_Recv(from P0…); MPI_Recv(from P2…); MPI_Recv(*, x); if (x==22) then ERROR else MPI_Recv(*, x); P2 --- MPI_Send(to P1…); MPI_Send(to P1, data=33); Verification: test all possible scenarios Find all possible matches for the receive – Ability to track causality is a prerequisite Replay the program and force the other matches

13 Dynamic verification of MPI Dynamic verification combines strength of formal methods and testing – Avoids generating false alarms – Finds bugs with respect to actual binaries Builds on the familiar approach of “testing” Guarantee coverage over nondeterminism

14 Overview of Message Passing Interface (MPI) An API specification for communication protocols between processes Allows developers to write high performance and portable parallel code Rich in features – Synchronous: easy to use and understand – Asynchronous: high performance – Nondeterministic constructs: reduce code complexity

15 MPI operations MPI_Send(void* buffer, int count, MPI_Datatype type, int dest, int tag, MPI_Comm comm) send(P,T) send(P) MPI_Recv(void* buffer, int count, MPI_Datatype type, int src, int tag, MPI_Comm comm, MPI_Status status) send(P,T)- send a message with tag T to process P recv(P,T)- recv a message with tag T from process P recv(*,T)- recv a message with tag T from any process recv(*,*)- recv a message with any tag from any process MPI_Isend(void* buffer, int count, MPI_Datatype type, int dest, int tag, MPI_Comm comm, MPI_Request h) isend(P,T,h) – nonblocking send, communication handle h irecv(P,T,h) – nonblocking recv, communication handle h irecv(*,T,h) irecv(*,*,h) MPI_Irecv(void* buffer, int count, MPI_Datatype type, int src, int tag, int comm, MPI_Request h) MPI_Wait(MPI_Request h, MPI_Status status) wait(h) – wait for the completion of h barrier – synchronization MPI_Barrier(MPI_Comm comm)

16 Nonovertaking rule facilitates message matching Sender sends two messages: – Both can match to a receive operation – First message will match before second message Receiver posts two receives: – Both can match with an incoming message – First receive will match before second receive P0P0 P1P1 send(1) recv(0) P0P0 P1P1 send(1) recv(*)recv(0) P0P0 P1P1 send(1) irecv(*)recv(0)

17 Happens-before is the basis of causality tracking e 1 happens-before (  ) e 2 iff: – e 1 occurs before e 2 in the same process – e 1 is the sending of a message m and e 2 receives it – e 1  e 3 and e 3  e 2 a bc d e f a  e b  c d  f a  b c  d e  f a  c a  d a  f b  d b  f c  f

18 Tracking causality with Lamport clocks Each process keeps a clock (an integer) Increase the clock when it has an event Attach the clock to outgoing messages (piggyback) Upon receiving piggybacked clock, update the clock to the value greater or equal to its clock, but higher than the piggybacked clock 1a 2b3c 4d 2e 5 f If e 1  e 2 then the clock of e 1 is less than the clock of e 2 What about e and d? The converse does not hold!

19 Tracking causality with vector clocks Each process keeps a vector of clocks (VC) Increase its clock component when it has an event Attach the VC to outgoing messages (piggyback) Upon receiving piggybacked VC clock, update each component to the maximum between the current VC and the piggybacked VC 1,0,0a 1,1,0 b 1,2,0 c 1,2,1 d 2,0,0e 2,2,2 f What about e and d? They are concurrent! e 1  e 2 iff VC of e 1 is less than VC e 2

20 Agenda Motivation and Contributions Background Matches-Before for MPI The centralized approach: ISP The distributed approach: DAMPI Conclusions

21 The necessity for matches-before The notion of happening does not mean much Local ordering does not always hold – For example, P0: send(1); send(2); The notion of completion also does not work P0 --- send (1) P1 --- irecv (*,h) recv (*) wait(h) The irecv(*) happens before the recv(*) but completes after it

22 Possible states of an MPI call All possible states of an MPI call – Issued – Matched – Completed – Returned It’s always possible to know exactly which state the call is in – Except for the Matched state, which has a matching window P 1 isend(0,h1) barrier send(0) wait(h1) P 2 irecv(0,h2) barrier recv(0) wait(h2)

23 Definition of matches-before recv, barrier, and wait match before all the calls following it sends and receives have matches-before order according to the non-overtaking rule Nonblocking calls match before their waits Matches-before is irreflexive, asymmetric and transitive isend(0,h1) send(0) barrierwait(h1) irecv(0,h2) recv(0)barrierwait(h2)

24 The role of match-sets send recv e2e2 e1e1 barrier e3e3

26 Executable Proc 1 Proc 2 …… Proc n Scheduler Run MPI Runtime Previous work: centralized approach MPI Program Interposition Layer Verifies MPI programs for deadlocks, resource leaks, assertion violations Guarantees coverage over the space of MPI non-determinism FM 2010, PPoPP 2009, EuroPVM 2010, EuroPVM 2009

27 Drawbacks of ISP Scales only up to 32-64 processes Large programs (of 1000s of processes) often exhibit bugs that are not triggered at low ends – Index out of bounds – Buffer overflows – MPI implementation bugs Need a truly In-Situ verification method for codes deployed on large-scale clusters! – Verify an MPI program as deployed on a cluster

29 DAMPI Distributed Analyzer of MPI Programs Dynamic verification focusing on coverage over the space of MPI non-determinism Verification occurs on the actual deployed code DAMPI’s features: – Detect and enforce alternative outcomes – Scalable – User-configurable coverage

30 DAMPI Framework Executable Proc 1 Proc 2 …… Proc n Alternate Matches MPI runtime MPI Program DAMPI - PnMPI modules Schedule Generator Epoch Decisions Rerun DAMPI – Distributed Analyzer for MPI

31 Main idea in DAMPI: Distributed Causality Tracking Perform an initial run of the MPI program Track causalities (discover which alternative nondeterministic matches could have occurred) Two alternatives: – Use of Vector Clocks (thorough, but non-scalable) – Use Lamport Clocks (our choice) Omissions possible – but only in unrealistic situations Scalable!

32 DAMPI uses Lamport clocks to maintain Matches-Before Use Lamport clock to track Matches-Before – Each process keeps a logical clock – Attach clock to each outgoing message – Increases it after a nondeterministic receive has matched Mb allows us to infer when irecv’s match – Compare incoming clock to detect potential matches barrier [0] irecv(*) [0] recv(*) [1] send(1) [0] send(1) [0] P0P0 P1P1 P2P2 barrier [0] barrier [0] wait [2] Excuse me, why is the second send RED?

33 How we use Matches-Before relationship to detect alternative matches barrier [0] irecv(*) [0] recv(*) [1] send(1) [0] send(1) [0] P0P0 P1P1 P2P2 barrier [0] barrier [0] wait [2]

34 How we use Matches-Before relationship to detect alternative matches send recv e2e2 e1e1 barrier e3e3

35 How we use Matches-Before relationship to detect alternative matches barrier [0] R = irecv(*) [0] R’= recv(*) [1] S = send(1) [0] send(1) [0] P0P0 P1P1 P2P2 barrier [0] barrier [0] wait [2]

36 Limitations of Lamport Clock R 1 (*) 0 0 0 0 1 pb(3) S(P2) P0 P1 P2 P3 R 2 (*) 2 S(P2) R 3 (*) 3 S(P2) pb(0) R(*) 1 S(P3) pb(0) This send is a potential match S(P3) Our protocol guarantees that impossible matches will not be forced (there could be deadlocks otherwise)

37 Lamport Clocks vs Vector Clocks DAMPI provides two protocols – Lamport clocks: sound and scalable – Vector clocks: sound and complete We evaluate the scalability and accuracy – Scalability: bandwidth, latency, overhead – Accuracy: omissions The Lamport Clocks protocol does not have any omissions in practice – MPI applications have well structured communication patterns

38 Experiments setup Atlas cluster in LLNL – 1152 nodes, 8 cores per node – 16GB of memory per node – MVAPICH-2 All experiments run at 8 tasks per node Results averaged out over five runs

39 Latency Impact

40 Bandwidth Impact

41 Application overhead – ParMETIS

42 Application overhead – AMG 2006

43 Application overhead – SMG2000

44 DAMPI’s Implementation Detail: using PnMPI Executable Proc 1 Proc 2 …… Proc n Alternate Matches MPI runtime MPI Program DAMPI - PnMPI modules Schedule Generator Epoch Decisions Status module Request module Communicator module Type module Deadlock module DAMPI - PnMPI modules Core Module Optional Error Checking Module Piggyback module DAMPI driver

45 Piggyback implementation details MPI does not provide a built-in mechanism to attach piggyback data to messages Most common piggyback mechanisms – Attach piggyback to the buffer: easy to use but expensive – Send piggyback as a separate message: low overhead but has issues with wildcard receives – Using user-defined datatype to transmit piggyback low overhead, difficult to piggyback on collectives

46 DAMPI uses a mixed piggyback scheme Datatype piggyback for point-to-point Separate message piggyback for collectives Piggyback Message Data pb_buf stores piggyback int MPI_Send(buf,count,user_type,…){ Create new datatype D from pb_buf and buf return PMPI_Send(MPI_BOTTOM,1,D,…); } int MPI_Recv(buf,count,user_type,…) { Create new datatype D from pb_buf and buf return PMPI_Recv(MPI_BOTTOM,1,D,…); } Wrapper – Piggyback Layer Datatype D Piggyback Message Data Sending/Receiving (MPI_BOTTOM,1,D) instead of (buffer,count,user_type) Datatype D

47 Experiments Comparison with ISP: – 64-node cluster of Intel Xeon X5550 (8 cores per node, 2.67 GHZ), 24GB RAM per node – All experiments were run with 8 tasks per node Measuring overhead of DAMPI: – 800-node cluster of AMD Opteron (16 cores per node, 2.3GHZ), 32GB RAM per node – All experiments were run with 16 tasks per node

48 DAMPI maintains very good scalability vs ISP

49 DAMPI is also faster at processing interleavings

50 Results on large applications: SpecMPI2007 and NAS-PB Slowdown is for one interleaving No replay was necessary

51 Heuristics for Overcoming Search Explosion Full coverage leads to state explosion – Given limited time, full coverage biases towards the beginning of the state space DAMPI offers two ways to limit search space: – Ignore uninteresting regions Users annotate programs with MPI_Pcontrol – Bounded mixing: limits impact of non-det. choice bound = infinity: full search

52 Bounded Mixing visualization A A B B C C D D E E F F G G H H MPI_Finalize Bound = 1

54 Bounded Mixing visualization A A B B C C D D E E F F G G H H MPI_Finalize Bound = 1 Total interleavings: 8

58 Bounded Mixing visualization A A B B C C D D E E F F G G H H MPI_Finalize Bound = 2 Total interleavings: 10 Bound >= 3 Total interleavings: 16

59 Applying Bounded Mixing on ADLB

60 How well did we do? DAMPI achieves scalable verification – Coverage over the space of nondeterminism – Works on realistic MPI programs at large scale Further correctness checking capabilities can be added as modules to DAMPI

61 Questions?

62 Concluding Remarks Scalable dynamic verification for MPI is feasible – Combines strength of testing and formal methods – Guarantees coverage over nondeterminism Matches-before ordering for MPI provides the basis for tracking causality in MPI DAMPI is the first MPI verifier that can scale beyond hundreds of processes

63 Moore Law still holds

64 Results on non-trivial applications: SpecMPI2007 and NAS-PB BenchmarkSlowdownTotal R* Communicator Leak Request Leak ParMETIS-3.11.180YesNo 104.milc1551KYesNo 107.leslie3d1.140No 113.GemsFDTD1.130YesNo 126.lammps1.880No 130.socorro1.250No 137.lu1.04732YesNo BT1.280YesNo CG1.090No DT1.010No EP1.020No FT1.010YesNo IS1.090No LU2.221KNo MG1.150No

65 P0 P1 P2 Barrier Isend(1, req) Wait(req) Scheduler Irecv(*, req) Barrier Recv(2) Wait(req) Isend(1, req) Wait(req) Barrier Isend(1) sendNext Barrier MPI Runtime How ISP Works: Delayed Execution

66 P0 P1 P2 Barrier Isend(1, req) Wait(req) Scheduler Irecv(*, req) Barrier Recv(2) Wait(req) Isend(1, req) Wait(req) Barrier Isend(1) sendNext Barrier Irecv(*) Barrier MPI Runtime Delayed Execution

67 P0 P1 P2 Barrier Isend(1, req) Wait(req) Scheduler Irecv(*, req) Barrier Recv(2) Wait(req) Isend(1, req) Wait(req) Barrier Isend(1) Barrier Irecv(*) Barrier Barrier Barrier MPI Runtime Collect Max Set of Enabled MPI calls

68 P0 P1 P2 Barrier Isend(1, req) Wait(req) MPI Runtime Scheduler Irecv(*, req) Barrier Recv(2) Wait(req) Isend(1, req) Wait(req) Barrier Isend(1) Barrier Irecv(*) Barrier Wait (req) Recv(2) Isend(1) SendNext Wait (req) Irecv(2) Isend Wait No Match No Match Deadlock! Build Happens-Before Edges

1 Scalable Formal Dynamic Verification of MPI Programs through Distributed Causality Tracking Dissertation Defense Anh Vo Committee: Prof. Ganesh Gopalakrishnan.

Similar presentations

Presentation on theme: "1 Scalable Formal Dynamic Verification of MPI Programs through Distributed Causality Tracking Dissertation Defense Anh Vo Committee: Prof. Ganesh Gopalakrishnan."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Scalable Formal Dynamic Verification of MPI Programs through Distributed Causality Tracking Dissertation Defense Anh Vo Committee: Prof. Ganesh Gopalakrishnan.

Similar presentations

Presentation on theme: "1 Scalable Formal Dynamic Verification of MPI Programs through Distributed Causality Tracking Dissertation Defense Anh Vo Committee: Prof. Ganesh Gopalakrishnan."— Presentation transcript:

Similar presentations

About project

Feedback