1 Scalable Formal Dynamic Verification of MPI Programs through Distributed Causality Tracking Dissertation Defense Anh Vo Committee: Prof. Ganesh Gopalakrishnan.

Slides:

Advertisements

Similar presentations

Demo of ISP Eclipse GUI Command-line Options Set-up Audience with LiveDVD About 30 minutes – by Ganesh 1.

Advertisements

MPI Message Passing Interface

Technology Drivers Traditional HPC application drivers – OS noise, resource monitoring and management, memory footprint – Complexity of resources to be.

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.

1 Non-Blocking Communications. 2 #include int main(int argc, char **argv) { int my_rank, ncpus; int left_neighbor, right_neighbor; int data_received=-1;

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Gwendolyn Voskuilen, Faraz Ahmad, and T. N. Vijaykumar Electrical & Computer Engineering ISCA 2010.

Illustration of ISP for each bug-class MPI Happens-before: how MPI supports Out-of-order execution The POE algorithm of ISP explained using MPI happens-before.

Module 7: Advanced Development  GEM only slides here  Started on page 38 in SC09 version Module 77-0.

Parallel Processing1 Parallel Processing (CS 667) Lecture 9: Advanced Point to Point Communication Jeremy R. Johnson *Parts of this lecture was derived.

R2: An application-level kernel for record and replay Z. Guo, X. Wang, J. Tang, X. Liu, Z. Xu, M. Wu, M. F. Kaashoek, Z. Zhang, (MSR Asia, Tsinghua, MIT),

EFFICIENT DYNAMIC VERIFICATION ALGORITHMS FOR MPI APPLICATIONS Dissertation Defense Sarvani Vakkalanka Committee: Prof. Ganesh Gopalakrishnan (advisor),

Practical Formal Verification of MPI and Thread Programs Sarvani Vakkalanka Anh Vo* Michael DeLisi Sriram Aananthakrishnan Alan Humphrey Christopher Derrick.

1 Buffers l When you send data, where does it go? One possibility is: Process 0Process 1 User data Local buffer the network User data Local buffer.

CHESS: A Systematic Testing Tool for Concurrent Software CSCI6900 George.

Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.

Toward Efficient Support for Multithreaded MPI Communication Pavan Balaji 1, Darius Buntinas 1, David Goodell 1, William Gropp 2, and Rajeev Thakur 1 1.

An Automata-based Approach to Testing Properties in Event Traces H. Hallal, S. Boroday, A. Ulrich, A. Petrenko Sophia Antipolis, France, May 2003.

Scheduling Considerations for building Dynamic Verification Tools for MPI Sarvani Vakkalanka, Michael DeLisi Ganesh Gopalakrishnan, Robert M. Kirby School.

Distributed Systems Spring 2009

CMPT 431 Dr. Alexandra Fedorova Lecture VIII: Time And Global Clocks.

CPSC 668Set 16: Distributed Shared Memory1 CPSC 668 Distributed Algorithms and Systems Fall 2006 Prof. Jennifer Welch.

Distributed Memory Programming with MPI. What is MPI? Message Passing Interface (MPI) is an industry standard message passing system designed to be both.

1 Parallel Computing—Introduction to Message Passing Interface (MPI)

The Problem  Rigorous descriptions for widely used APIs essential  Informal documents / Experiments not a substitute Goals / Benefits  Define MPI rigorously.

MPI Point-to-Point Communication CS 524 – High-Performance Computing.

1 Tuesday, October 10, 2006 To err is human, and to blame it on a computer is even more so. -Robert Orben.

The shift from sequential to parallel and distributed computing is of fundamental importance for the advancement of computing practices. Unfortunately,

CS 179: GPU Programming Lecture 20: Cross-system communication.

Presenter: Chi-Hung Lu 1. Problems Distributed applications are hard to validate Distribution of application state across many distinct execution environments.

D. Becker, M. Geimer, R. Rabenseifner, and F. Wolf Laboratory for Parallel Programming | September Synchronizing the timestamps of concurrent events.

ORNL is managed by UT-Battelle for the US Department of Energy Crash Course In Message Passing Interface Adam Simpson NCCS User Assistance.

Distributed Control of FACTS Devices Using a Transportation Model Bruce McMillin Computer Science Mariesa Crow Electrical and Computer Engineering University.

Microsoft Research Asia Ming Wu, Haoxiang Lin, Xuezheng Liu, Zhenyu Guo, Huayang Guo, Lidong Zhou, Zheng Zhang MIT Fan Long, Xi Wang, Zhilei Xu.

CMSC 345 Fall 2000 Unit Testing. The testing process.

15-740/ Oct. 17, 2012 Stefan Muller.  Problem: Software is buggy!  More specific problem: Want to make sure software doesn’t have bad property.

View-Oriented Parallel Programming for multi-core systems Dr Zhiyi Huang World 45 Univ of Otago.

1 Choosing MPI Alternatives l MPI offers may ways to accomplish the same task l Which is best? »Just like everything else, it depends on the vendor, system.

QoS Support in High-Speed, Wormhole Routing Networks Mario Gerla, B. Kannan, Bruce Kwan, Prasasth Palanti,Simon Walton.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.

Java Threads 11 Threading and Concurrent Programming in Java Introduction and Definitions D.W. Denbo Introduction and Definitions D.W. Denbo.

Parallel Programming with MPI Prof. Sivarama Dandamudi School of Computer Science Carleton University.

Message Passing Programming Model AMANO, Hideharu Textbook pp. １４０－１４７.

Summary of MPI commands Luis Basurto. Large scale systems Shared Memory systems – Memory is shared among processors Distributed memory systems – Each.

1 Overview on Send And Receive routines in MPI Kamyar Miremadi November 2004.

Distributed-Memory (Message-Passing) Paradigm FDI 2004 Track M Day 2 – Morning Session #1 C. J. Ribbens.

MPI (continue) An example for designing explicit message passing programs Advanced MPI concepts.

Parallel Programming with MPI By, Santosh K Jena..

MA471Fall 2002 Lecture5. More Point To Point Communications in MPI Note: so far we have covered –MPI_Init, MPI_Finalize –MPI_Comm_size, MPI_Comm_rank.

The shift from sequential to parallel and distributed computing is of fundamental importance for the advancement of computing practices. Unfortunately,

MPI Point to Point Communication CDP 1. Message Passing Definitions Application buffer Holds the data for send or receive Handled by the user System buffer.

FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture FIT5174 Distributed & Parallel Systems Lecture 5 Message Passing and MPI.

Bronis R. de Supinski and Jeffrey S. Vetter Center for Applied Scientific Computing August 15, 2000 Umpire: Making MPI Programs Safe.

MPI Send/Receive Blocked/Unblocked Josh Alexander, University of Oklahoma Ivan Babic, Earlham College Andrew Fitz Gibbon, Shodor Education Foundation Inc.

Parallel Algorithms & Implementations: Data-Parallelism, Asynchronous Communication and Master/Worker Paradigm FDI 2007 Track Q Day 2 – Morning Session.

Agenda  Quick Review  Finish Introduction  Java Threads.

Reachability Testing of Concurrent Programs1 Reachability Testing of Concurrent Programs Richard Carver, GMU Yu Lei, UTA.

Message Passing Interface Using resources from

Testing Concurrent Programs Sri Teja Basava Arpit Sud CSCI 5535: Fundamentals of Programming Languages University of Colorado at Boulder Spring 2010.

Symbolic Model Checking of Software Nishant Sinha with Edmund Clarke, Flavio Lerda, Michael Theobald Carnegie Mellon University.

3/12/2013Computer Engg, IIT(BHU)1 MPI-2. POINT-TO-POINT COMMUNICATION Communication between 2 and only 2 processes. One sending and one receiving. Types:

Optimistic Hybrid Analysis

Verification of Data-Dependent Properties of MPI-Based Parallel Scientific Software Anastasia Mironova.

MPI Point to Point Communication

Store Recycling Function Experimental Results

Setac: A Phased Deterministic Testing Framework for Scala Actors

More on MPI Nonblocking point-to-point routines Deadlock

Lecture 14: Inter-process Communication

Time And Global Clocks CMPT 431.

More on MPI Nonblocking point-to-point routines Deadlock

Presentation transcript:

1 Scalable Formal Dynamic Verification of MPI Programs through Distributed Causality Tracking Dissertation Defense Anh Vo Committee: Prof. Ganesh Gopalakrishnan (co-advisor), Prof. Robert M. Kirby (co-advisor), Dr. Bronis R. de Supinski (LLNL), Prof. Mary Hall and Prof. Matthew Might

2 Our computational ambitions are endless! Terascale Petascale (where we are now) Exascale Zettascale Correctness is important Jaguar, Courtesy ORNL Protein Folding, Courtesy Wikipedia Computation Astrophysics, Courtesy LBL

3 Yet we are falling behind when it comes to correctness Concurrent software debugging is hard It gets harder as the degree of parallelism in applications increases – Node level: Message Passing Interface (MPI) – Core level: Threads, OpenMPI, CUDA Hybrid programming will be the future – MPI + Threads – MPI + OpenMP – MPI + CUDA Yet tools are lagging behind! – Many tools cannot operate at scale MPI Apps MPI Correctness Tools

4 We focus on dynamic verification for MPI Lack of systematic verification tools for MPI We need to build verification tools for MPI first – Realistic MPI programs run at large scale – Downscaling might mask bugs MPI tools can be expanded to support hybrid programs

5 We choose MPI because of its ubiquity Born 1994 when the world had 600 internet sites, 700 nm lithography, 68 MHz CPUs Still the dominant API for HPC – Most widely supported and understood – High performance, flexible, portable

6 Thesis statement Scalable, modular and usable dynamic verification of realistic MPI programs is feasible and novel.

7 Contributions

8 Agenda Motivation and Contributions Background MPI ordering based on Matches-Before The centralized approach: ISP The distributed approach: DAMPI Conclusions

9 Example: Deterministic operations are permuted MPI_Barrier … P1P2…Pn Exploring all n! permutations is wasteful Traditional testing is wasteful….

10 P0 --- MPI_Send(to P1…); MPI_Send(to P1, data=22); P1 --- MPI_Recv(from P0…); MPI_Recv(from P2…); MPI_Recv(*, x); if (x==22) then ERROR else MPI_Recv(*, x); P2 --- MPI_Send(to P1…); MPI_Send(to P1, data=33); Unlucky (bug missed) Testing can also be inconclusive; without nondeterminism coverage, we can miss bugs

11 P0 --- MPI_Send(to P1…); MPI_Send(to P1, data=22); P1 --- MPI_Recv(from P0…); MPI_Recv(from P2…); MPI_Recv(*, x); if (x==22) then ERROR else MPI_Recv(*, x); P2 --- MPI_Send(to P1…); MPI_Send(to P1, data=33); Lucky (bug caught!) Testing can also be inconclusive; without nondeterminism coverage, we can miss bugs

12 P0 --- MPI_Send(to P1…); MPI_Send(to P1, data=22); P1 --- MPI_Recv(from P0…); MPI_Recv(from P2…); MPI_Recv(*, x); if (x==22) then ERROR else MPI_Recv(*, x); P2 --- MPI_Send(to P1…); MPI_Send(to P1, data=33); Verification: test all possible scenarios Find all possible matches for the receive – Ability to track causality is a prerequisite Replay the program and force the other matches

13 Dynamic verification of MPI Dynamic verification combines strength of formal methods and testing – Avoids generating false alarms – Finds bugs with respect to actual binaries Builds on the familiar approach of “testing” Guarantee coverage over nondeterminism

14 Overview of Message Passing Interface (MPI) An API specification for communication protocols between processes Allows developers to write high performance and portable parallel code Rich in features – Synchronous: easy to use and understand – Asynchronous: high performance – Nondeterministic constructs: reduce code complexity

15 MPI operations MPI_Send(void* buffer, int count, MPI_Datatype type, int dest, int tag, MPI_Comm comm) send(P,T) send(P) MPI_Recv(void* buffer, int count, MPI_Datatype type, int src, int tag, MPI_Comm comm, MPI_Status status) send(P,T)- send a message with tag T to process P recv(P,T)- recv a message with tag T from process P recv(*,T)- recv a message with tag T from any process recv(*,*)- recv a message with any tag from any process MPI_Isend(void* buffer, int count, MPI_Datatype type, int dest, int tag, MPI_Comm comm, MPI_Request h) isend(P,T,h) – nonblocking send, communication handle h irecv(P,T,h) – nonblocking recv, communication handle h irecv(*,T,h) irecv(*,*,h) MPI_Irecv(void* buffer, int count, MPI_Datatype type, int src, int tag, int comm, MPI_Request h) MPI_Wait(MPI_Request h, MPI_Status status) wait(h) – wait for the completion of h barrier – synchronization MPI_Barrier(MPI_Comm comm)

16 Nonovertaking rule facilitates message matching Sender sends two messages: – Both can match to a receive operation – First message will match before second message Receiver posts two receives: – Both can match with an incoming message – First receive will match before second receive P0P0 P1P1 send(1) recv(0) P0P0 P1P1 send(1) recv(*)recv(0) P0P0 P1P1 send(1) irecv(*)recv(0)

17 Happens-before is the basis of causality tracking e 1 happens-before (  ) e 2 iff: – e 1 occurs before e 2 in the same process – e 1 is the sending of a message m and e 2 receives it – e 1  e 3 and e 3  e 2 a bc d e f a  e b  c d  f a  b c  d e  f a  c a  d a  f b  d b  f c  f

18 Tracking causality with Lamport clocks Each process keeps a clock (an integer) Increase the clock when it has an event Attach the clock to outgoing messages (piggyback) Upon receiving piggybacked clock, update the clock to the value greater or equal to its clock, but higher than the piggybacked clock 1a 2b3c 4d 2e 5 f If e 1  e 2 then the clock of e 1 is less than the clock of e 2 What about e and d? The converse does not hold!

19 Tracking causality with vector clocks Each process keeps a vector of clocks (VC) Increase its clock component when it has an event Attach the VC to outgoing messages (piggyback) Upon receiving piggybacked VC clock, update each component to the maximum between the current VC and the piggybacked VC 1,0,0a 1,1,0 b 1,2,0 c 1,2,1 d 2,0,0e 2,2,2 f What about e and d? They are concurrent! e 1  e 2 iff VC of e 1 is less than VC e 2

20 Agenda Motivation and Contributions Background Matches-Before for MPI The centralized approach: ISP The distributed approach: DAMPI Conclusions

21 The necessity for matches-before The notion of happening does not mean much Local ordering does not always hold – For example, P0: send(1); send(2); The notion of completion also does not work P0 --- send (1) P1 --- irecv (*,h) recv (*) wait(h) The irecv(*) happens before the recv(*) but completes after it

22 Possible states of an MPI call All possible states of an MPI call – Issued – Matched – Completed – Returned It’s always possible to know exactly which state the call is in – Except for the Matched state, which has a matching window P 1 isend(0,h1) barrier send(0) wait(h1) P 2 irecv(0,h2) barrier recv(0) wait(h2)

23 Definition of matches-before recv, barrier, and wait match before all the calls following it sends and receives have matches-before order according to the non-overtaking rule Nonblocking calls match before their waits Matches-before is irreflexive, asymmetric and transitive isend(0,h1) send(0) barrierwait(h1) irecv(0,h2) recv(0)barrierwait(h2)

24 The role of match-sets send recv e2e2 e1e1 barrier e3e3

25 Agenda Motivation and Contributions Background Matches-Before for MPI The centralized approach: ISP The distributed approach: DAMPI Conclusions

26 Executable Proc 1 Proc 2 …… Proc n Scheduler Run MPI Runtime Previous work: centralized approach MPI Program Interposition Layer Verifies MPI programs for deadlocks, resource leaks, assertion violations Guarantees coverage over the space of MPI non-determinism FM 2010, PPoPP 2009, EuroPVM 2010, EuroPVM 2009

27 Drawbacks of ISP Scales only up to processes Large programs (of 1000s of processes) often exhibit bugs that are not triggered at low ends – Index out of bounds – Buffer overflows – MPI implementation bugs Need a truly In-Situ verification method for codes deployed on large-scale clusters! – Verify an MPI program as deployed on a cluster

28 Agenda Motivation and Contributions Background Matches-Before for MPI The centralized approach: ISP The distributed approach: DAMPI Conclusions

29 DAMPI Distributed Analyzer of MPI Programs Dynamic verification focusing on coverage over the space of MPI non-determinism Verification occurs on the actual deployed code DAMPI’s features: – Detect and enforce alternative outcomes – Scalable – User-configurable coverage

30 DAMPI Framework Executable Proc 1 Proc 2 …… Proc n Alternate Matches MPI runtime MPI Program DAMPI - PnMPI modules Schedule Generator Epoch Decisions Rerun DAMPI – Distributed Analyzer for MPI

31 Main idea in DAMPI: Distributed Causality Tracking Perform an initial run of the MPI program Track causalities (discover which alternative nondeterministic matches could have occurred) Two alternatives: – Use of Vector Clocks (thorough, but non-scalable) – Use Lamport Clocks (our choice) Omissions possible – but only in unrealistic situations Scalable!

32 DAMPI uses Lamport clocks to maintain Matches-Before Use Lamport clock to track Matches-Before – Each process keeps a logical clock – Attach clock to each outgoing message – Increases it after a nondeterministic receive has matched Mb allows us to infer when irecv’s match – Compare incoming clock to detect potential matches barrier [0] irecv(*) [0] recv(*) [1] send(1) [0] send(1) [0] P0P0 P1P1 P2P2 barrier [0] barrier [0] wait [2] Excuse me, why is the second send RED?

33 How we use Matches-Before relationship to detect alternative matches barrier [0] irecv(*) [0] recv(*) [1] send(1) [0] send(1) [0] P0P0 P1P1 P2P2 barrier [0] barrier [0] wait [2]

34 How we use Matches-Before relationship to detect alternative matches send recv e2e2 e1e1 barrier e3e3

35 How we use Matches-Before relationship to detect alternative matches barrier [0] R = irecv(*) [0] R’= recv(*) [1] S = send(1) [0] send(1) [0] P0P0 P1P1 P2P2 barrier [0] barrier [0] wait [2]

36 Limitations of Lamport Clock R 1 (*) pb(3) S(P2) P0 P1 P2 P3 R 2 (*) 2 S(P2) R 3 (*) 3 S(P2) pb(0) R(*) 1 S(P3) pb(0) This send is a potential match S(P3) Our protocol guarantees that impossible matches will not be forced (there could be deadlocks otherwise)

37 Lamport Clocks vs Vector Clocks DAMPI provides two protocols – Lamport clocks: sound and scalable – Vector clocks: sound and complete We evaluate the scalability and accuracy – Scalability: bandwidth, latency, overhead – Accuracy: omissions The Lamport Clocks protocol does not have any omissions in practice – MPI applications have well structured communication patterns

38 Experiments setup Atlas cluster in LLNL – 1152 nodes, 8 cores per node – 16GB of memory per node – MVAPICH-2 All experiments run at 8 tasks per node Results averaged out over five runs

39 Latency Impact

40 Bandwidth Impact

41 Application overhead – ParMETIS

42 Application overhead – AMG 2006

43 Application overhead – SMG2000

44 DAMPI’s Implementation Detail: using PnMPI Executable Proc 1 Proc 2 …… Proc n Alternate Matches MPI runtime MPI Program DAMPI - PnMPI modules Schedule Generator Epoch Decisions Status module Request module Communicator module Type module Deadlock module DAMPI - PnMPI modules Core Module Optional Error Checking Module Piggyback module DAMPI driver

45 Piggyback implementation details MPI does not provide a built-in mechanism to attach piggyback data to messages Most common piggyback mechanisms – Attach piggyback to the buffer: easy to use but expensive – Send piggyback as a separate message: low overhead but has issues with wildcard receives – Using user-defined datatype to transmit piggyback low overhead, difficult to piggyback on collectives

46 DAMPI uses a mixed piggyback scheme Datatype piggyback for point-to-point Separate message piggyback for collectives Piggyback Message Data pb_buf stores piggyback int MPI_Send(buf,count,user_type,…){ Create new datatype D from pb_buf and buf return PMPI_Send(MPI_BOTTOM,1,D,…); } int MPI_Recv(buf,count,user_type,…) { Create new datatype D from pb_buf and buf return PMPI_Recv(MPI_BOTTOM,1,D,…); } Wrapper – Piggyback Layer Datatype D Piggyback Message Data Sending/Receiving (MPI_BOTTOM,1,D) instead of (buffer,count,user_type) Datatype D

47 Experiments Comparison with ISP: – 64-node cluster of Intel Xeon X5550 (8 cores per node, 2.67 GHZ), 24GB RAM per node – All experiments were run with 8 tasks per node Measuring overhead of DAMPI: – 800-node cluster of AMD Opteron (16 cores per node, 2.3GHZ), 32GB RAM per node – All experiments were run with 16 tasks per node

48 DAMPI maintains very good scalability vs ISP

49 DAMPI is also faster at processing interleavings

50 Results on large applications: SpecMPI2007 and NAS-PB Slowdown is for one interleaving No replay was necessary

51 Heuristics for Overcoming Search Explosion Full coverage leads to state explosion – Given limited time, full coverage biases towards the beginning of the state space DAMPI offers two ways to limit search space: – Ignore uninteresting regions Users annotate programs with MPI_Pcontrol – Bounded mixing: limits impact of non-det. choice bound = infinity: full search

52 Bounded Mixing visualization A A B B C C D D E E F F G G H H MPI_Finalize Bound = 1

53 Bounded Mixing visualization A A B B C C D D E E F F G G H H MPI_Finalize Bound = 1

54 Bounded Mixing visualization A A B B C C D D E E F F G G H H MPI_Finalize Bound = 1 Total interleavings: 8

55 Bounded Mixing visualization A A B B C C D D E E F F G G H H MPI_Finalize Bound = 2

56 Bounded Mixing visualization A A B B C C D D E E F F G G H H MPI_Finalize Bound = 2

57 Bounded Mixing visualization A A B B C C D D E E F F G G H H MPI_Finalize Bound = 2

58 Bounded Mixing visualization A A B B C C D D E E F F G G H H MPI_Finalize Bound = 2 Total interleavings: 10 Bound >= 3 Total interleavings: 16

59 Applying Bounded Mixing on ADLB

60 How well did we do? DAMPI achieves scalable verification – Coverage over the space of nondeterminism – Works on realistic MPI programs at large scale Further correctness checking capabilities can be added as modules to DAMPI

61 Questions?

62 Concluding Remarks Scalable dynamic verification for MPI is feasible – Combines strength of testing and formal methods – Guarantees coverage over nondeterminism Matches-before ordering for MPI provides the basis for tracking causality in MPI DAMPI is the first MPI verifier that can scale beyond hundreds of processes

63 Moore Law still holds

64 Results on non-trivial applications: SpecMPI2007 and NAS-PB BenchmarkSlowdownTotal R* Communicator Leak Request Leak ParMETIS YesNo 104.milc1551KYesNo 107.leslie3d1.140No 113.GemsFDTD1.130YesNo 126.lammps1.880No 130.socorro1.250No 137.lu YesNo BT1.280YesNo CG1.090No DT1.010No EP1.020No FT1.010YesNo IS1.090No LU2.221KNo MG1.150No

65 P0 P1 P2 Barrier Isend(1, req) Wait(req) Scheduler Irecv(*, req) Barrier Recv(2) Wait(req) Isend(1, req) Wait(req) Barrier Isend(1) sendNext Barrier MPI Runtime How ISP Works: Delayed Execution

66 P0 P1 P2 Barrier Isend(1, req) Wait(req) Scheduler Irecv(*, req) Barrier Recv(2) Wait(req) Isend(1, req) Wait(req) Barrier Isend(1) sendNext Barrier Irecv(*) Barrier MPI Runtime Delayed Execution

67 P0 P1 P2 Barrier Isend(1, req) Wait(req) Scheduler Irecv(*, req) Barrier Recv(2) Wait(req) Isend(1, req) Wait(req) Barrier Isend(1) Barrier Irecv(*) Barrier Barrier Barrier MPI Runtime Collect Max Set of Enabled MPI calls

68 P0 P1 P2 Barrier Isend(1, req) Wait(req) MPI Runtime Scheduler Irecv(*, req) Barrier Recv(2) Wait(req) Isend(1, req) Wait(req) Barrier Isend(1) Barrier Irecv(*) Barrier Wait (req) Recv(2) Isend(1) SendNext Wait (req) Irecv(2) Isend Wait No Match No Match Deadlock! Build Happens-Before Edges