1 Distributed Dynamic Partial Order Reduction based Verification of Threaded Software Yu Yang (PhD student; summer intern at CBL) Xiaofang Chen (PhD student;

1 Distributed Dynamic Partial Order Reduction based Verification of Threaded Software Yu Yang (PhD student; summer intern at CBL) Xiaofang Chen (PhD student; summer intern at IBM) Ganesh Gopalakrishnan Robert M. Kirby School of Computing University of Utah SPIN 2007 Workshop Presentation Supported by: Microsoft HPC Institutes NSF CNS 0509379

2 Thread Programming will become more prevalent FV of thread programs will grow in importance

3 Why FV for Threaded Programs > 80% of chips shipped will be multi-core (photo courtesy of Intel Corporation.)

4 Model Checking will Increasingly be thru Dynamic Methods Also known as Runtime or In-Situ methods

5 Why Dynamic Verification Methods Even after early life-cycle modeling and validation, the final code will have far more details Early life-cycle modeling is often impossible - Use of libraries (API) such as MPI, OpenMP, Shmem, … - Library function semantics can be tricky - The bug may be in the library function implementation

6 Model Checking will often be “stateless”

7 Why Stateless One may not be able to access a lot of the state - e.g. state of the OS. It is expensive to hash and lookup revisits. Stateless is easier to parallelize

8 Partial Order Reduction is Crucial !

9 Why POR? Process P0: ------------------------------- 0: MPI_Init 1: MPI_Win_lock 2: MPI_Accumulate 3: MPI_Win_unlock 4: MPI_Barrier 5: MPI_Finalize Process P1: ------------------------------- 0: MPI_Init 1: MPI_Win_lock 2: MPI_Accumulate 3: MPI_Win_unlock 4: MPI_Barrier 5: MPI_Finalize ONLY DEPENDENT OPERATIONS 504 interleavings without POR (2 * (10!)) / (5!)^2 2 interleavings with POR !!

10 Dynamic POR is almost a “must” ! ( Dynamic POR as in Flanagan and Godefroid, POPL 2005)

11 Why Dynamic POR ? a[ j ]++ a[ k ]-- Ample Set depends on whether j == k Can be very difficult to determine statically Can determine dynamically

12 Why Dynamic POR ? The notion of action dependence (crucial to POR methods) is a function of the execution

13 Computation of “ample” sets in Static POR versus in DPOR Ample determined using “local” criteria Current State Next move of Red process Nearest Dependent Transition Looking Back Add Red Process to “Backtrack Set” This builds the Ample set incrementally based on observed dependencies Blue is in “Done” set { BT }, { Done }

14 l We target C/C++ PThread Programs l Instrument the given program (largely automated) l Run the concurrent program “till the end” l Record interleaving variants while advancing l When # recorded backtrack points reaches a soft limit, spill work to other nodes l In one larger example, a 11-hour run was finished in 11 minutes using 64 nodes l Heuristic to avoid recomputations was essential for speed-up. l First known distributed DPOR Putting it all together …

15 A Simple DPOR Example {}, {} t0: lock(t) unlock(t) t1: lock(t) unlock(t) t2: lock(t) unlock(t)

16 t0: lock {}, {} t0: lock(t) unlock(t) t1: lock(t) unlock(t) t2: lock(t) unlock(t) A Simple DPOR Example

17 t0: lock t0: unlock {}, {} t0: lock(t) unlock(t) t1: lock(t) unlock(t) t2: lock(t) unlock(t) A Simple DPOR Example

18 t0: lock t0: unlock t1: lock {}, {} t0: lock(t) unlock(t) t1: lock(t) unlock(t) t2: lock(t) unlock(t) A Simple DPOR Example

19 t0: lock t0: unlock t1: lock {t1}, {t0} t0: lock(t) unlock(t) t1: lock(t) unlock(t) t2: lock(t) unlock(t) A Simple DPOR Example

20 t0: lock t0: unlock t1: lock t1: unlock t2: lock {t1}, {t0} {}, {} t0: lock(t) unlock(t) t1: lock(t) unlock(t) t2: lock(t) unlock(t) A Simple DPOR Example

21 t0: lock t0: unlock t1: lock t1: unlock t2: lock {t1}, {t0} {t2}, {t1} t0: lock(t) unlock(t) t1: lock(t) unlock(t) t2: lock(t) unlock(t) A Simple DPOR Example

22 t0: lock t0: unlock t1: lock t2: unlock t1: unlock t2: lock {t1}, {t0} {t2}, {t1} t0: lock(t) unlock(t) t1: lock(t) unlock(t) t2: lock(t) unlock(t) A Simple DPOR Example

23 t0: lock t0: unlock t1: lock t1: unlock t2: lock {t1}, {t0} {t2}, {t1} t0: lock(t) unlock(t) t1: lock(t) unlock(t) t2: lock(t) unlock(t) A Simple DPOR Example

24 t0: lock t0: unlock {t1}, {t0} {t2}, {t1} t0: lock(t) unlock(t) t1: lock(t) unlock(t) t2: lock(t) unlock(t) A Simple DPOR Example

25 t0: lock t0: unlock t2: lock {t1,t2}, {t0} {}, {t1, t2} t0: lock(t) unlock(t) t1: lock(t) unlock(t) t2: lock(t) unlock(t) A Simple DPOR Example

26 t0: lock t0: unlock t2: lock t2: unlock {t1,t2}, {t0} {}, {t1, t2} t0: lock(t) unlock(t) t1: lock(t) unlock(t) t2: lock(t) unlock(t) A Simple DPOR Example …

27 t0: lock t0: unlock {t1,t2}, {t0} {}, {t1, t2} t0: lock(t) unlock(t) t1: lock(t) unlock(t) t2: lock(t) unlock(t) A Simple DPOR Example

28 {t2}, {t0,t1} t0: lock(t) unlock(t) t1: lock(t) unlock(t) t2: lock(t) unlock(t) A Simple DPOR Example

29 t1: lock t1: unlock {t2}, {t0, t1} t0: lock(t) unlock(t) t1: lock(t) unlock(t) t2: lock(t) unlock(t) A Simple DPOR Example …

30 For this example, all the paths explored during DPOR For others, it will be a proper subset

31 Idea for parallelization: Explore computations from the backtrack set in other processes. “Embarrassingly Parallel” – it seems so, anyway !

32 We first built a sequential DPOR explorer for C / Pthreads programs, called “Inspect” Multithreaded C/C++ program instrumented program instrumentation Thread library wrapper compile executable thread 1 thread n scheduler request/permit

33 l Stateless search does not maintain search history l Different branches of an acyclic space can be explored concurrently l Simple master-slave scheme can work here – one load balancer + workers We then made the following observations

34 Request unloading idle node id work description report result load balancer We then devised a work-distribution scheme…

35 We got zero speedup! Why? Deeper investigation revealed that multiple nodes ended up exploring the same interleavings

36 Illustration of the problem (1 of 5) t0: lock t0: unlock t1: lock t2: unlock t1: unlock t2: lock {t1}, {t0} {t2}, {t1}

37 Illustration of the problem (2 of 5) t0: lock t0: unlock t1: lock t2: unlock t1: unlock t2: lock {t1}, {t0} {t2}, {t1} Heuristic : Handoff DEEPEST backtrack point for another node to explore Reason : Largest number of paths emanate from there To Node 1

38 Detail of (2 of 5) t0: lock t0: unlock t1: lock t2: unlock t1: unlock t2: lock {t1}, {t0} {t2}, {t1} Node 0 t0: lock t0: unlock t1: lock t2: unlock t1: unlock t2: lock { }, {t0,t1} {t2}, {t1}

39 Detail of (2 of 5) t0: lock t0: unlock t1: lock t2: unlock t1: unlock t2: lock {t1}, {t0} {t2}, {t1} Node 1Node 0 t0: lock t0: unlock t1: lock t2: unlock t1: unlock t2: lock { }, {t0,t1} {t2}, {t1} t0: lock {t1}, {t0}

40 Detail of (2 of 5) t0: lock t0: unlock t1: lock t2: unlock t1: unlock t2: lock {t1}, {t0} {t2}, {t1} Node 1Node 0 t0: lock t0: unlock t1: lock t2: unlock t1: unlock t2: lock { }, { t0,t1 } {t2}, {t1} t0: lock { t1 }, {t0} t1 is forced into DONE set before work handed to Node 1 Node 1 keeps t1 in backtrack set

41 Illustration of the problem (3 of 5) t0: lock t0: unlock t1: lock t2: unlock t1: unlock t2: lock {t1}, {t0} {t2}, {t1} To Node 1 Decide to do THIS work at Node 0 itself…

42 t0: lock t0: unlock {}, {t0,t1} {t2}, {t1} {t1}, {t0} Illustration of the problem (4 of 5) Being expanded by Node 0 Being expanded by Node 1

43 Illustration of the problem (5 of 5) t0: lock t0: unlock {t2}, {t0,t1} {}, {t2} t2: lock t2: unlock

44 Illustration of the problem (5 of 5) t0: lock t0: unlock {t2}, {t0,t1} {}, {t2} {t1}, {t0} t1: lock t1: unlock t2: lock t2: unlock

45 Illustration of the problem (5 of 5) t0: lock t0: unlock {t2}, {t0,t1} {}, {t2} {t2}, {t0, t1} t1: lock t1: unlock t2: lock t2: unlock t2: lock t2: unlock {}, {t2} Redundancy!

46 New Backtrack Set Computation: Aggressively mark up the stack! t0: lock t0: unlock t1: lock t2: unlock t1: unlock t2: lock {t1,t2}, {t0} {t2}, {t1} l Update the backtrack sets of ALL dependent operations! l Forms a good allocation scheme l Does not involve any synchronizations l Redundant work may still be performed l Likelihood is reduced because a node aggressively “owns” one operation and all its dependants

47 Implementation and Evaluation l Using MPI for communication among nodes l Did experiments on a 72-node cluster – 2.4 GHz Intel XEON process, 2GB memory/node – Two (small) benchmarks Indexer & file system benchmark used in Flanagan and Godefoid’s DPOR paper – Aget -- a multithreaded ftp client – Bbuf – an implementation of bounded buffer

48 Sequential Checking Time Benchmark ThreadsRuns Time (sec) fsbench268,192291.32 indexer1632,7681188.73 aget6113,4005662.96 bbuf81,938,81639710.43

49 Speedup on indexer & fs (small exs); so diminishing returns > 40 nodes…

50 Speedup on aget

51 Speedup on bbuf

52 Conclusions and Future Work l Method described is VERY promising l We have an in-situ model checker for MPI programs also! (EuroPVM / MPI 2007) – Will be parallelized using MPI for work distribution! l The C/PThread Work needs to be pushed a lot more: – Automate Instrumentation – Try many new examples – Improve work-distribution heuristic in response to findings – Release tool

53 Questions?

54 Answers ! l Properties: Currently – Local “assert”s – Deadlocks – Uninitialized Variables l No plans for liveness l Tool release likely in 6 months l That is a very good question. Let’s talk!

55 Extra Slides

56 Concurrent operations on some database Class A operations: pthread_mutex_lock(mutex); a_count++; if (a_count == 1) pthred_mutex_lock(res); pthread_mutex_unlock(mutex); … pthread_mutex_lock(mutex); a_count--; if (a_count == 0) pthread_mutex_unlock(res); pthread_mutex_unlock(mutex); Class B operations: pthread_mutex_lock(mutex); b_count++; if (b_count == 1) pthred_mutex_lock(res); pthread_mutex_unlock(mutex); … pthread_mutex_lock(mutex); b_count--; if (b_count == 0) pthread_mutex_unlock(res); pthread_mutex_unlock(mutex);

57 Initial random execution a1 : acquire mutex a2 : a_count + + a3 : a_count == 1 a4 : acquire res a5 : release mutex a6 : acquire mutex a7 : a_count a8 : a_count == 0 a9 : release res a10 : release mutex b1 : acquire mutex b2 : b_count + + b3 : b_count == 1 b4 : acquire res b5 : release mutex b6 : acquire mutex b7 : b_count b8 : b_count == 0 b9 : release lock b10 : release mutex Class A operations: pthread_mutex_lock(mutex); a_count++; if (a_count == 1) pthred_mutex_lock(res); pthread_mutex_unlock(mutex); … pthread_mutex_lock(mutex); a_count--; if (a_count == 0) pthread_mutex_unlock(res); pthread_mutex_unlock(mutex);

63 Initial random execution a1 : acquire mutex a2 : a_count + + a3 : a_count == 1 a4 : acquire res a5 : release mutex a6 : acquire mutex a7 : a_count -- a8 : a_count == 0 a9 : release res a10 : release mutex b1 : acquire mutex b2 : b_count + + b3 : b_count == 1 b4 : acquire res b5 : release mutex b6 : acquire mutex b7 : b_count b8 : b_count == 0 b9 : release lock b10 : release mutex Class A operations: pthread_mutex_lock(mutex); a_count++; if (a_count == 1) pthred_mutex_lock(res); pthread_mutex_unlock(mutex); … pthread_mutex_lock(mutex); a_count--; if (a_count == 0) pthread_mutex_unlock(res); pthread_mutex_unlock(mutex);

67 Initial random execution a1 : acquire mutex a2 : a_count + + a3 : a_count == 1 a4 : acquire res a5 : release mutex a6 : acquire mutex a7 : a_count a8 : a_count == 0 a9 : release res a10 : release mutex b1 : acquire mutex b2 : b_count + + b3 : b_count == 1 b4 : acquire res b5 : release mutex b6 : acquire mutex b7 : b_count b8 : b_count == 0 b9 : release lock b10 : release mutex Class B operations: pthread_mutex_lock(mutex); b_count++; if (b_count == 1) pthred_mutex_lock(res); pthread_mutex_unlock(mutex); … pthread_mutex_lock(mutex); b_count--; if (b_count == 0) pthread_mutex_unlock(res); pthread_mutex_unlock(mutex);

68 Initial random execution a1 : acquire mutex a2 : a_count + + a3 : a_count == 1 a4 : acquire res a5 : release mutex a6 : acquire mutex a7 : a_count-- a8 : a_count == 0 a9 : release res a10 : release mutex b1 : acquire mutex b2 : b_count + + b3 : b_count == 1 b4 : acquire res b5 : release mutex b6 : acquire mutex b7 : b_count b8 : b_count == 0 b9 : release lock b10 : release mutex Class B operations: pthread_mutex_lock(mutex); b_count++; if (b_count == 1) pthred_mutex_lock(res); pthread_mutex_unlock(mutex); … pthread_mutex_lock(mutex); b_count--; if (b_count == 0) pthread_mutex_unlock(res); pthread_mutex_unlock(mutex);

69 Initial random execution a1 : acquire mutex a2 : a_count + + a3 : a_count == 1 a4 : acquire res a5 : release mutex a6 : acquire mutex a7 : a_count a8 : a_count == 0 a9 : release res a10 : release mutex b1 : acquire mutex b2 : b_count + + b3 : b_count == 1 b4 : acquire res b5 : release mutex b6 : acquire mutex b7 : b_count b8 : b_count == 0 b9 : release lock b10 : release mutex Class B operations: pthread_mutex_lock(mutex); b_count++; if (b_count == 1) pthred_mutex_lock(res); pthread_mutex_unlock(mutex); … pthread_mutex_lock(mutex); b_count--; if (b_count == 0) pthread_mutex_unlock(res); pthread_mutex_unlock(mutex);

70 Dependent operations? a1 : acquire mutex a2 : a_count + + a3 : a_count == 1 a4 : acquire res a5 : release mutex a6 : acquire mutex a7 : a_count a8 : a_count == 0 a9 : release res a10 : release mutex b1 : acquire mutex b2 : b_count + + b3 : b_count == 1 b4 : acquire res b5 : release mutex b6 : acquire mutex b7 : b_count b8 : b_count == 0 b9 : release lock b10 : release mutex Class B operations: pthread_mutex_lock(mutex); b_count++; if (b_count == 1) pthred_mutex_lock(res); pthread_mutex_unlock(mutex); … pthread_mutex_lock(mutex); b_count--; if (b_count == 0) pthread_mutex_unlock(res); pthread_mutex_unlock(mutex);

71 Start an alternative execution a1 : acquire mutex a2 : a_count + + a3 : a_count == 1 a4 : acquire res a5 : release mutex a6 : acquire mutex a7 : a_count -- a8 : a_count == 0 a9 : release res a10 : release mutex b1 : acquire mutex b2 : b_count + + b3 : b_count == 1 b4 : acquire res b5 : release mutex b6 : acquire mutex b7 : b_count b8 : b_count == 0 b9 : release lock b10 : release mutex Class A operations: pthread_mutex_lock(mutex); a_count++; if (a_count == 1) pthred_mutex_lock(res); pthread_mutex_unlock(mutex); … pthread_mutex_lock(mutex); a_count--; if (a_count == 0) pthread_mutex_unlock(res); pthread_mutex_unlock(mutex);

72 Get a deadlock! a1 : acquire mutex a2 : a_count + + a3 : a_count == 1 a4 : acquire res a5 : release mutex b1 : acquire mutex b2 : b_count + + b3 : b_count == 1 a6 : acquire mutex a7 : a_count -- a8 : a_count == 0 a9 : release res a10 : release mutex b4 : acquire res b5 : release mutex b6 : acquire mutex b7 : b_count b8 : b_count == 0 b9 : release lock b10 : release mutex Class A operations: pthread_mutex_lock(mutex); a_count++; if (a_count == 1) pthred_mutex_lock(res); pthread_mutex_unlock(mutex); pthread_mutex_lock(mutex); Class B operations: pthread_mutex_lock(mutex); b_count++; if (b_count == 1) pthred_mutex_lock(res);

1 Distributed Dynamic Partial Order Reduction based Verification of Threaded Software Yu Yang (PhD student; summer intern at CBL) Xiaofang Chen (PhD student;

Similar presentations

Presentation on theme: "1 Distributed Dynamic Partial Order Reduction based Verification of Threaded Software Yu Yang (PhD student; summer intern at CBL) Xiaofang Chen (PhD student;"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Distributed Dynamic Partial Order Reduction based Verification of Threaded Software Yu Yang (PhD student; summer intern at CBL) Xiaofang Chen (PhD student;

Similar presentations

Presentation on theme: "1 Distributed Dynamic Partial Order Reduction based Verification of Threaded Software Yu Yang (PhD student; summer intern at CBL) Xiaofang Chen (PhD student;"— Presentation transcript:

Similar presentations

About project

Feedback