Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lazy Diagnosis of In-Production Concurrency Bugs

Similar presentations


Presentation on theme: "Lazy Diagnosis of In-Production Concurrency Bugs"— Presentation transcript:

1 Lazy Diagnosis of In-Production Concurrency Bugs
Baris Kasikci, Weidong Cui, Xinyang Ge, Ben Niu

2 Why Does In-Production Bug Diagnosis Matter?
Potential to fix bugs that impact users Short release cycles make in-house testing challenging Release cycles can be as frequent as a few times a day1 [1]

3 Concurrency Bug Diagnosis
W Atomicity Violation Thread 1 Thread 2 Time Thread 1 Thread 2 if (*x) { y = *x; } free(x); x = NULL; Concurrency bug diagnosis requires knowing the order of key events (e.g., memory accesses)

4 Challenges of Concurrency Bug Diagnosis
Diagnosis requires reproducing bugs [PBI, ASPLOS’13] [Gist, SOSP’15] Practitioners report that they can fix reproducible bugs [PLATEAU’14] It may not be possible to reproduce in-production concurrency bugs Inputs for reproducing bugs may not be available Exposing bugs in production may incur high overhead [RaceMob, SOSP’13]

5 In theory, ΔT can be on the order of a nanosecond
Record/Replay Tracing fine-grained interleavings incurs high overhead State-of-the-art record/replay has 28% overhead [DoublePlay, ASPLOS’11] R W Atomicity Violation Time Thread 1 Thread 2 ΔT1 ΔT2 In theory, ΔT can be on the order of a nanosecond

6 Coarse Interleaving Hypothesis
Study with 54 bugs in 13 systems Smallest ΔT is 91 microseconds R W Atomicity Violation Time Thread 1 Thread 2 ΔT1 ΔT2 91 us ~1ns ~ 10^5 A lightweight, coarse-grained time tracking mechanism can help infer ordering

7 Lazy Diagnosis Snorlax Leverages the coarse interleaving hypothesis
Hybrid dynamic/static root cause diagnosis technique Snorlax Lazy Diagnosis Prototype Fully Accurate Concurrency Bug Diagnosis (11 bugs in 7 systems) Low overhead (always below < 2%)

8 Outline Usage model Design Evaluation

9 Current Bug Diagnosis Model
Root cause diagnosis

10 Lazy Diagnosis Usage Model
Root cause + Control- flow trace & Timing Info Root cause diagnosis Control flow trace speeds up static analysis Coarse-grained timing information helps determine ordering

11 Outline Usage model Design Evaluation

12 Lazy Diagnosis Hybrid Bug Pattern Statistical Type-based Points-to
Analysis Type-based Ranking Bug Pattern Computation Statistical Diagnosis

13 Lazy Diagnosis Hybrid Bug Pattern Statistical Type-based Points-to
Analysis Type-based Ranking Bug Pattern Computation Statistical Diagnosis

14 Hybrid Points-to Analysis
FAILURE (CRASH) I1 store i32* %21, %bufSize store %Queue* %1, %q IF I2 load %Queue*, %fifo Finds instructions with operands pointing to the same location as the failing instruction’s operand

15 Hybrid Points-To Analysis
Uses the control flow traces to limit the scope of static analysis Runs fast, scales to large programs (e.g., httpd, MySQL) Lazy Control flow traces trigger the analysis Interprocedural Bug patterns may span multiple functions Flow-insensitive Discards execution order of instructions for scalability

16 Lazy Diagnosis Hybrid Bug Pattern Statistical Type-based Points-to
Analysis Type-based Ranking Bug Pattern Computation Statistical Diagnosis

17 Lazy Diagnosis Hybrid Bug Pattern Statistical Type-based Points-to
Analysis Type-based Ranking Bug Pattern Computation Statistical Diagnosis

18 Type-Based Ranking load %Queue*, %fifo FAILURE (CRASH) 1 2 store i32* %21, %bufSize store i32* %21, %bufSize Type-based Ranking store %Queue* %1, %q store %Queue* %1, %q Highly ranks instructions operating on types that match the failing instruction's operand type

19 Lazy Diagnosis Hybrid Bug Pattern Statistical Type-based Points-to
Analysis Type-based Ranking Bug Pattern Computation Statistical Diagnosis

20 Lazy Diagnosis Hybrid Bug Pattern Statistical Type-based Points-to
Analysis Type-based Ranking Bug Pattern Computation Statistical Diagnosis

21 Bug Pattern Computation
Thread 1 Thread 2 Bug Pattern I FAILURE load %Queue*, %fifo load %Queue*, %fifo load %Queue*, %fifo Bug Pattern Computation Bug Pattern Computation store %Queue* %1, %q store %Queue* %1, %q store i32* %21, %bufSize store i32* %21, %bufSize Thread 1 Thread 2 Bug Pattern II

22 Bug Pattern Computation
Our implementation uses timing packets in Intel Processor Trace Granularity of a few 10s of microseconds We measured the smallest ΔT between key events as 91 microseconds Leverages the coarse interleaving hypothesis to establish instruction orders

23 Lazy Diagnosis Hybrid Bug Pattern Statistical Type-based Points-to
Analysis Type-based Ranking Bug Pattern Computation Statistical Diagnosis

24 Lazy Diagnosis Hybrid Bug Pattern Statistical Type-based Points-to
Analysis Type-based Ranking Bug Pattern Computation Statistical Diagnosis

25 Statistical identification of failure predicting patterns
store %Queue* %1, %q load %Queue*, %fifo Thread 1 Thread 2 FAILURE (CRASH) store %Queue* %1, %q load %Queue*, %fifo Thread 1 Thread 2 SUCCESS store %Queue* %1, %q load %Queue*, %fifo Thread 1 Thread 2 SUCCESS store %Queue* %1, %q load %Queue*, %fifo Thread 1 Thread 2 SUCCESS store %Queue* %1, %q load %Queue*, %fifo Thread 1 Thread 2 SUCCESS store %Queue* %1, %q load %Queue*, %fifo Thread 1 Thread 2 FAILURE (CRASH) Statistical identification of failure predicting patterns

26 Outline Usage model Design Evaluation

27 Evaluation of Snorlax Is Snorlax effective? Is Snorlax accurate? Is Snorlax efficient? How does Snorlax compare to its competition?

28 Experimental Setup Real-world C/C++ programs 11 concurrency bugs Workloads from program’s test cases and test cases by other researchers

29 Snorlax’s Effectiveness
Snorlax correctly identified the root causes of 11 bugs Determined after manual investigation of developer fixes A single failure recurrence is enough for root cause diagnosis In practice, for concurrency bugs, “event orders” = “root cause” Snorlax can effectively diagnose concurrency bugs

30 All stages of Lazy Diagnosis are necessary for full accuracy
Snorlax’s Accuracy Contribution Accuracy All stages of Lazy Diagnosis are necessary for full accuracy

31 Snorlax has low runtime performance overhead (always below 2%)
Snorlax’s Efficiency Percentage Overhead 0.97% Snorlax has low runtime performance overhead (always below 2%)

32 Snorlax vs. Gist 39% Percentage Overhead 3% 0.9% 1.9% Snorlax scales better than Gist with the increasing number of application threads

33 Lazy Diagnosis Snorlax Leverages the coarse interleaving hypothesis
Hybrid dynamic/static root cause diagnosis technique Snorlax Lazy Diagnosis Prototype Fully Accurate Concurrency Bug Diagnosis (11 bugs in 7 systems) Low overhead (always below < 2%) Scales well with the increasing number of threads Michigan is hiring!


Download ppt "Lazy Diagnosis of In-Production Concurrency Bugs"

Similar presentations


Ads by Google